Fix new kernel-doc warnings in fs/bio.c:
Warning(fs/bio.c:316): No description found for parameter 'bio'
Warning(fs/bio.c:316): No description found for parameter 'parent'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
We had a number of new features in ext4 during this merge window
(ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
there were many more regression and bug fixes this time around. It
didn't help that xfstests hadn't been fully updated to fully stress
test COLLAPSE_RANGE until after -rc1.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTVIEUAAoJENNvdpvBGATwnKkQANlzQv6BhgzCa0b5Iu0SkHeD
OuLAtPFYE5OVEK22oWT0H76gBi71RHLboHwThd+ZfEeEPvyfs42wY0J/PV/R9dHx
kwhU+MaDDzugfVj3gg29DpYNLQkL/evq0vlNbrRk5je877c2I8JbXV/aAoTVFZoH
NGOsagwBqWCsgL5nSOk/nEZSRX2AzSCkgmOVxylLzFoyTUkX3vZx8G8XtS1zRgbH
b1yOWIK1Ifj7tmBZ4HLpNiK6/NpHAHeHRFiaCQxY0hkLjUeMyVNJfZzXS/Fzp8DP
p1/nm5z9PaFj4nyBC1Wvh9Z6Lj0zQ0ap73LV+w4fHM1SZub3XY+hvyXj/8qMNaSc
lLIGwa2AZFpurbKKn6MZTi5CubVLZs6PZKzDgYURnEcJCgeMujMOxbKekcL5sP9E
Gb6Hh9I/f08HagCRox5O0W7f0/TBY5bFryx5kQQZUtpcRmnY3m7cohSkn6WriwTZ
zYApOZMZkFX5spSeYsfyi8K8wHij/5mXvm7qeqQ0Rj4Ehycd+7jwltOCVXAYN29+
zSKaBaxH2+V7zuGHSxjDFbOOlPotTFNzGmFh08DPTF4Vgnc9uMlLo0Oz8ADFDcT2
JZ4pAFTEREnHOATNl5bAEi8wNrU/Ln9IGhlYCYI9X5BQXjf9oPXcYwQT/lKCb07s
ks8ujfry1R/gjQGuv+LH
=gi42
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"These are regression and bug fixes for ext4.
We had a number of new features in ext4 during this merge window
(ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
there were many more regression and bug fixes this time around. It
didn't help that xfstests hadn't been fully updated to fully stress
test COLLAPSE_RANGE until after -rc1"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
ext4: disable COLLAPSE_RANGE for bigalloc
ext4: fix COLLAPSE_RANGE failure with 1KB block size
ext4: use EINVAL if not a regular file in ext4_collapse_range()
ext4: enforce we are operating on a regular file in ext4_zero_range()
ext4: fix extent merging in ext4_ext_shift_path_extents()
ext4: discard preallocations after removing space
ext4: no need to truncate pagecache twice in collapse range
ext4: fix removing status extents in ext4_collapse_range()
ext4: use filemap_write_and_wait_range() correctly in collapse range
ext4: use truncate_pagecache() in collapse range
ext4: remove temporary shim used to merge COLLAPSE_RANGE and ZERO_RANGE
ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled
ext4: always check ext4_ext_find_extent result
ext4: fix error handling in ext4_ext_shift_extents
ext4: silence sparse check warning for function ext4_trim_extent
ext4: COLLAPSE_RANGE only works on extent-based files
ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches
ext4: use i_size_read in ext4_unaligned_aio()
fs: disallow all fallocate operation on active swapfile
fs: move falloc collapse range check into the filesystem methods
...
Once COLLAPSE RANGE is be disable for ext4 with bigalloc feature till finding
root-cause of problem. It will be enable with fixing that regression of
xfstest(generic 075 and 091) again.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When formatting with 1KB or 2KB(not aligned with PAGE SIZE) block
size, xfstests generic/075 and 091 are failing. The offset supplied to
function truncate_pagecache_range is block size aligned. In this
function start offset is re-aligned to PAGE_SIZE by rounding_up to the
next page boundary. Due to this rounding up, old data remains in the
page cache when blocksize is less than page size and start offset is
not aligned with page size. In case of collapse range, we need to
align start offset to page size boundary by doing a round down
operation instead of round up.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
in non-lazy walk we need to be careful about dentry switching from
negative to positive - both ->d_flags and ->d_inode are updated,
and in some places we might see only one store. The cases where
dentry has been obtained by dcache lookup with ->i_mutex held on
parent are safe - ->d_lock and ->i_mutex provide all the barriers
we need. However, there are several places where we run into
trouble:
* do_last() fetches ->d_inode, then checks ->d_flags and
assumes that inode won't be NULL unless d_is_negative() is true.
Race with e.g. creat() - we might have fetched the old value of
->d_inode (still NULL) and new value of ->d_flags (already not
DCACHE_MISS_TYPE). Lin Ming has observed and reported the resulting
oops.
* a bunch of places checks ->d_inode for being non-NULL,
then checks ->d_flags for "is it a symlink". Race with symlink(2)
in case if our CPU sees ->d_inode update first - we see non-NULL
there, but ->d_flags still contains DCACHE_MISS_TYPE instead of
DCACHE_SYMLINK_TYPE. Result: false negative on "should we follow
link here?", with subsequent unpleasantness.
Cc: stable@vger.kernel.org # 3.13 and 3.14 need that one
Reported-and-tested-by: Lin Ming <minggr@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull cifs fixes from Steve French:
"A set of 5 small cifs fixes"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cif: fix dead code
cifs: fix error handling cifs_user_readv
fs: cifs: remove unused variable.
Return correct error on query of xattr on file with empty xattrs
cifs: Wait for writebacks to complete before attempting write.
Here are some driver core fixes for 3.15-rc2. Also in here are some
documentation updates, as well as an API removal that had to wait for
after -rc1 due to the cleanups coming into you from multiple developer
trees (this one and the PPC tree.)
All have been in linux next successfully.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlNRl6cACgkQMUfUDdst+yllxACfV9fZ/A6IQja60AdPEo+oa6Cw
RiIAoJtH0D0G0eC4+/Qs9GSRMoB4jPPC
=Wi3a
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some driver core fixes for 3.15-rc2. Also in here are some
documentation updates, as well as an API removal that had to wait for
after -rc1 due to the cleanups coming into you from multiple developer
trees (this one and the PPC tree.)
All have been in linux next successfully"
* tag 'driver-core-3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
drivers/base/dd.c incorrect pr_debug() parameters
Documentation: Update stable address in Chinese and Japanese translations
topology: Fix compilation warning when not in SMP
Chinese: add translation of io_ordering.txt
stable_kernel_rules: spelling/word usage
sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()
kernfs: protect lazy kernfs_iattrs allocation with mutex
fs: Don't return 0 from get_anon_bdev
There is a bug in ext4_ext_shift_path_extents() where if we actually
manage to merge a extent we would skip shifting the next extent. This
will result in in one extent in the extent tree not being properly
shifted.
This is causing failure in various xfstests tests using fsx or fsstress
with collapse range support. It will also cause file system corruption
which looks something like:
e2fsck 1.42.9 (4-Feb-2014)
Pass 1: Checking inodes, blocks, and sizes
Inode 20 has out of order extents
(invalid logical block 3, physical block 492938, len 2)
Clear? yes
...
when running e2fsck.
It's also very easily reproducible just by running fsx without any
parameters. I can usually hit the problem within a minute.
Fix it by increasing ex_start only if we're not merging the extent.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Currently in ext4_collapse_range() and ext4_punch_hole() we're
discarding preallocation twice. Once before we attempt to do any changes
and second time after we're done with the changes.
While the second call to ext4_discard_preallocations() in
ext4_punch_hole() case is not needed, we need to discard preallocation
right after ext4_ext_remove_space() in collapse range case because in
the case we had to restart a transaction in the middle of removing space
we might have new preallocations created.
Remove unneeded ext4_discard_preallocations() ext4_punch_hole() and move
it to the better place in ext4_collapse_range()
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We're already calling truncate_pagecache() before we attempt to do any
actual job so there is not need to truncate pagecache once more using
truncate_setsize() after we're finished.
Remove truncate_setsize() and replace it just with i_size_write() note
that we're holding appropriate locks.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently in ext4_collapse_range() when calling ext4_es_remove_extent() to
remove status extents we're passing (EXT_MAX_BLOCKS - punch_start - 1)
in order to remove all extents from start of the collapse range to the
end of the file. However this is wrong because we might miss the
possible extent covering the last block of the file.
Fix it by removing the -1.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
Currently we're passing -1 as lend argumnet for
filemap_write_and_wait_range() which is wrong since lend is signed type
so it would cause some confusion and we might not write_and_wait for the
entire range we're expecting to write.
Fix it by using LLONG_MAX instead.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We should be using truncate_pagecache() instead of
truncate_pagecache_range() in the collapse range because we're
truncating page cache from offset to the end of file.
truncate_pagecache() also get rid of the private COWed pages from the
range because we're going to shift the end of the file.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Since we're still limiting attributes to a page, the result here is that
a large getattr result will return NFS4ERR_REP_TOO_BIG/TOO_BIG_TO_CACHE
instead of NFS4ERR_RESOURCE.
Both error returns are wrong, and the real bug here is the arbitrary
limit on getattr results, fixed by as-yet out-of-tree patches. But at a
minimum we can make life easier for clients by sticking to one broken
behavior in released kernels instead of two....
Trond says:
one immediate consequence of this patch will be that NFSv4.1
clients will now report EIO instead of EREMOTEIO if they hit the
problem. That may make debugging a little less obvious.
Another consequence will be that if we ever do try to add client
side handling of NFS4ERR_REP_TOO_BIG, then we now have to deal
with the “handle existing buggy server” syndrome.
Reported-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
...otherwise the logic in the timeout handling doesn't work correctly.
Spotted-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A fl->fl_break_time of 0 has a special meaning to the lease break code
that basically means "never break the lease". knfsd uses this to ensure
that leases don't disappear out from under it.
Unfortunately, the code in __break_lease can end up passing this value
to wait_event_interruptible as a timeout, which prevents it from going
to sleep at all. This causes __break_lease to spin in a tight loop and
causes soft lockups.
Fix this by ensuring that we pass a minimum value of 1 as a timeout
instead.
Cc: <stable@vger.kernel.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Reported-by: Terry Barnaby <terry1@beam.ltd.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Old values of user quota limits were being used and
could allow users to exceed their allotted quotas.
This patch refreshes the limits to the latest values
so that quotas are enforced correctly.
Resolves: rhbz#1077463
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This issue was found by Coverity (CID 1202536)
This proposes a fix for a statement that creates dead code.
The "rc < 0" statement is within code that is run
with "rc > 0".
It seems like "err < 0" was meant to be used here.
This way, the error code is returned by the function.
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steve French <smfrench@gmail.com>
Coverity says:
*** CID 1202537: Dereference after null check (FORWARD_NULL)
/fs/cifs/file.c: 2873 in cifs_user_readv()
2867 cur_len = min_t(const size_t, len - total_read, cifs_sb->rsize);
2868 npages = DIV_ROUND_UP(cur_len, PAGE_SIZE);
2869
2870 /* allocate a readdata struct */
2871 rdata = cifs_readdata_alloc(npages,
2872 cifs_uncached_readv_complete);
>>> CID 1202537: Dereference after null check (FORWARD_NULL)
>>> Comparing "rdata" to null implies that "rdata" might be null.
2873 if (!rdata) {
2874 rc = -ENOMEM;
2875 goto error;
2876 }
2877
2878 rc = cifs_read_allocate_pages(rdata, npages);
...when we "goto error", rc will be non-zero, and then we end up trying
to do a kref_put on the rdata (which is NULL). Fix this by replacing
the "goto error" with a "break".
Reported-by: <scan-admin@coverity.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
xfstests generic/004 reproduces an ilock deadlock using the tmpfile
interface when selinux is enabled. This occurs because
xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The
latter eventually calls into xfs_xattr_get() which attempts to get the
lock again. E.g.:
xfs_io D ffffffff81c134c0 4096 3561 3560 0x00000080
ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8
00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540
ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480
Call Trace:
[<ffffffff8177f969>] schedule+0x29/0x70
[<ffffffff81783a65>] rwsem_down_read_failed+0xc5/0x120
[<ffffffffa05aa97f>] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
[<ffffffff813b3434>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff810ed179>] ? down_read_nested+0x89/0xa0
[<ffffffffa05aa7f2>] ? xfs_ilock+0x122/0x250 [xfs]
[<ffffffffa05aa7f2>] xfs_ilock+0x122/0x250 [xfs]
[<ffffffffa05aa97f>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
[<ffffffffa05701d0>] xfs_attr_get+0x90/0xe0 [xfs]
[<ffffffffa0565e07>] xfs_xattr_get+0x37/0x50 [xfs]
[<ffffffff8124842f>] generic_getxattr+0x4f/0x70
[<ffffffff8133fd9e>] inode_doinit_with_dentry+0x1ae/0x650
[<ffffffff81340e0c>] selinux_d_instantiate+0x1c/0x20
[<ffffffff813351bb>] security_d_instantiate+0x1b/0x30
[<ffffffff81237db0>] d_instantiate+0x50/0x70
[<ffffffff81237e85>] d_tmpfile+0xb5/0xc0
[<ffffffffa05add02>] xfs_create_tmpfile+0x362/0x410 [xfs]
[<ffffffffa0559ac8>] xfs_vn_tmpfile+0x18/0x20 [xfs]
[<ffffffff81230388>] path_openat+0x228/0x6a0
[<ffffffff810230f9>] ? sched_clock+0x9/0x10
[<ffffffff8105a427>] ? kvm_clock_read+0x27/0x40
[<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
[<ffffffff8123101a>] do_filp_open+0x3a/0x90
[<ffffffff817845e7>] ? _raw_spin_unlock+0x27/0x40
[<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
[<ffffffff8121e3ce>] do_sys_open+0x12e/0x210
[<ffffffff8121e4ce>] SyS_open+0x1e/0x20
[<ffffffff8178eda9>] system_call_fastpath+0x16/0x1b
xfs_vn_tmpfile() also fails to initialize security on the newly created
inode.
Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction
has been committed and the inode unlocked. Also, initialize security on
the inode based on the parent directory provided via the tmpfile call.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When testing exhaustion of dm snapshots, the following appeared
with CONFIG_DEBUG_OBJECTS_FREE enabled:
ODEBUG: free active (active state 0) object type: work_struct hint: xfs_buf_iodone_work+0x0/0x1d0 [xfs]
indicating that we'd freed a buffer which still had a pending reference,
down this path:
[ 190.867975] [<ffffffff8133e6fb>] debug_check_no_obj_freed+0x22b/0x270
[ 190.880820] [<ffffffff811da1d0>] kmem_cache_free+0xd0/0x370
[ 190.892615] [<ffffffffa02c5924>] xfs_buf_free+0xe4/0x210 [xfs]
[ 190.905629] [<ffffffffa02c6167>] xfs_buf_rele+0xe7/0x270 [xfs]
[ 190.911770] [<ffffffffa034c826>] xfs_trans_read_buf_map+0x7b6/0xac0 [xfs]
At issue is the fact that if IO fails in xfs_buf_iorequest,
we'll queue completion unconditionally, and then call
xfs_buf_rele; but if IO failed, there are no IOs remaining,
and xfs_buf_rele will free the bp while work is still queued.
Fix this by not scheduling completion if the buffer has
an error on it; run it immediately. The rest is only comment
changes.
Thanks to dchinner for spotting the root cause.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We negate the error value being returned from a generic function
incorrectly. The code path that it is running in returned negative
errors, so there is no need to negate it to get the correct error
signs here.
This was uncovered by generic/019.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
And interesting situation can occur if a log IO error occurs during
the unmount of a filesystem. The cases reported have the same
signature - the update of the superblock counters fails due to a log
write IO error:
XFS (dm-16): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa08a44a1
XFS (dm-16): Log I/O Error Detected. Shutting down filesystem
XFS (dm-16): Unable to update superblock counters. Freespace may not be correct on next mount.
XFS (dm-16): xfs_log_force: error 5 returned.
XFS (¿-¿¿¿): Please umount the filesystem and rectify the problem(s)
It can be seen that the last line of output contains a corrupt
device name - this is because the log and xfs_mount structures have
already been freed by the time this message is printed. A kernel
oops closely follows.
The issue is that the shutdown is occurring in a separate IO
completion thread to the unmount. Once the shutdown processing has
started and all the iclogs are marked with XLOG_STATE_IOERROR, the
log shutdown code wakes anyone waiting on a log force so they can
process the shutdown error. This wakes up the unmount code that
is doing a synchronous transaction to update the superblock
counters.
The unmount path now sees all the iclogs are marked with
XLOG_STATE_IOERROR and so never waits on them again, knowing that if
it does, there will not be a wakeup trigger for it and we will hang
the unmount if we do. Hence the unmount runs through all the
remaining code and frees all the filesystem structures while the
xlog_iodone() is still processing the shutdown. When the log
shutdown processing completes, xfs_do_force_shutdown() emits the
"Please umount the filesystem and rectify the problem(s)" message,
and xlog_iodone() then aborts all the objects attached to the iclog.
An iclog that has already been freed....
The real issue here is that there is no serialisation point between
the log IO and the unmount. We have serialisations points for log
writes, log forces, reservations, etc, but we don't actually have
any code that wakes for log IO to fully complete. We do that for all
other types of object, so why not iclogbufs?
Well, it turns out that we can easily do this. We've got xfs_buf
handles, and that's what everyone else uses for IO serialisation.
i.e. bp->b_sema. So, lets hold iclogbufs locked over IO, and only
release the lock in xlog_iodone() when we are finished with the
buffer. That way before we tear down the iclog, we can lock and
unlock the buffer to ensure IO completion has finished completely
before we tear it down.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Bob Mastors <bob.mastors@solidfire.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
FSX has been detecting data corruption after to collapse range
calls. The key observation is that the offset of the last extent in
the file was not being shifted, and hence when the file size was
adjusted it was truncating away data because the extents handled
been correctly shifted.
Tracing indicated that before the collapse, the extent list looked
like:
....
ino 0x5788 state idx 6 offset 26 block 195904 count 10 flag 0
ino 0x5788 state idx 7 offset 39 block 195917 count 35 flag 0
ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0
and after the shift of 2 blocks:
ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0
ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0
ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0
Note that the last extent did not change offset. After the changing
of the file size:
ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0
ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0
ino 0x5788 state idx 8 offset 86 block 195964 count 30 flag 0
You can see that the last extent had it's length truncated,
indicating that we've lost data.
The reason for this is that the xfs_bmap_shift_extents() loop uses
XFS_IFORK_NEXTENTS() to determine how many extents are in the inode.
This, unfortunately, doesn't take into account delayed allocation
extents - it's a count of physically allocated extents - and hence
when the file being collapsed has a delalloc extent like this one
does prior to the range being collapsed:
....
ino 0x5788 state idx 4 offset 11 block 4503599627239429 count 1 flag 0
....
it gets the count wrong and terminates the shift loop early.
Fix it by using the in-memory extent array size that includes
delayed allocation extents to determine the number of extents on the
inode.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Al Viro tracked down the problem that has caused generic/263 to fail
on XFS since the test was introduced. If is caused by
xfs_get_blocks() mapping a single extent that spans EOF without
marking it as buffer-new() so that the direct IO code does not zero
the tail of the block at the new EOF. This is a long standing bug
that has been around for many, many years.
Because xfs_get_blocks() starts the map before EOF, it can't set
buffer_new(), because that causes he direct IO code to also zero
unaligned sectors at the head of the IO. This would overwrite valid
data with zeros, and hence we cannot validly return a single extent
that spans EOF to direct IO.
Fix this by detecting a mapping that spans EOF and truncate it down
to EOF. This results in the the direct IO code doing the right thing
for unaligned data blocks before EOF, and then returning to get
another mapping for the region beyond EOF which XFS treats correctly
by setting buffer_new() on it. This makes direct Io behave correctly
w.r.t. tail block zeroing beyond EOF, and fsx is happy about that.
Again, thanks to Al Viro for finding what I couldn't.
[ dchinner: Fix for __divdi3 build error:
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
All device_schedule_callback_owner() users are converted to use
device_remove_file_self(). Remove now unused
{sysfs|device}_schedule_callback_owner().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_iattrs is allocated lazily when operations which require it
take place; unfortunately, the lazy allocation and returning weren't
properly synchronized and when there are multiple concurrent
operations, it might end up returning kernfs_iattrs which hasn't
finished initialization yet or different copies to different callers.
Fix it by synchronizing with a mutex. This can be smarter with memory
barriers but let's go there if it actually turns out to be necessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/533ABA32.9080602@oracle.com
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # 3.14
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 9e30cc9595 removed an internal mount. This
has the side-effect that rootfs now has FSID 0. Many
userspace utilities assume that st_dev in struct stat
is never 0, so this change breaks a number of tools in
early userspace.
Since we don't know how many userspace programs are affected,
make sure that FSID is at least 1.
References: http://article.gmane.org/gmane.linux.kernel/1666905
References: http://permalink.gmane.org/gmane.linux.utilities.util-linux-ng/8557
Cc: 3.14 <stable@vger.kernel.org>
Signed-off-by: Thomas Bächler <thomas@archlinux.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Tested-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In SMB2_set_compression(), the "res_key" variable is only initialized to NULL
and later kfreed. It is therefore useless and should be removed.
Found with the following semantic patch:
<smpl>
@@
identifier foo;
identifier f;
type T;
@@
* f(...) {
...
* T *foo = NULL;
... when forall
when != foo
* kfree(foo);
...
}
</smpl>
Signed-off-by: Cyril Roelandt <tipecaml@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
xfstest 020 detected a problem with cifs xattr handling. When a file
had an empty xattr list, we returned success (with an empty xattr value)
on query of particular xattrs rather than returning ENODATA.
This patch fixes it so that query of an xattr returns ENODATA when the
xattr list is empty for the file.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Problem reported in Red Hat bz 1040329 for strict writes where we cache
only when we hold oplock and write direct to the server when we don't.
When we receive an oplock break, we first change the oplock value for
the inode in cifsInodeInfo->oplock to indicate that we no longer hold
the oplock before we enqueue a task to flush changes to the backing
device. Once we have completed flushing the changes, we return the
oplock to the server.
There are 2 ways here where we can have data corruption
1) While we flush changes to the backing device as part of the oplock
break, we can have processes write to the file. These writes check for
the oplock, find none and attempt to write directly to the server.
These direct writes made while we are flushing from cache could be
overwritten by data being flushed from the cache causing data
corruption.
2) While a thread runs in cifs_strict_writev, the machine could receive
and process an oplock break after the thread has checked the oplock and
found that it allows us to cache and before we have made changes to the
cache. In that case, we end up with a dirty page in cache when we
shouldn't have any. This will be flushed later and will overwrite all
subsequent writes to the part of the file represented by this page.
Before making any writes to the server, we need to confirm that we are
not in the process of flushing data to the server and if we are, we
should wait until the process is complete before we attempt the write.
We should also wait for existing writes to complete before we process
an oplock break request which changes oplock values.
We add a version specific downgrade_oplock() operation to allow for
differences in the oplock values set for the different smb versions.
Cc: stable@vger.kernel.org
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
deletes aio context and all resources related to. It makes sense that
no IO operations connected to the context should be running after the context
is destroyed. As we removed io_context we have no chance to
get requests status or call io_getevents().
man page for io_destroy says that this function may block until
all context's requests are completed. Before kernel 3.11 io_destroy()
blocked indeed, but since aio refactoring in 3.11 it is not true anymore.
Here is a pseudo-code that shows a testcase for a race condition discovered
in 3.11:
initialize io_context
io_submit(read to buffer)
io_destroy()
// context is destroyed so we can free the resources
free(buffers);
// if the buffer is allocated by some other user he'll be surprised
// to learn that the buffer still filled by an outstanding operation
// from the destroyed io_context
The fix is straight-forward - add a completion struct and wait on it
in io_destroy, complete() should be called when number of in-fligh requests
reaches zero.
If two or more io_destroy() called for the same context simultaneously then
only the first one waits for IO completion, other calls behaviour is undefined.
Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and
do not see the race condition anymore.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
A fl->fl_break_time of 0 has a special meaning to the lease break code
that basically means "never break the lease". knfsd uses this to ensure
that leases don't disappear out from under it.
Unfortunately, the code in __break_lease can end up passing this value
to wait_event_interruptible as a timeout, which prevents it from going
to sleep at all. This makes __break_lease to spin in a tight loop and
causes soft lockups.
Fix this by ensuring that we pass a minimum value of 1 as a timeout
instead.
Cc: <stable@vger.kernel.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Reported-by: Terry Barnaby <terry1@beam.ltd.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Pointer 'newargs' is used after the memory that it points to has already
been freed.
Picked up by Coverity - CID 1201425.
Fixes: 0723a0473f ("btrfs: allow mounting btrfs subvolumes with
different ro/rw options")
Signed-off-by: Christoph Jaeger <christophjaeger@linux.com>
Signed-off-by: Chris Mason <clm@fb.com>
According to commit 5f16f3225b
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
When we are zeroing space andit is covered by a delalloc range, we
need to punch the delalloc range out before we truncate the page
cache. Failing to do so leaves and inconsistency between the page
cache and the extent tree, which we later trip over when doing
direct IO over the same range.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Similar to the write_begin problem, xfs-vm_write_end will truncate
back to the old EOF, potentially removing page cache from over the
top of delalloc blocks with valid data in them. Fix this by
truncating back to just the start of the failed write.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If we fail a write beyond EOF and have to handle it in
xfs_vm_write_begin(), we truncate the inode back to the current inode
size. This doesn't take into account the fact that we may have
already made successful writes to the same page (in the case of block
size < page size) and hence we can truncate the page cache away from
blocks with valid data in them. If these blocks are delayed
allocation blocks, we now have a mismatch between the page cache and
the extent tree, and this will trigger - at minimum - a delayed
block count mismatch assert when the inode is evicted from the cache.
We can also trip over it when block mapping for direct IO - this is
the most common symptom seen from fsx and fsstress when run from
xfstests.
Fix it by only truncating away the exact range we are updating state
for in this write_begin call.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When a write fails, if we don't clear the delalloc flags from the
buffers over the failed range, they can persist beyond EOF and cause
problems. writeback will see the pages in the page cache, see they
are dirty and continually retry the write, assuming that the page
beyond EOF is just racing with a truncate. The page will eventually
be released due to some other operation (e.g. direct IO), and it
will not pass through invalidation because it is dirty. Hence it
will be released with buffer_delay set on it, and trigger warnings
in xfs_vm_releasepage() and assert fail in xfs_file_aio_write_direct
because invalidation failed and we didn't write the corect amount.
This causes failures on block size < page size filesystems in fsx
and fsstress workloads run by xfstests.
Fix it by completely trashing any state on the buffer that could be
used to imply that it contains valid data when the delalloc range
over the buffer is punched out during the failed write handling.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
On 32 bit, size_t is "unsigned int", not "unsigned long", causing the
following warning when comparing with PAGE_SIZE, which is always "unsigned
long":
fs/cifs/file.c: In function ‘cifs_readdata_to_iov’:
fs/cifs/file.c:2757: warning: comparison of distinct pointer types lacks a cast
Introduced by commit 7f25bba819 ("cifs_iovec_read: keep iov_iter
between the calls of cifs_readdata_to_iov()"), which changed the
signedness of "remaining" and the code from min_t() to min().
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Where are some places where logic guaranties us that extent we are
searching exits, but this may not be true due to on-disk data
corruption. If such corruption happens we must prevent possible
null pointer dereferences.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Unfortunately, we weren't checking to make sure of this the inode was
extent-based before attempt operate on it. Hilarity ensues.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Pull yet more networking updates from David Miller:
1) Various fixes to the new Redpine Signals wireless driver, from
Fariya Fatima.
2) L2TP PPP connect code takes PMTU from the wrong socket, fix from
Dmitry Petukhov.
3) UFO and TSO packets differ in whether they include the protocol
header in gso_size, account for that in skb_gso_transport_seglen().
From Florian Westphal.
4) If VLAN untagging fails, we double free the SKB in the bridging
output path. From Toshiaki Makita.
5) Several call sites of sk->sk_data_ready() were referencing an SKB
just added to the socket receive queue in order to calculate the
second argument via skb->len. This is dangerous because the moment
the skb is added to the receive queue it can be consumed in another
context and freed up.
It turns out also that none of the sk->sk_data_ready()
implementations even care about this second argument.
So just kill it off and thus fix all these use-after-free bugs as a
side effect.
6) Fix inverted test in tcp_v6_send_response(), from Lorenzo Colitti.
7) pktgen needs to do locking properly for LLTX devices, from Daniel
Borkmann.
8) xen-netfront driver initializes TX array entries in RX loop :-) From
Vincenzo Maffione.
9) After refactoring, some tunnel drivers allow a tunnel to be
configured on top itself. Fix from Nicolas Dichtel.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
vti: don't allow to add the same tunnel twice
gre: don't allow to add the same tunnel twice
drivers: net: xen-netfront: fix array initialization bug
pktgen: be friendly to LLTX devices
r8152: check RTL8152_UNPLUG
net: sun4i-emac: add promiscuous support
net/apne: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
net: ipv6: Fix oif in TCP SYN+ACK route lookup.
drivers: net: cpsw: enable interrupts after napi enable and clearing previous interrupts
drivers: net: cpsw: discard all packets received when interface is down
net: Fix use after free by removing length arg from sk_data_ready callbacks.
Drivers: net: hyperv: Address UDP checksum issues
Drivers: net: hyperv: Negotiate suitable ndis version for offload support
Drivers: net: hyperv: Allocate memory for all possible per-pecket information
bridge: Fix double free and memory leak around br_allowed_ingress
bonding: Remove debug_fs files when module init fails
i40evf: program RSS LUT correctly
i40evf: remove open-coded skb_cow_head
ixgb: remove open-coded skb_cow_head
igbvf: remove open-coded skb_cow_head
...
The vfs merge caused a latent bug to show up:
In file included from fs/ceph/super.h:4:0,
from fs/ceph/ioctl.c:3:
include/linux/ceph/ceph_debug.h:4:0: warning: "pr_fmt" redefined [enabled by default]
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
^
In file included from include/linux/kernel.h:13:0,
from include/linux/uio.h:12,
from include/linux/socket.h:7,
from include/uapi/linux/in.h:22,
from include/linux/in.h:23,
from fs/ceph/ioctl.c:1:
include/linux/printk.h:214:0: note: this is the location of the previous definition
#define pr_fmt(fmt) fmt
^
where the reason is that <linux/ceph_debug.h> is included much too late
for the "pr_fmt()" define.
The include of <linux/ceph_debug.h> needs to be the first include in the
file, but fs/ceph/ioctl.c had for some reason missed that, and it wasn't
noticeable until some unrelated header file changes brought in an
indirect earlier include of <linux/kernel.h>.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.
Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.
This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...
Pull audit updates from Eric Paris.
* git://git.infradead.org/users/eparis/audit: (28 commits)
AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
audit: do not cast audit_rule_data pointers pointlesly
AUDIT: Allow login in non-init namespaces
audit: define audit_is_compat in kernel internal header
kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
sched: declare pid_alive as inline
audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
syscall_get_arch: remove useless function arguments
audit: remove stray newline from audit_log_execve_info() audit_panic() call
audit: remove stray newlines from audit_log_lost messages
audit: include subject in login records
audit: remove superfluous new- prefix in AUDIT_LOGIN messages
audit: allow user processes to log from another PID namespace
audit: anchor all pid references in the initial pid namespace
audit: convert PPIDs to the inital PID namespace.
pid: get pid_t ppid of task in init_pid_ns
audit: rename the misleading audit_get_context() to audit_take_context()
audit: Add generic compat syscall support
audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
...
This commit tries to fix some byte order issues that is found by sparse
check.
$ make M=fs/ext4 C=2 CF=-D__CHECK_ENDIAN__
...
CHECK fs/ext4/extents.c
fs/ext4/extents.c:5232:41: warning: restricted __le32 degrades to integer
fs/ext4/extents.c:5236:52: warning: bad assignment (-=) to restricted __le32
fs/ext4/extents.c:5258:45: warning: bad assignment (-=) to restricted __le32
fs/ext4/extents.c:5303:28: warning: restricted __le32 degrades to integer
fs/ext4/extents.c:5318:18: warning: incorrect type in assignment (different base types)
fs/ext4/extents.c:5318:18: expected unsigned int [unsigned] [usertype] ex_start
fs/ext4/extents.c:5318:18: got restricted __le32 [usertype] ee_block
fs/ext4/extents.c:5319:24: warning: restricted __le32 degrades to integer
fs/ext4/extents.c:5334:31: warning: incorrect type in assignment (different base types)
...
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently some file system have IS_SWAPFILE check in their fallocate
implementations and some do not. However we should really prevent any
fallocate operation on swapfile so move the check to vfs and remove the
redundant checks from the file systems fallocate implementations.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently in do_fallocate in collapse range case we're checking
whether offset + len is not bigger than i_size. However there is
nothing which would prevent i_size from changing so the check is
pointless. It should be done in the file system itself and the file
system needs to make sure that i_size is not going to change. The
i_size check for the other fallocate modes are also done in the
filesystems.
As it is now we can easily crash the kernel by having two processes
doing truncate and fallocate collapse range at the same time. This
can be reproduced on ext4 and it is theoretically possible on xfs even
though I was not able to trigger it with this simple test.
This commit removes the check from do_fallocate and adds it to the
file system.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently punch hole and collapse range fallocate operation are not
allowed on append only file. This should be case for zero range as well.
Fix it by allowing only pure fallocate (possibly with keep size set).
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
All the checks IS_APPEND and IS_IMMUTABLE for the fallocate operation on
the inode are done in vfs. No need to do this again in ext4. Remove it.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
O_APPEND handling there hadn't been completely fixed by Pavel's
patch; it checks the right value, but it's racy - we can't really
do that until i_mutex has been taken.
Fix by switching to __generic_file_aio_write() (open-coding
generic_file_aio_write(), actually) and pulling mutex_lock() above
inode_size_read().
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull aio ctx->ring_pages migration serialization fix from Ben LaHaise.
* git://git.kvack.org/~bcrl/aio-next:
aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration
Pull second set of btrfs updates from Chris Mason:
"The most important changes here are from Josef, fixing a btrfs
regression in 3.14 that can cause corruptions in the extent allocation
tree when snapshots are in use.
Josef also fixed some deadlocks in send/recv and other assorted races
when balance is running"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
Btrfs: fix compile warnings on on avr32 platform
btrfs: allow mounting btrfs subvolumes with different ro/rw options
btrfs: export global block reserve size as space_info
btrfs: fix crash in remount(thread_pool=) case
Btrfs: abort the transaction when we don't find our extent ref
Btrfs: fix EINVAL checks in btrfs_clone
Btrfs: fix unlock in __start_delalloc_inodes()
Btrfs: scrub raid56 stripes in the right way
Btrfs: don't compress for a small write
Btrfs: more efficient io tree navigation on wait_extent_bit
Btrfs: send, build path string only once in send_hole
btrfs: filter invalid arg for btrfs resize
Btrfs: send, fix data corruption due to incorrect hole detection
Btrfs: kmalloc() doesn't return an ERR_PTR
Btrfs: fix snapshot vs nocow writting
btrfs: Change the expanding write sequence to fix snapshot related bug.
btrfs: make device scan less noisy
btrfs: fix lockdep warning with reclaim lock inversion
Btrfs: hold the commit_root_sem when getting the commit root during send
Btrfs: remove transaction from send
...
Several spots in the kernel perform a sequence like:
skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);
But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.
Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.
And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.
So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.
Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.
Signed-off-by: David S. Miller <davem@davemloft.net>
The function ext4_update_i_disksize() is used in only one place, in
the function mpage_map_and_submit_extent(). Move its code to simplify
the code paths, and also move the call to ext4_mark_inode_dirty() into
the i_data_sem's critical region, to be consistent with all of the
other places where we update i_disksize. That way, we also keep the
raw_inode's i_disksize protected, to avoid the following race:
CPU #1 CPU #2
down_write(&i_data_sem)
Modify i_disk_size
up_write(&i_data_sem)
down_write(&i_data_sem)
Modify i_disk_size
Copy i_disk_size to on-disk inode
up_write(&i_data_sem)
Copy i_disk_size to on-disk inode
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
fs/btrfs/scrub.c: In function 'get_raid56_logic_offset':
fs/btrfs/scrub.c:2269: warning: comparison of distinct pointer types lacks a cast
fs/btrfs/scrub.c:2269: warning: right shift count >= width of type
fs/btrfs/scrub.c:2269: warning: passing argument 1 of '__div64_32' from incompatible pointer type
Since @rot is an int type, we should not use do_div(), fix it.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Return ENOMEM rather than EIO when find_get_page() fails in
ext4_mb_get_buddy_page_lock() and find_or_create_page() fails in
ext4_mb_load_buddy().
Signed-off-by: Younger Liu <younger.liucn@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When mounting ext4 with data=journal option, xfstest shared/002 and
shared/004 are currently failing as checksum computed for testfile
does not match with the checksum computed in other journal modes.
In case of data=journal mode, a call to filemap_write_and_wait_range
will not flush anything to disk as buffers are not marked dirty in
write_end. In collapse range this call is followed by a call to
truncate_pagecache_range. Due to this, when checksum is computed,
a portion of file is re-read from disk which replace valid data with
NULL bytes and hence the reason for the difference in checksum.
Calling ext4_force_commit before filemap_write_and_wait_range solves
the issue as it will mark the buffers dirty during commit transaction
which can be later synced by a call to filemap_write_and_wait_range.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Pull exofs updates from Boaz Harrosh:
"Trivial updates to exofs for 3.15-rc1
Just a few fixes sent by people"
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
MAINTAINERS: Update email address for bhalevy
fs: Mark functions as static in exofs/ore_raid.c
fs: Mark function as static in exofs/super.c
Given the following /etc/fstab entries:
/dev/sda3 /mnt/foo btrfs subvol=foo,ro 0 0
/dev/sda3 /mnt/bar btrfs subvol=bar,rw 0 0
you can't issue:
$ mount /mnt/foo
$ mount /mnt/bar
You would have to do:
$ mount /mnt/foo
$ mount -o remount,rw /mnt/foo
$ mount --bind -o remount,ro /mnt/foo
$ mount /mnt/bar
or
$ mount /mnt/bar
$ mount --rw /mnt/foo
$ mount --bind -o remount,ro /mnt/foo
With this patch you can do
$ mount /mnt/foo
$ mount /mnt/bar
$ cat /proc/self/mountinfo
49 33 0:41 /foo /mnt/foo ro,relatime shared:36 - btrfs /dev/sda3 rw,ssd,space_cache
87 33 0:41 /bar /mnt/bar rw,relatime shared:74 - btrfs /dev/sda3 rw,ssd,space_cache
Signed-off-by: Chris Mason <clm@fb.com>
Pull block layer fixes from Jens Axboe:
"A small collection of fixes that should go in before -rc1. The pull
request contains:
- A two patch fix for a regression with block enabled tagging caused
by a commit in the initial pull request. One patch is from Martin
and ensures that SCSI doesn't truncate 64-bit block flags, the
other one is from me and prevents us from double using struct
request queuelist for both completion and busy tags. This caused
anything from a boot crash for some, to crashes under load.
- A blk-mq fix for a potential soft stall when hot unplugging CPUs
with busy IO.
- percpu_counter fix is listed in here, that caused a suspend issue
with virtio-blk due to percpu counters having an inconsistent state
during CPU removal. Andrew sent this in separately a few days ago,
but it's here. JFYI.
- A few fixes for block integrity from Martin.
- A ratelimit fix for loop from Mike Galbraith, to avoid spewing too
much in error cases"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix regression with block enabled tagging
scsi: Make sure cmd_flags are 64-bit
block: Ensure we only enable integrity metadata for reads and writes
block: Fix integrity verification
block: Fix for_each_bvec()
drivers/block/loop.c: ratelimit error messages
blk-mq: fix potential stall during CPU unplug with IO pending
percpu_counter: fix bad counter state during suspend
We'd occasionally attempt to generate protection information for flushes
and other requests with a zero payload. Make sure we only attempt to
enable integrity for reads and writes.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit bf36f9cfa6 caused a regression by effectively reverting Nic's
fix from 5837c80e87 that ensures we traverse the full bio_vec list
upon completion.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull nfsd updates from Bruce Fields:
"Highlights:
- server-side nfs/rdma fixes from Jeff Layton and Tom Tucker
- xdr fixes (a larger xdr rewrite has been posted but I decided it
would be better to queue it up for 3.16).
- miscellaneous fixes and cleanup from all over (thanks especially to
Kinglong Mee)"
* 'for-3.15' of git://linux-nfs.org/~bfields/linux: (36 commits)
nfsd4: don't create unnecessary mask acl
nfsd: revert v2 half of "nfsd: don't return high mode bits"
nfsd4: fix memory leak in nfsd4_encode_fattr()
nfsd: check passed socket's net matches NFSd superblock's one
SUNRPC: Clear xpt_bc_xprt if xs_setup_bc_tcp failed
NFSD/SUNRPC: Check rpc_xprt out of xs_setup_bc_tcp
SUNRPC: New helper for creating client with rpc_xprt
NFSD: Free backchannel xprt in bc_destroy
NFSD: Clear wcc data between compound ops
nfsd: Don't return NFS4ERR_STALE_STATEID for NFSv4.1+
nfsd4: fix nfs4err_resource in 4.1 case
nfsd4: fix setclientid encode size
nfsd4: remove redundant check from nfsd4_check_resp_size
nfsd4: use more generous NFS4_ACL_MAX
nfsd4: minor nfsd4_replay_cache_entry cleanup
nfsd4: nfsd4_replay_cache_entry should be static
nfsd4: update comments with obsolete function name
rpc: Allow xdr_buf_subsegment to operate in-place
NFSD: Using free_conn free connection
SUNRPC: fix memory leak of peer addresses in XPRT
...
My static checker suggests adding curly braces here. Probably that was
the intent, but actually the code works the same either way. I've just
changed the indenting and left the code as-is.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Acked-by: Dave Chiluk <chiluk@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conversions to ncp_dbg showed some format/argument mismatches so fix
them.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uses are gone, remove the macro.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use a more current logging style.
Convert the paranoia debug statement to vdbg.
Remove the embedded function names as dynamic_debug can do that.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use a more current logging style and enable use of dynamic debugging.
Remove embedded function names, dynamic debug can add this instead.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert to a more current logging style.
Add pr_fmt to prefix with "ncpfs: ".
Remove the embedded function names and use "%s: ", __func__
Some previously unprefixed messages now have "ncpfs: "
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There wasn't any check of the size passed from userspace before trying
to allocate the memory required.
This meant that userspace might request more space than allowed,
triggering an OOM.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The special handling of PF_MEMALLOC callers in ext4_write_inode()
shouldn't be necessary as there shouldn't be any. Warn about it. Also
update comment before the function as it seems somewhat outdated.
(Changes modeled on an ext3 patch posted by Jan Kara to the linux-ext4
mailing list on Februaryt 28, 2014, which apparently never went into
the ext3 tree.)
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Pull ext3 improvements, cleanups, reiserfs fix from Jan Kara:
"various cleanups for ext2, ext3, udf, isofs, a documentation update
for quota, and a fix of a race in reiserfs readdir implementation"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: fix race in readdir
ext2: acl: remove unneeded include of linux/capability.h
ext3: explicitly remove inode from orphan list after failed direct io
fs/isofs/inode.c add __init to init_inodecache()
ext3: Speedup WB_SYNC_ALL pass
fs/quota/Kconfig: Update filesystems
ext3: Update outdated comment before ext3_ordered_writepage()
ext3: Update PF_MEMALLOC handling in ext3_write_inode()
ext2/3: use prandom_u32() instead of get_random_bytes()
ext3: remove an unneeded check in ext3_new_blocks()
ext3: remove unneeded check in ext3_ordered_writepage()
fs: Mark function as static in ext3/xattr_security.c
fs: Mark function as static in ext3/dir.c
fs: Mark function as static in ext2/xattr_security.c
ext3: Add __init macro to init_inodecache
ext2: Add __init macro to init_inodecache
udf: Add __init macro to init_inodecache
fs: udf: parse_options: blocksize check
Merge second patch-bomb from Andrew Morton:
- the rest of MM
- zram updates
- zswap updates
- exit
- procfs
- exec
- wait
- crash dump
- lib/idr
- rapidio
- adfs, affs, bfs, ufs
- cris
- Kconfig things
- initramfs
- small amount of IPC material
- percpu enhancements
- early ioremap support
- various other misc things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (156 commits)
MAINTAINERS: update Intel C600 SAS driver maintainers
fs/ufs: remove unused ufs_super_block_third pointer
fs/ufs: remove unused ufs_super_block_second pointer
fs/ufs: remove unused ufs_super_block_first pointer
fs/ufs/super.c: add __init to init_inodecache()
doc/kernel-parameters.txt: add early_ioremap_debug
arm64: add early_ioremap support
arm64: initialize pgprot info earlier in boot
x86: use generic early_ioremap
mm: create generic early_ioremap() support
x86/mm: sparse warning fix for early_memremap
lglock: map to spinlock when !CONFIG_SMP
percpu: add preemption checks to __this_cpu ops
vmstat: use raw_cpu_ops to avoid false positives on preemption checks
slub: use raw_cpu_inc for incrementing statistics
net: replace __this_cpu_inc in route.c with raw_cpu_inc
modules: use raw_cpu_write for initialization of per cpu refcount.
mm: use raw_cpu ops for determining current NUMA node
percpu: add raw_cpu_ops
slub: fix leak of 'name' in sysfs_slab_add
...
Pointer 'usb3' to struct ufs_super_block_third acquired via
ubh_get_usb_third() is never used in function
ufs_read_cylinder_structures(). Thus remove it.
Detected by Coverity: CID 139939.
Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pointer 'usb2' to struct ufs_super_block_second acquired via
ubh_get_usb_second() is never used in function ufs_statfs(). Thus
remove it.
Detected by Coverity: CID 139940.
Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove occurences of unused pointers to struct ufs_super_block_first
that were acquired via ubh_get_usb_first().
Detected by Coverity: CID 139929 - CID 139936, CID 139940.
Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
init_inodecache is only called by __init init_ufs_fs.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/self/make-it-fail is a boolean, but accepts any number, including
negative ones. Change variable to unsigned, and cap upper bound at 1.
[akpm@linux-foundation.org: don't make make_it_fail unsigned]
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
init_inodecache is only called by __init init_bfs_fs
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Normal behavior for filenames exceeding specific filesystem limits is to
refuse operation.
AFFS standard name length being only 30 characters against 255 for usual
Linux filesystems, original implementation does filename truncate by
default with a define value AFFS_NO_TRUNCATE which can be enabled but
needs module compilation.
This patch adds 'nofilenametruncate' mount option so that user can
easily activate that feature and avoid a lot of problems (eg overwrite
files ...)
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 0edf977d2a ("[readdir] convert affs") returns directly -EIO
without unlocking dir inode and releasing dir bh when second affs_bread
sequence fails. This patch restores initial behaviour. It also fixes
pr_debug and affs_error to fit in 80 columns + removes reference to
filldir (replaced by dir_emit in the commit above).
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
init_inodecache is only called by __init init_affs_fs
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
init_inodecache is only called by __init init_adfs_fs.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently when an empty PT_NOTE is detected, vmcore initialization
fails. It sounds too harsh. Because PT_NOTE could be empty, for
example, one offlined a cpu but never restarted kdump service, and after
crash, PT_NOTE program header is there but no data contains. It's
better to warn about the empty PT_NOTE and continue to initialise
vmcore.
And ultimately the multiple PT_NOTE are merged into a single one, all
empty PT_NOTE are discarded naturally during the merge. So empty
PT_NOTE is not visible to user space and vmcore is as good as expected.
Signed-off-by: WANG Chao <chaowang@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Greg Pearson <greg.pearson@hp.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_task_state() uses the most significant bit to report the state to
user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
can be noticed via /proc as Z -> X -> Z change. Note that this was
possible even before EXIT_TRACE was introduced.
This is not really bad but imho it make sense to hide EXIT_TRACE from
user-space completely. So the patch simply swaps EXIT_ZOMBIE and
EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Lennart Poettering <lpoetter@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Starting from commit c4ad8f98be ("execve: use 'struct filename *' for
executable name passing") bprm->filename can not go away after
flush_old_exec(), so we do not need to save the binary name in
bprm->tcomm[] added by 96e02d1586 ("exec: fix use-after-free bug in
setup_new_exec()").
And there was never need for filename_to_taskname-like code, we can
simply do set_task_comm(kbasename(filename).
This patch has to change set_task_comm() and trace_task_rename() to
accept "const char *", but I think this change is also good.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The /proc/*/pagemap contain sensitive information and currently its mode
is 0444. Change this to 0400, so the VFS will prevent unprivileged
processes from getting file descriptors on arbitrary privileged
/proc/*/pagemap files.
This reduces the scope of address space leaking and bypasses by protecting
already running processes.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These procfs files contain sensitive information and currently their
mode is 0444. Change this to 0400, so the VFS will be able to block
unprivileged processes from getting file descriptors on arbitrary
privileged /proc/*/{stack,syscall,personality} files.
This reduces the scope of ASLR leaking and bypasses by protecting already
running processes.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)
The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure. And in the
case of the NULL pointer, there is no structure to initialize. So,
rcu_assign_pointer(p, NULL) can be safely converted to
RCU_INIT_POINTER(p, NULL)
Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we don't have a way how to determing from which mount point
file has been opened. This information is required for proper dumping
and restoring file descriptos due to presence of mount namespaces. It's
possible, that two file descriptors are opened using the same paths, but
one fd references mount point from one namespace while the other fd --
from other namespace.
$ ls -l /proc/1/fd/1
lrwx------ 1 root root 64 Mar 19 23:54 /proc/1/fd/1 -> /dev/null
$ cat /proc/1/fdinfo/1
pos: 0
flags: 0100002
mnt_id: 16
$ cat /proc/1/mountinfo | grep ^16
16 32 0:4 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs rw,size=1013356k,nr_inodes=253339,mode=755
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Rob Landley <rob@landley.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It should read "reclaimable slab" and not "reclaimable swap".
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.
It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
load_elf_binary() sets current->mm->def_flags = def_flags and def_flags
is always zero. Not only this looks strange, this is unnecessary
because mm_init() has already set ->def_flags = 0.
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Convert spinlock/static array to va_format (inspired by Joe Perches
help on previous logging patches).
- Convert printk(KERN_ERR to pr_warn in __ntfs_warning.
- Convert printk(KERN_ERR to pr_err in __ntfs_error.
- Convert printk(KERN_DEBUG to pr_debug in __ntfs_debug. (Note that
__ntfs_debug is still guarded by #if DEBUG)
- Improve !DEBUG to parse all arguments (Joe Perches).
- Sparse pr_foo() conversions in super.c
NTFS, NTFS-fs prefixes as well as 'warning' and 'error' were removed :
pr_foo() automatically adds module name and error level is already
specified.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph updates from Sage Weil:
"The biggest chunk is a series of patches from Ilya that add support
for new Ceph osd and crush map features, including some new tunables,
primary affinity, and the new encoding that is needed for erasure
coding support. This brings things into parity with the server side
and the looming firefly release. There is also support for allocation
hints in RBD that help limit fragmentation on the server side.
There is also a series of patches from Zheng fixing NFS reexport,
directory fragmentation support, flock vs fnctl behavior, and some
issues with clustered MDS.
Finally, there are some miscellaneous fixes from Yunchuan Wen for
fscache, Fabian Frederick for ACLs, and from me for fsync(dirfd)
behavior"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (79 commits)
ceph: skip invalid dentry during dcache readdir
libceph: dump pool {read,write}_tier to debugfs
libceph: output primary affinity values on osdmap updates
ceph: flush cap release queue when trimming session caps
ceph: don't grabs open file reference for aborted request
ceph: drop extra open file reference in ceph_atomic_open()
ceph: preallocate buffer for readdir reply
libceph: enable PRIMARY_AFFINITY feature bit
libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
libceph: add support for osd primary affinity
libceph: add support for primary_temp mappings
libceph: return primary from ceph_calc_pg_acting()
libceph: switch ceph_calc_pg_acting() to new helpers
libceph: introduce apply_temps() helper
libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
libceph: ceph_can_shift_osds(pool) and pool type defines
libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
libceph: enable OSDMAP_ENC feature bit
libceph: primary_affinity decode bits
libceph: primary_affinity infrastructure
...
This patch-set includes the following major enhancement patches.
o introduce large directory support
o introduce f2fs_issue_flush to merge redundant flush commands
o merge write IOs as much as possible aligned to the segment
o add sysfs entries to tune the f2fs configuration
o use radix_tree for the free_nid_list to reduce in-memory operations
o remove costly bit operations in f2fs_find_entry
o enhance the readahead flow for CP/NAT/SIT/SSA blocks
The other bug fixes are as follows.
o recover xattr node blocks correctly after sudden-power-cut
o fix to calculate the maximum number of node ids
o enhance to handle many error cases
And, there are a bunch of cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJTQiQrAAoJEEAUqH6CSFDSlbIP/iq06BrUeMDLoQFhA2GQFKFD
wd0A5h9hCiFcKBcI/u/aAQqj/a5wdwzDl9XzH2PzJ45IM6sVGQZ0lv+kdLhab6rk
ipNbV7G0yLAX+8ygS6GZF7pSKfMzGSGTrRvfdtoiunIip1jCY1IkUxv1XMgBSPza
wnWYrE5HXEqRUDCqPXJyxrPmx0/0jw8/V82Ng9stnY34ySs+l/3Pvg65Kh0QuSSy
BRjJUGlOCF68KUBKd+6YB2T5KlbQde3/5lhP+GMOi+xm5sFB+j+59r/WpJpF2Nxs
ImxQs5GkiU01ErH/rn5FgHY/zzddQenBKwOvrjEeUA1eVpBurdsIr1JN0P6qDbgB
ho5U8LzCQq+HZiW444eQGkXSOagpUKqDhTVJO7Fji/wG88Atc9gLX3ix8TH2skxT
C5CvvrJM7DKBtkZyTzotKY/cWorOZhge6E/EkbGaM1sSHdK5b1Rg4YlFi9TDyz0n
QjGD1uuvEeukeKGdIG9pjc7o5ledbMDYwLpT2RuRXenLOTsn8BqDOo9aRTg+5Kag
tJNJLFumjPR2mEBNKjicJMUf381J/SKDwZszAz9mgvCZXldMza/Ax0LzJDJCVmkP
UuBiVzGxVzpd33IsESUDr0J9hc+t8kS10jfAeKnE3cpb6n7/RYxstHh6CHOFKNXM
gPUSYPN3CYiP47DnSfzA
=eSW+
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This patch-set includes the following major enhancement patches.
- introduce large directory support
- introduce f2fs_issue_flush to merge redundant flush commands
- merge write IOs as much as possible aligned to the segment
- add sysfs entries to tune the f2fs configuration
- use radix_tree for the free_nid_list to reduce in-memory operations
- remove costly bit operations in f2fs_find_entry
- enhance the readahead flow for CP/NAT/SIT/SSA blocks
The other bug fixes are as follows:
- recover xattr node blocks correctly after sudden-power-cut
- fix to calculate the maximum number of node ids
- enhance to handle many error cases
And, there are a bunch of cleanups"
* tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (62 commits)
f2fs: fix wrong statistics of inline data
f2fs: check the acl's validity before setting
f2fs: introduce f2fs_issue_flush to avoid redundant flush issue
f2fs: fix to cover io->bio with io_rwsem
f2fs: fix error path when fail to read inline data
f2fs: use list_for_each_entry{_safe} for simplyfying code
f2fs: avoid free slab cache under spinlock
f2fs: avoid unneeded lookup when xattr name length is too long
f2fs: avoid unnecessary bio submit when wait page writeback
f2fs: return -EIO when node id is not matched
f2fs: avoid RECLAIM_FS-ON-W warning
f2fs: skip unnecessary node writes during fsync
f2fs: introduce fi->i_sem to protect fi's info
f2fs: change reclaim rate in percentage
f2fs: add missing documentation for dir_level
f2fs: remove unnecessary threshold
f2fs: throttle the memory footprint with a sysfs entry
f2fs: avoid to drop nat entries due to the negative nr_shrink
f2fs: call f2fs_wait_on_page_writeback instead of native function
f2fs: introduce nr_pages_to_write for segment alignment
...
Introduce a block group type bit for a global reserve and fill the space
info for SPACE_INFO ioctl. This should replace the newly added ioctl
(01e219e806) to get just the 'size' part
of the global reserve, while the actual usage can be now visible in the
'btrfs fi df' output during ENOSPC stress.
The unpatched userspace tools will show the blockgroup as 'unknown'.
CC: Jeff Mahoney <jeffm@suse.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Reproducer:
mount /dev/ubda /mnt
mount -oremount,thread_pool=42 /mnt
Gives a crash:
? btrfs_workqueue_set_max+0x0/0x70
btrfs_resize_thread_pool+0xe3/0xf0
? sync_filesystem+0x0/0xc0
? btrfs_resize_thread_pool+0x0/0xf0
btrfs_remount+0x1d2/0x570
? kern_path+0x0/0x80
do_remount_sb+0xd9/0x1c0
do_mount+0x26a/0xbf0
? kfree+0x0/0x1b0
SyS_mount+0xc4/0x110
It's a call
btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers, new_pool_size);
with
fs_info->scrub_wr_completion_workers = NULL;
as scrub wqs get created only on user's demand.
Patch skips not-created-yet workqueues.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Qu Wenruo <quwenruo@cn.fujitsu.com>
CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
- A few SPI NOR ID definitions
- Kill the NAND "max pagesize" restriction
- Fix some x16 bus-width NAND support
- Add NAND JEDEC parameter page support
- DT bindings for NAND ECC
- GPMI NAND updates (subpage reads)
- More OMAP NAND refactoring
- New STMicro SPI NOR driver (now in 40 patches!)
- A few other random bugfixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJTP6x+AAoJEFySrpd9RFgtit0P/jLWsjMK8G2ldPC4bMZsXmDF
n3c71GcCRlUq4Qzb4rtZx9DANLAh+JyRrMOKCPg6dAMegFdmUqDOpZpNp0vF57KG
myFjqTk+n5y0tfSkWLMUFt0tQ8ArDp3IBkQCUWkD5LgG50EWmjveIQGH0kFnkE39
Kytqvw17RV7f81tIs+WvKt8++YWD2X1VTpTi0S4fx2bJ99bJDBf/GgdoQpj2oirt
igXmloUFEsob9JHZ3qumcUm9vaHwv2TiouZTvRyGdJCCoPdpJEZO4Ka6e4uAVarT
6kMKXBk3lj2GsilOSFFCNetXfy5Bf0TkJkv4rDjh3R1Y4J/hSgraVCbWXdKhb6tj
RmwesdFMjsyS4f/Rhk5PXwJgGL9uK2mi6bk/SmXU0AMgCDSa5zjshY8Wq6C6uXwk
LqlnK8l3h8Txotbc/XJIL+QGMbMkYQI8gxWTHFaqzDtkMe36mnGs9Zec3oso/s2d
CNRpq5+dMZ6qF0z3zpOQHmFbaOekivMy7kCKMXer6ONsrQBNwTdmkwy+SdqnWsLF
YdJttwV/RRcE0SRvK6GrhvzkGlV83Z8RPny6hC1kbrgQ0ffoy2CmIqyWNObK1RXf
sYqoF8TCtR6Y8rHHi5dzZ1lria+nm8pb4+UfQLRI0mK8og7YW+fIfHXxqRrZZIrt
No8NCPBKWzyew2UE0AiQ
=CKih
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20140405' of git://git.infradead.org/linux-mtd
Pull MTD updates from Brian Norris:
- A few SPI NOR ID definitions
- Kill the NAND "max pagesize" restriction
- Fix some x16 bus-width NAND support
- Add NAND JEDEC parameter page support
- DT bindings for NAND ECC
- GPMI NAND updates (subpage reads)
- More OMAP NAND refactoring
- New STMicro SPI NOR driver (now in 40 patches!)
- A few other random bugfixes
* tag 'for-linus-20140405' of git://git.infradead.org/linux-mtd: (120 commits)
Fix index regression in nand_read_subpage
mtd: diskonchip: mem resource name is not optional
mtd: nand: fix mention to CONFIG_MTD_NAND_ECC_BCH
mtd: nand: fix GET/SET_FEATURES address on 16-bit devices
mtd: omap2: Use devm_ioremap_resource()
mtd: denali_dt: Use devm_ioremap_resource()
mtd: devices: elm: update DRIVER_NAME as "omap-elm"
mtd: devices: elm: configure parallel channels based on ecc_steps
mtd: devices: elm: clean elm_load_syndrome
mtd: devices: elm: check for hardware engine's design constraints
mtd: st_spi_fsm: Succinctly reorganise .remove()
mtd: st_spi_fsm: Allow loop to run at least once before giving up CPU
mtd: st_spi_fsm: Correct vendor name spelling issue - missing "M"
mtd: st_spi_fsm: Avoid duplicating MTD core code
mtd: st_spi_fsm: Remove useless consts from function arguments
mtd: st_spi_fsm: Convert ST SPI FSM (NOR) Flash driver to new DT partitions
mtd: st_spi_fsm: Move runtime configurable msg sequences into device's struct
mtd: st_spi_fsm: Supply the W25Qxxx chip specific configuration call-back
mtd: st_spi_fsm: Supply the S25FLxxx chip specific configuration call-back
mtd: st_spi_fsm: Supply the MX25xxx chip specific configuration call-back
...
I'm not sure why we weren't aborting here in the first place, it is obviously a
bad time from the fact that we print the leaf and yell loudly about it. Fix
this up, otherwise we panic because our path could be pointing into oblivion.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_drop_extents can now return -EINVAL, but only one caller
in btrfs_clone was checking for it. This adds it to the
caller for inline extents, which is where we really need it.
Signed-off-by: Chris Mason <clm@fb.com>
This patch fix a regression caused by the following patch:
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
break while loop will make us call @spin_unlock() without
calling @spin_lock() before, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
# mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5
# mount /dev/sda8 /mnt
# btrfs scrub start -BR /mnt
# echo $? <--unverified errors make return value be 3
This is because we don't setup right mapping between physical
and logical address for raid56, which makes checksum mismatch.
But we will find everthing is fine later when rechecking using
btrfs_map_block().
This patch fixed the problem by settuping right mappings and
we only verify data stripes' checksums.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
To compress a small file range(<=blocksize) that is not
an inline extent can not save disk space at all. skip it can
save us some cpu time.
This patch can also fix wrong setting nocompression flag for
inode, say a case when @total_in is 4096, and then we get
@total_compressed 52,because we do aligment to page cache size
firstly, and then we get into conclusion @total_in=@total_compressed
thus we will clear this inode's compression flag.
An exception comes from inserting inline extent failure but we
still have @total_compressed < @total_in,so we will still reset
inode's flag, this is ok, because we don't have good compression
effect.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we don't reschedule use rb_next to find the next extent state
instead of a full tree search, which is more efficient and safe
since we didn't release the io tree's lock.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
There's no point building the path string in each iteration of the
send_hole loop, as it produces always the same string.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Originally following cmds will work:
# btrfs fi resize -10A <mnt>
# btrfs fi resize -10Gaha <mnt>
Filter the arg by checking the return pointer of memparse.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The error handling was copy and pasted from memdup_user(). It should be
checking for NULL obviously.
Fixes: abccd00f8a ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
While running fsstress and snapshots concurrently, we will hit something
like followings:
Thread 1 Thread 2
|->fallocate
|->write pages
|->join transaction
|->add ordered extent
|->end transaction
|->flushing data
|->creating pending snapshots
|->write data into src root's
fallocated space
After above work flows finished, we will get a state that source and
snapshot root share same space, but source root have written data into
fallocated space, this will make fsck fail to verify checksums for
snapshot root's preallocating file extent data.Nocow writting also
has this same problem.
Fix this problem by syncing snapshots with nocow writting:
1.for nocow writting,if there are pending snapshots, we will
fall into COW way.
2.if there are pending nocow writes, snapshots for this root
will be blocked until nocow writting finish.
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When testing fsstress with snapshot making background, some snapshot
following problem.
Snapshot 270:
inode 323: size 0
Snapshot 271:
inode 323: size 349145
|-------Hole---|---------Empty gap-------|-------Hole-----|
0 122880 172032 349145
Snapshot 272:
inode 323: size 349145
|-------Hole---|------------Data---------|-------Hole-----|
0 122880 172032 349145
The fsstress operation on inode 323 is the following:
write: offset 126832 len 43124
truncate: size 349145
Since the write with offset is consist of 2 operations:
1. punch hole
2. write data
Hole punching is faster than data write, so hole punching in write
and truncate is done first and then buffered write, so the snapshot 271 got
empty gap, which will not pass btrfsck.
To fix the bug, this patch will change the write sequence which will
first punch a hole covering the write end if a hole is needed.
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Print the message only when the device is seen for the first time.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
When encountering memory pressure, testers have run into the following
lockdep warning. It was caused by __link_block_group calling kobject_add
with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
which gets us into reclaim context. The kobject doesn't actually need
to be added under the lock -- it just needs to ensure that it's only
added for the first block group to be linked.
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc8-default #1 Not tainted
---------------------------------------------------------
kswapd0/169 just changed the state of lock:
(&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
(&found->groups_sem){+++++.}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&found->groups_sem);
local_irq_disable();
lock(&delayed_node->mutex);
lock(&found->groups_sem);
<Interrupt>
lock(&delayed_node->mutex);
*** DEADLOCK ***
2 locks held by kswapd0/169:
#0: (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160
#1: (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We currently rely too heavily on roots being read-only to save us from just
accessing root->commit_root. We can easily balance blocks out from underneath a
read only root, so to save us from getting screwed make sure we only access
root->commit_root under the commit root sem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When heavily exercising xattr code the assertion that
jbd2_journal_dirty_metadata() shouldn't return error was triggered:
WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237
jbd2_journal_dirty_metadata+0x1ba/0x260()
CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G W 3.10.0-ceph-00049-g68d04c9 #1
Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968
ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80
0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978
Call Trace:
[<ffffffff816311b0>] dump_stack+0x19/0x1b
[<ffffffff8103fae0>] warn_slowpath_common+0x70/0xa0
[<ffffffff8103fb2a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81267c2a>] jbd2_journal_dirty_metadata+0x1ba/0x260
[<ffffffff81245093>] __ext4_handle_dirty_metadata+0xa3/0x140
[<ffffffff812561f3>] ext4_xattr_release_block+0x103/0x1f0
[<ffffffff81256680>] ext4_xattr_block_set+0x1e0/0x910
[<ffffffff8125795b>] ext4_xattr_set_handle+0x38b/0x4a0
[<ffffffff810a319d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff81257b32>] ext4_xattr_set+0xc2/0x140
[<ffffffff81258547>] ext4_xattr_user_set+0x47/0x50
[<ffffffff811935ce>] generic_setxattr+0x6e/0x90
[<ffffffff81193ecb>] __vfs_setxattr_noperm+0x7b/0x1c0
[<ffffffff811940d4>] vfs_setxattr+0xc4/0xd0
[<ffffffff8119421e>] setxattr+0x13e/0x1e0
[<ffffffff811719c7>] ? __sb_start_write+0xe7/0x1b0
[<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
[<ffffffff8118c65c>] ? fget_light+0x3c/0x130
[<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
[<ffffffff8118f1f8>] ? __mnt_want_write+0x58/0x70
[<ffffffff811946be>] SyS_fsetxattr+0xbe/0x100
[<ffffffff816407c2>] system_call_fastpath+0x16/0x1b
The reason for the warning is that buffer_head passed into
jbd2_journal_dirty_metadata() didn't have journal_head attached. This is
caused by the following race of two ext4_xattr_release_block() calls:
CPU1 CPU2
ext4_xattr_release_block() ext4_xattr_release_block()
lock_buffer(bh);
/* False */
if (BHDR(bh)->h_refcount == cpu_to_le32(1))
} else {
le32_add_cpu(&BHDR(bh)->h_refcount, -1);
unlock_buffer(bh);
lock_buffer(bh);
/* True */
if (BHDR(bh)->h_refcount == cpu_to_le32(1))
get_bh(bh);
ext4_free_blocks()
...
jbd2_journal_forget()
jbd2_journal_unfile_buffer()
-> JH is gone
error = ext4_handle_dirty_xattr_block(handle, inode, bh);
-> triggers the warning
We fix the problem by moving ext4_handle_dirty_xattr_block() under the
buffer lock. Sadly this cannot be done in nojournal mode as that
function can call sync_dirty_buffer() which would deadlock. Luckily in
nojournal mode the race is harmless (we only dirty already freed buffer)
and thus for nojournal mode we leave the dirtying outside of the buffer
lock.
Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
ext4_end_bio() currently throws away the error that it receives. Chances
are this is part of a spate of errors, one of which will end up getting
the error returned to userspace somehow, but we shouldn't take that risk.
Also print out the errno to aid in debug.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
With EXT4FS_DEBUG ext4_count_free_clusters() will call
ext4_read_block_bitmap() without s_group_info initialized, so we need to
initialize multi-block allocator before.
And dependencies that must be solved, to allow this:
- multi-block allocator needs in group descriptors
- need to install s_op before initializing multi-block allocator,
because in ext4_mb_init_backend() new inode is created.
- initialize number of group desc blocks (s_gdb_count) otherwise
number of clusters returned by ext4_free_clusters_after_init() is not correct.
(see ext4_bg_num_gdb_nometa())
Here is the stack backtrace:
(gdb) bt
#0 ext4_get_group_info (group=0, sb=0xffff880079a10000) at ext4.h:2430
#1 ext4_validate_block_bitmap (sb=sb@entry=0xffff880079a10000,
desc=desc@entry=0xffff880056510000, block_group=block_group@entry=0,
bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:358
#2 0xffffffff81232202 in ext4_wait_block_bitmap (sb=sb@entry=0xffff880079a10000,
block_group=block_group@entry=0,
bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:476
#3 0xffffffff81232eaf in ext4_read_block_bitmap (sb=sb@entry=0xffff880079a10000,
block_group=block_group@entry=0) at balloc.c:489
#4 0xffffffff81232fc0 in ext4_count_free_clusters (sb=sb@entry=0xffff880079a10000) at balloc.c:665
#5 0xffffffff81259ffa in ext4_check_descriptors (first_not_zeroed=<synthetic pointer>,
sb=0xffff880079a10000) at super.c:2143
#6 ext4_fill_super (sb=sb@entry=0xffff880079a10000, data=<optimized out>,
data@entry=0x0 <irq_stack_union>, silent=silent@entry=0) at super.c:3851
...
Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When we try to get 2^32-1 block of the file which has the extent
(ee_block=2^32-2, ee_len=1) with FIBMAP ioctl, it causes BUG_ON
in ext4_ext_put_gap_in_cache().
To avoid the problem, ext4_map_blocks() needs to check the file logical block
number. ext4_ext_put_gap_in_cache() called via ext4_map_blocks() cannot
handle 2^32-1 because the maximum file logical block number is 2^32-2.
Note that ext4_ind_map_blocks() returns -EIO when the block number is invalid.
So ext4_map_blocks() should also return the same errno.
Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
'0x7FDEADBEEF' will be truncated to 32-bit number under unicore32. Need
append 'ULL' for it.
The related warning (with allmodconfig under unicore32):
CC [M] fs/ext4/extents_status.o
fs/ext4/extents_status.c: In function "__es_remove_extent":
fs/ext4/extents_status.c:813: warning: integer constant is too large for "long" type
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If we remove a file that has inline data after mount, our statistics turns to
inaccurate.
cat /sys/kernel/debug/f2fs/status
- Inline_data Inode: 4294967295
Let's add stat_inc_inline_inode() to stat inline info of the file when lookup.
Change log from v1:
o stat in f2fs_lookup() instead of in do_read_inode() for excluding wrong stat.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Before setting the acl, call posix_acl_valid() to check if it is
valid or not.
Signed-off-by: zhangzhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Some storage devices show relatively high latencies to complete cache_flush
commands, even though their normal IO speed is prettry much high. In such
the case, it needs to merge cache_flush commands as much as possible to avoid
issuing them redundantly.
So, this patch introduces a mount option, "-o flush_merge", to mitigate such
the overhead.
If this option is enabled by user, F2FS merges the cache_flush commands and then
issues just one cache_flush on behalf of them. Once the single command is
finished, F2FS sends a completion signal to all the pending threads.
Note that, this option can be used under a workload consisting of very intensive
concurrent fsync calls, while the storage handles cache_flush commands slowly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Lets try this again. We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send. So fix this by making sure looking at the
commit roots is always going to be consistent. We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once. Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes. With this patch we no longer
deadlock and pass all the weird send/receive corner cases. Thanks,
Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel. This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed. Turns out this is because of a
bad interaction of balance and send/receive. Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening. So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid. But because we think it is invalid we clear uptodate and
re-read in the block. If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.
Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data. So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We could have possibly added an extent_op to the locked_ref while we dropped
locked_ref->lock, so check for this case as well and loop around. Otherwise we
could lose flag updates which would lead to extent tree corruption. Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
This was done to allow NO_COW to continue to be NO_COW after relocation but it
is not right. When relocating we will convert blocks to FULL_BACKREF that we
relocate. We can leave some of these full backref blocks behind if they are not
cow'ed out during the relocation, like if we fail the relocation with ENOSPC and
then just drop the reloc tree. Then when we go to cow the block again we won't
lookup the extent flags because we won't think there has been a snapshot
recently which means we will do our normal ref drop thing instead of adding back
a tree ref and dropping the shared ref. This will cause btrfs_free_extent to
blow up because it can't find the ref we are trying to free. This was found
with my ref verifying tool. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Highlights include:
- Stable fix for a use after free issue in the NFSv4.1 open code
- Fix the SUNRPC bi-directional RPC code to account for TCP segmentation
- Optimise usage of readdirplus when confronted with 'ls -l' situations
- Soft mount bugfixes
- NFS over RDMA bugfixes
- NFSv4 close locking fixes
- Various NFSv4.x client state management optimisations
- Rename/unlink code cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTQBayAAoJEGcL54qWCgDyUzgQAKzSlbcksMQT55M/KZJXabNW
KSctJeDrkTkRxOXTNxuF9NbIgeqenLijCokXty6BIUgup0zkOPMzFfRfgdQvplnp
YEj4sOEXEZ8CX+PoUTYOEayzt0ssEAOyidumiM+Gx2LD/E1d2xyCL7YaAOjIhVQS
OnXcX1cZw+dZSUxC9vu5fVDjrphJTnp4CXdbvR5PiJiXeKqzZd9e5M3hXgpAQ/AS
mWjYeUvM9mwyz7UmbLKkWEmzB3tFlGdTzDPxLRrkfcOSKI2Ham0lL3/Uv50/nRTu
99ts6KH8KLGcUuL9vD9KRebht2f71usBrWAdvpy1cUcf1Fh6lmEg4ktGfkqldaUu
9kNu9d5DCxJoGc6R2UTw5FeyPwYuDWoBwEGy1DcguJ5CeQn2R2nH4ps/P3J3DX4d
DZsJqCY9idKZCQhtyR0iF9j3x2bNFoENaL6WHI6b0J+xjMedIbHgeUQzIQP0RLBJ
h0IcjK0D+e7WdyC7jk4Nm3krtms5SNUG5/N9OUO36a7v8735PJBcbcgm9hZJt8Fh
t/4vqUmKIBXHioHsMhaFslqTWlYIR9a3MYmN7QtHFYbqUfNxH69v9y3d6jb4Igck
kqoEiui5aJOCR76s7oVdHCcm+klBwEPiACT+H9CUMzSoKzHSWsBSNZbJR3BEia4M
7dwScS1OfI2KuutshGQA
=weNx
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
- Stable fix for a use after free issue in the NFSv4.1 open code
- Fix the SUNRPC bi-directional RPC code to account for TCP segmentation
- Optimise usage of readdirplus when confronted with 'ls -l' situations
- Soft mount bugfixes
- NFS over RDMA bugfixes
- NFSv4 close locking fixes
- Various NFSv4.x client state management optimisations
- Rename/unlink code cleanups"
* tag 'nfs-for-3.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
nfs: pass string length to pr_notice message about readdir loops
NFSv4: Fix a use-after-free problem in open()
SUNRPC: rpc_restart_call/rpc_restart_call_prepare should clear task->tk_status
SUNRPC: Don't let rpc_delay() clobber non-timeout errors
SUNRPC: Ensure call_connect_status() deals correctly with SOFTCONN tasks
SUNRPC: Ensure call_status() deals correctly with SOFTCONN tasks
NFSv4: Ensure we respect soft mount timeouts during trunking discovery
NFSv4: Schedule recovery if nfs40_walk_client_list() is interrupted
NFS: advertise only supported callback netids
SUNRPC: remove KERN_INFO from dprintk() call sites
SUNRPC: Fix large reads on NFS/RDMA
NFS: Clean up: revert increase in READDIR RPC buffer max size
SUNRPC: Ensure that call_bind times out correctly
SUNRPC: Ensure that call_connect times out correctly
nfs: emit a fsnotify_nameremove call in sillyrename codepath
nfs: remove synchronous rename code
nfs: convert nfs_rename to use async_rename infrastructure
nfs: make nfs_async_rename non-static
nfs: abstract out code needed to complete a sillyrename
NFSv4: Clear the open state flags if the new stateid does not match
...
a staging driver; fix included. Greg KH said he'd take the patch
but hadn't as the merge window opened, so it's included here
to avoid breaking build.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJTQMH9AAoJENkgDmzRrbjxo4UP/jwlenP44v+RFpo/dn8Z8E2n
SREQscU5ZZKvuyFD6kUdvOz8YC/nTrJvXoVkMUF05GVbuvb8/8UPtT9ECVemd0rW
xNy4aFfv9rbrqRLBLpLK9LAgTuhwlbTgGxgL78zRn3hWmf1hBZWCY+cEvKM8l/+9
oEQdORL0sUpZh7iryAeGqbOrXT4gqJEvSLOFwiYTSo6ryzWIilmdXSUAh6s8MIEX
PR1+oH9J8B6J29lcXKMf8/sDI1EBUeSLdBmMCuN5Y7xpYxsQLroVx94kPbdBY+XK
ZRoYuUGSUJfGRZY46cFKApIGeF07z1DGoyXghbSWEQrI+23TMUmrKUg47LSukE4Y
yCUf8HAtqIA3gVc9GKDdSp/2UpkAhTTv5ogKgnIzs1InWtOIBdDRSVUQXDosFEXw
6ZZe1pQs2zfXyXxO4j0Wq36K4RgI0aqOVw+dcC+w5BidjVylgnYRV0PSDd72tid7
bIfnjDbUBo+o4LanPNGYK474KyO7AslgTE50w6zwbJzgdwCQ36hCpKqScBZzm60a
42LrgTVoIHHWAL1tDzWL/LzWflZGdJAezzNje0/f2Q3bGMiNHWoljAvUphkTZ7qt
E8+jWqmM+riH3e8Y5wKpO1BKt7NGHISEy//bUlnqTwisjIzVILZ6VjfugQ1AI+0x
llTXPBotFvfvXqxunBg7
=yzUO
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module updates from Rusty Russell:
"Nothing major: the stricter permissions checking for sysfs broke a
staging driver; fix included. Greg KH said he'd take the patch but
hadn't as the merge window opened, so it's included here to avoid
breaking build"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
staging: fix up speakup kobject mode
Use 'E' instead of 'X' for unsigned module taint flag.
VERIFY_OCTAL_PERMISSIONS: stricter checking for sysfs perms.
kallsyms: fix percpu vars on x86-64 with relocation.
kallsyms: generalize address range checking
module: LLVMLinux: Remove unused function warning from __param_check macro
Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE
module: remove MODULE_GENERIC_TABLE
module: allow multiple calls to MODULE_DEVICE_TABLE() per module
module: use pr_cont
skip dentries that were added before MDS issued FILE_SHARED to
client.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
There is no guarantee that the strings in the nfs_cache_array will be
NULL-terminated. In the event that we end up hitting a readdir loop, we
need to ensure that we pass the warning message the length of the
string.
Reported-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
ceph_atomic_open() calls ceph_open() after receiving the MDS reply.
ceph_open() grabs an extra open file reference. (The open request
already holds an open file reference)
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Preallocate buffer for readdir reply. Limit number of entries in
readdir reply according to the buffer size.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
If buffer size is zero, return the size of layout vxattr. If buffer
size is not zero, check if it is large enough for layout vxattr.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
send_mds_reconnect() may call discard_cap_releases() after all
release messages have been dropped by cleanup_cap_releases()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
When adjusting caps client wants, MDS does not record caps that are
not allowed. For non-auth MDS, it does not record WR caps. So when
a MDS reply changes a non-auth cap to auth cap, client needs to set
cap's mds_wanted according to the reply.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).
The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.
The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
VFS does not directly pass flock's operation code to filesystem's
flock callback. It translates the operation code to the form how
posix lock's parameters are presented.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
handle following sequence of events:
- client releases a inode with i_max_size > 0. The release message
is queued. (is not sent to the auth MDS)
- a 'lookup' request reply from non-auth MDS returns the same inode.
- client opens the inode in write mode. The version of inode trace
in 'open' request reply is equal to the cached inode's version.
- client requests new max size. The MDS ignores the request because
it does not affect client's write range
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Only auth MDS can issue write caps to clients, so don't consider
write caps registered with non-auth MDS as valid.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
The main changes in the XFS tree for 3.15-rc1 are:
- O_TMPFILE support
- allowing AIO+DIO writes beyond EOF
- FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
implementation
- FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
implementation
- IO verifier cleanup and rework
- stack usage reduction changes
- vm_map_ram NOIO context fixes to remove lockdep warings
- various bug fixes and cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJTPykMAAoJEK3oKUf0dfod/KoP/jKQwzQPdtT8EtAu5vENh9AO
55zwCDXXFjCNIGIFPkrUBQbbARVAqhLZn3vuLUUhqtRRELdgJy/yFKZ37MPd8bhU
dKetivEB192Jcd6Sn74vsOsNLm1u9mJqbQ1aothz0TiOrkkWFZlz4Otu36MZRHN3
9WgZXWSxr6I/hYHGyCorJWZ5ISs0XD3vR5dYXYeZChbTpTxlCT4X/YgUtW4WH/Tq
y4gG0fKfwr9KK07/LXuQgUuZGU8vwVuNNsXPhqh+FZ39SLD2Ea83h46Hzf/+vVNI
kCIyYN1y40uBWczmwAptVEnUwhpGK8PzNrhKwTZICDtuct9sikf7c+o0aEE9lcqo
8IBt0Dy4l7BQVFSZOjYo5Jw5a8jAbkh47zru31HxogEVqafdz80iWB12JagOOnXM
v/McvDvZMyfgGckih32FM4G7ElvTYgGai5/3dLhfMuhc4/DdwcBOF1yHmFmnjhWO
QRsQxLdefUtP3MfMYKaJHM6v2wE1S2l0owgp+HdPluNiOUmH/fqFq1WpHxqqeRPz
nuHF8oYlxaZP5WAarz6Yf1/twIeZJ1rTD8np8uocvMqQJzMYJgrQyH+xJqjJaITR
iveQcEoRB8D7/fXMGDdcjZYE2fG4l4JE2kuh97k5NZw76e3v2YXSGh0kd9WqR1uN
t07joLRQKR2pJuSmuD5E
=uSkJ
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs
Pull xfs update from Dave Chinner:
"There are a couple of new fallocate features in this request - it was
decided that it was easiest to push them through the XFS tree using
topic branches and have the ext4 support be based on those branches.
Hence you may see some overlap with the ext4 tree merge depending on
how they including those topic branches into their tree. Other than
that, there is O_TMPFILE support, some cleanups and bug fixes.
The main changes in the XFS tree for 3.15-rc1 are:
- O_TMPFILE support
- allowing AIO+DIO writes beyond EOF
- FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
implementation
- FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
implementation
- IO verifier cleanup and rework
- stack usage reduction changes
- vm_map_ram NOIO context fixes to remove lockdep warings
- various bug fixes and cleanups"
* tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs: (34 commits)
xfs: fix directory hash ordering bug
xfs: extra semi-colon breaks a condition
xfs: Add support for FALLOC_FL_ZERO_RANGE
fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
xfs: inode log reservations are still too small
xfs: xfs_check_page_type buffer checks need help
xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation
xfs: use NOIO contexts for vm_map_ram
xfs: don't leak EFSBADCRC to userspace
xfs: fix directory inode iolock lockdep false positive
xfs: allocate xfs_da_args to reduce stack footprint
xfs: always do log forces via the workqueue
xfs: modify verifiers to differentiate CRC from other errors
xfs: print useful caller information in xfs_error_report
xfs: add xfs_verifier_error()
xfs: add helper for updating checksums on xfs_bufs
xfs: add helper for verifying checksums on xfs_bufs
xfs: Use defines for CRC offsets in all cases
xfs: skip pointless CRC updates after verifier failures
xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
...
and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.
Other than that, the usual clean ups and minor bug fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y
Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A
ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI
kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd
bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15
a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs
rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa
ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/
+165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA
2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr
700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU
DrPKlXwIgva7zq5/S0Vr
=4s1Z
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.
Other than that, the usual clean ups and minor bug fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
ext4: fix premature freeing of partial clusters split across leaf blocks
ext4: remove unneeded test of ret variable
ext4: fix comment typo
ext4: make ext4_block_zero_page_range static
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
ext4: optimize Hurd tests when reading/writing inodes
ext4: kill i_version support for Hurd-castrated file systems
ext4: each filesystem creates and uses its own mb_cache
fs/mbcache.c: doucple the locking of local from global data
fs/mbcache.c: change block and index hash chain to hlist_bl_node
ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
ext4: refactor ext4_fallocate code
ext4: Update inode i_size after the preallocation
ext4: fix partial cluster handling for bigalloc file systems
ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
ext4: only call sync_filesystm() when remounting read-only
fs: push sync_filesystem() down to the file system's remount_fs()
jbd2: improve error messages for inconsistent journal heads
jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
jbd2: minimize region locked by j_list_lock in journal_get_create_access()
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJTO0ytAAoJEKurIx+X31iBRlMP/33oGKncx/iQOi8MqNOvOzXG
o6y6jVXosn/0sbiSnqcDtne/VstVP4RbmLpwm22tSD5QALGVubdhK+ryJ9PGC7gZ
2wXUP75UN3BXTqzCSE3vXchQnNJooT2EjX16sZ/PvR4lO1QrWKlozuWvTR9Qhbdb
czWh/pWrlckhgl1CR01haAdlmFo3M2/6CvkpBW7hDiY8yvBl9g0MrFPHLQH6Y15p
kAIXQSYYVctT0xhOiMSodl6K0hpH4iRi8t2c5jUfRL6M56Ggfw1h7tVgc25ZU5NO
/aG7k3MtKosxd7yk+Nz2bArOki22w+jXFSP45hjnZwOP096m4z6jBRawjb9/nNla
WNfrRCTnBn8caBAasz+b3obAuX25w2YW86ITq+G54541llQCmj07KXBx8Z6TC0bI
MVFG9wFbCghd0dal4DWkQ+tOLyZFVWWwChNnZahiPONzSqSMMR+Xr6dNNwMjPMm/
HSM+PNUxvgLRuVvn3zLKaJp+2WugVxqBFyRdMeKRPsqzBqZ9HSkPfqkAuZg/7CZa
Yq1HBMR8ctWGbpTdxMZ1WEh2FTfuL8l/hr+Ns3Efe7+G2muXmCP2x1TvPOF2djlK
f7udwp62ZR6D5SZvWW9nUBZTmgCJ264ypk3hS4f4Sbu42bvttyHkYPKMDuGFioYR
PUXMWhcQoyB4PGh09NMr
=zOOz
-----END PGP SIGNATURE-----
Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore fixes from Tony Luck:
"Series of small bug fixes for pstore"
* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
pstore: Fix memory leak when decompress using big_oops_buf
pstore: Fix buffer overflow while write offset equal to buffer size
pstore: Correct the max_dump_cnt clearing of ramoops
pstore: Fix NULL pointer fault if get NULL prz in ramoops_get_next_prz
pstore: skip zero size persistent ram buffer in traverse
pstore: clarify clearing of _read_cnt in ramoops_context
Pull fuse update from Miklos Szeredi:
"This series adds cached writeback support to fuse, improving write
throughput"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix "uninitialized variable" warning
fuse: Turn writeback cache on
fuse: Fix O_DIRECT operations vs cached writeback misorder
fuse: fuse_flush() should wait on writeback
fuse: Implement write_begin/write_end callbacks
fuse: restructure fuse_readpage()
fuse: Flush files on wb close
fuse: Trust kernel i_mtime only
fuse: Trust kernel i_size only
fuse: Connection bit for enabling writeback
fuse: Prepare to handle short reads
fuse: Linking file to inode helper
This set includes a couple trivial cleanups and changes recovery
log messages from DEBUG to INFO.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJTPCtSAAoJEDgbc8f8gGmqn3IQALyqRO6spT5fGgPHktLo5lZx
Qp6uFn1dHwPncQbdCOQPsFNBnCKE76+lfY09NjABCYnJdEyZFye1MwwYDyqQjqEt
6kUi8LLrkF39KUkOPivV19fUfwNVMoNwAIUKzTSo8wuWE/1+0NL+FIxCUd3rGqxv
A5yg1nXGTIBPNvZArq8bi49tfriCu167/QXLPlVWIRkyJ69H5ErC+sewufSP/Z5m
S4yDFGOLy6pu+snT0GHGZcTRAfz37YpHdpAjlxUiurEZFXGvCagF55DaZIiRhf0K
RDXrm1rgyNQ7H1gfKa92TOijGx62nOYR9F3zlc8c4tQB+eDaWTQWGR0+lgI/PklZ
VPMHG+lqauaEv9YwKS5UL+E3BuyiAVuwF41mue8lOSA0t+D5P4O2gB+BFuYSpUy1
Kwt2+S38sqiPwEqRcS/n+x/qx+uaGHxtsQh6cEJw4YM5cslJukkRSx8SUqgoETNB
uUGjD+pG1lReyUUzzjG74UKRvb7dV+psWRDb56hx2nJrZ2H24s4h7FEHaHJWKoX+
5H4K3b+E3lqwHp4ARu/yuUYgg/PCsGTsL986135nVnvjV/cHaG4u5N8KP+a9hSyW
PkubO7kUO34KKH45yhSh79ELsWyw4gFgy0mls2BoDpE31phGkoYbKB+hbIkbeKVQ
EZLmwKt2aijdo3z0g4bO
=Ljna
-----END PGP SIGNATURE-----
Merge tag 'dlm-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes a couple trivial cleanups and changes recovery log
messages from DEBUG to INFO"
* tag 'dlm-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: use INFO for recovery messages
fs: Include appropriate header file in dlm/ast.c
dlm: silence a harmless use after free warning
Pull btrfs changes from Chris Mason:
"This is a pretty long stream of bug fixes and performance fixes.
Qu Wenruo has replaced the btrfs async threads with regular kernel
workqueues. We'll keep an eye out for performance differences, but
it's nice to be using more generic code for this.
We still have some corruption fixes and other patches coming in for
the merge window, but this batch is tested and ready to go"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits)
Btrfs: fix a crash of clone with inline extents's split
btrfs: fix uninit variable warning
Btrfs: take into account total references when doing backref lookup
Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
Btrfs: fix incremental send's decision to delay a dir move/rename
Btrfs: remove unnecessary inode generation lookup in send
Btrfs: fix race when updating existing ref head
btrfs: Add trace for btrfs_workqueue alloc/destroy
Btrfs: less fs tree lock contention when using autodefrag
Btrfs: return EPERM when deleting a default subvolume
Btrfs: add missing kfree in btrfs_destroy_workqueue
Btrfs: cache extent states in defrag code path
Btrfs: fix deadlock with nested trans handles
Btrfs: fix possible empty list access when flushing the delalloc inodes
Btrfs: split the global ordered extents mutex
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
Btrfs: reclaim delalloc metadata more aggressively
Btrfs: remove unnecessary lock in may_commit_transaction()
Btrfs: remove the unnecessary flush when preparing the pages
Btrfs: just do dirty page flush for the inode with compression before direct IO
...
but instead the widening contributor base. It is good to see that
interest is increasing in GFS2, and I'd like to thank all the
contributors to this patch set.
In addition to the usual set of bug fixes and clean ups, there are
patches to improve inode creation performance when xattrs are
required and some improvements to the transaction code which is
intended to help improve scalability after further changes in due
course.
Journal extent mapping is also updated to make it more efficient
and again, this is a foundation for future work in this area.
The maximum number of ACLs has been increased to 300 (for a 4k
block size) which means that even with a few additional xattrs
from selinux, everything should fit within a single fs block.
There is also a patch to bring GFS2's own copy of the writepages
code up to the same level as the core VFS. Eventually we may be
able to merge some of this code, since it is fairly similar.
The other major change this time, is bringing consistency to
the printing of messages via fs_<level>, pr_<level> macros.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJTOn+JAAoJEMrg3m4a/8jSSoYQALctSOmyGW978JMAKiwuUeSr
367ho/I/WfZWybWH7iZ/hdEMNCUCnP3C1ZJhYKJ6J60h35p1hIK7DYp9tOy0RsTS
JD3VamE/jboljXyZaaMCtly7HPQMV82rRmI3+bSoXpT4mPz+PB+kRCe2QkvyVAsh
5tojtLz6L/In/eo4UlqZjn1BITcYRL5AgMi+8h8h6Foi4MgnFISZbezC6U5eO46P
DT/xwd0fw+o5ZTm/dTQmhCCH30y4cpKZnNhi+xhHrEm95gBZWcONHD0qyNZe3fBc
WuGUU9hURHHkqT671T7sBGzfNrsKk1OgNzFNy1YrF5C+t6hpG9iAKRIHtuVqSqPx
OblhKP0lebDY1L41NqZR4Up+pUjCMxOs3f+FAl2rlHRBIQdroOu82CZHdTBfM/HJ
1ZvkMrIkxMKb7RtSnTdXsPxcPPZNakHhDaNxHMmMlFlflbXGQqWZaMMhK181d7dn
Y0WU2ayPmjjUdO5OnekMV5J/hNYNLobnV9OO75j4pyqlnHLIIycc/wgNULcU+OJ6
GooOQJNnnAo+2JUvS+Ejn88q2if05HOg4fCXRfu4bdA2zDehei1jr5xz5IWj0OAM
AlmTgUYzK7osvA1XtNd6naCmes+fnm3+Jfh0+YtpeZKgCvaYYCoZCHJccb+a0AIq
7dTkyCQtgsKE+yjPKose
=IWmZ
-----END PGP SIGNATURE-----
Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull GFS2 updates from Steven Whitehouse:
"One of the main highlights this time, is not the patches themselves
but instead the widening contributor base. It is good to see that
interest is increasing in GFS2, and I'd like to thank all the
contributors to this patch set.
In addition to the usual set of bug fixes and clean ups, there are
patches to improve inode creation performance when xattrs are required
and some improvements to the transaction code which is intended to
help improve scalability after further changes in due course.
Journal extent mapping is also updated to make it more efficient and
again, this is a foundation for future work in this area.
The maximum number of ACLs has been increased to 300 (for a 4k block
size) which means that even with a few additional xattrs from selinux,
everything should fit within a single fs block.
There is also a patch to bring GFS2's own copy of the writepages code
up to the same level as the core VFS. Eventually we may be able to
merge some of this code, since it is fairly similar.
The other major change this time, is bringing consistency to the
printing of messages via fs_<level>, pr_<level> macros"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (29 commits)
GFS2: Fix address space from page function
GFS2: Fix uninitialized VFS inode in gfs2_create_inode
GFS2: Fix return value in slot_get()
GFS2: inline function gfs2_set_mode
GFS2: Remove extraneous function gfs2_security_init
GFS2: Increase the max number of ACLs
GFS2: Re-add a call to log_flush_wait when flushing the journal
GFS2: Ensure workqueue is scheduled after noexp request
GFS2: check NULL return value in gfs2_ok_to_move
GFS2: Convert gfs2_lm_withdraw to use fs_err
GFS2: Use fs_<level> more often
GFS2: Use pr_<level> more consistently
GFS2: Move recovery variables to journal structure in memory
GFS2: global conversion to pr_foo()
GFS2: return -E2BIG if hit the maximum limits of ACLs
GFS2: Clean up journal extent mapping
GFS2: replace kmalloc - __vmalloc / memset 0
GFS2: Remove extra "if" in gfs2_log_flush()
fs: NULL dereference in posix_acl_to_xattr()
GFS2: Move log buffer accounting to transaction
...
Pull file locking updates from Jeff Layton:
"Highlights:
- maintainership change for fs/locks.c. Willy's not interested in
maintaining it these days, and is OK with Bruce and I taking it.
- fix for open vs setlease race that Al ID'ed
- cleanup and consolidation of file locking code
- eliminate unneeded BUG() call
- merge of file-private lock implementation"
* 'locks-3.15' of git://git.samba.org/jlayton/linux:
locks: make locks_mandatory_area check for file-private locks
locks: fix locks_mandatory_locked to respect file-private locks
locks: require that flock->l_pid be set to 0 for file-private locks
locks: add new fcntl cmd values for handling file private locks
locks: skip deadlock detection on FL_FILE_PVT locks
locks: pass the cmd value to fcntl_getlk/getlk64
locks: report l_pid as -1 for FL_FILE_PVT locks
locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
locks: rename locks_remove_flock to locks_remove_file
locks: consolidate checks for compatible filp->f_mode values in setlk handlers
locks: fix posix lock range overflow handling
locks: eliminate BUG() call when there's an unexpected lock on file close
locks: add __acquires and __releases annotations to locks_start and locks_stop
locks: remove "inline" qualifier from fl_link manipulation functions
locks: clean up comment typo
locks: close potential race between setlease and open
MAINTAINERS: update entry for fs/locks.c
Pull renameat2 system call from Miklos Szeredi:
"This adds a new syscall, renameat2(), which is the same as renameat()
but with a flags argument.
The purpose of extending rename is to add cross-rename, a symmetric
variant of rename, which exchanges the two files. This allows
interesting things, which were not possible before, for example
atomically replacing a directory tree with a symlink, etc... This
also allows overlayfs and friends to operate on whiteouts atomically.
Andy Lutomirski also suggested a "noreplace" flag, which disables the
overwriting behavior of rename.
These two flags, RENAME_EXCHANGE and RENAME_NOREPLACE are only
implemented for ext4 as an example and for testing"
* 'cross-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ext4: add cross rename support
ext4: rename: split out helper functions
ext4: rename: move EMLINK check up
ext4: rename: create ext4_renament structure for local vars
vfs: add cross-rename
vfs: lock_two_nondirectories: allow directory args
security: add flags to rename hooks
vfs: add RENAME_NOREPLACE flag
vfs: add renameat2 syscall
vfs: rename: use common code for dir and non-dir
vfs: rename: move d_move() up
vfs: add d_is_dir()
Any setattr of the ACL attribute, even if it sets just the basic 3-ACE
ACL exactly as it was returned from a file with only mode bits, creates
a mask entry, and it is only the mask, not group, entry that is changed
by subsequent modifications of the mode bits.
So, for example, it's surprising that GROUP@ is left without read or
write permissions after a chmod 0666:
touch test
chmod 0600 test
nfs4_getfacl test
A::OWNER@:rwatTcCy
A::GROUP@:tcy
A::EVERYONE@:tcy
nfs4_getfacl test | nfs4_setfacl -S - test #
chmod 0666 test
nfs4_getfacl test
A::OWNER@:rwatTcCy
A::GROUP@:tcy
D::GROUP@:rwa
A::EVERYONE@:rwatcy
So, let's stop creating the unnecessary mask ACL.
A mask will still be created on non-trivial ACLs (ACLs with actual named
user and group ACEs), so the odd posix-acl behavior of chmod modifying
only the mask will still be left in that case; but that's consistent
with local behavior.
Reported-by: Soumya Koduri <skoduri@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This reverts the part of commit 6e14b46b91
that changes NFSv2 behavior.
Mark Lord found that it broke nfs-root for Linux clients, because it
broke NFSv2.
In fact, from RFC 1094:
"Notice that the file type is specified both in the mode bits
and in the file type. This is really a bug in the protocol and
will be fixed in future versions."
So NFSv2 clients really are expected to depend on the high bits of the
mode.
Cc: stable@kernel.org
Reported-by: Mark Lord <mlord@pobox.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Merge first patch-bomb from Andrew Morton:
- Various misc bits
- kmemleak fixes
- small befs, codafs, cifs, efs, freexxfs, hfsplus, minixfs, reiserfs things
- fanotify
- I appear to have become SuperH maintainer
- ocfs2 updates
- direct-io tweaks
- a bit of the MM queue
- printk updates
- MAINTAINERS maintenance
- some backlight things
- lib/ updates
- checkpatch updates
- the rtc queue
- nilfs2 updates
- Small Documentation/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (237 commits)
Documentation/SubmittingPatches: remove references to patch-scripts
Documentation/SubmittingPatches: update some dead URLs
Documentation/filesystems/ntfs.txt: remove changelog reference
Documentation/kmemleak.txt: updates
fs/reiserfs/super.c: add __init to init_inodecache
fs/reiserfs: move prototype declaration to header file
fs/hfsplus/attributes.c: add __init to hfsplus_create_attr_tree_cache()
fs/hfsplus/extents.c: fix concurrent acess of alloc_blocks
fs/hfsplus/extents.c: remove unused variable in hfsplus_get_block
nilfs2: update project's web site in nilfs2.txt
nilfs2: update MAINTAINERS file entries fix
nilfs2: verify metadata sizes read from disk
nilfs2: add FITRIM ioctl support for nilfs2
nilfs2: add nilfs_sufile_trim_fs to trim clean segs
nilfs2: implementation of NILFS_IOCTL_SET_SUINFO ioctl
nilfs2: add nilfs_sufile_set_suinfo to update segment usage
nilfs2: add struct nilfs_suinfo_update and flags
nilfs2: update MAINTAINERS file entries
fs/coda/inode.c: add __init to init_inodecache()
BEFS: logging cleanup
...
init_inodecache is only called by __init init_reiserfs_fs.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move prototype declaration to header file reiserfs/reiserfs.h from
reiserfs/super.c because they are used by more than one file.
This eliminates the following warning in reiserfs/bitmap.c:
fs/reiserfs/bitmap.c:647:6: warning: no previous prototype for `show_alloc_options' [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfsplus_create_attr_tree_cache is only called by __init init_hfsplus_fs
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Concurrent access to alloc_blocks in hfsplus_inode_info() is protected
by extents_lock mutex. This patch fixes two instances where
alloc_blocks modification was not protected with this lock.
This fixes possible allocation bitmap corruption in race conditions
while extending and truncating files.
[akpm@linux-foundation.org: take extents_lock before taking a copy of ->alloc_blocks]
[akpm@linux-foundation.org: remove now-unused label `out']
Signed-off-by: Sougata Santra <sougata@tuxera.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable is defined but not used. Generally it compiles away with
-O2 optimization hence it does not show a warning.
Signed-off-by: Sougata Santra <sougata@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add code to check sizes of on-disk data of metadata files such as inode
size, segment usage size, DAT entry size, and checkpoint size. Although
these sizes are read from disk, the current implementation doesn't check
them.
If these sizes are not sane on disk, it can cause out-of-range access to
metadata or memory access overrun on metadata block buffers due to
overflow in sundry calculations.
Both lower limit and upper limit of metadata sizes are verified to
prevent these issues.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for the FITRIM ioctl, which enables user space tools to
issue TRIM/DISCARD requests to the underlying device. Every clean
segment within the specified range will be discarded.
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add nilfs_sufile_trim_fs(), which takes an fstrim_range structure and
calls blkdev_issue_discard for every clean segment in the specified
range. The range is truncated to file system block boundaries.
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this ioctl the segment usage entries in the SUFILE can be updated
from userspace.
This is useful, because it allows the userspace GC to modify and update
segment usage entries for specific segments, which enables it to avoid
unnecessary write operations.
If a segment needs to be cleaned, but there is no or very little
reclaimable space in it, the cleaning operation basically degrades to a
useless moving operation. In the end the only thing that changes is the
location of the data and a timestamp in the segment usage information.
With this ioctl the GC can skip the cleaning and update the segment
usage entries directly instead.
This is basically a shortcut to cleaning the segment. It is still
necessary to read the segment summary information, but the writing of
the live blocks can be skipped if it's not worth it.
[konishi.ryusuke@lab.ntt.co.jp: add description of NILFS_IOCTL_SET_SUINFO ioctl]
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce nilfs_sufile_set_suinfo(), which expects an array of
nilfs_suinfo_update structures and updates the segment usage information
accordingly.
This is basically a helper function for the newly introduced
NILFS_IOCTL_SET_SUINFO ioctl.
[konishi.ryusuke@lab.ntt.co.jp: use put_bh() instead of brelse() because we know bh != NULL]
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
init_inodecache is only called by __init init_coda
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Summary:
- all printk(KERN_foo converted to pr_foo()
- add pr_fmt and remove redundant prefixes
- convert befs_() to va_format (based on patch by Joe Perches)
- remove non standard %Lu
- use __func__ for all debugging
[akpm@linux-foundation.org: fix printk warnings, reported by Fengguang]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
init_inodecache is only called by __init init_befs_fs.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kzalloc for clean fs_info allocation like other filesystems.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
init_inodecache is only called by __init init_minix_fs.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A missing 'break' statement in bm_status_write() results in a user program
receiving '3' when doing the following:
write(fd, "-1", 2);
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
init_inodecache is only called by __init init_efs_fs.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
uselib hasn't been used since libc5; glibc does not use it. Support
turning it off.
When disabled, also omit the load_elf_library implementation from
binfmt_elf.c, which only uselib invokes.
bloat-o-meter:
add/remove: 0/4 grow/shrink: 0/1 up/down: 0/-785 (-785)
function old new delta
padzero 39 36 -3
uselib_flags 20 - -20
sys_uselib 168 - -168
SyS_uselib 168 - -168
load_elf_library 426 - -426
The new CONFIG_USELIB defaults to `y'.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 6307f8fee2 ("security: remove dead hook task_setgroups"),
set_groups will always return zero, so we could just remove return value
of set_groups.
This patch reduces code size, and simplfies code to use set_groups,
because we don't need to check its return value any more.
[akpm@linux-foundation.org: remove obsolete claims from set_groups() comment]
Signed-off-by: Wang YanQing <udknight@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_sysfs is an obsolete system call no longer supported by libc.
- This patch adds a default CONFIG_SYSFS_SYSCALL=y
- Option can be turned off in expert mode.
- cond_syscall added to kernel/sys_ni.c
[akpm@linux-foundation.org: tweak Kconfig help text]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is plenty of anecdotal evidence and a load of blog posts
suggesting that using "drop_caches" periodically keeps your system
running in "tip top shape". Perhaps adding some kernel documentation
will increase the amount of accurate data on its use.
If we are not shrinking caches effectively, then we have real bugs.
Using drop_caches will simply mask the bugs and make them harder to
find, but certainly does not fix them, nor is it an appropriate
"workaround" to limit the size of the caches. On the contrary, there
have been bug reports on issues that turned out to be misguided use of
cache dropping.
Dropping caches is a very drastic and disruptive operation that is good
for debugging and running tests, but if it creates bug reports from
production use, kernel developers should be aware of its use.
Add a bit more documentation about it, a syslog message to track down
abusers, and vmstat drop counters to help analyze problem reports.
[akpm@linux-foundation.org: checkpatch fixes]
[hannes@cmpxchg.org: add runtime suppression control]
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes read_cache_page_async() which wasn't really needed
anywhere and simplifies the code around it a bit.
read_cache_page_async() is useful when we want to read a page into the
cache without waiting for it to complete. This happens when the
appropriate callback 'filler' doesn't complete its read operation and
releases the page lock immediately, and instead queues a different
completion routine to do that. This never actually happened anywhere in
the code.
read_cache_page_async() had 3 different callers:
- read_cache_page() which is the sync version, it would just wait for
the requested read to complete using wait_on_page_read().
- JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
function it supplied doesn't do any async reads, and would complete
before the filler function returns - making it actually a sync read.
- CRAMFS would call it using the read_mapping_page_async() wrapper, with
a similar story to JFFS2 - the filler function doesn't do anything that
reminds async reads and would always complete before the filler function
returns.
To sum it up, the code in mm/filemap.c never took advantage of having
read_cache_page_async(). While there are filler callbacks that do async
reads (such as the block one), we always called it with the
read_cache_page().
This patch adds a mandatory wait for read to complete when adding a new
page to the cache, and removes read_cache_page_async() and its wrappers.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page. As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently. At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.
Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate. Reclaim will check
for this flag before installing shadow pages.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shmem mappings already contain exceptional entries where swap slot
information is remembered.
To be able to store eviction information for regular page cache, prepare
every site dealing with the radix trees directly to handle entries other
than pages.
The common lookup functions will filter out non-page entries and return
NULL for page cache holes, just as before. But provide a raw version of
the API which returns non-page entries as well, and switch shmem over to
use it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.
It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot". But this is about to change, though, as shadow page
descriptors will be stored in the page cache after the actual pages get
evicted from memory.
Move the functions over to mm/filemap.c and make them native page cache
operations, where they can later be adapted to handle the new definition
of "page cache hole".
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This code used to have its own lru cache pagevec up until a0b8cab3 ("mm:
remove lru parameter from __pagevec_lru_add and remove parts of pagevec
API"). Now it's just add_to_page_cache() followed by lru_cache_add(),
might as well use add_to_page_cache_lru() directly.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We know that "ret > 0" is true here. These tests were left over from
commit 02afc27fae ('direct-io: Handle O_(D)SYNC AIO') and aren't
needed any more.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The return value of bio_get_nr_vecs() cannot be bigger than
BIO_MAX_PAGES, so we can remove redundant the comparison between
nr_pages and BIO_MAX_PAGES.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes the following crash:
kernel BUG at fs/ocfs2/uptodate.c:530!
Modules linked in: ocfs2(F) ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs xen_privcmd sunrpc 8021q garp stp llc bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iTCO_wdt iTCO_vendor_support dcdbas coretemp freq_table mperf microcode pcspkr serio_raw bnx2 lpc_ich mfd_core i5k_amb i5000_edac edac_core e1000e sg shpchp ext4(F) jbd2(F) mbcache(F) dm_round_robin(F) sr_mod(F) cdrom(F) usb_storage(F) sd_mod(F) crc_t10dif(F) pata_acpi(F) ata_generic(F) ata_piix(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) radeon(F)
ttm(F) drm_kms_helper(F) drm(F) hwmon(F) i2c_algo_bit(F) i2c_core(F) dm_multipath(F) dm_mirror(F) dm_region_hash(F) dm_log(F) dm_mod(F)
CPU 5
Pid: 21303, comm: xattr-test Tainted: GF W 3.8.13-30.el6uek.x86_64 #2 Dell Inc. PowerEdge 1950/0M788G
RIP: ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2]
Process xattr-test (pid: 21303, threadinfo ffff880017aca000, task ffff880016a2c480)
Call Trace:
ocfs2_init_xattr_bucket+0x8a/0x120 [ocfs2]
ocfs2_cp_xattr_bucket+0xbb/0x1b0 [ocfs2]
ocfs2_extend_xattr_bucket+0x20a/0x2f0 [ocfs2]
ocfs2_add_new_xattr_bucket+0x23e/0x4b0 [ocfs2]
ocfs2_xattr_set_entry_index_block+0x13c/0x3d0 [ocfs2]
ocfs2_xattr_block_set+0xf9/0x220 [ocfs2]
__ocfs2_xattr_set_handle+0x118/0x710 [ocfs2]
ocfs2_xattr_set+0x691/0x880 [ocfs2]
ocfs2_xattr_user_set+0x46/0x50 [ocfs2]
generic_setxattr+0x96/0xa0
__vfs_setxattr_noperm+0x7b/0x170
vfs_setxattr+0xbc/0xc0
setxattr+0xde/0x230
sys_fsetxattr+0xc6/0xf0
system_call_fastpath+0x16/0x1b
Code: 41 80 0c 24 01 48 89 df e8 7d f0 ff ff 4c 89 e6 48 89 df e8 a2 fe ff ff 48 89 df e8 3a f0 ff ff 48 8b 1c 24 4c 8b 64 24 08 c9 c3 <0f> 0b eb fe 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 66 66
RIP ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2]
It hit the BUG_ON() in ocfs2_set_new_buffer_uptodate():
void ocfs2_set_new_buffer_uptodate(struct ocfs2_caching_info *ci,
struct buffer_head *bh)
{
/* This should definitely *not* exist in our cache */
if (ocfs2_buffer_cached(ci, bh))
printk(KERN_ERR "bh->b_blocknr: %lu @ %p\n", bh->b_blocknr, bh);
BUG_ON(ocfs2_buffer_cached(ci, bh));
set_buffer_uptodate(bh);
ocfs2_metadata_cache_io_lock(ci);
ocfs2_set_buffer_uptodate(ci, bh);
ocfs2_metadata_cache_io_unlock(ci);
}
The problem here is:
We cached a block, but the buffer_head got reused. When we are to pick
up this block again, a new buffer_head created with UPTODATE flag
cleared. ocfs2_buffer_uptodate() returned false since no UPTODATE is
set on the buffer_head. so we set this block to cache as a NEW block,
then it failed at asserting block is not in cache.
The fix is to add a new parameter indicating the bucket is a new
allocated or not to ocfs2_init_xattr_bucket().
ocfs2_init_xattr_bucket() assert block not cached accordingly.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following case may lead to the same system inode ref in confusion.
A thread B thread
ocfs2_get_system_file_inode
->get_local_system_inode
->_ocfs2_get_system_file_inode
because of *arr == NULL,
ocfs2_get_system_file_inode
->get_local_system_inode
->_ocfs2_get_system_file_inode
gets first ref thru
_ocfs2_get_system_file_inode,
gets second ref thru igrab and
set *arr = inode
at the moment, B thread also gets
two refs, so lead to one more
inode ref.
So add mutex lock to avoid multi thread set two inode ref once at the
same time.
Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_info_handle_freeinode() and ocfs2_test_inode_bit() func, after
calls ocfs2_get_system_file_inode() to get inode ref, if calls
ocfs2_info_scan_inode_alloc() or ocfs2_inode_lock() failed, we should
iput inode alloc to avoid leaking the inode.
Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 17330860
When accepting an incomming connection o2net_accept_one clones a child
data socket from the parent listening socket. It then proceeds to setup
the child with callback o2net_data_ready() and sk_user_data to NULL. If
data arrives in this window, o2net_listen_data_ready will be called with
some non-deterministic value in sk_user_data (not inherited). We panic
when we page fault on sk_user_data -- in parent it is
sock_def_readable().
The fix is to recognize that this is a data socket being set up by
looking at the socket state and do nothing.
Signed-off-by: Tariq Saseed <tariq.x.saeed@oracle.com>
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After updating alloc_dinode counts in ocfs2_alloc_dinode_update_counts(),
if ocfs2_alloc_dinode_update_bitmap() failed, there is a rare case that
some space may be lost.
So, roll back alloc_dinode counts when ocfs2_block_group_set_bits()
failed.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Younger Liu <younger.liucn@gmail.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_do_flock() calls ocfs2_file_lock() to get the cross-node clock and
then call flock_lock_file_wait() to compete with local processes. In
case flock_lock_file_wait() failed, say -ENOMEM, clean up work is not
done. This patch adds the cleanup --drop the cross-node lock which was
just granted.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ensure that ocfs2_update_inode_fsync_trans() is called any time we touch
an inode in a given transaction. This is a follow-on to the previous
patch to reduce lock contention and deadlocking during an fsync
operation.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Wengang <wen.gang.wang@oracle.com>
Cc: Greg Marsden <greg.marsden@oracle.com>
Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not put bh when buffer_uptodate failed in ocfs2_write_block and
ocfs2_write_super_or_backup, because it will put bh in b_end_io.
Otherwise it will hit a warning "VFS: brelse: Trying to free free
buffer".
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When ocfs2_create_new_inode_locks() return error, inode open lock may
not be obtainted for this inode. So other nodes can remove this file
and free dinode when inode still remain in memory on this node, which is
not correct and may trigger BUG. So __ocfs2_mknod_locked should return
error when ocfs2_create_new_inode_locks() failed.
Node_1 Node_2
create fileA, call ocfs2_mknod()
-> ocfs2_get_init_inode(), allocate inodeA
-> ocfs2_claim_new_inode(), claim dinode(dinodeA)
-> call ocfs2_create_new_inode_locks(),
create open lock failed, return error
-> __ocfs2_mknod_locked return success
unlink fileA
try open lock succeed,
and free dinodeA
create another file, call ocfs2_mknod()
-> ocfs2_get_init_inode(), allocate inodeB
-> ocfs2_claim_new_inode(), as Node_2 had freed dinodeA,
so claim dinodeA and update generation for dinodeA
call __ocfs2_drop_dl_inodes()->ocfs2_delete_inode()
to free inodeA, and finally triggers BUG
on(inode->i_generation != le32_to_cpu(fe->i_generation))
in function ocfs2_inode_lock_update().
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 18108070
ocfs2_xattr_extend_allocation() hits panic when creating xattr during
data extent alloc phase. The problem occurs if due to local alloc
fragmentation, clusters are spread over multiple extents. In this case
ocfs2_add_clusters_in_btree() finds no space to store more than one
extent record and therefore fails returning RESTART_META. The situation
is anticipated for xattr update case but not xattr create case. This
fix simply ports that code to create case.
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In dlm_query_region_handler(), once kmalloc failed, it will unlock
dlm_domain_lock without lock first, then deadlock happens.
Signed-off-by: Zhonghua Guo <guozhonghua@h3c.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Tested-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
llseek requires ocfs2 inode lock for updating the file size in SEEK_END.
because the file size maybe update on another node.
This bug can be reproduce the following scenario: at first, we dd a test
fileA, the file size is 10k.
on NodeA:
---------
1) open the test fileA, lseek the end of file. and print the position.
2) close the test fileA
on NodeB:
1) open the test fileA, append the 5k data to test FileA.
2) lseek the end of file. and print the position.
3) close file.
At first we run the test program1 on NodeA , the result is 10k. And
then run the test program2 on NodeB, the result is 15k. At last, we run
the test program1 on NodeA again, the result is 10k.
After applying this patch the three step result is 15k.
test result: 1000000 times lseek call;
index lseek with inode lock (unit:us) lseek without inode lock (unit:us)
1 1168162 555383
2 1168011 549504
3 1170538 549396
4 1170375 551685
5 1170444 556719
6 1174364 555307
7 1163294 551552
8 1170080 549350
9 1162464 553700
10 1165441 552594
avg 1168317 552519
avg with lock - avg without lock = 615798
(avg with lock - avg without lock)/1000000=0.615798 us
Signed-off-by: Jensen <shencanquan@huawei.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In o2nm_cluster, cl_idle_timeout_ms, cl_keepalive_delay_ms, as well as
cl_reconnect_delay_ms, are defined as type of unsigned int. So we
should also use unsigned int in the helper functions.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following patches are reverted in this patch because these patches
caused performance regression in the remote unlink() calls.
ea455f8ab6 - ocfs2: Push out dropping of dentry lock to ocfs2_wq
f7b1aa69be - ocfs2: Fix deadlock on umount
5fd1318937 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount
Previous patches in this series removed the possible deadlocks from
downconvert thread so the above patches shouldn't be needed anymore.
The regression is caused because these patches delay the iput() in case
of dentry unlocks. This also delays the unlocking of the open lockres.
The open lockresource is required to test if the inode can be wiped from
disk or not. When the deleting node does not get the open lock, it
marks it as orphan (even though it is not in use by another
node/process) and causes a journal checkpoint. This delays operations
following the inode eviction. This also moves the inode to the orphaned
inode which further causes more I/O and a lot of unneccessary orphans.
The following script can be used to generate the load causing issues:
declare -a create
declare -a remove
declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384)
unique="`mktemp -u XXXXX`"
script="/tmp/idontknow-${unique}.sh"
cat <<EOF > "${script}"
for n in {1..8}; do mkdir -p test/dir\${n}
eval touch test/dir\${n}/foo{1.."\$1"}
done
EOF
chmod 700 "${script}"
function fcreate ()
{
exec 2>&1 /usr/bin/time --format=%E "${script}" "$1"
}
function fremove ()
{
exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*"
}
function fcp ()
{
exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new"
}
echo -------------------------------------------------
echo "| # files | create #s | copy #s | remove #s |"
echo -------------------------------------------------
for ((x=0; x < ${#iterations[*]} ; x++)) do
create[$x]="`fcreate ${iterations[$x]}`"
copy[$x]="`fcp ${iterations[$x]}`"
remove[$x]="`fremove`"
printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]}
done
rm "${script}"
echo "------------------------"
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we are dropping last inode reference from downconvert thread, we will
end up calling ocfs2_mark_lockres_freeing() which can block if the lock
we are freeing is queued thus creating an A-A deadlock. Luckily, since
we are the downconvert thread, we can immediately dequeue the lock and
thus avoid waiting in this case.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We cannot drop last dquot reference from downconvert thread as that
creates the following deadlock:
NODE 1 NODE2
holds dentry lock for 'foo'
holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
dquot_initialize(bar)
ocfs2_dquot_acquire()
ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
...
downconvert thread (triggered from another
node or a different process from NODE2)
ocfs2_dentry_post_unlock()
...
iput(foo)
ocfs2_evict_inode(foo)
ocfs2_clear_inode(foo)
dquot_drop(inode)
...
ocfs2_dquot_release()
ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
- blocks
finds we need more space in
quota file
...
ocfs2_extend_no_holes()
ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
- deadlocks waiting for
downconvert thread
We solve the problem by postponing dropping of the last dquot reference to
a workqueue if it happens from the downconvert thread.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide dqgrab() function to get quota structure reference when we are
sure it already has at least one active reference. Make use of this
function inside quota code.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move dquot_initalize() call in ocfs2_delete_inode() after the moment we
verify inode is actually a sane one to delete. We certainly don't want
to initialize quota for system inodes etc. This also avoids calling
into quota code from downconvert thread.
Add more details into the comment why bailing out from
ocfs2_delete_inode() when we are in downconvert thread is OK.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The flag was never set, delete it.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a part of the nocontrold feature which was incorporated sometime
back.
This is required for backward compatibility of the tools, specifically
the scenario where the tools with recovery callback is used with a
kernel not using the recovery callbacks (older kernel + newer tools).
The tools look for this file to understand if the kernel supports DLM
recovery callbacks.
For kernels which support recovery callbacks but will miss this patch,
ocfs2 will continue to use the older API and would still be able to
mount the filesystem.
[akpm@linux-foundation.org: simplify]
[sfr@canb.auug.org.au: VERIFY_OCTAL_PERMISSIONS fix up]
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race window in dlm_do_recovery() between dlm_remaster_locks()
and dlm_reset_recovery() when the recovery master nearly finish the
recovery process for a dead node. After the master sends FINALIZE_RECO
message in dlm_remaster_locks(), another node may become the recovery
master for another dead node, and then send the BEGIN_RECO message to
all the nodes included the old master, in the handler of this message
dlm_begin_reco_handler() of old master, dlm->reco.dead_node and
dlm->reco.new_master will be set to the second dead node and the new
master, then in dlm_reset_recovery(), these two variables will be reset
to default value. This will cause new recovery master can not finish
the recovery process and hung, at last the whole cluster will hung for
recovery.
old recovery master: new recovery master:
dlm_remaster_locks()
become recovery master for
another dead node.
dlm_send_begin_reco_message()
dlm_begin_reco_handler()
{
if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) {
return -EAGAIN;
}
dlm_set_reco_master(dlm, br->node_idx);
dlm_set_reco_dead_node(dlm, br->dead_node);
}
dlm_reset_recovery()
{
dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM);
dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM);
}
will hang in dlm_remaster_locks() for
request dlm locks info
Before send FINALIZE_RECO message, recovery master should set
DLM_RECO_STATE_FINALIZE for itself and clear it after the recovery done,
this can break the race windows as the BEGIN_RECO messages will not be
handled before DLM_RECO_STATE_FINALIZE flag is cleared.
A similar race may happen between new recovery master and normal node
which is in dlm_finalize_reco_handler(), also fix it.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This issue was introduced by commit 800deef3f6 ("ocfs2: use
list_for_each_entry where benefical") in 2007 where it replaced
list_for_each with list_for_each_entry. The variable "lock" will point
to invalid data if "tmpq" list is empty and a panic will be triggered
due to this. Sunil advised reverting it back, but the old version was
also not right. At the end of the outer for loop, that
list_for_each_entry will also set "lock" to an invalid data, then in the
next loop, if the "tmpq" list is empty, "lock" will be an stale invalid
data and cause the panic. So reverting the list_for_each back and reset
"lock" to NULL to fix this issue.
Another concern is that this seemes can not happen because the "tmpq"
list should not be empty. Let me describe how.
old lock resource owner(node 1): migratation target(node 2):
image there's lockres with a EX lock from node 2 in
granted list, a NR lock from node x with convert_type
EX in converting list.
dlm_empty_lockres() {
dlm_pick_migration_target() {
pick node 2 as target as its lock is the first one
in granted list.
}
dlm_migrate_lockres() {
dlm_mark_lockres_migrating() {
res->state |= DLM_LOCK_RES_BLOCK_DIRTY;
wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
//after the above code, we can not dirty lockres any more,
// so dlm_thread shuffle list will not run
downconvert lock from EX to NR
upconvert lock from NR to EX
<<< migration may schedule out here, then
<<< node 2 send down convert request to convert type from EX to
<<< NR, then send up convert request to convert type from NR to
<<< EX, at this time, lockres granted list is empty, and two locks
<<< in the converting list, node x up convert lock followed by
<<< node 2 up convert lock.
// will set lockres RES_MIGRATING flag, the following
// lock/unlock can not run
dlm_lockres_release_ast(dlm, res);
}
dlm_send_one_lockres()
dlm_process_recovery_data()
for (i=0; i<mres->num_locks; i++)
if (ml->node == dlm->node_num)
for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) {
list_for_each_entry(lock, tmpq, list)
if (lock) break; <<< lock is invalid as grant list is empty.
}
if (lock->ml.node != ml->node)
BUG() >>> crash here
}
I see the above locks status from a vmcore of our internal bug.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, ocfs2_sync_file grabs i_mutex and forces the current journal
transaction to complete. This isn't terribly efficient, since sync_file
really only needs to wait for the last transaction involving that inode
to complete, and this doesn't require i_mutex.
Therefore, implement the necessary bits to track the newest tid
associated with an inode, and teach sync_file to wait for that instead
of waiting for everything in the journal to commit. Furthermore, only
issue the flush request to the drive if jbd2 hasn't already done so.
This also eliminates the deadlock between ocfs2_file_aio_write() and
ocfs2_sync_file(). aio_write takes i_mutex then calls
ocfs2_aiodio_wait() to wait for unaligned dio writes to finish.
However, if that dio completion involves calling fsync, then we can get
into trouble when some ocfs2_sync_file tries to take i_mutex.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Variable uuid_net_key in ocfs2_initialize_super() is not used. Clean it
up.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a problem that waitqueue_active() may check stale data thus miss
a wakeup of threads waiting on ip_unaligned_aio.
The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to
be of type mutex thus the above prolem is avoid. Another benifit is that
mutex which works as FIFO is fairer than wake_up_all().
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When mounting an ocfs2 volume, it will firstly generate a file
/sys/kernel/debug/o2dlm/<uuid>/dlm_state, and then launch the dlm thread.
So the following situation will cause a null pointer dereference.
dlm_debug_init -> access file dlm_state which will call dlm_state_print ->
dlm_launch_thread
Move dlm_debug_init after dlm_launch_thread and dlm_launch_recovery_thread
can fix this issue.
Signed-off-by: Zongxun Wang <wangzongxun@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move code moving event structure to access_list from copy_event_to_user()
to fanotify_read() where it is more logical (so that we can immediately
see in the main loop that we either move the event to a different list
or free it). Also move special error handling for permission events
from copy_event_to_user() to the main loop to have it in one place with
error handling for normal events. This makes copy_event_to_user()
really only copy the event to user without any side effects.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Swap the error / "read ok" branches in the main loop of fanotify_read().
We will grow the "read ok" part in the next patch and this makes the
indentation easier. Also it is more common to have error conditions
inside an 'if' instead of the fast path.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
access_mutex is used only to guard operations on access_list. There's
no need for sleeping within this lock so just make a spinlock out of it.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, fanotify creates new structure to track the fact that
permission event has been reported to userspace and someone is waiting
for a response to it. As event structures are now completely in the
hands of each notification framework, we can use the event structure for
this tracking instead of allocating a new structure.
Since this makes the event structures for normal events and permission
events even more different and the structures have different lifetime
rules, we split them into two separate structures (where permission
event structure contains the structure for a normal event). This makes
normal events 8 bytes smaller and the code a tad bit cleaner.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The prepare_for_access_response() function checks whether
group->fanotify_data.bypass_perm is set. However this test can never be
true because prepare_for_access_response() is called only from
fanotify_read() which means fanotify group is alive with an active fd
while bypass_perm is set from fanotify_release() when all file
descriptors pointing to the group are closed and the group is going
away.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nameidata was replaced by flags in commit 00cd8dd3bf ("stop passing
nameidata to ->lookup()").
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cifs_init_inodecache is only called by __init init_cifs.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 839a8e8660 ("writeback: replace custom worker pool
implementation with unbound workqueue") when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.
This can happen because even though bdi_unregister() cancels all pending
flushing work, nothing really prevents new ones from being queued from
balance_dirty_pages() or other places.
Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.
Fixes: 839a8e8660
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes. The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb(). This can happen since mod_delayed_work()
can now steal work from a work_queue. This fixes the problem by using
queue_delayed_work() instead. This is a regression caused by commit
839a8e8660 ("writeback: replace custom worker pool implementation with
unbound workqueue").
The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task. Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.
We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future. This function doesn't change the
timer if it is already armed.
For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty. The same problem can happen if the sync work is run on the
rescue worker.
[jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zento.linux.org.uk>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Benson Leung <bleung@chromium.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Cc: Luigi Semenzato <semenzato@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit f5ea1100 ("xfs: add CRCs to dir2/da node blocks") introduced
in 3.10 incorrectly converted the btree hash index array pointer in
xfs_da3_fixhashpath(). It resulted in the the current hash always
being compared against the first entry in the btree rather than the
current block index into the btree block's hash entry array. As a
result, it was comparing the wrong hashes, and so could misorder the
entries in the btree.
For most cases, this doesn't cause any problems as it requires hash
collisions to expose the ordering problem. However, when there are
hash collisions within a directory there is a very good probability
that the entries will be ordered incorrectly and that actually
matters when duplicate hashes are placed into or removed from the
btree block hash entry array.
This bug results in an on-disk directory corruption and that results
in directory verifier functions throwing corruption warnings into
the logs. While no data or directory entries are lost, access to
them may be compromised, and attempts to remove entries from a
directory that has suffered from this corruption may result in a
filesystem shutdown. xfs_repair will fix the directory hash
ordering without data loss occuring.
[dchinner: wrote useful a commit message]
cc: <stable@vger.kernel.org>
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Pull cgroup updates from Tejun Heo:
"A lot updates for cgroup:
- The biggest one is cgroup's conversion to kernfs. cgroup took
after the long abandoned vfs-entangled sysfs implementation and
made it even more convoluted over time. cgroup's internal objects
were fused with vfs objects which also brought in vfs locking and
object lifetime rules. Naturally, there are places where vfs rules
don't fit and nasty hacks, such as credential switching or lock
dance interleaving inode mutex and cgroup_mutex with object serial
number comparison thrown in to decide whether the operation is
actually necessary, needed to be employed.
After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding
of several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.
- Various simplifications including dropping of module support,
easier cgroup name/path handling, simplified cgroup file type
handling and task_cg_lists optimization.
- Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are
associated with it, which BTW was what the dummy hierarchy was for
anyway.
- Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.
There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch"
* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
cgroup: remove useless argument from cgroup_exit()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
cgroup: break kernfs active_ref protection in cgroup directory operations
cgroup: fix cgroup_taskset walking order
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: make cgrp_dfl_root mountable
cgroup: drop const from @buffer of cftype->write_string()
cgroup: rename cgroup_dummy_root and related names
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: reorganize cgroup bootstrapping
cgroup: relocate setting of CGRP_DEAD
cpuset: use rcu_read_lock() to protect task_cs()
cgroup_freezer: document freezer_fork() subtleties
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup: drop task_lock() protection around task->cgroups
cgroup: update how a newly forked task gets associated with css_set
...
There were some extra semi-colons here which mean that we return true
unintentionally.
Fixes: a49935f200 ('xfs: xfs_check_page_type buffer checks need help')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Mark functions as static in exofs/ore_raid.c because they are not used
outside this file.
This also eliminates the following warning in exofs/ore_raid.c:
fs/exofs/ore_raid.c:24:14: warning: no previous prototype for _raid_page_alloc [-Wmissing-prototypes]
fs/exofs/ore_raid.c:29:6: warning: no previous prototype for _raid_page_free [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Mark function as static in exofs/super.c because it is not used outside
this file.
This also eliminates the following warning in exofs/super.c:
fs/exofs/super.c:546:5: warning: no previous prototype \
for __alloc_dev_table[-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
ceph_fh_to_parent() returns dentry that corresponds to the 'ino' field
of struct ceph_nfs_confh. This is wrong, it should return dentry that
corresponds to the 'parent_ino' field.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
MDS handles LOOKUPHASH and LOOKUPINO MDS requests in the same way.
So __cfh_to_dentry() is redundant.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The object store limit needs to be updated after writing,
and this can be done provided the corresponding object has already
been initialized. Current object initialization is done asynchrously,
which introduce a race if a file is opened, then immediately followed
by a writing, the initialization may have not completed, the code will
reach the ASSERT in fscache_submit_exclusive_op() to cause kernel
bug.
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Synchronize object->store_limit[_l] with new inode->i_size after file writing.
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Add an interface to explicitly synchronize object->store_limit[_l]
with inode->i_size
Tested-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
Signed-off-by: Min Chen <minchen@ubuntukylin.com>
Signed-off-by: Li Wang <liwang@ubuntukylin.com>