Return device pointer (with the IS_ERR semantics) from
btrfs_scan_one_device so we don't have to return in through pointer.
And since btrfs_fs_devices can be obtained from btrfs_device, return that.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fixed conflics after recent changes to btrfs_scan_one_device ]
Signed-off-by: David Sterba <dsterba@suse.com>
fs_devices is always passed to btrfs_scan_one_device which overrides it.
In the call stack below fs_devices is passed to btrfs_scan_one_device
from btrfs_mount_root. In btrfs_mount_root the output fs_devices of
this call stack is not used.
btrfs_mount_root
btrfs_parse_early_options
btrfs_scan_one_device
So, it is not necessary to pass fs_devices from btrfs_mount_root, using
a local variable in btrfs_parse_early_options is enough.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Technically this extends the critical section covered by uuid_mutex to:
- parse early mount options -- here we can call device scan on paths
that can be passed as 'device=/dev/...'
- scan the device passed to mount
- open the devices related to the fs_devices -- this increases
fs_devices::opened
The race can happen when mount calls one of the scans and there's
another one called eg. by mkfs or 'btrfs dev scan':
Mount Scan
----- ----
scan_one_device (dev1, fsid1)
scan_one_device (dev2, fsid1)
add the device
free stale devices
fsid1 fs_devices::opened == 0
find fsid1:dev1
free fsid1:dev1
if it's the last one,
free fs_devices of fsid1
too
open_devices (dev1, fsid1)
dev1 not found
When fixed, the uuid mutex will make sure that mount will increase
fs_devices::opened and this will not be touched by the racing scan
ioctl.
Reported-and-tested-by: syzbot+909a5177749d7990ffa4@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+ceb2606025ec1cc3479c@syzkaller.appspotmail.com
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to take a big lock, move resource initialization before
the critical section. It's not obvious from the diff, the desired order
is:
- initialize mount security options
- allocate temporary fs_info
- allocate superblock buffers
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prepartory work to fix race between mount and device scan.
btrfs_parse_early_options calls the device scan from mount and we'll
need to let mount completely manage the critical section.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prepartory work to fix race between mount and device scan.
The callers will have to manage the critical section, eg. mount wants to
scan and then call btrfs_open_devices without the ioctl scan walking in
and modifying the fs devices in the meantime.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prepartory work to fix race between mount and device scan.
The callers will have to manage the critical section, eg. mount wants to
scan and then call btrfs_open_devices without the ioctl scan walking in
and modifying the fs devices in the meantime.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_stale_devices() finds a stale (not opened) device matching
path in the fs_uuid list. We are already under uuid_mutex so when we
check for each fs_devices, hold the device_list_mutex too.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Over the years we named %fs_devices and %devices to represent the
struct btrfs_fs_devices and the struct btrfs_device. So follow the same
scheme here too. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make sure the device_list_lock is held the whole time:
* when the device is being looked up
* new device is initialized and put to the list
* the list counters are updated (fs_devices::opened, fs_devices::total_devices)
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_stale_devices() looks for device path reused for another
filesystem, and deletes the older fs_devices::device entry.
In preparation to handle locking in device_list_add, move
btrfs_free_stale_devices outside as these two functions serve a
different purpose.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 88c14590cd ("btrfs: use RCU in btrfs_show_devname for
device list traversal") btrfs_show_devname no longer takes
device_list_mutex. As such the deadlock that 0ccd05285e ("btrfs: fix a
possible umount deadlock") aimed to fix no longer exists, we can free
the devices immediatelly and remove the code that does the pending work.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is not used since the alloc_start parameter has been
obsoleted in commit 0d0c71b317 ("btrfs: obsolete and remove
mount option alloc_start").
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since parameter flags is no more used since commit d740760656 ("btrfs:
split parse_early_options() in two"), remove it.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case of deleting the seed device the %cur_devices (seed) and the
%fs_devices (parent) are different. Now, as the parent
fs_devices::total_devices also maintains the total number of devices
including the seed device, so decrement its in-memory value for the
successful seed delete. We are already updating its corresponding
on-disk btrfs_super_block::number_devices value.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to
btrfs_quota_enable") not only resulted in an easier to follow code but
it also introduced a subtle bug. It changed the timing when the initial
transaction rescan was happening:
- before the commit: it would happen after transaction commit had occured
- after the commit: it might happen before the transaction was committed
This results in failure to correctly rescan the quota since there could
be data which is still not committed on disk.
This patch aims to fix this by moving the transaction creation/commit
inside btrfs_quota_enable, which allows to schedule the quota commit
after the transaction has been committed.
Fixes: 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable")
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Link: https://marc.info/?l=linux-btrfs&m=152999289017582
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add fall-back code to catch failure of full_stripe_write. Proper error
handling from inside run_plug would need more code restructuring as it's
called at arbitrary points by io scheduler.
Signed-off-by: David Sterba <dsterba@suse.com>
Add helper that schedules a given function to run on the rmw workqueue.
This will replace several standalone helpers.
Signed-off-by: David Sterba <dsterba@suse.com>
The loops iterating eb pages use unsigned long, that's an overkill as
we know that there are at most 16 pages (64k / 4k), and 4 by default
(with nodesize 16k).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Almost all callers pass the start and len as 2 arguments but this is not
necessary, all the information is provided by the eb. By reordering the
calls to num_extent_pages, we don't need the local variables with
start/len.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Functions that get btrfs inode can simply reach the fs_info by
dereferencing the root and this looks a bit more straightforward
compared to the btrfs_sb(...) indirection.
If the transaction handle is available and not NULL it's used instead.
Signed-off-by: David Sterba <dsterba@suse.com>
There are several places when the btrfs inode is converted to the
generic inode, back to btrfs and then passed to btrfs_ino. We can remove
the extra back and forth conversions.
Signed-off-by: David Sterba <dsterba@suse.com>
io_ctl_set_generation() assumes that the generation number shares
the same page with inline CRCs. Let's make sure this is always true.
Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is only usage of the declared devices variable, instead use its
value directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many instances of the %fs_info->fs_devices pointer
dereferences, use a temporary variable instead.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Invalid reloc tree can cause kernel NULL pointer dereference when btrfs
does some cleanup of the reloc roots.
It turns out that fs_info::reloc_ctl can be NULL in
btrfs_recover_relocation() as we allocate relocation control after all
reloc roots have been verified.
So when we hit: note, we haven't called set_reloc_control() thus
fs_info::reloc_ctl is still NULL.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199833
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A crafted image has empty root tree block, which will later cause NULL
pointer dereference.
The following trees should never be empty:
1) Tree root
Must contain at least root items for extent tree, device tree and fs
tree
2) Chunk tree
Or we can't even bootstrap as it contains the mapping.
3) Fs tree
At least inode item for top level inode (.).
4) Device tree
Dev extents for chunks
5) Extent tree
Must have corresponding extent for each chunk.
If any of them is empty, we are sure the fs is corrupted and no need to
mount it.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A crafted image with invalid block group items could make free space cache
code to cause panic.
We could detect such invalid block group item by checking:
1) Item size
Known fixed value.
2) Block group size (key.offset)
We have an upper limit on block group item (10G)
3) Chunk objectid
Known fixed value.
4) Type
Only 4 valid type values, DATA, METADATA, SYSTEM and DATA|METADATA.
No more than 1 bit set for profile type.
5) Used space
No more than the block group size.
This should allow btrfs to detect and refuse to mount the crafted image.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199849
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The v0 extent type checks are the right case for the unlikely
annotations as we don't expect to ever see them, so let's give the
compiler some hint.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Following the removal of the v0 handling code let's be courteous and
print an error message when such extents are handled. In the cases
where we have a transaction just abort it, otherwise just call
btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO.
In case the error handling would be too intrusive, leave the BUG_ON in
place, like extent_data_ref_count, other proper handling would catch
that earlier.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The v0 compat code was introduced in commit 5d4f98a28c
("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") 9
years ago, which was merged in 2.6.31. This means that the code is
there to support filesystems which are _VERY_ old and if you are using
btrfs on such an old kernel, you have much bigger problems. This coupled
with the fact that no one is likely testing/maintining this code likely
means it has bugs lurking. All things considered I think 43 kernel
releases later it's high time this remnant of the past got removed.
This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case
we have a v0 with no support intact as a sort of safety-net.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's only coding style fix not functinal change. When if/else has only
one statement then the braces are not needed.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not good to override the error code when failing from
btrfs_getxattr() in btrfs_get_acl() because it hides the real reason of
the failure.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no chance to get into -ERANGE error condition because we first
call btrfs_getxattr to get the length of the attribute, then we do a
subsequent call with the size from the first call. Between the 2 calls
the size shouldn't change. So remove the unnecessary -ERANGE error
check.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_get_acl() the first call of btr_getxattr() is for getting the
length of attribute, the value buffer is never used in this case. So
it's better to replace empty string with NULL.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The caller of btrfs_get_acl() checks error condition so there is no
impact from this change. In practice there is no chance to get into
default case of switch statement because VFS has already checked the
type.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If type of extent_inline_ref found is not expected, filesystem may have
been corrupted, should return EUCLEAN instead of EINVAL.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct kiocb carries the ki_pos, so there is no need to pass it as
a separate function parameter.
generic_file_direct_write() increments ki_pos, so we now assign pos
after the function.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
[ rename to btrfs_buffered_write ]
Signed-off-by: David Sterba <dsterba@suse.com>
For easier debugging, print eb->start if level is invalid. Also make
clear if bytenr found is not expected.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the function uses 2 goto labels to properly handle allocation
failures. This could be simplified by simply re-arranging the code so
that allocations are the in the beginning of the function. This allows
to use simple return statements. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Under certain KVM load and LTP tests, it is possible to hit the
following calltrace if quota is enabled:
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
RIP: 0010:blk_status_to_errno+0x1a/0x30
Call Trace:
submit_extent_page+0x191/0x270 [btrfs]
? btrfs_create_repair_bio+0x130/0x130 [btrfs]
__do_readpage+0x2d2/0x810 [btrfs]
? btrfs_create_repair_bio+0x130/0x130 [btrfs]
? run_one_async_done+0xc0/0xc0 [btrfs]
__extent_read_full_page+0xe7/0x100 [btrfs]
? run_one_async_done+0xc0/0xc0 [btrfs]
read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
? run_one_async_done+0xc0/0xc0 [btrfs]
btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
read_tree_block+0x31/0x60 [btrfs]
read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
btrfs_search_slot+0x46b/0xa00 [btrfs]
? kmem_cache_alloc+0x1a8/0x510
? btrfs_get_token_32+0x5b/0x120 [btrfs]
find_parent_nodes+0x11d/0xeb0 [btrfs]
? leaf_space_used+0xb8/0xd0 [btrfs]
? btrfs_leaf_free_space+0x49/0x90 [btrfs]
? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
btrfs_find_all_roots+0x45/0x60 [btrfs]
btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
? pick_next_task_fair+0x2cd/0x530
? __switch_to+0x92/0x4b0
btrfs_worker_helper+0x81/0x300 [btrfs]
process_one_work+0x1da/0x3f0
worker_thread+0x2b/0x3f0
? process_one_work+0x3f0/0x3f0
kthread+0x11a/0x130
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x35/0x40
BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
BTRFS info (device vda2): forced readonly
BTRFS error (device vda2): pending csums is 2887680
[CAUSE]
It's caused by race with block group auto removal:
- There is a meta block group X, which has only one tree block
The tree block belongs to fs tree 257.
- In current transaction, some operation modified fs tree 257
The tree block gets COWed, so the block group X is empty, and marked
as unused, queued to be deleted.
- Some workload (like fsync) wakes up cleaner_kthread()
Which will call btrfs_delete_unused_bgs() to remove unused block
groups.
So block group X along its chunk map get removed.
- Some delalloc work finished for fs tree 257
Quota needs to get the original reference of the extent, which will
read tree blocks of commit root of 257.
Then since the chunk map gets removed, the above warning gets
triggered.
[FIX]
Just let btrfs_delete_unused_bgs() skip block group which still has
pinned bytes.
However there is a minor side effect: currently we only queue empty
blocks at update_block_group(), and such empty block group with pinned
bytes won't go through update_block_group() again, such block group
won't be removed, until it gets new extent allocated and removed.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With gcc 4.1.2:
fs/btrfs/inode-map.c: In function ‘btrfs_unpin_free_ino’:
fs/btrfs/inode-map.c:241: warning: ‘count’ may be used uninitialized in this function
While this warning is a false-positive, it can easily be killed by
refactoring the code.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>