kernel_optimize_test

Author	SHA1	Message	Date
Johannes Thumshirn	3831bf0094	btrfs: add sha256 to checksumming algorithm Add sha256 to the list of possible checksumming algorithms used by BTRFS. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:43 +01:00
Johannes Thumshirn	3951e7f050	btrfs: add xxhash64 to checksumming algorithms Add xxhash64 to the list of possible checksumming algorithms used by BTRFS. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:43 +01:00
David Sterba	8530c37a70	btrfs: get bdev from latest_dev for dio bh_result To remove use of extent_map::bdev we need to find a replacement, and the latest_bdev is the only one we can use here, because inode::i_bdev and superblock::s_bdev are NULL. The DIO code uses bdev in two places: * to read blocksize to perform alignment checks in do_blockdev_direct_IO, but we do them in btrfs code before any call to DIO * in the following call chain: do_direct_IO get_more_blocks sdio->get_block() <-- this is btrfs_get_blocks_direct subsequently the map_bh->b_dev member is used in clean_bdev_aliases and dio_new_bio to set the bio's bdev to that of the buffer_head. However, because we have provided a submit function dio_bio_submit calls our submission function and ignores the bdev. So it's safe to pass any valid bdev that's used within the filesystem. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:47:01 +01:00
David Sterba	c3e14909d3	btrfs: assert extent_map bdevs and lookup_map and split This is a preparatory patch for removing extent_map::bdev. There's some history behind the code so this is only precaution to catch if things break before the actual removal happens. Logically, comparing a raw low-level block device (bdev) does not make sense for extent maps (high-level objects). This had no effect in practice but was quite confusing in the code. The lookup_map is set iff EXTENT_FLAG_FS_MAPPING is set. The two pointers were stored in the same bytes and used potentially in two meanings. Now they're split, so the asserts are in place to check that the condition will not change. The lookup map pointer misused bdev, this has been changed in commit `95617d6932` ("btrfs: cleanup, stop casting for extent_map->lookup everywhere") to the explicit type. But the semantics hasn't changed and bdev was not actually used to decide if maps are mergeable. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:47:01 +01:00
Johannes Thumshirn	32ab3d1b4d	btrfs: remove pointless indentation in btrfs_read_sys_array() Instead of checking if we've read a BTRFS_CHUNK_ITEM_KEY from disk and then process it we could just bail out early if the read disk key wasn't a BTRFS_CHUNK_ITEM_KEY. This removes a level of indentation and makes the code nicer to read. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:47:01 +01:00
Johannes Thumshirn	5ae2169290	btrfs: reduce indentation in btrfs_may_alloc_data_chunk In btrfs_may_alloc_data_chunk() we're checking if the chunk type is of type BTRFS_BLOCK_GROUP_DATA and if it is we process it. Instead of checking if the chunk type is a BTRFS_BLOCK_GROUP_DATA chunk we can negate the check and bail out early if it isn't. This makes the code a bit more readable. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:47:00 +01:00
Johannes Thumshirn	721860d578	btrfs: remove pointless local variable in lock_stripe_add() In lock_stripe_add() we're caching the bucket for the stripe hash table just for a single call to dereference the stripe hash. If we just directly call rbio_bucket() we can safe the pointless local variable. Also move the dereferencing of the stripe hash outside of the variable declaration block to not break over the 80 characters limit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:47:00 +01:00
Johannes Thumshirn	9d6cb1b0f9	btrfs: raid56: reduce indentation in lock_stripe_add In lock_stripe_add() we're traversing the stripe hash list and check if the current list element's raid_map equals is equal to the raid bio's raid_map. If both are equal we continue processing. If we'd check for inequality instead of equality we can reduce one level of indentation. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:47:00 +01:00
Nikolay Borisov	bc80230e0e	btrfs: Return offset from find_desired_extent Instead of using an input pointer parameter as the return value and have an int as the return type of find_desired_extent, rework the function to directly return the found offset. Doing that the 'ret' variable in btrfs_llseek_file can be removed. Additional (subjective) benefit is that btrfs' llseek function now resemebles those of the other major filesystems. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:47:00 +01:00
Nikolay Borisov	2034f3b470	btrfs: Simplify btrfs_file_llseek Handle SEEK_END/SEEK_CUR in a single 'default' case by directly returning from generic_file_llseek. This makes the 'out' label redundant. Finally return directly the vale from vfs_setpos. No semantic changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:59 +01:00
Nikolay Borisov	d79b7c26b1	btrfs: Speed up btrfs_file_llseek Modifying the file position is done on a per-file basis. This renders holding the inode lock for writing useless and makes the performance of concurrent llseek's abysmal. Fix this by holding the inode for read. This provides protection against concurrent truncates and find_desired_extent already includes proper extent locking for the range which ensures proper locking against concurrent writes. SEEK_CUR and SEEK_END can be done lockessly. The former is synchronized by file::f_lock spinlock. SEEK_END is not synchronized but atomic, but that's OK since there is not guarantee that SEEK_END will always be at the end of the file in the face of tail modifications. This change brings ~82% performance improvement when doing a lot of parallel fseeks. The workload essentially does: for (d=0; d<num_seek_read; d++) { /* offset %= 16777216; / fseek (f, 256 d % 16777216, SEEK_SET); fread (buffer, 64, 1, f); } Without patch: num workprocesses = 16 num fseek/fread = 8000000 step = 256 fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 real 0m41.412s user 0m28.777s sys 2m16.510s With patch: num workprocesses = 16 num fseek/fread = 8000000 step = 256 fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 real 0m11.479s user 0m27.629s sys 0m21.040s Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:59 +01:00
David Sterba	0cf2521313	btrfs: compression: remove ops pointer from workspace_manager We can infer the ops from the type that is now passed to all functions that would need it, this makes workspace_manager::ops redundant and can be removed. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:59 +01:00
David Sterba	1e00235160	btrfs: compression: inline free_workspace Replace indirect calls to free_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:59 +01:00
David Sterba	a3bbd2a9ee	btrfs: compression: pass type to btrfs_put_workspace We can infer the workspace_manager from type and the type will be used in the following patch to call a common helper for free_workspace. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:59 +01:00
David Sterba	c778df1406	btrfs: compression: inline alloc_workspace Replace indirect calls to alloc_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:58 +01:00
David Sterba	5907a9bb13	btrfs: compression: pass type to btrfs_get_workspace We can infer the workspace_manager from type and the type will be used in the following patch to call a common helper for alloc_workspace. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:58 +01:00
David Sterba	bd3a5287cc	btrfs: compression: inline put_workspace Similar to get_workspace, majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. Trivial callback implementations use the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:58 +01:00
David Sterba	6a0d12724b	btrfs: compression: inline get_workspace Majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. ZLIB needs to adjust level in the callback and ZSTD workspace management is complex, the rest is call to the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:58 +01:00
David Sterba	d20f395f98	btrfs: compression: export alloc/free/get/put callbacks of all algos The indirect calls will be replaced by a switch in compression.c. (Switch is faster than indirect calls with when Spectre mitigations are enabled). Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:58 +01:00
David Sterba	2510307e6c	btrfs: compression: inline cleanup_workspace_manager Replace loop calling to all algos with a list of direct calls to the cleanup manager callback. When that becomes trivial it is replaced by direct call to the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:57 +01:00
David Sterba	2dba714390	btrfs: compression: let workspace manager cleanup take only the type With the access to the workspace structures, we can look it up together with the compression ops inside the workspace manager cleanup helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:57 +01:00
David Sterba	d551703347	btrfs: compression: inline init_workspace_manager Replace loop calling to all algos with a list of direct calls to the init manager callback. When that becomes trivial it is replaced by direct call to the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:57 +01:00
David Sterba	975db48330	btrfs: compression: let workspace manager init take only the type With the access to the workspace structures, we can look it up together with the compression ops inside the workspace manager init helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:57 +01:00
David Sterba	be95104531	btrfs: compression: attach workspace manager to the ops There's a lot of indirection when the generic code calls into algo-specific callbacks to reach the private workspace manager structure and back to the generic code. To simplify that, export the workspace manager for heuristic, LZO and ZLIB, while ZSTD is going to use it's own manager. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:57 +01:00
David Sterba	1e4eb74654	btrfs: switch compression callbacks to direct calls The indirect calls bring some overhead due to spectre vulnerability mitigations. The number of cases is small and below the threshold (10-20) where indirect call would be better. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:57 +01:00
David Sterba	c4bf665a31	btrfs: export compression and decompression callbacks Export compress_pages, decompress_bio and decompress callbacks for all compression algos. The indirect calls will be replaced by a switch. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:56 +01:00
Josef Bacik	a60adce85f	btrfs: use btrfs_block_group_cache_done in update_block_group When free'ing extents in a block group we check to see if the block group is not cached, and then cache it if we need to. However we'll just carry on as long as we're loading the cache. This is problematic because we are dirtying the block group here. If we are fast enough we could do a transaction commit and clear the free space cache while we're still loading the space cache in another thread. This truncates the free space inode, which will keep it from loading the space cache. Fix this by using the btrfs_block_group_cache_done helper so that we try to load the space cache unconditionally here, which will result in the caller waiting for the fast caching to complete and keep us from truncating the free space inode. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:56 +01:00
Josef Bacik	3797136b62	btrfs: check page->mapping when loading free space cache While testing 5.2 we ran into the following panic [52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001 [52238.105608] RIP: 0010:drop_buffers+0x3d/0x150 [52238.304051] Call Trace: [52238.308958] try_to_free_buffers+0x15b/0x1b0 [52238.317503] shrink_page_list+0x1164/0x1780 [52238.325877] shrink_inactive_list+0x18f/0x3b0 [52238.334596] shrink_node_memcg+0x23e/0x7d0 [52238.342790] ? do_shrink_slab+0x4f/0x290 [52238.350648] shrink_node+0xce/0x4a0 [52238.357628] balance_pgdat+0x2c7/0x510 [52238.365135] kswapd+0x216/0x3e0 [52238.371425] ? wait_woken+0x80/0x80 [52238.378412] ? balance_pgdat+0x510/0x510 [52238.386265] kthread+0x111/0x130 [52238.392727] ? kthread_create_on_node+0x60/0x60 [52238.401782] ret_from_fork+0x1f/0x30 The page we were trying to drop had a page->private, but had no page->mapping and so called drop_buffers, assuming that we had a buffer_head on the page, and then panic'ed trying to deref 1, which is our page->private for data pages. This is happening because we're truncating the free space cache while we're trying to load the free space cache. This isn't supposed to happen, and I'll fix that in a followup patch. However we still shouldn't allow those sort of mistakes to result in messing with pages that do not belong to us. So add the page->mapping check to verify that we still own this page after dropping and re-acquiring the page lock. This page being unlocked as: btrfs_readpage extent_read_full_page __extent_read_full_page __do_readpage if (!nr) unlock_page <-- nr can be 0 only if submit_extent_page returns an error CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add callchain ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:56 +01:00
Filipe Manana	536870071d	Btrfs: fix metadata space leak on fixup worker failure to set range as delalloc In the fixup worker, if we fail to mark the range as delalloc in the io tree, we must release the previously reserved metadata, as well as update the outstanding extents counter for the inode, otherwise we leak metadata space. In pratice we can't return an error from btrfs_set_extent_delalloc(), which is just a wrapper around __set_extent_bit(), as for most errors __set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as well) and returning an -EEXIST error doesn't happen in this case since the exclusive bits parameter always has a value of 0 through this code path. Nevertheless, just fix the error handling in the fixup worker, in case one day __set_extent_bit() can return an error to this code path. Fixes: `f3038ee3a3` ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:56 +01:00
Filipe Manana	a0e248bb50	Btrfs: fix negative subv_writers counter and data space leak after buffered write When doing a buffered write it's possible to leave the subv_writers counter of the root, used for synchronization between buffered nocow writers and snapshotting. This happens in an exceptional case like the following: 1) We fail to allocate data space for the write, since there's not enough available data space nor enough unallocated space for allocating a new data block group; 2) Because of that failure, we try to go to NOCOW mode, which succeeds and therefore we set the local variable 'only_release_metadata' to true and set the root's sub_writers counter to 1 through the call to btrfs_start_write_no_snapshotting() made by check_can_nocow(); 3) The call to btrfs_copy_from_user() returns zero, which is very unlikely to happen but not impossible; 4) No pages are copied because btrfs_copy_from_user() returned zero; 5) We call btrfs_end_write_no_snapshotting() which decrements the root's subv_writers counter to 0; 6) We don't set 'only_release_metadata' back to 'false' because we do it only if 'copied', the value returned by btrfs_copy_from_user(), is greater than zero; 7) On the next iteration of the while loop, which processes the same page range, we are now able to allocate data space for the write (we got enough data space released in the meanwhile); 8) After this if we fail at btrfs_delalloc_reserve_metadata(), because now there isn't enough free metadata space, or in some other place further below (prepare_pages(), lock_and_cleanup_extent_if_need(), btrfs_dirty_pages()), we break out of the while loop with 'only_release_metadata' having a value of 'true'; 9) Because 'only_release_metadata' is 'true' we end up decrementing the root's subv_writers counter to -1 (through a call to btrfs_end_write_no_snapshotting()), and we also end up not releasing the data space previously reserved through btrfs_check_data_free_space(). As a consequence the mechanism for synchronizing NOCOW buffered writes with snapshotting gets broken. Fix this by always setting 'only_release_metadata' to false at the start of each iteration. Fixes: `8257b2dc3c` ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume") Fixes: `7ee9e4405f` ("Btrfs: check if we can nocow if we don't have data space") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:56 +01:00
Marcos Paulo de Souza	b929c1d831	btrfs: ioctl: Try to use btrfs_fs_info instead of file Some functions are doing some unnecessary indirection to reach the btrfs_fs_info struct. Change these functions to receive a btrfs_fs_info struct instead of a file. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:55 +01:00
Anand Jain	4273eaff9b	btrfs: use bool argument in free_root_pointers() We don't need int argument bool shall do in free_root_pointers(). And rename the argument as it confused two people. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:55 +01:00
Chengguang Xu	ce96b7ffd1	btrfs: use better definition of number of compression type The compression type upper limit constant is the same as the last value and this is confusing. In order to keep coding style consistent, use BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of 'NR' being one more than the last value. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:55 +01:00
Chengguang Xu	b9b1a53e18	btrfs: use enum for extent type defines Use enum to replace macro definitions of extent types. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:55 +01:00
Chengguang Xu	b2cd295964	btrfs: props: remove unnecessary hash_init() DEFINE_HASHTABLE itself has already included initialization code, we don't have to call hash_init() again, so remove it. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:55 +01:00
Nikolay Borisov	8d510121bf	btrfs: Rename btrfs_join_transaction_nolock This function is used only during the final phase of freespace cache writeout. This is necessary since using the plain btrfs_join_transaction api is deadlock prone. The deadlock looks like: T1: btrfs_commit_transaction commit_cowonly_roots btrfs_write_dirty_block_groups btrfs_wait_cache_io __btrfs_wait_cache_io btrfs_wait_ordered_range <-- Triggers ordered IO for freespace inode and blocks transaction commit until freespace cache writeout T2: <-- after T1 has triggered the writeout finish_ordered_fn btrfs_finish_ordered_io btrfs_join_transaction <--- this would block waiting for current transaction to commit, but since trans commit is waiting for this writeout to finish The special purpose functions prevents it by simply skipping the "wait for writeout" since it's guaranteed the transaction won't proceed until we are done. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:54 +01:00
Nikolay Borisov	ce6d3eb6fd	btrfs: User assert to document transaction requirement Using an ASSERT in btrfs_pin_extent allows to more stringently observe whether the function is called under a transaction or not. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:54 +01:00
David Sterba	67439dadb0	btrfs: opencode extent_buffer_get The helper is trivial and we can understand what the atomic_inc on something named refs does. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:54 +01:00
Tejun Heo	f7bddf1e27	btrfs: Avoid getting stuck during cyclic writebacks During a cyclic writeback, extent_write_cache_pages() uses done_index to update the writeback_index after the current run is over. However, instead of current index + 1, it gets to to the current index itself. Unfortunately, this, combined with returning on EOF instead of looping back, can lead to the following pathlogical behavior. 1. There is a single file which has accumulated enough dirty pages to trigger balance_dirty_pages() and the writer appending to the file with a series of short writes. 2. balance_dirty_pages kicks in, wakes up background writeback and sleeps. 3. Writeback kicks in and the cursor is on the last page of the dirty file. Writeback is started or skipped if already in progress. As it's EOF, extent_write_cache_pages() returns and the cursor is set to done_index which is pointing to the last page. 4. Writeback is done. Nothing happens till balance_dirty_pages finishes, at which point we go back to #1. This can almost completely stall out writing back of the file and keep the system over dirty threshold for a long time which can mess up the whole system. We encountered this issue in production with a package handling application which can reliably reproduce the issue when running under tight memory limits. Reading the comment in the error handling section, this seems to be to avoid accidentally skipping a page in case the write attempt on the page doesn't succeed. However, this concern seems bogus. On each page, the code either: * Skips and moves onto the next page. * Fails issue and sets done_index to index + 1. * Successfully issues and continue to the next page if budget allows and not EOF. IOW, as long as it's not EOF and there's budget, the code never retries writing back the same page. Only when a page happens to be the last page of a particular run, we end up retrying the page, which can't possibly guarantee anything data integrity related. Besides, cyclic writes are only used for non-syncing writebacks meaning that there's no data integrity implication to begin with. Fix it by always setting done_index past the current page being processed. Note that this problem exists in other writepages too. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:54 +01:00
Marcos Paulo de Souza	a9143bd31c	btrfs: block-group: Rework documentation of check_system_chunk function Commit `4617ea3a52` (" Btrfs: fix necessary chunk tree space calculation when allocating a chunk") removed the is_allocation argument from check_system_chunk, since the formula for reserving the necessary space for allocation or removing a chunk would be the same. So, rework the comment by removing the mention of is_allocation argument. Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:54 +01:00
Qu Wenruo	c06631b0d8	btrfs: Enhance error output for write time tree checker Unlike read time tree checker errors, write time error can't be inspected by "btrfs inspect dump-tree", so we need extra information to determine what's going wrong. The patch will add the following output for write time tree checker error: - The content of the offending tree block To help determining if it's a false alert. - Kernel WARN_ON() for debug build This is helpful for us to detect unexpected write time tree checker error, especially fstests could catch the dmesg. Since the WARN_ON() is only triggered for write time tree checker, test cases utilizing dm-error won't trigger this WARN_ON(), thus no extra noise. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:54 +01:00
Qu Wenruo	80d7fd1e09	btrfs: tree-checker: Refactor prev_key check for ino into a function Refactor the check for prev_key->objectid of the following key types into one function, check_prev_ino(): - EXTENT_DATA - INODE_REF - DIR_INDEX - DIR_ITEM - XATTR_ITEM Also add the check of prev_key for INODE_REF. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:53 +01:00
Chris Mason	dbb70becde	Btrfs: extent_write_locked_range() should attach inode->i_wb extent_write_locked_range() is used when we're falling back to buffered IO from inside of compression. It allocates its own wbc and should associate it with the inode's i_wb to make sure the IO goes down from the correct cgroup. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:53 +01:00
Chris Mason	ec39f7696c	Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios Async CRCs and compression submit IO through helper threads, which means they have IO priority inversions when cgroup IO controllers are in use. This flags all of the writes submitted by btrfs helper threads as REQ_CGROUP_PUNT. submit_bio() will punt these to dedicated per-blkcg work items to avoid the priority inversion. For the compression code, we take a reference on the wbc's blkg css and pass it down to the async workers. For the async CRCs, the bio already has the correct css, we just need to tell the block layer to use REQ_CGROUP_PUNT. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Chris Mason <clm@fb.com> Modified-and-reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:53 +01:00
Chris Mason	1d53c9e672	Btrfs: only associate the locked page with one async_chunk struct The btrfs writepages function collects a large range of pages flagged for delayed allocation, and then sends them down through the COW code for processing. When compression is on, we allocate one async_chunk structure for every 512K, and then run those pages through the compression code for IO submission. writepages starts all of this off with a single page, locked by the original call to extent_write_cache_pages(), and it's important to keep track of this page because it has already been through clear_page_dirty_for_io(). The btrfs async_chunk struct has a pointer to the locked_page, and when we're redirtying the page because compression had to fallback to uncompressed IO, we use page->index to decide if a given async_chunk struct really owns that page. But, this is racey. If a given delalloc range is broken up into two async_chunks (chunkA and chunkB), we can end up with something like this: compress_file_range(chunkA) submit_compress_extents(chunkA) submit compressed bios(chunkA) put_page(locked_page) compress_file_range(chunkB) ... Or: async_cow_submit submit_compressed_extents <--- falls back to buffered writeout cow_file_range extent_clear_unlock_delalloc __process_pages_contig put_page(locked_pages) async_cow_submit The end result is that chunkA is completed and cleaned up before chunkB even starts processing. This means we can free locked_page() and reuse it elsewhere. If we get really lucky, it'll have the same page->index in its new home as it did before. While we're processing chunkB, we might decide we need to fall back to uncompressed IO, and so compress_file_range() will call __set_page_dirty_nobufers() on chunkB->locked_page. Without cgroups in use, this creates as a phantom dirty page, which isn't great but isn't the end of the world. What can happen, it can go through the fixup worker and the whole COW machinery again: in submit_compressed_extents(): while (async extents) { ... cow_file_range if (!page_started ...) extent_write_locked_range else if (...) unlock_page continue; This hasn't been observed in practice but is still possible. With cgroups in use, we might crash in the accounting code because page->mapping->i_wb isn't set. BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0 IP: percpu_counter_add_batch+0x11/0x70 PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 16 PID: 2172 Comm: rm Not tainted RIP: 0010:percpu_counter_add_batch+0x11/0x70 RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286 RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115 RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090 RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000 R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001 FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: account_page_cleaned+0x15b/0x1f0 __cancel_dirty_page+0x146/0x200 truncate_cleanup_page+0x92/0xb0 truncate_inode_pages_range+0x202/0x7d0 btrfs_evict_inode+0x92/0x5a0 evict+0xc1/0x190 do_unlinkat+0x176/0x280 do_syscall_64+0x63/0x1a0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 The fix here is to make asyc_chunk->locked_page NULL everywhere but the one async_chunk struct that's allowed to do things to the locked page. Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/ Fixes: `771ed689d2` ("Btrfs: Optimize compressed writeback and reads") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Chris Mason <clm@fb.com> [ update changelog from mail thread discussion ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:53 +01:00
Chris Mason	ba8a9d0795	Btrfs: delete the entire async bio submission framework Now that we're not using btrfs_schedule_bio() anymore, delete all the code that supported it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:53 +01:00
Chris Mason	08635bae0b	Btrfs: stop using btrfs_schedule_bio() btrfs_schedule_bio() hands IO off to a helper thread to do the actual submit_bio() call. This has been used to make sure async crc and compression helpers don't get stuck on IO submission. To maintain good performance, over time the IO submission threads duplicated some IO scheduler characteristics such as high and low priority IOs and they also made some ugly assumptions about request allocation batch sizes. All of this cost at least one extra context switch during IO submission, and doesn't fit well with the modern blkmq IO stack. So, this commit stops using btrfs_schedule_bio(). We may need to adjust the number of async helper threads for crcs and compression, but long term it's a better path. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:52 +01:00
David Sterba	e1f60a6580	btrfs: add __pure attribute to functions The attribute is more relaxed than const and the functions could dereference pointers, as long as the observable state is not changed. We do have such functions, based on -Wsuggest-attribute=pure . The visible effects of this patch are negligible, there are differences in the assembly but hard to summarize. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:52 +01:00
David Sterba	4143cb8b6f	btrfs: add const function attribute For some reason the attribute is called __attribute_const__ and not __const, marks functions that have no observable effects on program state, IOW not reading pointers, just the arguments and calculating a value. Allows the compiler to do some optimizations, based on -Wsuggest-attribute=const . The effects are rather small, though, about 60 bytes decrese of btrfs.ko. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:52 +01:00
David Sterba	b105e92755	btrfs: add __cold attribute to more functions The attribute can mark functions supposed to be called rarely if at all and the text can be moved to sections far from the other code. The attribute has been added to several functions already, this patch is based on hints given by gcc -Wsuggest-attribute=cold. The net effect of this patch is decrease of btrfs.ko by 1000-1300, depending on the config options. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:52 +01:00
David Sterba	4c66e0d424	btrfs: drop unused parameter is_new from btrfs_iget The parameter is now always set to NULL and could be dropped. The last user was get_default_root but that got reworked in `05dbe6837b` ("Btrfs: unify subvol= and subvolid= mounting") and the parameter became unused. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:52 +01:00
Josef Bacik	baf320b9d5	btrfs: use refcount_inc_not_zero in kill_all_nodes We hit the following warning while running down a different problem [ 6197.175850] ------------[ cut here ]------------ [ 6197.185082] refcount_t: underflow; use-after-free. [ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60 [ 6197.521792] Call Trace: [ 6197.526687] __btrfs_release_delayed_node+0x76/0x1c0 [ 6197.536615] btrfs_kill_all_delayed_nodes+0xec/0x130 [ 6197.546532] ? __btrfs_btree_balance_dirty+0x60/0x60 [ 6197.556482] btrfs_clean_one_deleted_snapshot+0x71/0xd0 [ 6197.566910] cleaner_kthread+0xfa/0x120 [ 6197.574573] kthread+0x111/0x130 [ 6197.581022] ? kthread_create_on_node+0x60/0x60 [ 6197.590086] ret_from_fork+0x1f/0x30 [ 6197.597228] ---[ end trace 424bb7ae00509f56 ]--- This is because the free side drops the ref without the lock, and then takes the lock if our refcount is 0. So you can have nodes on the tree that have a refcount of 0. Fix this by zero'ing out that element in our temporary array so we don't try to kill it again. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:51 +01:00
Anand Jain	aa6c0df73e	btrfs: print process name and pid that calls device scanning Its very helpful if we had logged the device scanner process name to debug the race condition between the systemd-udevd scan and the user initiated device forget command. This patch adds process name and pid to the scan message. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add pid to the message ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:51 +01:00
Nikolay Borisov	725af92a62	btrfs: Open-code name_in_log_ref in replay_one_name That function adds unnecessary indirection between backref_in_log and the caller. Furthermore it also "downgrades" backref_in_log's return value to a boolean, when in fact it could very well be an error. Rectify the situation by simply opencoding name_in_log_ref in replay_one_name and properly handling possible return codes from backref_in_log. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:51 +01:00
Nikolay Borisov	d3316c8233	btrfs: Properly handle backref_in_log retval This function can return a negative error value if btrfs_search_slot errors for whatever reason or if btrfs_alloc_path runs out of memory. This is currently problemattic because backref_in_log is treated by its callers as if it returns boolean. Fix this by adding proper error handling in callers. That also enables the function to return the direct error code from btrfs_search_slot. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:51 +01:00
Nikolay Borisov	89cbf5f6b6	btrfs: Don't opencode btrfs_find_name_in_backref in backref_in_log Direct replacement, though note that the inside of the loop in btrfs_find_name_in_backref is organized in a slightly different way but is equvalent. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:51 +01:00
Qu Wenruo	3296bf5624	btrfs: transaction: Cleanup unused TRANS_STATE_BLOCKED The state was introduced in commit `4a9d8bdee3` ("Btrfs: make the state of the transaction more readable"), then in commit `302167c50b` ("btrfs: don't end the transaction for delayed refs in throttle") the state is completely removed. So we can just clean up the state since it's only compared but never set. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:50 +01:00
Qu Wenruo	61c047b541	btrfs: transaction: describe transaction states and transitions Add an overview of the basic btrfs transaction transitions, including the following states: - No transaction states - Transaction N [[TRANS_STATE_RUNNING]] - Transaction N [[TRANS_STATE_COMMIT_START]] - Transaction N [[TRANS_STATE_COMMIT_DOING]] - Transaction N [[TRANS_STATE_UNBLOCKED]] - Transaction N [[TRANS_STATE_COMPLETED]] For each state, the comment will include: - Basic explaination about current state - How to go next stage - What will happen if we call various start_transaction() functions - Relationship to transaction N+1 This doesn't provide tech details, but serves as a cheat sheet for reader to get into the code a little easier. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:50 +01:00
David Sterba	c1499166d1	btrfs: use has_single_bit_set for clarity Replace is_power_of_2 with the helper that is self-documenting and remove the open coded call in alloc_profile_is_valid. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:50 +01:00
David Sterba	79c8264e44	btrfs: add 64bit safe helper for power of two checks As is_power_of_two takes unsigned long, it's not safe on 32bit architectures, but we could pass any u64 value in seveal places. Add a separate helper and also an alias that better expresses the purpose for which the helper is used. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:50 +01:00
Anand Jain	e62869be1e	btrfs: balance: use term redundancy instead of integrity in message When balance reduces the number of copies of metadata, it reduces the redundancy, use the term redundancy instead of integrity. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:50 +01:00
David Sterba	1f95ec012c	btrfs: move btrfs_unlock_up_safe to other locking functions The function belongs to the family of locking functions, so move it there. The 'noinline' keyword is dropped as it's now an exported function that does not need it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:49 +01:00
David Sterba	ed2b1d36a9	btrfs: move btrfs_set_path_blocking to other locking functions The function belongs to the family of locking functions, so move it there. The 'noinline' keyword is dropped as it's now an exported function that does not need it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:49 +01:00
David Sterba	31f6e769ce	btrfs: make btrfs_assert_tree_locked static inline The function btrfs_assert_tree_locked is used outside of the locking code so it is exported, however we can make it static inine as it's fairly trivial. This is the only locking assertion used in release builds, inlining improves the text size by 174 bytes and reduces stack consumption in the callers. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:49 +01:00
David Sterba	d6156218be	btrfs: make locking assertion helpers static inline I've noticed that none of the btrfs_assert_lock debugging helpers is inlined, despite they're short and mostly a value update. Making them inline shaves 67 from the text size, reduces stack consumption and perhaps also slightly improves the performance due to avoiding unnecessary calls. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:49 +01:00
Omar Sandoval	c9eb55db84	btrfs: get rid of pointless wtag variable in async-thread.c Commit `ac0c7cf8be` ("btrfs: fix crash when tracepoint arguments are freed by wq callbacks") added a void pointer, wtag, which is passed into trace_btrfs_all_work_done() instead of the freed work item. This is silly for a few reasons: 1. The freed work item still has the same address. 2. work is still in scope after it's freed, so assigning wtag doesn't stop anyone from using it. 3. The tracepoint has always taken a void * argument, so assigning wtag doesn't actually make things any more type-safe. (Note that the original bug in commit `bc074524e1` ("btrfs: prefix fsid to all trace events") was that the void * was implicitly casted when it was passed to btrfs_work_owner() in the trace point itself). Instead, let's add some clearer warnings as comments. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:49 +01:00
Omar Sandoval	a0cac0ec96	btrfs: get rid of unique workqueue helper functions Commit `9e0af23764` ("Btrfs: fix task hang under heavy compressed write") worked around the issue that a recycled work item could get a false dependency on the original work item due to how the workqueue code guarantees non-reentrancy. It did so by giving different work functions to different types of work. However, the fixes in the previous few patches are more complete, as they prevent a work item from being recycled at all (except for a tiny window that the kernel workqueue code handles for us). This obsoletes the previous fix, so we don't need the unique helpers for correctness. The only other reason to keep them would be so they show up in stack traces, but they always seem to be optimized to a tail call, so they don't show up anyways. So, let's just get rid of the extra indirection. While we're here, rename normal_work_helper() to the more informative btrfs_work_helper(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:48 +01:00
Omar Sandoval	57d4f0b863	btrfs: don't prematurely free work in scrub_missing_raid56_worker() Currently, scrub_missing_raid56_worker() puts and potentially frees sblock (which embeds the work item) and then submits a bio through scrub_wr_submit(). This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". Fix it by dropping the reference after we submit the bio. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:48 +01:00
Omar Sandoval	e732fe95e4	btrfs: don't prematurely free work in reada_start_machine_worker() Currently, reada_start_machine_worker() frees the reada_machine_work and then calls __reada_start_machine() to do readahead. This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". There _might_ already be a deadlock here: reada_start_machine_worker() can depend on itself through stacked filesystems (__read_start_machine() -> reada_start_machine_dev() -> reada_tree_block_flagged() -> read_extent_buffer_pages() -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() -> submit_bio() onto a loop device can trigger readahead on the lower filesystem). Either way, let's fix it by freeing the work at the end. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:48 +01:00
Omar Sandoval	9be490f1e1	btrfs: don't prematurely free work in end_workqueue_fn() Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds the work item) and then calls bio_endio(). This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". In particular, the endio call may depend on other work items. For example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() -> __btrfs_correct_data_nocsum() -> dio_read_error() -> submit_dio_repair_bio(), which submits a bio that is also completed through a end_workqueue_fn() work item. However, __btrfs_correct_data_nocsum() waits for the newly submitted bio to complete, thus it depends on another work item. This example currently usually works because we use different workqueue helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR. However, it may deadlock with stacked filesystems and is fragile overall. The proper fix is to free the work item at the very end of the work function, so let's do that. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:48 +01:00
Omar Sandoval	c495dcd6fb	btrfs: don't prematurely free work in run_ordered_work() We hit the following very strange deadlock on a system with Btrfs on a loop device backed by another Btrfs filesystem: 1. The top (loop device) filesystem queues an async_cow work item from cow_file_range_async(). We'll call this work X. 2. Worker thread A starts work X (normal_work_helper()). 3. Worker thread A executes the ordered work for the top filesystem (run_ordered_work()). 4. Worker thread A finishes the ordered work for work X and frees X (work->ordered_free()). 5. Worker thread A executes another ordered work and gets blocked on I/O to the bottom filesystem (still in run_ordered_work()). 6. Meanwhile, the bottom filesystem allocates and queues an async_cow work item which happens to be the recently-freed X. 7. The workqueue code sees that X is already being executed by worker thread A, so it schedules X to be executed _after_ worker thread A finishes (see the find_worker_executing_work() call in process_one_work()). Now, the top filesystem is waiting for I/O on the bottom filesystem, but the bottom filesystem is waiting for the top filesystem to finish, so we deadlock. This happens because we are breaking the workqueue assumption that a work item cannot be recycled while it still depends on other work. Fix it by waiting to free the work item until we are done with all of the related ordered work. P.S.: One might ask why the workqueue code doesn't try to detect a recycled work item. It actually does try by checking whether the work item has the same work function (find_worker_executing_work()), but in our case the function is the same. This is the only key that the workqueue code has available to compare, short of adding an additional, layer-violating "custom key". Considering that we're the only ones that have ever hit this, we should just play by the rules. Unfortunately, we haven't been able to create a minimal reproducer other than our full container setup using a compress-force=zstd filesystem on top of another compress-force=zstd filesystem. Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:48 +01:00
Omar Sandoval	cdc6f1668e	btrfs: get rid of unnecessary memset() of work item Commit `fc97fab0ea` ("btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue.") converted qgroup_rescan_work to be initialized with btrfs_init_work(), but it left behind an unnecessary memset(). Get rid of the memset(). Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:47 +01:00
Josef Bacik	b3f167aa6c	btrfs: move the failrec tree stuff into extent-io-tree.h This needs to be cleaned up in the future, but for now it belongs to the extent-io-tree stuff since it uses the internal tree search code. Needed to export get_state_failrec and set_state_failrec as well since we're not going to move the actual IO part of the failrec stuff out at this point. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:47 +01:00
Josef Bacik	083e75e7e6	btrfs: export find_delalloc_range This utilizes internal stuff to the extent_io_tree, so we need to export it before we move it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:47 +01:00
Josef Bacik	9c7d3a5483	btrfs: move extent_io_tree defs to their own header extent_io.c/h are huge, encompassing a bunch of different things. The extent_io_tree code can live on its own, so separate this out. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:47 +01:00
Josef Bacik	6f0d04f8e7	btrfs: separate out the extent io init function We are moving extent_io_tree into it's on file, so separate out the extent_state init stuff from extent_io_tree_init(). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:47 +01:00
Josef Bacik	33ca832fef	btrfs: separate out the extent leak code We check both extent buffer and extent state leaks in the same function, separate these two functions out so we can move them around. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:46 +01:00
Qu Wenruo	34ffafdba1	btrfs: ctree: Remove stray comment of setting up path lock The following comment shows up in btrfs_search_slot() with out much sense: /* * setup the path here so we can release it under lock * contention with the cow code / if (cow) { / code touching path->lock[] is far away from here / } This comment hasn't been cleaned up after the relevant code has been removed. The original code is introduced in commit `65b51a009e` ("btrfs_search_slot: reduce lock contention by cowing in two stages"): + + / + * setup the path here so we can release it under lock + * contention with the cow code + */ + p->nodes[level] = b; + if (!p->skip_locking) + p->locks[level] = 1; + But in current code, we have different timing for modifying path lock, so just remove the comment. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:46 +01:00
Qu Wenruo	abe9339d69	btrfs: ctree: Reduce one indent level for btrfs_search_old_slot() Similar to btrfs_search_slot() done in previous patch, make a shortcut for the level 0 case and allow to reduce indentation for the remaining case. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:46 +01:00
Qu Wenruo	f624d97608	btrfs: ctree: Reduce one indent level for btrfs_search_slot() In btrfs_search_slot(), we something like: if (level != 0) { /* Do search inside tree nodes/ } else { / Do search inside tree leaves / goto done; } This caused extra indent for tree node search code. Change it to something like: if (level == 0) { / Do search inside tree leaves / goto done' } / Do search inside tree nodes */ So we have more space to maneuver our code, this is especially useful as the tree nodes search code is more complex than the leaves search code. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:46 +01:00
Qu Wenruo	71bf92a9b8	btrfs: tree-checker: Add check for INODE_REF For INODE_REF we will check: - Objectid (ino) against previous key To detect missing INODE_ITEM. - No overflow/padding in the data payload Much like DIR_ITEM, but with less members to check. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:46 +01:00
Qu Wenruo	c18679ebd8	btrfs: tree-checker: Try to detect missing INODE_ITEM For the following items, key->objectid is inode number: - DIR_ITEM - DIR_INDEX - XATTR_ITEM - EXTENT_DATA - INODE_REF So in the subvolume tree, such items must have its previous item share the same objectid, e.g.: (257 INODE_ITEM 0) (257 DIR_INDEX xxx) (257 DIR_ITEM xxx) (258 INODE_ITEM 0) (258 INODE_REF 0) (258 XATTR_ITEM 0) (258 EXTENT_DATA 0) But if we have the following sequence, then there is definitely something wrong, normally some INODE_ITEM is missing, like: (257 INODE_ITEM 0) (257 DIR_INDEX xxx) (257 DIR_ITEM xxx) (258 XATTR_ITEM 0) <<< objecitd suddenly changed to 258 (258 EXTENT_DATA 0) So just by checking the previous key for above inode based key types, we can detect a missing inode item. For INODE_REF key type, the check will be added along with INODE_REF checker. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:46 +01:00
Filipe Manana	b9fae2ebee	Btrfs: make btrfs_wait_extents() static It's not used ouside of transaction.c Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:45 +01:00
Nikolay Borisov	35b814f3c5	btrfs: Add assert to catch nested transaction commit A recent patch to btrfs showed that there was at least 1 case where a nested transaction was committed. Nested transaction in this case means a code which has a transaction handle calls some function which in turn obtains a copy of the same transaction handle. In such cases the correct thing to do is for the lower callee to call btrfs_end_transaction which contains appropriate checks so as to not commit the transaction which will result in stale trans handler for the caller. To catch such cases add an assert in btrfs_commit_transaction ensuring btrfs_trans_handle::use_count is always 1. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:45 +01:00
Goldwyn Rodrigues	9cf35f6735	btrfs: simplify inode locking for RWF_NOWAIT This is similar to `942491c9e6` ("xfs: fix AIM7 regression"). Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. This causes extra contention on the lock and can be measured eg. by AIM7 benchmark. So change our read/write methods to just do the trylock for the RWF_NOWAIT case. Fixes: `edf064e7c6` ("btrfs: nowait aio support") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 12:46:45 +01:00
Linus Torvalds	afd7a71872	for-5.4-rc7-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3MUkoACgkQxWXV+ddt WDvAEQ//fEHZ51NbIMwJNltqF4mr6Oao0M5u0wejEgiXzmR9E1IuHUgVK+KDQmSu wZl/y+RTlQC0TiURyStaVFBreXEqiuB79my9u4iDeNv/4UJQB42qpmYB4EuviDgB mxb9bFpWTLkO6Oc+vMrGF3BOmVsQKlq2nOua25g8VFtApQ6uiEfbwBOslCcC8kQB ZpNBl6x74xz/VWNWZnRStBfwYjRitKNDVU6dyIyRuLj8cktqfGBxGtx7/w0wDiZT kPR1bNtdpy3Ndke6H/0G6plRWi9kENqcN43hvrz54IKh2l+Jd2/as51j4Qq2tJU9 KaAnJzRaSePxc2m0SqtgZTvc2BYSOg7dqaCyHxBB0CUBdTdJdz2TVZ2KM9MiLns4 1haHBLo4l8o8zeYZpW05ac6OXKY4f8qsjWPEGshn4FDbq0TrHQzYxAF3c0X3hPag SnuvilgYUuYal+n87qinePg/ZmVrrBXPRycpQnn7FxqezJbf/2WUEojUVQnreU05 mdp8mulxQxyFhgEvO7K1uDtlP8bqW69IO9M//6IWzGNKTDK2SRI08ULplghqgyna 8SG0+y9w26r8UIWDhuvPbdfUMSG3kEH8yLFK84AFDMVJJxOnfznE3sC8sGOiP5q9 OUkl8l7bhDkyAdWZY57gGUobebdPfnLxRV9A+LZQ2El1kSOEK18= =xzXs -----END PGP SIGNATURE----- Merge tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A fix for an older bug that has started to show up during testing (because of an updated test for rename exchange). It's an in-memory corruption caused by local variable leaking out of the function scope" * tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix log context list corruption after rename exchange operation	2019-11-13 12:06:10 -08:00
Filipe Manana	e6c617102c	Btrfs: fix log context list corruption after rename exchange operation During rename exchange we might have successfully log the new name in the source root's log tree, in which case we leave our log context (allocated on stack) in the root's list of log contextes. However we might fail to log the new name in the destination root, in which case we fallback to a transaction commit later and never sync the log of the source root, which causes the source root log context to remain in the list of log contextes. This later causes invalid memory accesses because the context was allocated on stack and after rename exchange finishes the stack gets reused and overwritten for other purposes. The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can detect this and report something like the following: [ 691.489929] ------------[ cut here ]------------ [ 691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38). [ 691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0 (...) [ 691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1 [ 691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [ 691.490003] RIP: 0010:__list_add_valid+0x95/0xe0 (...) [ 691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282 [ 691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000 [ 691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0 [ 691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114 [ 691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff [ 691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b [ 691.490019] FS: 00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000 [ 691.490021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0 [ 691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 691.490029] Call Trace: [ 691.490058] btrfs_log_inode_parent+0x667/0x2730 [btrfs] [ 691.490083] ? join_transaction+0x24a/0xce0 [btrfs] [ 691.490107] ? btrfs_end_log_trans+0x80/0x80 [btrfs] [ 691.490111] ? dget_parent+0xb8/0x460 [ 691.490116] ? lock_downgrade+0x6b0/0x6b0 [ 691.490121] ? rwlock_bug.part.0+0x90/0x90 [ 691.490127] ? do_raw_spin_unlock+0x142/0x220 [ 691.490151] btrfs_log_dentry_safe+0x65/0x90 [btrfs] [ 691.490172] btrfs_sync_file+0x9f1/0xc00 [btrfs] [ 691.490195] ? btrfs_file_write_iter+0x1800/0x1800 [btrfs] [ 691.490198] ? rcu_read_lock_any_held.part.11+0x20/0x20 [ 691.490204] ? __do_sys_newstat+0x88/0xd0 [ 691.490207] ? cp_new_stat+0x5d0/0x5d0 [ 691.490218] ? do_fsync+0x38/0x60 [ 691.490220] do_fsync+0x38/0x60 [ 691.490224] __x64_sys_fdatasync+0x32/0x40 [ 691.490228] do_syscall_64+0x9f/0x540 [ 691.490233] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 691.490235] RIP: 0033:0x7f4b253ad5f0 (...) [ 691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b [ 691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0 [ 691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003 [ 691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c [ 691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4 [ 691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0 This started happening recently when running some test cases from fstests like btrfs/004 for example, because support for rename exchange was added last week to fsstress from fstests. So fix this by deleting the log context for the source root from the list if we have logged the new name in the source root. Reported-by: Su Yue <Damenly_Su@gmx.com> Fixes: `d4682ba03e` ("Btrfs: sync log after logging new name") CC: stable@vger.kernel.org # 4.19+ Tested-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-11 19:46:02 +01:00
Linus Torvalds	00aff68362	for-5.4-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3GwvEACgkQxWXV+ddt WDvsfQ//UQ/TND1roMW3EmN+PLTRTUQS75Hb737r5Q66j0j94gFnBxB03R+tU8lP bH5XVpateb1wDmmdscqVQ1WM9O82bQdDNiYeLDr86+kzpLgy61rZHswZfNlDtl5M wDwyaxsrd7HndDeUZnIuaakYI9MXz9WIaNXkt0o8hSHctt0N18y23DSBFTVSh/4q T3cn4odkoBKtQ4mIn2OSmQvkl69nWRBVpjPJZIvsNszKjOo9aZTuGHrOWUV5RPiE Ho9UBkr+IjEDo8OH88vXAsHeYFIoYhEeUltjLHyF6Agwk1/Ajwp5sxXSubbfHUMQ l1YdmrTZf+l1Dxdj0sCdyK1npcgGI5IuZmIICpNUEAny9AbTtpSE3GNwtnIHAEAr cpki+1Z3lfaVSwNMYUz9Esbsb72C+f08WJHGHBMaOhjrBIwQUeWeYzTx6N7uDwNg GjaDRxjqFV2o7373isQaz7WOTatUMNtL1xvhkeceONI9NBXQRjW5rq5zECmr1ix7 lSTKDzn7yAd6eGBW0T6iYdRl4Bta/zxPiDEF5KDNPugvv23yx17LAOdS4qAiHzbx oglra/kz9D4xmKVqH7hpFaaHrzL38G4mz6aMdwBu0M7dkn9AwsXiXWSDb0n7jnK0 o96gkUBZjUSFBseUFD2s34MikFtrDynEJzPq96JHuWLguaByMcc= =CZxY -----END PGP SIGNATURE----- Merge tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few regressions and fixes for stable. Regressions: - fix a race leading to metadata space leak after task received a signal - un-deprecate 2 ioctls, marked as deprecated by mistake Fixes: - fix limit check for number of devices during chunk allocation - fix a race due to double evaluation of i_size_read inside max() macro, can cause a crash - remove wrong device id check in tree-checker" * tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range Btrfs: fix race leading to metadata space leak after task received signal btrfs: tree-checker: Fix wrong check on max devid btrfs: Consider system chunk array size for new SYSTEM chunks	2019-11-09 08:51:37 -08:00
David Sterba	a5009d3a31	btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as deprecated and scheduled for removal but we actualy do use them for 'btrfs subvolume delete -C/-c'. The deprecated thing in `ebc87351e5` should have been just the async flag for subvolume creation. The deprecation has been added in this development cycle, remove it until it's time. Fixes: `ebc87351e5` ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag") Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-04 21:42:01 +01:00
Josef Bacik	d98da49977	btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range We hit a regression while rolling out 5.2 internally where we were hitting the following panic kernel BUG at mm/page-writeback.c:2659! RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0 Call Trace: __process_pages_contig+0x25a/0x350 ? extent_clear_unlock_delalloc+0x43/0x70 submit_compressed_extents+0x359/0x4d0 normal_work_helper+0x15a/0x330 process_one_work+0x1f5/0x3f0 worker_thread+0x2d/0x3d0 ? rescuer_thread+0x340/0x340 kthread+0x111/0x130 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x1f/0x30 This is happening because the page is not locked when doing clear_page_dirty_for_io. Looking at the core dump it was because our async_extent had a ram_size of 24576 but our async_chunk range only spanned 20480, so we had a whole extra page in our ram_size for our async_extent. This happened because we try not to compress pages outside of our i_size, however a cleanup patch changed us to do actual_end = min_t(u64, i_size_read(inode), end + 1); which is problematic because i_size_read() can evaluate to different values in between checking and assigning. So either an expanding truncate or a fallocate could increase our i_size while we're doing writeout and actual_end would end up being past the range we have locked. I confirmed this was what was happening by installing a debug kernel that had actual_end = min_t(u64, i_size_read(inode), end + 1); if (actual_end > end + 1) { printk(KERN_ERR "KABOOM\n"); actual_end = end + 1; } and installing it onto 500 boxes of the tier that had been seeing the problem regularly. Last night I got my debug message and no panic, confirming what I expected. [ dsterba: the assembly confirms a tiny race window: mov 0x20(%rsp),%rax cmp %rax,0x48(%r15) # read movl $0x0,0x18(%rsp) mov %rax,%r12 mov %r14,%rax cmovbe 0x48(%r15),%r12 # eval Where r15 is inode and 0x48 is offset of i_size. The original fix was to revert `62b3762271` that would do an intermediate assignment and this would also avoid the doulble evaluation but is not future-proof, should the compiler merge the stores and call i_size_read anyway. There's a patch adding READ_ONCE to i_size_read but that's not being applied at the moment and we need to fix the bug. Instead, emulate READ_ONCE by two barrier()s that's what effectively happens. The assembly confirms single evaluation: mov 0x48(%rbp),%rax # read once mov 0x20(%rsp),%rcx mov $0x20,%edx cmp %rax,%rcx cmovbe %rcx,%rax mov %rax,(%rsp) mov %rax,%rcx mov %r14,%rax Where 0x48(%rbp) is inode->i_size stored to %eax. ] Fixes: `62b3762271` ("btrfs: Remove isize local variable in compress_file_range") CC: stable@vger.kernel.org # v5.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ changelog updated ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-04 21:41:49 +01:00
Filipe Manana	0cab7acc4a	Btrfs: fix race leading to metadata space leak after task received signal When a task that is allocating metadata needs to wait for the async reclaim job to process its ticket and gets a signal (because it was killed for example) before doing the wait, the task ends up erroring out but with space reserved for its ticket, which never gets released, resulting in a metadata space leak (more specifically a leak in the bytes_may_use counter of the metadata space_info object). Here's the sequence of steps leading to the space leak: 1) A task tries to create a file for example, so it ends up trying to start a transaction at btrfs_create(); 2) The filesystem is currently in a state where there is not enough metadata free space to satisfy the transaction's needs. So at space-info.c:__reserve_metadata_bytes() we create a ticket and add it to the list of tickets of the space info object. Also, because the metadata async reclaim job is not running, we queue a job ro run metadata reclaim; 3) In the meanwhile the task receives a signal (like SIGTERM from a kill command for example); 4) After queing the async reclaim job, at __reserve_metadata_bytes(), we unlock the metadata space info and call handle_reserve_ticket(); 5) That last function calls wait_reserve_ticket(), which acquires the lock from the metadata space info. Then in the first iteration of its while loop, it calls prepare_to_wait_event(), which returns -ERESTARTSYS because the task has a pending signal. As a result, we set the error field of the ticket to -EINTR and exit the while loop without deleting the ticket from the list of tickets (in the space info object). After exiting the loop we unlock the space info; 6) The async reclaim job is able to release enough metadata, acquires the metadata space info's lock and then reserves space for the ticket, since the ticket is still in the list of (non-priority) tickets. The space reservation happens at btrfs_try_granting_tickets(), called from maybe_fail_all_tickets(). This increments the bytes_may_use counter from the metadata space info object, sets the ticket's bytes field to zero (meaning success, that space was reserved) and removes it from the list of tickets; 7) wait_reserve_ticket() returns, with the error field of the ticket set to -EINTR. Then handle_reserve_ticket() just propagates that error to the caller. Because an error was returned, the caller does not release the reserved space, since the expectation is that any error means no space was reserved. Fix this by removing the ticket from the list, while holding the space info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns an error. Also add some comments and an assertion to guarantee we never end up with a ticket that has an error set and a bytes counter field set to zero, to more easily detect regressions in the future. This issue could be triggered sporadically by some test cases from fstests such as generic/269 for example, which tries to fill a filesystem and then kills fsstress processes running in the background. When this issue happens, we get a warning in syslog/dmesg when unmounting the filesystem, like the following: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs] (...) CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs] (...) RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286 RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8 RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001 R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508 R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100 FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: close_ctree+0x1ad/0x390 [btrfs] generic_shutdown_super+0x6c/0x110 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0xa0 [btrfs] deactivate_locked_super+0x3a/0x70 cleanup_mnt+0xb4/0x160 task_work_run+0x7e/0xc0 exit_to_usermode_loop+0xfa/0x100 do_syscall_64+0x1cb/0x220 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f4274d2cb37 (...) RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240 RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015 R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64 R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0 softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace bcf4b235461b26f6 ]--- BTRFS info (device sdb): space_info 4 has 19116032 free, is full BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536 BTRFS info (device sdb): global_block_rsv: size 0 reserved 0 BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0 BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0 BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0 BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0 Fixes: `374bf9c5cd` ("btrfs: unify error handling for ticket flushing") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-10-25 19:11:34 +02:00
Qu Wenruo	8bb177d18f	btrfs: tree-checker: Fix wrong check on max devid [BUG] The following script will cause false alert on devid check. #!/bin/bash dev1=/dev/test/test dev2=/dev/test/scratch1 mnt=/mnt/btrfs umount $dev1 &> /dev/null umount $dev2 &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f $dev1 mount $dev1 $mnt _fail() { echo "!!! FAILED !!!" exit 1 } for ((i = 0; i < 4096; i++)); do btrfs dev add -f $dev2 $mnt \|\| _fail btrfs dev del $dev1 $mnt \|\| _fail dev_tmp=$dev1 dev1=$dev2 dev2=$dev_tmp done [CAUSE] Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as upper limit for devid. But we can have devid holes just like above script. So the check for devid is incorrect and could cause false alert. [FIX] Just remove the whole devid check. We don't have any hard requirement for devid assignment. Furthermore, even devid could get corrupted by a bitflip, we still have dev extents verification at mount time, so corrupted data won't sneak in. This fixes fstests btrfs/194. Reported-by: Anand Jain <anand.jain@oracle.com> Fixes: `ab4ba2e133` ("btrfs: tree-checker: Verify dev item") CC: stable@vger.kernel.org # 5.2+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-10-25 19:11:34 +02:00
Qu Wenruo	c17add7a1c	btrfs: Consider system chunk array size for new SYSTEM chunks For SYSTEM chunks, despite the regular chunk item size limit, there is another limit due to system chunk array size. The extra limit was removed in a refactoring, so add it back. Fixes: `e3ecdb3fde` ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk") CC: stable@vger.kernel.org # 5.3+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-10-25 19:11:34 +02:00
Linus Torvalds	54955e3bfd	for-5.4-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2vBPgACgkQxWXV+ddt WDvaGQ/+JbV05ML7QjufNFKjzojNPg8dOwvLqF58askMEAqzyqOrzu1YsIOjchqv eZPP+wxh6j6i8LFnbaMYRhSVMlgpK9XOR3tNStp73QlSox6/6VKu7XR4wFXAoip3 l4XI8UqO5XqDG55UTtjKyo2VNq8vgq1gRUWD6hPtfzDP6WYj4JZuXOd+dT6PbP32 VHd91RoB8Z8qmypLF9Sju6IBWNhKl91TjlRqVdfLywaK8azqaxE1UttPf5DVbE6+ MlTuXhuxkNc4ddzj//oJ1s3ZP/nXtFmIZ75+Sd6P5DfqRNeIfjkKqXvffsjqoYVI 1Wv9sDiezxRB1RZjfInhtqvdvsqcsrXrM7x6BRVqI7IZDUaH5em8IoozQT62ze/4 MJBrRIbVodx2I7EVDMNGkx2TaIDAfnW4Z3UC+YSHLoy6jht1+SA5KLQDR1G/4NeR dWht5wOXwhp8P3DoaczTUpk0DLtAtygj04fH8CG277EILLCpuWwW8iRkPttkrIM2 HrtRKrKFJNyEpq9vHSFVvpQJkzgzqBFs1UnqnEuYh6qNgWCrS4PJDu9geQ8aPGr2 pA5aEqg5b+jjWzIYDP93PF0u7kF4mFVAozn6xMf95FKmM1OupYYQg5BRe7n/DfDu yh3Ms0Mmd9+snpNJ9lDr40cobHC5CDvfvO2SRERyBeXxARainN0= =Ft+G -----END PGP SIGNATURE----- Merge tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fixes of error handling cleanup of metadata accounting with qgroups enabled - fix swapped values for qgroup tracepoints - fix race when handling full sync flag - don't start unused worker thread, functionality removed already * tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: check for the full sync flag while holding the inode lock during fsync Btrfs: fix qgroup double free after failure to reserve metadata for delalloc btrfs: tracepoints: Fix bad entry members of qgroup events btrfs: tracepoints: Fix wrong parameter order for qgroup events btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() btrfs: don't needlessly create extent-refs kernel thread btrfs: block-group: Fix a memory leak due to missing btrfs_put_block_group() Btrfs: add missing extents release on file extent cluster relocation error	2019-10-23 06:14:29 -04:00
Filipe Manana	ba0b084ac3	Btrfs: check for the full sync flag while holding the inode lock during fsync We were checking for the full fsync flag in the inode before locking the inode, which is racy, since at that that time it might not be set but after we acquire the inode lock some other task set it. One case where this can happen is on a system low on memory and some concurrent task failed to allocate an extent map and therefore set the full sync flag on the inode, to force the next fsync to work in full mode. A consequence of missing the full fsync flag set is hitting the problems fixed by commit `0c713cbab6` ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges"), BUG_ON() when dropping extents from a log tree, hitting assertion failures at tree-log.c:copy_items() or all sorts of weird inconsistencies after replaying a log due to file extents items representing ranges that overlap. So just move the check such that it's done after locking the inode and before starting writeback again. Fixes: `0c713cbab6` ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges") CC: stable@vger.kernel.org # 5.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-10-17 20:36:02 +02:00
Filipe Manana	c7967fc149	Btrfs: fix qgroup double free after failure to reserve metadata for delalloc If we fail to reserve metadata for delalloc operations we end up releasing the previously reserved qgroup amount twice, once explicitly under the 'out_qgroup' label by calling btrfs_qgroup_free_meta_prealloc() and once again, under label 'out_fail', by calling btrfs_inode_rsv_release() with a value of 'true' for its 'qgroup_free' argument, which results in btrfs_qgroup_free_meta_prealloc() being called again, so we end up having a double free. Also if we fail to reserve the necessary qgroup amount, we jump to the label 'out_fail', which calls btrfs_inode_rsv_release() and that in turns calls btrfs_qgroup_free_meta_prealloc(), even though we weren't able to reserve any qgroup amount. So we freed some amount we never reserved. So fix this by removing the call to btrfs_inode_rsv_release() in the failure path, since it's not necessary at all as we haven't changed the inode's block reserve in any way at this point. Fixes: `c8eaeac7b7` ("btrfs: reserve delalloc metadata differently") CC: stable@vger.kernel.org # 5.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-10-17 20:13:44 +02:00
Qu Wenruo	fd2b007eae	btrfs: tracepoints: Fix wrong parameter order for qgroup events [BUG] For btrfs:qgroup_meta_reserve event, the trace event can output garbage: qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=DATA diff=2 The diff should always be alinged to sector size (4k), so there is definitely something wrong. [CAUSE] For the wrong @diff, it's caused by wrong parameter order. The correct parameters are: struct btrfs_root, s64 diff, int type. However the parameters used are: struct btrfs_root, int type, s64 diff. Fixes: `4ee0d8832c` ("btrfs: qgroup: Update trace events for metadata reservation") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-10-17 14:09:31 +02:00
Qu Wenruo	8702ba9396	btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() [Background] Btrfs qgroup uses two types of reserved space for METADATA space, PERTRANS and PREALLOC. PERTRANS is metadata space reserved for each transaction started by btrfs_start_transaction(). While PREALLOC is for delalloc, where we reserve space before joining a transaction, and finally it will be converted to PERTRANS after the writeback is done. [Inconsistency] However there is inconsistency in how we handle PREALLOC metadata space. The most obvious one is: In btrfs_buffered_write(): btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true); We always free qgroup PREALLOC meta space. While in btrfs_truncate_block(): btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0)); We only free qgroup PREALLOC meta space when something went wrong. [The Correct Behavior] The correct behavior should be the one in btrfs_buffered_write(), we should always free PREALLOC metadata space. The reason is, the btrfs_delalloc_* mechanism works by: - Reserve metadata first, even it's not necessary In btrfs_delalloc_reserve_metadata() - Free the unused metadata space Normally in: btrfs_delalloc_release_extents() \|- btrfs_inode_rsv_release() Here we do calculation on whether we should release or not. E.g. for 64K buffered write, the metadata rsv works like: /* The first page / reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=0 total: num_bytes=calc_inode_reservations() / The first page caused one outstanding extent, thus needs metadata rsv / / The 2nd page / reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed / The 2nd page doesn't cause new outstanding extent, needs no new meta rsv, so we free what we have reserved / / The 3rd~16th pages */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed (still space for one outstanding extent) This means, if btrfs_delalloc_release_extents() determines to free some space, then those space should be freed NOW. So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other than btrfs_qgroup_convert_reserved_meta(). The good news is: - The callers are not that hot The hottest caller is in btrfs_buffered_write(), which is already fixed by commit `336a8bb8e3` ("btrfs: Fix wrong btrfs_delalloc_release_extents parameter"). Thus it's not that easy to cause false EDQUOT. - The trans commit in advance for qgroup would hide the bug Since commit `f5fef45936` ("btrfs: qgroup: Make qgroup async transaction commit more aggressive"), when btrfs qgroup metadata free space is slow, it will try to commit transaction and free the wrongly converted PERTRANS space, so it's not that easy to hit such bug. [FIX] So to fix the problem, remove the @qgroup_free parameter for btrfs_delalloc_release_extents(), and always pass true to btrfs_inode_rsv_release(). Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: `43b18595d6` ("btrfs: qgroup: Use separate meta reservation type for delalloc") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-10-15 18:50:07 +02:00
David Sterba	80ed4548d0	btrfs: don't needlessly create extent-refs kernel thread The patch `32b593bfcb` ("Btrfs: remove no longer used function to run delayed refs asynchronously") removed the async delayed refs but the thread has been created, without any use. Remove it to avoid resource consumption. Fixes: `32b593bfcb` ("Btrfs: remove no longer used function to run delayed refs asynchronously") CC: stable@vger.kernel.org # 5.2+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-10-15 15:43:29 +02:00
Qu Wenruo	4b654acdae	btrfs: block-group: Fix a memory leak due to missing btrfs_put_block_group() In btrfs_read_block_groups(), if we have an invalid block group which has mixed type (DATA\|METADATA) while the fs doesn't have MIXED_GROUPS feature, we error out without freeing the block group cache. This patch will add the missing btrfs_put_block_group() to prevent memory leak. Note for stable backports: the file to patch in versions <= 5.3 is fs/btrfs/extent-tree.c Fixes: `49303381f1` ("Btrfs: bail out if block group has different mixed flag") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-10-11 21:27:51 +02:00

1 2 3 4 5 ...

8458 Commits