kernel_optimize_test/fs
Josef Bacik b917f9b946 btrfs: do not take the uuid_mutex in btrfs_rm_device
[ Upstream commit 8ef9dc0f14ba6124c62547a4fdc59b163d8b864e ]

We got the following lockdep splat while running fstests (specifically
btrfs/003 and btrfs/020 in a row) with the new rc.  This was uncovered
by 87579e9b7d8d ("loop: use worker per cgroup instead of kworker") which
converted loop to using workqueues, which comes with lockdep
annotations that don't exist with kworkers.  The lockdep splat is as
follows:

  WARNING: possible circular locking dependency detected
  5.14.0-rc2-custom+ #34 Not tainted
  ------------------------------------------------------
  losetup/156417 is trying to acquire lock:
  ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600

  but task is already holding lock:
  ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #5 (&lo->lo_mutex){+.+.}-{3:3}:
	 __mutex_lock+0xba/0x7c0
	 lo_open+0x28/0x60 [loop]
	 blkdev_get_whole+0x28/0xf0
	 blkdev_get_by_dev.part.0+0x168/0x3c0
	 blkdev_open+0xd2/0xe0
	 do_dentry_open+0x163/0x3a0
	 path_openat+0x74d/0xa40
	 do_filp_open+0x9c/0x140
	 do_sys_openat2+0xb1/0x170
	 __x64_sys_openat+0x54/0x90
	 do_syscall_64+0x3b/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xae

  -> #4 (&disk->open_mutex){+.+.}-{3:3}:
	 __mutex_lock+0xba/0x7c0
	 blkdev_get_by_dev.part.0+0xd1/0x3c0
	 blkdev_get_by_path+0xc0/0xd0
	 btrfs_scan_one_device+0x52/0x1f0 [btrfs]
	 btrfs_control_ioctl+0xac/0x170 [btrfs]
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x3b/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xae

  -> #3 (uuid_mutex){+.+.}-{3:3}:
	 __mutex_lock+0xba/0x7c0
	 btrfs_rm_device+0x48/0x6a0 [btrfs]
	 btrfs_ioctl+0x2d1c/0x3110 [btrfs]
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x3b/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xae

  -> #2 (sb_writers#11){.+.+}-{0:0}:
	 lo_write_bvec+0x112/0x290 [loop]
	 loop_process_work+0x25f/0xcb0 [loop]
	 process_one_work+0x28f/0x5d0
	 worker_thread+0x55/0x3c0
	 kthread+0x140/0x170
	 ret_from_fork+0x22/0x30

  -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
	 process_one_work+0x266/0x5d0
	 worker_thread+0x55/0x3c0
	 kthread+0x140/0x170
	 ret_from_fork+0x22/0x30

  -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
	 __lock_acquire+0x1130/0x1dc0
	 lock_acquire+0xf5/0x320
	 flush_workqueue+0xae/0x600
	 drain_workqueue+0xa0/0x110
	 destroy_workqueue+0x36/0x250
	 __loop_clr_fd+0x9a/0x650 [loop]
	 lo_ioctl+0x29d/0x780 [loop]
	 block_ioctl+0x3f/0x50
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x3b/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xae

  other info that might help us debug this:
  Chain exists of:
    (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
   Possible unsafe locking scenario:
	 CPU0                    CPU1
	 ----                    ----
    lock(&lo->lo_mutex);
				 lock(&disk->open_mutex);
				 lock(&lo->lo_mutex);
    lock((wq_completion)loop0);

   *** DEADLOCK ***
  1 lock held by losetup/156417:
   #0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]

  stack backtrace:
  CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   dump_stack_lvl+0x57/0x72
   check_noncircular+0x10a/0x120
   __lock_acquire+0x1130/0x1dc0
   lock_acquire+0xf5/0x320
   ? flush_workqueue+0x84/0x600
   flush_workqueue+0xae/0x600
   ? flush_workqueue+0x84/0x600
   drain_workqueue+0xa0/0x110
   destroy_workqueue+0x36/0x250
   __loop_clr_fd+0x9a/0x650 [loop]
   lo_ioctl+0x29d/0x780 [loop]
   ? __lock_acquire+0x3a0/0x1dc0
   ? update_dl_rq_load_avg+0x152/0x360
   ? lock_is_held_type+0xa5/0x120
   ? find_held_lock.constprop.0+0x2b/0x80
   block_ioctl+0x3f/0x50
   __x64_sys_ioctl+0x83/0xb0
   do_syscall_64+0x3b/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x7f645884de6b

Usually the uuid_mutex exists to protect the fs_devices that map
together all of the devices that match a specific uuid.  In rm_device
we're messing with the uuid of a device, so it makes sense to protect
that here.

However in doing that it pulls in a whole host of lockdep dependencies,
as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus
we end up with the dependency chain under the uuid_mutex being added
under the normal sb write dependency chain, which causes problems with
loop devices.

We don't need the uuid mutex here however.  If we call
btrfs_scan_one_device() before we scratch the super block we will find
the fs_devices and not find the device itself and return EBUSY because
the fs_devices is open.  If we call it after the scratch happens it will
not appear to be a valid btrfs file system.

We do not need to worry about other fs_devices modifying operations here
because we're protected by the exclusive operations locking.

So drop the uuid_mutex here in order to fix the lockdep splat.

A more detailed explanation from the discussion:

We are worried about rm and scan racing with each other, before this
change we'll zero the device out under the UUID mutex so when scan does
run it'll make sure that it can go through the whole device scan thing
without rm messing with us.

We aren't worried if the scratch happens first, because the result is we
don't think this is a btrfs device and we bail out.

The only case we are concerned with is we scratch _after_ scan is able
to read the superblock and gets a seemingly valid super block, so lets
consider this case.

Scan will call device_list_add() with the device we're removing.  We'll
call find_fsid_with_metadata_uuid() and get our fs_devices for this
UUID.  At this point we lock the fs_devices->device_list_mutex.  This is
what protects us in this case, but we have two cases here.

1. We aren't to the device removal part of the RM.  We found our device,
   and device name matches our path, we go down and we set total_devices
   to our super number of devices, which doesn't affect anything because
   we haven't done the remove yet.

2. We are past the device removal part, which is protected by the
   device_list_mutex.  Scan doesn't find the device, it goes down and
   does the

   if (fs_devices->opened)
	   return -EBUSY;

   check and we bail out.

Nothing about this situation is ideal, but the lockdep splat is real,
and the fix is safe, tho admittedly a bit scary looking.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy more from the discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18 14:04:00 +01:00
..
9p new helper: inode_wrong_type() 2021-09-08 08:49:01 +02:00
adfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
affs fs/affs: release old buffer head on error path 2021-03-04 11:38:37 +01:00
afs afs: Fix updating of i_blocks on file/dir extension 2021-09-30 10:11:01 +02:00
autofs autofs: harden ioctl table 2020-10-16 11:11:22 -07:00
befs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
bfs bfs: don't use WARNING: string when it's just info. 2021-01-06 14:56:52 +01:00
btrfs btrfs: do not take the uuid_mutex in btrfs_rm_device 2021-11-18 14:04:00 +01:00
cachefiles fs/cachefiles: Remove wait_bit_key layout dependency 2021-03-30 14:32:07 +02:00
ceph ceph: fix handling of "meta" errors 2021-10-27 09:56:51 +02:00
cifs smb3: correct smb3 ACL security descriptor 2021-10-09 14:40:57 +02:00
coda
configfs configfs: fix memleak in configfs_release_bin_file 2021-07-14 16:56:48 +02:00
cramfs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
crypto fscrypt: allow 256-bit master keys with AES-256-XTS 2021-11-18 14:03:54 +01:00
debugfs debugfs: debugfs_create_file_size(): use IS_ERR to check for error 2021-10-06 15:55:59 +02:00
devpts
dlm fs: dlm: fix memory leak when fenced 2021-07-14 16:55:59 +02:00
ecryptfs Revert "ecryptfs: replace BUG_ON with error handling code" 2021-05-26 12:06:55 +02:00
efivarfs efivarfs: revert "fix memory leak in efivarfs_create()" 2020-11-25 16:55:02 +01:00
efs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
erofs erofs: fix error return code in erofs_read_superblock() 2021-07-14 16:56:53 +02:00
exfat exfat: fix incorrect loading of i_blocks for large files 2021-11-18 14:03:37 +01:00
exportfs
ext2 ext2: fix sleeping in atomic bugs on error 2021-10-09 14:40:56 +02:00
ext4 ext4: refresh the ext4_ext_path struct after dropping i_data_sem. 2021-11-18 14:03:40 +01:00
f2fs f2fs: should put a page beyond EOF when preparing a write 2021-09-18 13:40:16 +02:00
fat [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
freevxfs
fscache fscache: Fix cookie key hashing 2021-09-18 13:40:15 +02:00
fuse fuse: fix page stealing 2021-11-18 14:03:40 +01:00
gfs2 gfs2: Fix glock_hash_walk bugs 2021-11-18 14:04:00 +01:00
hfs hfs: add lock nesting notation to hfs_find_init 2021-07-31 08:16:12 +02:00
hfsplus hfsplus: prevent corruption in shrinking truncate 2021-05-19 10:13:10 +02:00
hostfs hostfs: fix memory handling in follow_link() 2021-04-14 08:42:06 +02:00
hpfs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
hugetlbfs hugetlbfs: fix mount mode command line processing 2021-07-28 14:35:46 +02:00
iomap treewide: Change list_sort to use const pointers 2021-09-30 10:11:04 +02:00
isofs isofs: Fix out of bound access for corrupted isofs image 2021-11-12 14:58:33 +01:00
jbd2 ext4: fix debug format string warning 2021-05-19 10:13:19 +02:00
jffs2 jffs2: check the validity of dstlen in jffs2_zlib_compress() 2021-05-11 14:47:36 +02:00
jfs fs/jfs: Fix missing error code in lmLogInit() 2021-07-20 16:05:40 +02:00
kernfs kernfs: wire up ->splice_read and ->splice_write 2021-01-27 11:55:29 +01:00
lockd lockd: lockd server-side shouldn't set fl_ops 2021-09-18 13:40:30 +02:00
minix [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
nfs NFSv4/pnfs: The layout barrier indicate a minimal value for the seqid 2021-09-18 13:40:13 +02:00
nfs_common nfs_common: need lock during iterate through the list 2020-12-30 11:53:45 +01:00
nfsd NFSD: Keep existing listeners on portlist error 2021-10-27 09:56:47 +02:00
nilfs2 nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group 2021-09-26 14:09:01 +02:00
nls
notify fanotify: limit number of event merge attempts 2021-09-18 13:40:38 +02:00
ntfs ntfs: fix validity check for file name attribute 2021-07-14 16:55:38 +02:00
ocfs2 ocfs2: fix data corruption on truncate 2021-11-18 14:03:37 +01:00
omfs fs: omfs: use kmemdup() rather than kmalloc+memcpy 2020-09-22 23:39:45 -04:00
openpromfs
orangefs orangefs: fix orangefs df output. 2021-07-20 16:05:48 +02:00
overlayfs ovl: fix use after free in struct ovl_aio_req 2021-11-18 14:03:49 +01:00
proc fs/proc/uptime.c: Fix idle time reporting in /proc/uptime 2021-11-18 14:03:56 +01:00
pstore mark pstore-blk as broken 2021-07-14 16:56:12 +02:00
qnx4 qnx4: work around gcc false positive warning bug 2021-09-30 10:11:08 +02:00
qnx6 [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
quota quota: correct error number in free_dqentry() 2021-11-18 14:03:51 +01:00
ramfs ramfs: fix nommu mmap with gaps in the page cache 2020-10-16 11:11:22 -07:00
reiserfs reiserfs: check directory items on read from disk 2021-08-12 13:22:19 +02:00
romfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
squashfs squashfs: fix divide error in calculate_skip() 2021-05-19 10:13:10 +02:00
sysfs sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output 2020-10-02 12:02:30 +02:00
sysv [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
tracefs tracefs: Have tracefs directories not set OTH permission bits by default 2021-11-18 14:03:56 +01:00
ubifs treewide: Change list_sort to use const pointers 2021-09-30 10:11:04 +02:00
udf udf_get_extendedattr() had no boundary checks. 2021-09-15 09:50:27 +02:00
ufs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
unicode
vboxsf vboxfs: fix broken legacy mount signature checking 2021-10-17 10:43:33 +02:00
verity fs-verity: fix signed integer overflow with i_size near S64_MAX 2021-10-06 15:55:46 +02:00
xfs treewide: Change list_sort to use const pointers 2021-09-30 10:11:04 +02:00
zonefs zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone() 2021-03-25 09:04:05 +01:00
aio.c vfs: separate __sb_start_write into blocking and non-blocking helpers 2020-11-10 16:53:07 -08:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot 2020-10-16 11:11:21 -07:00
binfmt_elf.c elf: don't use MAP_FIXED_NOREPLACE for elf interpreter mappings 2021-10-06 15:55:59 +02:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c binfmt_misc: fix possible deadlock in bm_register_write 2021-03-17 17:06:35 +01:00
binfmt_script.c
block_dev.c block: fix a race between del_gendisk and BLKRRPART 2021-06-03 09:00:45 +02:00
buffer.c mm, memcg: rework remote charging API to support nesting 2020-10-18 09:27:09 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c coredump: fix memleak in dump_vma_snapshot() 2021-09-26 14:08:56 +02:00
d_path.c fs: fix NULL dereference due to data race in prepend_path() 2020-10-14 14:54:45 -07:00
dax.c dax: fix ENOMEM handling in grab_mapping_entry() 2021-07-14 16:56:13 +02:00
dcache.c
dcookies.c
direct-io.c fs: direct-io: fix missing sdio->boundary 2021-04-14 08:41:58 +02:00
drop_caches.c
eventfd.c
eventpoll.c fs/epoll: restore waking from ep_done_scan() 2021-05-11 14:47:12 +02:00
exec.c Revert "Add a reference to ucounts for each cred" 2021-09-08 08:49:00 +02:00
fcntl.c fcntl: fix potential deadlock for &fasync_struct.fa_lock 2021-09-15 09:50:27 +02:00
fhandle.c
file_table.c task_work: cleanup notification modes 2020-10-17 15:05:30 -06:00
file.c kernel/io_uring: cancel io_uring before task works 2021-01-30 13:55:18 +01:00
filesystems.c
fs_context.c
fs_parser.c fs_parse: mark fs_param_bad_value() as static 2020-10-13 18:38:27 -07:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c writeback: fix obtain a reference to a freeing memcg css 2021-07-14 16:56:31 +02:00
fsopen.c
init.c
inode.c fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode() 2020-12-30 11:53:49 +01:00
internal.h cgroup1: fix leaked context root causing sporadic NULL deref in LTP 2021-07-31 08:16:11 +02:00
io_uring.c Revert "io_uring: reinforce cancel on flush during exit" 2021-11-06 14:10:08 +01:00
io-wq.c io-wq: fix wakeup race when adding new work 2021-09-18 13:40:06 +02:00
io-wq.h io_uring: always batch cancel in *cancel_files() 2021-02-13 13:54:56 +01:00
ioctl.c
Kconfig tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha 2021-02-17 11:02:21 +01:00
Kconfig.binfmt
kernel_read_file.c vfs: check fd has read access in kernel_read_file_from_fd() 2021-10-27 09:56:51 +02:00
libfs.c libfs: fix error cast of negative value in simple_attr_write() 2020-11-22 10:48:22 -08:00
locks.c Revert "nfsd4: a client's own opens needn't prevent delegations" 2021-03-20 10:43:44 +01:00
Makefile Refactored code for 5.10: 2020-10-23 11:33:41 -07:00
mbcache.c
mount.h
mpage.c
namei.c LOOKUP_MOUNTPOINT: we are cleaning "jumped" flag too late 2021-04-14 08:41:58 +02:00
namespace.c fs: warn about impending deprecation of mandatory locks 2021-08-26 08:35:57 -04:00
no-block.c
nsfs.c
open.c open: don't silently ignore unknown O-flags in openat2() 2021-07-14 16:55:59 +02:00
pipe.c pipe: do FASYNC notifications for every pipe IO, not just state changes 2021-09-03 10:09:28 +02:00
pnode.c
pnode.h mount: fix mounting of detached mounts onto targets that reside on shared mounts 2021-03-17 17:06:13 +01:00
posix_acl.c
proc_namespace.c proc mountinfo: make splice available again 2020-12-30 11:54:02 +01:00
read_write.c Refactored code for 5.10: 2020-10-23 11:33:41 -07:00
readdir.c readdir: make sure to verify directory entry for legacy interfaces too 2021-04-21 13:00:54 +02:00
remap_range.c vfs: move the remap range helpers to remap_range.c 2020-10-15 09:48:49 -07:00
select.c kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data() 2021-03-25 09:04:16 +01:00
seq_file.c seq_file: disallow extremely large seq buffer allocations 2021-07-20 16:05:59 +02:00
signalfd.c
splice.c io_uring-5.10-2020-10-24 2020-10-24 12:40:18 -07:00
stack.c
stat.c fs: fix reporting supported extra file attributes for statx() 2021-05-11 14:47:33 +02:00
statfs.c
super.c vfs: move __sb_{start,end}_write* to fs.h 2020-11-10 16:53:11 -08:00
sync.c
timerfd.c
userfaultfd.c userfaultfd: fix a race between writeprotect and exit_mmap() 2021-10-27 09:56:51 +02:00
utimes.c
xattr.c fs/xattr.c: fix kernel-doc warnings for setxattr & removexattr 2020-10-13 18:38:27 -07:00