kernel_optimize_test

History

Dave Chinner e5f9d4e0f8 xfs: logging the on disk inode LSN can make it go backwards commit 32baa63d82ee3f5ab3bd51bae6bf7d1c15aed8c7 upstream. When we log an inode, we format the "log inode" core and set an LSN in that inode core. We do that via xfs_inode_item_format_core(), which calls: xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn); to format the log inode. It writes the LSN from the inode item into the log inode, and if recovery decides the inode item needs to be replayed, it recovers the log inode LSN field and writes it into the on disk inode LSN field. Now this might seem like a reasonable thing to do, but it is wrong on multiple levels. Firstly, if the item is not yet in the AIL, item->li_lsn is zero. i.e. the first time the inode it is logged and formatted, the LSN we write into the log inode will be zero. If we only log it once, recovery will run and can write this zero LSN into the inode. This means that the next time the inode is logged and log recovery runs, it will always replay changes to the inode regardless of whether the inode is newer on disk than the version in the log and that violates the entire purpose of recording the LSN in the inode at writeback time (i.e. to stop it going backwards in time on disk during recovery). Secondly, if we commit the CIL to the journal so the inode item moves to the AIL, and then relog the inode, the LSN that gets stamped into the log inode will be the LSN of the inode's current location in the AIL, not it's age on disk. And it's not the LSN that will be associated with the current change. That means when log recovery replays this inode item, the LSN that ends up on disk is the LSN for the previous changes in the log, not the current changes being replayed. IOWs, after recovery the LSN on disk is not in sync with the LSN of the modifications that were replayed into the inode. This, again, violates the recovery ordering semantics that on-disk writeback LSNs provide. Hence the inode LSN in the log dinode is -always- invalid. Thirdly, recovery actually has the LSN of the log transaction it is replaying right at hand - it uses it to determine if it should replay the inode by comparing it to the on-disk inode's LSN. But it doesn't use that LSN to stamp the LSN into the inode which will be written back when the transaction is fully replayed. It uses the one in the log dinode, which we know is always going to be incorrect. Looking back at the change history, the inode logging was broken by commit `93f958f9c4` ("xfs: cull unnecessary icdinode fields") way back in 2016 by a stupid idiot who thought he knew how this code worked. i.e. me. That commit replaced an in memory di_lsn field that was updated only at inode writeback time from the inode item.li_lsn value - and hence always contained the same LSN that appeared in the on-disk inode - with a read of the inode item LSN at inode format time. CLearly these are not the same thing. Before `93f958f9c4`, the log recovery behaviour was irrelevant, because the LSN in the log inode always matched the on-disk LSN at the time the inode was logged, hence recovery of the transaction would never make the on-disk LSN in the inode go backwards or get out of sync. A symptom of the problem is this, caught from a failure of generic/482. Before log recovery, the inode has been allocated but never used: xfs_db> inode 393388 xfs_db> p core.magic = 0x494e core.mode = 0 .... v3.crc = 0x99126961 (correct) v3.change_count = 0 v3.lsn = 0 v3.flags2 = 0 v3.cowextsize = 0 v3.crtime.sec = Thu Jan 1 10:00:00 1970 v3.crtime.nsec = 0 After log recovery: xfs_db> p core.magic = 0x494e core.mode = 020444 .... v3.crc = 0x23e68f23 (correct) v3.change_count = 2 v3.lsn = 0 v3.flags2 = 0 v3.cowextsize = 0 v3.crtime.sec = Thu Jul 22 17:03:03 2021 v3.crtime.nsec = 751000000 ... You can see that the LSN of the on-disk inode is 0, even though it clearly has been written to disk. I point out this inode, because the generic/482 failure occurred because several adjacent inodes in this specific inode cluster were not replayed correctly and still appeared to be zero on disk when all the other metadata (inobt, finobt, directories, etc) indicated they should be allocated and written back. The fix for this is two-fold. The first is that we need to either revert the LSN changes in `93f958f9c4` or stop logging the inode LSN altogether. If we do the former, log recovery does not need to change but we add 8 bytes of memory per inode to store what is largely a write-only inode field. If we do the latter, log recovery needs to stamp the on-disk inode in the same manner that inode writeback does. I prefer the latter, because we shouldn't really be trying to log and replay changes to the on disk LSN as the on-disk value is the canonical source of the on-disk version of the inode. It also matches the way we recover buffer items - we create a buf_log_item that carries the current recovery transaction LSN that gets stamped into the buffer by the write verifier when it gets written back when the transaction is fully recovered. However, this might break log recovery on older kernels even more, so I'm going to simply ignore the logged value in recovery and stamp the on-disk inode with the LSN of the transaction being recovered that will trigger writeback on transaction recovery completion. This will ensure that the on-disk inode LSN always reflects the LSN of the last change that was written to disk, regardless of whether it comes from log recovery or runtime writeback. Fixes: `93f958f9c4` ("xfs: cull unnecessary icdinode fields") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>		2022-08-03 12:00:51 +02:00
..
9p	9p: missing chunk of "fs/9p: Don't update file type when updating file attributes"	2022-06-22 14:13:12 +02:00
adfs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
affs	fs/affs: release old buffer head on error path	2021-03-04 11:38:37 +01:00
afs	afs: Fix dynamic root getattr	2022-06-29 08:59:49 +02:00
autofs	autofs: harden ioctl table	2020-10-16 11:11:22 -07:00
befs	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
bfs	bfs: don't use WARNING: string when it's just info.	2021-01-06 14:56:52 +01:00
btrfs	btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents	2022-07-21 21:20:01 +02:00
cachefiles	fs/cachefiles: Remove wait_bit_key layout dependency	2021-03-30 14:32:07 +02:00
ceph	ceph: allow ceph.dir.rctime xattr to be updatable	2022-06-14 18:32:44 +02:00
cifs	cifs: fix reconnect on smb3 mount types	2022-06-14 18:32:45 +02:00
coda
configfs	configfs: fix a race in configfs_{,un}register_subsystem()	2022-03-02 11:42:52 +01:00
cramfs	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
crypto	fscrypt: allow 256-bit master keys with AES-256-XTS	2021-11-18 14:03:54 +01:00
debugfs	debugfs: lockdown: Allow reading debugfs files that are not world readable	2022-01-27 10:54:02 +01:00
devpts	fsnotify: fix fsnotify hooks in pseudo filesystems	2022-02-01 17:25:39 +01:00
dlm	dlm: fix pending remove if msg allocation fails	2022-07-29 17:19:24 +02:00
ecryptfs	Revert "ecryptfs: replace BUG_ON with error handling code"	2021-05-26 12:06:55 +02:00
efivarfs	efivarfs: revert "fix memory leak in efivarfs_create()"	2020-11-25 16:55:02 +01:00
efs	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
erofs	erofs: fix deadlock when shrink erofs slab	2021-12-01 09:19:05 +01:00
exfat	exfat: check if cluster num is valid	2022-06-06 08:42:42 +02:00
exportfs
ext2	ext2: correct max file size computing	2022-04-08 14:40:18 +02:00
ext4	ext4: fix race condition between ext4_write and ext4_convert_inline_data	2022-07-21 21:20:02 +02:00
f2fs	f2fs: attach inline_data after setting compression	2022-06-29 08:59:51 +02:00
fat	fat: add ratelimit to fat*_ent_bread()	2022-06-09 10:20:58 +02:00
freevxfs
fscache	fscache: Fix cookie key hashing	2021-09-18 13:40:15 +02:00
fuse	fuse: fix pipe buffer lifetime for direct_io	2022-03-16 14:16:01 +01:00
gfs2	gfs2: use i_lock spin_lock for inode qadata	2022-06-09 10:20:57 +02:00
hfs	hfs: add lock nesting notation to hfs_find_init	2021-07-31 08:16:12 +02:00
hfsplus	hfsplus: prevent corruption in shrinking truncate	2021-05-19 10:13:10 +02:00
hostfs	hostfs: fix memory handling in follow_link()	2021-04-14 08:42:06 +02:00
hpfs	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
hugetlbfs	mm, hugetlb: allow for "high" userspace addresses	2022-04-27 13:53:54 +02:00
iomap	xfs: use current->journal_info for detecting transaction recursion	2022-07-07 17:52:19 +02:00
isofs	isofs: Fix out of bound access for corrupted isofs image	2021-11-12 14:58:33 +01:00
jbd2	jbd2: fix a potential race while discarding reserved buffers after an abort	2022-04-27 13:53:57 +02:00
jffs2	jffs2: fix memory leak in jffs2_do_fill_super	2022-06-14 18:32:35 +02:00
jfs	fs: jfs: fix possible NULL pointer dereference in dbFree()	2022-06-09 10:20:57 +02:00
kernfs	kernfs: Separate kernfs_pr_cont_buf and rename_lock.	2022-06-14 18:32:43 +02:00
lockd	lockd: lockd server-side shouldn't set fl_ops	2021-09-18 13:40:30 +02:00
minix	minix: fix bug when opening a file with O_DIRECT	2022-04-13 21:01:01 +02:00
nfs	pNFS: Avoid a live lock condition in pnfs_update_layout()	2022-06-22 14:13:16 +02:00
nfs_common	nfs_common: need lock during iterate through the list	2020-12-30 11:53:45 +01:00
nfsd	NFSD: restore EINVAL error translation in nfsd_commit()	2022-07-07 17:52:17 +02:00
nilfs2	nilfs2: fix incorrect masking of permission flags for symlinks	2022-07-21 21:20:01 +02:00
nls
notify	fsnotify: fix wrong lockdep annotations	2022-06-09 10:21:03 +02:00
ntfs	ntfs: fix use-after-free in ntfs_ucsncmp()	2022-08-03 12:00:43 +02:00
ocfs2	Revert "ocfs2: mount shared volume without ha stack"	2022-08-03 12:00:43 +02:00
omfs	fs: omfs: use kmemdup() rather than kmalloc+memcpy	2020-09-22 23:39:45 -04:00
openpromfs
orangefs	orangefs: Fix the size of a memory allocation in orangefs_bufmap_alloc()	2022-01-20 09:17:50 +01:00
overlayfs	ovl: fix warning in ovl_create_real()	2021-12-22 09:30:58 +01:00
proc	proc: fix dentry/inode overinstantiating under /proc/${pid}/net	2022-06-09 10:21:17 +02:00
pstore	pstore: Don't use semaphores in always-atomic-context code	2022-04-08 14:39:56 +02:00
qnx4	qnx4: work around gcc false positive warning bug	2021-09-30 10:11:08 +02:00
qnx6	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
quota	quota: Prevent memory allocation recursion while holding dq_lock	2022-06-22 14:13:14 +02:00
ramfs	ramfs: fix nommu mmap with gaps in the page cache	2020-10-16 11:11:22 -07:00
reiserfs	reiserfs: check directory items on read from disk	2021-08-12 13:22:19 +02:00
romfs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
squashfs	squashfs: fix divide error in calculate_skip()	2021-05-19 10:13:10 +02:00
sysfs	sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output	2020-10-02 12:02:30 +02:00
sysv	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
tracefs	tracefs: Set the group ownership in apply_options() not parse_options()	2022-03-02 11:42:54 +01:00
ubifs	ubifs: Rectify space amount budget for mkdir/tmpfile operations	2022-04-13 21:00:53 +02:00
udf	udf: Fix NULL ptr deref when converting from inline format	2022-02-01 17:25:39 +01:00
ufs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
unicode	unicode: Add utf8_casefold_hash	2020-09-10 14:03:31 -07:00
vboxsf	vboxfs: fix broken legacy mount signature checking	2021-10-17 10:43:33 +02:00
verity	fs-verity: fix signed integer overflow with i_size near S64_MAX	2021-10-06 15:55:46 +02:00
xfs	xfs: logging the on disk inode LSN can make it go backwards	2022-08-03 12:00:51 +02:00
zonefs	zonefs: fix zonefs_iomap_begin() for reads	2022-06-25 15:16:08 +02:00
aio.c	aio: fix use-after-free due to missing POLLFREE handling	2021-12-14 11:32:40 +01:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c	coredump: Snapshot the vmas in do_coredump	2022-04-08 14:40:44 +02:00
binfmt_elf.c	coredump: Use the vma snapshot in fill_files_note	2022-04-08 14:40:45 +02:00
binfmt_em86.c
binfmt_flat.c	binfmt_flat: do not stop relocating GOT entries prematurely on riscv	2022-06-09 10:20:47 +02:00
binfmt_misc.c	binfmt_misc: fix possible deadlock in bm_register_write	2021-03-17 17:06:35 +01:00
binfmt_script.c
block_dev.c	block: fix a race between del_gendisk and BLKRRPART	2021-06-03 09:00:45 +02:00
buffer.c	mm, memcg: rework remote charging API to support nesting	2020-10-18 09:27:09 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c	coredump: Use the vma snapshot in fill_files_note	2022-04-08 14:40:45 +02:00
d_path.c	fs: fix NULL dereference due to data race in prepend_path()	2020-10-14 14:54:45 -07:00
dax.c	dax: fix cache flush on PMD-mapped pages	2022-06-09 10:21:16 +02:00
dcache.c
dcookies.c
direct-io.c	fs: direct-io: fix missing sdio->boundary	2021-04-14 08:41:58 +02:00
drop_caches.c
eventfd.c
eventpoll.c	fs/epoll: restore waking from ep_done_scan()	2021-05-11 14:47:12 +02:00
exec.c	fix race between exit_itimers() and /proc/pid/timers	2022-07-21 21:19:59 +02:00
fcntl.c	fcntl: fix potential deadlock for &fasync_struct.fa_lock	2021-09-15 09:50:27 +02:00
fhandle.c
file_table.c	SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()	2022-05-18 10:23:48 +02:00
file.c	fs: fix fd table size alignment properly	2022-04-08 14:40:30 +02:00
filesystems.c
fs_context.c	memcg: charge fs_context and legacy_fs_context	2022-02-08 18:30:36 +01:00
fs_parser.c	fs_parse: mark fs_param_bad_value() as static	2020-10-13 18:38:27 -07:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c	fs-writeback: writeback_sb_inodes：Recalculate 'wrote' according skipped pages	2022-06-09 10:21:22 +02:00
fsopen.c
init.c
inode.c	fs: export an inode_update_time helper	2021-11-26 10:39:22 +01:00
internal.h	cgroup1: fix leaked context root causing sporadic NULL deref in LTP	2021-07-31 08:16:11 +02:00
io_uring.c	io_uring: Use original task for req identity in io_identity_cow()	2022-07-29 17:19:07 +02:00
io-wq.c	io-wq: fix wakeup race when adding new work	2021-09-18 13:40:06 +02:00
io-wq.h	io_uring: always batch cancel in *cancel_files()	2021-02-13 13:54:56 +01:00
ioctl.c	fs: fix an infinite loop in iomap_fiemap	2022-05-25 09:17:54 +02:00
Kconfig	tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha	2021-02-17 11:02:21 +01:00
Kconfig.binfmt
kernel_read_file.c	vfs: check fd has read access in kernel_read_file_from_fd()	2021-10-27 09:56:51 +02:00
libfs.c	libfs: fix error cast of negative value in simple_attr_write()	2020-11-22 10:48:22 -08:00
locks.c	Revert "nfsd4: a client's own opens needn't prevent delegations"	2021-03-20 10:43:44 +01:00
Makefile	Refactored code for 5.10:	2020-10-23 11:33:41 -07:00
mbcache.c
mount.h
mpage.c
namei.c	fsnotify: invalidate dcache before IN_DELETE event	2022-02-01 17:25:48 +01:00
namespace.c	fs: warn about impending deprecation of mandatory locks	2021-08-26 08:35:57 -04:00
no-block.c
nsfs.c
open.c	open: don't silently ignore unknown O-flags in openat2()	2021-07-14 16:55:59 +02:00
pipe.c	pipe: Fix missing lock in pipe_resize_ring()	2022-06-06 08:42:41 +02:00
pnode.c
pnode.h	mount: fix mounting of detached mounts onto targets that reside on shared mounts	2021-03-17 17:06:13 +01:00
posix_acl.c
proc_namespace.c	proc mountinfo: make splice available again	2020-12-30 11:54:02 +01:00
read_write.c	Refactored code for 5.10:	2020-10-23 11:33:41 -07:00
readdir.c	readdir: make sure to verify directory entry for legacy interfaces too	2021-04-21 13:00:54 +02:00
remap_range.c	fs/remap: constrain dedupe of EOF blocks	2022-07-21 21:20:01 +02:00
select.c	select: Fix indefinitely sleeping task in poll_schedule_timeout()	2022-01-29 10:26:11 +01:00
seq_file.c	seq_file: disallow extremely large seq buffer allocations	2021-07-20 16:05:59 +02:00
signalfd.c	signalfd: use wake_up_pollfree()	2021-12-14 11:32:40 +01:00
splice.c	io_uring-5.10-2020-10-24	2020-10-24 12:40:18 -07:00
stack.c
stat.c	stat: fix inconsistency between struct stat and struct compat_stat	2022-04-27 13:53:54 +02:00
statfs.c
super.c	vfs: make freeze_super abort when sync_filesystem returns error	2022-02-23 12:00:59 +01:00
sync.c
timerfd.c
userfaultfd.c	userfaultfd: fix a race between writeprotect and exit_mmap()	2021-10-27 09:56:51 +02:00
utimes.c
xattr.c	fs/xattr.c: fix kernel-doc warnings for setxattr & removexattr	2020-10-13 18:38:27 -07:00