kernel_optimize_test/fs
Dave Chinner e5f9d4e0f8 xfs: logging the on disk inode LSN can make it go backwards
commit 32baa63d82ee3f5ab3bd51bae6bf7d1c15aed8c7 upstream.

When we log an inode, we format the "log inode" core and set an LSN
in that inode core. We do that via xfs_inode_item_format_core(),
which calls:

	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);

to format the log inode. It writes the LSN from the inode item into
the log inode, and if recovery decides the inode item needs to be
replayed, it recovers the log inode LSN field and writes it into the
on disk inode LSN field.

Now this might seem like a reasonable thing to do, but it is wrong
on multiple levels. Firstly, if the item is not yet in the AIL,
item->li_lsn is zero. i.e. the first time the inode it is logged and
formatted, the LSN we write into the log inode will be zero. If we
only log it once, recovery will run and can write this zero LSN into
the inode.

This means that the next time the inode is logged and log recovery
runs, it will *always* replay changes to the inode regardless of
whether the inode is newer on disk than the version in the log and
that violates the entire purpose of recording the LSN in the inode
at writeback time (i.e. to stop it going backwards in time on disk
during recovery).

Secondly, if we commit the CIL to the journal so the inode item
moves to the AIL, and then relog the inode, the LSN that gets
stamped into the log inode will be the LSN of the inode's current
location in the AIL, not it's age on disk. And it's not the LSN that
will be associated with the current change. That means when log
recovery replays this inode item, the LSN that ends up on disk is
the LSN for the previous changes in the log, not the current
changes being replayed. IOWs, after recovery the LSN on disk is not
in sync with the LSN of the modifications that were replayed into
the inode. This, again, violates the recovery ordering semantics
that on-disk writeback LSNs provide.

Hence the inode LSN in the log dinode is -always- invalid.

Thirdly, recovery actually has the LSN of the log transaction it is
replaying right at hand - it uses it to determine if it should
replay the inode by comparing it to the on-disk inode's LSN. But it
doesn't use that LSN to stamp the LSN into the inode which will be
written back when the transaction is fully replayed. It uses the one
in the log dinode, which we know is always going to be incorrect.

Looking back at the change history, the inode logging was broken by
commit 93f958f9c4 ("xfs: cull unnecessary icdinode fields") way
back in 2016 by a stupid idiot who thought he knew how this code
worked. i.e. me. That commit replaced an in memory di_lsn field that
was updated only at inode writeback time from the inode item.li_lsn
value - and hence always contained the same LSN that appeared in the
on-disk inode - with a read of the inode item LSN at inode format
time. CLearly these are not the same thing.

Before 93f958f9c4, the log recovery behaviour was irrelevant,
because the LSN in the log inode always matched the on-disk LSN at
the time the inode was logged, hence recovery of the transaction
would never make the on-disk LSN in the inode go backwards or get
out of sync.

A symptom of the problem is this, caught from a failure of
generic/482. Before log recovery, the inode has been allocated but
never used:

xfs_db> inode 393388
xfs_db> p
core.magic = 0x494e
core.mode = 0
....
v3.crc = 0x99126961 (correct)
v3.change_count = 0
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jan  1 10:00:00 1970
v3.crtime.nsec = 0

After log recovery:

xfs_db> p
core.magic = 0x494e
core.mode = 020444
....
v3.crc = 0x23e68f23 (correct)
v3.change_count = 2
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jul 22 17:03:03 2021
v3.crtime.nsec = 751000000
...

You can see that the LSN of the on-disk inode is 0, even though it
clearly has been written to disk. I point out this inode, because
the generic/482 failure occurred because several adjacent inodes in
this specific inode cluster were not replayed correctly and still
appeared to be zero on disk when all the other metadata (inobt,
finobt, directories, etc) indicated they should be allocated and
written back.

The fix for this is two-fold. The first is that we need to either
revert the LSN changes in 93f958f9c4 or stop logging the inode LSN
altogether. If we do the former, log recovery does not need to
change but we add 8 bytes of memory per inode to store what is
largely a write-only inode field. If we do the latter, log recovery
needs to stamp the on-disk inode in the same manner that inode
writeback does.

I prefer the latter, because we shouldn't really be trying to log
and replay changes to the on disk LSN as the on-disk value is the
canonical source of the on-disk version of the inode. It also
matches the way we recover buffer items - we create a buf_log_item
that carries the current recovery transaction LSN that gets stamped
into the buffer by the write verifier when it gets written back
when the transaction is fully recovered.

However, this might break log recovery on older kernels even more,
so I'm going to simply ignore the logged value in recovery and stamp
the on-disk inode with the LSN of the transaction being recovered
that will trigger writeback on transaction recovery completion. This
will ensure that the on-disk inode LSN always reflects the LSN of
the last change that was written to disk, regardless of whether it
comes from log recovery or runtime writeback.

Fixes: 93f958f9c4 ("xfs: cull unnecessary icdinode fields")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-03 12:00:51 +02:00
..
9p 9p: missing chunk of "fs/9p: Don't update file type when updating file attributes" 2022-06-22 14:13:12 +02:00
adfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
affs fs/affs: release old buffer head on error path 2021-03-04 11:38:37 +01:00
afs afs: Fix dynamic root getattr 2022-06-29 08:59:49 +02:00
autofs autofs: harden ioctl table 2020-10-16 11:11:22 -07:00
befs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
bfs bfs: don't use WARNING: string when it's just info. 2021-01-06 14:56:52 +01:00
btrfs btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents 2022-07-21 21:20:01 +02:00
cachefiles fs/cachefiles: Remove wait_bit_key layout dependency 2021-03-30 14:32:07 +02:00
ceph ceph: allow ceph.dir.rctime xattr to be updatable 2022-06-14 18:32:44 +02:00
cifs cifs: fix reconnect on smb3 mount types 2022-06-14 18:32:45 +02:00
coda
configfs configfs: fix a race in configfs_{,un}register_subsystem() 2022-03-02 11:42:52 +01:00
cramfs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
crypto fscrypt: allow 256-bit master keys with AES-256-XTS 2021-11-18 14:03:54 +01:00
debugfs debugfs: lockdown: Allow reading debugfs files that are not world readable 2022-01-27 10:54:02 +01:00
devpts fsnotify: fix fsnotify hooks in pseudo filesystems 2022-02-01 17:25:39 +01:00
dlm dlm: fix pending remove if msg allocation fails 2022-07-29 17:19:24 +02:00
ecryptfs Revert "ecryptfs: replace BUG_ON with error handling code" 2021-05-26 12:06:55 +02:00
efivarfs efivarfs: revert "fix memory leak in efivarfs_create()" 2020-11-25 16:55:02 +01:00
efs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
erofs erofs: fix deadlock when shrink erofs slab 2021-12-01 09:19:05 +01:00
exfat exfat: check if cluster num is valid 2022-06-06 08:42:42 +02:00
exportfs
ext2 ext2: correct max file size computing 2022-04-08 14:40:18 +02:00
ext4 ext4: fix race condition between ext4_write and ext4_convert_inline_data 2022-07-21 21:20:02 +02:00
f2fs f2fs: attach inline_data after setting compression 2022-06-29 08:59:51 +02:00
fat fat: add ratelimit to fat*_ent_bread() 2022-06-09 10:20:58 +02:00
freevxfs
fscache fscache: Fix cookie key hashing 2021-09-18 13:40:15 +02:00
fuse fuse: fix pipe buffer lifetime for direct_io 2022-03-16 14:16:01 +01:00
gfs2 gfs2: use i_lock spin_lock for inode qadata 2022-06-09 10:20:57 +02:00
hfs hfs: add lock nesting notation to hfs_find_init 2021-07-31 08:16:12 +02:00
hfsplus hfsplus: prevent corruption in shrinking truncate 2021-05-19 10:13:10 +02:00
hostfs hostfs: fix memory handling in follow_link() 2021-04-14 08:42:06 +02:00
hpfs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
hugetlbfs mm, hugetlb: allow for "high" userspace addresses 2022-04-27 13:53:54 +02:00
iomap xfs: use current->journal_info for detecting transaction recursion 2022-07-07 17:52:19 +02:00
isofs isofs: Fix out of bound access for corrupted isofs image 2021-11-12 14:58:33 +01:00
jbd2 jbd2: fix a potential race while discarding reserved buffers after an abort 2022-04-27 13:53:57 +02:00
jffs2 jffs2: fix memory leak in jffs2_do_fill_super 2022-06-14 18:32:35 +02:00
jfs fs: jfs: fix possible NULL pointer dereference in dbFree() 2022-06-09 10:20:57 +02:00
kernfs kernfs: Separate kernfs_pr_cont_buf and rename_lock. 2022-06-14 18:32:43 +02:00
lockd lockd: lockd server-side shouldn't set fl_ops 2021-09-18 13:40:30 +02:00
minix minix: fix bug when opening a file with O_DIRECT 2022-04-13 21:01:01 +02:00
nfs pNFS: Avoid a live lock condition in pnfs_update_layout() 2022-06-22 14:13:16 +02:00
nfs_common nfs_common: need lock during iterate through the list 2020-12-30 11:53:45 +01:00
nfsd NFSD: restore EINVAL error translation in nfsd_commit() 2022-07-07 17:52:17 +02:00
nilfs2 nilfs2: fix incorrect masking of permission flags for symlinks 2022-07-21 21:20:01 +02:00
nls
notify fsnotify: fix wrong lockdep annotations 2022-06-09 10:21:03 +02:00
ntfs ntfs: fix use-after-free in ntfs_ucsncmp() 2022-08-03 12:00:43 +02:00
ocfs2 Revert "ocfs2: mount shared volume without ha stack" 2022-08-03 12:00:43 +02:00
omfs fs: omfs: use kmemdup() rather than kmalloc+memcpy 2020-09-22 23:39:45 -04:00
openpromfs
orangefs orangefs: Fix the size of a memory allocation in orangefs_bufmap_alloc() 2022-01-20 09:17:50 +01:00
overlayfs ovl: fix warning in ovl_create_real() 2021-12-22 09:30:58 +01:00
proc proc: fix dentry/inode overinstantiating under /proc/${pid}/net 2022-06-09 10:21:17 +02:00
pstore pstore: Don't use semaphores in always-atomic-context code 2022-04-08 14:39:56 +02:00
qnx4 qnx4: work around gcc false positive warning bug 2021-09-30 10:11:08 +02:00
qnx6 [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
quota quota: Prevent memory allocation recursion while holding dq_lock 2022-06-22 14:13:14 +02:00
ramfs ramfs: fix nommu mmap with gaps in the page cache 2020-10-16 11:11:22 -07:00
reiserfs reiserfs: check directory items on read from disk 2021-08-12 13:22:19 +02:00
romfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
squashfs squashfs: fix divide error in calculate_skip() 2021-05-19 10:13:10 +02:00
sysfs sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output 2020-10-02 12:02:30 +02:00
sysv [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
tracefs tracefs: Set the group ownership in apply_options() not parse_options() 2022-03-02 11:42:54 +01:00
ubifs ubifs: Rectify space amount budget for mkdir/tmpfile operations 2022-04-13 21:00:53 +02:00
udf udf: Fix NULL ptr deref when converting from inline format 2022-02-01 17:25:39 +01:00
ufs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
unicode unicode: Add utf8_casefold_hash 2020-09-10 14:03:31 -07:00
vboxsf vboxfs: fix broken legacy mount signature checking 2021-10-17 10:43:33 +02:00
verity fs-verity: fix signed integer overflow with i_size near S64_MAX 2021-10-06 15:55:46 +02:00
xfs xfs: logging the on disk inode LSN can make it go backwards 2022-08-03 12:00:51 +02:00
zonefs zonefs: fix zonefs_iomap_begin() for reads 2022-06-25 15:16:08 +02:00
aio.c aio: fix use-after-free due to missing POLLFREE handling 2021-12-14 11:32:40 +01:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c coredump: Snapshot the vmas in do_coredump 2022-04-08 14:40:44 +02:00
binfmt_elf.c coredump: Use the vma snapshot in fill_files_note 2022-04-08 14:40:45 +02:00
binfmt_em86.c
binfmt_flat.c binfmt_flat: do not stop relocating GOT entries prematurely on riscv 2022-06-09 10:20:47 +02:00
binfmt_misc.c binfmt_misc: fix possible deadlock in bm_register_write 2021-03-17 17:06:35 +01:00
binfmt_script.c
block_dev.c block: fix a race between del_gendisk and BLKRRPART 2021-06-03 09:00:45 +02:00
buffer.c mm, memcg: rework remote charging API to support nesting 2020-10-18 09:27:09 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c coredump: Use the vma snapshot in fill_files_note 2022-04-08 14:40:45 +02:00
d_path.c fs: fix NULL dereference due to data race in prepend_path() 2020-10-14 14:54:45 -07:00
dax.c dax: fix cache flush on PMD-mapped pages 2022-06-09 10:21:16 +02:00
dcache.c
dcookies.c
direct-io.c fs: direct-io: fix missing sdio->boundary 2021-04-14 08:41:58 +02:00
drop_caches.c
eventfd.c
eventpoll.c fs/epoll: restore waking from ep_done_scan() 2021-05-11 14:47:12 +02:00
exec.c fix race between exit_itimers() and /proc/pid/timers 2022-07-21 21:19:59 +02:00
fcntl.c fcntl: fix potential deadlock for &fasync_struct.fa_lock 2021-09-15 09:50:27 +02:00
fhandle.c
file_table.c SUNRPC: Ensure we flush any closed sockets before xs_xprt_free() 2022-05-18 10:23:48 +02:00
file.c fs: fix fd table size alignment properly 2022-04-08 14:40:30 +02:00
filesystems.c
fs_context.c memcg: charge fs_context and legacy_fs_context 2022-02-08 18:30:36 +01:00
fs_parser.c fs_parse: mark fs_param_bad_value() as static 2020-10-13 18:38:27 -07:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages 2022-06-09 10:21:22 +02:00
fsopen.c
init.c
inode.c fs: export an inode_update_time helper 2021-11-26 10:39:22 +01:00
internal.h cgroup1: fix leaked context root causing sporadic NULL deref in LTP 2021-07-31 08:16:11 +02:00
io_uring.c io_uring: Use original task for req identity in io_identity_cow() 2022-07-29 17:19:07 +02:00
io-wq.c io-wq: fix wakeup race when adding new work 2021-09-18 13:40:06 +02:00
io-wq.h io_uring: always batch cancel in *cancel_files() 2021-02-13 13:54:56 +01:00
ioctl.c fs: fix an infinite loop in iomap_fiemap 2022-05-25 09:17:54 +02:00
Kconfig tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha 2021-02-17 11:02:21 +01:00
Kconfig.binfmt
kernel_read_file.c vfs: check fd has read access in kernel_read_file_from_fd() 2021-10-27 09:56:51 +02:00
libfs.c libfs: fix error cast of negative value in simple_attr_write() 2020-11-22 10:48:22 -08:00
locks.c Revert "nfsd4: a client's own opens needn't prevent delegations" 2021-03-20 10:43:44 +01:00
Makefile Refactored code for 5.10: 2020-10-23 11:33:41 -07:00
mbcache.c
mount.h
mpage.c
namei.c fsnotify: invalidate dcache before IN_DELETE event 2022-02-01 17:25:48 +01:00
namespace.c fs: warn about impending deprecation of mandatory locks 2021-08-26 08:35:57 -04:00
no-block.c
nsfs.c
open.c open: don't silently ignore unknown O-flags in openat2() 2021-07-14 16:55:59 +02:00
pipe.c pipe: Fix missing lock in pipe_resize_ring() 2022-06-06 08:42:41 +02:00
pnode.c
pnode.h mount: fix mounting of detached mounts onto targets that reside on shared mounts 2021-03-17 17:06:13 +01:00
posix_acl.c
proc_namespace.c proc mountinfo: make splice available again 2020-12-30 11:54:02 +01:00
read_write.c Refactored code for 5.10: 2020-10-23 11:33:41 -07:00
readdir.c readdir: make sure to verify directory entry for legacy interfaces too 2021-04-21 13:00:54 +02:00
remap_range.c fs/remap: constrain dedupe of EOF blocks 2022-07-21 21:20:01 +02:00
select.c select: Fix indefinitely sleeping task in poll_schedule_timeout() 2022-01-29 10:26:11 +01:00
seq_file.c seq_file: disallow extremely large seq buffer allocations 2021-07-20 16:05:59 +02:00
signalfd.c signalfd: use wake_up_pollfree() 2021-12-14 11:32:40 +01:00
splice.c io_uring-5.10-2020-10-24 2020-10-24 12:40:18 -07:00
stack.c
stat.c stat: fix inconsistency between struct stat and struct compat_stat 2022-04-27 13:53:54 +02:00
statfs.c
super.c vfs: make freeze_super abort when sync_filesystem returns error 2022-02-23 12:00:59 +01:00
sync.c
timerfd.c
userfaultfd.c userfaultfd: fix a race between writeprotect and exit_mmap() 2021-10-27 09:56:51 +02:00
utimes.c
xattr.c fs/xattr.c: fix kernel-doc warnings for setxattr & removexattr 2020-10-13 18:38:27 -07:00