linux-kernelorg-stable/fs
Qu Wenruo 2dd4679961 btrfs: fix corruption reading compressed range when block size is smaller than page size
[ Upstream commit 9786531399 ]

[BUG]
With 64K page size (aarch64 with 64K page size config) and 4K btrfs
block size, the following workload can easily lead to a corrupted read:

        mkfs.btrfs -f -s 4k $dev > /dev/null
        mount -o compress $dev $mnt
        xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null
	echo "correct result:"
        od -Ad -t x1 $mnt/base
        xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
		  -c "reflink $mnt/base 0 32k 32k" \
		  -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
	echo "incorrect result:"
        od -Ad -t x1 $mnt/new
        umount $mnt

This shows the following result:

correct result:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0065536
incorrect result:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0065536

Notice the zero in the range [32K, 60K), which is incorrect.

[CAUSE]
With extra trace printk, it shows the following events during od:
(some unrelated info removed like CPU and context)

 od-3457   btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000

The "r/i" is indicating the root and inode number. In our case the file
"new" is using ino 258 from fs tree (root 5).

Here notice the @prev_em_start pointer is NULL. This means the
btrfs_do_readpage() is called from btrfs_read_folio(), not from
btrfs_readahead().

 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768

These above 32K blocks will be read from the first half of the
compressed data extent.

 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768

Note here there is no btrfs_submit_compressed_read() call. Which is
incorrect now.
Although both extent maps at 0 and 32K are pointing to the same compressed
data, their offsets are different thus can not be merged into the same
read.

So this means the compressed data read merge check is doing something
wrong.

 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
 od-3457   btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
 od-3457   btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440

The function btrfs_submit_compressed_read() is only called at the end of
folio read. The compressed bio will only have an extent map of range [0,
32K), but the original bio passed in is for the whole 64K folio.

This will cause the decompression part to only fill the first 32K,
leaving the rest untouched (aka, filled with zero).

This incorrect compressed read merge leads to the above data corruption.

There were similar problems that happened in the past, commit 808f80b467
("Btrfs: update fix for read corruption of compressed and shared
extents") is doing pretty much the same fix for readahead.

But that's back to 2015, where btrfs still only supports bs (block size)
== ps (page size) cases.
This means btrfs_do_readpage() only needs to handle a folio which
contains exactly one block.

Only btrfs_readahead() can lead to a read covering multiple blocks.
Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.

With v5.15 kernel btrfs introduced bs < ps support. This breaks the above
assumption that a folio can only contain one block.

Now btrfs_read_folio() can also read multiple blocks in one go.
But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
existing bio force submission check will never be triggered.

In theory, this can also happen for btrfs with large folios, but since
large folio is still experimental, we don't need to bother it, thus only
bs < ps support is affected for now.

[FIX]
Instead of passing @prev_em_start to do the proper compressed extent
check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
the existing bio force submission logic will always be triggered.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[ Adjust context ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-19 16:35:47 +02:00
..
9p 9p: Add a migrate_folio method 2025-06-19 15:32:36 +02:00
adfs
affs
afs
autofs
bcachefs add a string-to-qstr constructor 2025-07-10 16:05:08 +02:00
befs
bfs
btrfs btrfs: fix corruption reading compressed range when block size is smaller than page size 2025-09-19 16:35:47 +02:00
cachefiles cachefiles: Fix the incorrect return value in __cachefiles_write() 2025-07-24 08:56:30 +02:00
ceph ceph: fix race condition where r_parent becomes stale before sending message 2025-09-19 16:35:47 +02:00
coda
configfs configfs: Do not override creating attribute file failure in populate_attrs() 2025-06-27 11:11:12 +01:00
cramfs
crypto fscrypt: Don't use problematic non-inline crypto engines 2025-08-20 18:30:15 +02:00
debugfs debugfs: fix mount options not being applied 2025-08-28 16:31:08 +02:00
devpts
dlm dlm: make tcp still work in multi-link env 2025-05-29 11:02:14 +02:00
ecryptfs
efivarfs efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare 2025-09-04 15:31:52 +02:00
efs
erofs erofs: fix atomic context detection when !CONFIG_DEBUG_LOCK_ALLOC 2025-09-04 15:31:44 +02:00
exfat exfat: add cluster chain loop check for dir 2025-08-20 18:30:47 +02:00
exportfs
ext2 ext2: Handle fiemap on empty files to prevent EINVAL 2025-08-20 18:30:21 +02:00
ext4 ext4: introduce linear search for dentries 2025-09-19 16:35:42 +02:00
f2fs f2fs: fix to avoid out-of-boundary access in dnode page 2025-08-28 16:30:59 +02:00
fat
freevxfs
fuse fuse: prevent overflow in copy_file_range return value 2025-09-19 16:35:46 +02:00
gfs2 gfs2: Set .migrate_folio in gfs2_{rgrp,meta}_aops 2025-08-20 18:30:20 +02:00
hfs hfs: fix not erasing deleted b-tree node issue 2025-08-20 18:30:20 +02:00
hfsplus hfsplus: don't use BUG_ON() in hfsplus_create_attributes_file() 2025-08-20 18:30:19 +02:00
hostfs
hpfs
hugetlbfs
iomap iomap: skip unnecessary ifs_block_is_uptodate check 2025-05-02 07:59:27 +02:00
isofs isofs: Verify inode mode when loading from disk 2025-07-24 08:56:25 +02:00
jbd2 jbd2: prevent softlockup in jbd2_log_do_checkpoint() 2025-08-28 16:30:59 +02:00
jffs2 jffs2: check jffs2_prealloc_raw_node_refs() result in few other places 2025-06-27 11:11:37 +01:00
jfs jfs: upper bound check of tree index in dbAllocAG 2025-08-20 18:30:42 +02:00
kernfs kernfs: Fix UAF in polling when open file is released 2025-09-19 16:35:47 +02:00
lockd
minix
netfs netfs: Fix unbuffered write error handling 2025-08-28 16:31:04 +02:00
nfs NFSv4/flexfiles: Fix layout merge mirror check. 2025-09-19 16:35:44 +02:00
nfs_common
nfsd NFSD: detect mismatch of file handle and delegation stateid in OPEN op 2025-08-20 18:30:14 +02:00
nilfs2 nilfs2: reject invalid file types when reading inodes 2025-08-01 09:48:43 +01:00
nls
notify fanotify: sanitize handle_type values when reporting fid 2025-08-15 12:13:51 +02:00
ntfs3 fs/ntfs3: correctly create symlink for relative path 2025-08-20 18:30:21 +02:00
ocfs2 ocfs2: fix recursive semaphore deadlock in fiemap call 2025-09-19 16:35:45 +02:00
omfs
openpromfs
orangefs fs/orangefs: use snprintf() instead of sprintf() 2025-08-20 18:30:41 +02:00
overlayfs ovl: use I_MUTEX_PARENT when locking parent in ovl_create_temp() 2025-08-28 16:31:10 +02:00
proc proc: fix type confusion in pde_set_flags() 2025-09-19 16:35:45 +02:00
pstore pstore: Change kmsg_bytes storage size to u32 2025-05-29 11:02:58 +02:00
qnx4
qnx6
quota
ramfs
reiserfs
romfs
smb cifs: prevent NULL pointer dereference in UTF16 conversion 2025-09-09 18:58:18 +02:00
squashfs squashfs: fix memory leak in squashfs_fill_super 2025-08-28 16:31:05 +02:00
sysfs
sysv
tests
tracefs tracefs: Add d_delete to remove negative dentries 2025-08-20 18:30:21 +02:00
ubifs
udf udf: Verify partition map count 2025-08-20 18:30:20 +02:00
ufs
unicode
vboxsf
verity
xfs xfs: do not propagate ENODATA disk errors into xattr code 2025-09-04 15:31:54 +02:00
zonefs
Kconfig nfs: add missing selections of CONFIG_CRC32 2025-04-25 10:47:50 +02:00
Kconfig.binfmt
Makefile
aio.c
anon_inodes.c fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypass 2025-07-10 16:05:09 +02:00
attr.c
backing-file.c
bad_inode.c
binfmt_elf.c binfmt_elf: Move brk for static PIE even if ASLR disabled 2025-05-22 14:29:35 +02:00
binfmt_elf_fdpic.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
bpf_fs_kfuncs.c
buffer.c fs/buffer: fix use-after-free when call bh_read() helper 2025-08-28 16:31:08 +02:00
char_dev.c
compat_binfmt_elf.c
coredump.c coredump: hand a pidfd to the usermode coredump helper 2025-06-04 14:43:52 +02:00
d_path.c
dax.c
dcache.c
direct-io.c
drop_caches.c
eventfd.c
eventpoll.c eventpoll: Fix semi-unbounded recursion 2025-08-20 18:30:15 +02:00
exec.c
fcntl.c
fhandle.c fhandle: use more consistent rules for decoding file handle from userns 2025-09-19 16:35:41 +02:00
file.c alloc_fdtable(): change calling conventions. 2025-08-28 16:31:16 +02:00
file_table.c add a string-to-qstr constructor 2025-07-10 16:05:08 +02:00
filesystems.c fs/filesystems: Fix potential unsigned integer underflow in fs_name() 2025-06-19 15:32:32 +02:00
fs-writeback.c fs: writeback: fix use-after-free in __mark_inode_dirty() 2025-09-09 18:58:03 +02:00
fs_context.c
fs_parser.c
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c
inode.c
internal.h
ioctl.c
kernel_read_file.c
libfs.c better lockdep annotations for simple_recursive_removal() 2025-08-20 18:30:20 +02:00
locks.c
mbcache.c
mnt_idmapping.c
mount.h
mpage.c
namei.c
namespace.c fs/fhandle.c: fix a race in call of has_locked_children() 2025-09-09 18:58:19 +02:00
nsfs.c
open.c
pidfs.c pidfs: raise SB_I_NODEV and SB_I_NOEXEC 2025-08-20 18:30:21 +02:00
pipe.c
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c
readdir.c
remap_range.c
select.c
seq_file.c
signalfd.c
splice.c netfs: Fix unbuffered write error handling 2025-08-28 16:31:04 +02:00
stack.c
stat.c
statfs.c
super.c
sync.c
sysctls.c
timerfd.c
userfaultfd.c mm/userfaultfd: fix uninitialized output field for -EAGAIN race 2025-05-18 08:24:52 +02:00
utimes.c
xattr.c fs/xattr.c: fix simple_xattr_list() 2025-06-27 11:11:36 +01:00