Commit Graph

264 Commits

Author SHA1 Message Date
Rafael Aquini 6cb5f8a499 mm: khugepaged: fix call hpage_collapse_scan_file() for anonymous vma
JIRA: https://issues.redhat.com/browse/RHEL-84184

This patch is a backport of the following upstream commit:
commit f1897f2f08b28ae59476d8b73374b08f856973af
Author: Liu Shixin <liushixin2@huawei.com>
Date:   Sat Jan 11 11:45:11 2025 +0800

    mm: khugepaged: fix call hpage_collapse_scan_file() for anonymous vma

    syzkaller reported such a BUG_ON():

     ------------[ cut here ]------------
     kernel BUG at mm/khugepaged.c:1835!
     Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
     ...
     CPU: 6 UID: 0 PID: 8009 Comm: syz.15.106 Kdump: loaded Tainted: G        W          6.13.0-rc6 #22
     Tainted: [W]=WARN
     Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
     pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
     pc : collapse_file+0xa44/0x1400
     lr : collapse_file+0x88/0x1400
     sp : ffff80008afe3a60
     ...
     Call trace:
      collapse_file+0xa44/0x1400 (P)
      hpage_collapse_scan_file+0x278/0x400
      madvise_collapse+0x1bc/0x678
      madvise_vma_behavior+0x32c/0x448
      madvise_walk_vmas.constprop.0+0xbc/0x140
      do_madvise.part.0+0xdc/0x2c8
      __arm64_sys_madvise+0x68/0x88
      invoke_syscall+0x50/0x120
      el0_svc_common.constprop.0+0xc8/0xf0
      do_el0_svc+0x24/0x38
      el0_svc+0x34/0x128
      el0t_64_sync_handler+0xc8/0xd0
      el0t_64_sync+0x190/0x198

    This indicates that the pgoff is unaligned.  After analysis, I confirm the
    vma is mapped to /dev/zero.  Such a vma certainly has vm_file, but it is
    set to anonymous by mmap_zero().  So even if it's mmapped by 2m-unaligned,
    it can pass the check in thp_vma_allowable_order() as it is an
    anonymous-mmap, but then be collapsed as a file-mmap.

    It seems the problem has existed for a long time, but actually, since we
    have khugepaged_max_ptes_none check before, we will skip collapse it as it
    is /dev/zero and so has no present page.  But commit d8ea7cc8547c limit
    the check for only khugepaged, so the BUG_ON() can be triggered by
    madvise_collapse().

    Add vma_is_anonymous() check to make such vma be processed by
    hpage_collapse_scan_pmd().

    Link: https://lkml.kernel.org/r/20250111034511.2223353-1-liushixin2@huawei.com
    Fixes: d8ea7cc8547c ("mm/khugepaged: add flag to predicate khugepaged-only behavior")
    Signed-off-by: Liu Shixin <liushixin2@huawei.com>
    Reviewed-by: Yang Shi <yang@os.amperecomputing.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Chengming Zhou <chengming.zhou@linux.dev>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mattew Wilcox <willy@infradead.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:39:57 -04:00
Rafael Aquini 0415ecb7aa mm, madvise: fix potential workingset node list_lru leaks
JIRA: https://issues.redhat.com/browse/RHEL-84184

This patch is a backport of the following upstream commit:
commit 62e72d2cf702a5e2fb53d9c46ed900d9384e4a06
Author: Kairui Song <kasong@tencent.com>
Date:   Sun Dec 22 20:29:36 2024 +0800

    mm, madvise: fix potential workingset node list_lru leaks

    Since commit 5abc1e37afa0 ("mm: list_lru: allocate list_lru_one only when
    needed"), all list_lru users need to allocate the items using the new
    infrastructure that provides list_lru info for slab allocation, ensuring
    that the corresponding memcg list_lru is allocated before use.

    For workingset shadow nodes (which are xa_node), users are converted to
    use the new infrastructure by commit 9bbdc0f32409 ("xarray: use
    kmem_cache_alloc_lru to allocate xa_node").  The xas->xa_lru will be set
    correctly for filemap users.  However, there is a missing case: xa_node
    allocations caused by madvise(..., MADV_COLLAPSE).

    madvise(..., MADV_COLLAPSE) will also read in the absent parts of file
    map, and there will be xa_nodes allocated for the caller's memcg (assuming
    it's not rootcg).  However, these allocations won't trigger memcg list_lru
    allocation because the proper xas info was not set.

    If nothing else has allocated other xa_nodes for that memcg to trigger
    list_lru creation, and memory pressure starts to evict file pages,
    workingset_update_node will try to add these xa_nodes to their
    corresponding memcg list_lru, and it does not exist (NULL).  So they will
    be added to rootcg's list_lru instead.

    This shouldn't be a significant issue in practice, but it is indeed
    unexpected behavior, and these xa_nodes will not be reclaimed effectively.
    And may lead to incorrect counting of the list_lru->nr_items counter.

    This problem wasn't exposed until recent commit 28e98022b31ef
    ("mm/list_lru: simplify reparenting and initial allocation") added a
    sanity check: only dying memcg could have a NULL list_lru when
    list_lru_{add,del} is called.  This problem triggered this WARNING.

    So make madvise(..., MADV_COLLAPSE) also call xas_set_lru() to pass the
    list_lru which we may want to insert xa_node into later.  And move
    mapping_set_update to mm/internal.h, and turn into a macro to avoid
    including extra headers in mm/internal.h.

    Link: https://lkml.kernel.org/r/20241222122936.67501-1-ryncsn@gmail.com
    Fixes: 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node")
    Reported-by: syzbot+38a0cbd267eff2d286ff@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/lkml/675d01e9.050a0220.37aaf.00be.GAE@google.com/
    Signed-off-by: Kairui Song <kasong@tencent.com>
    Cc: Chengming Zhou <chengming.zhou@linux.dev>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Sasha Levin <sashal@kernel.org>
    Cc: Shakeel Butt <shakeel.butt@linux.dev>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-04-18 08:39:56 -04:00
Rafael Aquini 1cbfdeee67 mm: khugepaged: fix the arguments order in khugepaged_collapse_file trace point
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 37f0b47c5143c2957909ced44fc09ffb118c99f7
Author: Yang Shi <yang@os.amperecomputing.com>
Date:   Fri Oct 11 18:17:02 2024 -0700

    mm: khugepaged: fix the arguments order in khugepaged_collapse_file trace point

    The "addr" and "is_shmem" arguments have different order in TP_PROTO and
    TP_ARGS.  This resulted in the incorrect trace result:

    text-hugepage-644429 [276] 392092.878683: mm_khugepaged_collapse_file:
    mm=0xffff20025d52c440, hpage_pfn=0x200678c00, index=512, addr=1, is_shmem=0,
    filename=text-hugepage, nr=512, result=failed

    The value of "addr" is wrong because it was treated as bool value, the
    type of is_shmem.

    Fix the order in TP_PROTO to keep "addr" is before "is_shmem" since the
    original patch review suggested this order to achieve best packing.

    And use "lx" for "addr" instead of "ld" in TP_printk because address is
    typically shown in hex.

    After the fix, the trace result looks correct:

    text-hugepage-7291  [004]   128.627251: mm_khugepaged_collapse_file:
    mm=0xffff0001328f9500, hpage_pfn=0x20016ea00, index=512, addr=0x400000,
    is_shmem=0, filename=text-hugepage, nr=512, result=failed

    Link: https://lkml.kernel.org/r/20241012011702.1084846-1-yang@os.amperecomputing.com
    Fixes: 4c9473e87e75 ("mm/khugepaged: add tracepoint to collapse_file()")
    Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
    Cc: Gautam Menghani <gautammenghani201@gmail.com>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Cc: <stable@vger.kernel.org>    [6.2+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:45 -05:00
Rafael Aquini efa86c2ce1 khugepaged: use a folio throughout hpage_collapse_scan_file()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 43849758fdc976a6d6108ed6dfccdb136fdeec39
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Apr 3 18:18:36 2024 +0100

    khugepaged: use a folio throughout hpage_collapse_scan_file()

    Replace the use of pages with folios.  Saves a few calls to
    compound_head() and removes some uses of obsolete functions.

    Link: https://lkml.kernel.org/r/20240403171838.1445826-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:02 -05:00
Rafael Aquini 686c3d52b5 khugepaged: use a folio throughout collapse_file()
JIRA: https://issues.redhat.com/browse/RHEL-27745
Conflicts:
  * difference on the 3rd hunk due to RHEL backport commits 550b92cdb3
    ("khugepaged: call shmem_get_folio()") and 55c4fe91b6 ("khugepage:
    replace try_to_release_page() with filemap_release_folio()") being
    introduced out-of-order wrt each other, causing the former commit
    to introduce the leftover struct *folio statement being removed now.

This patch is a backport of the following upstream commit:
commit 8d1e24c0b82d9730d05ee85eb7f4195df8cdf6a6
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Apr 3 18:18:35 2024 +0100

    khugepaged: use a folio throughout collapse_file()

    Pull folios from the page cache instead of pages.  Half of this work had
    been done already, but we were still operating on pages for a large chunk
    of this function.  There is no attempt in this patch to handle large
    folios that are smaller than a THP; that will have to wait for a future
    patch.

    [willy@infradead.org: the unlikely() is embedded in IS_ERR()]
      Link: https://lkml.kernel.org/r/ZhIWX8K0E2tSyMSr@casper.infradead.org
    Link: https://lkml.kernel.org/r/20240403171838.1445826-7-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:01 -05:00
Rafael Aquini 03e5873871 khugepaged: pass a folio to __collapse_huge_page_copy()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 8eca68e2cfdf863e98dc3c2cc8b2be9cac46b9d6
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Apr 3 18:18:33 2024 +0100

    khugepaged: pass a folio to __collapse_huge_page_copy()

    Simplify the body of __collapse_huge_page_copy() while I'm looking at
    it.

    Link: https://lkml.kernel.org/r/20240403171838.1445826-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:25:00 -05:00
Rafael Aquini a6912284d2 khugepaged: remove hpage from collapse_huge_page()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 0234779276e56fb17677f3cf64d7cd501f8abe69
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Apr 3 18:18:32 2024 +0100

    khugepaged: remove hpage from collapse_huge_page()

    Work purely in terms of the folio.  Removes a call to compound_head()
    in put_page().

    Link: https://lkml.kernel.org/r/20240403171838.1445826-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:59 -05:00
Rafael Aquini 8aac432c67 khugepaged: remove hpage from collapse_file()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 610ff817b981921213ae51e5c5f38c76c6f0405e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Apr 3 18:18:34 2024 +0100

    khugepaged: remove hpage from collapse_file()

    Use new_folio throughout where we had been using hpage.

    Link: https://lkml.kernel.org/r/20240403171838.1445826-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:58 -05:00
Rafael Aquini f86bf1657d khugepaged: convert alloc_charge_hpage to alloc_charge_folio
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit d5ab50b9412c0bba750eef5a34fd2937de1aee55
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Apr 3 18:18:31 2024 +0100

    khugepaged: convert alloc_charge_hpage to alloc_charge_folio

    Both callers want to deal with a folio, so return a folio from this
    function.

    Link: https://lkml.kernel.org/r/20240403171838.1445826-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:58 -05:00
Rafael Aquini 5e794f5b8a khugepaged: inline hpage_collapse_alloc_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 4746f5ce0fa52e21b5fe432970fe9516d1a45ebc
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Apr 3 18:18:30 2024 +0100

    khugepaged: inline hpage_collapse_alloc_folio()

    Patch series "khugepaged folio conversions".

    We've been kind of hacking piecemeal at converting khugepaged to use
    folios instead of compound pages, and so this patchset is a little larger
    than it should be as I undo some of our wrong moves in the past.  In
    particular, collapse_file() now consistently uses 'new_folio' for the
    freshly allocated folio and 'folio' for the one that's currently in use.

    This patch (of 7):

    This function has one caller, and the combined function is simpler to
    read, reason about and modify.

    Link: https://lkml.kernel.org/r/20240403171838.1445826-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20240403171838.1445826-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:57 -05:00
Rafael Aquini d5b3eb5cd7 mm/khugepaged: use a folio more in collapse_file()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit b54d60b18e850561e8bdb4264ae740676c3b7658
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Dec 28 08:57:47 2023 +0000

    mm/khugepaged: use a folio more in collapse_file()

    This function is not yet fully converted to the folio API, but this
    removes a few uses of old APIs.

    Link: https://lkml.kernel.org/r/20231228085748.1083901-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:13 -05:00
Rafael Aquini 5156eb66c9 mm: convert collapse_huge_page() to use a folio
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 5432726848bb27a01badcbc93b596f39ee6c5ffb
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Dec 11 16:22:13 2023 +0000

    mm: convert collapse_huge_page() to use a folio

    Replace three calls to compound_head() with one.

    Link: https://lkml.kernel.org/r/20231211162214.2146080-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:54 -05:00
Rafael Aquini d7d6608d6b mm/khugepaged: convert collapse_pte_mapped_thp() to use folios
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 98b32d296d95d7aa0516c36b72406277412268cd
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Fri Oct 20 11:33:31 2023 -0700

    mm/khugepaged: convert collapse_pte_mapped_thp() to use folios

    This removes 2 calls to compound_head() and helps convert khugepaged to
    use folios throughout.

    Previously, if the address passed to collapse_pte_mapped_thp()
    corresponded to a tail page, the scan would fail immediately. Using
    filemap_lock_folio() we get the corresponding folio back and try to
    operate on the folio instead.

    Link: https://lkml.kernel.org/r/20231020183331.10770-6-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Rik van Riel <riel@surriel.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:06 -05:00
Rafael Aquini 1e246b9a35 mm/khugepaged: convert alloc_charge_hpage() to use folios
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit b455f39d228935f88eebcd1f7c1a6981093c6a3b
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Fri Oct 20 11:33:30 2023 -0700

    mm/khugepaged: convert alloc_charge_hpage() to use folios

    Also remove count_memcg_page_event now that its last caller no longer uses
    it and reword hpage_collapse_alloc_page() to hpage_collapse_alloc_folio().

    This removes 1 call to compound_head() and helps convert khugepaged to
    use folios throughout.

    Link: https://lkml.kernel.org/r/20231020183331.10770-5-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Rik van Riel <riel@surriel.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:05 -05:00
Rafael Aquini e77cde681f mm/khugepaged: convert is_refcount_suitable() to use folios
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit dbf85c21e4aff90912b5d7755d2b25611f9191e9
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Fri Oct 20 11:33:29 2023 -0700

    mm/khugepaged: convert is_refcount_suitable() to use folios

    Both callers of is_refcount_suitable() have been converted to use
    folios, so convert it to take in a folio. Both callers only operate on
    head pages of folios so mapcount/refcount conversions here are trivial.

    Removes 3 calls to compound head, and removes 315 bytes of kernel text.

    Link: https://lkml.kernel.org/r/20231020183331.10770-4-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:04 -05:00
Rafael Aquini 80c0a19bff mm/khugepaged: convert hpage_collapse_scan_pmd() to use folios
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 5c07ebb372d66423e508ecfb8e00324f8797f072
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Fri Oct 20 11:33:28 2023 -0700

    mm/khugepaged: convert hpage_collapse_scan_pmd() to use folios

    Replaces 5 calls to compound_head(), and removes 1385 bytes of kernel
    text.

    Link: https://lkml.kernel.org/r/20231020183331.10770-3-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Rik van Riel <riel@surriel.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:03 -05:00
Rafael Aquini 3668660749 mm/khugepaged: convert __collapse_huge_page_isolate() to use folios
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 8dd1e896735f6e5abf66525dfd39bbd7b8c0c6d6
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Fri Oct 20 11:33:27 2023 -0700

    mm/khugepaged: convert __collapse_huge_page_isolate() to use folios

    Patch series "Some khugepaged folio conversions", v3.

    This patchset converts a number of functions to use folios.  This cleans
    up some khugepaged code and removes a large number of hidden
    compound_head() calls.

    This patch (of 5):

    Replaces 11 calls to compound_head() with 1, and removes 1348 bytes of
    kernel text.

    Link: https://lkml.kernel.org/r/20231020183331.10770-1-vishal.moola@gmail.com
    Link: https://lkml.kernel.org/r/20231020183331.10770-2-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:03 -05:00
Rafael Aquini 17adb85d51 mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit a98460494b16db9c377e55bc13e5407a0eb79fe8
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Aug 21 12:51:20 2023 -0700

    mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

    Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
    shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
    thought it had emptied: page lock on the huge page is enough to protect
    against WP faults (which find the PTE has been cleared), but not enough to
    protect against userfaultfd.  "BUG: Bad rss-counter state" followed.

    retract_page_tables() protects against this by checking !vma->anon_vma;
    but we know that MADV_COLLAPSE needs to be able to work on private shmem
    mappings, even those with an anon_vma prepared for another part of the
    mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
    mappings which are userfaultfd_armed().  Whether it needs to work on
    private shmem mappings which are userfaultfd_armed(), I'm not so sure: but
    assume that it does.

    Just for this case, take the pmd_lock() two steps earlier: not because it
    gives any protection against this case itself, but because ptlock nests
    inside it, and it's the dropping of ptlock which let the bug in.  In other
    cases, continue to minimize the pmd_lock() hold time.

    Link: https://lkml.kernel.org/r/4d31abf5-56c0-9f3d-d12f-c9317936691@google.com
    Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()")
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reported-by: Jann Horn <jannh@google.com>
    Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:08 -04:00
Rafael Aquini 6e53c42dda mm: convert prep_transhuge_page() to folio_prep_large_rmappable()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit da6e7bf3a0315025e4199d599bd31763f0df3b4a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 16 16:11:53 2023 +0100

    mm: convert prep_transhuge_page() to folio_prep_large_rmappable()

    Match folio_undo_large_rmappable(), and move the casting from page to
    folio into the callers (which they were largely doing anyway).

    Link: https://lkml.kernel.org/r/20230816151201.3655946-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Yanteng Si <siyanteng@loongson.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:48 -04:00
Rafael Aquini 63fb2ece73 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit d50791c2bee9ed97b1dd81db9bbb11caddcdfb0d
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Jul 11 21:43:36 2023 -0700

    mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()

    Now that retract_page_tables() can retract page tables reliably, without
    depending on trylocks, delete all the apparatus for khugepaged to try
    again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
    per-mm memory which was set aside for that in the khugepaged_mm_slot.

    But one part of that is worth keeping: when hpage_collapse_scan_file()
    found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot to
    be tried for retraction later - catching, for example, page tables where a
    reversible mprotect() of a portion had required splitting the pmd, but now
    it can be recollapsed.  Call collapse_pte_mapped_thp() directly in this
    case (why was it deferred before?  I assume an issue with needing
    mmap_lock for write, but now it's only needed for read).

    [hughd@google.com: fix mmap_locked handlng]
      Link: https://lkml.kernel.org/r/bfc6cab2-497f-32bf-dd5-98dc1987e4a9@google.com
    Link: https://lkml.kernel.org/r/a5dce57-6dfa-5559-4698-e817eb2f993@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Russell King <linux@armlinux.org.uk>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:37 -04:00
Rafael Aquini aad6fa6001 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 1043173eb5eb351a1dba11cca12705075fe74a9e
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Jul 11 21:42:19 2023 -0700

    mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

    Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().  It
    does need mmap_read_lock(), but it does not need mmap_write_lock(), nor
    vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing paths are
    relying on pte_offset_map_lock() and pmd_lock(), so use those.

    Follow the pattern in retract_page_tables(); and using pte_free_defer()
    removes most of the need for tlb_remove_table_sync_one() here; but call
    pmdp_get_lockless_sync() to use it in the PAE case.

    First check the VMA, in case page tables are being torn down: from JannH.
    Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
    acquired and the page looks suitable: from then on its state is stable.

    However, collapse_pte_mapped_thp() was doing something others don't:
    freeing a page table still containing "valid" entries.  i_mmap lock did
    stop a racing truncate from double-freeing those pages, but we prefer
    collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB flush
    can wait until the pmdp_collapse_flush() which follows, but the
    mmu_notifier_invalidate_range_start() has to be done earlier.

    Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
    for khugepaged to keep on repeatedly invalidating a range which is then
    found unsuitable e.g.  contains COWs.  "step 2", which does the clearing,
    must then be more careful (after dropping ptl to do mmu_notifier), with
    abort prepared to correct the accounting like "step 3".  But with those
    entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
    safe by the huge page lock, which stops new PTEs from being faulted in.

    [hughd@google.com: don't set mmap_locked = true in madvise_collapse()]
      Link: https://lkml.kernel.org/r/d3d9ff14-ef8-8f84-e160-bfa1f5794275@google.com
    [hughd@google.com: use ptep_clear() instead of pte_clear()]
      Link: https://lkml.kernel.org/r/e0197433-8a47-6a65-534d-eda26eeb78b0@google.com
    Link: https://lkml.kernel.org/r/b53be6a4-7715-51f9-aad-f1347dcb7c4@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Russell King <linux@armlinux.org.uk>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:36 -04:00
Rafael Aquini 14ee903f19 mm/khugepaged: retract_page_tables() without mmap or vma lock
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 1d65b771bc08cd054cf6d3766a72e113dc46d62f
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Jul 11 21:41:04 2023 -0700

    mm/khugepaged: retract_page_tables() without mmap or vma lock

    Simplify shmem and file THP collapse's retract_page_tables(), and relax
    its locking: to improve its success rate and to lessen impact on others.

    Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
    target_mm, leave that part of the work to madvise_collapse() calling
    collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s result
    code to arrange for that.  That spares retract_page_tables() four
    arguments; and since it will be successful in retracting all of the page
    tables expected of it, no need to track and return a result code itself.

    It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
    but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
    allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
    THPs.  retract_page_tables() just needs to use those same spinlocks to
    exclude it briefly, while transitioning pmd from page table to none: so
    restore its use of pmd_lock() inside of which pte lock is nested.

    Users of pte_offset_map_lock() etc all now allow for them to fail: so
    retract_page_tables() now has no use for mmap_write_trylock() or
    vma_try_start_write().  In common with rmap and page_vma_mapped_walk(), it
    does not even need the mmap_read_lock().

    But those users do expect the page table to remain a good page table,
    until they unlock and rcu_read_unlock(): so the page table cannot be freed
    immediately, but rather by the recently added pte_free_defer().

    Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
    when PAE, and pmdp_collapse_flush() did not already do so: to make sure
    that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
    cannot pick up a pmd entry with mismatched pmd_low and pmd_high.

    retract_page_tables() can be enhanced to replace_page_tables(), which
    inserts the final huge pmd without mmap lock: going through an invalid
    state instead of pmd_none() followed by fault.  But that enhancement does
    raise some more questions: leave it until a later release.

    Link: https://lkml.kernel.org/r/f88970d9-d347-9762-ae6d-da978e8a4df@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Russell King <linux@armlinux.org.uk>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:35 -04:00
Rafael Aquini ec84ab01c5 ksm: add ksm zero pages for each process
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 6080d19f07043ade61094d0f58b14c05e1694a39
Author: xu xin <xu.xin16@zte.com.cn>
Date:   Tue Jun 13 11:09:38 2023 +0800

    ksm: add ksm zero pages for each process

    As the number of ksm zero pages is not included in ksm_merging_pages per
    process when enabling use_zero_pages, it's unclear of how many actual
    pages are merged by KSM. To let users accurately estimate their memory
    demands when unsharing KSM zero-pages, it's necessary to show KSM zero-
    pages per process. In addition, it help users to know the actual KSM
    profit because KSM-placed zero pages are also benefit from KSM.

    since unsharing zero pages placed by KSM accurately is achieved, then
    tracking empty pages merging and unmerging is not a difficult thing any
    longer.

    Since we already have /proc/<pid>/ksm_stat, just add the information of
    'ksm_zero_pages' in it.

    Link: https://lkml.kernel.org/r/20230613030938.185993-1-yang.yang29@zte.com.cn
    Signed-off-by: xu xin <xu.xin16@zte.com.cn>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:21 -04:00
Rafael Aquini 993ca53ef9 ksm: count all zero pages placed by KSM
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit e2942062e01df85b4692460fe5b48ab0c90fdb95
Author: xu xin <xu.xin16@zte.com.cn>
Date:   Tue Jun 13 11:09:34 2023 +0800

    ksm: count all zero pages placed by KSM

    As pages_sharing and pages_shared don't include the number of zero pages
    merged by KSM, we cannot know how many pages are zero pages placed by KSM
    when enabling use_zero_pages, which leads to KSM not being transparent
    with all actual merged pages by KSM.  In the early days of use_zero_pages,
    zero-pages was unable to get unshared by the ways like MADV_UNMERGEABLE so
    it's hard to count how many times one of those zeropages was then
    unmerged.

    But now, unsharing KSM-placed zero page accurately has been achieved, so
    we can easily count both how many times a page full of zeroes was merged
    with zero-page and how many times one of those pages was then unmerged.
    and so, it helps to estimate memory demands when each and every shared
    page could get unshared.

    So we add ksm_zero_pages under /sys/kernel/mm/ksm/ to show the number
    of all zero pages placed by KSM. Meanwhile, we update the Documentation.

    Link: https://lkml.kernel.org/r/20230613030934.185944-1-yang.yang29@zte.com.cn
    Signed-off-by: xu xin <xu.xin16@zte.com.cn>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
    Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:20 -04:00
Rafael Aquini 25e4aa840e mm: remove references to pagevec
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 1fec6890bf2247ecc93f5491c2d3f33c333d5c6e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jun 21 17:45:56 2023 +0100

    mm: remove references to pagevec

    Most of these should just refer to the LRU cache rather than the data
    structure used to implement the LRU cache.

    Link: https://lkml.kernel.org/r/20230621164557.3510324-13-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:32 -04:00
Rafael Aquini a85223eeb8 mm: ptep_get() conversion
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
      these are already applied via RHEL commit 26418f1a34 ("Merge DRM
      changes from upstream v6.4..v6.5")
  * kernel/events/uprobes.c: minor context difference due to backport of upstream
      commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
      as part of mmu_notifier_invalidate_range_end()")
  * mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
      commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
  * mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
      on the backport of upstream  commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
      within a BUG(): inconsistent pte comparison")
  * mm/ksm.c: context conflicts and differences on the 1st hunk are due to
      out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
      give up if pte_offset_map[_lock]() fails") being compensated for only now.
  * mm/memory.c: minor context difference on the 35th hunk due to backport of
      upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
  * mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
      upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
      MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
  * mm/migrate.c: minor context difference on the 2nd hunk due to backport of
      upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
      f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
      indicates write permissions")
  * mm/migrate_device.c: minor context difference on the 5th hunk due to backport
      of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
      TLBs  as part of mmu_notifier_invalidate_range_end()")
  * mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
      backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
      entry type for hwpoisoned swapcache page")
  * mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
      upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
      lru_gen_look_around()")

This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Mon Jun 12 16:15:45 2023 +0100

    mm: ptep_get() conversion

    Convert all instances of direct pte_t* dereferencing to instead use
    ptep_get() helper.  This means that by default, the accesses change from a
    C dereference to a READ_ONCE().  This is technically the correct thing to
    do since where pgtables are modified by HW (for access/dirty) they are
    volatile and therefore we should always ensure READ_ONCE() semantics.

    But more importantly, by always using the helper, it can be overridden by
    the architecture to fully encapsulate the contents of the pte.  Arch code
    is deliberately not converted, as the arch code knows best.  It is
    intended that arch code (arm64) will override the default with its own
    implementation that can (e.g.) hide certain bits from the core code, or
    determine young/dirty status by mixing in state from another source.

    Conversion was done using Coccinelle:

    ----

    // $ make coccicheck \
    //          COCCI=ptepget.cocci \
    //          SPFLAGS="--include-headers" \
    //          MODE=patch

    virtual patch

    @ depends on patch @
    pte_t *v;
    @@

    - *v
    + ptep_get(v)

    ----

    Then reviewed and hand-edited to avoid multiple unnecessary calls to
    ptep_get(), instead opting to store the result of a single call in a
    variable, where it is correct to do so.  This aims to negate any cost of
    READ_ONCE() and will benefit arch-overrides that may be more complex.

    Included is a fix for an issue in an earlier version of this patch that
    was pointed out by kernel test robot.  The issue arose because config
    MMU=n elides definition of the ptep helper functions, including
    ptep_get().  HUGETLB_PAGE=n configs still define a simple
    huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
    So when both configs are disabled, this caused a build error because
    ptep_get() is not defined.  Fix by continuing to do a direct dereference
    when MMU=n.  This is safe because for this config the arch code cannot be
    trying to virtualize the ptes because none of the ptep helpers are
    defined.

    Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Dave Airlie <airlied@gmail.com>
    Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:52 -04:00
Rafael Aquini b8b6bc7070 mm/khugepaged: use DEFINE_READ_MOSTLY_HASHTABLE macro
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit e1ad3e66676479d6a0af6be953767f865c902111
Author: Nick Desaulniers <ndesaulniers@google.com>
Date:   Fri Jun 9 16:44:45 2023 -0700

    mm/khugepaged: use DEFINE_READ_MOSTLY_HASHTABLE macro

    These are equivalent, but DEFINE_READ_MOSTLY_HASHTABLE exists to define
    a hashtable in the .data..read_mostly section.

    Link: https://lkml.kernel.org/r/20230609-khugepage-v1-1-dad4e8382298@google.com
    Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:41 -04:00
Rafael Aquini 5c773d7e1d mm/pgtable: delete pmd_trans_unstable() and friends
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit feda5c393a6c843c7bf1fc49e1381e2d3822b564
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:50:37 2023 -0700

    mm/pgtable: delete pmd_trans_unstable() and friends

    Delete pmd_trans_unstable, pmd_none_or_trans_huge_or_clear_bad() and
    pmd_devmap_trans_unstable(), all now unused.

    With mixed feelings, delete all the comments on pmd_trans_unstable().
    That was very good documentation of a subtle state, and this series does
    not even eliminate that state: but rather, normalizes and extends it,
    asking pte_offset_map[_lock]() callers to anticipate failure, without
    regard for whether mmap_read_lock() or mmap_write_lock() is held.

    Retain pud_trans_unstable(), which has one use in __handle_mm_fault(), but
    delete its equivalent pud_none_or_trans_huge_or_dev_or_clear_bad().  While
    there, move the default arch_needs_pgtable_deposit() definition up near
    where pgtable_trans_huge_deposit() and withdraw() are declared.

    Link: https://lkml.kernel.org/r/5abdab3-3136-b42e-274d-9c6281bfb79@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:38 -04:00
Rafael Aquini 0e2c45c65c mm: khugepaged: avoid pointless allocation for "struct mm_slot"
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 16618670276a77480e274117992cec5e42ce66a9
Author: Xin Hao <xhao@linux.alibaba.com>
Date:   Wed May 31 17:58:17 2023 +0800

    mm: khugepaged: avoid pointless allocation for "struct mm_slot"

    In __khugepaged_enter(), if "mm->flags" with MMF_VM_HUGEPAGE bit is set,
    the "mm_slot" will be released and return, so we can call mm_slot_alloc()
    after test_and_set_bit().

    Link: https://lkml.kernel.org/r/20230531095817.11012-1-xhao@linux.alibaba.com
    Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
    Reviewed-by: Andrew Morton <akpm@linux-foudation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:17 -04:00
Lucas Zampieri 2424e8e040 Merge: mm: follow up work for the MM v6.4 update and disable CONFIG_PER_VMA_LOCK until it is fixed
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4749

JIRA: https://issues.redhat.com/browse/RHEL-48221  
  
It was identified that our process to bring in code-base updates   
has been unwittingly missing some of the peripheric commits not   
touching directly the core code under mm/ the directory.  
While most of these identified peripheric commits are simple  
and basic clean-ups, some are relevant changesets that might end   
up causing real(and subtle) issues for RHEL deployments if they  
remain missing.   
  
The intent of this patchset is to close the aforementioned GAP  
by bringing in the missing peripheric commits from v5.14 up to  
v6.4, which is the level we're parking our codebase for RHEL-9.5.  
  
A secondary intent of this patchset is to bring in upstream's   
v6.5 commit that disables the PER_VMA_LOCK feature which was   
recently introduced (to RHEL-9.5) but was marked BROKEN upstream  
circa release v6.5, in order to avoid the reported issues with  
memory corruptions in the upstream builds.  
  
Signed-off-by: Rafael Aquini <aquini@redhat.com>

Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-08-06 14:21:52 +00:00
Carlos Maiolino 11fa035b3a shmem: fix quota lock nesting in huge hole handling
JIRA: https://issues.redhat.com/browse/RHEL-7768
Tested: with xfstests

i_pages lock nests inside i_lock, but shmem_charge() and shmem_uncharge()
were being called from THP splitting or collapsing while i_pages lock was
held, and now go on to call dquot_alloc_block_nodirty() which takes
i_lock to update i_blocks.

We may well want to take i_lock out of this path later, in the non-quota
case even if it's left in the quota case (or perhaps use i_lock instead
of shmem's info->lock throughout); but don't get into that at this time.

Move the shmem_charge() and shmem_uncharge() calls out from under i_pages
lock, accounting the full batch of holes in a single call.

Still pass the pages argument to shmem_uncharge(), but it happens now to
be unused: shmem_recalc_inode() is designed to account for clean pages
freed behind shmem's back, so it gets the accounting right by itself;
then the later call to shmem_inode_unacct_blocks() led to imbalance
(that WARN_ON(inode->i_blocks) in shmem_evict_inode()).

Reported-by: syzbot+38ca19393fb3344f57e6@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/0000000000008e62f40600bfe080@google.com/
Reported-by: syzbot+440ff8cca06ee7a1d4db@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/00000000000076a7840600bfb6e8@google.com/
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Message-Id: <20230725144510.253763-8-cem@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
(cherry picked from commit 509f006932de7556d48eaa7afcd02dcf1ca9a3e9)
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2024-07-17 07:49:46 +02:00
Rafael Aquini b11a709106 mm/khugepaged: alloc_charge_hpage() take care of mem charge errors
JIRA: https://issues.redhat.com/browse/RHEL-48221

This patch is a backport of the following upstream commit:
commit 94c02ad7ff12b988bd7ccf522f23e0b1f68659e0
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Feb 22 14:52:47 2023 -0500

    mm/khugepaged: alloc_charge_hpage() take care of mem charge errors

    If memory charge failed, instead of returning the hpage but with an error,
    allow the function to cleanup the folio properly, which is normally what a
    function should do in this case - either return successfully, or return
    with no side effect of partial runs with an indicated error.

    This will also avoid the caller calling mem_cgroup_uncharge()
    unnecessarily with either anon or shmem path (even if it's safe to do so).

    Link: https://lkml.kernel.org/r/20230222195247.791227-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: David Stevens <stevensd@chromium.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Zach O'Keefe <zokeefe@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-07-16 09:30:01 -04:00
Waiman Long 6d0328a7cf Revert "Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8""
JIRA: https://issues.redhat.com/browse/RHEL-36683
Upstream Status: RHEL only

This reverts commit 08637d76a2 which is a
revert of "Merge: cgroup: Backport upstream cgroup commits up to v6.8"

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-18 21:38:20 -04:00
Lucas Zampieri 08637d76a2 Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8"
This reverts merge request !4128
2024-05-16 15:26:41 +00:00
Lucas Zampieri 1ce55b7cbb Merge: cgroup: Backport upstream cgroup commits up to v6.8
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4128

JIRA: https://issues.redhat.com/browse/RHEL-34600    
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4128

This MR backports upstream cgroup commits up to v6.8 with related fixes,
if applicable. It also pulls in a number of scheduler and PSI related
commits due to their interaction with cgroup.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-16 13:28:22 +00:00
Nico Pache 0b96934f06 mm/khugepaged: fix regression in collapse_file()
commit e8c716bc6812202ccf4ce0f0bad3428b794fb39c
Author: Hugh Dickins <hughd@google.com>
Date:   Wed Jun 28 21:31:35 2023 -0700

    mm/khugepaged: fix regression in collapse_file()

    There is no xas_pause(&xas) in collapse_file()'s main loop, at the points
    where it does xas_unlock_irq(&xas) and then continues.

    That would explain why, once two weeks ago and twice yesterday, I have
    hit the VM_BUG_ON_PAGE(page != xas_load(&xas), page) since "mm/khugepaged:
    fix iteration in collapse_file" removed the xas_set(&xas, index) just
    before it: xas.xa_node could be left pointing to a stale node, if there
    was concurrent activity on the file which transformed its xarray.

    I tried inserting xas_pause()s, but then even bootup crashed on that
    VM_BUG_ON_PAGE(): there appears to be a subtle "nextness" implicit in
    xas_pause().

    xas_next() and xas_pause() are good for use in simple loops, but not in
    this one: xas_set() worked well until now, so use xas_set(&xas, index)
    explicitly at the head of the loop; and change that VM_BUG_ON_PAGE() not
    to need its own xas_set(), and not to interfere with the xa_state (which
    would probably stop the crashes from xas_pause(), but I trust that less).

    The user-visible effects of this bug (if VM_BUG_ONs are configured out)
    would be data loss and data leak - potentially - though in practice I
    expect it is more likely that a subsequent check (e.g. on mapping or on
    nr_none) would notice an inconsistency, and just abandon the collapse.

    Link: https://lore.kernel.org/linux-mm/f18e4b64-3f88-a8ab-56cc-d1f5f9c58d4@google.com/
    Fixes: c8a8f3b4a95a ("mm/khugepaged: fix iteration in collapse_file")
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: stable@kernel.org
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: David Stevens <stevensd@chromium.org>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:26 -06:00
Chris von Recklinghausen 2f0ee0c1c4 mm/khugepaged: fix iteration in collapse_file
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit c8a8f3b4a95ace7683b615ad9c9aa0eac59013ae
Author: David Stevens <stevensd@chromium.org>
Date:   Wed Jun 7 14:31:35 2023 +0900

    mm/khugepaged: fix iteration in collapse_file

    Remove an unnecessary call to xas_set(index) when iterating over the
    target range in collapse_file.  The extra call to xas_set reset the xas
    cursor to the top of the tree, causing the xas_next call on the next
    iteration to walk the tree to index instead of advancing to index+1.  This
    returned the same page again, which would cause collapse_file to fail
    because the page is already locked.

    This bug was hidden when CONFIG_DEBUG_VM was set.  When that config was
    used, the xas_load in a subsequent VM_BUG_ON assert would walk xas from
    the top of the tree to index, causing the xas_next call on the next loop
    iteration to advance the cursor as expected.

    Link: https://lkml.kernel.org/r/20230607053135.2087354-1-stevensd@google.com
    Fixes: a2e17cc2efc7 ("mm/khugepaged: maintain page cache uptodate flag")
    Signed-off-by: David Stevens <stevensd@chromium.org>
    Reviewed-by: Peter Xu <peterx@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: Kirill A . Shutemov <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:18 -04:00
Chris von Recklinghausen d92d331a28 mm/khugepaged: fix conflicting mods to collapse_file()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 0175ab610c2df7c21d93e4bd63b4e67cfa86737c
Author: Hugh Dickins <hughd@google.com>
Date:   Sat Apr 22 21:47:20 2023 -0700

    mm/khugepaged: fix conflicting mods to collapse_file()

    Inserting Ivan Orlov's syzbot fix commit 2ce0bdfebc74
    ("mm: khugepaged: fix kernel BUG in hpage_collapse_scan_file()")
    ahead of Jiaqi Yan's and David Stevens's commits
    12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory")
    cae106dd67b9 ("mm/khugepaged: refactor collapse_file control flow")
    ac492b9c70ca ("mm/khugepaged: skip shmem with userfaultfd")
    (all of which restructure collapse_file()) did not work out well.

    xfstests generic/086 on huge tmpfs (with accelerated khugepaged) freezes
    (if not on the first attempt, then the 2nd or 3rd) in find_lock_entries()
    while doing drop_caches: the file's xarray seems to have been corrupted,
    with find_get_entry() returning nonsense which makes no progress.

    Bisection led to ac492b9c70ca; and diff against earlier working linux-next
    suggested that it's probably down to an errant xas_store(), which does not
    belong with the later changes (and nor does the positioning of warnings).
    The later changes look as if they fix the syzbot issue independently.

    Remove most of what's left of 2ce0bdfebc74: just leave one WARN_ON_ONCE
    (xas_error) after the final xas_store() of the multi-index entry.

    Link: https://lkml.kernel.org/r/b6c881-c352-bb91-85a8-febeb09dfd71@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: David Stevens <stevensd@chromium.org>
    Cc: Ivan Orlov <ivan.orlov0322@gmail.com>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:06 -04:00
Chris von Recklinghausen 851c0bcf13 mm/khugepaged: maintain page cache uptodate flag
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit a2e17cc2efc72792c0d13d669d824fe9ab7155a1
Author: David Stevens <stevensd@chromium.org>
Date:   Tue Apr 4 21:01:17 2023 +0900

    mm/khugepaged: maintain page cache uptodate flag

    Make sure that collapse_file doesn't interfere with checking the uptodate
    flag in the page cache by only inserting hpage into the page cache after
    it has been updated and marked uptodate.  This is achieved by simply not
    replacing present pages with hpage when iterating over the target range.

    The present pages are already locked, so replacing them with the locked
    hpage before the collapse is finalized is unnecessary.  However, it is
    necessary to stop freezing the present pages after validating them, since
    leaving long-term frozen pages in the page cache can lead to deadlocks.
    Simply checking the reference count is sufficient to ensure that there are
    no long-term references hanging around that would the collapse would
    break.  Similar to hpage, there is no reason that the present pages
    actually need to be frozen in addition to being locked.

    This fixes a race where folio_seek_hole_data would mistake hpage for an
    fallocated but unwritten page.  This race is visible to userspace via data
    temporarily disappearing from SEEK_DATA/SEEK_HOLE.  This also fixes a
    similar race where pages could temporarily disappear from mincore.

    Link: https://lkml.kernel.org/r/20230404120117.2562166-5-stevensd@google.com
    Fixes: f3f0e1d215 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: David Stevens <stevensd@chromium.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:54 -04:00
Chris von Recklinghausen 417f1a6df6 mm/khugepaged: skip shmem with userfaultfd
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit ac492b9c70cac4d887e9dce4410b1d521851e142
Author: David Stevens <stevensd@chromium.org>
Date:   Tue Apr 4 21:01:16 2023 +0900

    mm/khugepaged: skip shmem with userfaultfd

    Make sure that collapse_file respects any userfaultfds registered with
    MODE_MISSING.  If userspace has any such userfaultfds registered, then for
    any page which it knows to be missing, it may expect a
    UFFD_EVENT_PAGEFAULT.  This means collapse_file needs to be careful when
    collapsing a shmem range would result in replacing an empty page with a
    THP, to avoid breaking userfaultfd.

    Synchronization when checking for userfaultfds in collapse_file is tricky
    because the mmap locks can't be used to prevent races with the
    registration of new userfaultfds.  Instead, we provide synchronization by
    ensuring that userspace cannot observe the fact that pages are missing
    before we check for userfaultfds.  Although this allows registration of a
    userfaultfd to race with collapse_file, it ensures that userspace cannot
    observe any pages transition from missing to present after such a race
    occurs.  This makes such a race indistinguishable to the collapse
    occurring immediately before the userfaultfd registration.

    The first step to provide this synchronization is to stop filling gaps
    during the loop iterating over the target range, since the page cache lock
    can be dropped during that loop.  The second step is to fill the gaps with
    XA_RETRY_ENTRY after the page cache lock is acquired the final time, to
    avoid races with accesses to the page cache that only take the RCU read
    lock.

    The fact that we don't fill holes during the initial iteration means that
    collapse_file now has to handle faults occurring during the collapse.
    This is done by re-validating the number of missing pages after acquiring
    the page cache lock for the final time.

    This fix is targeted at khugepaged, but the change also applies to
    MADV_COLLAPSE.  MADV_COLLAPSE on a range with a userfaultfd will now
    return EBUSY if there are any missing pages (instead of succeeding on
    shmem and returning EINVAL on anonymous memory).  There is also now a
    window during MADV_COLLAPSE where a fault on a missing page will cause the
    syscall to fail with EAGAIN.

    The fact that intermediate page cache state can no longer be observed
    before the rollback of a failed collapse is also technically a
    userspace-visible change (via at least SEEK_DATA and SEEK_END), but it is
    exceedingly unlikely that anything relies on being able to observe that
    transient state.

    Link: https://lkml.kernel.org/r/20230404120117.2562166-4-stevensd@google.com
    Signed-off-by: David Stevens <stevensd@chromium.org>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:54 -04:00
Chris von Recklinghausen b4ca48b370 mm/khugepaged: refactor collapse_file control flow
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit cae106dd67b99a65d117a9f6c977a86b120dad61
Author: David Stevens <stevensd@chromium.org>
Date:   Tue Apr 4 21:01:15 2023 +0900

    mm/khugepaged: refactor collapse_file control flow

    Add a rollback label to deal with failure, instead of continuously
    checking for RESULT_SUCCESS, to make it easier to add more failure cases.
    The refactoring also allows the collapse_file tracepoint to include hpage
    on success (instead of NULL).

    Link: https://lkml.kernel.org/r/20230404120117.2562166-3-stevensd@google.com
    Signed-off-by: David Stevens <stevensd@chromium.org>
    Acked-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:53 -04:00
Chris von Recklinghausen ee390e7f13 mm/khugepaged: drain lru after swapping in shmem
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit efa3d814fad151ed5209539ecc1fc2880f7680b2
Author: David Stevens <stevensd@chromium.org>
Date:   Tue Apr 4 21:01:14 2023 +0900

    mm/khugepaged: drain lru after swapping in shmem

    Patch series "mm/khugepaged: fixes for khugepaged+shmem", v6.

    This series reworks collapse_file so that the intermediate state of the
    collapse does not leak out of collapse_file. Although this makes
    collapse_file a bit more complicated, it means that the rest of the
    kernel doesn't have to deal with the unusual state. This directly fixes
    races with both lseek and mincore.

    This series also fixes the fact that khugepaged completely breaks
    userfaultfd+shmem. The rework of collapse_file provides a convenient
    place to check for registered userfaultfds without making the shmem
    userfaultfd implementation care about khugepaged.

    Finally, this series adds a lru_add_drain after swapping in shmem pages,
    which makes the subsequent folio_isolate_lru significantly more likely to
    succeed.

    This patch (of 4):

    Call lru_add_drain after swapping in shmem pages so that isolate_lru_page
    is more likely to succeed.

    Link: https://lkml.kernel.org/r/20230404120117.2562166-1-stevensd@google.com
    Link: https://lkml.kernel.org/r/20230404120117.2562166-2-stevensd@google.com
    Signed-off-by: David Stevens <stevensd@chromium.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:53 -04:00
Chris von Recklinghausen aea06a4977 mm/khugepaged: recover from poisoned file-backed memory
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 12904d953364e3bd21789a45137bf90df7cc78ee
Author: Jiaqi Yan <jiaqiyan@google.com>
Date:   Wed Mar 29 08:11:21 2023 -0700

    mm/khugepaged: recover from poisoned file-backed memory

    Make collapse_file roll back when copying pages failed. More concretely:
    - extract copying operations into a separate loop
    - postpone the updates for nr_none until both scanning and copying
      succeeded
    - postpone joining small xarray entries until both scanning and copying
      succeeded
    - postpone the update operations to NR_XXX_THPS until both scanning and
      copying succeeded
    - for non-SHMEM file, roll back filemap_nr_thps_inc if scan succeeded but
      copying failed

    Tested manually:
    0. Enable khugepaged on system under test. Mount tmpfs at /mnt/ramdisk.
    1. Start a two-thread application. Each thread allocates a chunk of
       non-huge memory buffer from /mnt/ramdisk.
    2. Pick 4 random buffer address (2 in each thread) and inject
       uncorrectable memory errors at physical addresses.
    3. Signal both threads to make their memory buffer collapsible, i.e.
       calling madvise(MADV_HUGEPAGE).
    4. Wait and then check kernel log: khugepaged is able to recover from
       poisoned pages by skipping them.
    5. Signal both threads to inspect their buffer contents and make sure no
       data corruption.

    Link: https://lkml.kernel.org/r/20230329151121.949896-4-jiaqiyan@google.com
    Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: David Stevens <stevensd@chromium.org>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Tong Tiangen <tongtiangen@huawei.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:53 -04:00
Chris von Recklinghausen 115bcfc4b0 mm/khugepaged: recover from poisoned anonymous memory
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 98c76c9f1ef7599b39bfd4bd99b8a760d4a8cd3b
Author: Jiaqi Yan <jiaqiyan@google.com>
Date:   Wed Mar 29 08:11:19 2023 -0700

    mm/khugepaged: recover from poisoned anonymous memory

    Problem
    =======
    Memory DIMMs are subject to multi-bit flips, i.e.  memory errors.  As
    memory size and density increase, the chances of and number of memory
    errors increase.  The increasing size and density of server RAM in the
    data center and cloud have shown increased uncorrectable memory errors.
    There are already mechanisms in the kernel to recover from uncorrectable
    memory errors.  This series of patches provides the recovery mechanism for
    the particular kernel agent khugepaged when it collapses memory pages.

    Impact
    ======
    The main reason we chose to make khugepaged collapsing tolerant of memory
    failures was its high possibility of accessing poisoned memory while
    performing functionally optional compaction actions.  Standard
    applications typically don't have strict requirements on the size of its
    pages.  So they are given 4K pages by the kernel.  The kernel is able to
    improve application performance by either

      1) giving applications 2M pages to begin with, or
      2) collapsing 4K pages into 2M pages when possible.

    This collapsing operation is done by khugepaged, a kernel agent that is
    constantly scanning memory.  When collapsing 4K pages into a 2M page, it
    must copy the data from the 4K pages into a physically contiguous 2M page.
    Therefore, as long as there exists one poisoned cache line in collapsible
    4K pages, khugepaged will eventually access it.  The current impact to
    users is a machine check exception triggered kernel panic.  However,
    khugepaged’s compaction operations are not functionally required kernel
    actions.  Therefore making khugepaged tolerant to poisoned memory will
    greatly improve user experience.

    This patch series is for cases where khugepaged is the first guy that
    detects the memory errors on the poisoned pages.  IOW, the pages are not
    known to have memory errors when khugepaged collapsing gets to them.  In
    our observation, this happens frequently when the huge page ratio of the
    system is relatively low, which is fairly common in virtual machines
    running on cloud.

    Solution
    ========
    As stated before, it is less desirable to crash the system only because
    khugepaged accesses poisoned pages while it is collapsing 4K pages.  The
    high level idea of this patch series is to skip the group of pages
    (usually 512 4K-size pages) once khugepaged finds one of them is poisoned,
    as these pages have become ineligible to be collapsed.

    We are also careful to unwind operations khuagepaged has performed before
    it detects memory failures.  For example, before copying and collapsing a
    group of anonymous pages into a huge page, the source pages will be
    isolated and their page table is unlinked from their PMD.  These
    operations need to be undone in order to ensure these pages are not
    changed/lost from the perspective of other threads (both user and kernel
    space).  As for file backed memory pages, there already exists a rollback
    case.  This patch just extends it so that khugepaged also correctly rolls
    back when it fails to copy poisoned 4K pages.

    This patch (of 3):

    Make __collapse_huge_page_copy return whether copying anonymous pages
    succeeded, and make collapse_huge_page handle the return status.

    Break existing PTE scan loop into two for-loops.  The first loop copies
    source pages into target huge page, and can fail gracefully when running
    into memory errors in source pages.  If copying all pages succeeds, the
    second loop releases and clears up these normal pages.  Otherwise, the
    second loop rolls back the page table and page states by:

    - re-establishing the original PTEs-to-PMD connection.
    - releasing source pages back to their LRU list.

    Tested manually:
    0. Enable khugepaged on system under test.
    1. Start a two-thread application. Each thread allocates a chunk of
       non-huge anonymous memory buffer.
    2. Pick 4 random buffer locations (2 in each thread) and inject
       uncorrectable memory errors at corresponding physical addresses.
    3. Signal both threads to make their memory buffer collapsible, i.e.
       calling madvise(MADV_HUGEPAGE).
    4. Wait and check kernel log: khugepaged is able to recover from poisoned
       pages and skips collapsing them.
    5. Signal both threads to inspect their buffer contents and make sure no
       data corruption.

    Link: https://lkml.kernel.org/r/20230329151121.949896-1-jiaqiyan@google.com
    Link: https://lkml.kernel.org/r/20230329151121.949896-2-jiaqiyan@google.com
    Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
    Cc: David Stevens <stevensd@chromium.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Tong Tiangen <tongtiangen@huawei.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:52 -04:00
Chris von Recklinghausen 1beed4f93b mm: khugepaged: fix kernel BUG in hpage_collapse_scan_file()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 2ce0bdfebc74f6cbd4e97a4e767d505a81c38cf2
Author: Ivan Orlov <ivan.orlov0322@gmail.com>
Date:   Wed Mar 29 18:53:30 2023 +0400

    mm: khugepaged: fix kernel BUG in hpage_collapse_scan_file()

    Syzkaller reported the following issue:

    kernel BUG at mm/khugepaged.c:1823!
    invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 5097 Comm: syz-executor220 Not tainted 6.2.0-syzkaller-13154-g857f1268a591 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/16/2023
    RIP: 0010:collapse_file mm/khugepaged.c:1823 [inline]
    RIP: 0010:hpage_collapse_scan_file+0x67c8/0x7580 mm/khugepaged.c:2233
    Code: 00 00 89 de e8 c9 66 a3 ff 31 ff 89 de e8 c0 66 a3 ff 45 84 f6 0f 85 28 0d 00 00 e8 22 64 a3 ff e9 dc f7 ff ff e8 18 64 a3 ff <0f> 0b f3 0f 1e fa e8 0d 64 a3 ff e9 93 f6 ff ff f3 0f 1e fa 4c 89
    RSP: 0018:ffffc90003dff4e0 EFLAGS: 00010093
    RAX: ffffffff81e95988 RBX: 00000000000001c1 RCX: ffff8880205b3a80
    RDX: 0000000000000000 RSI: 00000000000001c0 RDI: 00000000000001c1
    RBP: ffffc90003dff830 R08: ffffffff81e90e67 R09: fffffbfff1a433c3
    R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000000
    R13: ffffc90003dff6c0 R14: 00000000000001c0 R15: 0000000000000000
    FS:  00007fdbae5ee700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fdbae6901e0 CR3: 000000007b2dd000 CR4: 00000000003506e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     <TASK>
     madvise_collapse+0x721/0xf50 mm/khugepaged.c:2693
     madvise_vma_behavior mm/madvise.c:1086 [inline]
     madvise_walk_vmas mm/madvise.c:1260 [inline]
     do_madvise+0x9e5/0x4680 mm/madvise.c:1439
     __do_sys_madvise mm/madvise.c:1452 [inline]
     __se_sys_madvise mm/madvise.c:1450 [inline]
     __x64_sys_madvise+0xa5/0xb0 mm/madvise.c:1450
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x63/0xcd

    The xas_store() call during page cache scanning can potentially translate
    'xas' into the error state (with the reproducer provided by the syzkaller
    the error code is -ENOMEM).  However, there are no further checks after
    the 'xas_store', and the next call of 'xas_next' at the start of the
    scanning cycle doesn't increase the xa_index, and the issue occurs.

    This patch will add the xarray state error checking after the xas_store()
    and the corresponding result error code.

    Tested via syzbot.

    [akpm@linux-foundation.org: update include/trace/events/huge_memory.h's SCAN_STATUS]
    Link: https://lkml.kernel.org/r/20230329145330.23191-1-ivan.orlov0322@gmail.com
    Link: https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc
    Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
    Reported-by: syzbot+9578faa5475acb35fa50@syzkaller.appspotmail.com
    Tested-by: Zach O'Keefe <zokeefe@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Himadri Pandya <himadrispandya@gmail.com>
    Cc: Ivan Orlov <ivan.orlov0322@gmail.com>
    Cc: Shuah Khan <skhan@linuxfoundation.org>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:49 -04:00
Chris von Recklinghausen 9a41b45f56 mm/khugepaged: write-lock VMA while collapsing a huge page
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 55fd6fccad3172c0feaaa817f0a1283629ff183e
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:14 2023 -0800

    mm/khugepaged: write-lock VMA while collapsing a huge page

    Protect VMA from concurrent page fault handler while collapsing a huge
    page.  Page fault handler needs a stable PMD to use PTL and relies on
    per-VMA lock to prevent concurrent PMD changes.  pmdp_collapse_flush(),
    set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
    not be detected by a page fault handler without proper locking.

    Before this patch, page tables can be walked under any one of the
    mmap_lock, the mapping lock, and the anon_vma lock; so when khugepaged
    unlinks and frees page tables, it must ensure that all of those either are
    locked or don't exist.  This patch adds a fourth lock under which page
    tables can be traversed, and so khugepaged must also lock out that one.

    [surenb@google.com: vm_lock/i_mmap_rwsem inversion in retract_page_tables]
      Link: https://lkml.kernel.org/r/20230303213250.3555716-1-surenb@google.com
    [surenb@google.com: build fix]
      Link: https://lkml.kernel.org/r/CAJuCfpFjWhtzRE1X=J+_JjgJzNKhq-=JT8yTBSTHthwp0pqWZw@mail.gmail.com
    Link: https://lkml.kernel.org/r/20230227173632.3292573-16-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:40 -04:00
Chris von Recklinghausen 270626da6c mm/khugepaged: cleanup memcg uncharge for failure path
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 7cb1d7ef667716a9ff4e692e7ba1c3817d872222
Author: Peter Xu <peterx@redhat.com>
Date:   Fri Mar 3 10:12:18 2023 -0500

    mm/khugepaged: cleanup memcg uncharge for failure path

    Explicit memcg uncharging is not needed when the memcg accounting has the
    same lifespan of the page/folio.  That becomes the case for khugepaged
    after Yang & Zach's recent rework so the hpage will be allocated for each
    collapse rather than being cached.

    Cleanup the explicit memcg uncharge in khugepaged failure path and leave
    that for put_page().

    Link: https://lkml.kernel.org/r/20230303151218.311015-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Suggested-by: Zach O'Keefe <zokeefe@google.com>
    Reviewed-by: Zach O'Keefe <zokeefe@google.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: David Stevens <stevensd@chromium.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:08 -04:00
Aristeu Rozanski 4c96f5154f mm: change to return bool for isolate_lru_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit f7f9c00dfafffd7a5a1a5685e2d874c64913e2ed
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:35 2023 +0800

    mm: change to return bool for isolate_lru_page()

    The isolate_lru_page() can only return 0 or -EBUSY, and most users did not
    care about the negative error of isolate_lru_page(), except one user in
    add_page_for_migration().  So we can convert the isolate_lru_page() to
    return a boolean value, which can help to make the code more clear when
    checking the return value of isolate_lru_page().

    Also convert all users' logic of checking the isolation state.

    No functional changes intended.

    Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski d1230addeb mm: change to return bool for folio_isolate_lru()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit be2d57563822b7e00b2b16d9354637c4b6d6d5cc
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:34 2023 +0800

    mm: change to return bool for folio_isolate_lru()

    Patch series "Change the return value for page isolation functions", v3.

    Now the page isolation functions did not return a boolean to indicate
    success or not, instead it will return a negative error when failed
    to isolate a page. So below code used in most places seem a boolean
    success/failure thing, which can confuse people whether the isolation
    is successful.

    if (folio_isolate_lru(folio))
            continue;

    Moreover the page isolation functions only return 0 or -EBUSY, and
    most users did not care about the negative error except for few users,
    thus we can convert all page isolation functions to return a boolean
    value, which can remove the confusion to make code more clear.

    No functional changes intended in this patch series.

    This patch (of 4):

    Now the folio_isolate_lru() did not return a boolean value to indicate
    isolation success or not, however below code checking the return value can
    make people think that it was a boolean success/failure thing, which makes
    people easy to make mistakes (see the fix patch[1]).

    if (folio_isolate_lru(folio))
            continue;

    Thus it's better to check the negative error value expilictly returned by
    folio_isolate_lru(), which makes code more clear per Linus's
    suggestion[2].  Moreover Matthew suggested we can convert the isolation
    functions to return a boolean[3], since most users did not care about the
    negative error value, and can also remove the confusing of checking return
    value.

    So this patch converts the folio_isolate_lru() to return a boolean value,
    which means return 'true' to indicate the folio isolation is successful,
    and 'false' means a failure to isolation.  Meanwhile changing all users'
    logic of checking the isolation state.

    No functional changes intended.

    [1] https://lore.kernel.org/all/20230131063206.28820-1-Kuan-Ying.Lee@mediatek.com/T/#u
    [2] https://lore.kernel.org/all/CAHk-=wiBrY+O-4=2mrbVyxR+hOqfdJ=Do6xoucfJ9_5az01L4Q@mail.gmail.com/
    [3] https://lore.kernel.org/all/Y+sTFqwMNAjDvxw3@casper.infradead.org/

    Link: https://lkml.kernel.org/r/cover.1676424378.git.baolin.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/8a4e3679ed4196168efadf7ea36c038f2f7d5aa9.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>

    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski e40329347a mm/khugepaged: fix invalid page access in release_pte_pages()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit f528260b1a7d52140dfeb58857e13fc98ac193ef
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Feb 13 13:43:24 2023 -0800

    mm/khugepaged: fix invalid page access in release_pte_pages()

    release_pte_pages() converts from a pfn to a folio by using pfn_folio().
    If the pte is not mapped, pfn_folio() will result in undefined behavior
    which ends up causing a kernel panic[1].

    Only call pfn_folio() once we have validated that the pte is both valid
    and mapped to fix the issue.

    [1] https://lore.kernel.org/linux-mm/ff300770-afe9-908d-23ed-d23e0796e899@samsung.com/

    Link: https://lkml.kernel.org/r/20230213214324.34215-1-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Fixes: 9bdfeea46f49 ("mm/khugepaged: convert release_pte_pages() to use folios")
    Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Debugged-by: Alexandre Ghiti <alex@ghiti.fr>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:20 -04:00