Commit Graph

258 Commits

Author SHA1 Message Date
Rafael Aquini 4a2ebacbbc mm/ksm: fix ksm_zero_pages accounting
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit c2dc78b86e0821ecf9a9d0c35dba2618279a5bb6
Author: Chengming Zhou <chengming.zhou@linux.dev>
Date:   Tue May 28 13:15:22 2024 +0800

    mm/ksm: fix ksm_zero_pages accounting

    We normally ksm_zero_pages++ in ksmd when page is merged with zero page,
    but ksm_zero_pages-- is done from page tables side, where there is no any
    accessing protection of ksm_zero_pages.

    So we can read very exceptional value of ksm_zero_pages in rare cases,
    such as -1, which is very confusing to users.

    Fix it by changing to use atomic_long_t, and the same case with the
    mm->ksm_zero_pages.

    Link: https://lkml.kernel.org/r/20240528-b4-ksm-counters-v3-2-34bb358fdc13@linux.dev
    Fixes: e2942062e01d ("ksm: count all zero pages placed by KSM")
    Fixes: 6080d19f0704 ("ksm: add ksm zero pages for each process")
    Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
    Cc: Stefan Roesch <shr@devkernel.io>
    Cc: xu xin <xu.xin16@zte.com.cn>
    Cc: Yang Yang <yang.yang29@zte.com.cn>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:31 -04:00
Rafael Aquini f35943a830 mm/ksm: fix ksm_pages_scanned accounting
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 730cdc2c72c6905a2eda2fccbbf67dcef1206590
Author: Chengming Zhou <chengming.zhou@linux.dev>
Date:   Tue May 28 13:15:21 2024 +0800

    mm/ksm: fix ksm_pages_scanned accounting

    Patch series "mm/ksm: fix some accounting problems", v3.

    We encountered some abnormal ksm_pages_scanned and ksm_zero_pages during
    some random tests.

    1. ksm_pages_scanned unchanged even ksmd scanning has progress.
    2. ksm_zero_pages maybe -1 in some rare cases.

    This patch (of 2):

    During testing, I found ksm_pages_scanned is unchanged although the
    scan_get_next_rmap_item() did return valid rmap_item that is not NULL.

    The reason is the scan_get_next_rmap_item() will return NULL after a full
    scan, so ksm_do_scan() just return without accounting of the
    ksm_pages_scanned.

    Fix it by just putting ksm_pages_scanned accounting in that loop, and it
    will be accounted more timely if that loop would last for a long time.

    Link: https://lkml.kernel.org/r/20240528-b4-ksm-counters-v3-0-34bb358fdc13@linux.dev
    Link: https://lkml.kernel.org/r/20240528-b4-ksm-counters-v3-1-34bb358fdc13@linux.dev
    Fixes: b348b5fe2b5f ("mm/ksm: add pages scanned metric")
    Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: xu xin <xu.xin16@zte.com.cn>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
    Cc: Stefan Roesch <shr@devkernel.io>
    Cc: Yang Yang <yang.yang29@zte.com.cn>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:29 -04:00
Rafael Aquini d56b8e98a8 mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * mm/memory-failure.c: minor context conflict due to out-of-order backport
      of commit fa422b353d21 ("mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind")

This patch is a backport of the following upstream commit:
commit d256d1cd8da1cbc4615de69df71c87ce623fec2f
Author: Tong Tiangen <tongtiangen@huawei.com>
Date:   Mon Aug 28 10:25:27 2023 +0800

    mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()

    We found a softlock issue in our test, analyzed the logs, and found that
    the relevant CPU call trace as follows:

    CPU0:
      _do_fork
        -> copy_process()
          -> write_lock_irq(&tasklist_lock)  //Disable irq,waiting for
                                             //tasklist_lock

    CPU1:
      wp_page_copy()
        ->pte_offset_map_lock()
          -> spin_lock(&page->ptl);        //Hold page->ptl
        -> ptep_clear_flush()
          -> flush_tlb_others() ...
            -> smp_call_function_many()
              -> arch_send_call_function_ipi_mask()
                -> csd_lock_wait()         //Waiting for other CPUs respond
                                           //IPI

    CPU2:
      collect_procs_anon()
        -> read_lock(&tasklist_lock)       //Hold tasklist_lock
          ->for_each_process(tsk)
            -> page_mapped_in_vma()
              -> page_vma_mapped_walk()
                -> map_pte()
                  ->spin_lock(&page->ptl)  //Waiting for page->ptl

    We can see that CPU1 waiting for CPU0 respond IPI,CPU0 waiting for CPU2
    unlock tasklist_lock, CPU2 waiting for CPU1 unlock page->ptl. As a result,
    softlockup is triggered.

    For collect_procs_anon(), what we're doing is task list iteration, during
    the iteration, with the help of call_rcu(), the task_struct object is freed
    only after one or more grace periods elapse. the logic as follows:

    release_task()
      -> __exit_signal()
        -> __unhash_process()
          -> list_del_rcu()

      -> put_task_struct_rcu_user()
        -> call_rcu(&task->rcu, delayed_put_task_struct)

    delayed_put_task_struct()
      -> put_task_struct()
      -> if (refcount_sub_and_test())
            __put_task_struct()
              -> free_task()

    Therefore, under the protection of the rcu lock, we can safely use
    get_task_struct() to ensure a safe reference to task_struct during the
    iteration.

    By removing the use of tasklist_lock in task list iteration, we can break
    the softlock chain above.

    The same logic can also be applied to:
     - collect_procs_file()
     - collect_procs_fsdax()
     - collect_procs_ksm()

    Link: https://lkml.kernel.org/r/20230828022527.241693-1-tongtiangen@huawei.com
    Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:12 -04:00
Rafael Aquini 88901793db mm/ksm: add pages scanned metric
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit b348b5fe2b5f14ac8bb64fe271d7a027db8cc674
Author: Stefan Roesch <shr@devkernel.io>
Date:   Fri Aug 11 12:36:55 2023 -0700

    mm/ksm: add pages scanned metric

    ksm currently maintains several statistics, which let you determine how
    successful KSM is at sharing pages.  However it does not contain a metric
    to determine how much work it does.

    This commit adds the pages scanned metric.  This allows the administrator
    to determine how many pages have been scanned over a period of time.

    Link: https://lkml.kernel.org/r/20230811193655.2518943-1-shr@devkernel.io
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Rik van Riel <riel@surriel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:40 -04:00
Rafael Aquini 0d66ddd7f2 ksm: consider KSM-placed zeropages when calculating KSM profit
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 1a8e84305783bddbae708f28178c6d0aa6321913
Author: xu xin <xu.xin16@zte.com.cn>
Date:   Tue Jun 13 11:09:42 2023 +0800

    ksm: consider KSM-placed zeropages when calculating KSM profit

    When use_zero_pages is enabled, the calculation of ksm profit is not
    correct because ksm zero pages is not counted in.  So update the
    calculation of KSM profit including the documentation.

    Link: https://lkml.kernel.org/r/20230613030942.186041-1-yang.yang29@zte.com.cn
    Signed-off-by: xu xin <xu.xin16@zte.com.cn>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Xiaokai Ran <ran.xiaokai@zte.com.cn>
    Cc: Yang Yang <yang.yang29@zte.com.cn>
    Cc: Jiang Xuexin <jiang.xuexin@zte.com.cn>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:22 -04:00
Rafael Aquini ec84ab01c5 ksm: add ksm zero pages for each process
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 6080d19f07043ade61094d0f58b14c05e1694a39
Author: xu xin <xu.xin16@zte.com.cn>
Date:   Tue Jun 13 11:09:38 2023 +0800

    ksm: add ksm zero pages for each process

    As the number of ksm zero pages is not included in ksm_merging_pages per
    process when enabling use_zero_pages, it's unclear of how many actual
    pages are merged by KSM. To let users accurately estimate their memory
    demands when unsharing KSM zero-pages, it's necessary to show KSM zero-
    pages per process. In addition, it help users to know the actual KSM
    profit because KSM-placed zero pages are also benefit from KSM.

    since unsharing zero pages placed by KSM accurately is achieved, then
    tracking empty pages merging and unmerging is not a difficult thing any
    longer.

    Since we already have /proc/<pid>/ksm_stat, just add the information of
    'ksm_zero_pages' in it.

    Link: https://lkml.kernel.org/r/20230613030938.185993-1-yang.yang29@zte.com.cn
    Signed-off-by: xu xin <xu.xin16@zte.com.cn>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:21 -04:00
Rafael Aquini 993ca53ef9 ksm: count all zero pages placed by KSM
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit e2942062e01df85b4692460fe5b48ab0c90fdb95
Author: xu xin <xu.xin16@zte.com.cn>
Date:   Tue Jun 13 11:09:34 2023 +0800

    ksm: count all zero pages placed by KSM

    As pages_sharing and pages_shared don't include the number of zero pages
    merged by KSM, we cannot know how many pages are zero pages placed by KSM
    when enabling use_zero_pages, which leads to KSM not being transparent
    with all actual merged pages by KSM.  In the early days of use_zero_pages,
    zero-pages was unable to get unshared by the ways like MADV_UNMERGEABLE so
    it's hard to count how many times one of those zeropages was then
    unmerged.

    But now, unsharing KSM-placed zero page accurately has been achieved, so
    we can easily count both how many times a page full of zeroes was merged
    with zero-page and how many times one of those pages was then unmerged.
    and so, it helps to estimate memory demands when each and every shared
    page could get unshared.

    So we add ksm_zero_pages under /sys/kernel/mm/ksm/ to show the number
    of all zero pages placed by KSM. Meanwhile, we update the Documentation.

    Link: https://lkml.kernel.org/r/20230613030934.185944-1-yang.yang29@zte.com.cn
    Signed-off-by: xu xin <xu.xin16@zte.com.cn>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
    Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:20 -04:00
Rafael Aquini eb91b58bf5 ksm: support unsharing KSM-placed zero pages
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 79271476b3362a9e69adae949a520647f8af3559
Author: xu xin <xu.xin16@zte.com.cn>
Date:   Tue Jun 13 11:09:28 2023 +0800

    ksm: support unsharing KSM-placed zero pages

    Patch series "ksm: support tracking KSM-placed zero-pages", v10.

    The core idea of this patch set is to enable users to perceive the number
    of any pages merged by KSM, regardless of whether use_zero_page switch has
    been turned on, so that users can know how much free memory increase is
    really due to their madvise(MERGEABLE) actions.  But the problem is, when
    enabling use_zero_pages, all empty pages will be merged with kernel zero
    pages instead of with each other as use_zero_pages is disabled, and then
    these zero-pages are no longer monitored by KSM.

    The motivations to do this is seen at:
    https://lore.kernel.org/lkml/202302100915227721315@zte.com.cn/

    In one word, we hope to implement the support for KSM-placed zero pages
    tracking without affecting the feature of use_zero_pages, so that app
    developer can also benefit from knowing the actual KSM profit by getting
    KSM-placed zero pages to optimize applications eventually when
    /sys/kernel/mm/ksm/use_zero_pages is enabled.

    This patch (of 5):

    When use_zero_pages of ksm is enabled, madvise(addr, len,
    MADV_UNMERGEABLE) and other ways (like write 2 to /sys/kernel/mm/ksm/run)
    to trigger unsharing will *not* actually unshare the shared zeropage as
    placed by KSM (which is against the MADV_UNMERGEABLE documentation).  As
    these KSM-placed zero pages are out of the control of KSM, the related
    counts of ksm pages don't expose how many zero pages are placed by KSM
    (these special zero pages are different from those initially mapped zero
    pages, because the zero pages mapped to MADV_UNMERGEABLE areas are
    expected to be a complete and unshared page).

    To not blindly unshare all shared zero_pages in applicable VMAs, the patch
    use pte_mkdirty (related with architecture) to mark KSM-placed zero pages.
    Thus, MADV_UNMERGEABLE will only unshare those KSM-placed zero pages.

    In addition, we'll reuse this mechanism to reliably identify KSM-placed
    ZeroPages to properly account for them (e.g., calculating the KSM profit
    that includes zeropages) in the latter patches.

    The patch will not degrade the performance of use_zero_pages as it doesn't
    change the way of merging empty pages in use_zero_pages's feature.

    Link: https://lkml.kernel.org/r/202306131104554703428@zte.com.cn
    Link: https://lkml.kernel.org/r/20230613030928.185882-1-yang.yang29@zte.com.cn
    Signed-off-by: xu xin <xu.xin16@zte.com.cn>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
    Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:17:19 -04:00
Rafael Aquini 25e4aa840e mm: remove references to pagevec
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 1fec6890bf2247ecc93f5491c2d3f33c333d5c6e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jun 21 17:45:56 2023 +0100

    mm: remove references to pagevec

    Most of these should just refer to the LRU cache rather than the data
    structure used to implement the LRU cache.

    Link: https://lkml.kernel.org/r/20230621164557.3510324-13-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:32 -04:00
Rafael Aquini a85223eeb8 mm: ptep_get() conversion
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
      these are already applied via RHEL commit 26418f1a34 ("Merge DRM
      changes from upstream v6.4..v6.5")
  * kernel/events/uprobes.c: minor context difference due to backport of upstream
      commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
      as part of mmu_notifier_invalidate_range_end()")
  * mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
      commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
  * mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
      on the backport of upstream  commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
      within a BUG(): inconsistent pte comparison")
  * mm/ksm.c: context conflicts and differences on the 1st hunk are due to
      out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
      give up if pte_offset_map[_lock]() fails") being compensated for only now.
  * mm/memory.c: minor context difference on the 35th hunk due to backport of
      upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
  * mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
      upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
      MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
  * mm/migrate.c: minor context difference on the 2nd hunk due to backport of
      upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
      f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
      indicates write permissions")
  * mm/migrate_device.c: minor context difference on the 5th hunk due to backport
      of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
      TLBs  as part of mmu_notifier_invalidate_range_end()")
  * mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
      backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
      entry type for hwpoisoned swapcache page")
  * mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
      upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
      lru_gen_look_around()")

This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Mon Jun 12 16:15:45 2023 +0100

    mm: ptep_get() conversion

    Convert all instances of direct pte_t* dereferencing to instead use
    ptep_get() helper.  This means that by default, the accesses change from a
    C dereference to a READ_ONCE().  This is technically the correct thing to
    do since where pgtables are modified by HW (for access/dirty) they are
    volatile and therefore we should always ensure READ_ONCE() semantics.

    But more importantly, by always using the helper, it can be overridden by
    the architecture to fully encapsulate the contents of the pte.  Arch code
    is deliberately not converted, as the arch code knows best.  It is
    intended that arch code (arm64) will override the default with its own
    implementation that can (e.g.) hide certain bits from the core code, or
    determine young/dirty status by mixing in state from another source.

    Conversion was done using Coccinelle:

    ----

    // $ make coccicheck \
    //          COCCI=ptepget.cocci \
    //          SPFLAGS="--include-headers" \
    //          MODE=patch

    virtual patch

    @ depends on patch @
    pte_t *v;
    @@

    - *v
    + ptep_get(v)

    ----

    Then reviewed and hand-edited to avoid multiple unnecessary calls to
    ptep_get(), instead opting to store the result of a single call in a
    variable, where it is correct to do so.  This aims to negate any cost of
    READ_ONCE() and will benefit arch-overrides that may be more complex.

    Included is a fix for an issue in an earlier version of this patch that
    was pointed out by kernel test robot.  The issue arose because config
    MMU=n elides definition of the ptep helper functions, including
    ptep_get().  HUGETLB_PAGE=n configs still define a simple
    huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
    So when both configs are disabled, this caused a build error because
    ptep_get() is not defined.  Fix by continuing to do a direct dereference
    when MMU=n.  This is safe because for this config the arch code cannot be
    trying to virtualize the ptes because none of the ptep helpers are
    defined.

    Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Dave Airlie <airlied@gmail.com>
    Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:52 -04:00
Nico Pache 0b91dbac20 mm: enable page walking API to lock vmas during the walk
commit 49b0638502da097c15d46cd4e871dbaa022caf7c
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Aug 4 08:27:19 2023 -0700

    mm: enable page walking API to lock vmas during the walk

    walk_page_range() and friends often operate under write-locked mmap_lock.
    With introduction of vma locks, the vmas have to be locked as well during
    such walks to prevent concurrent page faults in these areas.  Add an
    additional member to mm_walk_ops to indicate locking requirements for the
    walk.

    The change ensures that page walks which prevent concurrent page faults
    by write-locking mmap_lock, operate correctly after introduction of
    per-vma locks.  With per-vma locks page faults can be handled under vma
    lock without taking mmap_lock at all, so write locking mmap_lock would
    not stop them.  The change ensures vmas are properly locked during such
    walks.

    A sample issue this solves is do_mbind() performing queue_pages_range()
    to queue pages for migration.  Without this change a concurrent page
    can be faulted into the area and be left out of migration.

    Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
    Suggested-by: Jann Horn <jannh@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam Howlett <liam.howlett@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Michel Lespinasse <michel@lespinasse.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:27 -06:00
Nico Pache 94afa740b8 mm/swapfile: fix wrong swap entry type for hwpoisoned swapcache page
commit f985fc322063c73916a0d5b6b3fcc6db2ba5792c
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Jul 27 19:56:40 2023 +0800

    mm/swapfile: fix wrong swap entry type for hwpoisoned swapcache page

    Patch series "A few fixup patches for mm", v2.

    This series contains a few fixup patches to fix potential unexpected
    return value, fix wrong swap entry type for hwpoisoned swapcache page and
    so on.  More details can be found in the respective changelogs.

    This patch (of 3):

    Hwpoisoned dirty swap cache page is kept in the swap cache and there's
    simple interception code in do_swap_page() to catch it.  But when trying
    to swapoff, unuse_pte() will wrongly install a general sense of "future
    accesses are invalid" swap entry for hwpoisoned swap cache page due to
    unaware of such type of page.  The user will receive SIGBUS signal without
    expected BUS_MCEERR_AR payload.  BTW, typo 'hwposioned' is fixed.

    Link: https://lkml.kernel.org/r/20230727115643.639741-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20230727115643.639741-2-linmiaohe@huawei.com
    Fixes: 6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:24 -06:00
Chris von Recklinghausen 90e13cba9a mm/ksm: move disabling KSM from s390/gmap code to KSM code
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 2c281f54f556e1f3266c8cb104adf9eea7a7b742
Author: David Hildenbrand <david@redhat.com>
Date:   Sat Apr 22 23:01:56 2023 +0200

    mm/ksm: move disabling KSM from s390/gmap code to KSM code

    Let's factor out actual disabling of KSM.  The existing "mm->def_flags &=
    ~VM_MERGEABLE;" was essentially a NOP and can be dropped, because
    def_flags should never include VM_MERGEABLE.  Note that we don't currently
    prevent re-enabling KSM.

    This should now be faster in case KSM was never enabled, because we only
    conditionally iterate all VMAs.  Further, it certainly looks cleaner.

    Link: https://lkml.kernel.org/r/20230422210156.33630-1-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Janosch Frank <frankja@linux.ibm.com>
    Acked-by: Stefan Roesch <shr@devkernel.io>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:08 -04:00
Chris von Recklinghausen 3f9448e50b mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 24139c07f413ef4b555482c758343d71392a19bc
Author: David Hildenbrand <david@redhat.com>
Date:   Sat Apr 22 22:54:18 2023 +0200

    mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0

    Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup
    disabling KSM", v2.

    (1) Make PR_SET_MEMORY_MERGE=0 unmerge pages like setting MADV_UNMERGEABLE
    does, (2) add a selftest for it and (3) factor out disabling of KSM from
    s390/gmap code.

    This patch (of 3):

    Let's unmerge any KSM pages when setting PR_SET_MEMORY_MERGE=0, and clear
    the VM_MERGEABLE flag from all VMAs -- just like KSM would.  Of course,
    only do that if we previously set PR_SET_MEMORY_MERGE=1.

    Link: https://lkml.kernel.org/r/20230422205420.30372-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20230422205420.30372-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Stefan Roesch <shr@devkernel.io>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Janosch Frank <frankja@linux.ibm.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:08 -04:00
Chris von Recklinghausen f4f56699af mm: add new KSM process and sysfs knobs
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit d21077fbc2fc987c2e593c34dc3b4d84e546dc9f
Author: Stefan Roesch <shr@devkernel.io>
Date:   Mon Apr 17 22:13:41 2023 -0700

    mm: add new KSM process and sysfs knobs

    This adds the general_profit KSM sysfs knob and the process profit metric
    knobs to ksm_stat.

    1) expose general_profit metric

       The documentation mentions a general profit metric, however this
       metric is not calculated.  In addition the formula depends on the size
       of internal structures, which makes it more difficult for an
       administrator to make the calculation.  Adding the metric for a better
       user experience.

    2) document general_profit sysfs knob

    3) calculate ksm process profit metric

       The ksm documentation mentions the process profit metric and how to
       calculate it.  This adds the calculation of the metric.

    4) mm: expose ksm process profit metric in ksm_stat

       This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
       The documentation mentions the formula for the ksm process profit
       metric, however it does not calculate it.  In addition the formula
       depends on the size of internal structures.  So it makes sense to
       expose it.

    5) document new procfs ksm knobs

    Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.io
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:04 -04:00
Chris von Recklinghausen 3c00c5a05f mm: add new api to enable ksm per process
Conflicts: include/uapi/linux/prctl.h - fuzz

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit d7597f59d1d33e9efbffa7060deb9ee5bd119e62
Author: Stefan Roesch <shr@devkernel.io>
Date:   Mon Apr 17 22:13:40 2023 -0700

    mm: add new api to enable ksm per process

    Patch series "mm: process/cgroup ksm support", v9.

    So far KSM can only be enabled by calling madvise for memory regions.  To
    be able to use KSM for more workloads, KSM needs to have the ability to be
    enabled / disabled at the process / cgroup level.

    Use case 1:
      The madvise call is not available in the programming language.  An
      example for this are programs with forked workloads using a garbage
      collected language without pointers.  In such a language madvise cannot
      be made available.

      In addition the addresses of objects get moved around as they are
      garbage collected.  KSM sharing needs to be enabled "from the outside"
      for these type of workloads.

    Use case 2:
      The same interpreter can also be used for workloads where KSM brings
      no benefit or even has overhead.  We'd like to be able to enable KSM on
      a workload by workload basis.

    Use case 3:
      With the madvise call sharing opportunities are only enabled for the
      current process: it is a workload-local decision.  A considerable number
      of sharing opportunities may exist across multiple workloads or jobs (if
      they are part of the same security domain).  Only a higler level entity
      like a job scheduler or container can know for certain if its running
      one or more instances of a job.  That job scheduler however doesn't have
      the necessary internal workload knowledge to make targeted madvise
      calls.

    Security concerns:

      In previous discussions security concerns have been brought up.  The
      problem is that an individual workload does not have the knowledge about
      what else is running on a machine.  Therefore it has to be very
      conservative in what memory areas can be shared or not.  However, if the
      system is dedicated to running multiple jobs within the same security
      domain, its the job scheduler that has the knowledge that sharing can be
      safely enabled and is even desirable.

    Performance:

      Experiments with using UKSM have shown a capacity increase of around 20%.

      Here are the metrics from an instagram workload (taken from a machine
      with 64GB main memory):

       full_scans: 445
       general_profit: 20158298048
       max_page_sharing: 256
       merge_across_nodes: 1
       pages_shared: 129547
       pages_sharing: 5119146
       pages_to_scan: 4000
       pages_unshared: 1760924
       pages_volatile: 10761341
       run: 1
       sleep_millisecs: 20
       stable_node_chains: 167
       stable_node_chains_prune_millisecs: 2000
       stable_node_dups: 2751
       use_zero_pages: 0
       zero_pages_sharing: 0

    After the service is running for 30 minutes to an hour, 4 to 5 million
    shared pages are common for this workload when using KSM.

    Detailed changes:

    1. New options for prctl system command
       This patch series adds two new options to the prctl system call.
       The first one allows to enable KSM at the process level and the second
       one to query the setting.

    The setting will be inherited by child processes.

    With the above setting, KSM can be enabled for the seed process of a cgroup
    and all processes in the cgroup will inherit the setting.

    2. Changes to KSM processing
       When KSM is enabled at the process level, the KSM code will iterate
       over all the VMA's and enable KSM for the eligible VMA's.

       When forking a process that has KSM enabled, the setting will be
       inherited by the new child process.

    3. Add general_profit metric
       The general_profit metric of KSM is specified in the documentation,
       but not calculated.  This adds the general profit metric to
       /sys/kernel/debug/mm/ksm.

    4. Add more metrics to ksm_stat
       This adds the process profit metric to /proc/<pid>/ksm_stat.

    5. Add more tests to ksm_tests and ksm_functional_tests
       This adds an option to specify the merge type to the ksm_tests.
       This allows to test madvise and prctl KSM.

       It also adds a two new tests to ksm_functional_tests: one to test
       the new prctl options and the other one is a fork test to verify that
       the KSM process setting is inherited by client processes.

    This patch (of 3):

    So far KSM can only be enabled by calling madvise for memory regions.  To
    be able to use KSM for more workloads, KSM needs to have the ability to be
    enabled / disabled at the process / cgroup level.

    1. New options for prctl system command

       This patch series adds two new options to the prctl system call.
       The first one allows to enable KSM at the process level and the second
       one to query the setting.

       The setting will be inherited by child processes.

       With the above setting, KSM can be enabled for the seed process of a
       cgroup and all processes in the cgroup will inherit the setting.

    2. Changes to KSM processing

       When KSM is enabled at the process level, the KSM code will iterate
       over all the VMA's and enable KSM for the eligible VMA's.

       When forking a process that has KSM enabled, the setting will be
       inherited by the new child process.

      1) Introduce new MMF_VM_MERGE_ANY flag

         This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
         is set, kernel samepage merging (ksm) gets enabled for all vma's of a
         process.

      2) Setting VM_MERGEABLE on VMA creation

         When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
         VM_MERGEABLE flag will be set for this VMA.

      3) support disabling of ksm for a process

         This adds the ability to disable ksm for a process if ksm has been
         enabled for the process with prctl.

      4) add new prctl option to get and set ksm for a process

         This adds two new options to the prctl system call
         - enable ksm for all vmas of a process (if the vmas support it).
         - query if ksm has been enabled for a process.

    3. Disabling MMF_VM_MERGE_ANY for storage keys in s390

       In the s390 architecture when storage keys are used, the
       MMF_VM_MERGE_ANY will be disabled.

    Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
    Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Bagas Sanjaya <bagasdotme@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:04 -04:00
Chris von Recklinghausen a3d0d37911 mm: ksm: support hwpoison for ksm page
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 4248d0083ec5817eebfb916c54950d100b3468ee
Author: Longlong Xia <xialonglong1@huawei.com>
Date:   Fri Apr 14 10:17:41 2023 +0800

    mm: ksm: support hwpoison for ksm page

    hwpoison_user_mappings() is updated to support ksm pages, and add
    collect_procs_ksm() to collect processes when the error hit an ksm page.
    The difference from collect_procs_anon() is that it also needs to traverse
    the rmap-item list on the stable node of the ksm page.  At the same time,
    add_to_kill_ksm() is added to handle ksm pages.  And
    task_in_to_kill_list() is added to avoid duplicate addition of tsk to the
    to_kill list.  This is because when scanning the list, if the pages that
    make up the ksm page all come from the same process, they may be added
    repeatedly.

    Link: https://lkml.kernel.org/r/20230414021741.2597273-3-xialonglong1@huawei.com
    Signed-off-by: Longlong Xia <xialonglong1@huawei.com>
    Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:01 -04:00
Chris von Recklinghausen fff0e5ecba mm: add tracepoints to ksm
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 739100c88f49a67c6487bb2d826b0b5a2ddc80e2
Author: Stefan Roesch <shr@devkernel.io>
Date:   Fri Feb 10 13:46:45 2023 -0800

    mm: add tracepoints to ksm

    This adds the following tracepoints to ksm:
    - start / stop scan
    - ksm enter / exit
    - merge a page
    - merge a page with ksm
    - remove a page
    - remove a rmap item

    This patch has been split off from the RFC patch series "mm:
    process/cgroup ksm support".

    Link: https://lkml.kernel.org/r/20230210214645.2720847-1-shr@devkernel.io
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:04 -04:00
Aristeu Rozanski 18093345ab mm/ksm: fix race with VMA iteration and mm_struct teardown
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 6db504ce55bdbc575723938fc480713c9183f6a2
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Wed Mar 8 17:03:10 2023 -0500

    mm/ksm: fix race with VMA iteration and mm_struct teardown

    exit_mmap() will tear down the VMAs and maple tree with the mmap_lock held
    in write mode.  Ensure that the maple tree is still valid by checking
    ksm_test_exit() after taking the mmap_lock in read mode, but before the
    for_each_vma() iterator dereferences a destroyed maple tree.

    Since the maple tree is destroyed, the flags telling lockdep to check an
    external lock has been cleared.  Skip the for_each_vma() iterator to avoid
    dereferencing a maple tree without the external lock flag, which would
    create a lockdep warning.

    Link: https://lkml.kernel.org/r/20230308220310.3119196-1-Liam.Howlett@oracle.com
    Fixes: a5f18ba07276 ("mm/ksm: use vma iterators instead of vma linked list")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reported-by: Pengfei Xu <pengfei.xu@intel.com>
      Link: https://lore.kernel.org/lkml/ZAdUUhSbaa6fHS36@xpf.sh.intel.com/
    Reported-by: syzbot+2ee18845e89ae76342c5@syzkaller.appspotmail.com
      Link: https://syzkaller.appspot.com/bug?id=64a3e95957cd3deab99df7cd7b5a9475af92c93e
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: <heng.su@intel.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:26 -04:00
Aristeu Rozanski 5455c3da6d mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7d4a8be0c4b2b7ffb367929d2b352651f083806b
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jan 10 13:57:22 2023 +1100

    mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export

    mmu_notifier_range_update_to_read_only() was originally introduced in
    commit c6d23413f8 ("mm/mmu_notifier:
    mmu_notifier_range_update_to_read_only() helper") as an optimisation for
    device drivers that know a range has only been mapped read-only.  However
    there are no users of this feature so remove it.  As it is the only user
    of the struct mmu_notifier_range.vma field remove that also.

    Link: https://lkml.kernel.org/r/20230110025722.600912-1-apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:05 -04:00
Audra Mitchell 4efa595c94 mm: hwpoison: support recovery from ksm_might_need_to_copy()
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    minor context conflict due to out of order backport related to the v6.1 update
    e6131c89a5 ("mm/swapoff: allow pte_offset_map[_lock]() to fail")

This patch is a backport of the following upstream commit:
commit 6b970599e807ea95c653926d41b095a92fd381e2
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Fri Dec 9 15:28:01 2022 +0800

    mm: hwpoison: support recovery from ksm_might_need_to_copy()

    When the kernel copies a page from ksm_might_need_to_copy(), but runs into
    an uncorrectable error, it will crash since poisoned page is consumed by
    kernel, this is similar to the issue recently fixed by Copy-on-write
    poison recovery.

    When an error is detected during the page copy, return VM_FAULT_HWPOISON
    in do_swap_page(), and install a hwpoison entry in unuse_pte() when
    swapoff, which help us to avoid system crash.  Note, memory failure on a
    KSM page will be skipped, but still call memory_failure_queue() to be
    consistent with general memory failure process, and we could support KSM
    page recovery in the feature.

    [wangkefeng.wang@huawei.com: enhance unuse_pte(), fix issue found by lkp]
      Link: https://lkml.kernel.org/r/20221213120523.141588-1-wangkefeng.wang@huawei.com
    [wangkefeng.wang@huawei.com: update changelog, alter ksm_might_need_to_copy(), restore unlikely() in unuse_pte()]
      Link: https://lkml.kernel.org/r/20230201074433.96641-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20221209072801.193221-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:03 -04:00
Audra Mitchell bb4ac080a2 mm/ksm: convert break_ksm() to use walk_page_range_vma()
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit d7c0e68dab98f0f5a2af501eaefeb90cc855fc80
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Oct 21 12:11:40 2022 +0200

    mm/ksm: convert break_ksm() to use walk_page_range_vma()

    FOLL_MIGRATION exists only for the purpose of break_ksm(), and actually,
    there is not even the need to wait for the migration to finish, we only
    want to know if we're dealing with a KSM page.

    Using follow_page() just to identify a KSM page overcomplicates GUP code.
    Let's use walk_page_range_vma() instead, because we don't actually care
    about the page itself, we only need to know a single property -- no need
    to even grab a reference.

    So, get rid of follow_page() usage such that we can get rid of
    FOLL_MIGRATION now and eventually be able to get rid of follow_page() in
    the future.

    In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge
    performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in
    a performance degradation of ~2% (old: ~5010 MiB/s, new: ~4900 MiB/s).  I
    don't think we particularly care for now.

    Interestingly, the benchmark reduction is due to the single callback.
    Adding a second callback (e.g., pud_entry()) reduces the benchmark by
    another 100-200 MiB/s.

    Link: https://lkml.kernel.org/r/20221021101141.84170-9-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:01 -04:00
Audra Mitchell c9d5756843 memory: move hotplug memory notifier priority to same file for easy sorting
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 1eeaa4fd39b0b1b3e986f8eab6978e69b01e3c5e
Author: Liu Shixin <liushixin2@huawei.com>
Date:   Fri Sep 23 11:33:47 2022 +0800

    memory: move hotplug memory notifier priority to same file for easy sorting

    The priority of hotplug memory callback is defined in a different file.
    And there are some callers using numbers directly.  Collect them together
    into include/linux/memory.h for easy reading.  This allows us to sort
    their priorities more intuitively without additional comments.

    Link: https://lkml.kernel.org/r/20220923033347.3935160-9-liushixin2@huawei.com
    Signed-off-by: Liu Shixin <liushixin2@huawei.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Waiman Long <longman@redhat.com>
    Cc: zefan li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:51 -04:00
Chris von Recklinghausen e26f18fd82 ksm: add the ksm prefix to the names of the ksm private structures
Conflicts: mm/kdm.c -
	We already have
	23f746e412b4 ("ksm: convert ksm_mm_slot.mm_list to ksm_mm_slot.mm_node")
	79b099415637 ("ksm: convert ksm_mm_slot.link to ksm_mm_slot.hash")
	58730ab6c7ca ("ksm: convert to use common struct mm_slot")
	so struct ksm_mm_slot is reduced to slot and rmap_list and
	alloc_mm_slot, free_mm_slot, get_mm_slot, and insert_to_mm_slots_hash
	are not added

JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 21fbd59136e0773e0b920371860d9b6757cdb250
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Wed Aug 31 11:19:48 2022 +0800

    ksm: add the ksm prefix to the names of the ksm private structures

    In order to prevent the name of the private structure of ksm from being
    the same as the name of the common structure used in subsequent patches,
    prefix their names with ksm in advance.

    Link: https://lkml.kernel.org/r/20220831031951.43152-5-zhengqi.arch@bytedanc
e.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:59 -04:00
Chris von Recklinghausen 57fcb30cb2 ksm: count allocated ksm rmap_items for each process
Conflicts: mm/ksm.c -
        We already have
	58730ab6c7ca ("ksm: convert to use common struct mm_slot")
	so access mm as mm_slot->slot.mm

JIRA: https://issues.redhat.com/browse/RHEL-27736

commit cb4df4cae4f2bd8cf7a32eff81178fce31600f7c
Author: xu xin <cgel.zte@gmail.com>
Date:   Tue Aug 30 14:38:38 2022 +0000

    ksm: count allocated ksm rmap_items for each process

    Patch series "ksm: count allocated rmap_items and update documentation",
    v5.

    KSM can save memory by merging identical pages, but also can consume
    additional memory, because it needs to generate rmap_items to save each
    scanned page's brief rmap information.

    To determine how beneficial the ksm-policy (like madvise), they are using
    brings, so we add a new interface /proc/<pid>/ksm_stat for each process
    The value "ksm_rmap_items" in it indicates the total allocated ksm
    rmap_items of this process.

    The detailed description can be seen in the following patches' commit
    message.

    This patch (of 2):

    KSM can save memory by merging identical pages, but also can consume
    additional memory, because it needs to generate rmap_items to save each
    scanned page's brief rmap information.  Some of these pages may be merged,
    but some may not be abled to be merged after being checked several times,
    which are unprofitable memory consumed.

    The information about whether KSM save memory or consume memory in
    system-wide range can be determined by the comprehensive calculation of
    pages_sharing, pages_shared, pages_unshared and pages_volatile.  A simple
    approximate calculation:

            profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
                     sizeof(rmap_item);

    where all_rmap_items equals to the sum of pages_sharing, pages_shared,
    pages_unshared and pages_volatile.

    But we cannot calculate this kind of ksm profit inner single-process wide
    because the information of ksm rmap_item's number of a process is lacked.
    For user applications, if this kind of information could be obtained, it
    helps upper users know how beneficial the ksm-policy (like madvise) they
    are using brings, and then optimize their app code.  For example, one
    application madvise 1000 pages as MERGEABLE, while only a few pages are
    really merged, then it's not cost-efficient.

    So we add a new interface /proc/<pid>/ksm_stat for each process in which
    the value of ksm_rmap_itmes is only shown now and so more values can be
    added in future.

    So similarly, we can calculate the ksm profit approximately for a single
    process by:

            profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items *
                     sizeof(rmap_item);

    where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and
    ksm_rmap_items is shown in /proc/<pid>/ksm_stat.

    Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn
    Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cn
    Signed-off-by: xu xin <xu.xin16@zte.com.cn>
    Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: Bagas Sanjaya <bagasdotme@gmail.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Izik Eidus <izik.eidus@ravellosystems.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:59 -04:00
Chris von Recklinghausen 748cab4803 mm/ksm: use vma iterators instead of vma linked list
Conflicts: mm/ksm.c -
	We already have
	21fbd59136e0 ("ksm: add the ksm prefix to the names of the ksm private structures")
	so keep type of rmap_item as struct ksm_rmap_item
	We already have
	58730ab6c7ca ("ksm: convert to use common struct mm_slot")
	so access mm as mm_slot->slot.mm

JIRA: https://issues.redhat.com/browse/RHEL-27736

commit a5f18ba0727656bd1fe3bcdb0d563f81790f9a04
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:49:01 2022 +0000

    mm/ksm: use vma iterators instead of vma linked list

    Remove the use of the linked list for eventual removal.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-54-Liam.Howlett@oracl
e.com
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:53 -04:00
Nico Pache ec1f9cb5af mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHARE
commit 6cce3314b928b2db7d5f48171e18314226551c3f
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Oct 21 12:11:37 2022 +0200

    mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHARE

    Let's stop breaking COW via a fake write fault and let's use
    FAULT_FLAG_UNSHARE instead.  This avoids any wrong side effects of the
    fake write fault, such as mapping the PTE writable and marking the pte
    dirty/softdirty.

    Consequently, we will no longer trigger a fake write fault and break COW
    without any such side-effects.

    Also, this fixes KSM interaction with userfaultfd-wp: when we have a KSM
    page that's write-protected by userfaultfd, break_ksm()->handle_mm_fault()
    will fail with VM_FAULT_SIGBUS and will simply return in break_ksm() with
    0 instead of actually breaking COW.

    For now, the KSM unmerge tests can trigger that:
        $ sudo ./ksm_functional_tests
        TAP version 13
        1..3
        # [RUN] test_unmerge
        ok 1 Pages were unmerged
        # [RUN] test_unmerge_discarded
        ok 2 Pages were unmerged
        # [RUN] test_unmerge_uffd_wp
        not ok 3 Pages were unmerged
        Bail out! 1 out of 3 tests failed
        # Planned tests != run tests (2 != 3)
        # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0

    The warning in dmesg also indicates this wrong handling:
        [  230.096368] FAULT_FLAG_ALLOW_RETRY missing 881
        [  230.100822] CPU: 1 PID: 1643 Comm: ksm-uffd-wp [...]
        [  230.110124] Hardware name: [...]
        [  230.117775] Call Trace:
        [  230.120227]  <TASK>
        [  230.122334]  dump_stack_lvl+0x44/0x5c
        [  230.126010]  handle_userfault.cold+0x14/0x19
        [  230.130281]  ? tlb_finish_mmu+0x65/0x170
        [  230.134207]  ? uffd_wp_range+0x65/0xa0
        [  230.137959]  ? _raw_spin_unlock+0x15/0x30
        [  230.141972]  ? do_wp_page+0x50/0x590
        [  230.145551]  __handle_mm_fault+0x9f5/0xf50
        [  230.149652]  ? mmput+0x1f/0x40
        [  230.152712]  handle_mm_fault+0xb9/0x2a0
        [  230.156550]  break_ksm+0x141/0x180
        [  230.159964]  unmerge_ksm_pages+0x60/0x90
        [  230.163890]  ksm_madvise+0x3c/0xb0
        [  230.167295]  do_madvise.part.0+0x10c/0xeb0
        [  230.171396]  ? do_syscall_64+0x67/0x80
        [  230.175157]  __x64_sys_madvise+0x5a/0x70
        [  230.179082]  do_syscall_64+0x58/0x80
        [  230.182661]  ? do_syscall_64+0x67/0x80
        [  230.186413]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

    This is primarily a fix for KSM+userfaultfd-wp, however, the fake write
    fault was always questionable.  As this fix is not easy to backport and
    it's not very critical, let's not cc stable.

    Link: https://lkml.kernel.org/r/20221021101141.84170-6-david@redhat.com
    Fixes: 529b930b87 ("userfaultfd: wp: hook userfault handler to write protection fault")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5617
Signed-off-by: Nico Pache <npache@redhat.com>
2024-01-19 10:11:00 -07:00
Nico Pache 499315b820 mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITE
commit 58f595c6659198e1ad0ed431a408ddd79b21e579
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Oct 21 12:11:34 2022 +0200

    mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITE

    Now that GUP no longer requires VM_FAULT_WRITE, break_ksm() is the sole
    remaining user of VM_FAULT_WRITE.  As we also want to stop triggering a
    fake write fault and instead use FAULT_FLAG_UNSHARE -- similar to
    GUP-triggered unsharing when taking a R/O pin on a shared anonymous page
    (including KSM pages), let's stop relying on VM_FAULT_WRITE.

    Let's rework break_ksm() to not rely on the return value of
    handle_mm_fault() anymore to figure out whether COW-breaking was
    successful.  Simply perform another follow_page() lookup to verify the
    result.

    While this makes break_ksm() slightly less efficient, we can simplify
    handle_mm_fault() a little and easily switch to FAULT_FLAG_UNSHARE without
    introducing similar KSM-specific behavior for FAULT_FLAG_UNSHARE.

    In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge
    performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in
    a performance degradation of ~4% -- 5% (old: ~5250 MiB/s, new: ~5010
    MiB/s).

    I don't think that we particularly care about that performance drop when
    unmerging.  If it ever turns out to be an actual performance issue, we can
    think about a better alternative for FAULT_FLAG_UNSHARE -- let's just keep
    it simple for now.

    Link: https://lkml.kernel.org/r/20221021101141.84170-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5617
Signed-off-by: Nico Pache <npache@redhat.com>
2024-01-19 10:11:00 -07:00
Chris von Recklinghausen 89634cbdb8 mm/various: give up if pte_offset_map[_lock]() fails
Conflicts:
	mm/gup.c - We don't have
		f7355e99d9f7 ("mm/gup: remove FOLL_MIGRATION")
		so don't remove retry label.
	mm/ksm.c - We don't have
		d7c0e68dab98 ("mm/ksm: convert break_ksm() to use walk_page_range_vma()")
		so don't add definition of break_ksm_pmd_entry or break_ksm_ops

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 04dee9e85cf50a2f24738e456d66b88de109b806
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:29:22 2023 -0700

    mm/various: give up if pte_offset_map[_lock]() fails

    Following the examples of nearby code, various functions can just give up
    if pte_offset_map() or pte_offset_map_lock() fails.  And there's no need
    for a preliminary pmd_trans_unstable() or other such check, since such
    cases are now safely handled inside.

    Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.co
m
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:16 -04:00
Chris von Recklinghausen 176bb35f89 mm: use pmdp_get_lockless() without surplus barrier()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 26e1a0c3277d7f43856ec424902423be212cc178
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:06:53 2023 -0700

    mm: use pmdp_get_lockless() without surplus barrier()

    Patch series "mm: allow pte_offset_map[_lock]() to fail", v2.

    What is it all about?  Some mmap_lock avoidance i.e.  latency reduction.
    Initially just for the case of collapsing shmem or file pages to THPs; but
    likely to be relied upon later in other contexts e.g.  freeing of empty
    page tables (but that's not work I'm doing).  mmap_write_lock avoidance
    when collapsing to anon THPs?  Perhaps, but again that's not work I've
    done: a quick attempt was not as easy as the shmem/file case.

    I would much prefer not to have to make these small but wide-ranging
    changes for such a niche case; but failed to find another way, and have
    heard that shmem MADV_COLLAPSE's usefulness is being limited by that
    mmap_write_lock it currently requires.

    These changes (though of course not these exact patches) have been in
    Google's data centre kernel for three years now: we do rely upon them.

    What is this preparatory series about?

    The current mmap locking will not be enough to guard against that tricky
    transition between pmd entry pointing to page table, and empty pmd entry,
    and pmd entry pointing to huge page: pte_offset_map() will have to
    validate the pmd entry for itself, returning NULL if no page table is
    there.  What to do about that varies: sometimes nearby error handling
    indicates just to skip it; but in many cases an ACTION_AGAIN or "goto
    again" is appropriate (and if that risks an infinite loop, then there must
    have been an oops, or pfn 0 mistaken for page table, before).

    Given the likely extension to freeing empty page tables, I have not
    limited this set of changes to a THP config; and it has been easier, and
    sets a better example, if each site is given appropriate handling: even
    where deeper study might prove that failure could only happen if the pmd
    table were corrupted.

    Several of the patches are, or include, cleanup on the way; and by the
    end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and
    pte_offset_map_lock() then handle those original races and more.  Most
    uses of pte_lockptr() are deprecated, with pte_offset_map_nolock() taking
    its place.

    This patch (of 32):

    Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
    reliable result with PAE (or READ_ONCE as before without PAE); and remove
    the unnecessary extra barrier()s which got left behind in its callers.

    HOWEVER: Note the small print in linux/pgtable.h, where it was designed
    specifically for fast GUP, and depends on interrupts being disabled for
    its full guarantee: most callers which have been added (here and before)
    do NOT have interrupts disabled, so there is still some need for caution.

    Link: https://lkml.kernel.org/r/f35279a9-9ac0-de22-d245-591afbfb4dc@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:11 -04:00
Chris von Recklinghausen 8db015807a ksm: use a folio in replace_page()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b4e6f66e45b43aed0903731b6c0700573f88282a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:41 2022 +0100

    ksm: use a folio in replace_page()

    Replace three calls to compound_head() with one.

    Link: https://lkml.kernel.org/r/20220902194653.1739778-46-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:02 -04:00
Chris von Recklinghausen 6a8a4c650d ksm: convert to use common struct mm_slot
Conflicts: mm/ksm.c - We don't have
	a5f18ba07276 ("mm/ksm: use vma iterators instead of vma linked list")
	since it uses the Maple Tree VMA Iterator (an explicit non-goal of
	this series) so don't use a VMA iterator.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 58730ab6c7cab4e8525b7492ac369ccbfff5093a
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Wed Aug 31 11:19:51 2022 +0800

    ksm: convert to use common struct mm_slot

    Convert to use common struct mm_slot, no functional change.

    Link: https://lkml.kernel.org/r/20220831031951.43152-8-zhengqi.arch@bytedanc
e.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:50 -04:00
Chris von Recklinghausen c8338d3acb ksm: convert ksm_mm_slot.link to ksm_mm_slot.hash
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 79b09941563737fad52a6b5ce9b9f0e1abf01bec
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Wed Aug 31 11:19:50 2022 +0800

    ksm: convert ksm_mm_slot.link to ksm_mm_slot.hash

    In order to use common struct mm_slot, convert ksm_mm_slot.link to
    ksm_mm_slot.hash in advance, no functional change.

    Link: https://lkml.kernel.org/r/20220831031951.43152-7-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:50 -04:00
Chris von Recklinghausen 72c561173b ksm: convert ksm_mm_slot.mm_list to ksm_mm_slot.mm_node
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 23f746e412b405fbd6fb9652c0f7c33818713c43
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Wed Aug 31 11:19:49 2022 +0800

    ksm: convert ksm_mm_slot.mm_list to ksm_mm_slot.mm_node

    In order to use common struct mm_slot, convert ksm_mm_slot.mm_list to
    ksm_mm_slot.mm_node in advance, no functional change.

    Link: https://lkml.kernel.org/r/20220831031951.43152-6-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:49 -04:00
Chris von Recklinghausen 45f57dbc7c ksm: add the ksm prefix to the names of the ksm private structures
Conflicts: mm/ksm.c - We don't have
	a5f18ba07276 ("mm/ksm: use vma iterators instead of vma linked list")
	since it uses the Maple Tree VMA Iterator (an explicit non-goal of
	this series), so don't add declaration for vmi.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 21fbd59136e0773e0b920371860d9b6757cdb250
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Wed Aug 31 11:19:48 2022 +0800

    ksm: add the ksm prefix to the names of the ksm private structures

    In order to prevent the name of the private structure of ksm from being
    the same as the name of the common structure used in subsequent patches,
    prefix their names with ksm in advance.

    Link: https://lkml.kernel.org/r/20220831031951.43152-5-zhengqi.arch@bytedanc
e.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:49 -04:00
Chris von Recklinghausen b2cb33b2e5 mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 50722804423680488b8063f6cc9a451333bf6f9b
Author: Zach O'Keefe <zokeefe@google.com>
Date:   Wed Jul 6 16:59:26 2022 -0700

    mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage

    When scanning an anon pmd to see if it's eligible for collapse, return
    SCAN_PMD_MAPPED if the pmd already maps a hugepage.  Note that
    SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
    file-collapse path, since the latter might identify pte-mapped compound
    pages.  This is required by MADV_COLLAPSE which necessarily needs to know
    what hugepage-aligned/sized regions are already pmd-mapped.

    In order to determine if a pmd already maps a hugepage, refactor
    mm_find_pmd():

    Return mm_find_pmd() to it's pre-commit f72e7dcdd2 ("mm: let mm_find_pmd
    fix buggy race with THP fault") behavior.  ksm was the only caller that
    explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
    there (pmd_present() and pmd_trans_huge() checks).

    Undo revert change in commit f72e7dcdd2 ("mm: let mm_find_pmd fix buggy
    race with THP fault") that open-coded split_huge_pmd_address() pmd lookup
    and use mm_find_pmd() instead.

    Link: https://lkml.kernel.org/r/20220706235936.2197195-9-zokeefe@google.com
    Signed-off-by: Zach O'Keefe <zokeefe@google.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Alex Shi <alex.shi@linux.alibaba.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Chris Kennelly <ckennelly@google.com>
    Cc: Chris Zankel <chris@zankel.net>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Max Filippov <jcmvbkbc@gmail.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Pavel Begunkov <asml.silence@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:22 -04:00
Nico Pache b6f3b56998 mm: fix the handling Non-LRU pages returned by follow_page
commit f7091ed64ec8311b0c35865875f8c3e04e5ea532
Author: Haiyue Wang <haiyue.wang@intel.com>
Date:   Tue Aug 23 21:58:41 2022 +0800

    mm: fix the handling Non-LRU pages returned by follow_page

    The handling Non-LRU pages returned by follow_page() jumps directly, it
    doesn't call put_page() to handle the reference count, since 'FOLL_GET'
    flag for follow_page() has get_page() called.  Fix the zone device page
    check by handling the page reference count correctly before returning.

    And as David reviewed, "device pages are never PageKsm pages".  Drop this
    zone device page check for break_ksm().

    Since the zone device page can't be a transparent huge page, so drop the
    redundant zone device page check for split_huge_pages_pid().  (by Miaohe)

    Link: https://lkml.kernel.org/r/20220823135841.934465-3-haiyue.wang@intel.com
    Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
    Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:00 -06:00
Nico Pache 278b8e0245 mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast
commit 088b8aa537c2c767765f1c19b555f21ffe555786
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Sep 1 10:35:59 2022 +0200

    mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast

    commit 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with
    PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
    cleared during temporary unmapping of a page, that the PTE is
    cleared/invalidated and that the TLB is flushed.

    What we want to achieve in all cases is that we cannot end up with a pin on
    an anonymous page that may be shared, because such pins would be
    unreliable and could result in memory corruptions when the mapped page
    and the pin go out of sync due to a write fault.

    That TLB flush handling was inspired by an outdated comment in
    mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
    the past to synchronize with GUP-fast. However, ever since general RCU GUP
    fast was introduced in commit 2667f50e8b ("mm: introduce a general RCU
    get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
    concurrent GUP-fast in all cases -- it only handles traditional IPI-based
    GUP-fast correctly.

    Peter Xu (thankfully) questioned whether that TLB flush is really
    required. On architectures that send an IPI broadcast on TLB flush,
    it works as expected. To synchronize with RCU GUP-fast properly, we're
    conceptually fine, however, we have to enforce a certain memory order and
    are missing memory barriers.

    Let's document that, avoid the TLB flush where possible and use proper
    explicit memory barriers where required. We shouldn't really care about the
    additional memory barriers here, as we're not on extremely hot paths --
    and we're getting rid of some TLB flushes.

    We use a smp_mb() pair for handling concurrent pinning and a
    smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
    PTE changes but permanent PageAnonExclusive changes.

    One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
    convert an exclusive anonymous page to a KSM page, and that page is already
    mapped write-protected (-> no PTE change) would be:

            Thread 0 (KSM)                  Thread 1 (GUP-fast)

                                            (B1) Read the PTE
                                            # (B2) skipped without FOLL_WRITE
            (A1) Clear PTE
            smp_mb()
            (A2) Check pinned
                                            (B3) Pin the mapped page
                                            smp_mb()
            (A3) Clear PageAnonExclusive
            smp_wmb()
            (A4) Restore PTE
                                            (B4) Check if the PTE changed
                                            smp_rmb()
                                            (B5) Check PageAnonExclusive

    Thread 1 will properly detect that PageAnonExclusive was cleared and
    back off.

    Note that we don't need a memory barrier between checking if the page is
    pinned and clearing PageAnonExclusive, because stores are not
    speculated.

    The possible issues due to reordering are of theoretical nature so far
    and attempts to reproduce the race failed.

    Especially the "no PTE change" case isn't the common case, because we'd
    need an exclusive anonymous page that's mapped R/O and the PTE is clean
    in KSM code -- and using KSM with page pinning isn't extremely common.
    Further, the clear+TLB flush we used for now implies a memory barrier.
    So the problematic missing part should be the missing memory barrier
    after pinning but before checking if the PTE changed.

    Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
    Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Andrea Parri <parri.andrea@gmail.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: "Paul E. McKenney" <paulmck@kernel.org>
    Cc: Christoph von Recklinghausen <crecklin@redhat.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:10:59 -06:00
Chris von Recklinghausen ad6d7b5ea6 mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6a56ccbcf6c69538b152644107a1d7383c876ca7
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Nov 8 18:46:50 2022 +0100

    mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite

    commit b191f9b106 ("mm: numa: preserve PTE write permissions across a
    NUMA hinting fault") added remembering write permissions using ordinary
    pte_write() for PROT_NONE mapped pages to avoid write faults when
    remapping the page !PROT_NONE on NUMA hinting faults.

    That commit noted:

        The patch looks hacky but the alternatives looked worse. The tidest was
        to rewalk the page tables after a hinting fault but it was more complex
        than this approach and the performance was worse. It's not generally
        safe to just mark the page writable during the fault if it's a write
        fault as it may have been read-only for COW so that approach was
        discarded.

    Later, commit 288bc54949 ("mm/autonuma: let architecture override how
    the write bit should be stashed in a protnone pte.") introduced a family
    of savedwrite PTE functions that didn't necessarily improve the whole
    situation.

    One confusing thing is that nowadays, if a page is pte_protnone()
    and pte_savedwrite() then also pte_write() is true. Another source of
    confusion is that there is only a single pte_mk_savedwrite() call in the
    kernel. All other write-protection code seems to silently rely on
    pte_wrprotect().

    Ever since PageAnonExclusive was introduced and we started using it in
    mprotect context via commit 64fe24a3e05e ("mm/mprotect: try avoiding write
    faults for exclusive anonymous pages when changing protection"), we do
    have machinery in place to avoid write faults when changing protection,
    which is exactly what we want to do here.

    Let's similarly do what ordinary mprotect() does nowadays when upgrading
    write permissions and reuse can_change_pte_writable() and
    can_change_pmd_writable() to detect if we can upgrade PTE permissions to be
    writable.

    For anonymous pages there should be absolutely no change: if an
    anonymous page is not exclusive, it could not have been mapped writable --
    because only exclusive anonymous pages can be mapped writable.

    However, there *might* be a change for writable shared mappings that
    require writenotify: if they are not dirty, we cannot map them writable.
    While it might not matter in practice, we'd need a different way to
    identify whether writenotify is actually required -- and ordinary mprotect
    would benefit from that as well.

    Note that we don't optimize for the actual migration case:
    (1) When migration succeeds the new PTE will not be writable because the
        source PTE was not writable (protnone); in the future we
        might just optimize that case similarly by reusing
        can_change_pte_writable()/can_change_pmd_writable() when removing
        migration PTEs.
    (2) When migration fails, we'd have to recalculate the "writable" flag
        because we temporarily dropped the PT lock; for now keep it simple and
        set "writable=false".

    We'll remove all savedwrite leftovers next.

    Link: https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:34 -04:00
Chris von Recklinghausen bac96ed149 mm/folio-compat: Remove migration compatibility functions
Bugzilla: https://bugzilla.redhat.com/2160210

commit 9800562f2ab41656b0bdc2a41c77ab3f6dfdd6fc
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 13:29:10 2022 -0400

    mm/folio-compat: Remove migration compatibility functions

    migrate_page_move_mapping(), migrate_page_copy() and migrate_page_states()
    are all now unused after converting all the filesystems from
    aops->migratepage() to aops->migrate_folio().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:30 -04:00
Chris von Recklinghausen 267a7a9b62 docs: rename Documentation/vm to Documentation/mm
Conflicts: drop changes to arch/loongarch/Kconfig - unsupported config

Bugzilla: https://bugzilla.redhat.com/2160210

commit ee65728e103bb7dd99d8604bf6c7aa89c7d7e446
Author: Mike Rapoport <rppt@kernel.org>
Date:   Mon Jun 27 09:00:26 2022 +0300

    docs: rename Documentation/vm to Documentation/mm

    so it will be consistent with code mm directory and with
    Documentation/admin-guide/mm and won't be confused with virtual machines.

    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Tested-by: Ira Weiny <ira.weiny@intel.com>
    Acked-by: Jonathan Corbet <corbet@lwn.net>
    Acked-by: Wu XiangCheng <bobwxc@email.cn>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:15 -04:00
Chris von Recklinghausen 54f0f4b125 ksm: fix typo in comment
Bugzilla: https://bugzilla.redhat.com/2160210

commit 3413b2c872c3fe5cf3b11d4e73de55098c81c7a3
Author: Julia Lawall <Julia.Lawall@inria.fr>
Date:   Sat May 21 13:11:44 2022 +0200

    ksm: fix typo in comment

    Spelling mistake (triple letters) in comment.  Detected with the help of
    Coccinelle.

    Link: https://lkml.kernel.org/r/20220521111145.81697-94-Julia.Lawall@inria.fr
    Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:12 -04:00
Chris von Recklinghausen 7b1db0833d mm: don't be stuck to rmap lock on reclaim path
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6d4675e601357834dadd2ba1d803f6484596015c
Author: Minchan Kim <minchan@kernel.org>
Date:   Thu May 19 14:08:54 2022 -0700

    mm: don't be stuck to rmap lock on reclaim path

    The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
    under memory pressure if processes keep working on their vmas(e.g., fork,
    mmap, munmap).  It makes reclaim path stuck.  In our real workload traces,
    we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
    makes other processes entering direct reclaim, which were also stuck on
    the lock.

    This patch makes lru aging path try_lock mode like shink_page_list so the
    reclaim context will keep working with next lru pages without being stuck.
    if it found the rmap lock contended, it rotates the page back to head of
    lru in both active/inactive lrus to make them consistent behavior, which
    is basic starting point rather than adding more heristic.

    Since this patch introduces a new "contended" field as out-param along
    with try_lock in-param in rmap_walk_control, it's not immutable any longer
    if the try_lock is set so remove const keywords on rmap related functions.
    Since rmap walking is already expensive operation, I doubt the const
    would help sizable benefit( And we didn't have it until 5.17).

    In a heavy app workload in Android, trace shows following statistics.  It
    almost removes rmap lock contention from reclaim path.

    Martin Liu reported:

    Before:

       max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
             1632            0            1631   151.542173        31672    209  page_lock_anon_vma_read
              601            0             601   145.544681        28817    198  rmap_walk_file

    After:

       max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
              NaN          NaN              NaN          NaN          NaN    0.0             NaN
                0            0                0     0.127645            1     12  rmap_walk_file

    [minchan@kernel.org: add comment, per Matthew]
      Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
    Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
    Signed-off-by: Minchan Kim <minchan@kernel.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: John Dias <joaodias@google.com>
    Cc: Tim Murray <timmurray@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Martin Liu <liumartin@google.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen 27e79b34cb ksm: count ksm merging pages for each process
Bugzilla: https://bugzilla.redhat.com/2160210

commit 7609385337a4feb6236e42dcd0df2185683ce839
Author: xu xin <xu.xin16@zte.com.cn>
Date:   Thu Apr 28 23:16:16 2022 -0700

    ksm: count ksm merging pages for each process

    Some applications or containers want to use KSM by calling madvise() to
    advise areas of address space to be MERGEABLE.  But they may not know
    which applications are more likely to cause real merges in the
    deployment.  If this patch is applied, it helps them know their
    corresponding number of merged pages, and then optimize their app code.

    As current KSM only counts the number of KSM merging pages(e.g.
    ksm_pages_sharing and ksm_pages_shared) of the whole system, we cannot see
    the more fine-grained KSM merging, for the upper application optimization,
    the merging area cannot be set easily according to the KSM page merging
    probability of each process.  Therefore, it is necessary to add extra
    statistical means so that the upper level users can know the detailed KSM
    merging information of each process.

    We add a new proc file named as ksm_merging_pages under /proc/<pid>/ to
    indicate the involved ksm merging pages of this process.

    [akpm@linux-foundation.org: fix comment typo, remove BUG_ON()s]
    Link: https://lkml.kernel.org/r/20220325082318.2352853-1-xu.xin16@zte.com.cn
    Signed-off-by: xu xin <xu.xin16@zte.com.cn>
    Reported-by: kernel test robot <lkp@intel.com>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
    Reported-by: Zeal Robot <zealci@zte.com.cn>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Ohhoon Kwon <ohoono.kwon@samsung.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Feng Tang <feng.tang@intel.com>
    Cc: Yang Yang <yang.yang29@zte.com.cn>
    Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
    Cc: Zeal Robot <zealci@zte.com.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:55 -04:00
Chris von Recklinghausen 30e9a2455a mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6c287605fd56466e645693eff3ae7c08fba56e0a
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:44 2022 -0700

    mm: remember exclusively mapped anonymous pages with PG_anon_exclusive

    Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
    exclusive, and use that information to make GUP pins reliable and stay
    consistent with the page mapped into the page table even if the page table
    entry gets write-protected.

    With that information at hand, we can extend our COW logic to always reuse
    anonymous pages that are exclusive.  For anonymous pages that might be
    shared, the existing logic applies.

    As already documented, PG_anon_exclusive is usually only expressive in
    combination with a page table entry.  Especially PTE vs.  PMD-mapped
    anonymous pages require more thought, some examples: due to mremap() we
    can easily have a single compound page PTE-mapped into multiple page
    tables exclusively in a single process -- multiple page table locks apply.
    Further, due to MADV_WIPEONFORK we might not necessarily write-protect
    all PTEs, and only some subpages might be pinned.  Long story short: once
    PTE-mapped, we have to track information about exclusivity per sub-page,
    but until then, we can just track it for the compound page in the head
    page and not having to update a whole bunch of subpages all of the time
    for a simple PMD mapping of a THP.

    For simplicity, this commit mostly talks about "anonymous pages", while
    it's for THP actually "the part of an anonymous folio referenced via a
    page table entry".

    To not spill PG_anon_exclusive code all over the mm code-base, we let the
    anon rmap code to handle all PG_anon_exclusive logic it can easily handle.

    If a writable, present page table entry points at an anonymous (sub)page,
    that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
    pin (FOLL_PIN) on an anonymous page references via a present page table
    entry, it must only pin if PG_anon_exclusive is set for the mapped
    (sub)page.

    This commit doesn't adjust GUP, so this is only implicitly handled for
    FOLL_WRITE, follow-up commits will teach GUP to also respect it for
    FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
    reliable.

    Whenever an anonymous page is to be shared (fork(), KSM), or when
    temporarily unmapping an anonymous page (swap, migration), the relevant
    PG_anon_exclusive bit has to be cleared to mark the anonymous page
    possibly shared.  Clearing will fail if there are GUP pins on the page:

    * For fork(), this means having to copy the page and not being able to
      share it.  fork() protects against concurrent GUP using the PT lock and
      the src_mm->write_protect_seq.

    * For KSM, this means sharing will fail.  For swap this means, unmapping
      will fail, For migration this means, migration will fail early.  All
      three cases protect against concurrent GUP using the PT lock and a
      proper clear/invalidate+flush of the relevant page table entry.

    This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
    pinned page gets mapped R/O and the successive write fault ends up
    replacing the page instead of reusing it.  It improves the situation for
    O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
    fork() is *not* involved, however swapout and fork() are still
    problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
    users will fix the issue for them.

    I. Details about basic handling

    I.1. Fresh anonymous pages

    page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
    given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
    the mechanism fresh anonymous pages come into life (besides migration code
    where we copy the page->mapping), all fresh anonymous pages will start out
    as exclusive.

    I.2. COW reuse handling of anonymous pages

    When a COW handler stumbles over a (sub)page that's marked exclusive, it
    simply reuses it.  Otherwise, the handler tries harder under page lock to
    detect if the (sub)page is exclusive and can be reused.  If exclusive,
    page_move_anon_rmap() will mark the given (sub)page exclusive.

    Note that hugetlb code does not yet check for PageAnonExclusive(), as it
    still uses the old COW logic that is prone to the COW security issue
    because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
    pages are a scarce resource.

    I.3. Migration handling

    try_to_migrate() has to try marking an exclusive anonymous page shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  migrate_vma_collect_pmd() and
    __split_huge_pmd_locked() are handled similarly.

    Writable migration entries implicitly point at shared anonymous pages.
    For readable migration entries that information is stored via a new
    "readable-exclusive" migration entry, specific to anonymous pages.

    When restoring a migration entry in remove_migration_pte(), information
    about exlusivity is detected via the migration entry type, and
    RMAP_EXCLUSIVE is set accordingly for
    page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.

    I.4. Swapout handling

    try_to_unmap() has to try marking the mapped page possibly shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  For now, information about exclusivity is lost.  In
    the future, we might want to remember that information in the swap entry
    in some cases, however, it requires more thought, care, and a way to store
    that information in swap entries.

    I.5. Swapin handling

    do_swap_page() will never stumble over exclusive anonymous pages in the
    swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
    to detect manually if an anonymous page is exclusive and has to set
    RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.

    I.6. THP handling

    __split_huge_pmd_locked() has to move the information about exclusivity
    from the PMD to the PTEs.

    a) In case we have a readable-exclusive PMD migration entry, simply
       insert readable-exclusive PTE migration entries.

    b) In case we have a present PMD entry and we don't want to freeze
       ("convert to migration entries"), simply forward PG_anon_exclusive to
       all sub-pages, no need to temporarily clear the bit.

    c) In case we have a present PMD entry and want to freeze, handle it
       similar to try_to_migrate(): try marking the page shared first.  In
       case we fail, we ignore the "freeze" instruction and simply split
       ordinarily.  try_to_migrate() will properly fail because the THP is
       still mapped via PTEs.

    When splitting a compound anonymous folio (THP), the information about
    exclusivity is implicitly handled via the migration entries: no need to
    replicate PG_anon_exclusive manually.

    I.7.  fork() handling fork() handling is relatively easy, because
    PG_anon_exclusive is only expressive for some page table entry types.

    a) Present anonymous pages

    page_try_dup_anon_rmap() will mark the given subpage shared -- which will
    fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
    PMD to handle it on the PTE level).

    Note that device exclusive entries are just a pointer at a PageAnon()
    page.  fork() will first convert a device exclusive entry to a present
    page table and handle it just like present anonymous pages.

    b) Device private entry

    Device private entries point at PageAnon() pages that cannot be mapped
    directly and, therefore, cannot get pinned.

    page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
    fail because they cannot get pinned.

    c) HW poison entries

    PG_anon_exclusive will remain untouched and is stale -- the page table
    entry is just a placeholder after all.

    d) Migration entries

    Writable and readable-exclusive entries are converted to readable entries:
    possibly shared.

    I.8. mprotect() handling

    mprotect() only has to properly handle the new readable-exclusive
    migration entry:

    When write-protecting a migration entry that points at an anonymous page,
    remember the information about exclusivity via the "readable-exclusive"
    migration entry type.

    II. Migration and GUP-fast

    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a migration entry, we have to mark the page possibly
    shared and synchronize against GUP-fast by a proper clear/invalidate+flush
    to make the following scenario impossible:

    1. try_to_migrate() places a migration entry after checking for GUP pins
       and marks the page possibly shared.

    2. GUP-fast pins the page due to lack of synchronization

    3. fork() converts the "writable/readable-exclusive" migration entry into a
       readable migration entry

    4. Migration fails due to the GUP pin (failing to freeze the refcount)

    5. Migration entries are restored. PG_anon_exclusive is lost

    -> We have a pinned page that is not marked exclusive anymore.

    Note that we move information about exclusivity from the page to the
    migration entry as it otherwise highly overcomplicates fork() and
    PTE-mapping a THP.

    III. Swapout and GUP-fast

    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a swap entry, we have to mark the page possibly shared
    and synchronize against GUP-fast by a proper clear/invalidate+flush to
    make the following scenario impossible:

    1. try_to_unmap() places a swap entry after checking for GUP pins and
       clears exclusivity information on the page.

    2. GUP-fast pins the page due to lack of synchronization.

    -> We have a pinned page that is not marked exclusive anymore.

    If we'd ever store information about exclusivity in the swap entry,
    similar to migration handling, the same considerations as in II would
    apply.  This is future work.

    Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen ab8c3870a8 mm/rmap: remove do_page_add_anon_rmap()
Bugzilla: https://bugzilla.redhat.com/2120352

commit f1e2db12e45baaa2d366f87c885968096c2ff5aa
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: remove do_page_add_anon_rmap()

    ... and instead convert page_add_anon_rmap() to accept flags.

    Passing flags instead of bools is usually nicer either way, and we want to
    more often also pass RMAP_EXCLUSIVE in follow up patches when detecting
    that an anonymous page is exclusive: for example, when restoring an
    anonymous page from a writable migration entry.

    This is a preparation for marking an anonymous page inside
    page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.

    Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:10 -04:00
Chris von Recklinghausen 6697b528b0 mm: handling Non-LRU pages returned by vm_normal_pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3218f8712d6bba1812efd5e0d66c1e15134f2a91
Author: Alex Sierra <alex.sierra@amd.com>
Date:   Fri Jul 15 10:05:11 2022 -0500

    mm: handling Non-LRU pages returned by vm_normal_pages

    With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
    device-managed anonymous pages that are not LRU pages.  Although they
    behave like normal pages for purposes of mapping in CPU page, and for COW.
    They do not support LRU lists, NUMA migration or THP.

    Callers to follow_page() currently don't expect ZONE_DEVICE pages,
    however, with DEVICE_COHERENT we might now return ZONE_DEVICE.  Check for
    ZONE_DEVICE pages in applicable users of follow_page() as well.

    Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.com
    Signed-off-by: Alex Sierra <alex.sierra@amd.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>       [v2]
    Reviewed-by: Alistair Popple <apopple@nvidia.com>       [v6]
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:09 -04:00
Chris von Recklinghausen acf06a0e07 mm/ksm: use helper macro __ATTR_RW
Bugzilla: https://bugzilla.redhat.com/2120352

commit 1bad2e5ca00b4c35cd2d62e380ba3aa7ec05b778
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:46:35 2022 -0700

    mm/ksm: use helper macro __ATTR_RW

    Use helper macro __ATTR_RW to define KSM_ATTR to make code more clear.
    Minor readability improvement.

    Link: https://lkml.kernel.org/r/20220221115809.26381-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:54 -04:00
Chris von Recklinghausen f3eba86ced mm/vmstat: add event for ksm swapping in copy
Bugzilla: https://bugzilla.redhat.com/2120352

commit 4d45c3aff5ebf80d329eba0f90544d20224f612d
Author: Yang Yang <yang.yang29@zte.com.cn>
Date:   Tue Mar 22 14:46:33 2022 -0700

    mm/vmstat: add event for ksm swapping in copy

    When faults in from swap what used to be a KSM page and that page had been
    swapped in before, system has to make a copy, and leaves remerging the
    pages to a later pass of ksmd.

    That is not good for performace, we'd better to reduce this kind of copy.
    There are some ways to reduce it, for example lessen swappiness or
    madvise(, , MADV_MERGEABLE) range.  So add this event to support doing
    this tuning.  Just like this patch: "mm, THP, swap: add THP swapping out
    fallback counting".

    Link: https://lkml.kernel.org/r/20220113023839.758845-1-yang.yang29@zte.com.cn
    Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
    Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Yang Shi <yang.shi@linux.alibaba.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Saravanan D <saravanand@fb.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:54 -04:00
Chris von Recklinghausen 70ca32d5ce mm: ksm: fix use-after-free kasan report in ksm_might_need_to_copy
Bugzilla: https://bugzilla.redhat.com/2120352

commit e1c63e110f977205ab9dfb38989c54e6e7b52a7b
Author: Nanyong Sun <sunnanyong@huawei.com>
Date:   Fri Jan 14 14:08:59 2022 -0800

    mm: ksm: fix use-after-free kasan report in ksm_might_need_to_copy

    When under the stress of swapping in/out with KSM enabled, there is a
    low probability that kasan reports the BUG of use-after-free in
    ksm_might_need_to_copy() when do swap in.  The freed object is the
    anon_vma got from page_anon_vma(page).

    It is because a swapcache page associated with one anon_vma now needed
    for another anon_vma, but the page's original vma was unmapped and the
    anon_vma was freed.  In this case the if condition below always return
    false and then alloc a new page to copy.  Swapin process then use the
    new page and can continue to run well, so this is harmless actually.

          } else if (anon_vma->root == vma->anon_vma->root &&
                     page->index == linear_page_index(vma, address)) {

    This patch exchange the order of above two judgment statement to avoid
    the kasan warning.  Let cpu run "page->index == linear_page_index(vma,
    address)" firstly and return false basically to skip the read of
    anon_vma->root which may trigger the kasan use-after-free warning:

        ==================================================================
        BUG: KASAN: use-after-free in ksm_might_need_to_copy+0x12e/0x5b0
        Read of size 8 at addr ffff88be9977dbd0 by task khugepaged/694

         CPU: 8 PID: 694 Comm: khugepaged Kdump: loaded Tainted: G OE - 4.18.0.x86_64
         Hardware name: 1288H V5/BC11SPSC0, BIOS 7.93 01/14/2021
        Call Trace:
         dump_stack+0xf1/0x19b
         print_address_description+0x70/0x360
         kasan_report+0x1b2/0x330
         ksm_might_need_to_copy+0x12e/0x5b0
         do_swap_page+0x452/0xe70
         __collapse_huge_page_swapin+0x24b/0x720
         khugepaged_scan_pmd+0xcae/0x1ff0
         khugepaged+0x8ee/0xd70
         kthread+0x1a2/0x1d0
         ret_from_fork+0x1f/0x40

        Allocated by task 2306153:
         kasan_kmalloc+0xa0/0xd0
         kmem_cache_alloc+0xc0/0x1c0
         anon_vma_clone+0xf7/0x380
         anon_vma_fork+0xc0/0x390
         copy_process+0x447b/0x4810
         _do_fork+0x118/0x620
         do_syscall_64+0x112/0x360
         entry_SYSCALL_64_after_hwframe+0x65/0xca

        Freed by task 2306242:
         __kasan_slab_free+0x130/0x180
         kmem_cache_free+0x78/0x1d0
         unlink_anon_vmas+0x19c/0x4a0
         free_pgtables+0x137/0x1b0
         exit_mmap+0x133/0x320
         mmput+0x15e/0x390
         do_exit+0x8c5/0x1210
         do_group_exit+0xb5/0x1b0
         __x64_sys_exit_group+0x21/0x30
         do_syscall_64+0x112/0x360
         entry_SYSCALL_64_after_hwframe+0x65/0xca

        The buggy address belongs to the object at ffff88be9977dba0
         which belongs to the cache anon_vma_chain of size 64
        The buggy address is located 48 bytes inside of
         64-byte region [ffff88be9977dba0, ffff88be9977dbe0)
        The buggy address belongs to the page:
        page:ffffea00fa65df40 count:1 mapcount:0 mapping:ffff888107717800 index:0x0
        flags: 0x17ffffc0000100(slab)
        ==================================================================

    Link: https://lkml.kernel.org/r/20211202102940.1069634-1-sunnanyong@huawei.com
    Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:40 -04:00