Commit Graph

1122710 Commits

Author SHA1 Message Date
Audra Mitchell c9d5756843 memory: move hotplug memory notifier priority to same file for easy sorting
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 1eeaa4fd39b0b1b3e986f8eab6978e69b01e3c5e
Author: Liu Shixin <liushixin2@huawei.com>
Date:   Fri Sep 23 11:33:47 2022 +0800

    memory: move hotplug memory notifier priority to same file for easy sorting

    The priority of hotplug memory callback is defined in a different file.
    And there are some callers using numbers directly.  Collect them together
    into include/linux/memory.h for easy reading.  This allows us to sort
    their priorities more intuitively without additional comments.

    Link: https://lkml.kernel.org/r/20220923033347.3935160-9-liushixin2@huawei.com
    Signed-off-by: Liu Shixin <liushixin2@huawei.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Waiman Long <longman@redhat.com>
    Cc: zefan li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:51 -04:00
Audra Mitchell d896df619e memory: remove unused register_hotmemory_notifier()
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit eafd296e0cc0cc03b4ae01c2b3b07273514d757c
Author: Liu Shixin <liushixin2@huawei.com>
Date:   Fri Sep 23 11:33:46 2022 +0800

    memory: remove unused register_hotmemory_notifier()

    Remove unused register_hotmemory_notifier().

    Link: https://lkml.kernel.org/r/20220923033347.3935160-8-liushixin2@huawei.com
    Signed-off-by: Liu Shixin <liushixin2@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Waiman Long <longman@redhat.com>
    Cc: zefan li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:51 -04:00
Audra Mitchell c5a1784372 mm/mm_init.c: use hotplug_memory_notifier() directly
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit d46722ef1c090541d56f706f3a90f3f2e84cdf0c
Author: Liu Shixin <liushixin2@huawei.com>
Date:   Fri Sep 23 11:33:44 2022 +0800

    mm/mm_init.c: use hotplug_memory_notifier() directly

    Commit 76ae847497bc52 ("Documentation: raise minimum supported version of
    GCC to 5.1") updated the minimum gcc version to 5.1.  So the problem
    mentioned in f02c696800 ("include/linux/memory.h: implement
    register_hotmemory_notifier()") no longer exist.  So we can now switch to
    use hotplug_memory_notifier() directly rather than
    register_hotmemory_notifier().

    Link: https://lkml.kernel.org/r/20220923033347.3935160-6-liushixin2@huawei.com
    Signed-off-by: Liu Shixin <liushixin2@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Waiman Long <longman@redhat.com>
    Cc: zefan li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:51 -04:00
Audra Mitchell 74d4b8f72a mm/mmap: use hotplug_memory_notifier() directly
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit cddb8d09ff1e477de8236a061a5017b21bab3c14
Author: Liu Shixin <liushixin2@huawei.com>
Date:   Fri Sep 23 11:33:43 2022 +0800

    mm/mmap: use hotplug_memory_notifier() directly

    Commit 76ae847497bc52 ("Documentation: raise minimum supported version of
    GCC to 5.1") updated the minimum gcc version to 5.1.  So the problem
    mentioned in f02c696800 ("include/linux/memory.h: implement
    register_hotmemory_notifier()") no longer exist.  So we can now switch to
    use hotplug_memory_notifier() directly rather than
    register_hotmemory_notifier().

    Link: https://lkml.kernel.org/r/20220923033347.3935160-5-liushixin2@huawei.com
    Signed-off-by: Liu Shixin <liushixin2@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Waiman Long <longman@redhat.com>
    Cc: zefan li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:51 -04:00
Audra Mitchell a26618361c mm/slub.c: use hotplug_memory_notifier() directly
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 946d5f9c9dcdbaedcd664fad08cea7910139d10f
Author: Liu Shixin <liushixin2@huawei.com>
Date:   Fri Sep 23 11:33:42 2022 +0800

    mm/slub.c: use hotplug_memory_notifier() directly

    Commit 76ae847497bc52 ("Documentation: raise minimum supported version of
    GCC to 5.1") updated the minimum gcc version to 5.1.  So the problem
    mentioned in f02c696800 ("include/linux/memory.h: implement
    register_hotmemory_notifier()") no longer exist.  So we can now switch to
    use hotplug_memory_notifier() directly rather than
    register_hotmemory_notifier().

    Link: https://lkml.kernel.org/r/20220923033347.3935160-4-liushixin2@huawei.com
    Signed-off-by: Liu Shixin <liushixin2@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Waiman Long <longman@redhat.com>
    Cc: zefan li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:51 -04:00
Audra Mitchell d4f83c01a7 fs/proc/kcore.c: use hotplug_memory_notifier() directly
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 5d89c224328bce791d051bf60aa92d90bae93c01
Author: Liu Shixin <liushixin2@huawei.com>
Date:   Fri Sep 23 11:33:41 2022 +0800

    fs/proc/kcore.c: use hotplug_memory_notifier() directly

    Commit 76ae847497bc52 ("Documentation: raise minimum supported version of
    GCC to 5.1") updated the minimum gcc version to 5.1.  So the problem
    mentioned in f02c696800 ("include/linux/memory.h: implement
    register_hotmemory_notifier()") no longer exist.  So we can now switch to
    use hotplug_memory_notifier() directly rather than
    register_hotmemory_notifier().

    Link: https://lkml.kernel.org/r/20220923033347.3935160-3-liushixin2@huawei.com
    Signed-off-by: Liu Shixin <liushixin2@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Waiman Long <longman@redhat.com>
    Cc: zefan li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:51 -04:00
Audra Mitchell 17d6db099b kasan: migrate workqueue_uaf test to kunit
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit b2c5bd4c69ce28500ed2176d11002a4e9b30da36
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Tue Sep 27 19:09:11 2022 +0200

    kasan: migrate workqueue_uaf test to kunit

    Migrate the workqueue_uaf test to the KUnit framework.

    Initially, this test was intended to check that Generic KASAN prints
    auxiliary stack traces for workqueues.  Nevertheless, the test is enabled
    for all modes to make that KASAN reports bad accesses in the tested
    scenario.

    The presence of auxiliary stack traces for the Generic mode needs to be
    inspected manually.

    Link: https://lkml.kernel.org/r/1d81b6cc2a58985126283d1e0de8e663716dd930.1664298455.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:51 -04:00
Audra Mitchell 151840a3d3 kasan: migrate kasan_rcu_uaf test to kunit
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 8516e837cab0b2c740b90603b66039aa7dcecda4
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Tue Sep 27 19:09:10 2022 +0200

    kasan: migrate kasan_rcu_uaf test to kunit

    Migrate the kasan_rcu_uaf test to the KUnit framework.

    Changes to the implementation of the test:

    - Call rcu_barrier() after call_rcu() to make that the RCU callbacks get
      triggered before the test is over.

    - Cast pointer passed to rcu_dereference_protected as __rcu to get rid of
      the Sparse warning.

    - Check that KASAN prints a report via KUNIT_EXPECT_KASAN_FAIL.

    Initially, this test was intended to check that Generic KASAN prints
    auxiliary stack traces for RCU objects. Nevertheless, the test is enabled
    for all modes to make that KASAN reports bad accesses in RCU callbacks.

    The presence of auxiliary stack traces for the Generic mode needs to be
    inspected manually.

    Link: https://lkml.kernel.org/r/897ee08d6cd0ba7e8a4fbfd9d8502823a2f922e6.1664298455.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell 57b16e0916 kasan: switch kunit tests to console tracepoints
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    Minor context conflict due to out of order backport:
    c9c5178853 ("kasan, arm64: rename tagging-related routines")

This patch is a backport of the following upstream commit:
commit 7ce0ea19d50e4e97a8da69f616ffa8afbb532a93
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Tue Sep 27 19:09:09 2022 +0200

    kasan: switch kunit tests to console tracepoints

    Switch KUnit-compatible KASAN tests from using per-task KUnit resources to
    console tracepoints.

    This allows for two things:

    1. Migrating tests that trigger a KASAN report in the context of a task
       other than current to KUnit framework.
       This is implemented in the patches that follow.

    2. Parsing and matching the contents of KASAN reports.
       This is not yet implemented.

    Link: https://lkml.kernel.org/r/9345acdd11e953b207b0ed4724ff780e63afeb36.1664298455.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell 4b0380a0d6 tmpfs: ensure O_LARGEFILE with generic_file_open()
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit a5454f95246aa1d3527ef5e128cd3a10bc8371de
Author: Thomas Weißschuh <thomas.weissschuh@amadeus.com>
Date:   Wed Sep 28 12:45:35 2022 +0200

    tmpfs: ensure O_LARGEFILE with generic_file_open()

    Without this check open() will open large files on tmpfs although
    O_LARGEFILE was not specified.  This is inconsistent with other
    filesystems.  Also it will later result in EOVERFLOW on stat() or EFBIG on
    write().

    Link: https://lore.kernel.org/lkml/76bedae6-22ea-4abc-8c06-b424ceb39217@t-8ch.de/
    Link: https://lkml.kernel.org/r/20220928104535.61186-1-linux@weissschuh.net
    Signed-off-by: Thomas Weißschuh <thomas.weissschuh@amadeus.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell 1578668f61 mm: memcontrol: use mem_cgroup_is_root() helper
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 7848ed6284ec4791eba22026e28edb2062790a3d
Author: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Date:   Fri Sep 30 19:14:33 2022 +0530

    mm: memcontrol: use mem_cgroup_is_root() helper

    Replace the checks for memcg is root memcg, with mem_cgroup_is_root()
    helper.

    Link: https://lkml.kernel.org/r/20220930134433.338103-1-kamalesh.babulal@oracle.com
    Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Tom Hromatka <tom.hromatka@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell d282f371b7 mm/mincore.c: use vma_lookup() instead of find_vma()
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 97955f6941f0e7dea64dea22711382daf1db2f76
Author: Deming Wang <wangdeming@inspur.com>
Date:   Thu Oct 6 23:03:45 2022 -0400

    mm/mincore.c: use vma_lookup() instead of find_vma()

    Using vma_lookup() verifies the start address is contained in the found
    vma.  This results in easier to read the code.

    Link: https://lkml.kernel.org/r/20221007030345.5029-1-wangdeming@inspur.com
    Signed-off-by: Deming Wang <wangdeming@inspur.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell fb208bc6ad filemap: find_get_entries() now updates start offset
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 9fb6beea79c6e7c959adf4fb7b94cf9a6028b941
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Oct 17 09:18:00 2022 -0700

    filemap: find_get_entries() now updates start offset

    Initially, find_get_entries() was being passed in the start offset as a
    value.  That left the calculation of the offset to the callers.  This led
    to complexity in the callers trying to keep track of the index.

    Now find_get_entries() takes in a pointer to the start offset and updates
    the value to be directly after the last entry found.  If no entry is
    found, the offset is not changed.  This gets rid of multiple hacky
    calculations that kept track of the start offset.

    Link: https://lkml.kernel.org/r/20221017161800.2003-3-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell 765c2fd97b filemap: find_lock_entries() now updates start offset
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    Context conflict due to out of order backport:
    9efa394ef3 ("tmpfs: fix data loss from failed fallocate")

This patch is a backport of the following upstream commit:
commit 3392ca121872dd8c33015c7703d4981c78819be3
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Oct 17 09:17:59 2022 -0700

    filemap: find_lock_entries() now updates start offset

    Patch series "Rework find_get_entries() and find_lock_entries()", v3.

    Originally the callers of find_get_entries() and find_lock_entries() were
    keeping track of the start index themselves as they traverse the search
    range.

    This resulted in hacky code such as in shmem_undo_range():

                            index = folio->index + folio_nr_pages(folio) - 1;

    where the - 1 is only present to stay in the right spot after incrementing
    index later.  This sort of calculation was also being done on every folio
    despite not even using index later within that function.

    These patches change find_get_entries() and find_lock_entries() to
    calculate the new index instead of leaving it to the callers so we can
    avoid all these complications.

    This patch (of 2):

    Initially, find_lock_entries() was being passed in the start offset as a
    value.  That left the calculation of the offset to the callers.  This led
    to complexity in the callers trying to keep track of the index.

    Now find_lock_entries() takes in a pointer to the start offset and updates
    the value to be directly after the last entry found.  If no entry is
    found, the offset is not changed.  This gets rid of multiple hacky
    calculations that kept track of the start offset.

    Link: https://lkml.kernel.org/r/20221017161800.2003-1-vishal.moola@gmail.com
    Link: https://lkml.kernel.org/r/20221017161800.2003-2-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell cb3a49da8a mm/rmap: fix comment in anon_vma_clone()
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit d8e454eb44473b2270e2675fb44a9d79dee36097
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Fri Oct 14 09:39:31 2022 +0800

    mm/rmap: fix comment in anon_vma_clone()

    Commit 2555283eb40d ("mm/rmap: Fix anon_vma->degree ambiguity leading to
    double-reuse") use num_children and num_active_vmas to replace the origin
    degree to fix anon_vma UAF problem.  Update the comment in anon_vma_clone
    to fit this change.

    Link: https://lkml.kernel.org/r/20221014013931.1565969-1-mawupeng1@huawei.com
    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell c1d6c78a3a mm/percpu: remove unused PERCPU_DYNAMIC_EARLY_SLOTS
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit d667c94962c1c81ef587ac91dc5c01a1cfe339c7
Author: Baoquan He <bhe@redhat.com>
Date:   Mon Oct 24 16:14:34 2022 +0800

    mm/percpu: remove unused PERCPU_DYNAMIC_EARLY_SLOTS

    Since commit 40064aeca3 ("percpu: replace area map allocator with
    bitmap"), there's no place to use PERCPU_DYNAMIC_EARLY_SLOTS. So
    clean it up.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell a23585f50b mm/percpu.c: remove the lcm code since block size is fixed at page size
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 3289e0533e70aafa9fb6d128fd4452db1b8befe8
Author: Baoquan He <bhe@redhat.com>
Date:   Mon Oct 24 16:14:33 2022 +0800

    mm/percpu.c: remove the lcm code since block size is fixed at page size

    Since commit b239f7daf5 ("percpu: set PCPU_BITMAP_BLOCK_SIZE to
    PAGE_SIZE"), the PCPU_BITMAP_BLOCK_SIZE has been set to page size
    fixedly. So the lcm code in pcpu_alloc_first_chunk() doesn't make
    sense any more, clean it up.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell a4b0f4aadc mm/percpu: replace the goto with break
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 83d261fc9e5fb03e8c32e365ca4ee53952611a2b
Author: Baoquan He <bhe@redhat.com>
Date:   Mon Oct 24 16:14:32 2022 +0800

    mm/percpu: replace the goto with break

    In function pcpu_reclaim_populated(), the line of goto jumping is
    unnecessary since the label 'end_chunk' is near the end of the for
    loop, use break instead.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell 443bfa2d5b mm/percpu: add comment to state the empty populated pages accounting
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 73046f8d31701c379f6db899cb09ba70a3285143
Author: Baoquan He <bhe@redhat.com>
Date:   Tue Oct 25 11:45:16 2022 +0800

    mm/percpu: add comment to state the empty populated pages accounting

    When allocating an area from a chunk, pcpu_block_update_hint_alloc()
    is called to update chunk metadata, including chunk's and global
    nr_empty_pop_pages. However, if the allocation is not atomic, some
    blocks may not be populated with pages yet, while we still subtract
    the number here. The number of pages will be added back with
    pcpu_chunk_populated() when populating pages.

    Adding code comment to make that more understandable.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell a432c1d810 mm/percpu: Update the code comment when creating new chunk
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit e04cb6976340d5ebf2b28ad91bf6a13a285aa566
Author: Baoquan He <bhe@redhat.com>
Date:   Mon Oct 24 16:14:30 2022 +0800

    mm/percpu: Update the code comment when creating new chunk

    The lock pcpu_alloc_mutex taking code has been moved to the beginning of
    pcpu_allo() if it's non atomic allocation. So the code comment above
    above pcpu_create_chunk() callsite need be updated.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:49 -04:00
Audra Mitchell 91e0cae202 mm/percpu: use list_first_entry_or_null in pcpu_reclaim_populated()
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit c1f6688d35d47ca11200789b000b3b20f5ecdbd9
Author: Baoquan He <bhe@redhat.com>
Date:   Tue Oct 25 11:11:45 2022 +0800

    mm/percpu: use list_first_entry_or_null in pcpu_reclaim_populated()

    To replace list_empty()/list_first_entry() pair to simplify code.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Acked-by: Dennis Zhou <dennis@kernel.org>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:49 -04:00
Audra Mitchell 10f60902d9 mm/percpu: remove unused pcpu_map_extend_chunks
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 5a7d596a05dddd09c44ae462f881491cf87ed120
Author: Baoquan He <bhe@redhat.com>
Date:   Mon Oct 24 16:14:28 2022 +0800

    mm/percpu: remove unused pcpu_map_extend_chunks

    Since commit 40064aeca3 ("percpu: replace area map allocator with
    bitmap"), it is unneeded.

    Signed-off-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:49 -04:00
Audra Mitchell 32223d8003 mm/slub: perform free consistency checks before call_rcu
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit bc29d5bd2ba977716e57572030290d6547ff3f6d
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Fri Aug 26 11:09:11 2022 +0200

    mm/slub: perform free consistency checks before call_rcu

    For SLAB_TYPESAFE_BY_RCU caches we use call_rcu to perform empty slab
    freeing. The rcu callback rcu_free_slab() calls __free_slab() that
    currently includes checking the slab consistency for caches with
    SLAB_CONSISTENCY_CHECKS flags. This check needs the slab->objects field
    to be intact.

    Because in the next patch we want to allow rcu_head in struct slab to
    become larger in debug configurations and thus potentially overwrite
    more fields through a union than slab_list, we want to limit the fields
    used in rcu_free_slab().  Thus move the consistency checks to
    free_slab() before call_rcu(). This can be done safely even for
    SLAB_TYPESAFE_BY_RCU caches where accesses to the objects can still
    occur after freeing them.

    As a result, only the slab->slab_cache field has to be physically
    separate from rcu_head for the freeing callback to work. We also save
    some cycles in the rcu callback for caches with consistency checks
    enabled.

    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:49 -04:00
Audra Mitchell 65dfaa7487 mm/slab: Annotate kmem_cache_node->list_lock as raw
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit b539ce9f1a31c442098c3f351cb4d03ba27c2720
Author: Jiri Kosina <jkosina@suse.cz>
Date:   Fri Oct 21 21:18:12 2022 +0200

    mm/slab: Annotate kmem_cache_node->list_lock as raw

    The list_lock can be taken in hardirq context when do_drain() is being
    called via IPI on all cores, and therefore lockdep complains about it,
    because it can't be preempted on PREEMPT_RT.

    That's not a real issue, as SLAB can't be built on PREEMPT_RT anyway, but
    we still want to get rid of the warning on non-PREEMPT_RT builds.

    Annotate it therefore as a raw lock in order to get rid of he lockdep
    warning below.

             =============================
             [ BUG: Invalid wait context ]
             6.1.0-rc1-00134-ge35184f32151 #4 Not tainted
             -----------------------------
             swapper/3/0 is trying to lock:
             ffff8bc88086dc18 (&parent->list_lock){..-.}-{3:3}, at: do_drain+0x57/0xb0
             other info that might help us debug this:
             context-{2:2}
             no locks held by swapper/3/0.
             stack backtrace:
             CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.1.0-rc1-00134-ge35184f32151 #4
             Hardware name: LENOVO 20K5S22R00/20K5S22R00, BIOS R0IET38W (1.16 ) 05/31/2017
             Call Trace:
              <IRQ>
              dump_stack_lvl+0x6b/0x9d
              __lock_acquire+0x1519/0x1730
              ? build_sched_domains+0x4bd/0x1590
              ? __lock_acquire+0xad2/0x1730
              lock_acquire+0x294/0x340
              ? do_drain+0x57/0xb0
              ? sched_clock_tick+0x41/0x60
              _raw_spin_lock+0x2c/0x40
              ? do_drain+0x57/0xb0
              do_drain+0x57/0xb0
              __flush_smp_call_function_queue+0x138/0x220
              __sysvec_call_function+0x4f/0x210
              sysvec_call_function+0x4b/0x90
              </IRQ>
              <TASK>
              asm_sysvec_call_function+0x16/0x20
             RIP: 0010:mwait_idle+0x5e/0x80
             Code: 31 d2 65 48 8b 04 25 80 ed 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 0b 78 46 00 31 c0 48 89 c1 fb 0f 01 c9 <eb> 06 fb 0f 1f 44 00 00 65 48 8b 04 25 80 ed 01 00 f0 80 60 02 df
             RSP: 0000:ffffa90940217ee0 EFLAGS: 00000246
             RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
             RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9bb9f93a
             RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000001
             R10: ffffa90940217ea8 R11: 0000000000000000 R12: ffffffffffffffff
             R13: 0000000000000000 R14: ffff8bc88127c500 R15: 0000000000000000
              ? default_idle_call+0x1a/0xa0
              default_idle_call+0x4b/0xa0
              do_idle+0x1f1/0x2c0
              ? _raw_spin_unlock_irqrestore+0x56/0x70
              cpu_startup_entry+0x19/0x20
              start_secondary+0x122/0x150
              secondary_startup_64_no_verify+0xce/0xdb
              </TASK>

    Signed-off-by: Jiri Kosina <jkosina@suse.cz>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:49 -04:00
Audra Mitchell 7880c06622 mm: slub: make slab_sysfs_init() a late_initcall
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 1a5ad30b89b4e9fa64f75b941a324396738b7616
Author: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Date:   Fri Sep 30 12:27:12 2022 +0200

    mm: slub: make slab_sysfs_init() a late_initcall

    Currently, slab_sysfs_init() is an __initcall aka device_initcall. It
    is rather time-consuming; on my board it takes around 11ms. That's
    about 1% of the time budget I have from U-Boot letting go and until
    linux must assume responsibility of keeping the external watchdog
    happy.

    There's no particular reason this would need to run at device_initcall
    time, so instead make it a late_initcall to allow vital functionality
    to get started a bit sooner.

    This actually ends up winning more than just those 11ms, because the
    slab caches that get created during other device_initcalls (and before
    my watchdog device gets probed) now don't end up doing the somewhat
    expensive sysfs_slab_add() themselves. Some example lines (with
    initcall_debug set) before/after:

    initcall ext4_init_fs+0x0/0x1ac returned 0 after 1386 usecs
    initcall journal_init+0x0/0x138 returned 0 after 517 usecs
    initcall init_fat_fs+0x0/0x68 returned 0 after 294 usecs

    initcall ext4_init_fs+0x0/0x1ac returned 0 after 240 usecs
    initcall journal_init+0x0/0x138 returned 0 after 32 usecs
    initcall init_fat_fs+0x0/0x68 returned 0 after 18 usecs

    Altogether, this means I now get to petting the watchdog around 17ms
    sooner. [Of course, the time the other initcalls save is instead spent
    in slab_sysfs_init(), which goes from 11ms to 16ms, so there's no
    overall change in boot time.]

    Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:49 -04:00
Audra Mitchell 6860172059 mm: slub: remove dead and buggy code from sysfs_slab_add()
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 979857ea2deae05454d257f119bedfe84a2c74d9
Author: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Date:   Fri Sep 30 10:47:42 2022 +0200

    mm: slub: remove dead and buggy code from sysfs_slab_add()

    The function sysfs_slab_add() has two callers:

    One is slab_sysfs_init(), which first initializes slab_kset, and only
    when that succeeds sets slab_state to FULL, and then proceeds to call
    sysfs_slab_add() for all previously created slabs.

    The other is __kmem_cache_create(), but only after a

            if (slab_state <= UP)
                    return 0;

    check.

    So in other words, sysfs_slab_add() is never called without
    slab_kset (aka the return value of cache_kset()) being non-NULL.

    And this is just as well, because if we ever did take this path and
    called kobject_init(&s->kobj), and then later when called again from
    slab_sysfs_init() would end up calling kobject_init_and_add(), we
    would hit

            if (kobj->state_initialized) {
                    /* do not error out as sometimes we can recover */
                    pr_err("kobject (%p): tried to init an initialized object, something is seriously wrong.\n",
                    dump_stack();
            }

    in kobject.c.

    Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:49 -04:00
Chris von Recklinghausen 2aef5ca72a Add CONFIG_PER_VMA_LOCK_STATS to RHEL configs collection
Upstream Status: RHEL-only
JIRA: https://issues.redhat.com/browse/RHEL-27736

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:26 -04:00
Chris von Recklinghausen 41c8c0ebba mmap: fix do_brk_flags() modifying obviously incorrect VMAs
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 6c28ca6485ddd7c5da171e479e3ebfbe661efc4d
Author: Liam Howlett <liam.howlett@oracle.com>
Date:   Mon Dec 5 19:23:17 2022 +0000

    mmap: fix do_brk_flags() modifying obviously incorrect VMAs

    Add more sanity checks to the VMA that do_brk_flags() will expand.  Ensure
    the VMA matches basic merge requirements within the function before
    calling can_vma_merge_after().

    Drop the duplicate checks from vm_brk_flags() since they will be enforced
    later.

    The old code would expand file VMAs on brk(), which is functionally
    wrong and also dangerous in terms of locking because the brk() path
    isn't designed for file VMAs and therefore doesn't lock the file
    mapping.  Checking can_vma_merge_after() ensures that new anonymous
    VMAs can't be merged into file VMAs.

    See https://lore.kernel.org/linux-mm/CAG48ez1tJZTOjS_FjRZhvtDA-STFmdw8PEizPDwMGFd_ui0Nrw@mail.gmail.com/

    Link: https://lkml.kernel.org/r/20221205192304.1957418-1-Liam.Howlett@oracle.com
    Fixes: 2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Suggested-by: Jann Horn <jannh@google.com>
    Cc: Jason A. Donenfeld <Jason@zx2c4.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:07 -04:00
Chris von Recklinghausen 2e800149c5 mm: do not BUG_ON missing brk mapping, because userspace can unmap it
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit f5ad5083404bb56c9de777dccb68c6672ef6487e
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Fri Dec 2 17:27:24 2022 +0100

    mm: do not BUG_ON missing brk mapping, because userspace can unmap it

    The following program will trigger the BUG_ON that this patch removes,
    because the user can munmap() mm->brk:

      #include <sys/syscall.h>
      #include <sys/mman.h>
      #include <assert.h>
      #include <unistd.h>

      static void *brk_now(void)
      {
        return (void *)syscall(SYS_brk, 0);
      }

      static void brk_set(void *b)
      {
        assert(syscall(SYS_brk, b) != -1);
      }

      int main(int argc, char *argv[])
      {
        void *b = brk_now();
        brk_set(b + 4096);
        assert(munmap(b - 4096, 4096 * 2) == 0);
        brk_set(b);
        return 0;
      }

    Compile that with musl, since glibc actually uses brk(), and then
    execute it, and it'll hit this splat:

      kernel BUG at mm/mmap.c:229!
      invalid opcode: 0000 [#1] PREEMPT SMP
      CPU: 12 PID: 1379 Comm: a.out Tainted: G S   U             6.1.0-rc7+ #419
      RIP: 0010:__do_sys_brk+0x2fc/0x340
      Code: 00 00 4c 89 ef e8 04 d3 fe ff eb 9a be 01 00 00 00 4c 89 ff e8 35 e0 fe ff e9 6e ff ff ff 4d 89 a7 20>
      RSP: 0018:ffff888140bc7eb0 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 00000000007e7000 RCX: ffff8881020fe000
      RDX: ffff8881020fe001 RSI: ffff8881955c9b00 RDI: ffff8881955c9b08
      RBP: 0000000000000000 R08: ffff8881955c9b00 R09: 00007ffc77844000
      R10: 0000000000000000 R11: 0000000000000001 R12: 00000000007e8000
      R13: 00000000007e8000 R14: 00000000007e7000 R15: ffff8881020fe000
      FS:  0000000000604298(0000) GS:ffff88901f700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000603fe0 CR3: 000000015ba9a005 CR4: 0000000000770ee0
      PKRU: 55555554
      Call Trace:
       <TASK>
       do_syscall_64+0x2b/0x50
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x400678
      Code: 10 4c 8d 41 08 4c 89 44 24 10 4c 8b 01 8b 4c 24 08 83 f9 2f 77 0a 4c 8d 4c 24 20 4c 01 c9 eb 05 48 8b>
      RSP: 002b:00007ffc77863890 EFLAGS: 00000212 ORIG_RAX: 000000000000000c
      RAX: ffffffffffffffda RBX: 000000000040031b RCX: 0000000000400678
      RDX: 00000000004006a1 RSI: 00000000007e6000 RDI: 00000000007e7000
      RBP: 00007ffc77863900 R08: 0000000000000000 R09: 00000000007e6000
      R10: 00007ffc77863930 R11: 0000000000000212 R12: 00007ffc77863978
      R13: 00007ffc77863988 R14: 0000000000000000 R15: 0000000000000000
       </TASK>

    Instead, just return the old brk value if the original mapping has been
    removed.

    [akpm@linux-foundation.org: fix changelog, per Liam]
    Link: https://lkml.kernel.org/r/20221202162724.2009-1-Jason@zx2c4.com
    Fixes: 2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Jann Horn <jannh@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:07 -04:00
Chris von Recklinghausen 131abb759c mm/page_alloc: leave IRQs enabled for per-cpu page allocations
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 5749077415994eb02d660b2559b9d8278521e73d
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Nov 18 10:17:14 2022 +0000

    mm/page_alloc: leave IRQs enabled for per-cpu page allocations

    The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
    allocating from the PCP must not re-enter the allocator from IRQ context.
    In each instance where IRQ-reentrancy is possible, the lock is acquired
    using pcp_spin_trylock_irqsave() even though IRQs are disabled and
    re-entrancy is impossible.

    Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the
    common case at the cost of some IRQ allocations taking a slower path.  If
    the PCP lists need to be refilled, the zone lock still needs to disable
    IRQs but that will only happen on PCP refill and drain.  If an IRQ is
    raised when a PCP allocation is in progress, the trylock will fail and
    fallback to using the buddy lists directly.  Note that this may not be a
    universal win if an interrupt-intensive workload also allocates heavily
    from interrupt context and contends heavily on the zone->lock as a result.

    [mgorman@techsingularity.net: migratetype might be wrong if a PCP was locked]
      Link: https://lkml.kernel.org/r/20221122131229.5263-2-mgorman@techsingularity.net
    [yuzhao@google.com: reported lockdep issue on IO completion from softirq]
    [hughd@google.com: fix list corruption, lock improvements, micro-optimsations]
    Link: https://lkml.kernel.org/r/20221118101714.19590-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:07 -04:00
Chris von Recklinghausen f659851e8b mm/page_alloc: always remove pages from temporary list
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit c3e58a70425ac6ddaae1529c8146e88b4f7252bb
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Nov 18 10:17:13 2022 +0000

    mm/page_alloc: always remove pages from temporary list

    Patch series "Leave IRQs enabled for per-cpu page allocations", v3.

    This patch (of 2):

    free_unref_page_list() has neglected to remove pages properly from the
    list of pages to free since forever.  It works by coincidence because
    list_add happened to do the right thing adding the pages to just the PCP
    lists.  However, a later patch added pages to either the PCP list or the
    zone list but only properly deleted the page from the list in one path
    leading to list corruption and a subsequent failure.  As a preparation
    patch, always delete the pages from one list properly before adding to
    another.  On its own, this fixes nothing although it adds a fractional
    amount of overhead but is critical to the next patch.

    Link: https://lkml.kernel.org/r/20221118101714.19590-1-mgorman@techsingularity.net
    Link: https://lkml.kernel.org/r/20221118101714.19590-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Reported-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:06 -04:00
Chris von Recklinghausen 7096ad3b1e mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 8d6a0ac09a16c026e1e2a03a61e12e95c48a25a6
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:47 2022 +0100

    mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping

    Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a
    COW (i.e., private writable) mapping and adjust the documentation
    accordingly.

    FAULT_FLAG_UNSHARE will now also break COW when encountering the shared
    zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by
    properly replacing the mapped page/pfn by a private copy (an exclusive
    anonymous page).

    Note that only do_wp_page() needs care: hugetlb_wp() already handles
    FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it
    correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE
    such that we can handle FAULT_FLAG_UNSHARE on the PTE level.

    This change is a requirement for reliable long-term R/O pinning in
    COW mappings.

    Link: https://lkml.kernel.org/r/20221116102659.70287-9-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:06 -04:00
Chris von Recklinghausen d43c5e6f48 mm: rework handling in do_wp_page() based on private vs. shared mappings
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit b9086fde6d44e8a95dc95b822bd87386129b832d
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Nov 16 11:26:45 2022 +0100

    mm: rework handling in do_wp_page() based on private vs. shared mappings

    We want to extent FAULT_FLAG_UNSHARE support to anything mapped into a
    COW mapping (pagecache page, zeropage, PFN, ...), not just anonymous pages.
    Let's prepare for that by handling shared mappings first such that we can
    handle private mappings last.

    While at it, use folio-based functions instead of page-based functions
    where we touch the code either way.

    Link: https://lkml.kernel.org/r/20221116102659.70287-7-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:06 -04:00
Chris von Recklinghausen 80a5bb5a00 hugetlb: remove duplicate mmu notifications
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 369258ce41c6d7663a7b6d509356fecad577378d
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Mon Nov 14 15:55:07 2022 -0800

    hugetlb: remove duplicate mmu notifications

    The common hugetlb unmap routine __unmap_hugepage_range performs mmu
    notification calls.  However, in the case where __unmap_hugepage_range is
    called via __unmap_hugepage_range_final, mmu notification calls are
    performed earlier in other calling routines.

    Remove mmu notification calls from __unmap_hugepage_range.  Add
    notification calls to the only other caller: unmap_hugepage_range.
    unmap_hugepage_range is called for truncation and hole punch, so change
    notification type from UNMAP to CLEAR as this is more appropriate.

    Link: https://lkml.kernel.org/r/20221114235507.294320-4-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: Peter Xu <peterx@redhat.com>
    Cc: Wei Chen <harperchen1110@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:05 -04:00
Chris von Recklinghausen 32b894d0b2 mm: teach release_pages() to take an array of encoded page pointers too
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 449c796768c9a1c738d1fa8671fb01663380b8a7
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Nov 9 12:30:49 2022 -0800

    mm: teach release_pages() to take an array of encoded page pointers too

    release_pages() already could take either an array of page pointers, or an
    array of folio pointers.  Expand it to also accept an array of encoded
    page pointers, which is what both the existing mlock() use and the
    upcoming mmu_gather use of encoded page pointers wants.

    Note that release_pages() won't actually use, or react to, any extra
    encoded bits.  Instead, this is very much a case of "I have walked the
    array of encoded pages and done everything the extra bits tell me to do,
    now release it all".

    Also, while the "either page or folio pointers" dual use was handled with
    a cast of the pointer in "release_folios()", this takes a slightly
    different approach and uses the "transparent union" attribute to describe
    the set of arguments to the function:

      https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html

    which has been supported by gcc forever, but the kernel hasn't used
    before.

    That allows us to avoid using various wrappers with casts, and just use
    the same function regardless of use.

    Link: https://lkml.kernel.org/r/20221109203051.1835763-2-torvalds@linux-foundation.org
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:05 -04:00
Chris von Recklinghausen d10ef99ada mm: introduce 'encoded' page pointers with embedded extra bits
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 70fb4fdff5826a48886152fd5c5db04eb6c59a40
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Nov 9 12:30:48 2022 -0800

    mm: introduce 'encoded' page pointers with embedded extra bits

    We already have this notion in parts of the MM code (see the mlock code
    with the LRU_PAGE and NEW_PAGE bits), but I'm going to introduce a new
    case, and I refuse to do the same thing we've done before where we just
    put bits in the raw pointer and say it's still a normal pointer.

    So this introduces a 'struct encoded_page' pointer that cannot be used for
    anything else than to encode a real page pointer and a couple of extra
    bits in the low bits.  That way the compiler can trivially track the state
    of the pointer and you just explicitly encode and decode the extra bits.

    Note that this makes the alignment of 'struct page' explicit even for the
    case where CONFIG_HAVE_ALIGNED_STRUCT_PAGE is not set.  That is entirely
    redundant in almost all cases, since the page structure already contains
    several word-sized entries.

    However, on m68k, the alignment of even 32-bit data is just 16 bits, and
    as such in theory the alignment of 'struct page' could be too.  So let's
    just make it very very explicit that the alignment needs to be at least 32
    bits, giving us a guarantee of two unused low bits in the pointer.

    Now, in practice, our page struct array is aligned much more than that
    anyway, even on m68k, and our existing code in mm/mlock.c obviously
    already depended on that.  But since the whole point of this change is to
    be careful about the type system when hiding extra bits in the pointer,
    let's also be explicit about the assumptions we make.

    NOTE!  This is being very careful in another way too: it has a build-time
    assertion that the 'flags' added to the page pointer actually fit in the
    two bits.  That means that this helper must be inlined, and can only be
    used in contexts where the compiler can statically determine that the
    value fits in the available bits.

    [akpm@linux-foundation.org: kerneldoc on a forward-declared struct confuses htmldocs]
    Link: https://lore.kernel.org/all/Y2tKixpO4RO6DgW5@tuxmaker.boeblingen.de.ibm.com/
    Link: https://lkml.kernel.org/r/20221109203051.1835763-1-torvalds@linux-foundation.org
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com> [s390]
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:05 -04:00
Chris von Recklinghausen c212810ffc mm/hugetlb_vmemmap: remap head page to newly allocated page
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 11aad2631bf74b3c811dee76154702aab855a323
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Mon Nov 7 15:39:22 2022 +0000

    mm/hugetlb_vmemmap: remap head page to newly allocated page

    Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed
    back to page allocator is as following: for a 2M hugetlb page it will reuse
    the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a
    1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially,
    that means that it breaks the first 4K of a potentially contiguous chunk of
    memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For
    this reason the memory that it's free back to page allocator cannot be used
    for hugetlb to allocate huge pages of the same size, but rather only of a
    smaller huge page size:

    Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node
    having 64G):

    * Before allocation:
    Free pages count per migrate type at order       0      1      2      3
    4      5      6      7      8      9     10
    ...
    Node    0, zone   Normal, type      Movable    340    100     32     15
    1      2      0      0      0      1  15558

    $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
    $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
     31987

    * After:

    Node    0, zone   Normal, type      Movable  30893  32006  31515      7
    0      0      0      0      0      0      0

    Notice how the memory freed back are put back into 4K / 8K / 16K page
    pools. And it allocates a total of 31987 pages (63974M).

    To fix this behaviour rather than remapping second vmemmap page (thus
    breaking the contiguous block of memory backing the struct pages)
    repopulate the first vmemmap page with a new one. We allocate and copy
    from the currently mapped vmemmap page, and then remap it later on.
    The same algorithm works if there's a pre initialized walk::reuse_page
    and the head page doesn't need to be skipped and instead we remap it
    when the @addr being changed is the @reuse_addr.

    The new head page is allocated in vmemmap_remap_free() given that on
    restore there's no need for functional change. Note that, because right
    now one hugepage is remapped at a time, thus only one free 4K page at a
    time is needed to remap the head page. Should it fail to allocate said
    new page, it reuses the one that's already mapped just like before. As a
    result, for every 64G of contiguous hugepages it can give back 1G more
    of contiguous memory per 64G, while needing in total 128M new 4K pages
    (for 2M hugetlb) or 256k (for 1G hugetlb).

    After the changes, try to assign a 64G node to hugetlb (on a 128G 2node
    guest, each node with 64G):

    * Before allocation
    Free pages count per migrate type at order       0      1      2      3
    4      5      6      7      8      9     10
    ...
    Node    0, zone   Normal, type      Movable      1      1      1      0
    0      1      0      0      1      1  15564

    $ echo 32768  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
    $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
    32394

    * After:

    Node    0, zone   Normal, type      Movable      0     50     97    108
    96     81     70     46     18      0      0

    In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out
    of the 32394 (64788M) allocated. So the memory freed back is indeed being
    used back in hugetlb and there's no massive order-0..order-2 pages
    accumulated unused.

    [joao.m.martins@oracle.com: v3]
      Link: https://lkml.kernel.org/r/20221109200623.96867-1-joao.m.martins@oracle.com
    [joao.m.martins@oracle.com: add smp_wmb() to ensure page contents are visible prior to PTE write]
      Link: https://lkml.kernel.org/r/20221110121214.6297-1-joao.m.martins@oracle.com
    Link: https://lkml.kernel.org/r/20221107153922.77094-1-joao.m.martins@oracle.com
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:05 -04:00
Chris von Recklinghausen 65e2538817 mm: mmap: fix documentation for vma_mas_szero
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 4a42344081ff7fbb890c0741e11d22cd7f658894
Author: Ian Cowan <ian@linux.cowan.aero>
Date:   Sun Nov 13 19:33:49 2022 -0500

    mm: mmap: fix documentation for vma_mas_szero

    When the struct_mm input, mm, was changed to a struct ma_state, mas, the
    documentation for the function was never updated.  This updates that
    documentation reference.

    Link: https://lkml.kernel.org/r/20221114003349.41235-1-ian@linux.cowan.aero
    Signed-off-by: Ian Cowan <ian@linux.cowan.aero>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Liam Howlett <liam.howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:04 -04:00
Chris von Recklinghausen 28ad3f239c mm/mmap: fix memory leak in mmap_region()
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit cc674ab3c0188002917c8a2c28e4424131f1fd7e
Author: Li Zetao <lizetao1@huawei.com>
Date:   Fri Oct 28 15:37:17 2022 +0800

    mm/mmap: fix memory leak in mmap_region()

    There is a memory leak reported by kmemleak:

      unreferenced object 0xffff88817231ce40 (size 224):
        comm "mount.cifs", pid 19308, jiffies 4295917571 (age 405.880s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          60 c0 b2 00 81 88 ff ff 98 83 01 42 81 88 ff ff  `..........B....
        backtrace:
          [<ffffffff81936171>] __alloc_file+0x21/0x250
          [<ffffffff81937051>] alloc_empty_file+0x41/0xf0
          [<ffffffff81937159>] alloc_file+0x59/0x710
          [<ffffffff81937964>] alloc_file_pseudo+0x154/0x210
          [<ffffffff81741dbf>] __shmem_file_setup+0xff/0x2a0
          [<ffffffff817502cd>] shmem_zero_setup+0x8d/0x160
          [<ffffffff817cc1d5>] mmap_region+0x1075/0x19d0
          [<ffffffff817cd257>] do_mmap+0x727/0x1110
          [<ffffffff817518b2>] vm_mmap_pgoff+0x112/0x1e0
          [<ffffffff83adf955>] do_syscall_64+0x35/0x80
          [<ffffffff83c0006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

    The root cause was traced to an error handing path in mmap_region() when
    arch_validate_flags() or mas_preallocate() fails.  In the shared anonymous
    mapping sence, vma will be setuped and mapped with a new shared anonymous
    file via shmem_zero_setup().  So in this case, the file resource needs to
    be released.

    Fix it by calling fput(vma->vm_file) and unmap_region() when
    arch_validate_flags() or mas_preallocate() returns an error in the shared
    anonymous mapping sence.

    Link: https://lkml.kernel.org/r/20221028073717.1179380-1-lizetao1@huawei.com
    Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
    Fixes: c462ac288f ("mm: Introduce arch_validate_flags()")
    Signed-off-by: Li Zetao <lizetao1@huawei.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:04 -04:00
Chris von Recklinghausen 91eceb4cd1 fs/userfaultfd: Fix maple tree iterator in userfaultfd_unregister()
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 59f2f4b8a757412fce372f6d0767bdb55da127a8
Author: Liam Howlett <liam.howlett@oracle.com>
Date:   Mon Nov 7 20:11:42 2022 +0000

    fs/userfaultfd: Fix maple tree iterator in userfaultfd_unregister()

    When iterating the VMAs, the maple state needs to be invalidated if the
    tree is modified by a split or merge to ensure the maple tree node
    contained in the maple state is still valid.  These invalidations were
    missed, so add them to the paths which alter the tree.

    Reported-by: syzbot+0d2014e4da2ccced5b41@syzkaller.appspotmail.com
    Fixes: 69dbe6daf104 (userfaultfd: use maple tree iterator to iterate VMAs)
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:04 -04:00
Chris von Recklinghausen 618af61e58 drm/i915/userptr: restore probe_range behaviour
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 6f7de35b50860c345babf8ed0aa0d75f9315eee4
Author: Matthew Auld <matthew.auld@intel.com>
Date:   Fri Oct 28 14:06:35 2022 +0100

    drm/i915/userptr: restore probe_range behaviour

    The conversion looks harmless, however the addr value is updated inside
    the loop with the previous vm_end, which then incorrectly leads to
    for_each_vma_range() iterating over stuff outside the range we care
    about. Fix this by storing the end value separately. Also fix the case
    where the range doesn't intersect with any vma, or if the vma itself
    doesn't extend the entire range, which must mean we have hole at the
    end. Both should result in an error, as per the previous behaviour.

    v2: Fix the cases where the range is empty, or if there's a hole at
    the end of the range

    Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/7247
    Testcase: igt@gem_userptr_blits@probe
    Fixes: f683b9d61319 ("i915: use the VMA iterator")
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Signed-off-by: Matthew Auld <matthew.auld@intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20221028130635.465839-1-matthew.auld@intel.com

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:03 -04:00
Chris von Recklinghausen f67b71bff9 mmap: fix remap_file_pages() regression
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 1db43d3f3733351849ddca4b573c037c7821bfd8
Author: Liam Howlett <liam.howlett@oracle.com>
Date:   Tue Oct 25 16:12:49 2022 +0000

    mmap: fix remap_file_pages() regression

    When using the VMA iterator, the final execution will set the variable
    'next' to NULL which causes the function to fail out.  Restore the break
    in the loop to exit the VMA iterator early without clearing NULL fixes the
    issue.

    Link: https://lore.kernel.org/lkml/29344.1666681759@jrobl/
    Link: https://lkml.kernel.org/r/20221025161222.2634030-1-Liam.Howlett@oracle.com
    Fixes: 763ecb035029 (mm: remove the vma linked list)
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reported-by: "J. R. Okajima" <hooanon05g@gmail.com>
    Tested-by: "J. R. Okajima" <hooanon05g@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:03 -04:00
Chris von Recklinghausen eca495248c mm: /proc/pid/smaps_rollup: fix maple tree search
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 08ac85521cb2e26f25b885492180815ce8eaf4b7
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Oct 18 20:18:38 2022 -0700

    mm: /proc/pid/smaps_rollup: fix maple tree search

    /proc/pid/smaps_rollup showed 0 kB for everything: now find first vma.

    Link: https://lkml.kernel.org/r/3011bee7-182-97a2-1083-d5f5b688e54b@google.com
    Fixes: c4c84f06285e ("fs/proc/task_mmu: stop using linked list and highest_vm_end")
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:03 -04:00
Chris von Recklinghausen 2cc45e24fc mm/mmap: fix MAP_FIXED address return on VMA merge
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit a57b70519d1f7c53be98478623652738e5ac70d5
Author: Liam Howlett <liam.howlett@oracle.com>
Date:   Tue Oct 18 19:17:12 2022 +0000

    mm/mmap: fix MAP_FIXED address return on VMA merge

    mmap should return the start address of newly mapped area when successful.
    On a successful merge of a VMA, the return address was changed and thus
    was violating that expectation from userspace.

    This is a restoration of functionality provided by 309d08d9b3
    (mm/mmap.c: fix mmap return value when vma is merged after call_mmap()).
    For completeness of fixing MAP_FIXED, implement the comments from the
    previous discussion to never update the address and fail if the address
    changes.  Leaving the error as a WARN_ON() to avoid crashing the kernel.

    Link: https://lkml.kernel.org/r/20221018191613.4133459-1-Liam.Howlett@oracle.com
    Link: https://lore.kernel.org/all/Y06yk66SKxlrwwfb@lakrids/
    Link: https://lore.kernel.org/all/20201203085350.22624-1-liuzixian4@huawei.com/
    Fixes: 4dd1b84140c1 ("mm/mmap: use advanced maple tree API for mmap_region()")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reported-by: Mark Rutland <mark.rutland@arm.com>
    Cc: Liu Zixian <liuzixian4@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:02 -04:00
Chris von Recklinghausen ee3e4872dd mm/mmap.c: __vma_adjust(): suppress uninitialized var warning
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 1cd916d0340d0f45b151599c24ec40b5b2fd8e4a
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Tue Oct 18 13:57:37 2022 -0700

    mm/mmap.c: __vma_adjust(): suppress uninitialized var warning

    The code is OK, but it fools gcc.

    mm/mmap.c:802 __vma_adjust() error: uninitialized symbol 'next_next'.

    Fixes: 524e00b36e8c5 ("mm: remove rb tree.")
    Reported-by: kernel test robot <lkp@intel.com>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:02 -04:00
Chris von Recklinghausen 56aa2a93bd mm/mmap: undo ->mmap() when mas_preallocate() fails
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 5789151e48acc3fd34d2109bf2021dc4df5e33e9
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Mon Oct 17 19:49:45 2022 -0700

    mm/mmap: undo ->mmap() when mas_preallocate() fails

    A memory leak in hugetlb_reserve_pages was reported in [1].  The root
    cause was traced to an error path in mmap_region when mas_preallocate()
    fails.  In this case, the vma is freed after a successful call to
    filesystem specific mmap.  The hugetlbfs mmap routine may allocate data
    structures pointed to by m_private_data.  These need to be cleaned up by
    the hugetlb vm_ops->close() routine.

    The same issue was addressed by commit deb0f6562884 ("mm/mmap: undo
    ->mmap() when arch_validate_flags() fails") for the arch_validate_flags()
    test.  Go to the same close_and_free_vma label if mas_preallocate() fails.

    [1] https://lore.kernel.org/linux-mm/CAKXUXMxf7OiCwbxib7MwfR4M1b5+b3cNTU7n5NV9Zm4967=FPQ@mail.gmail.com/

    Link: https://lkml.kernel.org/r/20221018024945.415036-1-mike.kravetz@oracle.com
    Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Andrii Nakryiko <andrii@kernel.org>
    Cc: Carlos Llamas <cmllamas@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:02 -04:00
Chris von Recklinghausen 87ff2bba80 mm/mempolicy: fix mbind_range() arguments to vma_merge()
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 7329e3ebe3594b425955ab591ecea335e85842c2
Author: Liam Howlett <liam.howlett@oracle.com>
Date:   Sat Oct 15 02:12:33 2022 +0000

    mm/mempolicy: fix mbind_range() arguments to vma_merge()

    Fuzzing produced an invalid argument to vma_merge() which was caught by
    the newly added verification of the number of VMAs being removed on
    process exit.  Analyzing the failure eventually resulted in finding an
    issue with the search of a VMA that started at address 0, which caused an
    underflow and thus the loss of many VMAs being tracked in the tree.  Fix
    the underflow by changing the search of the maple tree to use the start
    address directly.

    Link: https://lkml.kernel.org/r/20221015021135.2816178-1-Liam.Howlett@oracle.com
    Fixes: 66850be55e8e ("mm/mempolicy: use vma iterator & maple state instead of vma linked list")
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reported-by: kernel test robot <oliver.sang@intel.com>
      Link: https://lore.kernel.org/r/202210052318.5ad10912-oliver.sang@intel.com
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:01 -04:00
Chris von Recklinghausen c9c38b2760 mm/mmap: undo ->mmap() when arch_validate_flags() fails
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit deb0f6562884b5b4beb883d73e66a7d3a1b96d99
Author: Carlos Llamas <cmllamas@google.com>
Date:   Fri Sep 30 00:38:43 2022 +0000

    mm/mmap: undo ->mmap() when arch_validate_flags() fails

    Commit c462ac288f ("mm: Introduce arch_validate_flags()") added a late
    check in mmap_region() to let architectures validate vm_flags.  The check
    needs to happen after calling ->mmap() as the flags can potentially be
    modified during this callback.

    If arch_validate_flags() check fails we unmap and free the vma.  However,
    the error path fails to undo the ->mmap() call that previously succeeded
    and depending on the specific ->mmap() implementation this translates to
    reference increments, memory allocations and other operations what will
    not be cleaned up.

    There are several places (mainly device drivers) where this is an issue.
    However, one specific example is bpf_map_mmap() which keeps count of the
    mappings in map->writecnt.  The count is incremented on ->mmap() and then
    decremented on vm_ops->close().  When arch_validate_flags() fails this
    count is off since bpf_map_mmap_close() is never called.

    One can reproduce this issue in arm64 devices with MTE support.  Here the
    vm_flags are checked to only allow VM_MTE if VM_MTE_ALLOWED has been set
    previously.  From userspace then is enough to pass the PROT_MTE flag to
    mmap() syscall to trigger the arch_validate_flags() failure.

    The following program reproduces this issue:

      #include <stdio.h>
      #include <unistd.h>
      #include <linux/unistd.h>
      #include <linux/bpf.h>
      #include <sys/mman.h>

      int main(void)
      {
            union bpf_attr attr = {
                    .map_type = BPF_MAP_TYPE_ARRAY,
                    .key_size = sizeof(int),
                    .value_size = sizeof(long long),
                    .max_entries = 256,
                    .map_flags = BPF_F_MMAPABLE,
            };
            int fd;

            fd = syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
            mmap(NULL, 4096, PROT_WRITE | PROT_MTE, MAP_SHARED, fd, 0);

            return 0;
      }

    By manually adding some log statements to the vm_ops callbacks we can
    confirm that when passing PROT_MTE to mmap() the map->writecnt is off upon
    ->release():

    With PROT_MTE flag:
      root@debian:~# ./bpf-test
      [  111.263874] bpf_map_write_active_inc: map=9 writecnt=1
      [  111.288763] bpf_map_release: map=9 writecnt=1

    Without PROT_MTE flag:
      root@debian:~# ./bpf-test
      [  157.816912] bpf_map_write_active_inc: map=10 writecnt=1
      [  157.830442] bpf_map_write_active_dec: map=10 writecnt=0
      [  157.832396] bpf_map_release: map=10 writecnt=0

    This patch fixes the above issue by calling vm_ops->close() when the
    arch_validate_flags() check fails, after this we can proceed to unmap and
    free the vma on the error path.

    Link: https://lkml.kernel.org/r/20220930003844.1210987-1-cmllamas@google.com
    Fixes: c462ac288f ("mm: Introduce arch_validate_flags()")
    Signed-off-by: Carlos Llamas <cmllamas@google.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
    Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: <stable@vger.kernel.org>    [5.10+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:01 -04:00
Chris von Recklinghausen 99fcc27d26 mm/mmap: preallocate maple nodes for brk vma expansion
Conflicts: mm/mmap.c - We already have
	54a611b60590 ("Maple Tree: add new data structure")
	so mas_preallocate doesn't have a vma argument

JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 28c5609fb236807910ca347ad3e26c4567998526
Author: Liam Howlett <liam.howlett@oracle.com>
Date:   Tue Oct 11 16:08:37 2022 +0000

    mm/mmap: preallocate maple nodes for brk vma expansion

    If the brk VMA is the last vma in a maple node and meets the rare criteria
    that it can be expanded, then preallocation is necessary to avoid a
    potential fs_reclaim circular lock issue on low resources.

    At the same time use the actual vma start address (unaligned) when calling
    vma_adjust_trans_huge().

    Link: https://lkml.kernel.org/r/20221011160624.1253454-1-Liam.Howlett@oracle
.com
    Fixes: 2e7ce7d354f2 (mm/mmap: change do_brk_flags() to expand existing VMA a
nd add do_brk_munmap())
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Reported-by: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:01 -04:00
Chris von Recklinghausen ee5e4721d2 mm: more vma cache removal
Conflicts: include/linux/sched.h - The backport of
	f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
	cted the lack of this patch as a conflict.

JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 7be1c1a3c7b13fb259bb5159662a7b83622013b8
Author: Alexey Dobriyan <adobriyan@gmail.com>
Date:   Tue Oct 11 20:55:31 2022 +0300

    mm: more vma cache removal

    Link: https://lkml.kernel.org/r/Y0WuE3Riv4iy5Jx8@localhost.localdomain
    Fixes: 7964cf8caa4d ("mm: remove vmacache")
    Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
    Acked-by: Liam Howlett <liam.howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:20:00 -04:00