Commit Graph

332 Commits

Author SHA1 Message Date
Chris von Recklinghausen 6573caddd5 mm/gup: migrate device coherent pages when pinning instead of failing
Bugzilla: https://bugzilla.redhat.com/2160210

commit b05a79d4377f6dcc30683008ffd1c531ea965393
Author: Alistair Popple <apopple@nvidia.com>
Date:   Fri Jul 15 10:05:13 2022 -0500

    mm/gup: migrate device coherent pages when pinning instead of failing

    Currently any attempts to pin a device coherent page will fail.  This is
    because device coherent pages need to be managed by a device driver, and
    pinning them would prevent a driver from migrating them off the device.

    However this is no reason to fail pinning of these pages.  These are
    coherent and accessible from the CPU so can be migrated just like pinning
    ZONE_MOVABLE pages.  So instead of failing all attempts to pin them first
    try migrating them out of ZONE_DEVICE.

    [hch@lst.de: rebased to the split device memory checks, moved migrate_device_page to migrate_device.c]
    Link: https://lkml.kernel.org/r/20220715150521.18165-7-alex.sierra@amd.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:23 -04:00
Chris von Recklinghausen 94b8e0ebc1 mm: rename is_pinnable_page() to is_longterm_pinnable_page()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6077c943beee407168f72ece745b0aeaef6b896f
Author: Alex Sierra <alex.sierra@amd.com>
Date:   Fri Jul 15 10:05:08 2022 -0500

    mm: rename is_pinnable_page() to is_longterm_pinnable_page()

    Patch series "Add MEMORY_DEVICE_COHERENT for coherent device memory
    mapping", v9.

    This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
    owned by a device that can be mapped into CPU page tables like
    MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE.

    This patch series is mostly self-contained except for a few places where
    it needs to update other subsystems to handle the new memory type.

    System stability and performance are not affected according to our ongoing
    testing, including xfstests.

    How it works: The system BIOS advertises the GPU device memory (aka VRAM)
    as SPM (special purpose memory) in the UEFI system address map.

    The amdgpu driver registers the memory with devmap as
    MEMORY_DEVICE_COHERENT using devm_memremap_pages.  The initial user for
    this hardware page migration capability is the Frontier supercomputer
    project.  This functionality is not AMD-specific.  We expect other GPU
    vendors to find this functionality useful, and possibly other hardware
    types in the future.

    Our test nodes in the lab are similar to the Frontier configuration, with
    .5 TB of system memory plus 256 GB of device memory split across 4 GPUs,
    all in a single coherent address space.  Page migration is expected to
    improve application efficiency significantly.  We will report empirical
    results as they become available.

    Coherent device type pages at gup are now migrated back to system memory
    if they are being pinned long-term (FOLL_LONGTERM).  The reason is, that
    long-term pinning would interfere with the device memory manager owning
    the device-coherent pages (e.g.  evictions in TTM).  These series
    incorporate Alistair Popple patches to do this migration from
    pin_user_pages() calls.  hmm_gup_test has been added to hmm-test to test
    different get user pages calls.

    This series includes handling of device-managed anonymous pages returned
    by vm_normal_pages.  Although they behave like normal pages for purposes
    of mapping in CPU page tables and for COW, they do not support LRU lists,
    NUMA migration or THP.

    We also introduced a FOLL_LRU flag that adds the same behaviour to
    follow_page and related APIs, to allow callers to specify that they expect
    to put pages on an LRU list.

    This patch (of 14):

    is_pinnable_page() and folio_is_pinnable() are renamed to
    is_longterm_pinnable_page() and folio_is_longterm_pinnable() respectively.
    These functions are used in the FOLL_LONGTERM flag context.

    Link: https://lkml.kernel.org/r/20220715150521.18165-1-alex.sierra@amd.com
    Link: https://lkml.kernel.org/r/20220715150521.18165-2-alex.sierra@amd.com
    Signed-off-by: Alex Sierra <alex.sierra@amd.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:22 -04:00
Chris von Recklinghausen 975e215a96 mm/gup: fix comments to pin_user_pages_*()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 0768c8de1b74b0f177d5f16c00b1459e92837d26
Author: Yury Norov <yury.norov@gmail.com>
Date:   Mon May 9 18:20:47 2022 -0700

    mm/gup: fix comments to pin_user_pages_*()

    pin_user_pages API forces FOLL_PIN in gup_flags, which means that the API
    requires struct page **pages to be provided (not NULL).  However, the
    comment to pin_user_pages() clearly allows for passing in a NULL @pages
    argument.

    Remove the incorrect comments, and add WARN_ON_ONCE(!pages) calls to
    enforce the API.

    It has been independently spotted by Minchan Kim and confirmed with
    John Hubbard:

    https://lore.kernel.org/all/YgWA0ghrrzHONehH@google.com/

    Link: https://lkml.kernel.org/r/20220422015839.1274328-1-yury.norov@gmail.com
    Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:00 -04:00
Chris von Recklinghausen 19aa82ea5c mm: Add fault_in_subpage_writeable() to probe at sub-page granularity
Bugzilla: https://bugzilla.redhat.com/2160210

commit da32b5817253697671af961715517bfbb308a592
Author: Catalin Marinas <catalin.marinas@arm.com>
Date:   Sat Apr 23 11:07:49 2022 +0100

    mm: Add fault_in_subpage_writeable() to probe at sub-page granularity

    On hardware with features like arm64 MTE or SPARC ADI, an access fault
    can be triggered at sub-page granularity. Depending on how the
    fault_in_writeable() function is used, the caller can get into a
    live-lock by continuously retrying the fault-in on an address different
    from the one where the uaccess failed.

    In the majority of cases progress is ensured by the following
    conditions:

    1. copy_to_user_nofault() guarantees at least one byte access if the
       user address is not faulting.

    2. The fault_in_writeable() loop is resumed from the first address that
       could not be accessed by copy_to_user_nofault().

    If the loop iteration is restarted from an earlier (initial) point, the
    loop is repeated with the same conditions and it would live-lock.

    Introduce an arch-specific probe_subpage_writeable() and call it from
    the newly added fault_in_subpage_writeable() function. The arch code
    with sub-page faults will have to implement the specific probing
    functionality.

    Note that no other fault_in_subpage_*() functions are added since they
    have no callers currently susceptible to a live-lock.

    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Link: https://lore.kernel.org/r/20220423100751.1870771-2-catalin.marinas@arm.com
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:51 -04:00
Chris von Recklinghausen c48611f141 mm: avoid unnecessary page fault retires on shared memory types
Bugzilla: https://bugzilla.redhat.com/2160210

commit d92725256b4f22d084b813b37ddc394da79aacab
Author: Peter Xu <peterx@redhat.com>
Date:   Mon May 30 14:34:50 2022 -0400

    mm: avoid unnecessary page fault retires on shared memory types

    I observed that for each of the shared file-backed page faults, we're very
    likely to retry one more time for the 1st write fault upon no page.  It's
    because we'll need to release the mmap lock for dirty rate limit purpose
    with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

    Then after that throttling we return VM_FAULT_RETRY.

    We did that probably because VM_FAULT_RETRY is the only way we can return
    to the fault handler at that time telling it we've released the mmap lock.

    However that's not ideal because it's very likely the fault does not need
    to be retried at all since the pgtable was well installed before the
    throttling, so the next continuous fault (including taking mmap read lock,
    walk the pgtable, etc.) could be in most cases unnecessary.

    It's not only slowing down page faults for shared file-backed, but also add
    more mmap lock contention which is in most cases not needed at all.

    To observe this, one could try to write to some shmem page and look at
    "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
    shmem write simply because we retried, and vm event "pgfault" will capture
    that.

    To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
    show that we've completed the whole fault and released the lock.  It's also
    a hint that we should very possibly not need another fault immediately on
    this page because we've just completed it.

    This patch provides a ~12% perf boost on my aarch64 test VM with a simple
    program sequentially dirtying 400MB shmem file being mmap()ed and these are
    the time it needs:

      Before: 650.980 ms (+-1.94%)
      After:  569.396 ms (+-1.38%)

    I believe it could help more than that.

    We need some special care on GUP and the s390 pgfault handler (for gmap
    code before returning from pgfault), the rest changes in the page fault
    handlers should be relatively straightforward.

    Another thing to mention is that mm_account_fault() does take this new
    fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

    I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
    not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
    them as-is.

    Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Vineet Gupta <vgupta@kernel.org>
    Acked-by: Guo Ren <guoren@kernel.org>
    Acked-by: Max Filippov <jcmvbkbc@gmail.com>
    Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
    Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>    [arm part]
    Acked-by: Heiko Carstens <hca@linux.ibm.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Stafford Horne <shorne@gmail.com>
    Cc: David S. Miller <davem@davemloft.net>
    Cc: Johannes Berg <johannes@sipsolutions.net>
    Cc: Brian Cain <bcain@quicinc.com>
    Cc: Richard Henderson <rth@twiddle.net>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Janosch Frank <frankja@linux.ibm.com>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Jonas Bonn <jonas@southpole.se>
    Cc: Will Deacon <will@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Michal Simek <monstr@monstr.eu>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: Chris Zankel <chris@zankel.net>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Dinh Nguyen <dinguyen@kernel.org>
    Cc: Rich Felker <dalias@libc.org>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Yoshinori Sato <ysato@users.osdn.me>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:32 -04:00
Nico Pache 0cba48960b mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page
commit fac35ba763ed07ba93154c95ffc0c4a55023707f
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Thu Sep 1 18:41:31 2022 +0800

    mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page

    On some architectures (like ARM64), it can support CONT-PTE/PMD size
    hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and
    1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified.

    So when looking up a CONT-PTE size hugetlb page by follow_page(), it will
    use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size
    hugetlb in follow_page_pte().  However this pte entry lock is incorrect
    for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get
    the correct lock, which is mm->page_table_lock.

    That means the pte entry of the CONT-PTE size hugetlb under current pte
    lock is unstable in follow_page_pte(), we can continue to migrate or
    poison the pte entry of the CONT-PTE size hugetlb, which can cause some
    potential race issues, even though they are under the 'pte lock'.

    For example, suppose thread A is trying to look up a CONT-PTE size hugetlb
    page by move_pages() syscall under the lock, however antoher thread B can
    migrate the CONT-PTE hugetlb page at the same time, which will cause
    thread A to get an incorrect page, if thread A also wants to do page
    migration, then data inconsistency error occurs.

    Moreover we have the same issue for CONT-PMD size hugetlb in
    follow_huge_pmd().

    To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte()
    to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to
    get the correct pte entry lock to make the pte entry stable.

    Mike said:

    Support for CONT_PMD/_PTE was added with bb9dd3df8e ("arm64: hugetlb:
    refactor find_num_contig()").  Patch series "Support for contiguous pte
    hugepages", v4.  However, I do not believe these code paths were
    executed until migration support was added with 5480280d3f ("arm64/mm:
    enable HugeTLB migration for contiguous bit HugeTLB pages") I would go
    with 5480280d3f for the Fixes: targe.

    Link: https://lkml.kernel.org/r/635f43bdd85ac2615a58405da82b4d33c6e5eb05.1662017562.git.baolin.wang@linux.alibaba.com
    Fixes: 5480280d3f ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages")
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:44 -07:00
Nico Pache 737333f123 mm: gup: fix the fast GUP race against THP collapse
commit 70cbc3cc78a997d8247b50389d37c4e1736019da
Author: Yang Shi <shy828301@gmail.com>
Date:   Wed Sep 7 11:01:43 2022 -0700

    mm: gup: fix the fast GUP race against THP collapse

    Since general RCU GUP fast was introduced in commit 2667f50e8b ("mm:
    introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
    sufficient to handle concurrent GUP-fast in all cases, it only handles
    traditional IPI-based GUP-fast correctly.  On architectures that send an
    IPI broadcast on TLB flush, it works as expected.  But on the
    architectures that do not use IPI to broadcast TLB flush, it may have the
    below race:

       CPU A                                          CPU B
    THP collapse                                     fast GUP
                                                  gup_pmd_range() <-- see valid pmd
                                                      gup_pte_range() <-- work on pte
    pmdp_collapse_flush() <-- clear pmd and flush
    __collapse_huge_page_isolate()
        check page pinned <-- before GUP bump refcount
                                                          pin the page
                                                          check PTE <-- no change
    __collapse_huge_page_copy()
        copy data to huge page
        ptep_clear()
    install huge pmd for the huge page
                                                          return the stale page
    discard the stale page

    The race can be fixed by checking whether PMD is changed or not after
    taking the page pin in fast GUP, just like what it does for PTE.  If the
    PMD is changed it means there may be parallel THP collapse, so GUP should
    back off.

    Also update the stale comment about serializing against fast GUP in
    khugepaged.

    Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com
    Fixes: 2667f50e8b ("mm: introduce a general RCU get_user_pages_fast()")
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Peter Xu <peterx@redhat.com>
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:42 -07:00
Nico Pache 015651ad44 mm: fix missing wake-up event for FSDAX pages
commit f4f451a16dd1f478fdb966bcbb612c1e4ce6b962
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Tue Jul 5 20:35:32 2022 +0800

    mm: fix missing wake-up event for FSDAX pages

    FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
    1, then the page is freed.  The FSDAX pages can be pinned through GUP,
    then they will be unpinned via unpin_user_page() using a folio variant
    to put the page, however, folio variants did not consider this special
    case, the result will be to miss a wakeup event (like the user of
    __fuse_dax_break_layouts()).  This results in a task being permanently
    stuck in TASK_INTERRUPTIBLE state.

    Since FSDAX pages are only possibly obtained by GUP users, so fix GUP
    instead of folio_put() to lower overhead.

    Link: https://lkml.kernel.org/r/20220705123532.283-1-songmuchun@bytedance.com
    Fixes: d8ddc099c6b3 ("mm/gup: Add gup_put_folio()")
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: William Kucharski <william.kucharski@oracle.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:40 -07:00
Nico Pache aa28eb0c17 mm/migration: return errno when isolate_huge_page failed
commit 7ce82f4c3f3ead13a9d9498768e3b1a79975c4d8
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Mon May 30 19:30:15 2022 +0800

    mm/migration: return errno when isolate_huge_page failed

    We might fail to isolate huge page due to e.g.  the page is under
    migration which cleared HPageMigratable.  We should return errno in this
    case rather than always return 1 which could confuse the user, i.e.  the
    caller might think all of the memory is migrated while the hugetlb page is
    left behind.  We make the prototype of isolate_huge_page consistent with
    isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
    to isolate_hugetlb as suggested by Muchun to improve the readability.

    Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
    Fixes: e8db67eb0d ("mm: migrate: move_pages() supports thp migration")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Suggested-by: Huang Ying <ying.huang@intel.com>
    Reported-by: kernel test robot <lkp@intel.com> (build error)
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:39 -07:00
Chris von Recklinghausen eed1e135d7 mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning
Bugzilla: https://bugzilla.redhat.com/2120352

commit b6a2619c60b41a929bbb9c09f193d690d707b1af
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:45 2022 -0700

    mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning

    Let's verify when (un)pinning anonymous pages that we always deal with
    exclusive anonymous pages, which guarantees that we'll have a reliable
    PIN, meaning that we cannot end up with the GUP pin being inconsistent
    with he pages mapped into the page tables due to a COW triggered by a
    write fault.

    When pinning pages, after conditionally triggering GUP unsharing of
    possibly shared anonymous pages, we should always only see exclusive
    anonymous pages.  Note that anonymous pages that are mapped writable must
    be marked exclusive, otherwise we'd have a BUG.

    When pinning during ordinary GUP, simply add a check after our conditional
    GUP-triggered unsharing checks.  As we know exactly how the page is
    mapped, we know exactly in which page we have to check for
    PageAnonExclusive().

    When pinning via GUP-fast we have to be careful, because we can race with
    fork(): verify only after we made sure via the seqcount that we didn't
    race with concurrent fork() that we didn't end up pinning a possibly
    shared anonymous page.

    Similarly, when unpinning, verify that the pages are still marked as
    exclusive: otherwise something turned the pages possibly shared, which can
    result in random memory corruptions, which we really want to catch.

    With only the pinned pages at hand and not the actual page table entries
    we have to be a bit careful: hugetlb pages are always mapped via a single
    logical page table entry referencing the head page and PG_anon_exclusive
    of the head page applies.  Anon THP are a bit more complicated, because we
    might have obtained the page reference either via a PMD or a PTE --
    depending on the mapping type we either have to check PageAnonExclusive of
    the head page (PMD-mapped THP) or the tail page (PTE-mapped THP) applies:
    as we don't know and to make our life easier, check that either is set.

    Take care to not verify in case we're unpinning during GUP-fast because we
    detected concurrent fork(): we might stumble over an anonymous page that
    is now shared.

    Link: https://lkml.kernel.org/r/20220428083441.37290-18-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen 5160dd7755 mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page
Bugzilla: https://bugzilla.redhat.com/2120352

commit a7f226604170acd6b142b76472c1a49c12ebb83d
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:45 2022 -0700

    mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page

    Whenever GUP currently ends up taking a R/O pin on an anonymous page that
    might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
    on the page table entry will end up replacing the mapped anonymous page
    due to COW, resulting in the GUP pin no longer being consistent with the
    page actually mapped into the page table.

    The possible ways to deal with this situation are:
     (1) Ignore and pin -- what we do right now.
     (2) Fail to pin -- which would be rather surprising to callers and
         could break user space.
     (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
         pins.

    Let's implement 3) because it provides the clearest semantics and allows
    for checking in unpin_user_pages() and friends for possible BUGs: when
    trying to unpin a page that's no longer exclusive, clearly something went
    very wrong and might result in memory corruptions that might be hard to
    debug.  So we better have a nice way to spot such issues.

    This change implies that whenever user space *wrote* to a private mapping
    (IOW, we have an anonymous page mapped), that GUP pins will always remain
    consistent: reliable R/O GUP pins of anonymous pages.

    As a side note, this commit fixes the COW security issue for hugetlb with
    FOLL_PIN as documented in:
      https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
    The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
    instead of FOLL_PIN.

    Note that follow_huge_pmd() doesn't apply because we cannot end up in
    there with FOLL_PIN.

    This commit is heavily based on prototype patches by Andrea.

    Link: https://lkml.kernel.org/r/20220428083441.37290-17-david@redhat.com
    Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen 35ed883ed5 mm/gup: disallow follow_page(FOLL_PIN)
Bugzilla: https://bugzilla.redhat.com/2120352

commit 8909691b6c5a84b67573b23ee8bb917b005628f0
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:44 2022 -0700

    mm/gup: disallow follow_page(FOLL_PIN)

    We want to change the way we handle R/O pins on anonymous pages that might
    be shared: if we detect a possibly shared anonymous page -- mapped R/O and
    not !PageAnonExclusive() -- we want to trigger unsharing via a page fault,
    resulting in an exclusive anonymous page that can be pinned reliably
    without getting replaced via COW on the next write fault.

    However, the required page fault will be problematic for follow_page(): in
    contrast to ordinary GUP, follow_page() doesn't trigger faults internally.
    So we would have to end up failing a R/O pin via follow_page(), although
    there is something mapped R/O into the page table, which might be rather
    surprising.

    We don't seem to have follow_page(FOLL_PIN) users, and it's a purely
    internal MM function.  Let's just make our life easier and the semantics
    of follow_page() clearer by just disallowing FOLL_PIN for follow_page()
    completely.

    Link: https://lkml.kernel.org/r/20220428083441.37290-15-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen 724a2258f0 mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5535be3099717646781ce1540cf725965d680e7b
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Aug 9 22:56:40 2022 +0200

    mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW

    Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
    that FOLL_FORCE can be possibly dangerous, especially if there are races
    that can be exploited by user space.

    Right now, it would be sufficient to have some code that sets a PTE of a
    R/O-mapped shared page dirty, in order for it to erroneously become
    writable by FOLL_FORCE.  The implications of setting a write-protected PTE
    dirty might not be immediately obvious to everyone.

    And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set
    pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
    a shmem page R/O while marking the pte dirty.  This can be used by
    unprivileged user space to modify tmpfs/shmem file content even if the
    user does not have write permissions to the file, and to bypass memfd
    write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).

    To fix such security issues for good, the insight is that we really only
    need that fancy retry logic (FOLL_COW) for COW mappings that are not
    writable (!VM_WRITE).  And in a COW mapping, we really only broke COW if
    we have an exclusive anonymous page mapped.  If we have something else
    mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
    we have to trigger a write fault to break COW.  If we don't find an
    exclusive anonymous page when we retry, we have to trigger COW breaking
    once again because something intervened.

    Let's move away from this mandatory-retry + dirty handling and rely on our
    PageAnonExclusive() flag for making a similar decision, to use the same
    COW logic as in other kernel parts here as well.  In case we stumble over
    a PTE in a COW mapping that does not map an exclusive anonymous page, COW
    was not properly broken and we have to trigger a fake write-fault to break
    COW.

    Just like we do in can_change_pte_writable() added via commit 64fe24a3e05e
    ("mm/mprotect: try avoiding write faults for exclusive anonymous pages
    when changing protection") and commit 76aefad628aa ("mm/mprotect: fix
    soft-dirty check in can_change_pte_writable()"), take care of softdirty
    and uffd-wp manually.

    For example, a write() via /proc/self/mem to a uffd-wp-protected range has
    to fail instead of silently granting write access and bypassing the
    userspace fault handler.  Note that FOLL_FORCE is not only used for debug
    access, but also triggered by applications without debug intentions, for
    example, when pinning pages via RDMA.

    This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
    affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.

    Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
    let's just get rid of it.

    Thanks to Nadav Amit for pointing out that the pte_dirty() check in
    FOLL_FORCE code is problematic and might be exploitable.

    Note 1: We don't check for the PTE being dirty because it doesn't matter
            for making a "was COWed" decision anymore, and whoever modifies the
            page has to set the page dirty either way.

    Note 2: Kernels before extended uffd-wp support and before
            PageAnonExclusive (< 5.19) can simply revert the problematic
            commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
            v5.19 requires minor adjustments due to lack of
            vma_soft_dirty_enabled().

    Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
    Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: David Laight <David.Laight@ACULAB.COM>
    Cc: <stable@vger.kernel.org>    [5.16]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:10 -04:00
Chris von Recklinghausen 78938e28df mm/gup: remove unused get_user_pages_locked()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 73fd16d8080f7b1537ba7aa29917f64d6fffa664
Author: John Hubbard <jhubbard@nvidia.com>
Date:   Tue Mar 22 14:39:50 2022 -0700

    mm/gup: remove unused get_user_pages_locked()

    Now that the last caller of get_user_pages_locked() is gone, remove it.

    Link: https://lkml.kernel.org/r/20220204020010.68930-6-jhubbard@nvidia.com
    Signed-off-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:50 -04:00
Chris von Recklinghausen 3ce2dcc083 mm/gup: remove unused pin_user_pages_locked()
Bugzilla: https://bugzilla.redhat.com/2120352

commit ad6c441266dcd50be080a47e1178a1b15369923c
Author: John Hubbard <jhubbard@nvidia.com>
Date:   Tue Mar 22 14:39:43 2022 -0700

    mm/gup: remove unused pin_user_pages_locked()

    This routine was used for a short while, but then the calling code was
    refactored and the only caller was removed.

    Link: https://lkml.kernel.org/r/20220204020010.68930-4-jhubbard@nvidia.com
    Signed-off-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:49 -04:00
Chris von Recklinghausen 0070af694f mm/gup: follow_pfn_pte(): -EEXIST cleanup
Bugzilla: https://bugzilla.redhat.com/2120352

commit 65462462ffb28fddf13d46c628c4fc55878ab397
Author: John Hubbard <jhubbard@nvidia.com>
Date:   Tue Mar 22 14:39:40 2022 -0700

    mm/gup: follow_pfn_pte(): -EEXIST cleanup

    Remove a quirky special case from follow_pfn_pte(), and adjust its
    callers to match.  Caller changes include:

    __get_user_pages(): Regardless of any FOLL_* flags, get_user_pages() and
    its variants should handle PFN-only entries by stopping early, if the
    caller expected **pages to be filled in.  This makes for a more reliable
    API, as compared to the previous approach of skipping over such entries
    (and thus leaving them silently unwritten).

    move_pages(): squash the -EEXIST error return from follow_page() into
    -EFAULT, because -EFAULT is listed in the man page, whereas -EEXIST is
    not.

    Link: https://lkml.kernel.org/r/20220204020010.68930-3-jhubbard@nvidia.com
    Signed-off-by: John Hubbard <jhubbard@nvidia.com>
    Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:49 -04:00
Chris von Recklinghausen 70f0d8139c mm/gup.c: stricter check on THP migration entry during follow_pmd_mask
Bugzilla: https://bugzilla.redhat.com/2120352

commit 28b0ee3fb35047bd2bac57cc5a051b26bbd9b194
Author: Li Xinhai <lixinhai.lxh@gmail.com>
Date:   Fri Jan 14 14:05:16 2022 -0800

    mm/gup.c: stricter check on THP migration entry during follow_pmd_mask

    When BUG_ON check for THP migration entry, the existing code only check
    thp_migration_supported case, but not for !thp_migration_supported case.
    If !thp_migration_supported() and !pmd_present(), the original code may
    dead loop in theory.  To make the BUG_ON check consistent, we need catch
    both cases.

    Move the BUG_ON check one step earlier, because if the bug happen we
    should know it instead of depend on FOLL_MIGRATION been used by caller.

    Because pmdval instead of *pmd is read by the is_pmd_migration_entry()
    check, the existing code don't help to avoid useless locking within
    pmd_migration_entry_wait(), so remove that check.

    Link: https://lkml.kernel.org/r/20211217062559.737063-1-lixinhai.lxh@gmail.com
    Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:38 -04:00
Chris von Recklinghausen 37537d266c gup: avoid multiple user access locking/unlocking in fault_in_{read/write}able
Bugzilla: https://bugzilla.redhat.com/2120352

commit 677b2a8c1f25db5b09c1ef5bf72faa39ea81d9cf
Author: Christophe Leroy <christophe.leroy@csgroup.eu>
Date:   Fri Jan 14 14:05:13 2022 -0800

    gup: avoid multiple user access locking/unlocking in fault_in_{read/write}able

    fault_in_readable() and fault_in_writeable() perform __get_user() and
    __put_user() in a loop, implying multiple user access locking/unlocking.

    To avoid that, use user access blocks.

    Link: https://lkml.kernel.org/r/720dcf79314acca1a78fae56d478cc851952149d.1637084492.git.christophe.leroy@csgroup.eu
    Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
    Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:38 -04:00
Chris von Recklinghausen 5ea8323cdb Revert "mm: gup: gup_must_unshare()"
Conflicts: mm/gup.c - The backports of
	78d9d6ced31a ("mm/gup: Remove hpage_pincount_add()")
	6315d8a23ce3 ("mm/gup: Remove hpage_pincount_sub()")
	and
	d8ddc099c6b3 ("mm/gup: Add gup_put_folio()")
	were added after 97bfb74f01. d8ddc099c6b3 removed the definition of
	put_page_refs. Don't add it, hpage_pincount_add or
	hpage_pincount_sub back in.

Bugzilla: https://bugzilla.redhat.com/2120352
Upstream status: RHEL only

This reverts commit 97bfb74f01.
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:01 -04:00
Chris von Recklinghausen 5c12106b51 Revert "mm: gup: FOLL_UNSHARE"
COnflicts: mm/gup.c - The VM_BUG_ON_PAGE line was deleted by the backport of
	b0496fe4effd ("mm/gup: Convert gup_pte_range() to use a folio")
	(and this removal is also done in the upstream version of the patch).
	Don't add it back in.
	The backport of b0496fe4effd also changed the put_compound_head call
	to a gup_put_folio call, which causes the second merge conflict. Just
	remove it.

Bugzilla: https://bugzilla.redhat.com/2120352
Upstream status: RHEL only

This reverts commit 4315efd376.
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:01 -04:00
Chris von Recklinghausen 98e81ebff3 Revert "mm: gup: FOLL_NOUNSHARE: optimize follow_page"
Bugzilla: https://bugzilla.redhat.com/2120352
Upstream status: RHEL only

This reverts commit 0de0218375.
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:01 -04:00
Chris von Recklinghausen 9e6a94d9f5 Revert "mm: hugetlbfs: gup: gup_must_unshare(): enable hugetlbfs"
Bugzilla: https://bugzilla.redhat.com/2120352
Upstream status: RHEL only

This reverts commit 72c8a13ba5.
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:00 -04:00
Chris von Recklinghausen fd81b53da4 Revert "mm: gup: gup_must_unshare() use can_read_pin_swap_page()"
Bugzilla: https://bugzilla.redhat.com/2120352
Upstream status: RHEL only

This reverts commit 32fd7f268e.
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:00 -04:00
Aristeu Rozanski 130c98911e mm/munlock: add lru_add_drain() to fix memcg_stat_test
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit ece369c7e1044a333dc002d3c3c228b8853fc5f7
Author: Hugh Dickins <hughd@google.com>
Date:   Fri Apr 1 11:28:27 2022 -0700

    mm/munlock: add lru_add_drain() to fix memcg_stat_test

    Mike reports that LTP memcg_stat_test usually leads to

      memcg_stat_test 3 TINFO: Test unevictable with MAP_LOCKED
      memcg_stat_test 3 TINFO: Running memcg_process --mmap-lock1 -s 135168
      memcg_stat_test 3 TINFO: Warming up pid: 3460
      memcg_stat_test 3 TINFO: Process is still here after warm up: 3460
      memcg_stat_test 3 TFAIL: unevictable is 122880, 135168 expected

    but may also lead to

      memcg_stat_test 4 TINFO: Test unevictable with mlock
      memcg_stat_test 4 TINFO: Running memcg_process --mmap-lock2 -s 135168
      memcg_stat_test 4 TINFO: Warming up pid: 4271
      memcg_stat_test 4 TINFO: Process is still here after warm up: 4271
      memcg_stat_test 4 TFAIL: unevictable is 122880, 135168 expected

    or both.  A wee bit flaky.

    follow_page_pte() used to have an lru_add_drain() per each page mlocked,
    and the test came to rely on accurate stats.  The pagevec to be drained
    is different now, but still covered by lru_add_drain(); and, never mind
    the test, I believe it's in everyone's interest that a bulk faulting
    interface like populate_vma_page_range() or faultin_vma_page_range()
    should drain its local pagevecs at the end, to save others sometimes
    needing the much more expensive lru_add_drain_all().

    This does not absolutely guarantee exact stats - the mlocking task can
    be migrated between CPUs as it proceeds - but it's good enough and the
    tests pass.

    Link: https://lkml.kernel.org/r/47f6d39c-a075-50cb-1cfb-26dd957a48af@google.com
    Fixes: b67bf49ce7aa ("mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE")
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reported-by: Mike Galbraith <efault@gmx.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:21 -04:00
Aristeu Rozanski 7fd8bc1d8e mm/gup: Convert check_and_migrate_movable_pages() to use a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 1b7f7e58decccb52d6bc454413e3298f1ab3a9c6
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Feb 17 12:46:35 2022 -0500

    mm/gup: Convert check_and_migrate_movable_pages() to use a folio

    Switch from head pages to folios.  This removes an assumption that
    THPs are the only way to have a high-order page.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:16 -04:00
Aristeu Rozanski 91b1881d5c mm/gup: Turn compound_range_next() into gup_folio_range_next()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 659508f9c936aa6e3aaf6e9cf6a4a8836b8f8355
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Dec 23 10:20:12 2021 -0500

    mm/gup: Turn compound_range_next() into gup_folio_range_next()

    Convert the only caller to work on folios instead of pages.
    This removes the last caller of put_compound_head(), so delete it.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:16 -04:00
Aristeu Rozanski 4ee5b8fe10 mm/gup: Turn compound_next() into gup_folio_next()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 12521c7606b2037f8ac2a2fab19e955444a549cf
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Dec 22 23:43:16 2021 -0500

    mm/gup: Turn compound_next() into gup_folio_next()

    Convert both callers to work on folios instead of pages.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:16 -04:00
Aristeu Rozanski a4595d0a92 mm/gup: Convert gup_huge_pgd() to use a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 2d7919a29275dbb9bc3b6e6b4ea015a1eefc463f
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Dec 22 22:30:29 2021 -0500

    mm/gup: Convert gup_huge_pgd() to use a folio

    Use the new folio-based APIs.  This was the last user of
    try_grab_compound_head(), so remove it.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Aristeu Rozanski 2095d9641d mm/gup: Convert gup_huge_pud() to use a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 83afb52e47d5e31c7d58c07a6d31c43b5ef421a0
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Dec 22 18:07:47 2021 -0500

    mm/gup: Convert gup_huge_pud() to use a folio

    Use the new folio-based APIs.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Aristeu Rozanski 9b5f3de200 mm/gup: Convert gup_huge_pmd() to use a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: extra conversion due 4315efd376

commit 667ed1f7bb3b1c1ec2512e64cec04a07df7c5068
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Dec 22 16:57:23 2021 -0500

    mm/gup: Convert gup_huge_pmd() to use a folio

    Use the new folio-based APIs.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Aristeu Rozanski ff63496672 mm/gup: Convert gup_hugepte() to use a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 09a1626effb89dddcde10c10f5e3c5e6f8b94136
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Dec 22 16:38:30 2021 -0500

    mm/gup: Convert gup_hugepte() to use a folio

    There should be little to no effect from this patch; just removing
    uses of some old APIs.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Aristeu Rozanski d459884ef2 mm/gup: Convert gup_pte_range() to use a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: extra conversion due 4315efd376

commit b0496fe4effd83ef76c7440befb184f922b3ffbb
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Dec 10 15:54:11 2021 -0500

    mm/gup: Convert gup_pte_range() to use a folio

    We still call try_grab_folio() once per PTE; a future patch could
    optimise to just adjust the reference count for each page within
    the folio.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Aristeu Rozanski 50832fe6a4 mm/hugetlb: Use try_grab_folio() instead of try_grab_compound_head()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 822951d84684d7a0c4f45e7231c960e7fe786d8f
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat Jan 8 00:15:04 2022 -0500

    mm/hugetlb: Use try_grab_folio() instead of try_grab_compound_head()

    follow_hugetlb_page() only cares about success or failure, so it doesn't
    need to know the type of the returned pointer, only whether it's NULL
    or not.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Aristeu Rozanski ff87c1f9ac mm/gup: Add gup_put_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing df06b37ffe

commit d8ddc099c6b3dde887f9484da9a6677709d68b61
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Dec 10 15:39:04 2021 -0500

    mm/gup: Add gup_put_folio()

    Convert put_compound_head() to gup_put_folio() and hpage_pincount_sub()
    to folio_pincount_sub().  This removes the last call to put_page_refs(),
    so delete it.  Add a temporary put_compound_head() wrapper which will
    be deleted by the end of this series.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Aristeu Rozanski d13e1c8f86 mm/gup: Convert try_grab_page() to use a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 5fec0719908bdabdf9d017b0f488d18019ed00f7
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Feb 4 10:32:01 2022 -0500

    mm/gup: Convert try_grab_page() to use a folio

    Hoist the folio conversion and the folio_ref_count() check to the
    top of the function instead of using the one buried in try_get_page().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Aristeu Rozanski 63cdd4209a mm/gup: Add try_get_folio() and try_grab_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing 27674ef6c73f

commit ece1ed7bfa1208b527b3dc90bb45c55e0d139a88
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Feb 4 10:27:40 2022 -0500

    mm/gup: Add try_get_folio() and try_grab_folio()

    Convert try_get_compound_head() into try_get_folio() and convert
    try_grab_compound_head() into try_grab_folio().  Add a temporary
    try_grab_compound_head() wrapper around try_grab_folio() to let us
    convert callers individually.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Aristeu Rozanski 02c4025a8d mm: Make compound_pincount always available
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: Notice we have RHEL-only 44740bc20b applied but that shouldn't be a problem space wise since we don't ship 32 bit kernels anymore and we're well under 40 byte limit

commit 5232c63f46fdd779303527ec36c518cc1e9c6b4e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Jan 6 16:46:43 2022 -0500

    mm: Make compound_pincount always available

    Move compound_pincount from the third page to the second page, which
    means it's available for all compound pages.  That lets us delete
    hpage_pincount_available().

    On 32-bit systems, there isn't enough space for both compound_pincount
    and compound_nr in the second page (it would collide with page->private,
    which is in use for pages in the swap cache), so revert the optimisation
    of storing both compound_order and compound_nr on 32-bit systems.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Aristeu Rozanski 8030eea148 mm/gup: Remove hpage_pincount_sub()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing df06b37ffe

commit 6315d8a23ce308433cf615e435ca2ee2aee7d11c
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jan 7 14:19:39 2022 -0500

    mm/gup: Remove hpage_pincount_sub()

    Move the assertion (and correct it to be a cheaper variant),
    and inline the atomic_sub() operation.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:14 -04:00
Aristeu Rozanski 817e0cfb85 mm/gup: Remove hpage_pincount_add()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing df06b37ffe

commit 78d9d6ced31ad2f242e44bd24b774fd99c2d663d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jan 7 14:15:11 2022 -0500

    mm/gup: Remove hpage_pincount_add()

    It's clearer to call atomic_add() in the callers; the assertions clearly
    can't fire there because they're part of the condition for calling
    atomic_add().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:14 -04:00
Aristeu Rozanski ba3e0326ed mm/gup: Handle page split race more efficiently
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 59409373f60a0a493fe2a1b85dc8c6299c4fef37
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jan 7 14:04:55 2022 -0500

    mm/gup: Handle page split race more efficiently

    If we hit the page split race, the current code returns NULL which will
    presumably trigger a retry under the mmap_lock.  This isn't necessary;
    we can just retry the compound_head() lookup.  This is a very minor
    optimisation of an unlikely path, but conceptually it matches (eg)
    the page cache RCU-protected lookup.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:14 -04:00
Aristeu Rozanski 9be3e96759 mm/gup: Remove an assumption of a contiguous memmap
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 4c65422901154766e5cee17875ed680366a4a141
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jan 7 13:45:25 2022 -0500

    mm/gup: Remove an assumption of a contiguous memmap

    This assumption needs the inverse of nth_page(), which is temporarily
    named page_nth() until it's renamed later in this series.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:14 -04:00
Aristeu Rozanski 885b7047a5 mm/gup: Fix some contiguous memmap assumptions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit c228afb11ac6938532703ac712992524497aff29
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jan 7 13:25:55 2022 -0500

    mm/gup: Fix some contiguous memmap assumptions

    Several functions in gup.c assume that a compound page has virtually
    contiguous page structs.  This isn't true for SPARSEMEM configs unless
    SPARSEMEM_VMEMMAP is also set.  Fix them by using nth_page() instead of
    plain pointer arithmetic.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:14 -04:00
Aristeu Rozanski d808a5f982 mm/gup: Change the calling convention for compound_next()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 28297dbcad7ed3d7bac373eef121339cb0cac326
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Jan 9 21:03:47 2022 -0500

    mm/gup: Change the calling convention for compound_next()

    Return the head page instead of storing it to a passed parameter.
    Reorder the arguments to match the calling function's arguments.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:14 -04:00
Aristeu Rozanski dab58f8e90 mm/gup: Optimise compound_range_next()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 0b046e12ae5d6d286415a2e805fcfdd724b7add1
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Jan 9 16:21:23 2022 -0500

    mm/gup: Optimise compound_range_next()

    By definition, a compound page has an order >= 1, so the second half
    of the test was redundant.  Also, this cannot be a tail page since
    it's the result of calling compound_head(), so use PageHead() instead
    of PageCompound().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:14 -04:00
Aristeu Rozanski 153cd867d2 mm/gup: Change the calling convention for compound_range_next()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 8f39f5fcb7963f0a01b8077c92e627af279de65e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Jan 9 16:05:11 2022 -0500

    mm/gup: Change the calling convention for compound_range_next()

    Return the head page instead of storing it to a passed parameter.
    Pass the start page directly instead of passing a pointer to it.
    Reorder the arguments to match the calling function's arguments.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:14 -04:00
Aristeu Rozanski 5c3946b7f8 mm/gup: Remove for_each_compound_head()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit e76027488640802633c646210781b63221c2fdd2
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat Jan 8 20:23:46 2022 -0500

    mm/gup: Remove for_each_compound_head()

    This macro doesn't simplify the users; it's easier to just call
    compound_next() inside a standard loop.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:14 -04:00
Aristeu Rozanski 9ad988c873 mm/gup: Remove for_each_compound_range()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit a5f100db6855dbfe2709887b7348ce727e990fb6
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat Jan 8 20:23:46 2022 -0500

    mm/gup: Remove for_each_compound_range()

    This macro doesn't simplify the users; it's easier to just call
    compound_range_next() inside the loop.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:14 -04:00
Aristeu Rozanski b12cd317c7 mm/gup: Increment the page refcount before the pincount
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 8ea2979c1444cd455ddbe7f976de79cc09fdc38d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Feb 4 09:24:26 2022 -0500

    mm/gup: Increment the page refcount before the pincount

    We should always increase the refcount before doing anything else to
    the page so that other page users see the elevated refcount first.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:14 -04:00
Aristeu Rozanski c5c0c2debb mm: refactor check_and_migrate_movable_pages
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit f9f38f78c5d5eef3717b48d84263b4b46ee0110a
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Feb 16 15:31:37 2022 +1100

    mm: refactor check_and_migrate_movable_pages

    Remove up to two levels of indentation by using continue statements
    and move variables to local scope where possible.

    Link: https://lkml.kernel.org/r/20220210072828.2930359-11-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Chaitanya Kulkarni <kch@nvidia.com>
    Cc: Christian Knig <christian.koenig@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Karol Herbst <kherbst@redhat.com>
    Cc: Logan Gunthorpe <logang@deltatee.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:13 -04:00
Aristeu Rozanski c229e0f271 mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit b67bf49ce7aae72f63739abee6ac25f64bf20081
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:21:52 2022 -0800

    mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE

    If counting page mlocks, we must not double-count: follow_page_pte() can
    tell if a page has already been Mlocked or not, but cannot tell if a pte
    has already been counted or not: that will have to be done when the pte
    is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
    for new anon pages, but there's no such tracking yet for others).

    Delete all the FOLL_MLOCK code - faulting in the missing pages will do
    all that is necessary, without special mlock_vma_page() calls from here.

    But then FOLL_POPULATE turns out to serve no purpose - it was there so
    that its absence would tell faultin_page() not to faultin page when
    setting up VM_LOCKONFAULT areas; but if there's no special work needed
    here for mlock, then there's no work at all here for VM_LOCKONFAULT.

    Have I got that right?  I've not looked into the history, but see that
    FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
    purpose before?  Ah, yes, it was used to skip the old stack guard page.

    And is it intentional that COW is not broken on existing pages when
    setting up a VM_LOCKONFAULT area?  I can see that being argued either
    way, and have no reason to disagree with current behaviour.

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00