From 7127694666ce60d8849ce98eef10983a7d9b7398 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Sun, 4 May 2025 17:55:46 -0400
Subject: [PATCH 01/11] selftests: drivers/dma-buf: Fix implicit declaration
 warns

JIRA: https://issues.redhat.com/browse/RHEL-89519

commit 2f9602870886af74d97bac23ee6db5f5466d0a49
Author: Shuah Khan <skhan@linuxfoundation.org>
Date:   Fri, 17 Sep 2021 17:58:13 -0600

    selftests: drivers/dma-buf: Fix implicit declaration warns

    udmabuf has the following implicit declaration warns:

    udmabuf.c:30:10: warning: implicit declaration of function 'open';
    udmabuf.c:42:8: warning: implicit declaration of function 'fcntl'

    These are caused due to not including fcntl.h and including just
    linux/fcntl.h. Fix it to include fcntl.h which will bring in the
    linux/fcntl.h. In addition, define __EXPORTED_HEADERS__ to bring in
    F_ADD_SEALS and F_SEAL_SHRINK defines and fix the following error
    that show up when just fcntl.h is included.

    udmabuf.c:45:21: error: 'F_ADD_SEALS' undeclared
       45 |  ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK);
          |                     ^~~~~~~~~~~
    udmabuf.c:45:34: error: 'F_SEAL_SHRINK' undeclared
       45 |  ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK);
          |                                  ^~~~~~~~~~~~~

    Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/drivers/dma-buf/udmabuf.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/drivers/dma-buf/udmabuf.c b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
index 4de902ea14d8..de1c4e6de0b2 100644
--- a/tools/testing/selftests/drivers/dma-buf/udmabuf.c
+++ b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
@@ -1,10 +1,13 @@
 // SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#define __EXPORTED_HEADERS__
+
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <string.h>
 #include <errno.h>
-#include <linux/fcntl.h>
+#include <fcntl.h>
 #include <malloc.h>
 
 #include <sys/ioctl.h>

From 6980892b561c7efd4cdaf5728fdd86ed7e13188d Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Sun, 4 May 2025 17:55:47 -0400
Subject: [PATCH 02/11] selftests: drivers/dma-buf: Improve message in selftest
 summary

JIRA: https://issues.redhat.com/browse/RHEL-89519

commit dbeb232726871352fc3e688ff5b02897f8cb0dc7
Author: Soumya Negi <soumya.negi97@gmail.com>
Date:   Fri, 1 Jul 2022 05:50:52 -0700

    selftests: drivers/dma-buf: Improve message in selftest summary

    Selftest udmabuf for the dma-buf driver is skipped when the device file
    (e.g. /dev/udmabuf) for the DMA buffer cannot be opened i.e. no DMA buffer
    has been allocated.

    This patch adds clarity to the SKIP message.

    Signed-off-by: Soumya Negi <soumya.negi97@gmail.com>
    Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/drivers/dma-buf/udmabuf.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/drivers/dma-buf/udmabuf.c b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
index de1c4e6de0b2..c812080e304e 100644
--- a/tools/testing/selftests/drivers/dma-buf/udmabuf.c
+++ b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
@@ -32,7 +32,8 @@ int main(int argc, char *argv[])
 
 	devfd = open("/dev/udmabuf", O_RDWR);
 	if (devfd < 0) {
-		printf("%s: [skip,no-udmabuf]\n", TEST_PREFIX);
+		printf("%s: [skip,no-udmabuf: Unable to access DMA buffer device file]\n",
+		       TEST_PREFIX);
 		exit(77);
 	}
 

From 1bb55119fcda3d3e1df00a448cdb1319a9688298 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Sun, 4 May 2025 19:20:11 -0400
Subject: [PATCH 03/11] mm: record the migration reason for struct
 migration_target_control

JIRA: https://issues.redhat.com/browse/RHEL-89519
Conflicts:
 1) A context diff in the mm/page_alloc.c hunk due to missing upstream
    commit c8b360031218 ("mm: add alloc_contig_migrate_range allocation
    statistics").
 2) A context diff in the mm/gup.c hunk due to the presence of a later
    upstream commit 53ba78de064b ("mm/gup: introduce
    check_and_migrate_movable_folios()").

commit e42dfe4e0a51b476dcc6f1461c51fdb1b76573aa
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed, 6 Mar 2024 18:13:26 +0800

    mm: record the migration reason for struct migration_target_control

    Patch series "make the hugetlb migration strategy consistent", v2.

    As discussed in previous thread [1], there is an inconsistency when
    handling hugetlb migration.  When handling the migration of freed hugetlb,
    it prevents fallback to other NUMA nodes in
    alloc_and_dissolve_hugetlb_folio().  However, when dealing with in-use
    hugetlb, it allows fallback to other NUMA nodes in
    alloc_hugetlb_folio_nodemask(), which can break the per-node hugetlb pool
    and might result in unexpected failures when node bound workloads doesn't
    get what is asssumed available.

    This patchset tries to make the hugetlb migration strategy more clear
    and consistent. Please find details in each patch.

    [1]
    https://lore.kernel.org/all/6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com/

    This patch (of 2):

    To support different hugetlb allocation strategies during hugetlb
    migration based on various migration reasons, record the migration reason
    in the migration_target_control structure as a preparation.

    Link: https://lkml.kernel.org/r/cover.1709719720.git.baolin.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/7b95d4981e07211f57139fc5b1f7ce91b920cee4.1709719720.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/gup.c            | 1 +
 mm/internal.h       | 1 +
 mm/memory-failure.c | 1 +
 mm/memory_hotplug.c | 1 +
 mm/mempolicy.c      | 1 +
 mm/migrate.c        | 1 +
 mm/page_alloc.c     | 1 +
 mm/vmscan.c         | 3 ++-
 8 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 0bd7ecdd1e4a..972ab9bffa8d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2162,6 +2162,7 @@ migrate_longterm_unpinnable_folios(struct list_head *movable_folio_list,
 		struct migration_target_control mtc = {
 			.nid = NUMA_NO_NODE,
 			.gfp_mask = GFP_USER | __GFP_NOWARN,
+			.reason = MR_LONGTERM_PIN,
 		};
 
 		if (migrate_pages(movable_folio_list, alloc_migration_target,
diff --git a/mm/internal.h b/mm/internal.h
index fbd4362cc046..4ed7ab04a5c1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -961,6 +961,7 @@ struct migration_target_control {
 	int nid;		/* preferred node id */
 	nodemask_t *nmask;
 	gfp_t gfp_mask;
+	enum migrate_reason reason;
 };
 
 /*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9a8022ac5304..49518cb6dace 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2691,6 +2691,7 @@ static int soft_offline_in_use_page(struct page *page)
 	struct migration_target_control mtc = {
 		.nid = NUMA_NO_NODE,
 		.gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
+		.reason = MR_MEMORY_FAILURE,
 	};
 
 	if (!huge && folio_test_large(folio)) {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index cfd99166904b..f9f43e055ea3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1831,6 +1831,7 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		struct migration_target_control mtc = {
 			.nmask = &nmask,
 			.gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
+			.reason = MR_MEMORY_HOTPLUG,
 		};
 		int ret;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d588d691361c..5c5f163f12e8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1015,6 +1015,7 @@ static long migrate_to_node(struct mm_struct *mm, int source, int dest,
 	struct migration_target_control mtc = {
 		.nid = dest,
 		.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
+		.reason = MR_SYSCALL,
 	};
 
 	nodes_clear(nmask);
diff --git a/mm/migrate.c b/mm/migrate.c
index 969e0727a93f..cc865721d79c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2058,6 +2058,7 @@ static int do_move_pages_to_node(struct mm_struct *mm,
 	struct migration_target_control mtc = {
 		.nid = node,
 		.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
+		.reason = MR_SYSCALL,
 	};
 
 	err = migrate_pages(pagelist, alloc_migration_target, NULL,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aa2cbab0e18e..6fed3fa11edc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6248,6 +6248,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
 	struct migration_target_control mtc = {
 		.nid = zone_to_nid(cc->zone),
 		.gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
+		.reason = MR_CONTIG_RANGE,
 	};
 
 	lru_cache_disable();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5135c2e0495..a2147d6654f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -969,7 +969,8 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
 		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
 			__GFP_NOMEMALLOC | GFP_NOWAIT,
 		.nid = target_nid,
-		.nmask = &allowed_mask
+		.nmask = &allowed_mask,
+		.reason = MR_DEMOTION,
 	};
 
 	if (list_empty(demote_folios))

From edcc05940a6b5125c317d16c810e656f52ea1203 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 6 May 2025 12:00:30 -0400
Subject: [PATCH 04/11] mm: hugetlb: make the hugetlb migration strategy
 consistent

JIRA: https://issues.redhat.com/browse/RHEL-89519
Conflicts:
 1) A merge conflict in the include/linux/hugetlb.h hunk due to missing
    upstream commit 72e315f7a750 ("mempolicy: mmap_lock is not needed
    while migrating folios") which removes alloc_hugetlb_folio_vma()
    function declaration.
 2) A merge conflict in the mm/mempolicy.c hunk due to missing upstream
    commit 72e315f7a750 ("mempolicy: mmap_lock is not needed while
    migrating folios"). As the fallback policy is different in
    alloc_hugetlb_folio_vma() and its caller in new_folio(). Inline
    alloc_hugetlb_folio_vma() into new_folio() and apply the required
    change.

commit 42d0c3fbb5811fbfb663d8ede1d7ffba02e7ae18
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed, 6 Mar 2024 18:13:27 +0800

    mm: hugetlb: make the hugetlb migration strategy consistent

    As discussed in previous thread [1], there is an inconsistency when
    handing hugetlb migration.  When handling the migration of freed hugetlb,
    it prevents fallback to other NUMA nodes in
    alloc_and_dissolve_hugetlb_folio().  However, when dealing with in-use
    hugetlb, it allows fallback to other NUMA nodes in
    alloc_hugetlb_folio_nodemask(), which can break the per-node hugetlb pool
    and might result in unexpected failures when node bound workloads doesn't
    get what is asssumed available.

    To make hugetlb migration strategy more clear, we should list all the scenarios
    of hugetlb migration and analyze whether allocation fallback is permitted:

    1) Memory offline: will call dissolve_free_huge_pages() to free the
       freed hugetlb, and call do_migrate_range() to migrate the in-use
       hugetlb.  Both can break the per-node hugetlb pool, but as this is an
       explicit offlining operation, no better choice.  So should allow the
       hugetlb allocation fallback.

    2) Memory failure: same as memory offline.  Should allow fallback to a
       different node might be the only option to handle it, otherwise the
       impact of poisoned memory can be amplified.

    3) Longterm pinning: will call migrate_longterm_unpinnable_pages() to
       migrate in-use and not-longterm-pinnable hugetlb, which can break the
       per-node pool.  But we should fail to longterm pinning if can not
       allocate on current node to avoid breaking the per-node pool.

    4) Syscalls (mbind, migrate_pages, move_pages): these are explicit
       users operation to move pages to other nodes, so fallback to other
       nodes should not be prohibited.

    5) alloc_contig_range: used by CMA allocation and virtio-mem
       fake-offline to allocate given range of pages.  Now the freed hugetlb
       migration is not allowed to fallback, to keep consistency, the in-use
       hugetlb migration should be also not allowed to fallback.

    6) alloc_contig_pages: used by kfence, pgtable_debug etc.  The strategy
       should be consistent with that of alloc_contig_range().

    Based on the analysis of the various scenarios above, introducing a new
    helper to determine whether fallback is permitted according to the
    migration reason..

    [1] https://lore.kernel.org/all/6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com/
    Link: https://lkml.kernel.org/r/3519fcd41522817307a05b40fb551e2e17e68101.1709719720.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/hugetlb.h | 35 +++++++++++++++++++++++++++++++++--
 mm/hugetlb.c            | 14 ++++++++++++--
 mm/mempolicy.c          | 16 ++++++++++++++--
 mm/migrate.c            |  3 ++-
 4 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0c8ec88ee82d..13ecfeabb379 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -746,7 +746,8 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
-				nodemask_t *nmask, gfp_t gfp_mask);
+				nodemask_t *nmask, gfp_t gfp_mask,
+				bool allow_alloc_fallback);
 struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address);
 struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
@@ -974,6 +975,30 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return modified_mask;
 }
 
+static inline bool htlb_allow_alloc_fallback(int reason)
+{
+	bool allowed_fallback = false;
+
+	/*
+	 * Note: the memory offline, memory failure and migration syscalls will
+	 * be allowed to fallback to other nodes due to lack of a better chioce,
+	 * that might break the per-node hugetlb pool. While other cases will
+	 * set the __GFP_THISNODE to avoid breaking the per-node hugetlb pool.
+	 */
+	switch (reason) {
+	case MR_MEMORY_HOTPLUG:
+	case MR_MEMORY_FAILURE:
+	case MR_SYSCALL:
+	case MR_MEMPOLICY_MBIND:
+		allowed_fallback = true;
+		break;
+	default:
+		break;
+	}
+
+	return allowed_fallback;
+}
+
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 					   struct mm_struct *mm, pte_t *pte)
 {
@@ -1076,7 +1101,8 @@ alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
 
 static inline struct folio *
 alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
-			nodemask_t *nmask, gfp_t gfp_mask)
+			nodemask_t *nmask, gfp_t gfp_mask,
+			bool allow_alloc_fallback)
 {
 	return NULL;
 }
@@ -1199,6 +1225,11 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
 	return 0;
 }
 
+static inline bool htlb_allow_alloc_fallback(int reason)
+{
+	return false;
+}
+
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
 					   struct mm_struct *mm, pte_t *pte)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 92802a512733..931e8fae49e3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2491,7 +2491,7 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
 
 /* folio migration callback function */
 struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
-		nodemask_t *nmask, gfp_t gfp_mask)
+		nodemask_t *nmask, gfp_t gfp_mask, bool allow_alloc_fallback)
 {
 	spin_lock_irq(&hugetlb_lock);
 	if (available_huge_pages(h)) {
@@ -2506,6 +2506,10 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 	}
 	spin_unlock_irq(&hugetlb_lock);
 
+	/* We cannot fallback to other nodes, as we could break the per-node pool. */
+	if (!allow_alloc_fallback)
+		gfp_mask |= __GFP_THISNODE;
+
 	return alloc_migrate_hugetlb_folio(h, gfp_mask, preferred_nid, nmask);
 }
 
@@ -2521,7 +2525,13 @@ struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *v
 
 	gfp_mask = htlb_alloc_mask(h);
 	node = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
-	folio = alloc_hugetlb_folio_nodemask(h, node, nodemask, gfp_mask);
+	/*
+	 * This is used to allocate a temporary hugetlb to hold the copied
+	 * content, which will then be copied again to the final hugetlb
+	 * consuming a reservation. Set the alloc_fallback to false to indicate
+	 * that breaking the per-node hugetlb pool is not allowed in this case.
+	 */
+	folio = alloc_hugetlb_folio_nodemask(h, node, nodemask, gfp_mask, false);
 	mpol_cond_put(mpol);
 
 	return folio;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5c5f163f12e8..d54e77e52561 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1183,8 +1183,20 @@ static struct folio *new_folio(struct folio *src, unsigned long start)
 		return NULL;
 
 	if (folio_test_hugetlb(src)) {
-		return alloc_hugetlb_folio_vma(folio_hstate(src),
-				vma, address);
+		struct mempolicy *mpol;
+		nodemask_t *nodemask;
+		struct folio *folio;
+		struct hstate *h;
+		gfp_t gfp;
+		int nid;
+
+		h = folio_hstate(src);
+		gfp = htlb_alloc_mask(h);
+		nid = huge_node(vma, address, gfp, &mpol, &nodemask);
+		folio = alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp,
+				htlb_allow_alloc_fallback(MR_MEMPOLICY_MBIND));
+		mpol_cond_put(mpol);
+		return folio;
 	}
 
 	if (folio_test_large(src))
diff --git a/mm/migrate.c b/mm/migrate.c
index cc865721d79c..7b32d9daa141 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2019,7 +2019,8 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
 
 		gfp_mask = htlb_modify_alloc_mask(h, gfp_mask);
 		return alloc_hugetlb_folio_nodemask(h, nid,
-						mtc->nmask, gfp_mask);
+						mtc->nmask, gfp_mask,
+						htlb_allow_alloc_fallback(mtc->reason));
 	}
 
 	if (folio_test_large(src)) {

From 5bc77e408ae5e0f3423874becc6d5dcb928c1d17 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 6 May 2025 12:04:01 -0400
Subject: [PATCH 05/11] docs: hugetlbpage.rst: add hugetlb migration
 description

JIRA: https://issues.redhat.com/browse/RHEL-89519

commit 353dc187840100fabeb946bb9573bc5ca5e04fcb
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed, 6 Mar 2024 18:13:28 +0800

    docs: hugetlbpage.rst: add hugetlb migration description

    Add some description of the hugetlb migration strategy.

    Link: https://lkml.kernel.org/r/63fb16e7a4ebc5cb69ce655af86e29b2d8e9ba34.1709719720.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/mm/hugetlbpage.rst | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index fbdea8a45c96..28880f409b8e 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -378,6 +378,13 @@ Note that the number of overcommit and reserve pages remain global quantities,
 as we don't know until fault time, when the faulting task's mempolicy is
 applied, from which node the huge page allocation will be attempted.
 
+The hugetlb may be migrated between the per-node hugepages pool in the following
+scenarios: memory offline, memory failure, longterm pinning, syscalls(mbind,
+migrate_pages and move_pages), alloc_contig_range() and alloc_contig_pages().
+Now only memory offline, memory failure and syscalls allow fallbacking to allocate
+a new hugetlb on a different node if the current node is unable to allocate during
+hugetlb migration, that means these 3 cases can break the per-node hugepages pool.
+
 .. _using_huge_pages:
 
 Using Huge Pages

From 060ffc289def2e71fbd0150cc1a6f59349107b03 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 6 May 2025 12:04:01 -0400
Subject: [PATCH 06/11] udmabuf: pin the pages using memfd_pin_folios() API

JIRA: https://issues.redhat.com/browse/RHEL-89519
Conflicts:
  Several merge conflicts due to the presence of the following later
  upstream commits:
   - 1c0844c6184e ("udmabuf: change folios array from kmalloc to kvmalloc")
   - f49856f525ac ("udmabuf: fix memory leak on last export_udmabuf()
     error path")
   - 0a16e24e34f2 ("udmabuf: also check for F_SEAL_FUTURE_WRITE")
  The skipped hunks in 1c0844c6184e are reapplied.

commit c6a3194c05e7e6fd0e8fbfb1720084ae2503c4ac
Author: Vivek Kasireddy <vivek.kasireddy@intel.com>
Date:   Sun, 23 Jun 2024 23:36:16 -0700

    udmabuf: pin the pages using memfd_pin_folios() API

    Using memfd_pin_folios() will ensure that the pages are pinned
    correctly using FOLL_PIN. And, this also ensures that we don't
    accidentally break features such as memory hotunplug as it would
    not allow pinning pages in the movable zone.

    Using this new API also simplifies the code as we no longer have
    to deal with extracting individual pages from their mappings or
    handle shmem and hugetlb cases separately.

    Link: https://lkml.kernel.org/r/20240624063952.1572359-9-vivek.kasireddy@intel.com
    Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
    Acked-by: Dave Airlie <airlied@redhat.com>
    Acked-by: Gerd Hoffmann <kraxel@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Dongwon Kim <dongwon.kim@intel.com>
    Cc: Junxiao Chang <junxiao.chang@intel.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Shuah Khan <shuah@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
---
 drivers/dma-buf/udmabuf.c | 155 ++++++++++++++++++++------------------
 1 file changed, 80 insertions(+), 75 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index eb1cc0dc11a7..3384c7dfd718 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -30,6 +30,12 @@ struct udmabuf {
 	struct sg_table *sg;
 	struct miscdevice *device;
 	pgoff_t *offsets;
+	struct list_head unpin_list;
+};
+
+struct udmabuf_folio {
+	struct folio *folio;
+	struct list_head list;
 };
 
 static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
@@ -161,17 +167,43 @@ static void unmap_udmabuf(struct dma_buf_attachment *at,
 	return put_sg_table(at->dev, sg, direction);
 }
 
+static void unpin_all_folios(struct list_head *unpin_list)
+{
+	struct udmabuf_folio *ubuf_folio;
+
+	while (!list_empty(unpin_list)) {
+		ubuf_folio = list_first_entry(unpin_list,
+					      struct udmabuf_folio, list);
+		unpin_folio(ubuf_folio->folio);
+
+		list_del(&ubuf_folio->list);
+		kfree(ubuf_folio);
+	}
+}
+
+static int add_to_unpin_list(struct list_head *unpin_list,
+			     struct folio *folio)
+{
+	struct udmabuf_folio *ubuf_folio;
+
+	ubuf_folio = kzalloc(sizeof(*ubuf_folio), GFP_KERNEL);
+	if (!ubuf_folio)
+		return -ENOMEM;
+
+	ubuf_folio->folio = folio;
+	list_add_tail(&ubuf_folio->list, unpin_list);
+	return 0;
+}
+
 static void release_udmabuf(struct dma_buf *buf)
 {
 	struct udmabuf *ubuf = buf->priv;
 	struct device *dev = ubuf->device->this_device;
-	pgoff_t pg;
 
 	if (ubuf->sg)
 		put_sg_table(dev, ubuf->sg, DMA_BIDIRECTIONAL);
 
-	for (pg = 0; pg < ubuf->pagecount; pg++)
-		folio_put(ubuf->folios[pg]);
+	unpin_all_folios(&ubuf->unpin_list);
 	kvfree(ubuf->offsets);
 	kvfree(ubuf->folios);
 	kfree(ubuf);
@@ -226,64 +258,6 @@ static const struct dma_buf_ops udmabuf_ops = {
 #define SEALS_WANTED (F_SEAL_SHRINK)
 #define SEALS_DENIED (F_SEAL_WRITE|F_SEAL_FUTURE_WRITE)
 
-static int handle_hugetlb_pages(struct udmabuf *ubuf, struct file *memfd,
-				pgoff_t offset, pgoff_t pgcnt,
-				pgoff_t *pgbuf)
-{
-	struct hstate *hpstate = hstate_file(memfd);
-	pgoff_t mapidx = offset >> huge_page_shift(hpstate);
-	pgoff_t subpgoff = (offset & ~huge_page_mask(hpstate)) >> PAGE_SHIFT;
-	pgoff_t maxsubpgs = huge_page_size(hpstate) >> PAGE_SHIFT;
-	struct folio *folio = NULL;
-	pgoff_t pgidx;
-
-	mapidx <<= huge_page_order(hpstate);
-	for (pgidx = 0; pgidx < pgcnt; pgidx++) {
-		if (!folio) {
-			folio = __filemap_get_folio(memfd->f_mapping,
-						    mapidx,
-						    FGP_ACCESSED, 0);
-			if (IS_ERR(folio))
-				return PTR_ERR(folio);
-		}
-
-		folio_get(folio);
-		ubuf->folios[*pgbuf] = folio;
-		ubuf->offsets[*pgbuf] = subpgoff << PAGE_SHIFT;
-		(*pgbuf)++;
-		if (++subpgoff == maxsubpgs) {
-			folio_put(folio);
-			folio = NULL;
-			subpgoff = 0;
-			mapidx += pages_per_huge_page(hpstate);
-		}
-	}
-
-	if (folio)
-		folio_put(folio);
-
-	return 0;
-}
-
-static int handle_shmem_pages(struct udmabuf *ubuf, struct file *memfd,
-			      pgoff_t offset, pgoff_t pgcnt,
-			      pgoff_t *pgbuf)
-{
-	pgoff_t pgidx, pgoff = offset >> PAGE_SHIFT;
-	struct folio *folio = NULL;
-
-	for (pgidx = 0; pgidx < pgcnt; pgidx++) {
-		folio = shmem_read_folio(memfd->f_mapping, pgoff + pgidx);
-		if (IS_ERR(folio))
-			return PTR_ERR(folio);
-
-		ubuf->folios[*pgbuf] = folio;
-		(*pgbuf)++;
-	}
-
-	return 0;
-}
-
 static int check_memfd_seals(struct file *memfd)
 {
 	int seals;
@@ -323,17 +297,20 @@ static long udmabuf_create(struct miscdevice *device,
 			   struct udmabuf_create_list *head,
 			   struct udmabuf_create_item *list)
 {
-	pgoff_t pgcnt, pgbuf = 0, pglimit;
+	pgoff_t pgoff, pgcnt, pglimit, pgbuf = 0;
+	long nr_folios, ret = -EINVAL;
 	struct file *memfd = NULL;
+	struct folio **folios;
 	struct udmabuf *ubuf;
 	struct dma_buf *dmabuf;
-	int ret = -EINVAL;
-	u32 i, flags;
+	u32 i, j, k, flags;
+	loff_t end;
 
 	ubuf = kzalloc(sizeof(*ubuf), GFP_KERNEL);
 	if (!ubuf)
 		return -ENOMEM;
 
+	INIT_LIST_HEAD(&ubuf->unpin_list);
 	pglimit = (size_limit_mb * 1024 * 1024) >> PAGE_SHIFT;
 	for (i = 0; i < head->count; i++) {
 		if (!IS_ALIGNED(list[i].offset, PAGE_SIZE))
@@ -369,17 +346,46 @@ static long udmabuf_create(struct miscdevice *device,
 			goto err;
 
 		pgcnt = list[i].size >> PAGE_SHIFT;
-		if (is_file_hugepages(memfd))
-			ret = handle_hugetlb_pages(ubuf, memfd,
-						   list[i].offset,
-						   pgcnt, &pgbuf);
-		else
-			ret = handle_shmem_pages(ubuf, memfd,
-						 list[i].offset,
-						 pgcnt, &pgbuf);
-		if (ret < 0)
+		folios = kvmalloc_array(pgcnt, sizeof(*folios), GFP_KERNEL);
+		if (!folios) {
+			ret = -ENOMEM;
 			goto err;
+		}
 
+		end = list[i].offset + (pgcnt << PAGE_SHIFT) - 1;
+		ret = memfd_pin_folios(memfd, list[i].offset, end,
+				       folios, pgcnt, &pgoff);
+		if (ret <= 0) {
+			kvfree(folios);
+			if (!ret)
+				ret = -EINVAL;
+			goto err;
+		}
+
+		nr_folios = ret;
+		pgoff >>= PAGE_SHIFT;
+		for (j = 0, k = 0; j < pgcnt; j++) {
+			ubuf->folios[pgbuf] = folios[k];
+			ubuf->offsets[pgbuf] = pgoff << PAGE_SHIFT;
+
+			if (j == 0 || ubuf->folios[pgbuf-1] != folios[k]) {
+				ret = add_to_unpin_list(&ubuf->unpin_list,
+							folios[k]);
+				if (ret < 0) {
+					kfree(folios);
+					goto err;
+				}
+			}
+
+			pgbuf++;
+			if (++pgoff == folio_nr_pages(folios[k])) {
+				pgoff = 0;
+				if (++k == nr_folios)
+					break;
+			}
+		}
+
+		kvfree(folios);
 		fput(memfd);
 		memfd = NULL;
 	}
@@ -403,10 +409,9 @@ static long udmabuf_create(struct miscdevice *device,
 	return ret;
 
 err:
-	while (pgbuf > 0)
-		folio_put(ubuf->folios[--pgbuf]);
 	if (memfd)
 		fput(memfd);
+	unpin_all_folios(&ubuf->unpin_list);
 	kvfree(ubuf->offsets);
 	kvfree(ubuf->folios);
 	kfree(ubuf);

From 8dd580c15c898b1660a2934336cdfb1e39d3bbdd Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 6 May 2025 12:04:01 -0400
Subject: [PATCH 07/11] selftests/udmabuf: add tests to verify data after page
 migration

JIRA: https://issues.redhat.com/browse/RHEL-89519

commit 8d42e2a91dcf86b34461cd7f709797805afa9f43
Author: Vivek Kasireddy <vivek.kasireddy@intel.com>
Date:   Sun, 23 Jun 2024 23:36:17 -0700

    selftests/udmabuf: add tests to verify data after page migration

    Since the memfd pages associated with a udmabuf may be migrated as part of
    udmabuf create, we need to verify the data coherency after successful
    migration.  The new tests added in this patch try to do just that using 4k
    sized pages and also 2 MB sized huge pages for the memfd.

    Successful completion of the tests would mean that there is no disconnect
    between the memfd pages and the ones associated with a udmabuf.  And,
    these tests can also be augmented in the future to test newer udmabuf
    features (such as handling memfd hole punch).

    The idea for these tests comes from a patch by Mike Kravetz here:
    https://lists.freedesktop.org/archives/dri-devel/2023-June/410623.html

    v1->v2: (suggestions from Shuah)
    - Use ksft_* functions to print and capture results of tests
    - Use appropriate KSFT_* status codes for exit()
    - Add Mike Kravetz's suggested-by tag

    Link: https://lkml.kernel.org/r/20240624063952.1572359-10-vivek.kasireddy@intel.com
    Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Dave Airlie <airlied@redhat.com>
    Acked-by: Gerd Hoffmann <kraxel@redhat.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Dongwon Kim <dongwon.kim@intel.com>
    Cc: Junxiao Chang <junxiao.chang@intel.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
---
 .../selftests/drivers/dma-buf/udmabuf.c       | 214 +++++++++++++++---
 1 file changed, 183 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/drivers/dma-buf/udmabuf.c b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
index c812080e304e..6062723a172e 100644
--- a/tools/testing/selftests/drivers/dma-buf/udmabuf.c
+++ b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
@@ -9,52 +9,162 @@
 #include <errno.h>
 #include <fcntl.h>
 #include <malloc.h>
+#include <stdbool.h>
 
 #include <sys/ioctl.h>
 #include <sys/syscall.h>
+#include <sys/mman.h>
 #include <linux/memfd.h>
 #include <linux/udmabuf.h>
+#include "../../kselftest.h"
 
 #define TEST_PREFIX	"drivers/dma-buf/udmabuf"
 #define NUM_PAGES       4
+#define NUM_ENTRIES     4
+#define MEMFD_SIZE      1024 /* in pages */
 
-static int memfd_create(const char *name, unsigned int flags)
+static unsigned int page_size;
+
+static int create_memfd_with_seals(off64_t size, bool hpage)
 {
-	return syscall(__NR_memfd_create, name, flags);
+	int memfd, ret;
+	unsigned int flags = MFD_ALLOW_SEALING;
+
+	if (hpage)
+		flags |= MFD_HUGETLB;
+
+	memfd = memfd_create("udmabuf-test", flags);
+	if (memfd < 0) {
+		ksft_print_msg("%s: [skip,no-memfd]\n", TEST_PREFIX);
+		exit(KSFT_SKIP);
+	}
+
+	ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK);
+	if (ret < 0) {
+		ksft_print_msg("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX);
+		exit(KSFT_SKIP);
+	}
+
+	ret = ftruncate(memfd, size);
+	if (ret == -1) {
+		ksft_print_msg("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX);
+		exit(KSFT_FAIL);
+	}
+
+	return memfd;
+}
+
+static int create_udmabuf_list(int devfd, int memfd, off64_t memfd_size)
+{
+	struct udmabuf_create_list *list;
+	int ubuf_fd, i;
+
+	list = malloc(sizeof(struct udmabuf_create_list) +
+		      sizeof(struct udmabuf_create_item) * NUM_ENTRIES);
+	if (!list) {
+		ksft_print_msg("%s: [FAIL, udmabuf-malloc]\n", TEST_PREFIX);
+		exit(KSFT_FAIL);
+	}
+
+	for (i = 0; i < NUM_ENTRIES; i++) {
+		list->list[i].memfd  = memfd;
+		list->list[i].offset = i * (memfd_size / NUM_ENTRIES);
+		list->list[i].size   = getpagesize() * NUM_PAGES;
+	}
+
+	list->count = NUM_ENTRIES;
+	list->flags = UDMABUF_FLAGS_CLOEXEC;
+	ubuf_fd = ioctl(devfd, UDMABUF_CREATE_LIST, list);
+	free(list);
+	if (ubuf_fd < 0) {
+		ksft_print_msg("%s: [FAIL, udmabuf-create]\n", TEST_PREFIX);
+		exit(KSFT_FAIL);
+	}
+
+	return ubuf_fd;
+}
+
+static void write_to_memfd(void *addr, off64_t size, char chr)
+{
+	int i;
+
+	for (i = 0; i < size / page_size; i++) {
+		*((char *)addr + (i * page_size)) = chr;
+	}
+}
+
+static void *mmap_fd(int fd, off64_t size)
+{
+	void *addr;
+
+	addr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		ksft_print_msg("%s: ubuf_fd mmap fail\n", TEST_PREFIX);
+		exit(KSFT_FAIL);
+	}
+
+	return addr;
+}
+
+static int compare_chunks(void *addr1, void *addr2, off64_t memfd_size)
+{
+	off64_t off;
+	int i = 0, j, k = 0, ret = 0;
+	char char1, char2;
+
+	while (i < NUM_ENTRIES) {
+		off = i * (memfd_size / NUM_ENTRIES);
+		for (j = 0; j < NUM_PAGES; j++, k++) {
+			char1 = *((char *)addr1 + off + (j * getpagesize()));
+			char2 = *((char *)addr2 + (k * getpagesize()));
+			if (char1 != char2) {
+				ret = -1;
+				goto err;
+			}
+		}
+		i++;
+	}
+err:
+	munmap(addr1, memfd_size);
+	munmap(addr2, NUM_ENTRIES * NUM_PAGES * getpagesize());
+	return ret;
 }
 
 int main(int argc, char *argv[])
 {
 	struct udmabuf_create create;
 	int devfd, memfd, buf, ret;
-	off_t size;
-	void *mem;
+	off64_t size;
+	void *addr1, *addr2;
+
+	ksft_print_header();
+	ksft_set_plan(6);
 
 	devfd = open("/dev/udmabuf", O_RDWR);
 	if (devfd < 0) {
-		printf("%s: [skip,no-udmabuf: Unable to access DMA buffer device file]\n",
-		       TEST_PREFIX);
-		exit(77);
+		ksft_print_msg(
+			"%s: [skip,no-udmabuf: Unable to access DMA buffer device file]\n",
+			TEST_PREFIX);
+		exit(KSFT_SKIP);
 	}
 
 	memfd = memfd_create("udmabuf-test", MFD_ALLOW_SEALING);
 	if (memfd < 0) {
-		printf("%s: [skip,no-memfd]\n", TEST_PREFIX);
-		exit(77);
+		ksft_print_msg("%s: [skip,no-memfd]\n", TEST_PREFIX);
+		exit(KSFT_SKIP);
 	}
 
 	ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK);
 	if (ret < 0) {
-		printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX);
-		exit(77);
+		ksft_print_msg("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX);
+		exit(KSFT_SKIP);
 	}
 
-
 	size = getpagesize() * NUM_PAGES;
 	ret = ftruncate(memfd, size);
 	if (ret == -1) {
-		printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX);
-		exit(1);
+		ksft_print_msg("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX);
+		exit(KSFT_FAIL);
 	}
 
 	memset(&create, 0, sizeof(create));
@@ -64,44 +174,86 @@ int main(int argc, char *argv[])
 	create.offset = getpagesize()/2;
 	create.size   = getpagesize();
 	buf = ioctl(devfd, UDMABUF_CREATE, &create);
-	if (buf >= 0) {
-		printf("%s: [FAIL,test-1]\n", TEST_PREFIX);
-		exit(1);
-	}
+	if (buf >= 0)
+		ksft_test_result_fail("%s: [FAIL,test-1]\n", TEST_PREFIX);
+	else
+		ksft_test_result_pass("%s: [PASS,test-1]\n", TEST_PREFIX);
 
 	/* should fail (size not multiple of page) */
 	create.memfd  = memfd;
 	create.offset = 0;
 	create.size   = getpagesize()/2;
 	buf = ioctl(devfd, UDMABUF_CREATE, &create);
-	if (buf >= 0) {
-		printf("%s: [FAIL,test-2]\n", TEST_PREFIX);
-		exit(1);
-	}
+	if (buf >= 0)
+		ksft_test_result_fail("%s: [FAIL,test-2]\n", TEST_PREFIX);
+	else
+		ksft_test_result_pass("%s: [PASS,test-2]\n", TEST_PREFIX);
 
 	/* should fail (not memfd) */
 	create.memfd  = 0; /* stdin */
 	create.offset = 0;
 	create.size   = size;
 	buf = ioctl(devfd, UDMABUF_CREATE, &create);
-	if (buf >= 0) {
-		printf("%s: [FAIL,test-3]\n", TEST_PREFIX);
-		exit(1);
-	}
+	if (buf >= 0)
+		ksft_test_result_fail("%s: [FAIL,test-3]\n", TEST_PREFIX);
+	else
+		ksft_test_result_pass("%s: [PASS,test-3]\n", TEST_PREFIX);
 
 	/* should work */
+	page_size = getpagesize();
+	addr1 = mmap_fd(memfd, size);
+	write_to_memfd(addr1, size, 'a');
 	create.memfd  = memfd;
 	create.offset = 0;
 	create.size   = size;
 	buf = ioctl(devfd, UDMABUF_CREATE, &create);
-	if (buf < 0) {
-		printf("%s: [FAIL,test-4]\n", TEST_PREFIX);
-		exit(1);
-	}
+	if (buf < 0)
+		ksft_test_result_fail("%s: [FAIL,test-4]\n", TEST_PREFIX);
+	else
+		ksft_test_result_pass("%s: [PASS,test-4]\n", TEST_PREFIX);
+
+	munmap(addr1, size);
+	close(buf);
+	close(memfd);
+
+	/* should work (migration of 4k size pages)*/
+	size = MEMFD_SIZE * page_size;
+	memfd = create_memfd_with_seals(size, false);
+	addr1 = mmap_fd(memfd, size);
+	write_to_memfd(addr1, size, 'a');
+	buf = create_udmabuf_list(devfd, memfd, size);
+	addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
+	write_to_memfd(addr1, size, 'b');
+	ret = compare_chunks(addr1, addr2, size);
+	if (ret < 0)
+		ksft_test_result_fail("%s: [FAIL,test-5]\n", TEST_PREFIX);
+	else
+		ksft_test_result_pass("%s: [PASS,test-5]\n", TEST_PREFIX);
+
+	close(buf);
+	close(memfd);
+
+	/* should work (migration of 2MB size huge pages)*/
+	page_size = getpagesize() * 512; /* 2 MB */
+	size = MEMFD_SIZE * page_size;
+	memfd = create_memfd_with_seals(size, true);
+	addr1 = mmap_fd(memfd, size);
+	write_to_memfd(addr1, size, 'a');
+	buf = create_udmabuf_list(devfd, memfd, size);
+	addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
+	write_to_memfd(addr1, size, 'b');
+	ret = compare_chunks(addr1, addr2, size);
+	if (ret < 0)
+		ksft_test_result_fail("%s: [FAIL,test-6]\n", TEST_PREFIX);
+	else
+		ksft_test_result_pass("%s: [PASS,test-6]\n", TEST_PREFIX);
 
-	fprintf(stderr, "%s: ok\n", TEST_PREFIX);
 	close(buf);
 	close(memfd);
 	close(devfd);
+
+	ksft_print_msg("%s: ok\n", TEST_PREFIX);
+	ksft_print_cnts();
+
 	return 0;
 }

From 100c8a5637418769966e7a3ca5787a9bb2c20429 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 6 May 2025 12:04:02 -0400
Subject: [PATCH 08/11] udmabuf: pre-fault when first page fault

JIRA: https://issues.redhat.com/browse/RHEL-89519

commit f0bbcc258e81288212c2092c587ae06428196598
Author: Huan Yang <link@vivo.com>
Date:   Wed, 18 Sep 2024 10:52:24 +0800

    udmabuf: pre-fault when first page fault

    The current udmabuf mmap only fills the physical memory to the
    corresponding virtual address when the user actually accesses the
    virtual address.

    However, the current udmabuf has already obtained and pinned the folio
    upon completion of the creation.This means that the physical memory has
    already been acquired, rather than being accessed dynamically.

    As a result, the page fault has lost its purpose as a demanding
    page. Due to the fact that page fault requires trapping into kernel mode
    and filling in when accessing the corresponding virtual address in mmap,
    when creating a large size udmabuf, this represents a considerable
    overhead.

    This patch fill the pfn into page table, and then pre-fault each pfn
    into vma, when first access.

    Notice, if anything wrong , we do not return an error during this
    pre-fault step. However, an error will be returned if the failure occurs
    when the addr is truly accessed

    Suggested-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
    Signed-off-by: Huan Yang <link@vivo.com>
    Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
    Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20240918025238.2957823-2-link@vivo.com

Signed-off-by: Waiman Long <longman@redhat.com>
---
 drivers/dma-buf/udmabuf.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index 3384c7dfd718..eea807c3e219 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -43,7 +43,8 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct udmabuf *ubuf = vma->vm_private_data;
 	pgoff_t pgoff = vmf->pgoff;
-	unsigned long pfn;
+	unsigned long addr, pfn;
+	vm_fault_t ret;
 
 	if (pgoff >= ubuf->pagecount)
 		return VM_FAULT_SIGBUS;
@@ -51,7 +52,35 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
 	pfn = folio_pfn(ubuf->folios[pgoff]);
 	pfn += ubuf->offsets[pgoff] >> PAGE_SHIFT;
 
-	return vmf_insert_pfn(vma, vmf->address, pfn);
+	ret = vmf_insert_pfn(vma, vmf->address, pfn);
+	if (ret & VM_FAULT_ERROR)
+		return ret;
+
+	/* pre fault */
+	pgoff = vma->vm_pgoff;
+	addr = vma->vm_start;
+
+	for (; addr < vma->vm_end; pgoff++, addr += PAGE_SIZE) {
+		if (addr == vmf->address)
+			continue;
+
+		if (WARN_ON(pgoff >= ubuf->pagecount))
+			break;
+
+		pfn = folio_pfn(ubuf->folios[pgoff]);
+		pfn += ubuf->offsets[pgoff] >> PAGE_SHIFT;
+
+		/**
+		 * If the below vmf_insert_pfn() fails, we do not return an
+		 * error here during this pre-fault step. However, an error
+		 * will be returned if the failure occurs when the addr is
+		 * truly accessed.
+		 */
+		if (vmf_insert_pfn(vma, addr, pfn) & VM_FAULT_ERROR)
+			break;
+	}
+
+	return ret;
 }
 
 static const struct vm_operations_struct udmabuf_vm_ops = {

From 3c90d8bbaccd31bc651b378e23cbfdc89abd766b Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 6 May 2025 12:04:02 -0400
Subject: [PATCH 09/11] udmabuf: udmabuf_create pin folio codestyle cleanup

JIRA: https://issues.redhat.com/browse/RHEL-89519
Conflicts:
  A merge conflict in a udmabuf_create() hunk due to the presence of
  a later upstream commit f49856f525ac ("udmabuf: fix memory leak on
  last export_udmabuf() error path").

commit 164fd9efd46531fddfaa933d394569259896642b
Author: Huan Yang <link@vivo.com>
Date:   Wed, 18 Sep 2024 10:52:27 +0800

    udmabuf: udmabuf_create pin folio codestyle cleanup

    This patch aim to simplify the memfd folio pin during the udmabuf
    create. No functional changes.

    This patch create a udmabuf_pin_folios function, in this, do the memfd
    pin folio and then record each pinned folio, offset.

    This patch simplify the pinned folio record, iter by each pinned folio,
    and then record each offset in it.

    Compare to iter by pgcnt, more readable.

    Suggested-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
    Signed-off-by: Huan Yang <link@vivo.com>
    Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
    Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20240918025238.2957823-5-link@vivo.com

Signed-off-by: Waiman Long <longman@redhat.com>
---
 drivers/dma-buf/udmabuf.c | 143 +++++++++++++++++++++-----------------
 1 file changed, 79 insertions(+), 64 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index eea807c3e219..caa6238d5ba8 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -291,9 +291,6 @@ static int check_memfd_seals(struct file *memfd)
 {
 	int seals;
 
-	if (!memfd)
-		return -EBADFD;
-
 	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
 		return -EBADFD;
 
@@ -322,18 +319,69 @@ static struct dma_buf *export_udmabuf(struct udmabuf *ubuf,
 	return dma_buf_export(&exp_info);
 }
 
+static long udmabuf_pin_folios(struct udmabuf *ubuf, struct file *memfd,
+			       loff_t start, loff_t size)
+{
+	pgoff_t pgoff, pgcnt, upgcnt = ubuf->pagecount;
+	struct folio **folios = NULL;
+	u32 cur_folio, cur_pgcnt;
+	long nr_folios;
+	long ret = 0;
+	loff_t end;
+
+	pgcnt = size >> PAGE_SHIFT;
+	folios = kvmalloc_array(pgcnt, sizeof(*folios), GFP_KERNEL);
+	if (!folios)
+		return -ENOMEM;
+
+	end = start + (pgcnt << PAGE_SHIFT) - 1;
+	nr_folios = memfd_pin_folios(memfd, start, end, folios, pgcnt, &pgoff);
+	if (nr_folios <= 0) {
+		ret = nr_folios ? nr_folios : -EINVAL;
+		goto end;
+	}
+
+	cur_pgcnt = 0;
+	for (cur_folio = 0; cur_folio < nr_folios; ++cur_folio) {
+		pgoff_t subpgoff = pgoff;
+		size_t fsize = folio_size(folios[cur_folio]);
+
+		ret = add_to_unpin_list(&ubuf->unpin_list, folios[cur_folio]);
+		if (ret < 0)
+			goto end;
+
+		for (; subpgoff < fsize; subpgoff += PAGE_SIZE) {
+			ubuf->folios[upgcnt] = folios[cur_folio];
+			ubuf->offsets[upgcnt] = subpgoff;
+			++upgcnt;
+
+			if (++cur_pgcnt >= pgcnt)
+				goto end;
+		}
+
+		/**
+		 * In a given range, only the first subpage of the first folio
+		 * has an offset, that is returned by memfd_pin_folios().
+		 * The first subpages of other folios (in the range) have an
+		 * offset of 0.
+		 */
+		pgoff = 0;
+	}
+end:
+	ubuf->pagecount = upgcnt;
+	kvfree(folios);
+	return ret;
+}
+
 static long udmabuf_create(struct miscdevice *device,
 			   struct udmabuf_create_list *head,
 			   struct udmabuf_create_item *list)
 {
-	pgoff_t pgoff, pgcnt, pglimit, pgbuf = 0;
-	long nr_folios, ret = -EINVAL;
-	struct file *memfd = NULL;
-	struct folio **folios;
+	pgoff_t pgcnt = 0, pglimit;
 	struct udmabuf *ubuf;
 	struct dma_buf *dmabuf;
-	u32 i, j, k, flags;
-	loff_t end;
+	long ret = -EINVAL;
+	u32 i, flags;
 
 	ubuf = kzalloc(sizeof(*ubuf), GFP_KERNEL);
 	if (!ubuf)
@@ -342,81 +390,50 @@ static long udmabuf_create(struct miscdevice *device,
 	INIT_LIST_HEAD(&ubuf->unpin_list);
 	pglimit = (size_limit_mb * 1024 * 1024) >> PAGE_SHIFT;
 	for (i = 0; i < head->count; i++) {
-		if (!IS_ALIGNED(list[i].offset, PAGE_SIZE))
+		if (!PAGE_ALIGNED(list[i].offset))
 			goto err;
-		if (!IS_ALIGNED(list[i].size, PAGE_SIZE))
+		if (!PAGE_ALIGNED(list[i].size))
 			goto err;
-		ubuf->pagecount += list[i].size >> PAGE_SHIFT;
-		if (ubuf->pagecount > pglimit)
+
+		pgcnt += list[i].size >> PAGE_SHIFT;
+		if (pgcnt > pglimit)
 			goto err;
 	}
 
-	if (!ubuf->pagecount)
+	if (!pgcnt)
 		goto err;
 
-	ubuf->folios = kvmalloc_array(ubuf->pagecount, sizeof(*ubuf->folios),
-				      GFP_KERNEL);
+	ubuf->folios = kvmalloc_array(pgcnt, sizeof(*ubuf->folios), GFP_KERNEL);
 	if (!ubuf->folios) {
 		ret = -ENOMEM;
 		goto err;
 	}
-	ubuf->offsets = kvcalloc(ubuf->pagecount, sizeof(*ubuf->offsets),
-				 GFP_KERNEL);
+
+	ubuf->offsets = kvcalloc(pgcnt, sizeof(*ubuf->offsets), GFP_KERNEL);
 	if (!ubuf->offsets) {
 		ret = -ENOMEM;
 		goto err;
 	}
 
-	pgbuf = 0;
 	for (i = 0; i < head->count; i++) {
-		memfd = fget(list[i].memfd);
+		struct file *memfd = fget(list[i].memfd);
+
+		if (!memfd) {
+			ret = -EBADFD;
+			goto err;
+		}
+
 		ret = check_memfd_seals(memfd);
-		if (ret < 0)
-			goto err;
-
-		pgcnt = list[i].size >> PAGE_SHIFT;
-		folios = kvmalloc_array(pgcnt, sizeof(*folios), GFP_KERNEL);
-		if (!folios) {
-			ret = -ENOMEM;
+		if (ret < 0) {
+			fput(memfd);
 			goto err;
 		}
 
-		end = list[i].offset + (pgcnt << PAGE_SHIFT) - 1;
-		ret = memfd_pin_folios(memfd, list[i].offset, end,
-				       folios, pgcnt, &pgoff);
-		if (ret <= 0) {
-			kvfree(folios);
-			if (!ret)
-				ret = -EINVAL;
-			goto err;
-		}
-
-		nr_folios = ret;
-		pgoff >>= PAGE_SHIFT;
-		for (j = 0, k = 0; j < pgcnt; j++) {
-			ubuf->folios[pgbuf] = folios[k];
-			ubuf->offsets[pgbuf] = pgoff << PAGE_SHIFT;
-
-			if (j == 0 || ubuf->folios[pgbuf-1] != folios[k]) {
-				ret = add_to_unpin_list(&ubuf->unpin_list,
-							folios[k]);
-				if (ret < 0) {
-					kfree(folios);
-					goto err;
-				}
-			}
-
-			pgbuf++;
-			if (++pgoff == folio_nr_pages(folios[k])) {
-				pgoff = 0;
-				if (++k == nr_folios)
-					break;
-			}
-		}
-
-		kvfree(folios);
+		ret = udmabuf_pin_folios(ubuf, memfd, list[i].offset,
+					 list[i].size);
 		fput(memfd);
-		memfd = NULL;
+		if (ret)
+			goto err;
 	}
 
 	flags = head->flags & UDMABUF_FLAGS_CLOEXEC ? O_CLOEXEC : 0;
@@ -438,8 +455,6 @@ static long udmabuf_create(struct miscdevice *device,
 	return ret;
 
 err:
-	if (memfd)
-		fput(memfd);
 	unpin_all_folios(&ubuf->unpin_list);
 	kvfree(ubuf->offsets);
 	kvfree(ubuf->folios);

From 66665c8d25b3a5743e772a0c1db3614d479d1c1d Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 6 May 2025 12:04:03 -0400
Subject: [PATCH 10/11] udmabuf: fix racy memfd sealing check

JIRA: https://issues.redhat.com/browse/RHEL-89519
Conflicts: A merge conflict due to missing upstream commit c87a1268e9c5
	   ("udmabuf: reuse folio array when pin folios")

commit 9cb189a882738c1d28b349d4e7c6a1ef9b3d8f87
Author: Jann Horn <jannh@google.com>
Date:   Wed, 4 Dec 2024 17:26:19 +0100

    udmabuf: fix racy memfd sealing check

    The current check_memfd_seals() is racy: Since we first do
    check_memfd_seals() and then udmabuf_pin_folios() without holding any
    relevant lock across both, F_SEAL_WRITE can be set in between.
    This is problematic because we can end up holding pins to pages in a
    write-sealed memfd.

    Fix it using the inode lock, that's probably the easiest way.
    In the future, we might want to consider moving this logic into memfd,
    especially if anyone else wants to use memfd_pin_folios().

    Reported-by: Julian Orth <ju.orth@gmail.com>
    Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219106
    Closes: https://lore.kernel.org/r/CAG48ez0w8HrFEZtJkfmkVKFDhE5aP7nz=obrimeTgpD+StkV9w@mail.gmail.com
    Fixes: fbb0de795078 ("Add udmabuf misc device")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jann Horn <jannh@google.com>
    Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
    Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20241204-udmabuf-fixes-v2-1-23887289de1c@google.com

Signed-off-by: Waiman Long <longman@redhat.com>
---
 drivers/dma-buf/udmabuf.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index caa6238d5ba8..ce7b08f4723f 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -423,14 +423,19 @@ static long udmabuf_create(struct miscdevice *device,
 			goto err;
 		}
 
+		/*
+		 * Take the inode lock to protect against concurrent
+		 * memfd_add_seals(), which takes this lock in write mode.
+		 */
+		inode_lock_shared(file_inode(memfd));
 		ret = check_memfd_seals(memfd);
-		if (ret < 0) {
-			fput(memfd);
-			goto err;
-		}
+		if (ret)
+			goto out_unlock;
 
 		ret = udmabuf_pin_folios(ubuf, memfd, list[i].offset,
 					 list[i].size);
+out_unlock:
+		inode_unlock_shared(file_inode(memfd));
 		fput(memfd);
 		if (ret)
 			goto err;

From 400dd9d9c69e2d53364680e9d533f03656e409f7 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 6 May 2025 12:04:03 -0400
Subject: [PATCH 11/11] mm: gup: fix infinite loop within __get_longterm_locked

JIRA: https://issues.redhat.com/browse/RHEL-89519

commit 1aaf8c122918aa8897605a9aa1e8ed6600d6f930
Author: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Date:   Tue, 21 Jan 2025 10:01:59 +0800

    mm: gup: fix infinite loop within __get_longterm_locked

    We can run into an infinite loop in __get_longterm_locked() when
    collect_longterm_unpinnable_folios() finds only folios that are isolated
    from the LRU or were never added to the LRU.  This can happen when all
    folios to be pinned are never added to the LRU, for example when
    vm_ops->fault allocated pages using cma_alloc() and never added them to
    the LRU.

    Fix it by simply taking a look at the list in the single caller, to see if
    anything was added.

    [zhaoyang.huang@unisoc.com: move definition of local]
      Link: https://lkml.kernel.org/r/20250122012604.3654667-1-zhaoyang.huang@unisoc.com
    Link: https://lkml.kernel.org/r/20250121020159.3636477-1-zhaoyang.huang@unisoc.com
    Fixes: 67e139b02d99 ("mm/gup.c: refactor check_and_migrate_movable_pages()")
    Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Suggested-by: David Hildenbrand <david@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Aijun Sun <aijun.sun@unisoc.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/gup.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 972ab9bffa8d..c857e952e051 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2069,13 +2069,13 @@ static void pofs_unpin(struct pages_or_folios *pofs)
 /*
  * Returns the number of collected folios. Return value is always >= 0.
  */
-static unsigned long collect_longterm_unpinnable_folios(
+static void collect_longterm_unpinnable_folios(
 		struct list_head *movable_folio_list,
 		struct pages_or_folios *pofs)
 {
-	unsigned long i, collected = 0;
 	struct folio *prev_folio = NULL;
 	bool drain_allow = true;
+	unsigned long i;
 
 	for (i = 0; i < pofs->nr_entries; i++) {
 		struct folio *folio = pofs_get_folio(pofs, i);
@@ -2087,8 +2087,6 @@ static unsigned long collect_longterm_unpinnable_folios(
 		if (folio_is_longterm_pinnable(folio))
 			continue;
 
-		collected++;
-
 		if (folio_is_device_coherent(folio))
 			continue;
 
@@ -2110,8 +2108,6 @@ static unsigned long collect_longterm_unpinnable_folios(
 				    NR_ISOLATED_ANON + folio_is_file_lru(folio),
 				    folio_nr_pages(folio));
 	}
-
-	return collected;
 }
 
 /*
@@ -2188,11 +2184,9 @@ static long
 check_and_migrate_movable_pages_or_folios(struct pages_or_folios *pofs)
 {
 	LIST_HEAD(movable_folio_list);
-	unsigned long collected;
 
-	collected = collect_longterm_unpinnable_folios(&movable_folio_list,
-						       pofs);
-	if (!collected)
+	collect_longterm_unpinnable_folios(&movable_folio_list, pofs);
+	if (list_empty(&movable_folio_list))
 		return 0;
 
 	return migrate_longterm_unpinnable_folios(&movable_folio_list, pofs);