linux-kernelorg-stable/kernel
Yujie Liu 66f3fc7411 sched/numa: Fix the vma scan starving issue
[ Upstream commit f22cde4371 ]

Problem statement:
Since commit fc137c0dda ("sched/numa: enhance vma scanning logic"), the
Numa vma scan overhead has been reduced a lot.  Meanwhile, the reducing of
the vma scan might create less Numa page fault information.  The
insufficient information makes it harder for the Numa balancer to make
decision.  Later, commit b7a5b537c5 ("sched/numa: Complete scanning of
partial VMAs regardless of PID activity") and commit 84db47ca71
("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to
bring back part of the performance.

Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a
long duration of remote Numa node read was observed by PMU events: A few
cores having ~500MB/s remote memory access for ~20 seconds.  It causes
high core-to-core variance and performance penalty.  After the
investigation, it is found that many vmas are skipped due to the active
PID check.  According to the trace events, in most cases,
vma_is_accessed() returns false because the history access info stored in
pids_active array has been cleared.

Proposal:
The main idea is to adjust vma_is_accessed() to let it return true easier.
Thus compare the diff between mm->numa_scan_seq and
vma->numab_state->prev_scan_seq.  If the diff has exceeded the threshold,
scan the vma.

This patch especially helps the cases where there are small number of
threads, like the process-based SPECcpu.  Without this patch, if the
SPECcpu process access the vma at the beginning, then sleeps for a long
time, the pid_active array will be cleared.  A a result, if this process
is woken up again, it never has a chance to set prot_none anymore.
Because only the first 2 times of access is granted for vma scan:
(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be
worse, no other threads within the task can help set the prot_none.  This
causes information lost.

Raghavendra helped test current patch and got the positive result
on the AMD platform:

autonumabench NUMA01
                            base                  patched
Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*

Duration User      380345.36   368252.04
Duration System      1358.89     1156.23
Duration Elapsed     2277.45     2213.25

autonumabench NUMA02

Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*

Duration User        1513.23     1575.48
Duration System         8.33        8.13
Duration Elapsed       28.59       29.71

kernbench

Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*

Duration User       68816.41    67615.74
Duration System     21873.94    22848.08
Duration Elapsed      506.66      504.55

Intel 256 CPUs/2 Sockets:
autonuma benchmark also shows improvements:

                                               v6.10-rc5              v6.10-rc5
                                                                         +patch
Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*

                   v6.10-rc5   v6.10-rc5
                                  +patch
Duration User     1064010.16  1075416.23
Duration System      3307.64     3104.66
Duration Elapsed     4537.54     4604.73

The SPECcpu remote node access issue disappears with the patch applied.

Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com
Fixes: fc137c0dda ("sched/numa: enhance vma scanning logic")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com>
Reviewed-and-tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-04 16:29:22 +02:00
..
bpf bpf: correctly handle malformed BPF_CORE_TYPE_ID_LOCAL relos 2024-10-04 16:29:19 +02:00
cgroup cgroup: Protect css->cgroup write under css_set_lock 2024-09-12 11:11:35 +02:00
configs Kbuild updates for v6.6 2023-09-05 11:01:47 -07:00
debug kdb: Use the passed prompt in kdb_position_cursor() 2024-08-03 08:54:34 +02:00
dma dma-mapping: benchmark: Don't starve others when doing the test 2024-09-12 11:11:37 +02:00
entry entry: Respect changes to system call number by trace_sys_enter() 2024-04-03 15:28:50 +02:00
events perf/aux: Fix AUX buffer serialization 2024-09-12 11:11:42 +02:00
futex futex: Don't include process MM in futex key on no-MMU 2023-11-20 11:58:53 +01:00
gcov gcov: add support for GCC 14 2024-06-27 13:49:13 +02:00
irq genirq/cpuhotplug: Retry with cpu_online_mask when migration fails 2024-08-19 06:04:24 +02:00
kcsan
livepatch livepatch: Fix missing newline character in klp_resolve_symbols() 2023-11-20 11:59:25 +01:00
locking rtmutex: Drop rt_mutex::wait_lock before scheduling 2024-09-12 11:11:25 +02:00
module module: make waiting for a concurrent module loader interruptible 2024-08-14 13:58:53 +02:00
power PM: s2idle: Make sure CPUs will wakeup directly on resume 2024-04-17 11:19:26 +02:00
printk printk: For @suppress_panic_printk check for other CPU in panic 2024-04-13 13:07:29 +02:00
rcu rcu/nocb: Fix RT throttling hrtimer armed from offline CPU 2024-10-04 16:29:07 +02:00
sched sched/numa: Fix the vma scan starving issue 2024-10-04 16:29:22 +02:00
time hrtimer: Prevent queuing of hrtimer without a function callback 2024-08-29 17:33:41 +02:00
trace tracing/osnoise: Fix build when timerlat is not enabled 2024-09-18 19:24:09 +02:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.kexec kexec: select CRYPTO from KEXEC_FILE instead of depending on it 2024-01-05 15:19:41 +01:00
Kconfig.locks
Kconfig.preempt
Makefile kernel/numa.c: Move logging out of numa.h 2024-06-12 11:11:50 +02:00
acct.c
async.c async: Introduce async_schedule_dev_nocall() 2024-01-31 16:18:49 -08:00
audit.c audit: Send netlink ACK before setting connection in auditd_set 2024-02-05 20:14:14 +00:00
audit.h
audit_fsnotify.c
audit_tree.c
audit_watch.c audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare() 2023-11-28 17:19:56 +00:00
auditfilter.c ima: Avoid blocking in RCU read-side critical section 2024-07-11 12:49:18 +02:00
auditsc.c audit,io_uring: io_uring openat triggers audit reference count underflow 2023-10-13 18:34:46 +02:00
backtracetest.c
bounds.c bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUS 2024-05-02 16:32:50 +02:00
capability.c
cfi.c
compat.c
configs.c
context_tracking.c
cpu.c cpu/SMT: Enable SMT only if a core is online 2024-08-29 17:33:30 +02:00
cpu_pm.c
crash_core.c mm: turn folio_test_hugetlb into a PageType 2024-05-02 16:32:47 +02:00
crash_dump.c
cred.c cred: get rid of CONFIG_DEBUG_CREDENTIALS 2023-12-20 17:01:51 +01:00
delayacct.c
dma.c
exec_domain.c
exit.c mm: optimize the redundant loop of mm_update_owner_next() 2024-07-11 12:49:15 +02:00
extable.c
fail_function.c
fork.c Revert "fork: defer linking file vma until vma is fully initialized" 2024-06-21 14:38:47 +02:00
freezer.c
gen_kheaders.sh kheaders: explicitly define file modes for archived headers 2024-06-21 14:38:40 +02:00
groups.c
hung_task.c
iomem.c
irq_work.c
jump_label.c jump_label: Fix the fix, brown paper bags galore 2024-08-14 13:58:38 +02:00
kallsyms.c
kallsyms_internal.h
kallsyms_selftest.c
kallsyms_selftest.h
kcmp.c
kcov.c kcov: properly check for softirq context 2024-08-14 13:58:57 +02:00
kexec.c kernel: kexec: copy user-array safely 2023-11-28 17:19:40 +00:00
kexec_core.c kexec: do syscore_shutdown() in kernel_kexec 2024-01-31 16:18:56 -08:00
kexec_elf.c
kexec_file.c kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y 2024-09-12 11:11:27 +02:00
kexec_internal.h
kheaders.c
kprobes.c kprobes: Fix to check symbol prefixes correctly 2024-08-14 13:58:51 +02:00
ksyms_common.c
ksysfs.c
kthread.c kthread: fix task state in kthread worker if being frozen 2024-10-04 16:29:20 +02:00
latencytop.c
module_signature.c
notifier.c
nsproxy.c
numa.c kernel/numa.c: Move logging out of numa.h 2024-06-12 11:11:50 +02:00
padata.c padata: Honor the caller's alignment in case of chunk_size 0 2024-10-04 16:28:52 +02:00
panic.c panic: Flush kernel log buffer at the end 2024-04-13 13:07:29 +02:00
params.c
pid.c pidfd: prevent a kernel-doc warning 2023-09-19 13:21:33 -07:00
pid_namespace.c zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING 2024-06-21 14:38:50 +02:00
pid_sysctl.h
profile.c profiling: remove profile=sleep support 2024-08-14 13:58:47 +02:00
ptrace.c
range.c
reboot.c kernel/reboot: emergency_restart: Set correct system_state 2023-11-28 17:20:04 +00:00
regset.c
relay.c
resource.c x86/kaslr: Expose and use the end of the physical memory address space 2024-09-12 11:11:25 +02:00
resource_kunit.c
rseq.c
scftorture.c
scs.c
seccomp.c
signal.c kernel: rerun task_work while freezing in get_signal() 2024-08-03 08:54:13 +02:00
smp.c smp: Add missing destroy_work_on_stack() call in smp_call_on_cpu() 2024-09-12 11:11:37 +02:00
smpboot.c kthread: add kthread_stop_put 2024-06-12 11:12:52 +02:00
smpboot.h
softirq.c softirq: Fix suspicious RCU usage in __do_softirq() 2024-06-12 11:11:27 +02:00
stackleak.c
stacktrace.c
static_call.c
static_call_inline.c
stop_machine.c
sys.c prctl: generalize PR_SET_MDWE support check to be per-arch 2024-04-03 15:28:54 +02:00
sys_ni.c syscalls: fix compat_sys_io_pgetevents_time64 usage 2024-07-05 09:34:04 +02:00
sysctl-test.c
sysctl.c
task_work.c task_work: Introduce task_work_cancel() again 2024-08-03 08:54:16 +02:00
taskstats.c
torture.c rcutorture: Fix stuttering races and other issues 2023-11-28 17:20:08 +00:00
tracepoint.c
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c
up.c
user-return-notifier.c
user.c
user_namespace.c
usermode_driver.c
utsname.c
utsname_sysctl.c
vhost_task.c vhost_task: Handle SIGKILL by flushing work and exiting 2024-07-11 12:49:10 +02:00
watch_queue.c kernel: watch_queue: copy user-array safely 2023-11-28 17:19:40 +00:00
watchdog.c watchdog: move softlockup_panic back to early_param 2023-11-28 17:19:57 +00:00
watchdog_buddy.c
watchdog_perf.c watchdog/perf: properly initialize the turbo mode timestamp and rearm counter 2024-08-03 08:54:29 +02:00
workqueue.c workqueue: Improve scalability of workqueue watchdog touch 2024-09-12 11:11:42 +02:00
workqueue_internal.h