linux-kernelorg-stable/kernel
Nicholas Piggin 2655421ae6 lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
On big systems, the mm refcount can become highly contented when doing a
lot of context switching with threaded applications.  user<->idle switch
is one of the important cases.  Abandoning lazy tlb entirely slows this
switching down quite a bit in the common uncontended case, so that is not
viable.

Implement a scheme where lazy tlb mm references do not contribute to the
refcount, instead they get explicitly removed when the refcount reaches
zero.

The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
switch away from this mm to init_mm if it was being used as the lazy tlb
mm.  Enabling the shoot lazies option therefore requires that the arch
ensures that mm_cpumask contains all CPUs that could possibly be using mm.
A DEBUG_VM option IPIs every CPU in the system after this to ensure there
are no references remaining before the mm is freed.

Shootdown IPIs cost could be an issue, but they have not been observed to
be a serious problem with this scheme, because short-lived processes tend
not to migrate CPUs much, therefore they don't get much chance to leave
lazy tlb mm references on remote CPUs.  There are a lot of options to
reduce them if necessary, described in comments.

The near-worst-case can be benchmarked with will-it-scale:

  context_switch1_threads -t $(($(nproc) / 2))

This will create nproc threads (nproc / 2 switching pairs) all sharing the
same mm that spread over all CPUs so each CPU does thread->idle->thread
switching.

[ Rik came up with basically the same idea a few years ago, so credit
  to him for that. ]

Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28 16:20:08 -07:00
..
bpf bpf: Adjust insufficient default bpf_jit_limit 2023-03-21 12:43:05 -07:00
cgroup Networking changes for 6.3. 2023-02-21 18:24:12 -08:00
configs
debug
dma swiotlb: mark swiotlb_memblock_alloc() as __init 2023-02-22 06:44:48 -08:00
entry entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up 2023-03-21 15:13:15 +01:00
events perf: Fix check before add_event_to_groups() in perf_group_detach() 2023-03-15 21:49:47 +01:00
futex
gcov
irq A set of updates for the interrupt susbsystem: 2023-03-05 11:19:16 -08:00
kcsan kcsan: avoid passing -g for test 2023-03-23 17:18:35 -07:00
livepatch Livepatching changes for 6.3 2023-02-23 14:00:10 -08:00
locking RCU pull request for v6.3 2023-02-21 10:45:51 -08:00
module modules-6.3-rc1 2023-02-23 14:05:08 -08:00
power Merge branches 'powercap', 'pm-domains', 'pm-em' and 'pm-opp' 2023-02-15 20:06:26 +01:00
printk printk changes for 6.3 2023-02-23 13:49:45 -08:00
rcu Merge branch 'stall.2023.01.09a' into HEAD 2023-02-02 16:40:07 -08:00
sched lazy tlb: introduce lazy tlb mm refcount helper functions 2023-03-28 16:20:08 -07:00
time Updates for timekeeping, timers and clockevent/source drivers: 2023-02-21 09:45:13 -08:00
trace Tracing fixes for 6.3: 2023-03-19 10:46:02 -07:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
Makefile
acct.c
async.c
audit.c
audit.h
audit_fsnotify.c
audit_tree.c
audit_watch.c
auditfilter.c
auditsc.c capability: just use a 'u64' instead of a 'u32[2]' array 2023-03-01 10:01:22 -08:00
backtracetest.c
bounds.c
capability.c capability: just use a 'u64' instead of a 'u32[2]' array 2023-03-01 10:01:22 -08:00
cfi.c
compat.c sched_getaffinity: don't assume 'cpumask_size()' is fully initialized 2023-03-14 19:32:38 -07:00
configs.c
context_tracking.c context_tracking: Fix noinstr vs KASAN 2023-01-13 11:48:18 +01:00
cpu.c lazy tlb: introduce lazy tlb mm refcount helper functions 2023-03-28 16:20:08 -07:00
cpu_pm.c cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}() 2023-01-13 11:48:15 +01:00
crash_core.c mm: remove 'First tail page' members from struct page 2023-02-02 22:32:59 -08:00
crash_dump.c
cred.c
delayacct.c
dma.c
exec_domain.c
exit.c lazy tlb: introduce lazy tlb mm refcount helper functions 2023-03-28 16:20:08 -07:00
extable.c
fail_function.c kernel/fail_function: fix memory leak with using debugfs_lookup() 2023-02-08 13:36:22 +01:00
fork.c lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme 2023-03-28 16:20:08 -07:00
freezer.c
gen_kheaders.sh kheaders: use standard naming for the temporary directory 2023-01-22 23:43:34 +09:00
groups.c
hung_task.c hung_task: print message when hung_task_warnings gets down to zero. 2023-02-09 17:03:20 -08:00
iomem.c
irq_work.c
jump_label.c
kallsyms.c
kallsyms_internal.h
kallsyms_selftest.c kallsyms: Fix scheduling with interrupts disabled in self-test 2023-01-13 15:09:08 -08:00
kallsyms_selftest.h
kcmp.c
kcov.c mm: replace vma->vm_flags direct modifications with modifier calls 2023-02-09 16:51:39 -08:00
kexec.c kexec: introduce sysctl parameters kexec_load_limit_* 2023-02-02 22:50:05 -08:00
kexec_core.c There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
kexec_elf.c
kexec_file.c kexec: introduce sysctl parameters kexec_load_limit_* 2023-02-02 22:50:05 -08:00
kexec_internal.h
kheaders.c
kmod.c
kprobes.c x86/kprobes: Fix arch_check_optimized_kprobe check within optimized_kprobe range 2023-02-21 08:49:16 +09:00
ksysfs.c kernels/ksysfs.c: export kernel address bits 2023-01-20 14:30:45 +01:00
kthread.c lazy tlb: introduce lazy tlb mm refcount helper functions 2023-03-28 16:20:08 -07:00
latencytop.c
module_signature.c
notifier.c kernel/notifier: Remove CONFIG_SRCU 2023-02-02 16:26:06 -08:00
nsproxy.c
padata.c
panic.c panic: fix the panic_print NMI backtrace setting 2023-03-02 21:54:23 -08:00
params.c kernel/params.c: Use kstrtobool() instead of strtobool() 2023-01-25 14:07:21 -08:00
pid.c
pid_namespace.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
pid_sysctl.h mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC 2023-01-18 17:12:37 -08:00
profile.c
ptrace.c
range.c
reboot.c
regset.c
relay.c mm: replace vma->vm_flags direct modifications with modifier calls 2023-02-09 16:51:39 -08:00
resource.c dax/kmem: Fix leak of memory-hotplug resources 2023-02-17 14:58:01 -08:00
resource_kunit.c
rseq.c
scftorture.c
scs.c
seccomp.c seccomp: fix kernel-doc function name warning 2023-01-13 17:01:06 -08:00
signal.c
smp.c
smpboot.c
smpboot.h
softirq.c
stackleak.c
stacktrace.c
static_call.c
static_call_inline.c
stop_machine.c
sys.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
sys_ni.c
sysctl-test.c
sysctl.c sysctl: fix proc_dobool() usability 2023-02-21 13:34:07 -08:00
task_work.c
taskstats.c
torture.c torture: Fix hang during kthread shutdown phase 2023-01-05 12:10:35 -08:00
tracepoint.c tracepoint: Allow livepatch module add trace event 2023-02-18 14:34:36 -05:00
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c umh: simplify the capability pointer logic 2023-03-03 16:18:19 -08:00
up.c
user-return-notifier.c
user.c
user_namespace.c userns: fix a struct's kernel-doc notation 2023-02-02 22:50:04 -08:00
usermode_driver.c
utsname.c
utsname_sysctl.c
watch_queue.c watch_queue: fix IOC_WATCH_QUEUE_SET_SIZE alloc error paths 2023-03-08 11:44:45 +01:00
watchdog.c
watchdog_hld.c
workqueue.c workqueue: Fold rebind_worker() within rebind_workers() 2023-01-13 07:50:40 -10:00
workqueue_internal.h