glibc/sysdeps
Noah Goldstein af992e7abd x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`
Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 /
ncores_per_socket'. This patch updates that value to roughly
'sizeof_L3 / 4`

The original value (specifically dividing the `ncores_per_socket`) was
done to limit the amount of other threads' data a `memcpy`/`memset`
could evict.

Dividing by 'ncores_per_socket', however leads to exceedingly low
non-temporal thresholds and leads to using non-temporal stores in
cases where REP MOVSB is multiple times faster.

Furthermore, non-temporal stores are written directly to main memory
so using it at a size much smaller than L3 can place soon to be
accessed data much further away than it otherwise could be. As well,
modern machines are able to detect streaming patterns (especially if
REP MOVSB is used) and provide LRU hints to the memory subsystem. This
in affect caps the total amount of eviction at 1/cache_associativity,
far below meaningfully thrashing the entire cache.

As best I can tell, the benchmarks that lead this small threshold
where done comparing non-temporal stores versus standard cacheable
stores. A better comparison (linked below) is to be REP MOVSB which,
on the measure systems, is nearly 2x faster than non-temporal stores
at the low-end of the previous threshold, and within 10% for over
100MB copies (well past even the current threshold). In cases with a
low number of threads competing for bandwidth, REP MOVSB is ~2x faster
up to `sizeof_L3`.

The divisor of `4` is a somewhat arbitrary value. From benchmarks it
seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs
such as Broadwell prefer something closer to `8`. This patch is meant
to be followed up by another one to make the divisor cpu-specific, but
in the meantime (and for easier backporting), this patch settles on
`4` as a middle-ground.

Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable
stores where done using:
https://github.com/goldsteinn/memcpy-nt-benchmarks

Sheets results (also available in pdf on the github):
https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml
Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2023-06-12 11:33:39 -05:00
..
aarch64 Fix a few more typos I missed in previous round -- BZ 25337 2023-06-02 23:46:32 +00:00
alpha Fix a few more typos I missed in previous round -- BZ 25337 2023-06-02 23:46:32 +00:00
arc Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
arm Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
csky Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
generic Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
gnu Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
hppa Fix a few more typos I missed in previous round -- BZ 25337 2023-06-02 23:46:32 +00:00
htl Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
hurd hurd: Fix using interposable hurd_thread_self 2023-05-19 20:45:51 +02:00
i386 Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
ia64 Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
ieee754 Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
loongarch Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
m68k Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
mach hurd: Fix x86_64 sigreturn restoring bogus reply_port 2023-06-04 19:05:51 +02:00
microblaze Update copyright dates with scripts/update-copyrights 2023-01-06 21:14:39 +00:00
mips Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
nios2 Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
nptl Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
or1k Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
posix posix: Add error message for EAI_OVERFLOW 2023-05-29 15:30:14 +02:00
powerpc Regenerate configure fragment -- BZ 25337. 2023-05-23 16:21:29 +00:00
pthread pthreads: Use _exit to terminate the tst-stdio1 test 2023-06-06 11:39:06 +02:00
riscv Revert "riscv: Resolve symbols directly for symbols with STO_RISCV_VARIANT_CC." 2023-05-07 14:16:03 +02:00
s390 Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
sh Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
sparc Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
unix linux: Fail as unsupported if personality call is filtered 2023-06-05 12:51:48 -03:00
wordsize-32 Update copyright dates with scripts/update-copyrights 2023-01-06 21:14:39 +00:00
wordsize-64 hurd: Fix tst-writev test 2023-05-01 13:01:30 +02:00
x86 x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` 2023-06-12 11:33:39 -05:00
x86_64 x86-64: Use YMM registers in memcmpeq-evex.S 2023-06-01 09:21:14 -07:00