Commit Graph

92 Commits

Author SHA1 Message Date
Wilco Dijkstra b990b0aee2 AArch64: Cleanup SVE config and defines
Now we finally support modern GCC and binutils, it's time for a cleanup.
Remove HAVE_AARCH64_SVE_ASM define and conditional compilation.  Remove SVE
configure checks for SVE, ACLE and variant-PCS support.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-05-20 10:33:55 +00:00
Wilco Dijkstra 2c421fc430 AArch64: Cleanup PAC and BTI
Now we finally support modern GCC and binutils, it's time for a cleanup.
Use PAC and BTI instructions unconditionally and use proper assembler syntax.
Remove the PR target/94791 strip_pac workarounds for buggy GCCs.  Remove the
PAC/BTI configure checks - always emit GNU property notes on assembly files.
Change cfi_window_save to the correct cfi_negate_ra_state unwind directive.

Reviewed-by: Matthieu Longo <matthieu.longo@arm.com>
2025-05-19 15:35:32 +00:00
Andrew Pinski ceeffd970c aarch64: Add back non-temporal load/stores from oryon-1's memset
I misunderstood the recommendation from the hardware team about non-temporal
load/stores. It is still recommended to use them in memset for large sizes. It
was not recommended for their use with device memory and memset is already
not valid to be used with device memory.

This reverts commit e6590f0c86.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-04-15 12:07:07 -07:00
Andrew Pinski 0e1aa5db73 aarch64: Add back non-temporal load/stores from oryon-1's memcpy
I misunderstood the recommendation from the hardware team about non-temporal
load/stores. It is still recommended to use them in memcpy for large sizes. It
was not recommended for their use with device memory and memcpy is already
not valid to be use with device memory.

This reverts commit eb5eeb4740.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-04-15 12:06:59 -07:00
Wilco Dijkstra 0f044be1da AArch64: Use prefer_sve_ifuncs for SVE memset
Use prefer_sve_ifuncs for SVE memset just like memcpy.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-02-27 16:51:57 +00:00
Wilco Dijkstra ce2f26a22e AArch64: Remove PTR_ARG/SIZE_ARG defines
This series removes various ILP32 defines that are now
no longer needed.

Remove PTR_ARG/SIZE_ARG.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24 14:15:15 +00:00
Wilco Dijkstra 163b1bbb76 AArch64: Add SVE memset
Add SVE memset based on the generic memset with predicated load for sizes < 16.
Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned
stores for the last 64 bytes.  Performance of random memset benchmark improves
by ~2% on Neoverse V1.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-02-20 15:31:50 +00:00
Paul Eggert 2642002380 Update copyright dates with scripts/update-copyrights 2025-01-01 11:22:09 -08:00
Andrew Pinski e6590f0c86 aarch64: Remove non-temporal load/stores from oryon-1's memset
The hardware architects have a new recommendation not to use
non-temporal load/stores for memset. This patch removes this path.
I found there was no difference in the memset speed with/without
non-temporal load/stores either.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-21 11:32:23 -03:00
Andrew Pinski eb5eeb4740 aarch64: Remove non-temporal load/stores from oryon-1's memcpy
The hardware architects have a new recommendation not to use
non-temporal load/stores for memcpy. This patch removes this path.
I found there was no difference in the memcpy speed with/without
non-temporal load/stores either.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-21 11:32:17 -03:00
Andrew Pinski e162ab2bf1 AArch64: Remove thunderx{,2} memcpy
ThunderX1 and ThunderX2 have been retired for a few years now.
So let's remove the thunderx{,2} specific versions of memcpy.
The performance gain or them was for medium and large sizes
while the generic (aarch64) memcpy will handle just slightly worse.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-11-20 11:23:53 +00:00
Wilco Dijkstra 8ecb477ea1 AArch64: Remove memset-reg.h
Remove memset-reg.h by moving register definitions into the memset
implementations.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-09-10 14:18:03 +01:00
Andrew Pinski 2f1f7a5f8a
Aarch64: Add new memset for Qualcomm's oryon-1 core
Qualcom's new core, oryon-1, has a different characteristics for
memset than the current versions of memset. For non-zero, larger
sizes, using GPRs rather than the SIMD stores is ~30% faster.
For even larger sizes, using the nontemporal stores is needed
not to polute the L1/L2 caches.

For zero values, using `dc zva` should be used. Since we
know the size will always be 64 bytes, we don't need to figure
out the size there.

I started with the emag memset and added back the `dc zva` code.

Changes since v1:
* v3: Fix comment formating

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-30 13:47:17 +02:00
Andrew Pinski 4dc83cac78
Aarch64: Add memcpy for qualcomm's oryon-1 core
Qualcomm's new core (oryon-1) has a different performance characteristic
than other cores. For memcpy, it is faster to use the GPRs to
do the copy for large sizes (2x faster). For even larger sizes,
it is better to use the nontemporal load/store instructions so
we don't pollute the L1/L2 caches.

For smaller sizes, the characteristic are very similar to
other cores.
I used the thunderx memcpy as a starting point and expanded from there.

Changes since v1:
* v2: Fix ordering in Makefile.
* v3: Fix comment grammar about the ldnp/stnp instructions.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-30 13:46:33 +02:00
Adhemerval Zanella ef9596352b aarch64: Remove duplicate memchr/strlen in libc.a (BZ 31777)
The generic version provides weak definitions of memchr/strlen,
which are already provided by the ifunc resolvers.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-05-23 09:36:08 -03:00
Adhemerval Zanella bcae44ea85 elf: Only process multiple tunable once (BZ 31686)
The 680c597e9c commit made loader reject ill-formatted strings by
first tracking all set tunables and then applying them. However, it does
not take into consideration if the same tunable is set multiple times,
where parse_tunables_string appends the found tunable without checking
if it was already in the list. It leads to a stack-based buffer overflow
if the tunable is specified more than the total number of tunables.  For
instance:

  GLIBC_TUNABLES=glibc.malloc.check=2:... (repeat over the number of
  total support for different tunable).

Instead, use the index of the tunable list to get the expected tunable
entry.  Since now the initial list is zero-initialized, the compiler
might emit an extra memset and this requires some minor adjustment
on some ports.

Checked on x86_64-linux-gnu and aarch64-linux-gnu.

Reported-by: Yuto Maeda <maeda@cyberdefense.jp>
Reported-by: Yutaro Shimizu <shimizu@cyberdefense.jp>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2024-05-07 12:16:36 -03:00
Wilco Dijkstra 2e94e2f5d2 AArch64: Check kernel version for SVE ifuncs
Old Linux kernels disable SVE after every system call.  Calling the
SVE-optimized memcpy afterwards will then cause a trap to reenable SVE.
As a result, applications with a high use of syscalls may run slower with
the SVE memcpy.  This is true for kernels between 4.15.0 and before 6.2.0,
except for 5.14.0 which was patched.  Avoid this by checking the kernel
version and selecting the SVE ifunc on modern kernels.

Parse the kernel version reported by uname() into a 24-bit kernel.major.minor
value without calling any library functions.  If uname() is not supported or
if the version format is not recognized, assume the kernel is modern.

Tested-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-03-21 16:50:51 +00:00
Paul Eggert dff8da6b3e Update copyright dates with scripts/update-copyrights 2024-01-01 10:53:40 -08:00
Szabolcs Nagy 8e755f5bc8 aarch64: fix tested ifunc variants
Don't test a64fx string functions when BTI is enabled since they are
not BTI compatible.
2023-12-04 14:41:26 +00:00
Wilco Dijkstra 2f5524cc53 AArch64: Remove Falkor memcpy
The latest implementations of memcpy are actually faster than the Falkor
implementations [1], so remove the falkor/phecda ifuncs for memcpy and
the now unused IS_FALKOR/IS_PHECDA defines.

[1] https://sourceware.org/pipermail/libc-alpha/2022-December/144227.html

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2023-11-13 16:52:50 +00:00
Wilco Dijkstra 3d7090f14b AArch64: Add memset_zva64
Add a specialized memset for the common ZVA size of 64 to avoid the
overhead of reading the ZVA size.  Since the code is identical to
__memset_falkor, remove the latter.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2023-11-13 16:50:44 +00:00
Wilco Dijkstra 9627ab99b5 AArch64: Cleanup emag memset
Cleanup emag memset - merge the memset_base64.S file, remove
the unused ZVA code (since it is disabled on emag).

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2023-11-13 16:45:47 +00:00
Wilco Dijkstra 9fd3409842 AArch64: Cleanup ifuncs
Cleanup ifuncs.  Remove uses of libc_hidden_builtin_def, use ENTRY rather than
ENTRY_ALIGN, remove unnecessary defines and conditional compilation.  Rename
strlen_mte to strlen_generic.  Remove rtld-memset.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-11-01 13:41:59 +00:00
Wilco Dijkstra 2bd0017988 AArch64: Add support for MOPS memcpy/memmove/memset
Add support for MOPS in cpu_features and INIT_ARCH.  Add ifuncs using MOPS for
memcpy, memmove and memset (use .inst for now so it works with all binutils
versions without needing complex configure and conditional compilation).

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-10-24 13:37:48 +01:00
Paul Pluzhnikov 65cc53fe7c Fix misspellings in sysdeps/ -- BZ 25337 2023-05-30 23:02:29 +00:00
Wilco Dijkstra d2d3f3720c AArch64: Improve SVE memcpy and memmove
Improve SVE memcpy by copying 2 vectors if the size is small enough.
This improves performance of random memcpy by ~9% on Neoverse V1, and
33-64 byte copies are ~16% faster.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-02-06 16:15:34 +00:00
Wilco Dijkstra 1bbb1a2022 AArch64: Improve strlen_asimd
Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment.
Performance improves slightly as a result.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17 15:09:18 +00:00
Joseph Myers 6d7e8eda9b Update copyright dates with scripts/update-copyrights 2023-01-06 21:14:39 +00:00
Wilco Dijkstra e6f3fe362f aarch64: Use memcpy_simd as the default memcpy
Since __memcpy_simd is the fastest memcpy on almost all cores, replace
the generic memcpy with it.  If SVE is available, a SVE memcpy will be
used by default (including for Neoverse N2).
2022-10-26 14:16:50 +01:00
Wilco Dijkstra a8e72913fe aarch64: Cleanup memset ifunc
Cleanup memset ifunc selectors. The A64FX memset relies on a ZVA size of
256, so add an explicit check.
2022-10-26 14:12:55 +01:00
Adhemerval Zanella 5355f9ca7b elf: Remove -fno-tree-loop-distribute-patterns usage on dl-support
Besides the option being gcc specific, this approach is still fragile
and not future proof since we do not know if this will be the only
optimization option gcc will add that transforms loops to memset
(or any libcall).

This patch adds a new header, dl-symbol-redir-ifunc.h, that can b
used to redirect the compiler generated libcalls to port the generic
memset implementation if required.

Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2022-10-10 10:32:28 -03:00
Wilco Dijkstra fdaf78656f Add bounds check to __libc_ifunc_impl_list
Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC
redundant and fixes several targets that will write outside the array.
To avoid unnecessary large diffs, pass the maximum in the argument 'i' to
IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing
ones can be updated if desired.

Passes buildmanyglibc.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2022-06-10 17:13:29 +01:00
Wilco Dijkstra eea282d9c6 AArch64: Sort makefile entries
Sort makefile entries to reduce conflicts.
2022-06-07 16:58:15 +01:00
Wilco Dijkstra 9f298bfe1f AArch64: Add SVE memcpy
Add an initial SVE memcpy implementation.  Copies up to 32 bytes use SVE
vectors which improves the random memcpy benchmark significantly.
Cleanup the memcpy and memmove ifunc selectors.
2022-06-07 16:58:03 +01:00
Wilco Dijkstra e5fa62b8db AArch64: Check for SVE in ifuncs [BZ #28744]
Add a check for SVE in the A64FX ifuncs for memcpy, memset and memmove.
This fixes BZ #28744.
2022-01-06 14:36:28 +00:00
Paul Eggert 581c785bf3 Update copyright dates with scripts/update-copyrights
I used these shell commands:

../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")

and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.

I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah.  I don't
know why I run into these diagnostics whereas others evidently do not.

remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
2022-01-01 11:40:24 -08:00
Wilco Dijkstra b31bd11454 AArch64: Improve A64FX memcpy
v2 is a complete rewrite of the A64FX memcpy. Performance is improved
by streamlining the code, aligning all large copies and using a single
unrolled loop for all sizes. The code size for memcpy and memmove goes
down from 1796 bytes to 868 bytes. Performance is better in all cases:
bench-memcpy-random is 2.3% faster overall, bench-memcpy-large is ~33%
faster for large sizes, bench-memcpy-walk is 25% faster for small sizes
and 20% for the largest sizes. The geomean of all tests in bench-memcpy
is 5.1% faster, and total time is reduced by 4%.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-12-02 18:36:03 +00:00
Naohiro Tamura 381b29616a aarch64: Disable A64FX memcpy/memmove BTI unconditionally
This patch disables A64FX memcpy/memmove BTI instruction insertion
unconditionally such as A64FX memset patch [1] for performance.

[1] commit 07b427296b

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-09-24 13:26:59 +01:00
Naohiro Tamura 1d9f99ce1b AArch64: Update A64FX memset not to degrade at 16KB
This patch updates unroll8 code so as not to degrade at the peak
performance 16KB for both FX1000 and FX700.

Inserted 2 instructions at the beginning of the unroll8 loop,
cmp and branch, are a workaround that is found heuristically.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2021-09-06 10:23:24 +01:00
Szabolcs Nagy f873adf3df Revert "AArch64: Update A64FX memset not to degrade at 16KB"
Because of wrong commit author. Will recommit it with right author.

This reverts commit 23777232c2.
2021-09-06 10:23:25 +01:00
Naohiro Tamura via Libc-alpha 23777232c2 AArch64: Update A64FX memset not to degrade at 16KB
This patch updates unroll8 code so as not to degrade at the peak
performance 16KB for both FX1000 and FX700.

Inserted 2 instructions at the beginning of the unroll8 loop,
cmp and branch, are a workaround that is found heuristically.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2021-09-03 15:59:46 +01:00
Wilco Dijkstra a5db6a5cae [5/5] AArch64: Improve A64FX memset medium loops
Simplify the code for memsets smaller than L1. Improve the unroll8 and
L1_prefetch loops.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:46:20 +01:00
Wilco Dijkstra e69d9981f8 [4/5] AArch64: Improve A64FX memset by removing unroll32
Remove unroll32 code since it doesn't improve performance.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:44:27 +01:00
Wilco Dijkstra 186092c6ba [3/5] AArch64: Improve A64FX memset for remaining bytes
Simplify handling of remaining bytes. Avoid lots of taken branches and complex
whilelo computations, instead unconditionally write vectors from the end.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:42:07 +01:00
Wilco Dijkstra 9bc2ed8f46 [2/5] AArch64: Improve A64FX memset for large sizes
Improve performance of large memsets. Simplify alignment code. For zero memset
use DC ZVA, which almost doubles performance. For non-zero memsets use the
unroll8 loop which is about 10% faster.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:39:37 +01:00
Wilco Dijkstra 07b427296b [1/5] AArch64: Improve A64FX memset for small sizes
Improve performance of small memsets by reducing instruction counts and
improving code alignment. Bench-memset shows 35-45% performance gain for
small sizes.

Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10 13:30:27 +01:00
Naohiro Tamura 4f26956d5b aarch64: Added optimized memset for A64FX
This patch optimizes the performance of memset for A64FX [1] which
implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill and prefetch.

SVE assembler code for memset is implemented as Vector Length Agnostic
code so theoretically it can be run on any SOC which supports ARMv8-A
SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
2021-05-27 09:47:53 +01:00
Naohiro Tamura fa527f345c aarch64: Added optimized memcpy and memmove for A64FX
This patch optimizes the performance of memcpy/memmove for A64FX [1]
which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
cache per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill, and software pipelining.

SVE assembler code for memcpy/memmove is implemented as Vector Length
Agnostic code so theoretically it can be run on any SOC which supports
ARMv8-A SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
2021-05-27 09:47:53 +01:00
Szabolcs Nagy 04c6a8073d aarch64: Fix the list of tested IFUNC variants [BZ #26818]
Some IFUNC variants are not compatible with BTI and MTE so don't
set them as usable for testing and benchmarking on a BTI or MTE
enabled system.

As far as IFUNC selectors are concerned a system is BTI enabled if
the cpu supports it and glibc was built with BTI branch protection.

Most IFUNC variants are BTI compatible, but thunderx2 memcpy and
memmove use a jump table with indirect jump, without a BTI j.

Fixes bug 26818.
2021-01-25 16:15:54 +00:00
Szabolcs Nagy c3c4a25e65 aarch64: Move and update the definition of MTE_ENABLED
The hwcap value is now in linux 5.10 and in glibc bits/hwcap.h, so use
that definition.

Move the definition to init-arch.h so all ifunc selectors can use it
and expose an "mte" shorthand for mte enabled runtime.

For now we allow user code to enable tag checks and use PROT_MTE
mappings without libc involvment, this is not guaranteed ABI, but
can be useful for testing and debugging with MTE.
2021-01-25 15:35:43 +00:00