Commit Graph

616 Commits

Author SHA1 Message Date
Dylan Fleming fd1d642ef8 AArch64: Remove WANT_SIMD_EXCEPT from aarch64 AdvSIMD math routines
Remove legacy code for supporting an old Arm Optimised Routines
deprecated feature for throwing SIMD Exceptions.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-11-18 15:51:15 +00:00
Pierre Blanchard bb6519de1e AArch64: Fix and improve SVE pow(f) special cases
powf:

Update scalar special case function to best use new interface.

pow:

Make specialcase NOINLINE to prevent str/ldr leaking in fast path.
Remove depency in sv_call2, as new callback impl is not a
performance gain.
Replace with vectorised specialcase since structure of scalar
routine is fairly simple.

Throughput gain of about 5-10% on V1 for large values and 25% for subnormal `x`.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-18 15:51:15 +00:00
Pierre Blanchard e889160273 AArch64: fix SVE tanpi(f) [BZ #33642]
Fixed svld1rq using incorrect predicates (BZ #33642).
Next to no performance variations (tested on V1).

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-18 15:51:15 +00:00
Wilco Dijkstra 989e538224 math: Remove float_t and double_t [BZ #33563]
Remove uses of float_t and double_t. This is not useful on modern machines,
and does not help given GCC defaults to -fexcess-precision=fast.
One use of double_t remains to allow forcing the precision to double
on targets where FLT_EVAL_METHOD=2. This fixes BZ #33563 on
i486-pc-linux-gnu.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-11-12 19:33:23 +00:00
Yury Khrustalev a9c426bcca aarch64: fix includes in SME tests
Use the correct include for the SIGCHLD macro: signal.h

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-12 13:45:52 +00:00
Florian Weimer 259adb087d aarch64: Remove $(aarch64-bti) check
The variable was removed in commit 2c421fc430
("AArch64: Cleanup PAC and BTI"), so this Makefile fragment is
always excluded.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-11-07 14:12:01 +01:00
Joe Ramsay e45af510bc AArch64: Fix instability in AdvSIMD sinh
Previously presence of special-cases in one lane could affect the
results in other lanes due to unconditional scalar fallback. The old
WANT_SIMD_EXCEPT option (which has never been enabled in libmvec) has
been removed from AOR, making it easier to spot and fix
this. No measured change in performance. This patch applies cleanly as
far back as 2.41, however there are conflicts with 2.40 where sinh was
first introduced.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-06 18:30:47 +00:00
Joe Ramsay 6c22823da5 AArch64: Fix instability in AdvSIMD tan
Previously presence of special-cases in one lane could affect the
results in other lanes due to unconditional scalar fallback. The old
WANT_SIMD_EXCEPT option (which has never been enabled in libmvec) has
been removed from AOR, making it easier to spot and fix this. 4%
improvement in throughput with GCC 14 on Neoverse V1. This bug is
present as far back as 2.39 (where tan was first introduced).

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-06 18:30:47 +00:00
Joe Ramsay 5b82fb1882 AArch64: Optimise SVE scalar callbacks
Instead of using SVE instructions to marshall special results into the
correct lane, just write the entire vector (and the predicate) to
memory, then use cheaper scalar operations.

Geomean speedup of 16% in special intervals on Neoverse with GCC 14.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-06 15:45:37 +00:00
Wilco Dijkstra 324c088a18 nptl: Remove ATOMIC_EXCHANGE_USES_CAS usage
The only usage was for pthread_spin_lock, introduced by 12d2dd7060,
as a way to optimize the code for certain architectures. Now that atomic
builtins are used by default, let the compiler use the best code sequence
for the atomic exchange.

Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Wilco Dijkstra 53807741fb Define __HAVE_64B_ATOMICS from compiler support
Now that atomic builtins are used by default, we can rely on the
compiler to define when to use 64-bit atomic operations.

It allows the use of 64-bit atomic operations on some 32-bit ABIs where
they were not previously enabled due to missing pre-processor handling:
hppa, mips64n32, s390, and sparcv9.

Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella 70ee250fb8 atomic: Consolidate atomic_full_barrier implementation
All ABIs save for sparcv9 and s390 defines it to __sync_synchronize,
which can be mapped to __atomic_thread_fence (__ATOMIC_SEQ_CST).

For Sparc, it uses a stricter #StoreStore|#LoadStore|#StoreLoad|#LoadLoad
instead of the #StoreLoad generated by __sync_synchronize.

For s390x, it defaults to a memory barrier where __sync_synchronize
emits a 'bcr 15,0' (which the manual describes as pipeline synchronization).

The barrier is used only in one place (pthread_mutex_setprioceiling),
and using a stricter barrier for s390 is ok performance-wise.

Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella b299332fb4 aarch64: Remove ununsed atomic macros
These are already provided by the generic include/atomic.h.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella 8711c29bb7 aarch64: Fix tst-ifunc-arg-4 on clang-18
It issues:

../sysdeps/aarch64/tst-ifunc-arg-4.c:39:1: error: unused function 'resolver' [-Werror,-Wunused-function]
   39 | resolver (uint64_t arg0, const uint64_t arg1[])
      | ^~~~~~~~
1 error generated.

clang-19 and onwards do not trigger the warning.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-10-29 12:54:10 -03:00
Adhemerval Zanella 970364dac0 Annotate swtich fall-through
The clang default to warning for missing fall-through and it does
not support all comment-like annotation that gcc does.  Use C23
[[fallthrough]] annotation instead.
proper attribute instead.

Reviewed-by: Collin Funk <collin.funk1@gmail.com>
2025-10-29 12:54:01 -03:00
Wilco Dijkstra 0375e6e233 AArch64: Use math-use-builtins for roundeven(f)/lrint(f)/lround(f)
Remove target implementations of roundeven(f)/lrint(f)/lround(f) and
use the math-use-builtins mechanism instead.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-10-17 17:03:54 +00:00
Yury Khrustalev ecb0fc2f0f aarch64: tests for SME
This commit adds tests for the following use cases relevant to handing of
the SME state:

 - fork() and vfork()
 - clone() and clone3()
 - signal handler

While most cases are trivial, the case of clone3() is more complicated since
the clone3() symbol is not public in Glibc.

To avoid having to check all possible ways clone3() may be called via other
public functions (e.g. vfork() or pthread_create()), we put together a test
that links directly with clone3.o. All the existing functions that have calls
to clone3() may not actually use it, in which case the outcome of such tests
would be unexpected. Having a direct call to the clone3() symbol in the test
allows to check precisely what we need to test: that the __arm_za_disable()
function is indeed called and has the desired effect.

Linking to clone3.o also requires linking to __arm_za_disable.o that in
turn requires the _dl_hwcap2 hidden symbol which to provide in the test
and initialise it before using.

Co-authored-by: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-10-14 09:42:46 +01:00
Yury Khrustalev b4b713bd89 aarch64: define macro for calling __libc_arm_za_disable
A common sequence of instructions is used in several places
in assembly files, so define it in one place as an assembly
macro.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-10-14 09:42:46 +01:00
Yury Khrustalev 7a47a51e8d misc: Fix several typos 2025-10-10 14:52:40 +01:00
Luna Lamb 653e6c4fff AArch64: Implement AdvSIMD and SVE log10p1(f) routines
Vector variants of the new C23 log10p1 routines.

Note: Benchmark inputs for log10p1(f) are identical to log1p(f)

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-09-27 12:45:59 +00:00
Luna Lamb db42732474 AArch64: Implement AdvSIMD and SVE log2p1(f) routines
Vector variants of the new C23 log2p1 routines.

Note: Benchmark inputs for log2p1(f) are identical to log1p(f).

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-09-27 12:44:09 +00:00
Wilco Dijkstra aebaeb2c33 AArch64: Update math-vector-fortran.h
Update math-vector-fortran.h with the latest set of math functions
and sort by name.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-09-19 12:57:47 +00:00
Adhemerval Zanella 63ba1a1509 math: Add fetestexcept internal alias
To avoid linknamespace issues on old standards.  It is required
if the fallback fma implementation is used if/when it is also
used internally for other implementation.
Reviewed-by: DJ Delorie <dj@redhat.com>
2025-09-11 14:46:07 -03:00
Adhemerval Zanella 2eb8836de7 math: Add feclearexcept internal alias
To avoid linknamespace issues on old standards.  It is required
if the fallback fma implementation is used if/when it is also
used internally for other implementation.
Reviewed-by: DJ Delorie <dj@redhat.com>
2025-09-11 14:46:07 -03:00
remph e20ca759af AArch64: add optimised strspn/strcspn
Requires Neon (aka. Advanced SIMD).  Looks up 16 characters at a time,
for a 2-3x perfomance improvement, and a ~30% speedup on the strtok &
strsep benchtests, as tested on Cortex A-{53,72}.

Signed-off-by: remph <lhr@disroot.org>

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-09-10 16:12:23 +00:00
Hasaan Khan 8ced7815fb AArch64: Implement exp2m1 and exp10m1 routines
Vector variants of the new C23 exp2m1 & exp10m1 routines.

Note: Benchmark inputs for exp2m1 & exp10m1 are identical to exp2 & exp10
respectively, this also includes the floating point variations.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-09-02 16:50:24 +00:00
Pierre Blanchard aac077645a AArch64: Fix SVE powf routine [BZ #33299]
Fix a bug in predicate logic introduced in last change.
A slight performance improvement from relying on all true
predicates during conversion from single to double.
This fixes BZ #33299.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-08-20 17:45:21 +00:00
Adhemerval Zanella 20528165bd Disable SFrame support by default
And add extra checks to enable for binutils 2.45 and if the architecture
explicitly enables it.  When SFrame is disabled, all the related code
is also not enabled for backtrace() and _dl_find_object(), so SFrame
backtracking is not used even if the binary has the SFrame segment.

This patch also adds some other related fixes:

  * Fixed an issue with AC_CHECK_PROG_VER, where the READELF_SFRAME
    usage prevented specifying a different readelf through READELF
    environment variable at configure time.

  * Add an extra arch-specific internal definition,
    libc_cv_support_sframe, to disable --enable-sframe on architectures
    that have binutils but not glibc support (s390x).

  * Renamed the tests without the .sframe segment and move the
    tst-backtrace1 from pthread to debug.

  * Use the built compiler strip to remove the .sframe segment,
    instead of the system one (which might not support SFrame).

Checked on x86_64-linux-gnu and aarch64-linux-gnu.

Reviewed-by: Sam James <sam@gentoo.org>
2025-07-24 15:51:58 -03:00
H.J. Lu 848f0e46f0 i386: Update ___tls_get_addr to preserve vector registers
Compiler generates the following instruction sequence for dynamic TLS
access:

	leal	tls_var@tlsgd(,%ebx,1), %eax
	call	___tls_get_addr@PLT

CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS, AX, CX, and DX, are unchanged after CALL.  But
___tls_get_addr is a normal function which doesn't preserve any vector
registers.

1. Rename the generic __tls_get_addr function to ___tls_get_addr_internal.
2. Change ___tls_get_addr to a wrapper function with implementations for
FNSAVE, FXSAVE, XSAVE and XSAVEC to save and restore all vector registers.
3. dl-tlsdesc-dynamic.h has:

_dl_tlsdesc_dynamic:
	/* Like all TLS resolvers, preserve call-clobbered registers.
	   We need two scratch regs anyway.  */
	subl	$32, %esp
	cfi_adjust_cfa_offset (32)

It is wrong to use

	movl	%ebx, -28(%esp)
	movl	%esp, %ebx
	cfi_def_cfa_register(%ebx)
	...
	mov	%ebx, %esp
	cfi_def_cfa_register(%esp)
	movl	-28(%esp), %ebx

to preserve EBX on stack.  Fix it with:

	movl	%ebx, 28(%esp)
	movl	%esp, %ebx
	cfi_def_cfa_register(%ebx)
	...
	mov	%ebx, %esp
	cfi_def_cfa_register(%esp)
	movl	28(%esp), %ebx

4. Update _dl_tlsdesc_dynamic to call ___tls_get_addr_internal directly.
5. Add have-test-mtls-traditional to compile tst-tls23-mod.c with
traditional TLS variant to verify the fix.
6. Define DL_RUNTIME_RESOLVE_REALIGN_STACK in sysdeps/x86/sysdep.h.

This fixes BZ #32996.

Co-Authored-By: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-06-19 04:30:31 +08:00
Luna Lamb 6849c5b791 AArch64: Improve codegen SVE log1p helper
Improve codegen by packing coefficients.
4% and 2% improvement in throughput microbenchmark on Neoverse V1, for acosh
and atanh respectively.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-06-18 17:28:51 +00:00
Dylan Fleming dee22d2a81 AArch64: Optimise SVE FP64 Hyperbolics
Reworke SVE FP64 hyperbolics to use the SVE FEXPA
instruction.

Also update the special case handelling for large
inputs to be entirely vectorised.

Performance improvements on Neoverse V1:

cosh_sve: 19% for |x| < 709, 5x otherwise
sinh_sve: 24% for |x| < 709, 5.9x otherwise
tanh_sve: 12% for |x| < 19,  9x otherwise

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-06-18 17:28:51 +00:00
Dylan Fleming 1e3d1ddf97 AArch64: Optimize SVE exp functions
Improve performance of SVE exps by making better use
of the SVE FEXPA instruction.

Performance improvement on Neoverse V1:
exp2_sve:   21%
exp2f_sve:  24%
exp10f_sve: 23%
expm1_sve:  25%

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-06-18 17:28:51 +00:00
Yury Khrustalev c0f0db2d59 aarch64: simplify calls to __libc_arm_za_disable in assembly
There is no functional change in this patch.

We remove stores and loads to stack, return address signing, and redundant
CFI directives before and after call to __libc_arm_za_disable().

The __libc_arm_za_disable implementation follows special calling convention
that allows to avoid most of the operations that would be necessary for a
call to a normal function (see [1] for details).

First, we rely on __libc_arm_za_disable() not clobbering certain registers,
and we put return address into one of these registers. Now we don't need
to store it on stack, so we don't need to sign return address using PAC.

Second, as a result of the above, we don't need to update the CFI offset.

This patch provides small optimisation avoiding unnecessary store and load
on stack also simplifies assembly code and CFI directives.

[1]: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-06-18 09:42:33 +01:00
Yury Khrustalev eeedfc2f74 aarch64: GCS: use internal struct in __alloc_gcs
No functional change here, just a small refactoring to simplify
using __alloc_gcs() for allocating shadow stacks.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-06-18 09:37:13 +01:00
Yury Khrustalev b15ed85c86 aarch64: fix typo in sysdeps/aarch64/Makefile 2025-06-10 10:48:07 +01:00
Wilco Dijkstra 09795c5612 AArch64: Fix builderror with GCC 12.1/12.2
Early versions of GCC 12 didn't support -mtune=neoverse-v2, so use
-mtune=neoverse-v1 instead.

Reported-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-06-06 13:22:27 +00:00
Yury Khrustalev fcd6a8b5c5 aarch64: add __ifunc_hwcap function to be used in ifunc resolvers
Add a new helper function __ifunc_hwcap() as a portable way to
access HWCAP elements via the parameter(s) passed to an ifunc
resolver checking the _IFUNC_ARG_HWCAP bit in the first parameter
and size of the buffer in the second parameter.

Note that 0 is returned when the requested element is not available
or does not correspond to a valid AT_HWCAP{,2,...} value.

Also add relevant tests.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-06-05 14:38:51 +01:00
Yury Khrustalev ea14d04e9a aarch64: add support for hwcap3,4
Add basic support for hwcap3 and hwcap4 in dynamic loader and
ifunc resolvers.

Describe new backward-compatible prototype for GNU indirect
function resolvers that use a pointer to uint64_t array in
stead of a pointer to the __ifunc_arg_t struct.

This patch also adds macro _IFUNC_HWCAP_MAX to specify current
number of hwcap elements.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-06-05 14:38:03 +01:00
Wilco Dijkstra aa18367c11 AArch64: Improve enabling of SVE for libmvec
When using a -mcpu option in CFLAGS, GCC can report errors when building libmvec.
Fix this by overriding both -mcpu and -march with a generic variant with SVE added.
Also use a tune for a modern SVE core.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-05-29 16:58:49 +00:00
Luna Lamb da196e6134 AArch64: Improve codegen in SVE log1p
Improves memory access, reformat evaluation scheme to pack coefficients.
5% improvement in throughput microbenchmark on Neoverse V1.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-05-29 15:25:35 +00:00
Wilco Dijkstra 2071666d03 AArch64: Fix typo in math-vector.h
Fix typo atanpi2->atan2pi in math-vector.h.
2025-05-20 13:44:16 +00:00
Wilco Dijkstra b990b0aee2 AArch64: Cleanup SVE config and defines
Now we finally support modern GCC and binutils, it's time for a cleanup.
Remove HAVE_AARCH64_SVE_ASM define and conditional compilation.  Remove SVE
configure checks for SVE, ACLE and variant-PCS support.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-05-20 10:33:55 +00:00
Wilco Dijkstra 2c421fc430 AArch64: Cleanup PAC and BTI
Now we finally support modern GCC and binutils, it's time for a cleanup.
Use PAC and BTI instructions unconditionally and use proper assembler syntax.
Remove the PR target/94791 strip_pac workarounds for buggy GCCs.  Remove the
PAC/BTI configure checks - always emit GNU property notes on assembly files.
Change cfi_window_save to the correct cfi_negate_ra_state unwind directive.

Reviewed-by: Matthieu Longo <matthieu.longo@arm.com>
2025-05-19 15:35:32 +00:00
Dylan Fleming 96abd59bf2 AArch64: Implement AdvSIMD and SVE atan2pi/f
Implement double and single precision variants of the C23 routine atan2pi
for both AdvSIMD and SVE.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-05-19 15:35:25 +00:00
Dylan Fleming edf6202815 AArch64: Implement AdvSIMD and SVE atanpi/f
Implement double and single precision variants of the C23 routine atanpi
for both AdvSIMD and SVE.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-05-19 15:34:40 +00:00
Dylan Fleming 0ef2cf44e7 AArch64: Implement AdvSIMD and SVE asinpi/f
Implement double and single precision variants of the C23 routine asinpi
for both AdvSIMD and SVE.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-05-19 15:33:50 +00:00
Dylan Fleming 993997ca1b AArch64: Implement AdvSIMD and SVE acospi/f
Implement double and single precision variants of the C23 routine acospi
for both AdvSIMD and SVE.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-05-19 15:31:59 +00:00
Dylan Fleming 1e84509e00 AArch64: Optimize inverse trig functions
Improve performance of Inverse trig functions by altering how coefficients are
loaded.

Performance improvement on Neoverse V1:
SVE     acos   14%
AdvSIMD acos   6%

AdvSIMD asin   6%
SVE     asin   5%
AdvSIMD asinf  2%

AdvSIMD atanf  22%
SVE     atanf  20%
SVE     atan   11%
AdvSIMD atan   5%

SVE     atan2  7%
SVE     atan2f 4%
AdvSIMD atan2f 3%
AdvSIMD atan2  2%

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-05-19 14:54:32 +00:00
Yury Khrustalev 251f932624 aarch64: update tests for SME
Add test that checks that ZA state is disabled after setjmp and sigsetjmp
Update existing SME test that uses setjmp

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-05-15 14:23:35 +01:00
Yury Khrustalev a7f6fd976c aarch64: Disable ZA state of SME in setjmp and sigsetjmp
Due to the nature of the ZA state, setjmp() should clear it in the
same manner as it is already done by longjmp.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-05-15 14:23:03 +01:00