Now we finally support modern GCC and binutils, it's time for a cleanup.
Remove HAVE_AARCH64_SVE_ASM define and conditional compilation. Remove SVE
configure checks for SVE, ACLE and variant-PCS support.
Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
Now we finally support modern GCC and binutils, it's time for a cleanup.
Use PAC and BTI instructions unconditionally and use proper assembler syntax.
Remove the PR target/94791 strip_pac workarounds for buggy GCCs. Remove the
PAC/BTI configure checks - always emit GNU property notes on assembly files.
Change cfi_window_save to the correct cfi_negate_ra_state unwind directive.
Reviewed-by: Matthieu Longo <matthieu.longo@arm.com>
Implement double and single precision variants of the C23 routine atan2pi
for both AdvSIMD and SVE.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Implement double and single precision variants of the C23 routine atanpi
for both AdvSIMD and SVE.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Implement double and single precision variants of the C23 routine asinpi
for both AdvSIMD and SVE.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Implement double and single precision variants of the C23 routine acospi
for both AdvSIMD and SVE.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Improve performance of Inverse trig functions by altering how coefficients are
loaded.
Performance improvement on Neoverse V1:
SVE acos 14%
AdvSIMD acos 6%
AdvSIMD asin 6%
SVE asin 5%
AdvSIMD asinf 2%
AdvSIMD atanf 22%
SVE atanf 20%
SVE atan 11%
AdvSIMD atan 5%
SVE atan2 7%
SVE atan2f 4%
AdvSIMD atan2f 3%
AdvSIMD atan2 2%
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Add test that checks that ZA state is disabled after setjmp and sigsetjmp
Update existing SME test that uses setjmp
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Due to the nature of the ZA state, setjmp() should clear it in the
same manner as it is already done by longjmp.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Previously, longjmp() on aarch64 was using CFI directives around the
call to __libc_arm_za_disable() after CFA was redefined at the start
of longjmp(). This may result in unwinding issues. Move the call and
surrounding CFI directives to the beginning of longjmp().
Suggested-by: Wilco Dijkstra <wilco.dijkstra@arm.com>
I misunderstood the recommendation from the hardware team about non-temporal
load/stores. It is still recommended to use them in memset for large sizes. It
was not recommended for their use with device memory and memset is already
not valid to be used with device memory.
This reverts commit e6590f0c86.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
I misunderstood the recommendation from the hardware team about non-temporal
load/stores. It is still recommended to use them in memcpy for large sizes. It
was not recommended for their use with device memory and memcpy is already
not valid to be use with device memory.
This reverts commit eb5eeb4740.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
When libgcc is built with pac-ret, it requires to autenticate the
unwinding frame based on CFI information. The _dl_tlsdesc_dynamic
uses a custom calling convention, where it is responsible to save
and restore all registers it might use (even volatile).
The pac-ret support added by 1be3d6eb82
was added only on the slow-path, but the fast path also adds DWARF
Register Rule Instruction (cfi_adjust_cfa_offset) since it requires
to save/restore some auxiliary register. It seems that this is not
fully supported neither by libgcc nor AArch64 ABI [1].
Instead, move paciasp/autiasp to function prologue/epilogue to be
used on both fast and slow paths.
I also corrected the _dl_tlsdesc_dynamic comment description, it was
copied from i386 implementation without any adjustment.
Checked on aarch64-linux-gnu with a toolchain built with
--enable-standard-branch-protection on a system with pac-ret
support.
[1] https://github.com/ARM-software/abi-aa/blob/main/aadwarf64/aadwarf64.rst#id1
Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
Polynomial order was unnecessarily high, unlocking multiple
optimizations.
Max error for new SVE expf is 0.88 +0.5ULP.
Max error for new SVE coshf is 2.56 +0.5ULP.
Performance improvement on Neoverse V1: expf (30%), coshf (26%).
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The current approach tracks math maximum supported errors by explicitly
setting them per function and architecture. On newer implementations or
new compiler versions, the file is updated with newer values if it
shows higher results. The idea is to track the maximum known error, to
update the manual with the obtained values.
The constant libm-test-ulps shows little value, where it is usually a
mechanical change done by the maintainer, for past releases it is
usually ignored whether the ulp change resulted from a compiler
regression, and the math tests already have a maximum ulp error that
triggers a regression.
It was shown by a recent update after the new acosf [1] implementation
that is correctly rounded, where the libm-test-ulps was indeed from a
compiler issue.
This patch removes all arch-specific libm-test-ulps, adds system generic
libm-test-ulps where applicable, and changes its semantics. The generic
files now track specific implementation constraints, like if it is
expected to be correctly rounded, or if the system-specific has
different error expectations.
Now multiple libm-test-ulps can be defined, and system-specific
overrides generic implementation. This is for the case where
arch-specific implementation might show worse precision than generic
implementation, for instance, the cbrtf on i686.
Regressions are only reported if the implementation shows larger errors
than 9 ulps (13 for IBM long double) unless it is overridden by
libm-test-ulps and the maximum error is not printed at the end of tests.
The regen-ulps rule is also removed since it does not make sense to
update the libm-test-ulps automatically.
The manual error table is also removed, Paul Zimmermann and others have
been tracking libm precision with a more comprehensive analysis for some
releases; so link to his work instead.
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=9cc9f8e11e8fb8f54f1e84d9f024917634a78201
This series removes various ILP32 defines that are now
no longer needed.
Remove PTR_ARG/SIZE_ARG.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add SVE memset based on the generic memset with predicated load for sizes < 16.
Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned
stores for the last 64 bytes. Performance of random memset benchmark improves
by ~2% on Neoverse V1.
Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
Reduce number of MOV/MOVPRFXs and use unpredicated FMUL.
Replace MUL with LSL. Speedup on Neoverse V1: 6%.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Use unpredicted muls, and improve memory access.
7%, 3% and 1% improvement in throughput microbenchmark on Neoverse V1,
for exp, exp2 and cosh respectively.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Use unpredicated muls, use lanewise mla's and improve memory access.
1% regression in throughput microbenchmark on Neoverse V1.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
- Add GCS marking to some of the tests when target supports GCS
- Fix tst-ro-dynamic-mod.map linker script to avoid removing
GNU properties
- Add header with macros for GNU properties
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Allocate GCS based on the stack size, this can be used for coroutines
(makecontext) and thread creation (if the kernel allows user allocated
GCS).
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
check_gcs is called for each dependency of a DSO, but the GNU property
of the ld.so is not processed so ldso->l_mach.gcs may not be correct.
Just assume ld.so is GCS compatible independently of the ELF marking.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
- Handle GCS marking
- Use l_searchlist.r_list for gcs (allows using the
same function for static exe)
Co-authored-by: Yury Khrustalev <yury.khrustalev@arm.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Use the dynamic linker start code to enable GCS in the dynamic linked
case after _dl_start returns and before _dl_start_user which marks
the point after which user code may run.
Like in the static linked case this ensures that GCS is enabled on a
top level stack frame.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This tunable controls Guarded Control Stack (GCS) for the process.
0 = disabled: do not enable GCS
1 = enforced: check markings and fail if any binary is not marked
2 = optional: check markings but keep GCS off if a binary is unmarked
3 = override: enable GCS, markings are ignored
By default it is 0, so GCS is disabled, value 1 will enable GCS.
The status is stored into GL(dl_aarch64_gcs) early and only applied
later, since enabling GCS is tricky: it must happen on a top level
stack frame. Using GL instead of GLRO because it may need updates
depending on loaded libraries that happen after readonly protection
is applied, however library marking based GCS setting is not yet
implemented.
Describe new tunable in the manual.
Co-authored-by: Yury Khrustalev <yury.khrustalev@arm.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This implementations ensures that longjmp across different stacks
works: it scans for GCS cap token and switches GCS if necessary
then the target GCSPR is restored with a GCSPOPM loop once the
current GCSPR is on the same GCS.
This makes longjmp linear time in the number of jumped over stack
frames when GCS is enabled.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The target specific internal __longjmp is called with a __jmp_buf
argument which has its size exposed in the ABI. On aarch64 this has
no space left, so GCSPR cannot be restored in longjmp in the usual
way, which is needed for the Guarded Control Stack (GCS) extension.
setjmp is implemented via __sigsetjmp which has a jmp_buf argument
however it is also called with __pthread_unwind_buf_t argument cast
to jmp_buf (in cancellation cleanup code built with -fno-exception).
The two types, jmp_buf and __pthread_unwind_buf_t, have common bits
beyond the __jmp_buf field and there is unused space there which we
can use for saving GCSPR.
For this to work some bits of those two generic types have to be
reserved for target specific use and the generic code in glibc has
to ensure that __longjmp is always called with a __jmp_buf that is
embedded into one of those two types. Morally __longjmp should be
changed to take jmp_buf as argument, but that is an intrusive change
across targets.
Note: longjmp is never called with __pthread_unwind_buf_t from user
code, only the internal __libc_longjmp is called with that type and
thus the two types could have separate longjmp implementations on a
target. We don't rely on this now (but might in the future given that
cancellation unwind does not need to restore GCSPR).
Given the above this patch finds an unused slot for GCSPR. This
placement is not exposed in the ABI so it may change in the future.
This is also very target ABI specific so the generic types cannot
be easily changed to clearly mark the reserved fields.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The Guarded Control Stack instructions can be present even if the
hardware does not support the extension (runtime checked feature),
so the asm code should be backward compatible with old assemblers.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
clang issues:
error: value size does not match register size specified by the
constraint and modifier [-Werror,-Wasm-operand-widths]
while tryng to use 32 bit variables with 'mrs' to get/set the
fpsr, dczid_el0, and ctr.