Commit Graph

616 Commits

Author SHA1 Message Date
Yury Khrustalev 691edbdf77 aarch64: fix unwinding in longjmp
Previously, longjmp() on aarch64 was using CFI directives around the
call to __libc_arm_za_disable() after CFA was redefined at the start
of longjmp(). This may result in unwinding issues. Move the call and
surrounding CFI directives to the beginning of longjmp().

Suggested-by: Wilco Dijkstra <wilco.dijkstra@arm.com>
2025-05-13 13:00:57 +01:00
Andrew Pinski ceeffd970c aarch64: Add back non-temporal load/stores from oryon-1's memset
I misunderstood the recommendation from the hardware team about non-temporal
load/stores. It is still recommended to use them in memset for large sizes. It
was not recommended for their use with device memory and memset is already
not valid to be used with device memory.

This reverts commit e6590f0c86.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-04-15 12:07:07 -07:00
Andrew Pinski 0e1aa5db73 aarch64: Add back non-temporal load/stores from oryon-1's memcpy
I misunderstood the recommendation from the hardware team about non-temporal
load/stores. It is still recommended to use them in memcpy for large sizes. It
was not recommended for their use with device memory and memcpy is already
not valid to be use with device memory.

This reverts commit eb5eeb4740.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-04-15 12:06:59 -07:00
Adhemerval Zanella 4352e2cc93 aarch64: Fix _dl_tlsdesc_dynamic unwind for pac-ret (BZ 32612)
When libgcc is built with pac-ret, it requires to autenticate the
unwinding frame based on CFI information.  The _dl_tlsdesc_dynamic
uses a custom calling convention, where it is responsible to save
and restore all registers it might use (even volatile).

The pac-ret support added by 1be3d6eb82
was added only on the slow-path, but the fast path also adds DWARF
Register Rule Instruction (cfi_adjust_cfa_offset) since it requires
to save/restore some auxiliary register.  It seems that this is not
fully supported neither by libgcc nor AArch64 ABI [1].

Instead, move paciasp/autiasp to function prologue/epilogue to be
used on both fast and slow paths.

I also corrected the _dl_tlsdesc_dynamic comment description, it was
copied from i386 implementation without any adjustment.

Checked on aarch64-linux-gnu with a toolchain built with
--enable-standard-branch-protection on a system with pac-ret
support.

[1]  https://github.com/ARM-software/abi-aa/blob/main/aadwarf64/aadwarf64.rst#id1

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-03-31 10:08:06 -03:00
Pierre Blanchard cf56eb28fa AArch64: Optimize algorithm in users of SVE expf helper
Polynomial order was unnecessarily high, unlocking multiple
optimizations.
Max error for new SVE expf is 0.88 +0.5ULP.
Max error for new SVE coshf is 2.56 +0.5ULP.
Performance improvement on Neoverse V1: expf (30%), coshf (26%).

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-03-18 17:15:18 +00:00
Adhemerval Zanella 3e8814903c math: Refactor how to use libm-test-ulps
The current approach tracks math maximum supported errors by explicitly
setting them per function and architecture. On newer implementations or
new compiler versions, the file is updated with newer values if it
shows higher results. The idea is to track the maximum known error, to
update the manual with the obtained values.

The constant libm-test-ulps shows little value, where it is usually a
mechanical change done by the maintainer, for past releases it is
usually ignored whether the ulp change resulted from a compiler
regression, and the math tests already have a maximum ulp error that
triggers a regression.

It was shown by a recent update after the new acosf [1] implementation
that is correctly rounded, where the libm-test-ulps was indeed from a
compiler issue.

This patch removes all arch-specific libm-test-ulps, adds system generic
libm-test-ulps where applicable, and changes its semantics. The generic
files now track specific implementation constraints, like if it is
expected to be correctly rounded, or if the system-specific has
different error expectations.

Now multiple libm-test-ulps can be defined, and system-specific
overrides generic implementation.  This is for the case where
arch-specific implementation might show worse precision than generic
implementation, for instance, the cbrtf on i686.

Regressions are only reported if the implementation shows larger errors
than 9 ulps (13 for IBM long double) unless it is overridden by
libm-test-ulps and the maximum error is not printed at the end of tests.
The regen-ulps rule is also removed since it does not make sense to
update the libm-test-ulps automatically.

The manual error table is also removed, Paul Zimmermann and others have
been tracking libm precision with a more comprehensive analysis for some
releases; so link to his work instead.

[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=9cc9f8e11e8fb8f54f1e84d9f024917634a78201
2025-03-12 13:40:07 -03:00
Wilco Dijkstra 0f044be1da AArch64: Use prefer_sve_ifuncs for SVE memset
Use prefer_sve_ifuncs for SVE memset just like memcpy.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-02-27 16:51:57 +00:00
Wilco Dijkstra 935563754b AArch64: Remove LP64 and ILP32 ifdefs
Remove LP64 and ILP32 ifdefs.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24 14:20:29 +00:00
Wilco Dijkstra 4c11379106 AArch64: Simplify lrint
Simplify lrint.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24 14:20:03 +00:00
Wilco Dijkstra 0a021727bc AArch64: Remove AARCH64_R macro
Remove AArch64_R relocation macro.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24 14:19:19 +00:00
Wilco Dijkstra eb7ac024d9 AArch64: Cleanup pointer mangling
Cleanup pointer mangling.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24 14:17:57 +00:00
Wilco Dijkstra 19860fd42e AArch64: Remove PTR_REG defines
Remove PTR_REG defines.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24 14:16:55 +00:00
Wilco Dijkstra ce2f26a22e AArch64: Remove PTR_ARG/SIZE_ARG defines
This series removes various ILP32 defines that are now
no longer needed.

Remove PTR_ARG/SIZE_ARG.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24 14:15:15 +00:00
Wilco Dijkstra 163b1bbb76 AArch64: Add SVE memset
Add SVE memset based on the generic memset with predicated load for sizes < 16.
Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned
stores for the last 64 bytes.  Performance of random memset benchmark improves
by ~2% on Neoverse V1.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-02-20 15:31:50 +00:00
Yat Long Poon 95e807209b AArch64: Improve codegen for SVE powf
Improve memory access with indexed/unpredicated instructions.
Eliminate register spills.  Speedup on Neoverse V1: 3%.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-02-13 18:16:54 +00:00
Yat Long Poon 0b195651db AArch64: Improve codegen for SVE pow
Move constants to struct.  Improve memory access with indexed/unpredicated
instructions.  Eliminate register spills.  Speedup on Neoverse V1: 24%.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-02-13 18:16:54 +00:00
Yat Long Poon f5ff34cb3c AArch64: Improve codegen for SVE erfcf
Reduce number of MOV/MOVPRFXs and use unpredicated FMUL.
Replace MUL with LSL.  Speedup on Neoverse V1: 6%.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-02-13 18:16:54 +00:00
Luna Lamb c0ff447edf Aarch64: Improve codegen in SVE exp and users, and update expf_inline
Use unpredicted muls, and improve memory access.
7%, 3% and 1% improvement in throughput microbenchmark on Neoverse V1,
for exp, exp2 and cosh respectively.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-02-13 18:16:54 +00:00
Luna Lamb 8f0e7fe61e Aarch64: Improve codegen in SVE asinh
Use unpredicated muls, use lanewise mla's and improve memory access.
1% regression in throughput microbenchmark on Neoverse V1.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-02-13 18:16:54 +00:00
Adhemerval Zanella 8f170dc819 math: Use tanpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic tanpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                      master        patched   improvement
x86_64                      85.1683        47.7990        43.88%
x86_64v2                    76.8219        41.4679        46.02%
x86_64v3                    73.7775        37.7734        48.80%
aarch64 (Neoverse)          35.4514        18.0742        49.02%
power8                      22.7604        10.1054        55.60%
power10                     22.1358         9.9553        55.03%

reciprocal-throughput        master        patched   improvement
x86_64                      41.0174        19.4718        52.53%
x86_64v2                    34.8565        11.3761        67.36%
x86_64v3                    34.0325         9.6989        71.50%
aarch64 (Neoverse)          25.4349         9.2017        63.82%
power8                      13.8626         3.8486        72.24%
power10                     11.7933         3.6420        69.12%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella de2fca9fe2 math: Use sinpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic sinpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                      master        patched   improvement
x86_64                      47.5710        38.4455        19.18%
x86_64v2                    46.8828        40.7563        13.07%
x86_64v3                    44.0034        34.1497        22.39%
aarch64 (Neoverse)          19.2493        14.1968        26.25%
power8                      23.5312        16.3854        30.37%
power10                     22.6485        10.2888        54.57%

reciprocal-throughput        master        patched   improvement
x86_64                      21.8858        11.6717        46.67%
x86_64v2                    22.0620        11.9853        45.67%
x86_64v3                    21.5653        11.3291        47.47%
aarch64 (Neoverse)          13.0615         6.5499        49.85%
power8                      16.2030         6.9580        57.06%
power10                     12.8911         4.2858        66.75%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella be85208b9f math: Use cospif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic cospif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                    master        patched   improvement
x86_64                    47.4679        38.4157        19.07%
x86_64v2                  46.9686        38.3329        18.39%
x86_64v3                  43.8929        31.8510        27.43%
aarch64 (Neoverse)        18.8867        13.2089        30.06%
power8                    22.9435         7.8023        65.99%
power10                   15.4472        7.77505        49.67%

reciprocal-throughput      master        patched   improvement
x86_64                    20.9518        11.4991        45.12%
x86_64v2                  19.8699        10.5921        46.69%
x86_64v3                  19.3475         9.3998        51.42%
aarch64 (Neoverse)        12.5767         6.2158        50.58%
power8                    15.0566         3.2654        78.31%
power10                    9.2866         3.1147        66.46%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella 95a01ea955 math: Use atanpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic atanpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                     master        patched   improvement
x86_64                     66.3296        52.7558        20.46%
x86_64v2                   66.0429        51.4007        22.17%
x86_64v3                   60.6294        48.7876        19.53%
aarch64 (Neoverse)         24.3163        20.9110        14.00%
power8                     16.5766        13.3620        19.39%
power10                    16.5115        13.4072        18.80%

reciprocal-throughput       master        patched   improvement
x86_64                     30.8599        16.0866        47.87%
x86_64v2                   29.2286        15.4688        47.08%
x86_64v3                   23.0960        12.8510        44.36%
aarch64 (Neoverse)         15.4619        10.6752        30.96%
power8                      7.9200         5.2483        33.73%
power10                     6.8539         4.6262        32.50%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella 1cd9ccd8c0 math: Use atan2pif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic atan2pif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                 master        patched   improvement
x86_64                 79.4006        70.8726        10.74%
x86_64v2               77.5136        69.1424        10.80%
x86_64v3               71.8050        68.1637         5.07%
aarch64 (Neoverse)     27.8363        24.7700        11.02%
power8                 39.3893        17.2929        56.10%
power10                19.7200        16.8187        14.71%

reciprocal-throughput   master        patched   improvement
x86_64                 38.3457        30.9471        19.29%
x86_64v2               37.4023        30.3112        18.96%
x86_64v3               33.0713        24.4891        25.95%
aarch64 (Neoverse)     19.3683        15.3259        20.87%
power8                 19.5507        8.27165        57.69%
power10                9.05331        7.63775        15.64%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella ae679a0aca math: Use asinpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic asinpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                 master        patched   improvement
x86_64                 46.4996        41.6126        10.51%
x86_64v2               46.7551        38.8235        16.96%
x86_64v3               42.6235        33.7603        20.79%
aarch64 (Neoverse)     17.4161        14.3604        17.55%
power8                 10.7347         9.0193        15.98%
power10                10.6420         9.0362        15.09%

reciprocal-throughput   master        patched   improvement
x86_64                 24.7208        16.5544        33.03%
x86_64v2               24.2177        14.8938        38.50%
x86_64v3               20.5617        10.5452        48.71%
aarch64 (Neoverse)     13.4827        7.17613        46.78%
power8                 6.46134        3.56089        44.89%
power10                5.79007        3.49544        39.63%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella edb2a8f0ae math: Use acospif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic acospif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                  master        patched   improvement
x86_64                  54.8281        42.9070        21.74%
x86_64v2                54.1717        42.7497        21.08%
x86_64v3                49.3552        34.1512        30.81%
aarch64 (Neoverse)      17.9395        14.3733        19.88%
power8                  20.3110         8.8609        56.37%
power10                 11.3113        8.84067        21.84%

reciprocal-throughput    master        patched   improvement
x86_64                  21.2301        14.4803        31.79%
x86_64v2                20.6858        13.9506        32.56%
x86_64v3                16.1944        11.3377        29.99%
aarch64 (Neoverse)      11.4474        7.13282        37.69%
power8                  10.6916        3.57547        66.56%
power10                 4.64269        3.54145        23.72%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Yury Khrustalev d3f2b71ef1 aarch64: Fix tests not compatible with targets supporting GCS
- Add GCS marking to some of the tests when target supports GCS
 - Fix tst-ro-dynamic-mod.map linker script to avoid removing
   GNU properties
 - Add header with macros for GNU properties

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20 09:36:19 +00:00
Szabolcs Nagy 3d8da0d91b aarch64: Add GCS user-space allocation logic
Allocate GCS based on the stack size, this can be used for coroutines
(makecontext) and thread creation (if the kernel allows user allocated
GCS).

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20 09:36:19 +00:00
Szabolcs Nagy 29476485f9 aarch64: Ignore GCS property of ld.so
check_gcs is called for each dependency of a DSO, but the GNU property
of the ld.so is not processed so ldso->l_mach.gcs may not be correct.
Just assume ld.so is GCS compatible independently of the ELF marking.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20 09:36:19 +00:00
Szabolcs Nagy 4d56a5bbd6 aarch64: Handle GCS marking
- Handle GCS marking
 - Use l_searchlist.r_list for gcs (allows using the
   same function for static exe)

Co-authored-by: Yury Khrustalev <yury.khrustalev@arm.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-01-20 09:35:56 +00:00
Szabolcs Nagy 8d516b6f85 aarch64: Use l_searchlist.r_list for bti
Allows using the same function for static exe.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20 09:31:47 +00:00
Szabolcs Nagy 76b79f7241 aarch64: Mark objects with GCS property note
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-20 09:31:47 +00:00
Szabolcs Nagy 01f52b11de aarch64: Enable GCS in dynamic linked exe
Use the dynamic linker start code to enable GCS in the dynamic linked
case after _dl_start returns and before _dl_start_user which marks
the point after which user code may run.

Like in the static linked case this ensures that GCS is enabled on a
top level stack frame.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-01-20 09:31:47 +00:00
Szabolcs Nagy 9ad3d9267d aarch64: Add glibc.cpu.aarch64_gcs tunable
This tunable controls Guarded Control Stack (GCS) for the process.

0 = disabled: do not enable GCS
1 = enforced: check markings and fail if any binary is not marked
2 = optional: check markings but keep GCS off if a binary is unmarked
3 = override: enable GCS, markings are ignored

By default it is 0, so GCS is disabled, value 1 will enable GCS.

The status is stored into GL(dl_aarch64_gcs) early and only applied
later, since enabling GCS is tricky: it must happen on a top level
stack frame. Using GL instead of GLRO because it may need updates
depending on loaded libraries that happen after readonly protection
is applied, however library marking based GCS setting is not yet
implemented.

Describe new tunable in the manual.

Co-authored-by: Yury Khrustalev <yury.khrustalev@arm.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-01-20 09:31:33 +00:00
Szabolcs Nagy 7d22054db7 aarch64: Mark swapcontext with indirect_return
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-20 09:22:41 +00:00
Szabolcs Nagy 5ff5e7836e aarch64: Add GCS support to longjmp
This implementations ensures that longjmp across different stacks
works: it scans for GCS cap token and switches GCS if necessary
then the target GCSPR is restored with a GCSPOPM loop once the
current GCSPR is on the same GCS.

This makes longjmp linear time in the number of jumped over stack
frames when GCS is enabled.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20 09:22:41 +00:00
Szabolcs Nagy 13cbbb0cb2 aarch64: Define jmp_buf offset for GCS
The target specific internal __longjmp is called with a __jmp_buf
argument which has its size exposed in the ABI. On aarch64 this has
no space left, so GCSPR cannot be restored in longjmp in the usual
way, which is needed for the Guarded Control Stack (GCS) extension.

setjmp is implemented via __sigsetjmp which has a jmp_buf argument
however it is also called with __pthread_unwind_buf_t argument cast
to jmp_buf (in cancellation cleanup code built with -fno-exception).
The two types, jmp_buf and __pthread_unwind_buf_t, have common bits
beyond the __jmp_buf field and there is unused space there which we
can use for saving GCSPR.

For this to work some bits of those two generic types have to be
reserved for target specific use and the generic code in glibc has
to ensure that __longjmp is always called with a __jmp_buf that is
embedded into one of those two types. Morally __longjmp should be
changed to take jmp_buf as argument, but that is an intrusive change
across targets.

Note: longjmp is never called with __pthread_unwind_buf_t from user
code, only the internal __libc_longjmp is called with that type and
thus the two types could have separate longjmp implementations on a
target. We don't rely on this now (but might in the future given that
cancellation unwind does not need to restore GCSPR).

Given the above this patch finds an unused slot for GCSPR. This
placement is not exposed in the ABI so it may change in the future.
This is also very target ABI specific so the generic types cannot
be easily changed to clearly mark the reserved fields.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20 09:22:41 +00:00
Szabolcs Nagy 58771b8a59 aarch64: Add asm helpers for GCS
The Guarded Control Stack instructions can be present even if the
hardware does not support the extension (runtime checked feature),
so the asm code should be backward compatible with old assemblers.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20 09:22:41 +00:00
Adhemerval Zanella 6c575d835e aarch64: Use 64-bit variable to access the special registers
clang issues:

  error: value size does not match register size specified by the
  constraint and modifier [-Werror,-Wasm-operand-widths]

while tryng to use 32 bit variables with 'mrs' to get/set the
fpsr, dczid_el0, and ctr.
2025-01-13 10:17:38 -03:00
Adhemerval Zanella 9cc9f8e11e math: Fix acosf when building with gcc <= 11
GCC <= 11 wrongly assumes the rounding is to nearest and performs a
constant folding where it should evaluate since the result is not
exact [1].

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57245
2025-01-09 12:53:58 -03:00
Luna Lamb f86b4cf875 AArch64: Improve codegen in SVE expm1f and users
Use unpredicated muls, use absolute compare and improve memory access.
Expm1f, sinhf and tanhf show 7%, 5% and 1% improvement in throughput
microbenchmark on Neoverse V1.
2025-01-03 21:42:51 +00:00
Joe Ramsay 080998f6e7 AArch64: Add vector tanpi routines
Vector variant of the new C23 tanpi. New tests pass on AArch64.
2025-01-03 21:39:56 +00:00
Joe Ramsay 40c3a06293 AArch64: Add vector cospi routines
Vector variant of the new C23 cospi. New tests pass on AArch64.
2025-01-03 21:39:56 +00:00
Joe Ramsay 6050b45716 AArch64: Add vector sinpi to libmvec
Vector variant of the new C23 sinpi. New tests pass on AArch64.
2025-01-03 21:39:56 +00:00
Yat Long Poon 91c1fadba3 AArch64: Improve codegen for SVE log1pf users
Reduce memory access by using lanewise MLA and reduce number of MOVPRFXs.
Move log1pf implementation to inline helper function.
Speedup on Neoverse V1 for log1pf (10%), acoshf (-1%), atanhf (2%), asinhf (2%).
2025-01-03 21:39:56 +00:00
Yat Long Poon 32d193a372 AArch64: Improve codegen for SVE logs
Reduce memory access by using lanewise MLA and moving constants to struct
and reduce number of MOVPRFXs.
Update maximum ULP error for double log_sve from 1 to 2.
Speedup on Neoverse V1 for log (3%), log2 (5%), and log10 (4%).
2025-01-03 21:39:56 +00:00
Luna Lamb aa6609feb2 AArch64: Improve codegen in SVE tans
Improves memory access.
Tan: MOVPRFX 7 -> 2, LD1RD 12 -> 5, move MOV away from return.
Tanf: MOV 2 -> 1, MOVPRFX 6 -> 3, LD1RW 5 -> 4, move mov away from return.
2025-01-03 21:39:56 +00:00
Luna Lamb 140b985e5a AArch64: Improve codegen in AdvSIMD asinh
Improves memory access and removes spills.
Load the polynomial evaluation coefficients into 2 vectors and use lanewise
MLAs.  Reduces MOVs 6->3 , LDR 11->5, STR/STP 2->0, ADRP 3->2.
2025-01-03 21:39:56 +00:00
Wilco Dijkstra 0ab62fa4f6 AArch64: Update libm-test-ulps
Update ulps for (a)cospi, (a)sinpi, (a)tanpi, atan2pi.
2025-01-02 17:53:07 +00:00
Florian Weimer ceae7e2770 elf: Introduce generic <dl-tls.h>
On arc, the definition of TLS_DTV_UNALLOCATED now comes from
<dl-dtv.h>.

For x86-64 x32, a separate version is needed because unsigned long int
is 32 bits on this target.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-01-02 13:45:27 +01:00
Paul Eggert 2642002380 Update copyright dates with scripts/update-copyrights 2025-01-01 11:22:09 -08:00
Florian Weimer 6a99b4172a aarch64: Regenerate ulps
Results from running on Neoverse-V2, built with GCC 11.5.
2024-12-20 07:12:30 +01:00
Adhemerval Zanella 0e0be3ed80 math: Use tanhf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic tanhf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      51.5273        41.0951        20.25%
x86_64v2                    47.7021        39.1526        17.92%
x86_64v3                    45.0373        34.2737        23.90%
i686                       133.9970        83.8596        37.42%
aarch64 (Neoverse)          21.5439        14.7961        31.32%
power10                     13.3301         8.4406        36.68%

reciprocal-throughput        master        patched   improvement
x86_64                      24.9493        12.8547        48.48%
x86_64v2                    20.7051        12.7761        38.29%
x86_64v3                    19.2492        11.0851        42.41%
i686                        78.6498        29.8211        62.08%
aarch64 (Neoverse)          11.6026        7.11487        38.68%
power10                      6.3328         2.8746        54.61%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 1751c0519a math: Use sinhf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic sinhf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      52.6819        49.1489         6.71%
x86_64v2                    49.1162        42.9447        12.57%
x86_64v3                    46.9732        39.9157        15.02%
i686                       141.1470       129.6410         8.15%
aarch64 (Neoverse)          20.8539        17.1288        17.86%
power10                     14.5258        9.1906         36.73%

reciprocal-throughput        master        patched   improvement
x86_64                      27.5553        23.9395        13.12%
x86_64v2                    21.6423        20.3219         6.10%
x86_64v3                    21.4842        16.0224        25.42%
i686                        87.9709        86.1626         2.06%
aarch64 (Neoverse)          15.1919        12.2744        19.20%
power10                      7.2188         5.2611        27.12%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 9583836785 math: Use coshf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode),
although it should worse performance than current one.  The current
implementation performance comes mainly from the internal usage of
the optimize expf implementation, and shows a maximum ULPs of 2 for
FE_TONEAREST and 3 for other rounding modes.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      40.6995        49.0737       -20.58%
x86_64v2                    40.5841        44.3604        -9.30%
x86_64v3                    39.3879        39.7502        -0.92%
i686                       112.3380       129.8570       -15.59%
aarch64 (Neoverse)          18.6914        17.0946         8.54%
power10                     11.1343        9.3245         16.25%

reciprocal-throughput        master        patched   improvement
x86_64                      18.6471        24.1077       -29.28%
x86_64v2                    17.7501        20.2946       -14.34%
x86_64v3                    17.8262        17.1877         3.58%
i686                        64.1454        86.5645       -34.95%
aarch64 (Neoverse)          9.77226        12.2314       -25.16%
power10                      4.0200        5.3316        -32.63%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 7cfd8b5698 math: Use atanhf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic atanhf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      59.4930        45.8568        22.92%
x86_64v2                    59.5705        45.5804        23.48%
x86_64v3                    53.1838        37.7155        29.08%
i686                        169.354       133.5940        21.12%
aarch64 (Neoverse)          26.0781        16.9829        34.88%
power10                     15.6591        10.7623        31.27%

reciprocal-throughput        master        patched   improvement
x86_64                      23.5903        18.5766        21.25%
x86_64v2                    22.6489        18.2683        19.34%
x86_64v3                    19.0401        13.9474        26.75%
i686                        97.6034       107.3260        -9.96%
aarch64 (Neoverse)          15.3664        9.57846        37.67%
power10                      6.8877        4.6242         32.86%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 6f9bacf36b math: Use atan2f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic atan2f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      68.1175        69.2014        -1.59%
x86_64v2                    66.9884        66.0081         1.46%
x86_64v3                    57.7034        61.6407        -6.82%
i686                       189.8690        152.7560       19.55%
aarch64 (Neoverse)          32.6151        24.5382        24.76%
power10                     21.7282        17.1896        20.89%

reciprocal-throughput        master        patched   improvement
x86_64                      34.5202        31.6155         8.41%
x86_64v2                    32.6379        30.3372         7.05%
x86_64v3                    34.3677        23.6455        31.20%
i686                       157.7290        75.8308        51.92%
aarch64 (Neoverse)          27.7788        16.2671        41.44%
power10                     15.5715         8.1588        47.60%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella a357d6273f math: Use atanf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic atanf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      56.8265        53.6842         5.53%
x86_64v2                    54.8177        53.6842         2.07%
x86_64v3                    46.2915        48.7034        -5.21%
i686                       158.3760        108.9560       31.20%
aarch64 (Neoverse)           21.687        20.5893         5.06%
power10                     13.1903        13.5012        -2.36%

reciprocal-throughput        master        patched   improvement
x86_64                      16.6787        16.7601        -0.49%
x86_64v2                    16.6983        16.7601        -0.37%
x86_64v3                    16.2268        12.1391        25.19%
i686                       138.6840        36.0640        74.00%
aarch64 (Neoverse)          11.8012        10.3565        12.24%
power10                      5.3212         4.2894        19.39%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella ed608a40e2 math: Use asinhf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic asinhf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      64.5128        56.9717        11.69%
x86_64v2                    63.3065        57.2666         9.54%
x86_64v3                    62.8719        51.4170        18.22%
i686                       189.1630        137.635        27.24%
aarch64 (Neoverse)          25.3551        20.5757        18.85%
power10                     17.9712        13.3302        25.82%

reciprocal-throughput        master        patched   improvement
x86_64                      20.0844        15.4731        22.96%
x86_64v2                    19.2919        15.4000        20.17%
x86_64v3                    18.7226        11.9009        36.44%
i686                       103.7670        80.2681        22.65%
aarch64 (Neoverse)          12.5005        8.68969        30.49%
power10                      7.2220        5.03617        30.27%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>:
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 5fb4b566ef math: Use asinf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic asinf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      42.8237        35.2460        17.70%
x86_64v2                    43.3711        35.9406        17.13%
x86_64v3                    35.0335        30.5744        12.73%
i686                       213.8780        104.4710       51.15%
aarch64 (Neoverse)          17.2937        13.6025        21.34%
power10                     12.0227        7.4241         38.25%

reciprocal-throughput        master        patched   improvement
x86_64                      13.6770        15.5231       -13.50%
x86_64v2                    13.8722        16.0446       -15.66%
x86_64v3                    13.6211        13.2753         2.54%
i686                       186.7670        45.4388        75.67%
aarch64 (Neoverse)          9.96089        9.39285         5.70%
power10                      4.9862        3.7819         24.15%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 673e6fe110 math: Use acoshf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic acoshf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      61.2471        58.7742         4.04%
x86_64-v2                   62.6519        59.0523         5.75%
x86_64-v3                   58.7408        50.1393        14.64%
aarch64                     24.8580        21.3317        14.19%
power10                     17.0469        13.1345        22.95%

reciprocal-throughput        master        patched   improvement
x86_64                      16.1618        15.1864         6.04%
x86_64-v2                   15.7729        14.7563         6.45%
x86_64-v3                   14.1669        11.9568        15.60%
aarch64                      10.911        9.5486         12.49%
power10                     6.38196        5.06734        20.60%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 66fa7ad437 math: Use acosf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic acosf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      52.5098        36.6312        30.24%
x86_64v2                    53.0217        37.3091        29.63%
x86_64v3                    42.8501        32.3977        24.39%
i686                       207.3960       109.4000        47.25%
aarch64                     21.3694        13.7871        35.48%
power10                     14.5542         7.2891        49.92%

reciprocal-throughput        master        patched   improvement
x86_64                      14.1487        15.9508       -12.74%
x86_64v2                    14.3293        16.1899       -12.98%
x86_64v3                    13.6563        12.6161         7.62%
i686                       158.4060        45.7354        71.13%
aarch64                     12.5515        9.19233        26.76%
power10                      5.7868         3.3487        42.13%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Joana Cruz cff9648d0b AArch64: Improve codegen of AdvSIMD expf family
Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs.
Also use intrinsics instead of native operations.
expf: 3% improvement in throughput microbenchmark on Neoverse V1, exp2f: 5%,
exp10f: 13%, coshf: 14%.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-12-17 15:28:22 +00:00
Joana Cruz 6914774b9d AArch64: Improve codegen of AdvSIMD atan(2)(f)
Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs.
8% improvement in throughput microbenchmark on Neoverse V1.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-12-17 15:28:22 +00:00
Joana Cruz d6e034f5b2 AArch64: Improve codegen of AdvSIMD logf function family
Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs.
8% improvement in throughput microbenchmark on Neoverse V1 for log2 and log,
and 2% for log10.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-12-17 15:25:58 +00:00
Wilco Dijkstra ca7d48a80f AArch64: Update libm-test-ulps
Update ulps for acospi, asinpi, atanpi, atan2pi.
2024-12-13 17:14:58 +00:00
Pierre Blanchard 13a7ef5999 AArch64: Improve codegen in users of ADVSIMD expm1 helper
Add inline helper for expm1 and rearrange operations so MOV
is not necessary in reduction or around the special-case handler.
Reduce memory access by using more indexed MLAs in polynomial.
Speedup on Neoverse V1 for expm1 (19%), sinh (8.5%), and tanh (7.5%).
2024-12-09 16:20:34 +00:00
Pierre Blanchard ca0c0d0f26 AArch64: Improve codegen in users of ADVSIMD log1p helper
Add inline helper for log1p and rearrange operations so MOV
is not necessary in reduction or around the special-case handler.
Reduce memory access by using more indexed MLAs in polynomial.
Speedup on Neoverse V1 for log1p (3.5%), acosh (7.5%) and atanh (10%).
2024-12-09 16:20:34 +00:00
Pierre Blanchard 8eb5ad2ebc AArch64: Improve codegen in AdvSIMD logs
Remove spurious ADRP and a few MOVs.
Reduce memory access by using more indexed MLAs in polynomial.
Align notation so that algorithms are easier to compare.
Speedup on Neoverse V1 for log10 (8%), log (8.5%), and log2 (10%).
Update error threshold in AdvSIMD log (now matches SVE log).
2024-12-09 16:20:34 +00:00
Pierre Blanchard 569cfaaf49 AArch64: Improve codegen in AdvSIMD pow
Remove spurious ADRP. Improve memory access by shuffling constants and
using more indexed MLAs.

A few more optimisation with no impact on accuracy
- force fmas contraction
- switch from shift-aided rint to rint instruction

Between 1 and 5% throughput improvement on Neoverse
V1 depending on benchmark.
2024-12-09 16:20:34 +00:00
Andreas K. Hüttel 80d1e63e90
math: Add tanpi aarch64 ulps
Linux dola 5.15.169-gentoo-dist #1 SMP Wed Oct 23 06:25:30 -00 2024 aarch64 GNU/Linux
Vendor ID:                ARM
  Model name:             Neoverse-N1

gcc (Gentoo Hardened 13.3.1_p20241025 p1) 13.3.1 20241024

Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
2024-12-08 18:25:05 +01:00
Wilco Dijkstra fa16523c48 AArch64: Update libm-test-ulps
Add sinpi/cospi.
2024-12-05 16:19:37 +00:00
Adhemerval Zanella 17a43505b3 elf: Consolidate stackinfo.h
And use sane default the generic implementation.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-12-02 17:14:58 +00:00
Wilco Dijkstra a08d9a52f9 AArch64: Remove zva_128 from memset
Remove ZVA 128 support from memset - the new memset no longer
guarantees count >= 256, which can result in underflow and a
crash if ZVA size is 128 ([1]).  Since only one CPU uses a ZVA
size of 128 and its memcpy implementation was removed in commit
e162ab2bf1, remove this special
case too.

[1] https://sourceware.org/pipermail/libc-alpha/2024-November/161626.html

Reviewed-by: Andrew Pinski <quic_apinski@quicinc.com>
2024-11-29 13:27:13 +00:00
Adhemerval Zanella bccb0648ea math: Use tanf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic tanf.

The code was adapted to glibc style, to use the definition of
math_config.h, to remove errno handling, and to use a generic
128 bit routine for ABIs that do not support it natively.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (neoverse1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       82.3961       54.8052       33.49%
x86_64v2                     82.3415       54.8052       33.44%
x86_64v3                     69.3661       50.4864       27.22%
i686                         219.271       45.5396       79.23%
aarch64                      29.2127       19.1951       34.29%
power10                      19.5060       16.2760       16.56%

reciprocal-throughput         master       patched  improvement
x86_64                       28.3976       19.7334       30.51%
x86_64v2                     28.4568       19.7334       30.65%
x86_64v3                     21.1815       16.1811       23.61%
i686                         105.016       15.1426       85.58%
aarch64                      18.1573       10.7681       40.70%
power10                       8.7207        8.7097        0.13%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella d846f4c12d math: Use lgammaf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic lgammaf.

The code was adapted to glibc style, to use the definition of
math_config.h, to remove errno handling, to use math_narrow_eval
on overflow usage, and to adapt to make it reentrant.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       86.5609       70.3278       18.75%
x86_64v2                     78.3030       69.9709       10.64%
x86_64v3                     74.7470       59.8457       19.94%
i686                         387.355       229.761       40.68%
aarch64                      40.8341       33.7563       17.33%
power10                      26.5520       16.1672       39.11%
powerpc                      28.3145       17.0625       39.74%

reciprocal-throughput         master       patched  improvement
x86_64                       68.0461       48.3098       29.00%
x86_64v2                     55.3256       47.2476       14.60%
x86_64v3                     52.3015       38.9028       25.62%
i686                         340.848       195.707       42.58%
aarch64                      36.8000       30.5234       17.06%
power10                      20.4043       12.6268       38.12%
powerpc                      22.6588       13.8866       38.71%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella 994fec2397 math: Use erff from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic erff.

The code was adapted to glibc style and to use the definition of
math_config.h.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       85.7363       45.1372       47.35%
x86_64v2                     86.6337       38.5816       55.47%
x86_64v3                     71.3810       34.0843       52.25%
i686                         190.143       97.5014       48.72%
aarch64                      34.9091       14.9320       57.23%
power10                      38.6160        8.5188       77.94%
powerpc                      39.7446       8.45781       78.72%

reciprocal-throughput         master       patched  improvement
x86_64                       35.1739       14.7603       58.04%
x86_64v2                     34.5976       11.2283       67.55%
x86_64v3                     27.3260        9.8550       63.94%
i686                         91.0282       30.8840       66.07%
aarch64                      22.5831        6.9615       69.17%
power10                      18.0386        3.0918       82.86%
powerpc                      20.7277       3.63396       82.47%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella c5d241f06b math: Use cbrtf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic cbrtf.

The code was adapted to glibc style and to use the definition of
math_config.h.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master        patched       improvement
x86_64                       68.6348        36.8908            46.25%
x86_64v2                     67.3418        36.6968            45.51%
x86_64v3                     63.4981        32.7859            48.37%
aarch64                      29.3172        12.1496            58.56%
power10                      18.0845         8.8893            50.85%
powerpc                      18.0859        8.79527            51.37%

reciprocal-throughput         master        patched       improvement
x86_64                       36.4369        13.3565            63.34%
x86_64v2                     37.3611        13.1149            64.90%
x86_64v3                     31.6024        11.2102            64.53%
aarch64                      18.6866        7.3474             60.68%
power10                       9.4758        3.6329             61.66%
powerpc                      9.58896        3.90439            59.28%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-11-22 10:01:03 -03:00
Andrew Pinski e6590f0c86 aarch64: Remove non-temporal load/stores from oryon-1's memset
The hardware architects have a new recommendation not to use
non-temporal load/stores for memset. This patch removes this path.
I found there was no difference in the memset speed with/without
non-temporal load/stores either.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-21 11:32:23 -03:00
Andrew Pinski eb5eeb4740 aarch64: Remove non-temporal load/stores from oryon-1's memcpy
The hardware architects have a new recommendation not to use
non-temporal load/stores for memcpy. This patch removes this path.
I found there was no difference in the memcpy speed with/without
non-temporal load/stores either.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-21 11:32:17 -03:00
Andrew Pinski e162ab2bf1 AArch64: Remove thunderx{,2} memcpy
ThunderX1 and ThunderX2 have been retired for a few years now.
So let's remove the thunderx{,2} specific versions of memcpy.
The performance gain or them was for medium and large sizes
while the generic (aarch64) memcpy will handle just slightly worse.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-11-20 11:23:53 +00:00
Joe Ramsay 2d82d781a5 AArch64: Remove SVE erf and erfc tables
By using a combination of mask-and-add instead of the shift-based
index calculation the routines can share the same table as other
variants with no performance degradation.

The tables change name because of other changes in downstream AOR.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-11-01 16:10:41 +00:00
Adhemerval Zanella f338c7c5f5 math: Use log10p1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic log10p1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      68.5251        32.2627        52.92%
x86_64v2                    68.8912        32.7887        52.41%
x86_64v3                    59.3427        27.0521        54.41%
i686                        162.026        103.383        36.19%
aarch64                     26.8513        14.5695        45.74%
power10                     12.7426         8.4929        33.35%
powerpc                     16.6768        9.29135        44.29%

reciprocal-throughput        master        patched   improvement
x86_64                      26.0969        12.4023        52.48%
x86_64v2                    25.0045        11.0748        55.71%
x86_64v3                    20.5610        10.2995        49.91%
i686                        89.8842        78.5211        12.64%
aarch64                     17.1200         9.4832        44.61%
power10                      6.7814         6.4258         5.24%
powerpc                      15.769         7.6825        51.28%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:40 -03:00
Adhemerval Zanella 8ae9e51376 math: Use log1pf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic log1pf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      71.8142        38.9668        45.74%
x86_64v2                    71.9094        39.1321        45.58%
x86_64v3                    60.1000        32.4016        46.09%
i686                        147.105        104.258        29.13%
aarch64                     26.4439        14.0050        47.04%
power10                     19.4874         9.4146        51.69%
powerpc                     17.6145        8.00736        54.54%

reciprocal-throughput        master        patched   improvement
x86_64                      19.7604        12.7254        35.60%
x86_64v2                    19.0039        11.9455        37.14%
x86_64v3                    16.8559        11.9317        29.21%
i686                        82.3426        73.9718        10.17%
aarch64                     14.4665         7.9614        44.97%
power10                     11.9974         8.4117        29.89%
powerpc                     7.15222         6.0914        14.83%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:39 -03:00
Adhemerval Zanella c369580814 math: Use log2p1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic log2p1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      70.1462        47.0090        32.98%
x86_64v2                    70.2513        47.6160        32.22%
x86_64v3                    60.4840        39.9443        33.96%
i686                        164.068        122.909        25.09%
aarch64                     25.9169        16.9207        34.71%
power10                     18.1261        9.8592         45.61%
powerpc                     17.2683        9.38665        45.64%

reciprocal-throughput        master        patched   improvement
x86_64                      26.2240        16.4082        37.43%
x86_64v2                    25.0911        15.7480        37.24%
x86_64v3                    20.9371        11.7264        43.99%
i686                        90.4209        95.3073        -5.40%
aarch64                     16.8537        8.9561         46.86%
power10                     12.9401        6.5555         49.34%
powerpc                     9.01763        7.54745        16.30%

The performance decrease for i686 is mostly due the use of x87 fpu,
when building with '-msse2 -mfpmath=sse:

                             master        patched   improvement
latency                     164.068        102.982        37.23%
reciprocal-throughput       89.1968        82.5117         7.49%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:39 -03:00
Adhemerval Zanella bbd578b38d math: Use expm1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic expm1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      96.7402        36.4026        62.37%
x86_64v2                    97.5391        33.4625        65.69%
x86_64v3                    82.1778        30.8668        62.44%
i686                         120.58        94.8302        21.35%
aarch64                     32.3558        12.8881        60.17%
power10                     23.5087        9.8574         58.07%
powerpc                     23.4776        9.06325        61.40%

reciprocal-throughput        master        patched   improvement
x86_64                      27.8224        15.9255        42.76%
x86_64v2                    27.8364        9.6438         65.36%
x86_64v3                    20.3227        9.6146         52.69%
i686                        63.5629        59.4718         6.44%
aarch64                     17.4838        7.1082         59.34%
power10                     12.4644        8.7829         29.54%
powerpc                     14.2152        5.94765        58.16%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:35 -03:00
Adhemerval Zanella 5c22fd25c1 math: Use exp2m1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp2m1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).  The
only change is to handle FLT_MAX_EXP for FE_DOWNWARD or FE_TOWARDZERO.

The benchmark inputs are based on exp2f ones.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      40.6042        48.7104       -19.96%
x86_64v2                    40.7506        35.9032        11.90%
x86_64v3                    35.2301        31.7956        9.75%
i686                        102.094        94.6657        7.28%
aarch64                     18.2704        15.1387        17.14%
power10                     11.9444         8.2402        31.01%

reciprocal-throughput        master        patched   improvement
x86_64                      20.8683        16.1428        22.64%
x86_64v2                    19.5076        10.4474        46.44%
x86_64v3                    19.2106        10.4014        45.86%
i686                        56.4054        59.3004        -5.13%
aarch64                     12.0781         7.3953        38.77%
power10                      6.5306         5.9388         9.06%

The generic implementation calls __ieee754_exp2f and x86_64 provides
an optimized ifunc version (built with -mfma -mavx2, not correctly
rounded).  This explains the performance difference for x86_64.

Same for i686, where the ABI provides an optimized __ieee754_exp2f
version built with '-msse2 -mfpmath=sse'.  When built wth same
flags, the new algorithm shows a better performance:

                            master        patched    improvement
latency                    102.094        91.2823         10.59%
reciprocal-throughput      56.4054        52.7984          6.39%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:35 -03:00
Adhemerval Zanella 5fa89852fa math: Use exp10m1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp10m1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).  I mostly
fixed some small issues in corner cases (sNaN handling, -INFINITY,
a specific overflow check).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      45.4690        49.5845        -9.05%
x86_64v2                    46.1604        36.2665        21.43%
x86_64v3                    37.8442        31.0359        17.99%
i686                        121.367        93.0079        23.37%
aarch64                     21.1126        15.0165        28.87%
power10                     12.7426        8.4929         33.35%

reciprocal-throughput        master        patched   improvement
x86_64                      19.6005        17.4005        11.22%
x86_64v2                    19.6008        11.1977        42.87%
x86_64v3                    17.5427        10.2898        41.34%
i686                        59.4215        60.9675        -2.60%
aarch64                     13.9814        7.9173         43.37%
power10                      6.7814        6.4258          5.24%

The generic implementation calls __ieee754_exp10f which has an
optimized version, although it is not correctly rounded, which is
the main culprit of the the latency difference for x86_64 and
throughp for i686.

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:26 -03:00
Joe Ramsay 1cf29fbc5b AArch64: Small optimisation in AdvSIMD erf and erfc
In both routines, reduce register pressure such that GCC 14 emits no
spills for erf and fewer spills for erfc.  Also use more efficient
comparison for the special-case in erf.

Benchtests show erf improves by 6.4%, erfc by 1.0%.
2024-10-28 15:01:37 +00:00
Paul Zimmermann 392b3f0971 replace tgammaf by the CORE-MATH implementation
The CORE-MATH implementation is correctly rounded (for any rounding mode).
This can be checked by exhaustive tests in a few minutes since there are
less than 2^32 values to check against for example GNU MPFR.
This patch also adds some bench values for tgammaf.

Tested on x86_64 and x86 (cfarm26).

With the initial GNU libc code it gave on an Intel(R) Core(TM) i7-8700:

      "tgammaf": {
       "": {
        "duration": 3.50188e+09,
        "iterations": 2e+07,
        "max": 602.891,
        "min": 65.1415,
        "mean": 175.094
       }
      }

With the new code:

      "tgammaf": {
       "": {
        "duration": 3.30825e+09,
        "iterations": 5e+07,
        "max": 211.592,
        "min": 32.0325,
        "mean": 66.1649
       }
      }

With the initial GNU libc code it gave on cfarm26 (i686):

  "tgammaf": {
   "": {
    "duration": 3.70505e+09,
    "iterations": 6e+06,
    "max": 2420.23,
    "min": 243.154,
    "mean": 617.509
   }
  }

With the new code:

  "tgammaf": {
   "": {
    "duration": 3.24497e+09,
    "iterations": 1.8e+07,
    "max": 1238.15,
    "min": 101.155,
    "mean": 180.276
   }
  }

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>

Changes in v2:
    - include <math.h> (fix the linknamespace failures)
    - restored original benchtests/strcoll-inputs/filelist#en_US.UTF-8 file
    - restored original wrapper code (math/w_tgammaf_compat.c),
      except for the dealing with the sign
    - removed the tgammaf/float entries in all libm-test-ulps files
    - address other comments from Joseph Myers
      (https://sourceware.org/pipermail/libc-alpha/2024-July/158736.html)

Changes in v3:
    - pass NULL argument for signgam from w_tgammaf_compat.c
    - use of math_narrow_eval
    - added more comments

Changes in v4:
    - initialize local_signgam to 0 in math/w_tgamma_template.c
    - replace sysdeps/ieee754/dbl-64/gamma_productf.c by dummy file

Changes in v5:
    - do not mention local_signgam any more in math/w_tgammaf_compat.c
    - initialize local_signgam to 1 instead of 0 in w_tgamma_template.c
      and added comment

Changes in v6:
    - pass NULL as 2nd argument of __ieee754_gammaf_r in
      w_tgammaf_compat.c, and check for NULL in e_gammaf_r.c

Changes in v7:
    - added Signed-off-by line for Alexei Sibidanov (author of the code)

Changes in v8:
    - added Signed-off-by line for Paul Zimmermann (submitted of the patch)

Changes in v9:
    - address comments from review by Adhemerval Zanella
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-10-11 11:12:32 +02:00
Joe Ramsay 16a59571e4 AArch64: Simplify rounding-multiply pattern in several AdvSIMD routines
This operation can be simplified to use simpler multiply-round-convert
sequence, which uses fewer instructions and constants.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:08 +01:00
Joe Ramsay 7900ac490d AArch64: Improve codegen in users of ADVSIMD expm1f helper
Rearrange operations so MOV is not necessary in reduction or around
the special-case handler.  Reduce memory access by using more indexed
MLAs in polynomial.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay 5bc100bd4b AArch64: Improve codegen in users of AdvSIMD log1pf helper
log1pf is quite register-intensive - use fewer registers for the
polynomial, and make various changes to shorten dependency chains in
parent routines.  There is now no spilling with GCC 14.  Accuracy moves
around a little - comments adjusted accordingly but does not require
regen-ulps.

Use the helper in log1pf as well, instead of having separate
implementations.  The more accurate polynomial means special-casing can
be simplified, and the shorter dependency chain avoids the usual dance
around v0, which is otherwise difficult.

There is a small duplication of vectors containing 1.0f (or 0x3f800000) -
GCC is not currently able to efficiently handle values which fit in FMOV
but not MOVI, and are reinterpreted to integer.  There may be potential
for more optimisation if this is fixed.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay a15b1394b5 AArch64: Improve codegen in SVE F32 logs
Reduce MOVPRFXs by using unpredicated (non-destructive) instructions
where possible.  Similar to the recent change to AdvSIMD F32 logs,
adjust special-case arguments and bounds to allow for more optimal
register usage.  For all 3 routines one MOVPRFX remains in the
reduction, which cannot be avoided as immediate AND and ASR are both
destructive.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay 7b8c134b54 AArch64: Improve codegen in SVE expf & related routines
Reduce MOV and MOVPRFX by improving special-case handling.  Use inline
helper to duplicate the entire computation between the special- and
non-special case branches, removing the contention for z0 between x
and the return value.

Also rearrange some MLAs and MLSs - by making the multiplicand the
destination we can avoid a MOVPRFX in several cases.  Also change which
constants go in the vector used for lanewise ops - the last lane is no
longer wasted.

Spotted that shift was incorrect in exp2f and exp10f, w.r.t. to the
comment that explains it.  Fixed - worst-case ULP for exp2f moves
around but it doesn't change significantly for either routine.

Worst-case error for coshf increases due to passing x to exp rather
than abs(x) - updated the comment, but does not require regen-ulps.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-23 15:44:07 +01:00
Joe Ramsay 751a5502be AArch64: Add vector logp1 alias for log1p
This enables vectorisation of C23 logp1, which is an alias for log1p.
There are no new tests or ulp entries because the new symbols are simply
aliases.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-19 17:53:34 +01:00
Wilco Dijkstra 8ecb477ea1 AArch64: Remove memset-reg.h
Remove memset-reg.h by moving register definitions into the memset
implementations.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-09-10 14:18:03 +01:00
Wilco Dijkstra cec3aef324 AArch64: Optimize memset
Improve small memsets by avoiding branches and use overlapping stores.
Use DC ZVA for copies over 128 bytes.  Remove unnecessary code for ZVA sizes
other than 64 and 128.  Performance of random memset benchmark improves by 24%
on Neoverse N1.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-09-09 15:30:00 +01:00
Joe Ramsay 8b09af572b aarch64: Avoid redundant MOVs in AdvSIMD F32 logs
Since the last operation is destructive, the first argument to the FMA
also has to be the first argument to the special-case in order to
avoid unnecessary MOVs. Reorder arguments and adjust special-case
bounds to facilitate this.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2024-09-09 13:03:49 +01:00
Adhemerval Zanella e2f88d8524 aarch64: Regenerate ULPs
From new tests added by 0797283910.
2024-08-07 11:02:03 -03:00