Add the C23 memalignment function (query the alignment of a pointer)
to glibc.
Given how simple this operation is, it would make sense for compilers
to inline calls to this function, but I'm treating that as a compiler
matter (compilers should add it as a built-in function) rather than
adding an inline version to glibc headers (although such an inline
version would be reasonable as well). I've filed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122117 for this feature
in GCC.
Tested for x86_64 and x86.
And remove some unused entries of the fallback table.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The fma is required only for x == -0x1.da285cp-5 in FE_TONEAREST
to provide correctly rounded results.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The fma is required only for x == +/-0x1.6371e8p-4f in FE_TOWARDZERO
to provide correctly rounded results.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The fma is not required to provide correctly rounded and it helps
on !__FP_FAST_FMA ISAs.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
The fma is required only for inputs less than 0x1.0fd288p-127. Also
only add the extra check for !__FP_FAST_FMA targets.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
The fma is not strickly required to provide correctly rounded and
it helps on !__FP_FAST_FMA ABIs.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
This commit adds tests for the following use cases relevant to handing of
the SME state:
- fork() and vfork()
- clone() and clone3()
- signal handler
While most cases are trivial, the case of clone3() is more complicated since
the clone3() symbol is not public in Glibc.
To avoid having to check all possible ways clone3() may be called via other
public functions (e.g. vfork() or pthread_create()), we put together a test
that links directly with clone3.o. All the existing functions that have calls
to clone3() may not actually use it, in which case the outcome of such tests
would be unexpected. Having a direct call to the clone3() symbol in the test
allows to check precisely what we need to test: that the __arm_za_disable()
function is indeed called and has the desired effect.
Linking to clone3.o also requires linking to __arm_za_disable.o that in
turn requires the _dl_hwcap2 hidden symbol which to provide in the test
and initialise it before using.
Co-authored-by: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This change adds a call to the __arm_za_disable() function immediately
before the SVC instruction inside clone() and clone3() wrappers. It also
adds a macro for inline clone() used in fork() and adds the same call to
the vfork implementation. This sets the ZA state of SME to "off" on return
from these functions (for both the child and the parent).
The __arm_za_disable() function is described in [1] (8.1.3). Note that
the internal Glibc name for this function is __libc_arm_za_disable().
When this change was originally proposed [2,3], it generated a long
discussion where several questions and concerns were raised. Here we
will address these concerns and explain why this change is useful and,
in fact, necessary.
In a nutshell, a C library that conforms to the AAPCS64 spec [1] (pertinent
to this change, mainly, the chapters 6.2 and 6.6), should have a call to the
__arm_za_disable() function in clone() and clone3() wrappers. The following
explains in detail why this is the case.
When we consider using the __arm_za_disable() function inside the clone()
and clone3() libc wrappers, we talk about the C library subroutines clone()
and clone3() rather than the syscalls with similar names. In the current
version of Glibc, clone() is public and clone3() is private, but it being
private is not pertinent to this discussion.
We will begin with stating that this change is NOT a bug fix for something
in the kernel. The requirement to call __arm_za_disable() does NOT come from
the kernel. It also is NOT needed to satisfy a contract between the kernel
and userspace. This is why it is not for the kernel documentation to describe
this requirement. This requirement is instead needed to satisfy a pure userspace
scheme outlined in [1] and to make sure that software that uses Glibc (or any
other C library that has correct handling of SME states (see below)) conforms
to [1] without having to unnecessarily become SME-aware thus losing portability.
To recap (see [1] (6.2)), SME extension defines SME state which is part of
processor state. Part of this SME state is ZA state that is necessary to
manage ZA storage register in the context of the ZA lazy saving scheme [1]
(6.6). This scheme exists because it would be challenging to handle ZA
storage of SME in either callee-saved or caller-saved manner.
There are 3 kinds of ZA state that are defined in terms of the PSTATE.ZA
bit and the TPIDR2_EL0 register (see [1] (6.6.3)):
- "off": PSTATE.ZA == 0
- "active": PSTATE.ZA == 1 TPIDR2_EL0 == null
- "dormant": PSTATE.ZA == 1 TPIDR2_EL0 != null
As [1] (6.7.2) outlines, every subroutine has exactly one SME-interface
depending on the permitted ZA-states on entry and on normal return from
a call to this subroutine. Callers of a subroutine must know and respect
the ZA-interface of the subroutines they are using. Using a subroutine
in a way that is not permitted by its ZA-interface is undefined behaviour.
In particular, clone() and clone3() (the C library functions) have the
ZA-private interface. This means that the permitted ZA-states on entry
are "off" and "dormant" and that the permitted states on return are "off"
or "dormant" (but if and only if it was "dormant" on entry).
This means that both functions in question should correctly handle both
"off" and "dormant" ZA-states on entry. The conforming states on return
are "off" and "dormant" (if inbound state was already "dormant").
This change ensures that the ZA-state on return is always "off". Note,
that, in the context of clone() and clone3(), "on return" means a point
when execution resumes at certain address after transferring from clone()
or clone3(). For the caller (we may refer to it as "parent") this is the
return address in the link register where the RET instruction jumps. For
the "child", this is the target branch address.
So, the "off" state on return is permitted and conformant. Why can't we
retain the "dormant" state? In theory, we can, but we shouldn't, here is
why.
Every subroutine with a private-ZA interface, including clone() and clone3(),
must comply with the lazy saving scheme [1] (6.7.2). This puts additional
responsibility on a subroutine if ZA-state on return is "dormant" because
this state has special meaning. The "caller" (that is the place in code
where execution is transferred to, so this include both "parent" and "child")
may check the ZA-state and use it as per the spec of the "dormant" state that
is outlined in [1] (6.6.6 and 6.6.7).
Conforming to this would require more code inside of clone() and clone3()
which hardly is desirable.
For the return to "parent" this could be achieved in theory, but given that
neither clone() nor clone3() are supposed to be used in the middle of an
SME operation, if wouldn't be useful. For the "return" to "child" this
would be particularly difficult to achieve given the complexity of these
functions and their interfaces. Most importantly, it would be illegal
and somewhat meaningless to allow a "child" to start execution in the
"dormant" ZA-state because the very essence of the "dormant" state implies
that there is a place to return and that there is some outer context that
we are allowed to interact with.
To sum up, calling __arm_za_disable() to ensure the "off" ZA-state when the
execution resumes after a call to clone() or clone3() is correct and also
the most simple way to conform to [1].
Can there be situations when we can avoid calling __arm_za_disable()?
Calling __arm_za_disable() implies certain (sufficiently small) overhead,
so one might rightly ponder avoiding making a call to this function when
we can afford not to. The most trivial cases like this (e.g. when the
calling thread doesn't have access to SME or to the TPIDR2_EL0 register)
are already handled by this function (see [1] (8.1.3 and 8.1.2)). Reasoning
about other possible use cases would require making code inside clone() and
clone3() more complicated and it would defeat the point of trying to make
an optimisation of not calling __arm_za_disable().
Why can't the kernel do this instead?
The handling of SME state by the kernel is described in [4]. In short,
kernel must not impose a specific ZA-interface onto a userspace function.
Interaction with the kernel happens (among other thing) via system calls.
In Glibc many of the system calls (notably, including SYS_clone and
SYS_clone3) are used via wrappers, and the kernel has no control of them
and, moreover, it cannot dictate how these wrappers should behave because
it is simply outside of the kernel's remit.
However, in certain cases, the kernel may ensure that a "child" doesn't
start in an incorrect state. This is what is done by the recent change
included in 6.16 kernel [5]. This is not enough to ensure that code that
uses clone() and clone3() function conforms to [1] when it runs on a
system that provides SME, hence this change.
[1]: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst
[2]: https://inbox.sourceware.org/libc-alpha/20250522114828.2291047-1-yury.khrustalev@arm.com
[3]: https://inbox.sourceware.org/libc-alpha/20250609121407.3316070-1-yury.khrustalev@arm.com
[4]: https://www.kernel.org/doc/html/v6.16/arch/arm64/sme.html
[5]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cde5c32db55740659fca6d56c09b88800d88fd29
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
A common sequence of instructions is used in several places
in assembly files, so define it in one place as an assembly
macro.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
strerror, strsignal, and their variants should return unique strings for
each known (and, depending on the function, unknown) error/signal. Add
tests to verify this for strerror, strerror_r (GNU and XSI compliant
variants), and strerror_l (for the C locale), strerrordesc_np,
strsignal, sigabbrev_np, and sigdescr_np.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Use __seg_gs named address space qualifiers in PTR_MANGLE() and
PTR_DEMANGLE() macros to access the pointer_guard field in the TCB.
This change allows the compiler to eliminate redundant reads of
the variable, reducing the number of reads from 105 to 94 and
decreasing the text size of the library by 280 bytes.
While at it, fix a few trivial whitespace issues as well
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Use __seg_fs named address space qualifiers in PTR_MANGLE() and
PTR_DEMANGLE() macros to access the pointer_guard field in the TCB.
This change allows the compiler to eliminate redundant reads of
the variable, reducing the number of reads from 98 to 89 and
decreasing the text size of the library by 512 bytes.
While at it, fix a few trivial whitespace issues as well.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Update RSEQ access macros to use `(struct rseq_area) {}.member`
in _Static_assert and __typeof expressions, instead of
RSEQ_SELF()->member. This adopts the typeof_member style, avoiding
reliance on RSEQ_SELF for compile-time expressions.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Replace manual cast with a direct
`(struct rseq_area __seg_gs *)__rseq_offset` dereference to access
`member`. This avoids redundant `offsetof(struct rseq_area, member)`
and improves readability while preserving semantics.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Replace manual casts with a direct `(__tcbhead_t __seg_gs *)0`
dereferences for `stack_guard` and `pointer_guard`. This makes
the macros more straightforward and removes the dependency on
<stdint.h>.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Replace manual cast with a direct `(__typeof(*descr) __seg_gs *)0`
dereference to access `member`. This avoids redundant
`offsetof(struct pthread, member)` and improves readability while
preserving semantics.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This patch addresses the actual cause of CVE-2025-5745
The vector non-volatile registers are not used anymore for
32 byte load and comparison operation
Additionally, the assembler workaround used earlier for the
instruction lxvp is replaced with actual instruction.
Signed-off-by: Sachin Monga <smonga@linux.ibm.com>
Co-authored-by: Paul Murphy <paumurph@redhat.com>
This patch addresses the actual cause of CVE-2025-5702
The vector non-volatile registers are not used anymore for
32 byte load and comparison operation
Additionally, the assembler workaround used earlier for the
instruction lxvp is replaced with actual instruction.
Signed-off-by: Sachin Monga <smonga@linux.ibm.com>
Co-authored-by: Paul Murphy <paumurph@redhat.com>
It is enabled through math-use-builtins-fma.h if glibc is built
for VPFv4 (__ARM_FEATURE_FMA predefined by GCC), or through IFUNC
(testing HWCAP_ARM_VFPv4) otherwise.
Checked on arm-linux-gnueabihf.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
With same micro-optimization done for the double variant:
* Combine the |y| zero check.
* Rework the check to adjust result and call fmod.
* Remove one check after fmod.
* Remove float-int-float roundtrip on return.
Also use math_config.h macros and indent the code. The resulting
strategy is different in many places that I think requires a
different Copyright.
I see the following performance improvements using remainder benchtests
(using reciprocal-throughput metric):
Architecture | Input | master | patch | Improvemnt
-----------------|-----------------|----------|-----------------------
x86_64 | subnormals | 20.4176 | 19.6144 | 3.93%
x86_64 | normal | 54.0939 | 52.2343 | 3.44%
x86_64 | close-exponent | 23.9120 | 22.3768 | 6.42%
aarch64 | subnormals | 9.2423 | 8.3825 | 9.30%
aarch64 | normal | 30.5393 | 29.244 | 4.24%
aarch64 | close-exponent | 15.5405 | 13.9256 | 10.39%
The aarch64 used as Neoverse-N1, gcc 15.1.1; while the x86_64 was
a AMD Ryzen 9 5900X, gcc 15.2.1.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The commit 34b9f8bc17 provides an optimized fmod implementation; use
the same strategy used for remainderf and implement the double variant
on top of fmod.
I see the following performance improvements using remainder benchtests
(using reciprocal-throughput metric):
Architecture | Input | master | patch | Improvemnt
-----------------|-----------------|----------|-----------------------
x86_64 | subnormals | 76.1345 | 21.5334 | 71.72%
x86_64 | normal | 553.2670 | 426.5670 | 22.90%
x86_64 | close-exponent | 30.5111 | 22.6893 | 25.64%
aarch64 | subnormals | 26.0734 | 8.4876 | 67.45%
aarch64 | normal | 205.2590 | 200.082 | 2.52%
aarch64 | close-exponent | 13.8481 | 13.6663 | 1.31%
The aarch64 used as Neoverse-N1, gcc 15.1.1; while the x86_64 was
a AMD Ryzen 9 5900X, gcc 15.2.1.
This implementation also fixes the math/test-double-remainder issues
on alpha.
Tested on aarch64-linux-gnu and x86_64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
When compiling on x86_64 with -Wshift-overflow=2 you can see the
following warning:
../sysdeps/ieee754/flt-32/math_config.h: In function ‘is_inf’:
../sysdeps/ieee754/flt-32/math_config.h:184:37: warning: result of ‘2139095040 << 1’ requires 33 bits to represent, but ‘int’ only has 32 bits [-Wshift-overflow=]
184 | return (x << 1) == (EXPONENT_MASK << 1);
| ^~
This patch adjusts the definitions to use UINT32_C. This matches the
definitions in sysdeps/ieee754/dbl-64/math_config.h which use UINT64_C
for these definitions.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
C23 adds once_flag, ONCE_FLAG_INIT and call_once to stdlib.h (in C11
they were only in threads.h, in C23 they are in both headers; this
change came from N2840). Implement this change, with a
bits/types/once_flag.h header for the common type and initializer
definitions.
Note that there's an omnibus bug (bug 33001) that covers more than
just these missing definitions.
This doesn't seem a significant enough feature to be worth mentioning
in NEWS.
ISO C is not concerned with whether functions are in libc or
libpthread, but POSIX links this to what header they are declared in,
so functions declared in stdlib.h are supposed to be in libc.
However, the current edition of POSIX is based on C17; hopefully Hurd
glibc will have completed the merge of libpthread into libc (in
particular, moving call_once) well before a future edition of POSIX
based on C23 (or a later version of ISO C) is released.
Tested for x86_64 and x86.
Add the C23 memset_explicit function to glibc. Everything here is
closely based on the approach taken for explicit_bzero. This includes
the bits that relate to internal uses of explicit_bzero within glibc
(although we don't currently have any such internal uses of
memset_explicit), and also includes the nonnull attribute (when we
move to nonnull_if_nonzero for various functions following C2y, this
function should be included in that change).
The function is declared both for __USE_MISC and for __GLIBC_USE (ISOC23)
(so by default not just for compilers defaulting to C23 mode).
Tested for x86_64 and x86.
Instead of a negative return value the fixed FUSE copy_file_range will
silently truncate the size to UINT_MAX & PAGE_MASK [1]. Allow that value
to be returned as well.
[1] 1e08938c36
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Vector variants of the new C23 log10p1 routines.
Note: Benchmark inputs for log10p1(f) are identical to log1p(f)
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Vector variants of the new C23 log2p1 routines.
Note: Benchmark inputs for log2p1(f) are identical to log1p(f).
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
fpu_control.h is an installed header so a wider range of compiler versions
(including ones older than GCC 9) are relevant with it than are relevant
for building glibc.
Fixes commit 3014dec3ad
('x86: Remove obsolete "*&" GCC asm memory operand workaround')
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
In the !defined __PIC__ case, we cannot guarantee that the delay slot
is properly filled at the final `j` instuction without reordering
active.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Testing strcmp on MIPS hardware shows that strcmp.S performs worse
than the combination of using the generic strcmp.c implementation
alongside -funroll-loops.
Suggested-by: Joseph Myers <josmyers@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
It now calls __libc_assert, which contains similar logic. The assert
call only requires memory allocation for the message translation, so
test-assert2.c is adapted to handle it.
It also removes the fxprintf from assert/assert_perror; although it
is not 100% backwards-compatible (write message only if there is a
file descriptor associated with the stderr). It now writes bytes
directly without going through the wide stream state.
Checked on aarch64-linux-gnu.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Legacy encodings of SSE instructions incur AVX-SSE domain transition
penalties on some Intel microarchitectures (e.g. Haswell, Broadwell).
Using the VEX forms avoids these penatlies and keeps all instructions
in the VEX decode domain. Use "%v" sequence to emit the "v" prefix
for opcodes when compiling with -mavx.
No functional changes intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
GCC now accept plain variable names as valid lvalues for "m"
constraints, automatically spilling locals to memory if necessary.
The long-standing "*&" pattern was originally used as a defensive
workaround for older compiler versions that rejected operands
such as:
asm ("incl %0" : "+m"(x));
with errors like "memory input is not directly addressable".
Modern compilers (GCC >= 9) reliably generate correct code
without the workaround, and the resulting assembly is identical.
No functional changes intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Check for VM limit RPCs
* config.h.in: add #undef for HAVE_MACH_VM_GET_SIZE_LIMIT and
HAVE_MACH_VM_SET_SIZE_LIMIT.
* sysdeps/mach/configure.ac: use mach_RPC_CHECK to check for
vm_set_size_limit and vm_get_size_limit RPCs in gnumach.defs.
* sysdeps/mach/configure: regenerate file.
Use vm_get_size_limit to initialize RLIMIT_AS
* hurd/hurdrlimit.c(init_rlimit): use vm_get_size_limit to initialize
RLIMIT_AS entry of the _hurd_rlimits array.
Notify the kernel of the new VM size limits
* sysdeps/mach/hurd/setrlimit.c: use the vm_set_size_limit RPC,
if available, to notify the kernel of the new limits. Retry RPC
calls if they were interrupted by a signal.
Message-ID: <03fb90a795b354a366ee73f56f73e6ad22a86cda.1755220108.git.dnietoc@gmail.com>
On stack overflow typically, we may not actually have room on the stack to
trampoline back from the signal handler. We have to detect this before
locking the ss, otherwise the signal thread will be stuck on taking the
ss lock while trying to post SIGSEGV.
On i686, after GCC 16 commit:
commit 07d8de9174c421d719649639a1452b8b9f2eee32
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Wed Jul 2 08:58:23 2025 +0800
x86-64: Add --enable-x86-64-mfentry
which warns ‘-pg’ without ‘-mfentry’, when glibc is configured with
--disable-default-pie, GCC 16 fails to compile .op files and gmon tests
with error:
cc1: error: ‘-pg’ without ‘-mfentry’ may be unreliable with shrink wrapping [-Werror]
Compile .op files and gmon tests with -mfentry if it is supported by
CC/TEST_CC and glibc is configured with --disable-default-pie. This
fixes BZ #33376.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Joseph Myers <josmyers@redhat.com>
Use the __seg_gs named address space qualifier to cast access to the
gscope_flag in the TCB as a %gs: prefixed address. This enables the
use of the "m" operand constraint, which informs the compiler about
memory access in the inline assembly.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Carlos O'Donell <carlos@redhat.com>