DragonOS/docs/locales/en/kernel/syscall/sys_capget_capset.md

7.2 KiB
Raw Blame History

:::{note} AI Translation Notice

This document was automatically translated by hunyuan-turbos-latest model, for reference only.

  • Source document: kernel/syscall/sys_capget_capset.md

  • Translation time: 2025-09-25 09:18:48

  • Translation model: hunyuan-turbos-latest

Please report issues via Community Channel

:::

Design Documentation for sys_capget / sys_capset

This document briefly introduces the design and implementation key points of sys_capget and sys_capset in DragonOS, covering version negotiation, user-space data structures, capability bitset rules, and call flows.

Source Code:

  • kernel/src/process/syscall/sys_cap_get_set.rs
  • kernel/src/process/cred.rs

Overview

  • DragonOS aligns with Linux's capability interface, supporting user-space reading or setting process capability sets via capget/capset.
  • Capability sets include:
    • cap_effective (pE): The capabilities currently in effect for the process
    • cap_permitted (pP): The upper limit of capabilities granted to the process
    • cap_inheritable (pI): Capabilities that can be inherited by child processes
    • cap_bset: Bounding set, limiting the upper bound of obtainable capabilities (used only for rule constraints, not directly read/written in this interface)
    • cap_ambient: Ambient set (not modified by capset)
  • Capability bit width: DragonOS uses 64-bit storage but currently only supports the lower 41 bits (CAP_FULL_SET = (1<<41)-1), with higher bits truncated.

User-Space Data Structures and Versions

Aligned with Linux's user-space structures:

// header: cap_user_header_t
struct CapUserHeader {
    uint32_t version; // 版本号
    int32_t  pid;     // 目标进程: 0=当前进程,其他=指定pid
};

// data: cap_user_data_t 数组元素
struct CapUserData {
    uint32_t effective;
    uint32_t permitted;
    uint32_t inheritable;
}
  • Version constants:
    • _LINUX_CAPABILITY_VERSION_1 = 0x19980330
    • _LINUX_CAPABILITY_VERSION_2 = 0x20071026 (deprecated)
    • _LINUX_CAPABILITY_VERSION_3 = 0x20080522
  • Kernel-supported version in DragonOS: _KERNEL_CAPABILITY_VERSION = v3
  • Number of u32 groups copied per version:
    • v1: 1 group (lower 32 bits only)
    • v2/v3: 2 groups (lower 32 bits + upper 32 bits)

Aggregation/Splitting Rules:

  • capset: Aggregates CapUserData[0..tocopy) from user input into a u64 (truncated to 41 bits at higher positions)
  • capget: Returns the number of u32 groups corresponding to the requested version (v1:1 group; v2/v3:2 groups) based on the request, also returning 0 when data==NULL.

Version Negotiation and Probe Behavior

  • capget:
    • If version is unknown: Writes back header.version as the kernel-supported version (v3) and returns:
      • If data==NULL: Returns 0 (for probing)
      • If data!=NULL: Returns EINVAL
    • If version is valid: Returns the number of u32 groups corresponding to the requested version (v1:1 group; v2/v3:2 groups), also returning 0 when data==NULL.
  • capset:
    • If version is unknown: Directly returns EINVAL (does not take on probing responsibility), more consistent with Linux.
    • data cannot be empty (NULL returns EFAULT).

Target Process Selection and pid Semantics

  • capget:
    • pid < 0: EINVAL
    • pid == 0: Uses the current process
    • pid != 0: Looks up the target process (returns ESRCH if not found)
  • capset:
    • pid < 0: EPERM (negative pid targets not allowed)
    • pid == 0 or pid == current process pid: Allowed
    • pid != current process pid: EPERM (only self-modification allowed)

Capability Set Rules (capset)

Let:

  • pE_old = old effective
  • pP_old = old permitted
  • pI_old = old inheritable
  • bset = bounding set
  • pE_new, pP_new, pI_new derived from user data (already truncated to 41-bit mask)

Constraints:

  1. pE_new ⊆ pP_new
    If any bit in pE_new is not in pP_new: EPERM

  2. pP_new ⊆ pP_old (not allowed to elevate permitted)
    If pP_new contains any bits not in pP_old: EPERM

  3. pI_new limitation (aligned with Linux's CAP_SETPCAP and bset constraints)

    • If the current process has CAP_SETPCAP_BIT (in the pE_old effective set): pI_new ⊆ (pI_old pP_old) ∩ bset
      If exceeded: EPERM
    • If not: pI_new ⊆ (pI_old pP_old) and pI_new ⊆ (pI_old bset)
      Any exceedance: EPERM

Note:

  • Ambient capabilities are not modified by capset and remain unchanged.
  • By cloning the old cred, updating pE/pP/pI, and then atomically replacing it in the PCB (pcb.set_cred).

Flowchart

Main flow of capget:

[读取 header(version,pid)]
        |
   [版本合法?]
      /     \
    否       是
    |         |
[写回 header.version=v3]     [pid 选择]
        |                     |-- pid<0 -> EINVAL
   [data==NULL?]              |-- pid==0 -> 当前进程 cred
      /     \                 |-- pid!=0 -> 查找目标任务
    是       否               |              |- 未找到 -> ESRCH
    |         |               |              |- 找到 -> 目标 cred
  返回 0     EINVAL           |
                              [拆分 e/p/i 为低/高 32 位]
                              [data==NULL?]
                                /       \
                              是         否
                               |          |
                             返回 0     写回用户缓冲区,返回 0

Main flow of capset:

[读取 header(version,pid)]
        |
   [版本合法?]
      /     \
    否       是
    |         |
  EINVAL   [data==NULL?]
              /      \
            是        否
             |         |
           EFAULT    [pid 检查]
                      |- pid<0 -> EPERM
                      |- pid!=self -> EPERM
                      |- pid==self -> [读取用户数据并聚合 pE/pP/pI]
                                      [规则1: pE_new ⊆ pP_new?]  否 -> EPERM
                                      [规则2: pP_new ⊆ pP_old?] 否 -> EPERM
                                      [规则3: pI_new 受 CAP_SETPCAP/bset 限制?] 否 -> EPERM
                                      [克隆 cred 更新 pE/pP/pI]
                                      [pcb.set_cred 原子替换]
                                      返回 0

Capability Bit Width and Masks

Apply masks to e/p/i during aggregation:

  • mask = CAPFlags::CAP_FULL_SET.bits() = (1<<41)-1
  • Higher bits are truncated to ensure cross-version compatibility and consistency with the current implementation.

Design Trade-offs and Alignment

  • capget supports "probe" semantics for unknown versions: writes back the supported version and returns 0 when data==NULL.
  • capset does not take on probing: unknown versions directly return EINVAL, more closely aligned with Linux behavior.
  • pid constraints are stricter: capset only allows modification of the current process to avoid cross-process permission modifications.
  • Rules follow the Linux capability model: not allowed to elevate permitted; effective must be limited by permitted; inheritable is constrained by CAP_SETPCAP and bset.

Future Work

  • Improve more interfaces for ambient capabilities and bounding set (currently ambient is not modified in capset).
  • Introduce more complete capability bit definitions and permission check interfaces.
  • Align documentation and test cases with more boundary conditions (such as the impact of user namespaces).