Skip to content

Commit a2900f5

Browse files
committed
Merge patch series "pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL"
Christian Brauner <brauner@kernel.org> says: Add three new clone3() flags for pidfd-based process lifecycle management. === CLONE_AUTOREAP === CLONE_AUTOREAP makes a child process auto-reap on exit without ever becoming a zombie. This is a per-process property in contrast to the existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for SIGCHLD which applies to all children of a given parent. Currently the only way to automatically reap children is to set SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property affecting all children which makes it unsuitable for libraries or applications that need selective auto-reaping of specific children while still being able to wait() on others. CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct. When the child exits do_notify_parent() checks this flag and causes exit_notify() to transition the task directly to EXIT_DEAD. Since the flag lives on the child it survives reparenting: if the original parent exits and the child is reparented to a subreaper or init the child still auto-reaps when it eventually exits. This is cleaner than forcing the subreaper to get SIGCHLD and then reaping it. If the parent doesn't care the subreaper won't care. If there's a subreaper that would care it would be easy enough to add a prctl() that either just turns back on SIGCHLD and turns off auto-reaping or a prctl() that just notifies the subreaper whenever a child is reparented to it. CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to monitor the child's exit via poll() and retrieve exit status via PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget pattern. No exit signal is delivered so exit_signal must be zero. CLONE_THREAD and CLONE_PARENT are rejected: CLONE_THREAD because autoreap is a process-level property, and CLONE_PARENT because an autoreap child reparented via CLONE_PARENT could become an invisible zombie under a parent that never calls wait(). The flag is not inherited by the autoreap process's own children. Each child that should be autoreaped must be explicitly created with CLONE_AUTOREAP. === CLONE_NNP === CLONE_NNP sets no_new_privs on the child at clone time. Unlike prctl(PR_SET_NO_NEW_PRIVS) which a process sets on itself, CLONE_NNP allows the parent to impose no_new_privs on the child at creation without affecting the parent's own privileges. CLONE_THREAD is rejected because threads share credentials. CLONE_NNP is useful on its own for any spawn-and-sandbox pattern but was specifically introduced to enable unprivileged usage of CLONE_PIDFD_AUTOKILL. === CLONE_PIDFD_AUTOKILL === This flag ties a child's lifetime to the pidfd returned from clone3(). When the last reference to the struct file created by clone3() is closed the kernel sends SIGKILL to the child. A pidfd obtained via pidfd_open() for the same process does not keep the child alive and does not trigger autokill - only the specific struct file from clone3() has this property. This is useful for container runtimes, service managers, and sandboxed subprocess execution - any scenario where the child must die if the parent crashes or abandons the pidfd or just wants a throwaway helper process. CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP. It requires CLONE_PIDFD because the whole point is tying the child's lifetime to the pidfd. It requires CLONE_AUTOREAP because a killed child with no one to reap it would become a zombie - the primary use case is the parent crashing or abandoning the pidfd so no one is around to call waitpid(). CLONE_THREAD is rejected because autokill targets a process not a thread. If CLONE_NNP is specified together with CLONE_PIDFD_AUTOKILL an unprivileged user may spawn a process that is autokilled. The child cannot escalate privileges via setuid/setgid exec after being spawned. If CLONE_PIDFD_AUTOKILL is specified without CLONE_NNP the caller must have have CAP_SYS_ADMIN in its user namespace. * patches from https://patch.msgid.link/20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org: selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests selftests/pidfd: add CLONE_NNP tests selftests/pidfd: add CLONE_AUTOREAP tests pidfd: add CLONE_PIDFD_AUTOKILL clone: add CLONE_NNP clone: add CLONE_AUTOREAP Link: https://patch.msgid.link/20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2 parents 6de23f8 + ec26879 commit a2900f5

10 files changed

Lines changed: 996 additions & 13 deletions

File tree

fs/pidfs.c

Lines changed: 32 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88
#include <linux/mount.h>
99
#include <linux/pid.h>
1010
#include <linux/pidfs.h>
11+
#include <linux/sched/signal.h>
12+
#include <linux/signal.h>
1113
#include <linux/pid_namespace.h>
1214
#include <linux/poll.h>
1315
#include <linux/proc_fs.h>
@@ -637,7 +639,28 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
637639
return open_namespace(ns_common);
638640
}
639641

642+
static int pidfs_file_release(struct inode *inode, struct file *file)
643+
{
644+
struct pid *pid = inode->i_private;
645+
struct task_struct *task;
646+
647+
if (!(file->f_flags & PIDFD_AUTOKILL))
648+
return 0;
649+
650+
guard(rcu)();
651+
task = pid_task(pid, PIDTYPE_TGID);
652+
if (!task)
653+
return 0;
654+
655+
/* Not available for kthreads or user workers for now. */
656+
if (WARN_ON_ONCE(task->flags & (PF_KTHREAD | PF_USER_WORKER)))
657+
return 0;
658+
do_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_TGID);
659+
return 0;
660+
}
661+
640662
static const struct file_operations pidfs_file_operations = {
663+
.release = pidfs_file_release,
641664
.poll = pidfd_poll,
642665
#ifdef CONFIG_PROC_FS
643666
.show_fdinfo = pidfd_show_fdinfo,
@@ -1093,11 +1116,11 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
10931116
int ret;
10941117

10951118
/*
1096-
* Ensure that PIDFD_STALE can be passed as a flag without
1097-
* overloading other uapi pidfd flags.
1119+
* Ensure that internal pidfd flags don't overlap with each
1120+
* other or with uapi pidfd flags.
10981121
*/
1099-
BUILD_BUG_ON(PIDFD_STALE == PIDFD_THREAD);
1100-
BUILD_BUG_ON(PIDFD_STALE == PIDFD_NONBLOCK);
1122+
BUILD_BUG_ON(hweight32(PIDFD_THREAD | PIDFD_NONBLOCK |
1123+
PIDFD_STALE | PIDFD_AUTOKILL) != 4);
11011124

11021125
ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path);
11031126
if (ret < 0)
@@ -1108,9 +1131,12 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
11081131
flags &= ~PIDFD_STALE;
11091132
flags |= O_RDWR;
11101133
pidfd_file = dentry_open(&path, flags, current_cred());
1111-
/* Raise PIDFD_THREAD explicitly as do_dentry_open() strips it. */
1134+
/*
1135+
* Raise PIDFD_THREAD and PIDFD_AUTOKILL explicitly as
1136+
* do_dentry_open() strips O_EXCL and O_TRUNC.
1137+
*/
11121138
if (!IS_ERR(pidfd_file))
1113-
pidfd_file->f_flags |= (flags & PIDFD_THREAD);
1139+
pidfd_file->f_flags |= (flags & (PIDFD_THREAD | PIDFD_AUTOKILL));
11141140

11151141
return pidfd_file;
11161142
}

include/linux/sched/signal.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@ struct signal_struct {
132132
*/
133133
unsigned int is_child_subreaper:1;
134134
unsigned int has_child_subreaper:1;
135+
unsigned int autoreap:1;
135136

136137
#ifdef CONFIG_POSIX_TIMERS
137138

include/uapi/linux/pidfd.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
#ifdef __KERNEL__
1414
#include <linux/sched.h>
1515
#define PIDFD_STALE CLONE_PIDFD
16+
#define PIDFD_AUTOKILL O_TRUNC
1617
#endif
1718

1819
/* Flags for pidfd_send_signal(). */

include/uapi/linux/sched.h

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,11 @@
3434
#define CLONE_IO 0x80000000 /* Clone io context */
3535

3636
/* Flags for the clone3() syscall. */
37-
#define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
38-
#define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
37+
#define CLONE_CLEAR_SIGHAND (1ULL << 32) /* Clear any signal handler and reset to SIG_DFL. */
38+
#define CLONE_INTO_CGROUP (1ULL << 33) /* Clone into a specific cgroup given the right permissions. */
39+
#define CLONE_AUTOREAP (1ULL << 34) /* Auto-reap child on exit. */
40+
#define CLONE_NNP (1ULL << 35) /* Set no_new_privs on child. */
41+
#define CLONE_PIDFD_AUTOKILL (1ULL << 36) /* Kill child when clone pidfd closes. */
3942

4043
/*
4144
* cloning flags intersect with CSIGNAL so can be used with unshare and clone3

kernel/fork.c

Lines changed: 49 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2028,6 +2028,41 @@ __latent_entropy struct task_struct *copy_process(
20282028
return ERR_PTR(-EINVAL);
20292029
}
20302030

2031+
if (clone_flags & CLONE_AUTOREAP) {
2032+
if (clone_flags & CLONE_THREAD)
2033+
return ERR_PTR(-EINVAL);
2034+
if (clone_flags & CLONE_PARENT)
2035+
return ERR_PTR(-EINVAL);
2036+
if (args->exit_signal)
2037+
return ERR_PTR(-EINVAL);
2038+
}
2039+
2040+
if ((clone_flags & CLONE_PARENT) && current->signal->autoreap)
2041+
return ERR_PTR(-EINVAL);
2042+
2043+
if (clone_flags & CLONE_NNP) {
2044+
if (clone_flags & CLONE_THREAD)
2045+
return ERR_PTR(-EINVAL);
2046+
}
2047+
2048+
if (clone_flags & CLONE_PIDFD_AUTOKILL) {
2049+
if (!(clone_flags & CLONE_PIDFD))
2050+
return ERR_PTR(-EINVAL);
2051+
if (!(clone_flags & CLONE_AUTOREAP))
2052+
return ERR_PTR(-EINVAL);
2053+
if (clone_flags & CLONE_THREAD)
2054+
return ERR_PTR(-EINVAL);
2055+
/*
2056+
* Without CLONE_NNP the child could escalate privileges
2057+
* after being spawned, so require CAP_SYS_ADMIN.
2058+
* With CLONE_NNP the child can't gain new privileges,
2059+
* so allow unprivileged usage.
2060+
*/
2061+
if (!(clone_flags & CLONE_NNP) &&
2062+
!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
2063+
return ERR_PTR(-EPERM);
2064+
}
2065+
20312066
/*
20322067
* Force any signals received before this point to be delivered
20332068
* before the fork happens. Collect up signals sent to multiple
@@ -2250,13 +2285,18 @@ __latent_entropy struct task_struct *copy_process(
22502285
* if the fd table isn't shared).
22512286
*/
22522287
if (clone_flags & CLONE_PIDFD) {
2253-
int flags = (clone_flags & CLONE_THREAD) ? PIDFD_THREAD : 0;
2288+
unsigned flags = PIDFD_STALE;
2289+
2290+
if (clone_flags & CLONE_THREAD)
2291+
flags |= PIDFD_THREAD;
2292+
if (clone_flags & CLONE_PIDFD_AUTOKILL)
2293+
flags |= PIDFD_AUTOKILL;
22542294

22552295
/*
22562296
* Note that no task has been attached to @pid yet indicate
22572297
* that via CLONE_PIDFD.
22582298
*/
2259-
retval = pidfd_prepare(pid, flags | PIDFD_STALE, &pidfile);
2299+
retval = pidfd_prepare(pid, flags, &pidfile);
22602300
if (retval < 0)
22612301
goto bad_fork_free_pid;
22622302
pidfd = retval;
@@ -2412,6 +2452,9 @@ __latent_entropy struct task_struct *copy_process(
24122452
*/
24132453
copy_seccomp(p);
24142454

2455+
if (clone_flags & CLONE_NNP)
2456+
task_set_no_new_privs(p);
2457+
24152458
init_task_pid_links(p);
24162459
if (likely(p->pid)) {
24172460
ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);
@@ -2435,6 +2478,8 @@ __latent_entropy struct task_struct *copy_process(
24352478
*/
24362479
p->signal->has_child_subreaper = p->real_parent->signal->has_child_subreaper ||
24372480
p->real_parent->signal->is_child_subreaper;
2481+
if (clone_flags & CLONE_AUTOREAP)
2482+
p->signal->autoreap = 1;
24382483
list_add_tail(&p->sibling, &p->real_parent->children);
24392484
list_add_tail_rcu(&p->tasks, &init_task.tasks);
24402485
attach_pid(p, PIDTYPE_TGID);
@@ -2897,7 +2942,8 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
28972942
{
28982943
/* Verify that no unknown flags are passed along. */
28992944
if (kargs->flags &
2900-
~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP))
2945+
~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP |
2946+
CLONE_AUTOREAP | CLONE_NNP | CLONE_PIDFD_AUTOKILL))
29012947
return false;
29022948

29032949
/*

kernel/ptrace.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -549,7 +549,8 @@ static bool __ptrace_detach(struct task_struct *tracer, struct task_struct *p)
549549
if (!dead && thread_group_empty(p)) {
550550
if (!same_thread_group(p->real_parent, tracer))
551551
dead = do_notify_parent(p, p->exit_signal);
552-
else if (ignoring_children(tracer->sighand)) {
552+
else if (ignoring_children(tracer->sighand) ||
553+
p->signal->autoreap) {
553554
__wake_up_parent(p, tracer);
554555
dead = true;
555556
}

kernel/signal.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2251,6 +2251,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
22512251
if (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN)
22522252
sig = 0;
22532253
}
2254+
if (!tsk->ptrace && tsk->signal->autoreap) {
2255+
autoreap = true;
2256+
sig = 0;
2257+
}
22542258
/*
22552259
* Send with __send_signal as si_pid and si_uid are in the
22562260
* parent's namespaces.

tools/testing/selftests/pidfd/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,4 @@ pidfd_info_test
1212
pidfd_exec_helper
1313
pidfd_xattr_test
1414
pidfd_setattr_test
15+
pidfd_autoreap_test

tools/testing/selftests/pidfd/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ CFLAGS += -g $(KHDR_INCLUDES) $(TOOLS_INCLUDES) -pthread -Wall
44
TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test \
55
pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test \
66
pidfd_file_handle_test pidfd_bind_mount pidfd_info_test \
7-
pidfd_xattr_test pidfd_setattr_test
7+
pidfd_xattr_test pidfd_setattr_test pidfd_autoreap_test
88

99
TEST_GEN_PROGS_EXTENDED := pidfd_exec_helper
1010

0 commit comments

Comments
 (0)