diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index dcb0c4473859..67f2e632dd4e 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -983,23 +983,6 @@ All cgroup core files are prefixed with "cgroup." killing cgroups is a process directed operation, i.e. it affects the whole thread-group. - cgroup.pressure - A read-write single value file that allowed values are "0" and "1". - The default is "1". - - Writing "0" to the file will disable the cgroup PSI accounting. - Writing "1" to the file will re-enable the cgroup PSI accounting. - - This control attribute is not hierarchical, so disable or enable PSI - accounting in a cgroup does not affect PSI accounting in descendants - and doesn't need pass enablement via ancestors from root. - - The reason this control attribute exists is that PSI accounts stalls for - each cgroup separately and aggregates it at each level of the hierarchy. - This may cause non-negligible overhead for some workloads when under - deep level of the hierarchy, in which case this control attribute can - be used to disable PSI accounting in the non-leaf cgroups. - irq.pressure A read-write nested-keyed file. diff --git a/Documentation/power/freezing-of-tasks.rst b/Documentation/power/freezing-of-tasks.rst index df9755bfbd94..53b6a56c4635 100644 --- a/Documentation/power/freezing-of-tasks.rst +++ b/Documentation/power/freezing-of-tasks.rst @@ -14,28 +14,27 @@ architectures). II. How does it work? ===================== -There is one per-task flag (PF_NOFREEZE) and three per-task states -(TASK_FROZEN, TASK_FREEZABLE and __TASK_FREEZABLE_UNSAFE) used for that. -The tasks that have PF_NOFREEZE unset (all user space tasks and some kernel -threads) are regarded as 'freezable' and treated in a special way before the -system enters a sleep state as well as before a hibernation image is created -(hibernation is directly covered by what follows, but the description applies -to system-wide suspend too). +There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN +and PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have +PF_NOFREEZE unset (all user space processes and some kernel threads) are +regarded as 'freezable' and treated in a special way before the system enters a +suspend state as well as before a hibernation image is created (in what follows +we only consider hibernation, but the description also applies to suspend). Namely, as the first step of the hibernation procedure the function freeze_processes() (defined in kernel/power/process.c) is called. A system-wide -static key freezer_active (as opposed to a per-task flag or state) is used to -indicate whether the system is to undergo a freezing operation. And -freeze_processes() sets this static key. After this, it executes -try_to_freeze_tasks() that sends a fake signal to all user space processes, and -wakes up all the kernel threads. All freezable tasks must react to that by -calling try_to_freeze(), which results in a call to __refrigerator() (defined -in kernel/freezer.c), which changes the task's state to TASK_FROZEN, and makes -it loop until it is woken by an explicit TASK_FROZEN wakeup. Then, that task -is regarded as 'frozen' and so the set of functions handling this mechanism is -referred to as 'the freezer' (these functions are defined in -kernel/power/process.c, kernel/freezer.c & include/linux/freezer.h). User space -tasks are generally frozen before kernel threads. +variable system_freezing_cnt (as opposed to a per-task flag) is used to indicate +whether the system is to undergo a freezing operation. And freeze_processes() +sets this variable. After this, it executes try_to_freeze_tasks() that sends a +fake signal to all user space processes, and wakes up all the kernel threads. +All freezable tasks must react to that by calling try_to_freeze(), which +results in a call to __refrigerator() (defined in kernel/freezer.c), which sets +the task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes +it loop until PF_FROZEN is cleared for it. Then, we say that the task is +'frozen' and therefore the set of functions handling this mechanism is referred +to as 'the freezer' (these functions are defined in kernel/power/process.c, +kernel/freezer.c & include/linux/freezer.h). User space processes are generally +frozen before kernel threads. __refrigerator() must not be called directly. Instead, use the try_to_freeze() function (defined in include/linux/freezer.h), that checks @@ -44,40 +43,31 @@ if the task is to be frozen and makes the task enter __refrigerator(). For user space processes try_to_freeze() is called automatically from the signal-handling code, but the freezable kernel threads need to call it explicitly in suitable places or use the wait_event_freezable() or -wait_event_freezable_timeout() macros (defined in include/linux/wait.h) -that put the task to sleep (TASK_INTERRUPTIBLE) or freeze it (TASK_FROZEN) if -freezer_active is set. The main loop of a freezable kernel thread may look +wait_event_freezable_timeout() macros (defined in include/linux/freezer.h) +that combine interruptible sleep with checking if the task is to be frozen and +calling try_to_freeze(). The main loop of a freezable kernel thread may look like the following one:: set_freezable(); + do { + hub_events(); + wait_event_freezable(khubd_wait, + !list_empty(&hub_event_list) || + kthread_should_stop()); + } while (!kthread_should_stop() || !list_empty(&hub_event_list)); - while (true) { - struct task_struct *tsk = NULL; +(from drivers/usb/core/hub.c::hub_thread()). - wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); - spin_lock_irq(&oom_reaper_lock); - if (oom_reaper_list != NULL) { - tsk = oom_reaper_list; - oom_reaper_list = tsk->oom_reaper_list; - } - spin_unlock_irq(&oom_reaper_lock); - - if (tsk) - oom_reap_task(tsk); - } - -(from mm/oom_kill.c::oom_reaper()). - -If a freezable kernel thread is not put to the frozen state after the freezer -has initiated a freezing operation, the freezing of tasks will fail and the -entire system-wide transition will be cancelled. For this reason, freezable -kernel threads must call try_to_freeze() somewhere or use one of the +If a freezable kernel thread fails to call try_to_freeze() after the freezer has +initiated a freezing operation, the freezing of tasks will fail and the entire +hibernation operation will be cancelled. For this reason, freezable kernel +threads must call try_to_freeze() somewhere or use one of the wait_event_freezable() and wait_event_freezable_timeout() macros. After the system memory state has been restored from a hibernation image and devices have been reinitialized, the function thaw_processes() is called in -order to wake up each frozen task. Then, the tasks that have been frozen leave -__refrigerator() and continue running. +order to clear the PF_FROZEN flag for each frozen task. Then, the tasks that +have been frozen leave __refrigerator() and continue running. Rationale behind the functions dealing with freezing and thawing of tasks @@ -106,8 +96,7 @@ III. Which kernel threads are freezable? Kernel threads are not freezable by default. However, a kernel thread may clear PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE directly is not allowed). From this point it is regarded as freezable -and must call try_to_freeze() or variants of wait_event_freezable() in a -suitable place. +and must call try_to_freeze() in a suitable place. IV. Why do we do that? ====================== diff --git a/drivers/android/binder.c b/drivers/android/binder.c index 4ef4e2dc47cb..ccd0eee30b77 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -3714,9 +3714,10 @@ static int binder_wait_for_work(struct binder_thread *thread, struct binder_proc *proc = thread->proc; int ret = 0; + freezer_do_not_count(); binder_inner_proc_lock(proc); for (;;) { - prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE); + prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE); if (binder_has_work_ilocked(thread, do_proc_work)) break; if (do_proc_work) @@ -3733,6 +3734,7 @@ static int binder_wait_for_work(struct binder_thread *thread, } finish_wait(&thread->wait, &wait); binder_inner_proc_unlock(proc); + freezer_count(); return ret; } diff --git a/drivers/media/pci/pt3/pt3.c b/drivers/media/pci/pt3/pt3.c index cc3cfbc0e33c..c0bc86793355 100644 --- a/drivers/media/pci/pt3/pt3.c +++ b/drivers/media/pci/pt3/pt3.c @@ -445,8 +445,8 @@ static int pt3_fetch_thread(void *data) pt3_proc_dma(adap); delay = ktime_set(0, PT3_FETCH_DELAY * NSEC_PER_MSEC); - set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE); - schedule_hrtimeout_range(&delay, + set_current_state(TASK_UNINTERRUPTIBLE); + freezable_schedule_hrtimeout_range(&delay, PT3_FETCH_DELAY_DELTA * NSEC_PER_MSEC, HRTIMER_MODE_REL); } diff --git a/fs/coredump.c b/fs/coredump.c index 5742b3e457a2..2516de8f4361 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -403,8 +403,9 @@ static int coredump_wait(int exit_code, struct core_state *core_state) if (core_waiters > 0) { struct core_thread *ptr; - wait_for_completion_state(&core_state->startup, - TASK_UNINTERRUPTIBLE|TASK_FREEZABLE); + freezer_do_not_count(); + wait_for_completion(&core_state->startup); + freezer_count(); /* * Wait for all the threads to become inactive, so that * all the thread context (extended register state, like diff --git a/fs/nfs/file.c b/fs/nfs/file.c index b01e1ea304bf..5dcdca1dc075 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -578,8 +578,7 @@ static vm_fault_t nfs_vm_page_mkwrite(struct vm_fault *vmf) } wait_on_bit_action(&NFS_I(inode)->flags, NFS_INO_INVALIDATING, - nfs_wait_bit_killable, - TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); + nfs_wait_bit_killable, TASK_KILLABLE); folio_lock(folio); mapping = folio_file_mapping(folio); diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 76ec0a464db7..81a5000c4d60 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -72,13 +72,18 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fattr) return nfs_fileid_to_ino_t(fattr->fileid); } -int nfs_wait_bit_killable(struct wait_bit_key *key, int mode) +static int nfs_wait_killable(int mode) { - schedule(); + freezable_schedule_unsafe(); if (signal_pending_state(mode, current)) return -ERESTARTSYS; return 0; } + +int nfs_wait_bit_killable(struct wait_bit_key *key, int mode) +{ + return nfs_wait_killable(mode); +} EXPORT_SYMBOL_GPL(nfs_wait_bit_killable); /** @@ -1338,8 +1343,7 @@ int nfs_clear_invalid_mapping(struct address_space *mapping) */ for (;;) { ret = wait_on_bit_action(bitlock, NFS_INO_INVALIDATING, - nfs_wait_bit_killable, - TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); + nfs_wait_bit_killable, TASK_KILLABLE); if (ret) goto out; spin_lock(&inode->i_lock); diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c index 2e7579626cf0..1597eef40d54 100644 --- a/fs/nfs/nfs3proc.c +++ b/fs/nfs/nfs3proc.c @@ -36,8 +36,7 @@ nfs3_rpc_wrapper(struct rpc_clnt *clnt, struct rpc_message *msg, int flags) res = rpc_call_sync(clnt, msg, flags); if (res != -EJUKEBOX) break; - __set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); - schedule_timeout(NFS_JUKEBOX_RETRY_TIME); + freezable_schedule_timeout_killable_unsafe(NFS_JUKEBOX_RETRY_TIME); res = -ERESTARTSYS; } while (!fatal_signal_pending(current)); return res; diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index fb45339b8e70..59e87a08a0ca 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -421,8 +421,8 @@ static int nfs4_delay_killable(long *timeout) { might_sleep(); - __set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); - schedule_timeout(nfs4_update_delay(timeout)); + freezable_schedule_timeout_killable_unsafe( + nfs4_update_delay(timeout)); if (!__fatal_signal_pending(current)) return 0; return -EINTR; @@ -432,8 +432,7 @@ static int nfs4_delay_interruptible(long *timeout) { might_sleep(); - __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE_UNSAFE); - schedule_timeout(nfs4_update_delay(timeout)); + freezable_schedule_timeout_interruptible_unsafe(nfs4_update_delay(timeout)); if (!signal_pending(current)) return 0; return __fatal_signal_pending(current) ? -EINTR :-ERESTARTSYS; @@ -7428,8 +7427,7 @@ nfs4_retry_setlk_simple(struct nfs4_state *state, int cmd, status = nfs4_proc_setlk(state, cmd, request); if ((status != -EAGAIN) || IS_SETLK(cmd)) break; - __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); - schedule_timeout(timeout); + freezable_schedule_timeout_interruptible(timeout); timeout *= 2; timeout = min_t(unsigned long, NFS4_LOCK_MAXTIMEOUT, timeout); status = -ERESTARTSYS; @@ -7497,8 +7495,10 @@ nfs4_retry_setlk(struct nfs4_state *state, int cmd, struct file_lock *request) break; status = -ERESTARTSYS; - wait_woken(&waiter.wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE, + freezer_do_not_count(); + wait_woken(&waiter.wait, TASK_INTERRUPTIBLE, NFS4_LOCK_MAXTIMEOUT); + freezer_count(); } while (!signalled()); remove_wait_queue(q, &waiter.wait); diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c index e079987af4a3..da0c932ddb4c 100644 --- a/fs/nfs/nfs4state.c +++ b/fs/nfs/nfs4state.c @@ -1317,8 +1317,7 @@ int nfs4_wait_clnt_recover(struct nfs_client *clp) refcount_inc(&clp->cl_count); res = wait_on_bit_action(&clp->cl_state, NFS4CLNT_MANAGER_RUNNING, - nfs_wait_bit_killable, - TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); + nfs_wait_bit_killable, TASK_KILLABLE); if (res) goto out; if (clp->cl_cons_state < 0) diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index 84343aefbbd6..86a9d8c29641 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -1911,7 +1911,7 @@ static int pnfs_prepare_to_retry_layoutget(struct pnfs_layout_hdr *lo) pnfs_layoutcommit_inode(lo->plh_inode, false); return wait_on_bit_action(&lo->plh_flags, NFS_LAYOUT_RETURN, nfs_wait_bit_killable, - TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); + TASK_KILLABLE); } static void nfs_layoutget_begin(struct pnfs_layout_hdr *lo) @@ -3210,7 +3210,7 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync) status = wait_on_bit_lock_action(&nfsi->flags, NFS_INO_LAYOUTCOMMITTING, nfs_wait_bit_killable, - TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); + TASK_KILLABLE); if (status) goto out; } diff --git a/fs/smb/client/inode.c b/fs/smb/client/inode.c index 9ae9b3bac949..e90971c750af 100644 --- a/fs/smb/client/inode.c +++ b/fs/smb/client/inode.c @@ -2459,7 +2459,7 @@ cifs_invalidate_mapping(struct inode *inode) static int cifs_wait_bit_killable(struct wait_bit_key *key, int mode) { - schedule(); + freezable_schedule_unsafe(); if (signal_pending_state(mode, current)) return -ERESTARTSYS; return 0; @@ -2477,7 +2477,7 @@ cifs_revalidate_mapping(struct inode *inode) return 0; rc = wait_on_bit_lock_action(flags, CIFS_INO_LOCK, cifs_wait_bit_killable, - TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); + TASK_KILLABLE); if (rc) return rc; diff --git a/fs/smb/client/transport.c b/fs/smb/client/transport.c index a6680f846d3a..1640619c49bf 100644 --- a/fs/smb/client/transport.c +++ b/fs/smb/client/transport.c @@ -764,10 +764,9 @@ wait_for_response(struct TCP_Server_Info *server, struct mid_q_entry *midQ) { int error; - error = wait_event_state(server->response_q, - midQ->mid_state != MID_REQUEST_SUBMITTED && - midQ->mid_state != MID_RESPONSE_RECEIVED, - (TASK_KILLABLE|TASK_FREEZABLE_UNSAFE)); + error = wait_event_freezekillable_unsafe(server->response_q, + midQ->mid_state != MID_REQUEST_SUBMITTED && + midQ->mid_state != MID_RESPONSE_RECEIVED); if (error < 0) return -ERESTARTSYS; diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 1098452e7f95..e7066106182b 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -604,9 +604,9 @@ xfsaild( while (1) { if (tout && tout <= 20) - set_current_state(TASK_KILLABLE|TASK_FREEZABLE); + set_current_state(TASK_KILLABLE); else - set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); + set_current_state(TASK_INTERRUPTIBLE); /* * Check kthread_should_stop() after we set the task state to @@ -655,14 +655,14 @@ xfsaild( ailp->ail_target == ailp->ail_target_prev && list_empty(&ailp->ail_buf_list)) { spin_unlock(&ailp->ail_lock); - schedule(); + freezable_schedule(); tout = 0; continue; } spin_unlock(&ailp->ail_lock); if (tout) - schedule_timeout(msecs_to_jiffies(tout)); + freezable_schedule_timeout(msecs_to_jiffies(tout)); __set_current_state(TASK_RUNNING); diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 98d2f964a925..08bac31196c8 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -343,20 +343,6 @@ struct cgroup_rstat_cpu { */ struct cgroup_base_stat last_bstat; - /* - * This field is used to record the cumulative per-cpu time of - * the cgroup and its descendants. Currently it can be read via - * eBPF/drgn etc, and we are still trying to determine how to - * expose it in the cgroupfs interface. - */ - struct cgroup_base_stat subtree_bstat; - - /* - * Snapshots at the last reading. These are used to calculate the - * deltas to propagate to the per-cpu subtree_bstat. - */ - struct cgroup_base_stat last_subtree_bstat; - /* * Child cgroups with stat updates on this cpu since the last read * are linked on the parent's ->updated_children through @@ -444,9 +430,6 @@ struct cgroup { struct cgroup_file procs_file; /* handle for "cgroup.procs" */ struct cgroup_file events_file; /* handle for "cgroup.events" */ - /* handles for "{cpu,memory,io,irq}.pressure" */ - struct cgroup_file psi_files[NR_PSI_RESOURCES]; - /* * The bitmask of subsystems enabled on the child cgroups. * ->subtree_control is the one configured through @@ -559,10 +542,6 @@ struct cgroup_root { /* Unique id for this hierarchy. */ int hierarchy_id; - /* A list running through the active hierarchies */ - struct list_head root_list; - struct rcu_head rcu; /* Must be near the top */ - /* * The root cgroup. The containing cgroup_root will be destroyed on its * release. cgrp->ancestors[0] will be used overflowing into the @@ -576,6 +555,9 @@ struct cgroup_root { /* Number of cgroups in the hierarchy, used only for /proc/cgroups */ atomic_t nr_cgrps; + /* A list running through the active hierarchies */ + struct list_head root_list; + /* Hierarchy-specific flags */ unsigned int flags; diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index dfcc93197383..acc1c93a824a 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -69,7 +69,6 @@ struct css_task_iter { extern struct file_system_type cgroup_fs_type; extern struct cgroup_root cgrp_dfl_root; extern struct css_set init_css_set; -extern spinlock_t css_set_lock; #define SUBSYS(_x) extern struct cgroup_subsys _x ## _cgrp_subsys; #include @@ -387,6 +386,7 @@ static inline void cgroup_unlock(void) * as locks used during the cgroup_subsys::attach() methods. */ #ifdef CONFIG_PROVE_RCU +extern spinlock_t css_set_lock; #define task_css_set_check(task, __c) \ rcu_dereference_check((task)->cgroups, \ rcu_read_lock_sched_held() || \ @@ -859,6 +859,4 @@ static inline void cgroup_bpf_put(struct cgroup *cgrp) {} #endif /* CONFIG_CGROUP_BPF */ -struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id); - #endif /* _LINUX_CGROUP_H */ diff --git a/include/linux/freezer.h b/include/linux/freezer.h index b303472255be..0621c5f86c39 100644 --- a/include/linux/freezer.h +++ b/include/linux/freezer.h @@ -8,11 +8,9 @@ #include #include #include -#include #ifdef CONFIG_FREEZER -DECLARE_STATIC_KEY_FALSE(freezer_active); - +extern atomic_t system_freezing_cnt; /* nr of freezing conds in effect */ extern bool pm_freezing; /* PM freezing in effect */ extern bool pm_nosig_freezing; /* PM nosig freezing in effect */ @@ -24,7 +22,10 @@ extern unsigned int freeze_timeout_msecs; /* * Check if a process has been frozen */ -extern bool frozen(struct task_struct *p); +static inline bool frozen(struct task_struct *p) +{ + return p->flags & PF_FROZEN; +} extern bool freezing_slow_path(struct task_struct *p); @@ -33,10 +34,9 @@ extern bool freezing_slow_path(struct task_struct *p); */ static inline bool freezing(struct task_struct *p) { - if (static_branch_unlikely(&freezer_active)) - return freezing_slow_path(p); - - return false; + if (likely(!atomic_read(&system_freezing_cnt))) + return false; + return freezing_slow_path(p); } /* Takes and releases task alloc lock using task_lock() */ @@ -48,14 +48,23 @@ extern int freeze_kernel_threads(void); extern void thaw_processes(void); extern void thaw_kernel_threads(void); -static inline bool try_to_freeze(void) +/* + * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION + * If try_to_freeze causes a lockdep warning it means the caller may deadlock + */ +static inline bool try_to_freeze_unsafe(void) { might_sleep(); if (likely(!freezing(current))) return false; + return __refrigerator(false); +} + +static inline bool try_to_freeze(void) +{ if (!(current->flags & PF_NOFREEZE)) debug_check_no_locks_held(); - return __refrigerator(false); + return try_to_freeze_unsafe(); } extern bool freeze_task(struct task_struct *p); @@ -70,6 +79,195 @@ static inline bool cgroup_freezing(struct task_struct *task) } #endif /* !CONFIG_CGROUP_FREEZER */ +/* + * The PF_FREEZER_SKIP flag should be set by a vfork parent right before it + * calls wait_for_completion(&vfork) and reset right after it returns from this + * function. Next, the parent should call try_to_freeze() to freeze itself + * appropriately in case the child has exited before the freezing of tasks is + * complete. However, we don't want kernel threads to be frozen in unexpected + * places, so we allow them to block freeze_processes() instead or to set + * PF_NOFREEZE if needed. Fortunately, in the ____call_usermodehelper() case the + * parent won't really block freeze_processes(), since ____call_usermodehelper() + * (the child) does a little before exec/exit and it can't be frozen before + * waking up the parent. + */ + + +/** + * freezer_do_not_count - tell freezer to ignore %current + * + * Tell freezers to ignore the current task when determining whether the + * target frozen state is reached. IOW, the current task will be + * considered frozen enough by freezers. + * + * The caller shouldn't do anything which isn't allowed for a frozen task + * until freezer_cont() is called. Usually, freezer[_do_not]_count() pair + * wrap a scheduling operation and nothing much else. + */ +static inline void freezer_do_not_count(void) +{ + current->flags |= PF_FREEZER_SKIP; +} + +/** + * freezer_count - tell freezer to stop ignoring %current + * + * Undo freezer_do_not_count(). It tells freezers that %current should be + * considered again and tries to freeze if freezing condition is already in + * effect. + */ +static inline void freezer_count(void) +{ + current->flags &= ~PF_FREEZER_SKIP; + /* + * If freezing is in progress, the following paired with smp_mb() + * in freezer_should_skip() ensures that either we see %true + * freezing() or freezer_should_skip() sees !PF_FREEZER_SKIP. + */ + smp_mb(); + try_to_freeze(); +} + +/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */ +static inline void freezer_count_unsafe(void) +{ + current->flags &= ~PF_FREEZER_SKIP; + smp_mb(); + try_to_freeze_unsafe(); +} + +/** + * freezer_should_skip - whether to skip a task when determining frozen + * state is reached + * @p: task in quesion + * + * This function is used by freezers after establishing %true freezing() to + * test whether a task should be skipped when determining the target frozen + * state is reached. IOW, if this function returns %true, @p is considered + * frozen enough. + */ +static inline bool freezer_should_skip(struct task_struct *p) +{ + /* + * The following smp_mb() paired with the one in freezer_count() + * ensures that either freezer_count() sees %true freezing() or we + * see cleared %PF_FREEZER_SKIP and return %false. This makes it + * impossible for a task to slip frozen state testing after + * clearing %PF_FREEZER_SKIP. + */ + smp_mb(); + return p->flags & PF_FREEZER_SKIP; +} + +/* + * These functions are intended to be used whenever you want allow a sleeping + * task to be frozen. Note that neither return any clear indication of + * whether a freeze event happened while in this function. + */ + +/* Like schedule(), but should not block the freezer. */ +static inline void freezable_schedule(void) +{ + freezer_do_not_count(); + schedule(); + freezer_count(); +} + +/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */ +static inline void freezable_schedule_unsafe(void) +{ + freezer_do_not_count(); + schedule(); + freezer_count_unsafe(); +} + +/* + * Like schedule_timeout(), but should not block the freezer. Do not + * call this with locks held. + */ +static inline long freezable_schedule_timeout(long timeout) +{ + long __retval; + freezer_do_not_count(); + __retval = schedule_timeout(timeout); + freezer_count(); + return __retval; +} + +/* + * Like schedule_timeout_interruptible(), but should not block the freezer. Do not + * call this with locks held. + */ +static inline long freezable_schedule_timeout_interruptible(long timeout) +{ + long __retval; + freezer_do_not_count(); + __retval = schedule_timeout_interruptible(timeout); + freezer_count(); + return __retval; +} + +/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */ +static inline long freezable_schedule_timeout_interruptible_unsafe(long timeout) +{ + long __retval; + + freezer_do_not_count(); + __retval = schedule_timeout_interruptible(timeout); + freezer_count_unsafe(); + return __retval; +} + +/* Like schedule_timeout_killable(), but should not block the freezer. */ +static inline long freezable_schedule_timeout_killable(long timeout) +{ + long __retval; + freezer_do_not_count(); + __retval = schedule_timeout_killable(timeout); + freezer_count(); + return __retval; +} + +/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */ +static inline long freezable_schedule_timeout_killable_unsafe(long timeout) +{ + long __retval; + freezer_do_not_count(); + __retval = schedule_timeout_killable(timeout); + freezer_count_unsafe(); + return __retval; +} + +/* + * Like schedule_hrtimeout_range(), but should not block the freezer. Do not + * call this with locks held. + */ +static inline int freezable_schedule_hrtimeout_range(ktime_t *expires, + u64 delta, const enum hrtimer_mode mode) +{ + int __retval; + freezer_do_not_count(); + __retval = schedule_hrtimeout_range(expires, delta, mode); + freezer_count(); + return __retval; +} + +/* + * Freezer-friendly wrappers around wait_event_interruptible(), + * wait_event_killable() and wait_event_interruptible_timeout(), originally + * defined in + */ + +/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */ +#define wait_event_freezekillable_unsafe(wq, condition) \ +({ \ + int __retval; \ + freezer_do_not_count(); \ + __retval = wait_event_killable(wq, (condition)); \ + freezer_count_unsafe(); \ + __retval; \ +}) + #else /* !CONFIG_FREEZER */ static inline bool frozen(struct task_struct *p) { return false; } static inline bool freezing(struct task_struct *p) { return false; } @@ -83,8 +281,35 @@ static inline void thaw_kernel_threads(void) {} static inline bool try_to_freeze(void) { return false; } +static inline void freezer_do_not_count(void) {} +static inline void freezer_count(void) {} +static inline int freezer_should_skip(struct task_struct *p) { return 0; } static inline void set_freezable(void) {} +#define freezable_schedule() schedule() + +#define freezable_schedule_unsafe() schedule() + +#define freezable_schedule_timeout(timeout) schedule_timeout(timeout) + +#define freezable_schedule_timeout_interruptible(timeout) \ + schedule_timeout_interruptible(timeout) + +#define freezable_schedule_timeout_interruptible_unsafe(timeout) \ + schedule_timeout_interruptible(timeout) + +#define freezable_schedule_timeout_killable(timeout) \ + schedule_timeout_killable(timeout) + +#define freezable_schedule_timeout_killable_unsafe(timeout) \ + schedule_timeout_killable(timeout) + +#define freezable_schedule_hrtimeout_range(expires, delta, mode) \ + schedule_hrtimeout_range(expires, delta, mode) + +#define wait_event_freezekillable_unsafe(wq, condition) \ + wait_event_killable(wq, condition) + #endif /* !CONFIG_FREEZER */ #endif /* FREEZER_H_INCLUDED */ diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index 4db5b4480e88..4bacb776176b 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -552,10 +552,6 @@ static inline int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr) { return -ENOSYS; } -static inline __poll_t kernfs_generic_poll(struct kernfs_open_file *of, - struct poll_table_struct *pt) -{ return -ENOSYS; } - static inline void kernfs_notify(struct kernfs_node *kn) { } static inline int kernfs_xattr_get(struct kernfs_node *kn, const char *name, diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index e799b1f8d05b..c238207d1615 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -31,18 +31,17 @@ struct misc_cg; * struct misc_res: Per cgroup per misc type resource * @max: Maximum limit on the resource. * @usage: Current usage of the resource. - * @events: Number of times, the resource limit exceeded. + * @failed: True if charged failed for the resource in a cgroup. */ struct misc_res { - u64 max; - atomic64_t usage; - atomic64_t events; + unsigned long max; + atomic_long_t usage; + atomic_long_t events; }; /** * struct misc_cg - Miscellaneous controller's cgroup structure. * @css: cgroup subsys state object. - * @events_file: Handle for the misc resources events file. * @res: Array of misc resources usage in the cgroup. */ struct misc_cg { @@ -54,10 +53,12 @@ struct misc_cg { struct misc_res res[MISC_CG_RES_TYPES]; }; -u64 misc_cg_res_total_usage(enum misc_res_type type); -int misc_cg_set_capacity(enum misc_res_type type, u64 capacity); -int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount); -void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount); +unsigned long misc_cg_res_total_usage(enum misc_res_type type); +int misc_cg_set_capacity(enum misc_res_type type, unsigned long capacity); +int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, + unsigned long amount); +void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, + unsigned long amount); /** * css_misc() - Get misc cgroup from the css. @@ -98,26 +99,27 @@ static inline void put_misc_cg(struct misc_cg *cg) #else /* !CONFIG_CGROUP_MISC */ -static inline u64 misc_cg_res_total_usage(enum misc_res_type type) +static inline unsigned long misc_cg_res_total_usage(enum misc_res_type type) { return 0; } -static inline int misc_cg_set_capacity(enum misc_res_type type, u64 capacity) +static inline int misc_cg_set_capacity(enum misc_res_type type, + unsigned long capacity) { return 0; } static inline int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, - u64 amount) + unsigned long amount) { return 0; } static inline void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, - u64 amount) + unsigned long amount) { } diff --git a/include/linux/psi.h b/include/linux/psi.h index 6ba577dc6a8e..39949e9842ac 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -22,9 +22,8 @@ void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res); -struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, - enum psi_res res, struct file *file, - struct kernfs_open_file *of); +struct psi_trigger *psi_trigger_create(struct psi_group *group, + char *buf, enum psi_res res, struct file *file); void psi_trigger_destroy(struct psi_trigger *t); __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file, @@ -39,7 +38,6 @@ static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) int psi_cgroup_alloc(struct cgroup *cgrp); void psi_cgroup_free(struct cgroup *cgrp); void cgroup_move_task(struct task_struct *p, struct css_set *to); -void psi_cgroup_restart(struct psi_group *group); #endif #else /* CONFIG_PSI */ @@ -61,7 +59,6 @@ static inline void cgroup_move_task(struct task_struct *p, struct css_set *to) { rcu_assign_pointer(p->cgroups, to); } -static inline void psi_cgroup_restart(struct psi_group *group) {} #endif #endif /* CONFIG_PSI */ diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index f822a459d5a4..5550f3ac336b 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -136,9 +136,6 @@ struct psi_trigger { /* Wait queue for polling */ wait_queue_head_t event_wait; - /* Kernfs file for cgroup triggers */ - struct kernfs_open_file *of; - /* Pending event flag */ int event; @@ -160,7 +157,6 @@ struct psi_trigger { struct psi_group { struct psi_group *parent; - bool enabled; /* Protects data used by the aggregator */ struct mutex avgs_lock; @@ -208,8 +204,6 @@ struct psi_group { #else /* CONFIG_PSI */ -#define NR_PSI_RESOURCES 0 - struct psi_group { }; #endif /* CONFIG_PSI */ diff --git a/include/linux/sched.h b/include/linux/sched.h index f89cea64b319..c6f24a97a577 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -102,19 +102,12 @@ struct task_struct; #define TASK_WAKING 0x00000200 #define TASK_NOLOAD 0x00000400 #define TASK_NEW 0x00000800 +/* RT specific auxilliary flag to mark RT lock waiters */ #define TASK_RTLOCK_WAIT 0x00001000 -#define TASK_FREEZABLE 0x00002000 -#define __TASK_FREEZABLE_UNSAFE (0x00004000 * IS_ENABLED(CONFIG_LOCKDEP)) -#define TASK_FROZEN 0x00008000 -#define TASK_STATE_MAX 0x00010000 +#define TASK_STATE_MAX 0x00002000 #define TASK_ANY (TASK_STATE_MAX-1) -/* - * DO NOT ADD ANY NEW USERS ! - */ -#define TASK_FREEZABLE_UNSAFE (TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE) - /* Convenience macros for the sake of set_current_state: */ #define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE) #define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED) @@ -1754,6 +1747,7 @@ extern struct pid *cad_pid; #define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */ #define PF_USED_ASYNC 0x00004000 /* Used async_schedule*(), used by module init */ #define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */ +#define PF_FROZEN 0x00010000 /* Frozen for system suspend */ #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ @@ -1764,6 +1758,7 @@ extern struct pid *cad_pid; #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ #define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */ +#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */ /* diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h index 8ada7dc802d3..370653c15b43 100644 --- a/include/linux/sunrpc/sched.h +++ b/include/linux/sunrpc/sched.h @@ -257,7 +257,7 @@ int rpc_malloc(struct rpc_task *); void rpc_free(struct rpc_task *); int rpciod_up(void); void rpciod_down(void); -int rpc_wait_for_completion_task(struct rpc_task *task); +int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *); #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) struct net; void rpc_show_tasks(struct net *); @@ -269,6 +269,11 @@ extern struct workqueue_struct *xprtiod_workqueue; void rpc_prepare_task(struct rpc_task *task); gfp_t rpc_task_gfp_mask(void); +static inline int rpc_wait_for_completion_task(struct rpc_task *task) +{ + return __rpc_wait_for_completion_task(task, NULL); +} + #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) || IS_ENABLED(CONFIG_TRACEPOINTS) static inline const char * rpc_qname(const struct rpc_wait_queue *q) { diff --git a/include/linux/wait.h b/include/linux/wait.h index ffc5f6e91a9d..15ea0a00bcba 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -278,7 +278,7 @@ static inline void wake_up_pollfree(struct wait_queue_head *wq_head) #define ___wait_is_interruptible(state) \ (!__builtin_constant_p(state) || \ - (state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL))) + state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE) \ extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags); @@ -358,8 +358,8 @@ do { \ } while (0) #define __wait_event_freezable(wq_head, condition) \ - ___wait_event(wq_head, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE), \ - 0, 0, schedule()) + ___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0, \ + freezable_schedule()) /** * wait_event_freezable - sleep (or freeze) until a condition gets true @@ -417,8 +417,8 @@ do { \ #define __wait_event_freezable_timeout(wq_head, condition, timeout) \ ___wait_event(wq_head, ___wait_cond_timeout(condition), \ - (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 0, timeout, \ - __ret = schedule_timeout(__ret)) + TASK_INTERRUPTIBLE, 0, timeout, \ + __ret = freezable_schedule_timeout(__ret)) /* * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid @@ -639,8 +639,8 @@ do { \ #define __wait_event_freezable_exclusive(wq, condition) \ - ___wait_event(wq, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 1, 0,\ - schedule()) + ___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0, \ + freezable_schedule()) #define wait_event_freezable_exclusive(wq, condition) \ ({ \ diff --git a/init/Kconfig b/init/Kconfig index 74660294886b..c1a4b401c37d 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -619,7 +619,6 @@ config TASK_IO_ACCOUNTING config PSI bool "Pressure stall information tracking" - select KERNFS help Collect metrics that indicate how overcommitted the CPU, memory, and IO capacity are in the system. diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 520b90dd97ec..c56071f150f2 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -164,13 +164,13 @@ struct cgroup_mgctx { #define DEFINE_CGROUP_MGCTX(name) \ struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name) +extern spinlock_t css_set_lock; extern struct cgroup_subsys *cgroup_subsys[]; extern struct list_head cgroup_roots; /* iterate across the hierarchies */ #define for_each_root(root) \ - list_for_each_entry_rcu((root), &cgroup_roots, root_list, \ - lockdep_is_held(&cgroup_mutex)) + list_for_each_entry((root), &cgroup_roots, root_list) /** * for_each_subsys - iterate all enabled cgroup subsystems diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 04d11a7dd95f..83044312bc41 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -360,9 +360,10 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type, } css_task_iter_end(&it); length = n; - /* now sort & strip out duplicates (tgids or recycled thread PIDs) */ + /* now sort & (if procs) strip out duplicates */ sort(array, length, sizeof(pid_t), cmppid, NULL); - length = pidlist_uniq(array, length); + if (type == CGROUP_FILE_PROCS) + length = pidlist_uniq(array, length); l = cgroup_pidlist_find_create(cgrp, type); if (!l) { @@ -430,7 +431,7 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos) if (l->list[mid] == pid) { index = mid; break; - } else if (l->list[mid] < pid) + } else if (l->list[mid] <= pid) index = mid + 1; else end = mid; @@ -1262,40 +1263,6 @@ int cgroup1_get_tree(struct fs_context *fc) return ret; } -/** - * task_get_cgroup1 - Acquires the associated cgroup of a task within a - * specific cgroup1 hierarchy. The cgroup1 hierarchy is identified by its - * hierarchy ID. - * @tsk: The target task - * @hierarchy_id: The ID of a cgroup1 hierarchy - * - * On success, the cgroup is returned. On failure, ERR_PTR is returned. - * We limit it to cgroup1 only. - */ -struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id) -{ - struct cgroup *cgrp = ERR_PTR(-ENOENT); - struct cgroup_root *root; - unsigned long flags; - - rcu_read_lock(); - for_each_root(root) { - /* cgroup1 only*/ - if (root == &cgrp_dfl_root) - continue; - if (root->hierarchy_id != hierarchy_id) - continue; - spin_lock_irqsave(&css_set_lock, flags); - cgrp = task_cgroup_from_root(tsk, root); - if (!cgrp || !cgroup_tryget(cgrp)) - cgrp = ERR_PTR(-ENOENT); - spin_unlock_irqrestore(&css_set_lock, flags); - break; - } - rcu_read_unlock(); - return cgrp; -} - static int __init cgroup1_wq_init(void) { /* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a84765c6ba45..c1f5f779be2f 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -494,6 +494,28 @@ static struct cgroup_subsys_state *cgroup_css(struct cgroup *cgrp, return &cgrp->self; } +/** + * cgroup_tryget_css - try to get a cgroup's css for the specified subsystem + * @cgrp: the cgroup of interest + * @ss: the subsystem of interest + * + * Find and get @cgrp's css associated with @ss. If the css doesn't exist + * or is offline, %NULL is returned. + */ +static struct cgroup_subsys_state *cgroup_tryget_css(struct cgroup *cgrp, + struct cgroup_subsys *ss) +{ + struct cgroup_subsys_state *css; + + rcu_read_lock(); + css = cgroup_css(cgrp, ss); + if (css && !css_tryget_online(css)) + css = NULL; + rcu_read_unlock(); + + return css; +} + /** * cgroup_e_css_by_mask - obtain a cgroup's effective css for the specified ss * @cgrp: the cgroup of interest @@ -659,7 +681,7 @@ EXPORT_SYMBOL_GPL(of_css); * @ssid: the index of the subsystem, CGROUP_SUBSYS_COUNT after reaching the end * @cgrp: the target cgroup to iterate css's of * - * Should be called under cgroup_mutex. + * Should be called under cgroup_[tree_]mutex. */ #define for_each_css(css, ssid, cgrp) \ for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT; (ssid)++) \ @@ -909,7 +931,7 @@ static void css_set_move_task(struct task_struct *task, #define CSS_SET_HASH_BITS 7 static DEFINE_HASHTABLE(css_set_table, CSS_SET_HASH_BITS); -static unsigned long css_set_hash(struct cgroup_subsys_state **css) +static unsigned long css_set_hash(struct cgroup_subsys_state *css[]) { unsigned long key = 0UL; struct cgroup_subsys *ss; @@ -1050,7 +1072,7 @@ static bool compare_css_sets(struct css_set *cset, */ static struct css_set *find_existing_css_set(struct css_set *old_cset, struct cgroup *cgrp, - struct cgroup_subsys_state **template) + struct cgroup_subsys_state *template[]) { struct cgroup_root *root = cgrp->root; struct cgroup_subsys *ss; @@ -1315,7 +1337,7 @@ static void cgroup_exit_root_id(struct cgroup_root *root) void cgroup_free_root(struct cgroup_root *root) { - kfree_rcu(root, rcu); + kfree(root); } static void cgroup_destroy_root(struct cgroup_root *root) @@ -1347,9 +1369,10 @@ static void cgroup_destroy_root(struct cgroup_root *root) spin_unlock_irq(&css_set_lock); - WARN_ON_ONCE(list_empty(&root->root_list)); - list_del_rcu(&root->root_list); - cgroup_root_count--; + if (!list_empty(&root->root_list)) { + list_del(&root->root_list); + cgroup_root_count--; + } if (!have_favordynmods) cgroup_favor_dynmods(root, false); @@ -1389,15 +1412,7 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset, } } - /* - * If cgroup_mutex is not held, the cgrp_cset_link will be freed - * before we remove the cgroup root from the root_list. Consequently, - * when accessing a cgroup root, the cset_link may have already been - * freed, resulting in a NULL res_cgroup. However, by holding the - * cgroup_mutex, we ensure that res_cgroup can't be NULL. - * If we don't hold cgroup_mutex in the caller, we must do the NULL - * check. - */ + BUG_ON(!res_cgroup); return res_cgroup; } @@ -1420,11 +1435,6 @@ current_cgns_cgroup_from_root(struct cgroup_root *root) rcu_read_unlock(); - /* - * The namespace_sem is held by current, so the root cgroup can't - * be umounted. Therefore, we can ensure that the res is non-NULL. - */ - WARN_ON_ONCE(!res); return res; } @@ -1461,6 +1471,7 @@ static struct cgroup *current_cgns_cgroup_dfl(void) static struct cgroup *cset_cgroup_from_root(struct css_set *cset, struct cgroup_root *root) { + lockdep_assert_held(&cgroup_mutex); lockdep_assert_held(&css_set_lock); return __cset_cgroup_from_root(cset, root); @@ -1468,9 +1479,7 @@ static struct cgroup *cset_cgroup_from_root(struct css_set *cset, /* * Return the cgroup for "task" from the given hierarchy. Must be - * called with css_set_lock held to prevent task's groups from being modified. - * Must be called with either cgroup_mutex or rcu read lock to prevent the - * cgroup root from being destroyed. + * called with cgroup_mutex and css_set_lock held. */ struct cgroup *task_cgroup_from_root(struct task_struct *task, struct cgroup_root *root) @@ -1731,27 +1740,25 @@ static int css_populate_dir(struct cgroup_subsys_state *css) struct cftype *cfts, *failed_cfts; int ret; - if (css->flags & CSS_VISIBLE) + if ((css->flags & CSS_VISIBLE) || !cgrp->kn) return 0; if (!css->ss) { if (cgroup_on_dfl(cgrp)) { - ret = cgroup_addrm_files(css, cgrp, + ret = cgroup_addrm_files(&cgrp->self, cgrp, cgroup_base_files, true); if (ret < 0) return ret; if (cgroup_psi_enabled()) { - ret = cgroup_addrm_files(css, cgrp, + ret = cgroup_addrm_files(&cgrp->self, cgrp, cgroup_psi_files, true); if (ret < 0) return ret; } } else { - ret = cgroup_addrm_files(css, cgrp, - cgroup1_base_files, true); - if (ret < 0) - return ret; + cgroup_addrm_files(css, cgrp, + cgroup1_base_files, true); } } else { list_for_each_entry(cfts, &css->ss->cfts, node) { @@ -2033,7 +2040,7 @@ void init_cgroup_root(struct cgroup_fs_context *ctx) struct cgroup_root *root = ctx->root; struct cgroup *cgrp = &root->cgrp; - INIT_LIST_HEAD_RCU(&root->root_list); + INIT_LIST_HEAD(&root->root_list); atomic_set(&root->nr_cgrps, 1); cgrp->root = root; init_cgroup_housekeeping(cgrp); @@ -2116,7 +2123,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) * care of subsystems' refcounts, which are explicitly dropped in * the failure exit path. */ - list_add_rcu(&root->root_list, &cgroup_roots); + list_add(&root->root_list, &cgroup_roots); cgroup_root_count++; /* @@ -2496,7 +2503,7 @@ struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset, /* * This function may be called both before and - * after cgroup_migrate_execute(). The two cases + * after cgroup_taskset_migrate(). The two cases * can be distinguished by looking at whether @cset * has its ->mg_dst_cset set. */ @@ -2641,6 +2648,10 @@ int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp) if (!cgroup_is_valid_domain(dst_cgrp->dom_cgrp)) return -EOPNOTSUPP; + /* mixables don't care */ + if (cgroup_is_mixable(dst_cgrp)) + return 0; + /* * If @dst_cgrp is already or can become a thread root or is * threaded, it doesn't matter. @@ -3651,32 +3662,9 @@ static int cgroup_stat_show(struct seq_file *seq, void *v) return 0; } -#ifdef CONFIG_CGROUP_SCHED -/** - * cgroup_tryget_css - try to get a cgroup's css for the specified subsystem - * @cgrp: the cgroup of interest - * @ss: the subsystem of interest - * - * Find and get @cgrp's css associated with @ss. If the css doesn't exist - * or is offline, %NULL is returned. - */ -static struct cgroup_subsys_state *cgroup_tryget_css(struct cgroup *cgrp, - struct cgroup_subsys *ss) +static int __maybe_unused cgroup_extra_stat_show(struct seq_file *seq, + struct cgroup *cgrp, int ssid) { - struct cgroup_subsys_state *css; - - rcu_read_lock(); - css = cgroup_css(cgrp, ss); - if (css && !css_tryget_online(css)) - css = NULL; - rcu_read_unlock(); - - return css; -} - -static int cgroup_extra_stat_show(struct seq_file *seq, int ssid) -{ - struct cgroup *cgrp = seq_css(seq)->cgroup; struct cgroup_subsys *ss = cgroup_subsys[ssid]; struct cgroup_subsys_state *css; int ret; @@ -3693,44 +3681,14 @@ static int cgroup_extra_stat_show(struct seq_file *seq, int ssid) return ret; } -static int cgroup_local_stat_show(struct seq_file *seq, - struct cgroup *cgrp, int ssid) -{ - struct cgroup_subsys *ss = cgroup_subsys[ssid]; - struct cgroup_subsys_state *css; - int ret; - - if (!ss->css_local_stat_show) - return 0; - - css = cgroup_tryget_css(cgrp, ss); - if (!css) - return 0; - - ret = ss->css_local_stat_show(seq, css); - css_put(css); - return ret; -} -#endif - static int cpu_stat_show(struct seq_file *seq, void *v) -{ - int ret = 0; - - cgroup_base_stat_cputime_show(seq); -#ifdef CONFIG_CGROUP_SCHED - ret = cgroup_extra_stat_show(seq, cpu_cgrp_id); -#endif - return ret; -} - -static int cpu_local_stat_show(struct seq_file *seq, void *v) { struct cgroup __maybe_unused *cgrp = seq_css(seq)->cgroup; int ret = 0; + cgroup_base_stat_cputime_show(seq); #ifdef CONFIG_CGROUP_SCHED - ret = cgroup_local_stat_show(seq, cgrp, cpu_cgrp_id); + ret = cgroup_extra_stat_show(seq, cgrp, cpu_cgrp_id); #endif return ret; } @@ -3788,8 +3746,8 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) return psi_show(seq, psi, PSI_CPU); } -static ssize_t pressure_write(struct kernfs_open_file *of, char *buf, - size_t nbytes, enum psi_res res) +static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, enum psi_res res) { struct cgroup_file_ctx *ctx = of->priv; struct psi_trigger *new; @@ -3810,7 +3768,7 @@ static ssize_t pressure_write(struct kernfs_open_file *of, char *buf, } psi = cgroup_psi(cgrp); - new = psi_trigger_create(psi, buf, res, of->file, of); + new = psi_trigger_create(psi, buf, res, of->file); if (IS_ERR(new)) { cgroup_put(cgrp); return PTR_ERR(new); @@ -3826,21 +3784,21 @@ static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - return pressure_write(of, buf, nbytes, PSI_IO); + return cgroup_pressure_write(of, buf, nbytes, PSI_IO); } static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - return pressure_write(of, buf, nbytes, PSI_MEM); + return cgroup_pressure_write(of, buf, nbytes, PSI_MEM); } static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - return pressure_write(of, buf, nbytes, PSI_CPU); + return cgroup_pressure_write(of, buf, nbytes, PSI_CPU); } #ifdef CONFIG_IRQ_TIME_ACCOUNTING @@ -3856,58 +3814,10 @@ static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - return pressure_write(of, buf, nbytes, PSI_IRQ); + return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ); } #endif -static int cgroup_pressure_show(struct seq_file *seq, void *v) -{ - struct cgroup *cgrp = seq_css(seq)->cgroup; - struct psi_group *psi = cgroup_psi(cgrp); - - seq_printf(seq, "%d\n", psi->enabled); - - return 0; -} - -static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, - loff_t off) -{ - ssize_t ret; - int enable; - struct cgroup *cgrp; - struct psi_group *psi; - - ret = kstrtoint(strstrip(buf), 0, &enable); - if (ret) - return ret; - - if (enable < 0 || enable > 1) - return -ERANGE; - - cgrp = cgroup_kn_lock_live(of->kn, false); - if (!cgrp) - return -ENOENT; - - psi = cgroup_psi(cgrp); - if (psi->enabled != enable) { - int i; - - /* show or hide {cpu,memory,io,irq}.pressure files */ - for (i = 0; i < NR_PSI_RESOURCES; i++) - cgroup_file_show(&cgrp->psi_files[i], enable); - - psi->enabled = enable; - if (enable) - psi_cgroup_restart(psi); - } - - cgroup_kn_unlock(of->kn); - - return nbytes; -} - static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, poll_table *pt) { @@ -4200,6 +4110,20 @@ static struct kernfs_ops cgroup_kf_ops = { .seq_show = cgroup_seqfile_show, }; +/* set uid and gid of cgroup dirs and files to that of the creator */ +static int cgroup_kn_set_ugid(struct kernfs_node *kn) +{ + struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID, + .ia_uid = current_fsuid(), + .ia_gid = current_fsgid(), }; + + if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) && + gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID)) + return 0; + + return kernfs_setattr(kn, &iattr); +} + static void cgroup_file_notify_timer(struct timer_list *timer) { cgroup_file_notify(container_of(timer, struct cgroup_file, @@ -4212,18 +4136,25 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp, char name[CGROUP_FILE_NAME_MAX]; struct kernfs_node *kn; struct lock_class_key *key = NULL; + int ret; #ifdef CONFIG_DEBUG_LOCK_ALLOC key = &cft->lockdep_key; #endif kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name), cgroup_file_mode(cft), - current_fsuid(), current_fsgid(), + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, cft->kf_ops, cft, NULL, key); if (IS_ERR(kn)) return PTR_ERR(kn); + ret = cgroup_kn_set_ugid(kn); + if (ret) { + kernfs_remove(kn); + return ret; + } + if (cft->file_offset) { struct cgroup_file *cfile = (void *)css + cft->file_offset; @@ -4371,13 +4302,14 @@ static int cgroup_init_cftypes(struct cgroup_subsys *ss, struct cftype *cfts) return ret; } -static void cgroup_rm_cftypes_locked(struct cftype *cfts) +static int cgroup_rm_cftypes_locked(struct cftype *cfts) { lockdep_assert_held(&cgroup_mutex); list_del(&cfts->node); cgroup_apply_cftypes(cfts, false); cgroup_exit_cftypes(cfts); + return 0; } /** @@ -4393,6 +4325,8 @@ static void cgroup_rm_cftypes_locked(struct cftype *cfts) */ int cgroup_rm_cftypes(struct cftype *cfts) { + int ret; + if (!cfts || cfts[0].name[0] == '\0') return 0; @@ -4400,9 +4334,9 @@ int cgroup_rm_cftypes(struct cftype *cfts) return -ENOENT; cgroup_lock(); - cgroup_rm_cftypes_locked(cfts); + ret = cgroup_rm_cftypes_locked(cfts); cgroup_unlock(); - return 0; + return ret; } /** @@ -5300,7 +5234,6 @@ static struct cftype cgroup_psi_files[] = { #ifdef CONFIG_PSI { .name = "io.pressure", - .file_offset = offsetof(struct cgroup, psi_files[PSI_IO]), .seq_show = cgroup_io_pressure_show, .write = cgroup_io_pressure_write, .poll = cgroup_pressure_poll, @@ -5308,7 +5241,6 @@ static struct cftype cgroup_psi_files[] = { }, { .name = "memory.pressure", - .file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]), .seq_show = cgroup_memory_pressure_show, .write = cgroup_memory_pressure_write, .poll = cgroup_pressure_poll, @@ -5316,7 +5248,6 @@ static struct cftype cgroup_psi_files[] = { }, { .name = "cpu.pressure", - .file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]), .seq_show = cgroup_cpu_pressure_show, .write = cgroup_cpu_pressure_write, .poll = cgroup_pressure_poll, @@ -5325,18 +5256,12 @@ static struct cftype cgroup_psi_files[] = { #ifdef CONFIG_IRQ_TIME_ACCOUNTING { .name = "irq.pressure", - .file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]), .seq_show = cgroup_irq_pressure_show, .write = cgroup_irq_pressure_write, .poll = cgroup_pressure_poll, .release = cgroup_pressure_release, }, #endif - { - .name = "cgroup.pressure", - .seq_show = cgroup_pressure_show, - .write = cgroup_pressure_write, - }, #endif /* CONFIG_PSI */ { } /* terminate */ }; @@ -5357,7 +5282,7 @@ static struct cftype cgroup_psi_files[] = { * RCU callback. * * 4. After the grace period, the css can be freed. Implemented in - * css_free_rwork_fn(). + * css_free_work_fn(). * * It is actually hairier because both step 2 and 4 require process context * and thus involve punting to css->destroy_work adding two additional @@ -5601,7 +5526,8 @@ err_free_css: /* * The returned cgroup is fully initialized including its control mask, but - * it doesn't have the control mask applied. + * it isn't associated with its kernfs_node and doesn't have the control + * mask applied. */ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, umode_t mode) @@ -5626,9 +5552,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, goto out_cancel_ref; /* create the directory */ - kn = kernfs_create_dir_ns(parent->kn, name, mode, - current_fsuid(), current_fsgid(), - cgrp, NULL); + kn = kernfs_create_dir(parent->kn, name, mode, cgrp); if (IS_ERR(kn)) { ret = PTR_ERR(kn); goto out_stat_exit; @@ -5773,6 +5697,10 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) */ kernfs_get(cgrp->kn); + ret = cgroup_kn_set_ugid(cgrp->kn); + if (ret) + goto out_destroy; + ret = css_populate_dir(&cgrp->self); if (ret) goto out_destroy; @@ -5925,7 +5853,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) /* * Mark @cgrp and the associated csets dead. The former prevents * further task migration and child creation by disabling - * cgroup_kn_lock_live(). The latter makes the csets ignored by + * cgroup_lock_live_group(). The latter makes the csets ignored by * the migration path. */ cgrp->self.flags &= ~CSS_ONLINE; @@ -5947,7 +5875,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) parent->nr_threaded_children--; spin_lock_irq(&css_set_lock); - for (tcgrp = parent; tcgrp; tcgrp = cgroup_parent(tcgrp)) { + for (tcgrp = cgroup_parent(cgrp); tcgrp; tcgrp = cgroup_parent(tcgrp)) { tcgrp->nr_descendants--; tcgrp->nr_dying_descendants++; /* @@ -6140,8 +6068,8 @@ int __init cgroup_init(void) continue; if (cgroup1_ssid_disabled(ssid)) - pr_info("Disabling %s control group subsystem in v1 mounts\n", - ss->legacy_name); + printk(KERN_INFO "Disabling %s control group subsystem in v1 mounts\n", + ss->name); cgrp_dfl_root.subsys_mask |= 1 << ss->id; @@ -6273,7 +6201,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, if (!buf) goto out; - rcu_read_lock(); + cgroup_lock(); spin_lock_irq(&css_set_lock); for_each_root(root) { @@ -6284,11 +6212,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, if (root == &cgrp_dfl_root && !READ_ONCE(cgrp_dfl_visible)) continue; - cgrp = task_cgroup_from_root(tsk, root); - /* The root has already been unmounted. */ - if (!cgrp) - continue; - seq_printf(m, "%d:", root->hierarchy_id); if (root != &cgrp_dfl_root) for_each_subsys(ss, ssid) @@ -6299,6 +6222,9 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, seq_printf(m, "%sname=%s", count ? "," : "", root->name); seq_putc(m, ':'); + + cgrp = task_cgroup_from_root(tsk, root); + /* * On traditional hierarchies, all zombie tasks show up as * belonging to the root cgroup. On the default hierarchy, @@ -6330,7 +6256,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, retval = 0; out_unlock: spin_unlock_irq(&css_set_lock); - rcu_read_unlock(); + cgroup_unlock(); kfree(buf); out: return retval; diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 5ab59ee794ef..68685e6b7459 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2562,7 +2562,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, update_partition_sd_lb(cs, old_prs); out_free: free_cpumasks(NULL, &tmp); - return retval; + return 0; } /** @@ -2598,6 +2598,9 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs, if (cpumask_equal(cs->exclusive_cpus, trialcs->exclusive_cpus)) return 0; + if (alloc_cpumasks(NULL, &tmp)) + return -ENOMEM; + if (*buf) compute_effective_exclusive_cpumask(trialcs, NULL); @@ -2612,9 +2615,6 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs, if (retval) return retval; - if (alloc_cpumasks(NULL, &tmp)) - return -ENOMEM; - if (old_prs) { if (cpumask_empty(trialcs->effective_xcpus)) { invalidate = true; @@ -4379,16 +4379,17 @@ hotplug_update_tasks_legacy(struct cpuset *cs, is_empty = cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed); + mutex_unlock(&cpuset_mutex); + /* * Move tasks to the nearest ancestor with execution resources, * This is full cgroup operation which will also call back into * cpuset. Should be done outside any lock. */ - if (is_empty) { - mutex_unlock(&cpuset_mutex); + if (is_empty) remove_tasks_in_empty_cpuset(cs); - mutex_lock(&cpuset_mutex); - } + + mutex_lock(&cpuset_mutex); } static void @@ -4558,7 +4559,6 @@ unlock: /** * cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset - * @work: unused * * This function is called after either CPU or memory configuration has * changed and updates cpuset accordingly. The top_cpuset is always @@ -4941,7 +4941,6 @@ bool cpuset_node_allowed(int node, gfp_t gfp_mask) /** * cpuset_spread_node() - On which node to begin search for a page - * @rotor: round robin rotor * * If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for * tasks in a cpuset with is_spread_page or is_spread_slab set), diff --git a/kernel/cgroup/legacy_freezer.c b/kernel/cgroup/legacy_freezer.c index 66d1708042a7..08236798d173 100644 --- a/kernel/cgroup/legacy_freezer.c +++ b/kernel/cgroup/legacy_freezer.c @@ -22,7 +22,6 @@ #include #include #include -#include /* * A cgroup is freezing if any FREEZING flags are set. FREEZING_SELF is @@ -66,15 +65,9 @@ static struct freezer *parent_freezer(struct freezer *freezer) bool cgroup_freezing(struct task_struct *task) { bool ret; - unsigned int state; rcu_read_lock(); - /* Check if the cgroup is still FREEZING, but not FROZEN. The extra - * !FROZEN check is required, because the FREEZING bit is not cleared - * when the state FROZEN is reached. - */ - state = task_freezer(task)->state; - ret = (state & CGROUP_FREEZING) && !(state & CGROUP_FROZEN); + ret = task_freezer(task)->state & CGROUP_FREEZING; rcu_read_unlock(); return ret; @@ -114,18 +107,16 @@ static int freezer_css_online(struct cgroup_subsys_state *css) struct freezer *freezer = css_freezer(css); struct freezer *parent = parent_freezer(freezer); - cpus_read_lock(); mutex_lock(&freezer_mutex); freezer->state |= CGROUP_FREEZER_ONLINE; if (parent && (parent->state & CGROUP_FREEZING)) { freezer->state |= CGROUP_FREEZING_PARENT | CGROUP_FROZEN; - static_branch_inc_cpuslocked(&freezer_active); + atomic_inc(&system_freezing_cnt); } mutex_unlock(&freezer_mutex); - cpus_read_unlock(); return 0; } @@ -140,16 +131,14 @@ static void freezer_css_offline(struct cgroup_subsys_state *css) { struct freezer *freezer = css_freezer(css); - cpus_read_lock(); mutex_lock(&freezer_mutex); if (freezer->state & CGROUP_FREEZING) - static_branch_dec_cpuslocked(&freezer_active); + atomic_dec(&system_freezing_cnt); freezer->state = 0; mutex_unlock(&freezer_mutex); - cpus_read_unlock(); } static void freezer_css_free(struct cgroup_subsys_state *css) @@ -190,7 +179,6 @@ static void freezer_attach(struct cgroup_taskset *tset) __thaw_task(task); } else { freeze_task(task); - /* clear FROZEN and propagate upwards */ while (freezer && (freezer->state & CGROUP_FROZEN)) { freezer->state &= ~CGROUP_FROZEN; @@ -283,8 +271,16 @@ static void update_if_frozen(struct cgroup_subsys_state *css) css_task_iter_start(css, 0, &it); while ((task = css_task_iter_next(&it))) { - if (freezing(task) && !frozen(task)) - goto out_iter_end; + if (freezing(task)) { + /* + * freezer_should_skip() indicates that the task + * should be skipped when determining freezing + * completion. Consider it frozen in addition to + * the usual frozen condition. + */ + if (!frozen(task) && !freezer_should_skip(task)) + goto out_iter_end; + } } freezer->state |= CGROUP_FROZEN; @@ -361,7 +357,7 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze, if (freeze) { if (!(freezer->state & CGROUP_FREEZING)) - static_branch_inc_cpuslocked(&freezer_active); + atomic_inc(&system_freezing_cnt); freezer->state |= state; freeze_cgroup(freezer); } else { @@ -370,9 +366,9 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze, freezer->state &= ~state; if (!(freezer->state & CGROUP_FREEZING)) { - freezer->state &= ~CGROUP_FROZEN; if (was_freezing) - static_branch_dec_cpuslocked(&freezer_active); + atomic_dec(&system_freezing_cnt); + freezer->state &= ~CGROUP_FROZEN; unfreeze_cgroup(freezer); } } @@ -390,7 +386,6 @@ static void freezer_change_state(struct freezer *freezer, bool freeze) { struct cgroup_subsys_state *pos; - cpus_read_lock(); /* * Update all its descendants in pre-order traversal. Each * descendant will try to inherit its parent's FREEZING state as @@ -419,7 +414,6 @@ static void freezer_change_state(struct freezer *freezer, bool freeze) } rcu_read_unlock(); mutex_unlock(&freezer_mutex); - cpus_read_unlock(); } static ssize_t freezer_write(struct kernfs_open_file *of, diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 79a3717a5803..ae2f4dd47508 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -14,7 +14,7 @@ #include #define MAX_STR "max" -#define MAX_NUM U64_MAX +#define MAX_NUM ULONG_MAX /* Miscellaneous res name, keep it in sync with enum misc_res_type */ static const char *const misc_res_name[] = { @@ -37,7 +37,7 @@ static struct misc_cg root_cg; * more than the actual capacity. We are using Limits resource distribution * model of cgroup for miscellaneous controller. */ -static u64 misc_res_capacity[MISC_CG_RES_TYPES]; +static unsigned long misc_res_capacity[MISC_CG_RES_TYPES]; /** * parent_misc() - Get the parent of the passed misc cgroup. @@ -74,10 +74,10 @@ static inline bool valid_type(enum misc_res_type type) * Context: Any context. * Return: Current total usage of the resource. */ -u64 misc_cg_res_total_usage(enum misc_res_type type) +unsigned long misc_cg_res_total_usage(enum misc_res_type type) { if (valid_type(type)) - return atomic64_read(&root_cg.res[type].usage); + return atomic_long_read(&root_cg.res[type].usage); return 0; } @@ -95,7 +95,7 @@ EXPORT_SYMBOL_GPL(misc_cg_res_total_usage); * * %0 - Successfully registered the capacity. * * %-EINVAL - If @type is invalid. */ -int misc_cg_set_capacity(enum misc_res_type type, u64 capacity) +int misc_cg_set_capacity(enum misc_res_type type, unsigned long capacity) { if (!valid_type(type)) return -EINVAL; @@ -114,9 +114,9 @@ EXPORT_SYMBOL_GPL(misc_cg_set_capacity); * Context: Any context. */ static void misc_cg_cancel_charge(enum misc_res_type type, struct misc_cg *cg, - u64 amount) + unsigned long amount) { - WARN_ONCE(atomic64_add_negative(-amount, &cg->res[type].usage), + WARN_ONCE(atomic_long_add_negative(-amount, &cg->res[type].usage), "misc cgroup resource %s became less than 0", misc_res_name[type]); } @@ -137,12 +137,13 @@ static void misc_cg_cancel_charge(enum misc_res_type type, struct misc_cg *cg, * * -EBUSY - If max limit will be crossed or total usage will be more than the * capacity. */ -int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) +int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, + unsigned long amount) { struct misc_cg *i, *j; int ret; struct misc_res *res; - u64 new_usage; + int new_usage; if (!(valid_type(type) && cg && READ_ONCE(misc_res_capacity[type]))) return -EINVAL; @@ -153,7 +154,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) for (i = cg; i; i = parent_misc(i)) { res = &i->res[type]; - new_usage = atomic64_add_return(amount, &res->usage); + new_usage = atomic_long_add_return(amount, &res->usage); if (new_usage > READ_ONCE(res->max) || new_usage > READ_ONCE(misc_res_capacity[type])) { ret = -EBUSY; @@ -164,7 +165,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) err_charge: for (j = i; j; j = parent_misc(j)) { - atomic64_inc(&j->res[type].events); + atomic_long_inc(&j->res[type].events); cgroup_file_notify(&j->events_file); } @@ -183,7 +184,8 @@ EXPORT_SYMBOL_GPL(misc_cg_try_charge); * * Context: Any context. */ -void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount) +void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, + unsigned long amount) { struct misc_cg *i; @@ -207,7 +209,7 @@ static int misc_cg_max_show(struct seq_file *sf, void *v) { int i; struct misc_cg *cg = css_misc(seq_css(sf)); - u64 max; + unsigned long max; for (i = 0; i < MISC_CG_RES_TYPES; i++) { if (READ_ONCE(misc_res_capacity[i])) { @@ -215,7 +217,7 @@ static int misc_cg_max_show(struct seq_file *sf, void *v) if (max == MAX_NUM) seq_printf(sf, "%s max\n", misc_res_name[i]); else - seq_printf(sf, "%s %llu\n", misc_res_name[i], + seq_printf(sf, "%s %lu\n", misc_res_name[i], max); } } @@ -239,13 +241,13 @@ static int misc_cg_max_show(struct seq_file *sf, void *v) * Return: * * >= 0 - Number of bytes processed in the input. * * -EINVAL - If buf is not valid. - * * -ERANGE - If number is bigger than the u64 capacity. + * * -ERANGE - If number is bigger than the unsigned long capacity. */ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct misc_cg *cg; - u64 max; + unsigned long max; int ret = 0, i; enum misc_res_type type = MISC_CG_RES_TYPES; char *token; @@ -269,7 +271,7 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf, if (!strcmp(MAX_STR, buf)) { max = MAX_NUM; } else { - ret = kstrtou64(buf, 0, &max); + ret = kstrtoul(buf, 0, &max); if (ret) return ret; } @@ -295,13 +297,13 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf, static int misc_cg_current_show(struct seq_file *sf, void *v) { int i; - u64 usage; + unsigned long usage; struct misc_cg *cg = css_misc(seq_css(sf)); for (i = 0; i < MISC_CG_RES_TYPES; i++) { - usage = atomic64_read(&cg->res[i].usage); + usage = atomic_long_read(&cg->res[i].usage); if (READ_ONCE(misc_res_capacity[i]) || usage) - seq_printf(sf, "%s %llu\n", misc_res_name[i], usage); + seq_printf(sf, "%s %lu\n", misc_res_name[i], usage); } return 0; @@ -320,12 +322,12 @@ static int misc_cg_current_show(struct seq_file *sf, void *v) static int misc_cg_capacity_show(struct seq_file *sf, void *v) { int i; - u64 cap; + unsigned long cap; for (i = 0; i < MISC_CG_RES_TYPES; i++) { cap = READ_ONCE(misc_res_capacity[i]); if (cap) - seq_printf(sf, "%s %llu\n", misc_res_name[i], cap); + seq_printf(sf, "%s %lu\n", misc_res_name[i], cap); } return 0; @@ -334,13 +336,12 @@ static int misc_cg_capacity_show(struct seq_file *sf, void *v) static int misc_events_show(struct seq_file *sf, void *v) { struct misc_cg *cg = css_misc(seq_css(sf)); - u64 events; - int i; + unsigned long events, i; for (i = 0; i < MISC_CG_RES_TYPES; i++) { - events = atomic64_read(&cg->res[i].events); + events = atomic_long_read(&cg->res[i].events); if (READ_ONCE(misc_res_capacity[i]) || events) - seq_printf(sf, "%s.max %llu\n", misc_res_name[i], events); + seq_printf(sf, "%s.max %lu\n", misc_res_name[i], events); } return 0; } @@ -396,7 +397,7 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css) for (i = 0; i < MISC_CG_RES_TYPES; i++) { WRITE_ONCE(cg->res[i].max, MAX_NUM); - atomic64_set(&cg->res[i].usage, 0); + atomic_long_set(&cg->res[i].usage, 0); } return &cg->css; diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c index 144a464e45c6..0d5c29879a50 100644 --- a/kernel/cgroup/namespace.c +++ b/kernel/cgroup/namespace.c @@ -149,3 +149,9 @@ const struct proc_ns_operations cgroupns_operations = { .install = cgroupns_install, .owner = cgroupns_owner, }; + +static __init int cgroup_namespaces_init(void) +{ + return 0; +} +subsys_initcall(cgroup_namespaces_init); diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 8bb980a6efc0..ab9bb36492dd 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -399,7 +399,6 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); struct cgroup *parent = cgroup_parent(cgrp); - struct cgroup_rstat_cpu *prstatc; struct cgroup_base_stat delta; unsigned seq; @@ -413,24 +412,17 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) delta = rstatc->bstat; } while (__u64_stats_fetch_retry(&rstatc->bsync, seq)); - /* propagate per-cpu delta to cgroup and per-cpu global statistics */ + /* propagate percpu delta to global */ cgroup_base_stat_sub(&delta, &rstatc->last_bstat); cgroup_base_stat_add(&cgrp->bstat, &delta); cgroup_base_stat_add(&rstatc->last_bstat, &delta); - cgroup_base_stat_add(&rstatc->subtree_bstat, &delta); - /* propagate cgroup and per-cpu global delta to parent (unless that's root) */ + /* propagate global delta to parent (unless that's root) */ if (cgroup_parent(parent)) { delta = cgrp->bstat; cgroup_base_stat_sub(&delta, &cgrp->last_bstat); cgroup_base_stat_add(&parent->bstat, &delta); cgroup_base_stat_add(&cgrp->last_bstat, &delta); - - delta = rstatc->subtree_bstat; - prstatc = cgroup_rstat_cpu(parent, cpu); - cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat); - cgroup_base_stat_add(&prstatc->subtree_bstat, &delta); - cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta); } } diff --git a/kernel/exit.c b/kernel/exit.c index dd0d85418186..4b998adb36ac 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -405,10 +405,10 @@ static void coredump_task_exit(struct task_struct *tsk) complete(&core_state->startup); for (;;) { - set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE); + set_current_state(TASK_UNINTERRUPTIBLE); if (!self.task) /* see coredump_finish() */ break; - schedule(); + freezable_schedule(); } __set_current_state(TASK_RUNNING); } diff --git a/kernel/fork.c b/kernel/fork.c index 288495b0d4ba..3bf3bbf41fde 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1527,12 +1527,13 @@ static void complete_vfork_done(struct task_struct *tsk) static int wait_for_vfork_done(struct task_struct *child, struct completion *vfork) { - unsigned int state = TASK_UNINTERRUPTIBLE|TASK_KILLABLE|TASK_FREEZABLE; int killed; + freezer_do_not_count(); cgroup_enter_frozen(); - killed = wait_for_completion_state(vfork, state); + killed = wait_for_completion_killable(vfork); cgroup_leave_frozen(false); + freezer_count(); if (killed) { task_lock(child); diff --git a/kernel/freezer.c b/kernel/freezer.c index 4fad0e6fca64..45ab36ffd0e7 100644 --- a/kernel/freezer.c +++ b/kernel/freezer.c @@ -13,11 +13,10 @@ #include /* total number of freezing conditions in effect */ -DEFINE_STATIC_KEY_FALSE(freezer_active); -EXPORT_SYMBOL(freezer_active); +atomic_t system_freezing_cnt = ATOMIC_INIT(0); +EXPORT_SYMBOL(system_freezing_cnt); -/* - * indicate whether PM freezing is in effect, protected by +/* indicate whether PM freezing is in effect, protected by * system_transition_mutex */ bool pm_freezing; @@ -30,7 +29,7 @@ static DEFINE_SPINLOCK(freezer_lock); * freezing_slow_path - slow path for testing whether a task needs to be frozen * @p: task to be tested * - * This function is called by freezing() if freezer_active isn't zero + * This function is called by freezing() if system_freezing_cnt isn't zero * and tests whether @p needs to enter and stay in frozen state. Can be * called under any context. The freezers are responsible for ensuring the * target tasks see the updated state. @@ -53,40 +52,41 @@ bool freezing_slow_path(struct task_struct *p) } EXPORT_SYMBOL(freezing_slow_path); -bool frozen(struct task_struct *p) -{ - return READ_ONCE(p->__state) & TASK_FROZEN; -} - /* Refrigerator is place where frozen processes are stored :-). */ bool __refrigerator(bool check_kthr_stop) { - unsigned int state = get_current_state(); + /* Hmm, should we be allowed to suspend when there are realtime + processes around? */ bool was_frozen = false; + unsigned int save = get_current_state(); pr_debug("%s entered refrigerator\n", current->comm); - WARN_ON_ONCE(state && !(state & TASK_NORMAL)); - for (;;) { - bool freeze; - - set_current_state(TASK_FROZEN); + set_current_state(TASK_UNINTERRUPTIBLE); spin_lock_irq(&freezer_lock); - freeze = freezing(current) && !(check_kthr_stop && kthread_should_stop()); + current->flags |= PF_FROZEN; + if (!freezing(current) || + (check_kthr_stop && kthread_should_stop())) + current->flags &= ~PF_FROZEN; spin_unlock_irq(&freezer_lock); - if (!freeze) + if (!(current->flags & PF_FROZEN)) break; - was_frozen = true; schedule(); } - __set_current_state(TASK_RUNNING); pr_debug("%s left refrigerator\n", current->comm); + /* + * Restore saved task state before returning. The mb'd version + * needs to be used; otherwise, it might silently break + * synchronization which depends on ordered task state change. + */ + set_current_state(save); + return was_frozen; } EXPORT_SYMBOL(__refrigerator); @@ -101,44 +101,6 @@ static void fake_signal_wake_up(struct task_struct *p) } } -static int __set_task_frozen(struct task_struct *p, void *arg) -{ - unsigned int state = READ_ONCE(p->__state); - - if (p->on_rq) - return 0; - - if (p != current && task_curr(p)) - return 0; - - if (!(state & (TASK_FREEZABLE | __TASK_STOPPED | __TASK_TRACED))) - return 0; - - /* - * Only TASK_NORMAL can be augmented with TASK_FREEZABLE, since they - * can suffer spurious wakeups. - */ - if (state & TASK_FREEZABLE) - WARN_ON_ONCE(!(state & TASK_NORMAL)); - -#ifdef CONFIG_LOCKDEP - /* - * It's dangerous to freeze with locks held; there be dragons there. - */ - if (!(state & __TASK_FREEZABLE_UNSAFE)) - WARN_ON_ONCE(debug_locks && p->lockdep_depth); -#endif - - WRITE_ONCE(p->__state, TASK_FROZEN); - return TASK_FROZEN; -} - -static bool __freeze_task(struct task_struct *p) -{ - /* TASK_FREEZABLE|TASK_STOPPED|TASK_TRACED -> TASK_FROZEN */ - return task_call_func(p, __set_task_frozen, NULL); -} - /** * freeze_task - send a freeze request to given task * @p: task to send the request to @@ -154,8 +116,20 @@ bool freeze_task(struct task_struct *p) { unsigned long flags; + /* + * This check can race with freezer_do_not_count, but worst case that + * will result in an extra wakeup being sent to the task. It does not + * race with freezer_count(), the barriers in freezer_count() and + * freezer_should_skip() ensure that either freezer_count() sees + * freezing == true in try_to_freeze() and freezes, or + * freezer_should_skip() sees !PF_FREEZE_SKIP and freezes the task + * normally. + */ + if (freezer_should_skip(p)) + return false; + spin_lock_irqsave(&freezer_lock, flags); - if (!freezing(p) || frozen(p) || __freeze_task(p)) { + if (!freezing(p) || frozen(p)) { spin_unlock_irqrestore(&freezer_lock, flags); return false; } @@ -163,52 +137,19 @@ bool freeze_task(struct task_struct *p) if (!(p->flags & PF_KTHREAD)) fake_signal_wake_up(p); else - wake_up_state(p, TASK_NORMAL); + wake_up_state(p, TASK_INTERRUPTIBLE); spin_unlock_irqrestore(&freezer_lock, flags); return true; } -/* - * The special task states (TASK_STOPPED, TASK_TRACED) keep their canonical - * state in p->jobctl. If either of them got a wakeup that was missed because - * TASK_FROZEN, then their canonical state reflects that and the below will - * refuse to restore the special state and instead issue the wakeup. - */ -static int __set_task_special(struct task_struct *p, void *arg) -{ - unsigned int state = 0; - - if (p->jobctl & JOBCTL_TRACED) - state = TASK_TRACED; - - else if (p->jobctl & JOBCTL_STOPPED) - state = TASK_STOPPED; - - if (state) - WRITE_ONCE(p->__state, state); - - return state; -} - void __thaw_task(struct task_struct *p) { - unsigned long flags, flags2; + unsigned long flags; spin_lock_irqsave(&freezer_lock, flags); - if (WARN_ON_ONCE(freezing(p))) - goto unlock; - - if (lock_task_sighand(p, &flags2)) { - /* TASK_FROZEN -> TASK_{STOPPED,TRACED} */ - bool ret = task_call_func(p, __set_task_special, NULL); - unlock_task_sighand(p, &flags2); - if (ret) - goto unlock; - } - - wake_up_state(p, TASK_FROZEN); -unlock: + if (frozen(p)) + wake_up_process(p); spin_unlock_irqrestore(&freezer_lock, flags); } diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c index 61b112897a84..f2cf62591a9d 100644 --- a/kernel/futex/waitwake.c +++ b/kernel/futex/waitwake.c @@ -348,7 +348,7 @@ void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q, * futex_queue() calls spin_unlock() upon completion, both serializing * access to the hash list and forcing another memory barrier. */ - set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); + set_current_state(TASK_INTERRUPTIBLE); futex_queue(q, hb); /* Arm the timer */ @@ -366,7 +366,7 @@ void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q, * is no timeout, or if it has yet to expire. */ if (!timeout || timeout->task) - schedule(); + freezable_schedule(); } __set_current_state(TASK_RUNNING); } @@ -444,7 +444,7 @@ retry: return ret; } - set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); + set_current_state(TASK_INTERRUPTIBLE); for (i = 0; i < count; i++) { u32 __user *uaddr = (u32 __user *)(unsigned long)vs[i].w.uaddr; @@ -518,7 +518,7 @@ static void futex_sleep_multiple(struct futex_vector *vs, unsigned int count, return; } - schedule(); + freezable_schedule(); } /** diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 72aafd127d05..52501e5f7655 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -95,8 +95,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) * Ensure the task is not frozen. * Also, skip vfork and any other user process that freezer should skip. */ - if (unlikely(READ_ONCE(t->__state) & TASK_FROZEN)) - return; + if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP))) + return; /* * When a freshly created task is scheduled once, changes its state to @@ -191,8 +191,6 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) hung_task_show_lock = false; rcu_read_lock(); for_each_process_thread(g, t) { - unsigned int state; - if (!max_count--) goto unlock; if (time_after(jiffies, last_break + HUNG_TASK_LOCK_BREAK)) { @@ -200,14 +198,8 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) goto unlock; last_break = jiffies; } - /* - * skip the TASK_KILLABLE tasks -- these can be killed - * skip the TASK_IDLE tasks -- those are genuinely idle - */ - state = READ_ONCE(t->__state); - if ((state & TASK_UNINTERRUPTIBLE) && - !(state & TASK_WAKEKILL) && - !(state & TASK_NOLOAD)) + /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ + if (READ_ONCE(t->__state) == TASK_UNINTERRUPTIBLE) check_hung_task(t, timeout); } unlock: diff --git a/kernel/power/main.c b/kernel/power/main.c index d7dc8b99644a..3821e5d5eade 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -25,7 +25,7 @@ unsigned int lock_system_sleep(void) { unsigned int flags = current->flags; - current->flags |= PF_NOFREEZE; + current->flags |= PF_FREEZER_SKIP; mutex_lock(&system_transition_mutex); return flags; } @@ -33,8 +33,24 @@ EXPORT_SYMBOL_GPL(lock_system_sleep); void unlock_system_sleep(unsigned int flags) { - if (!(flags & PF_NOFREEZE)) - current->flags &= ~PF_NOFREEZE; + /* + * Don't use freezer_count() because we don't want the call to + * try_to_freeze() here. + * + * Reason: + * Fundamentally, we just don't need it, because freezing condition + * doesn't come into effect until we release the + * system_transition_mutex lock, since the freezer always works with + * system_transition_mutex held. + * + * More importantly, in the case of hibernation, + * unlock_system_sleep() gets called in snapshot_read() and + * snapshot_write() when the freezing condition is still in effect. + * Which means, if we use try_to_freeze() here, it would make them + * enter the refrigerator, thus causing hibernation to lockup. + */ + if (!(flags & PF_FREEZER_SKIP)) + current->flags &= ~PF_FREEZER_SKIP; mutex_unlock(&system_transition_mutex); } EXPORT_SYMBOL_GPL(unlock_system_sleep); diff --git a/kernel/power/process.c b/kernel/power/process.c index 190b9f0bb6f1..11b570fcf049 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -53,7 +53,8 @@ static int try_to_freeze_tasks(bool user_only) if (p == current || !freeze_task(p)) continue; - todo++; + if (!freezer_should_skip(p)) + todo++; } read_unlock(&tasklist_lock); @@ -98,7 +99,8 @@ static int try_to_freeze_tasks(bool user_only) if (!wakeup || pm_debug_messages_on) { read_lock(&tasklist_lock); for_each_process_thread(g, p) { - if (p != current && freezing(p) && !frozen(p)) + if (p != current && !freezer_should_skip(p) + && freezing(p) && !frozen(p)) sched_show_task(p); } read_unlock(&tasklist_lock); @@ -130,7 +132,7 @@ int freeze_processes(void) current->flags |= PF_SUSPEND_TASK; if (!pm_freezing) - static_branch_inc(&freezer_active); + atomic_inc(&system_freezing_cnt); pm_wakeup_clear(0); pr_info("Freezing user space processes ... "); @@ -191,7 +193,7 @@ void thaw_processes(void) trace_suspend_resume(TPS("thaw_processes"), 0, true); if (pm_freezing) - static_branch_dec(&freezer_active); + atomic_dec(&system_freezing_cnt); pm_freezing = false; pm_nosig_freezing = false; diff --git a/kernel/ptrace.c b/kernel/ptrace.c index c295ec0d6bbc..09fbca50223d 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -269,7 +269,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) read_unlock(&tasklist_lock); if (!ret && !ignore_state && - WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED|TASK_FROZEN))) + WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED))) ret = -ESRCH; return ret; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6b77468992da..df129bb4d025 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4309,7 +4309,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in * __schedule(). See the comment for smp_mb__after_spinlock(). * - * A similar smp_rmb() lives in __task_needs_rq_lock(). + * A similar smb_rmb() lives in try_invoke_on_locked_down_task(). */ smp_rmb(); if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags)) @@ -4407,40 +4407,6 @@ out: return success; } -static bool __task_needs_rq_lock(struct task_struct *p) -{ - unsigned int state = READ_ONCE(p->__state); - - /* - * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when - * the task is blocked. Make sure to check @state since ttwu() can drop - * locks at the end, see ttwu_queue_wakelist(). - */ - if (state == TASK_RUNNING || state == TASK_WAKING) - return true; - - /* - * Ensure we load p->on_rq after p->__state, otherwise it would be - * possible to, falsely, observe p->on_rq == 0. - * - * See try_to_wake_up() for a longer comment. - */ - smp_rmb(); - if (p->on_rq) - return true; - -#ifdef CONFIG_SMP - /* - * Ensure the task has finished __schedule() and will not be referenced - * anymore. Again, see try_to_wake_up() for a longer comment. - */ - smp_rmb(); - smp_cond_load_acquire(&p->on_cpu, !VAL); -#endif - - return false; -} - /** * task_call_func - Invoke a function on task in fixed state * @p: Process for which the function is to be invoked, can be @current. @@ -4458,12 +4424,28 @@ static bool __task_needs_rq_lock(struct task_struct *p) int task_call_func(struct task_struct *p, task_call_f func, void *arg) { struct rq *rq = NULL; + unsigned int state; struct rq_flags rf; int ret; raw_spin_lock_irqsave(&p->pi_lock, rf.flags); - if (__task_needs_rq_lock(p)) + state = READ_ONCE(p->__state); + + /* + * Ensure we load p->on_rq after p->__state, otherwise it would be + * possible to, falsely, observe p->on_rq == 0. + * + * See try_to_wake_up() for a longer comment. + */ + smp_rmb(); + + /* + * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when + * the task is blocked. Make sure to check @state since ttwu() can drop + * locks at the end, see ttwu_queue_wakelist(). + */ + if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq) rq = __task_rq_lock(p, &rf); /* @@ -6672,7 +6654,7 @@ static void __sched notrace __schedule(unsigned int sched_mode) prev->sched_contributes_to_load = (prev_state & TASK_UNINTERRUPTIBLE) && !(prev_state & TASK_NOLOAD) && - !(prev_state & TASK_FROZEN); + !(prev->flags & PF_FROZEN); if (prev->sched_contributes_to_load) rq->nr_uninterruptible++; @@ -9288,7 +9270,7 @@ state_filter_match(unsigned long state_filter, struct task_struct *p) * When looking for TASK_UNINTERRUPTIBLE skip TASK_IDLE (allows * TASK_KILLABLE). */ - if (state_filter == TASK_UNINTERRUPTIBLE && (state & TASK_NOLOAD)) + if (state_filter == TASK_UNINTERRUPTIBLE && state == TASK_IDLE) return false; return true; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 37a0eada47d4..82dca39cb91c 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -159,6 +159,7 @@ __setup("psi=", setup_psi); #define EXP_300s 2034 /* 1/exp(2s/300s) */ /* PSI trigger definitions */ +#define WINDOW_MIN_US 500000 /* Min window size is 500ms */ #define WINDOW_MAX_US 10000000 /* Max window size is 10s */ #define UPDATES_PER_WINDOW 10 /* 10 updates per window */ @@ -179,7 +180,6 @@ static void group_init(struct psi_group *group) { int cpu; - group->enabled = true; for_each_possible_cpu(cpu) seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); group->avg_last_update = sched_clock(); @@ -483,12 +483,8 @@ static void update_triggers(struct psi_group *group, u64 now, continue; /* Generate an event */ - if (cmpxchg(&t->event, 0, 1) == 0) { - if (t->of) - kernfs_notify(t->of->kn); - else - wake_up_interruptible(&t->event_wait); - } + if (cmpxchg(&t->event, 0, 1) == 0) + wake_up_interruptible(&t->event_wait); t->last_event_time = now; /* Reset threshold breach flag once event got generated */ t->pending_event = false; @@ -775,16 +771,17 @@ static void psi_group_change(struct psi_group *group, int cpu, groupc = per_cpu_ptr(group->pcpu, cpu); /* - * First we update the task counts according to the state - * change requested through the @clear and @set bits. - * - * Then if the cgroup PSI stats accounting enabled, we - * assess the aggregate resource states this CPU's tasks - * have been in since the last change, and account any + * First we assess the aggregate resource states this CPU's + * tasks have been in since the last change, and account any * SOME and FULL time these may have resulted in. + * + * Then we update the task counts according to the state + * change requested through the @clear and @set bits. */ write_seqcount_begin(&groupc->seq); + record_times(groupc, now); + /* * Start with TSK_ONCPU, which doesn't have a corresponding * task count - it's just a boolean flag directly encoded in @@ -823,23 +820,6 @@ static void psi_group_change(struct psi_group *group, int cpu, if (set & (1 << t)) groupc->tasks[t]++; - if (!group->enabled) { - /* - * On the first group change after disabling PSI, conclude - * the current state and flush its time. This is unlikely - * to matter to the user, but aggregation (get_recent_times) - * may have already incorporated the live state into times_prev; - * avoid a delta sample underflow when PSI is later re-enabled. - */ - if (unlikely(groupc->state_mask & (1 << PSI_NONIDLE))) - record_times(groupc, now); - - groupc->state_mask = state_mask; - - write_seqcount_end(&groupc->seq); - return; - } - for (s = 0; s < NR_PSI_STATES; s++) { if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU)) state_mask |= (1 << s); @@ -856,8 +836,6 @@ static void psi_group_change(struct psi_group *group, int cpu, if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall)) state_mask |= (1 << PSI_MEM_FULL); - record_times(groupc, now); - groupc->state_mask = state_mask; write_seqcount_end(&groupc->seq); @@ -1007,9 +985,6 @@ void psi_account_irqtime(struct task_struct *task, u32 delta) group = task_psi_group(task); do { - if (!group->enabled) - continue; - groupc = per_cpu_ptr(group->pcpu, cpu); write_seqcount_begin(&groupc->seq); @@ -1185,40 +1160,6 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) task_rq_unlock(rq, task, &rf); } - -void psi_cgroup_restart(struct psi_group *group) -{ - int cpu; - - /* - * After we disable psi_group->enabled, we don't actually - * stop percpu tasks accounting in each psi_group_cpu, - * instead only stop test_state() loop, record_times() - * and averaging worker, see psi_group_change() for details. - * - * When disable cgroup PSI, this function has nothing to sync - * since cgroup pressure files are hidden and percpu psi_group_cpu - * would see !psi_group->enabled and only do task accounting. - * - * When re-enable cgroup PSI, this function use psi_group_change() - * to get correct state mask from test_state() loop on tasks[], - * and restart groupc->state_start from now, use .clear = .set = 0 - * here since no task status really changed. - */ - if (!group->enabled) - return; - - for_each_possible_cpu(cpu) { - struct rq *rq = cpu_rq(cpu); - struct rq_flags rf; - u64 now; - - rq_lock_irq(rq, &rf); - now = cpu_clock(cpu); - psi_group_change(group, cpu, 0, 0, now, true); - rq_unlock_irq(rq, &rf); - } -} #endif /* CONFIG_CGROUPS */ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) @@ -1296,9 +1237,8 @@ static int psi_cpu_open(struct inode *inode, struct file *file) return single_open(file, psi_cpu_show, NULL); } -struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, - enum psi_res res, struct file *file, - struct kernfs_open_file *of) +struct psi_trigger *psi_trigger_create(struct psi_group *group, + char *buf, enum psi_res res, struct file *file) { struct psi_trigger *t; enum psi_states state; @@ -1330,7 +1270,8 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, if (state >= PSI_NONIDLE) return ERR_PTR(-EINVAL); - if (window_us == 0 || window_us > WINDOW_MAX_US) + if (window_us < WINDOW_MIN_US || + window_us > WINDOW_MAX_US) return ERR_PTR(-EINVAL); /* @@ -1356,9 +1297,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, t->event = 0; t->last_event_time = 0; - t->of = of; - if (!of) - init_waitqueue_head(&t->event_wait); + init_waitqueue_head(&t->event_wait); t->pending_event = false; t->aggregator = privileged ? PSI_POLL : PSI_AVGS; @@ -1415,10 +1354,7 @@ void psi_trigger_destroy(struct psi_trigger *t) * being accessed later. Can happen if cgroup is deleted from under a * polling process. */ - if (t->of) - kernfs_notify(t->of->kn); - else - wake_up_interruptible(&t->event_wait); + wake_up_pollfree(&t->event_wait); if (t->aggregator == PSI_AVGS) { mutex_lock(&group->avgs_lock); @@ -1490,10 +1426,7 @@ __poll_t psi_trigger_poll(void **trigger_ptr, if (!t) return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; - if (t->of) - kernfs_generic_poll(t->of, wait); - else - poll_wait(file, &t->event_wait, wait); + poll_wait(file, &t->event_wait, wait); if (cmpxchg(&t->event, 1, 0) == 1) ret |= EPOLLPRI; @@ -1532,7 +1465,7 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf, return -EBUSY; } - new = psi_trigger_create(&psi_system, buf, res, file, NULL); + new = psi_trigger_create(&psi_system, buf, res, file); if (IS_ERR(new)) { mutex_unlock(&seq->lock); return PTR_ERR(new); diff --git a/kernel/signal.c b/kernel/signal.c index 05fb5e178600..e66b2ad2a345 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2294,7 +2294,7 @@ static int ptrace_stop(int exit_code, int why, unsigned long message, cgroup_enter_frozen(); if (!IS_ENABLED(CONFIG_PREEMPT_RT)) preempt_enable_no_resched(); - schedule(); + freezable_schedule(); cgroup_leave_frozen(true); /* @@ -2463,7 +2463,7 @@ static bool do_signal_stop(int signr) /* Now we don't run again until woken by SIGCONT or SIGKILL */ cgroup_enter_frozen(); - schedule(); + freezable_schedule(); return true; } else { /* @@ -2538,11 +2538,11 @@ static void do_freezer_trap(void) * immediately (if there is a non-fatal signal pending), and * put the task into sleep. */ - __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); + __set_current_state(TASK_INTERRUPTIBLE); clear_thread_flag(TIF_SIGPENDING); spin_unlock_irq(¤t->sighand->siglock); cgroup_enter_frozen(); - schedule(); + freezable_schedule(); } static int ptrace_signal(int signr, kernel_siginfo_t *info, enum pid_type type) @@ -3587,9 +3587,9 @@ static int do_sigtimedwait(const sigset_t *which, kernel_siginfo_t *info, recalc_sigpending(); spin_unlock_irq(&tsk->sighand->siglock); - __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); - ret = schedule_hrtimeout_range(to, tsk->timer_slack_ns, - HRTIMER_MODE_REL); + __set_current_state(TASK_INTERRUPTIBLE); + ret = freezable_schedule_hrtimeout_range(to, tsk->timer_slack_ns, + HRTIMER_MODE_REL); spin_lock_irq(&tsk->sighand->siglock); __set_task_blocked(tsk, &tsk->real_blocked); sigemptyset(&tsk->real_blocked); diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index da1ea24a4397..101e40bf4dc1 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -2038,11 +2038,11 @@ static int __sched do_nanosleep(struct hrtimer_sleeper *t) struct restart_block *restart; do { - set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); + set_current_state(TASK_INTERRUPTIBLE); hrtimer_sleeper_start_expires(t, t->mode); if (likely(t->task)) - schedule(); + freezable_schedule(); hrtimer_cancel(&t->timer); t->mode = HRTIMER_MODE_ABS; diff --git a/kernel/umh.c b/kernel/umh.c index fbf872c624cb..8945eaf4c671 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -404,7 +404,6 @@ EXPORT_SYMBOL(call_usermodehelper_setup); */ int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait) { - unsigned int state = TASK_UNINTERRUPTIBLE; DECLARE_COMPLETION_ONSTACK(done); int retval = 0; @@ -439,26 +438,23 @@ int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait) goto unlock; if (wait & UMH_FREEZABLE) - state |= TASK_FREEZABLE; + freezer_do_not_count(); if (wait & UMH_KILLABLE) { - retval = wait_for_completion_state(&done, state | TASK_KILLABLE); + retval = wait_for_completion_killable(&done); if (!retval) goto wait_done; /* umh_complete() will see NULL and free sub_info */ if (xchg(&sub_info->complete, NULL)) goto unlock; - - /* - * fallthrough; in case of -ERESTARTSYS now do uninterruptible - * wait_for_completion_state(). Since umh_complete() shall call - * complete() in a moment if xchg() above returned NULL, this - * uninterruptible wait_for_completion_state() will not block - * SIGKILL'ed processes for long. - */ + /* fallthrough, umh_complete() was already called */ } - wait_for_completion_state(&done, state); + + wait_for_completion(&done); + + if (wait & UMH_FREEZABLE) + freezer_count(); wait_done: retval = sub_info->retval; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 294ca9560ec6..86fae8cd25f2 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -826,8 +826,8 @@ static void khugepaged_alloc_sleep(void) DEFINE_WAIT(wait); add_wait_queue(&khugepaged_wait, &wait); - __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); - schedule_timeout(msecs_to_jiffies(khugepaged_alloc_sleep_millisecs)); + freezable_schedule_timeout_interruptible( + msecs_to_jiffies(khugepaged_alloc_sleep_millisecs)); remove_wait_queue(&khugepaged_wait, &wait); } diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index 6debf4fd42d4..82a0b246f407 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -276,7 +276,7 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue); static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode) { - schedule(); + freezable_schedule_unsafe(); if (signal_pending_state(mode, current)) return -ERESTARTSYS; return 0; @@ -340,12 +340,14 @@ static int rpc_complete_task(struct rpc_task *task) * to enforce taking of the wq->lock and hence avoid races with * rpc_complete_task(). */ -int rpc_wait_for_completion_task(struct rpc_task *task) +int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *action) { + if (action == NULL) + action = rpc_wait_bit_killable; return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE, - rpc_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); + action, TASK_KILLABLE); } -EXPORT_SYMBOL_GPL(rpc_wait_for_completion_task); +EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task); /* * Make an RPC task runnable. @@ -984,7 +986,7 @@ static void __rpc_execute(struct rpc_task *task) trace_rpc_task_sync_sleep(task, task->tk_action); status = out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_QUEUED, rpc_wait_bit_killable, - TASK_KILLABLE|TASK_FREEZABLE); + TASK_KILLABLE); if (status < 0) { /* * When a sync task receives a signal, it exits with diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index dc20a35cc7a6..ffa3c59ebbd9 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2448,14 +2448,13 @@ static long unix_stream_data_wait(struct sock *sk, long timeo, struct sk_buff *last, unsigned int last_len, bool freezable) { - unsigned int state = TASK_INTERRUPTIBLE | freezable * TASK_FREEZABLE; struct sk_buff *tail; DEFINE_WAIT(wait); unix_state_lock(sk); for (;;) { - prepare_to_wait(sk_sleep(sk), &wait, state); + prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); tail = skb_peek_tail(&sk->sk_receive_queue); if (tail != last || @@ -2468,7 +2467,10 @@ static long unix_stream_data_wait(struct sock *sk, long timeo, sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); unix_state_unlock(sk); - timeo = schedule_timeout(timeo); + if (freezable) + timeo = freezable_schedule_timeout(timeo); + else + timeo = schedule_timeout(timeo); unix_state_lock(sk); if (sock_flag(sk, SOCK_DEAD))