Skip to content

Commit 70f54f6

Browse files
arighihtejun
authored andcommitted
sched_ext: Document task ownership state machine
The task ownership state machine in sched_ext is quite hard to follow from the code alone. The interaction of ownership states, memory ordering rules and cross-CPU "lock dancing" makes the overall model subtle. Extend the documentation next to scx_ops_state to provide a more structured and self-contained description of the state transitions and their synchronization rules. The new reference should make the code easier to reason about and maintain and can help future contributors understand the overall task-ownership workflow. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
1 parent 0927780 commit 70f54f6

1 file changed

Lines changed: 98 additions & 16 deletions

File tree

kernel/sched/ext_internal.h

Lines changed: 98 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1035,26 +1035,108 @@ static const char *scx_enable_state_str[] = {
10351035
};
10361036

10371037
/*
1038-
* sched_ext_entity->ops_state
1038+
* Task Ownership State Machine (sched_ext_entity->ops_state)
10391039
*
1040-
* Used to track the task ownership between the SCX core and the BPF scheduler.
1041-
* State transitions look as follows:
1040+
* The sched_ext core uses this state machine to track task ownership
1041+
* between the SCX core and the BPF scheduler. This allows the BPF
1042+
* scheduler to dispatch tasks without strict ordering requirements, while
1043+
* the SCX core safely rejects invalid dispatches.
10421044
*
1043-
* NONE -> QUEUEING -> QUEUED -> DISPATCHING
1044-
* ^ | |
1045-
* | v v
1046-
* \-------------------------------/
1045+
* State Transitions
10471046
*
1048-
* QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call
1049-
* sites for explanations on the conditions being waited upon and why they are
1050-
* safe. Transitions out of them into NONE or QUEUED must store_release and the
1051-
* waiters should load_acquire.
1047+
* .------------> NONE (owned by SCX core)
1048+
* | | ^
1049+
* | enqueue | | direct dispatch
1050+
* | v |
1051+
* | QUEUEING -------'
1052+
* | |
1053+
* | enqueue |
1054+
* | completes |
1055+
* | v
1056+
* | QUEUED (owned by BPF scheduler)
1057+
* | |
1058+
* | dispatch |
1059+
* | |
1060+
* | v
1061+
* | DISPATCHING
1062+
* | |
1063+
* | dispatch |
1064+
* | completes |
1065+
* `---------------'
10521066
*
1053-
* Tracking scx_ops_state enables sched_ext core to reliably determine whether
1054-
* any given task can be dispatched by the BPF scheduler at all times and thus
1055-
* relaxes the requirements on the BPF scheduler. This allows the BPF scheduler
1056-
* to try to dispatch any task anytime regardless of its state as the SCX core
1057-
* can safely reject invalid dispatches.
1067+
* State Descriptions
1068+
*
1069+
* - %SCX_OPSS_NONE:
1070+
* Task is owned by the SCX core. It's either on a run queue, running,
1071+
* or being manipulated by the core scheduler. The BPF scheduler has no
1072+
* claim on this task.
1073+
*
1074+
* - %SCX_OPSS_QUEUEING:
1075+
* Transitional state while transferring a task from the SCX core to
1076+
* the BPF scheduler. The task's rq lock is held during this state.
1077+
* Since QUEUEING is both entered and exited under the rq lock, dequeue
1078+
* can never observe this state (it would be a BUG). When finishing a
1079+
* dispatch, if the task is still in %SCX_OPSS_QUEUEING the completion
1080+
* path busy-waits for it to leave this state (via wait_ops_state())
1081+
* before retrying.
1082+
*
1083+
* - %SCX_OPSS_QUEUED:
1084+
* Task is owned by the BPF scheduler. It's on a DSQ (dispatch queue)
1085+
* and the BPF scheduler is responsible for dispatching it. A QSEQ
1086+
* (queue sequence number) is embedded in this state to detect
1087+
* dispatch/dequeue races: if a task is dequeued and re-enqueued, the
1088+
* QSEQ changes and any in-flight dispatch operations targeting the old
1089+
* QSEQ are safely ignored.
1090+
*
1091+
* - %SCX_OPSS_DISPATCHING:
1092+
* Transitional state while transferring a task from the BPF scheduler
1093+
* back to the SCX core. This state indicates the BPF scheduler has
1094+
* selected the task for execution. When dequeue needs to take the task
1095+
* off a DSQ and it is still in %SCX_OPSS_DISPATCHING, the dequeue path
1096+
* busy-waits for it to leave this state (via wait_ops_state()) before
1097+
* proceeding. Exits to %SCX_OPSS_NONE when dispatch completes.
1098+
*
1099+
* Memory Ordering
1100+
*
1101+
* Transitions out of %SCX_OPSS_QUEUEING and %SCX_OPSS_DISPATCHING into
1102+
* %SCX_OPSS_NONE or %SCX_OPSS_QUEUED must use atomic_long_set_release()
1103+
* and waiters must use atomic_long_read_acquire(). This ensures proper
1104+
* synchronization between concurrent operations.
1105+
*
1106+
* Cross-CPU Task Migration
1107+
*
1108+
* When moving a task in the %SCX_OPSS_DISPATCHING state, we can't simply
1109+
* grab the target CPU's rq lock because a concurrent dequeue might be
1110+
* waiting on %SCX_OPSS_DISPATCHING while holding the source rq lock
1111+
* (deadlock).
1112+
*
1113+
* The sched_ext core uses a "lock dancing" protocol coordinated by
1114+
* p->scx.holding_cpu. When moving a task to a different rq:
1115+
*
1116+
* 1. Verify task can be moved (CPU affinity, migration_disabled, etc.)
1117+
* 2. Set p->scx.holding_cpu to the current CPU
1118+
* 3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
1119+
* is set, so clearing DISPATCHING first prevents the circular wait
1120+
* (safe to lock the rq we need)
1121+
* 4. Unlock the current CPU's rq
1122+
* 5. Lock src_rq (where the task currently lives)
1123+
* 6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
1124+
* race (dequeue clears holding_cpu to -1 when it takes the task), in
1125+
* this case migration is aborted
1126+
* 7. If src_rq == dst_rq: clear holding_cpu and enqueue directly
1127+
* into dst_rq's local DSQ (no lock swap needed)
1128+
* 8. Otherwise: call move_remote_task_to_local_dsq(), which releases
1129+
* src_rq, locks dst_rq, and performs the deactivate/activate
1130+
* migration cycle (dst_rq is held on return)
1131+
* 9. Unlock dst_rq and re-lock the current CPU's rq to restore
1132+
* the lock state expected by the caller
1133+
*
1134+
* If any verification fails, abort the migration.
1135+
*
1136+
* This state tracking allows the BPF scheduler to try to dispatch any task
1137+
* at any time regardless of its state. The SCX core can safely
1138+
* reject/ignore invalid dispatches, simplifying the BPF scheduler
1139+
* implementation.
10581140
*/
10591141
enum scx_ops_state {
10601142
SCX_OPSS_NONE, /* owned by the SCX core */

0 commit comments

Comments
 (0)