@@ -1035,26 +1035,108 @@ static const char *scx_enable_state_str[] = {
10351035};
10361036
10371037/*
1038- * sched_ext_entity->ops_state
1038+ * Task Ownership State Machine ( sched_ext_entity->ops_state)
10391039 *
1040- * Used to track the task ownership between the SCX core and the BPF scheduler.
1041- * State transitions look as follows:
1040+ * The sched_ext core uses this state machine to track task ownership
1041+ * between the SCX core and the BPF scheduler. This allows the BPF
1042+ * scheduler to dispatch tasks without strict ordering requirements, while
1043+ * the SCX core safely rejects invalid dispatches.
10421044 *
1043- * NONE -> QUEUEING -> QUEUED -> DISPATCHING
1044- * ^ | |
1045- * | v v
1046- * \-------------------------------/
1045+ * State Transitions
10471046 *
1048- * QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call
1049- * sites for explanations on the conditions being waited upon and why they are
1050- * safe. Transitions out of them into NONE or QUEUED must store_release and the
1051- * waiters should load_acquire.
1047+ * .------------> NONE (owned by SCX core)
1048+ * | | ^
1049+ * | enqueue | | direct dispatch
1050+ * | v |
1051+ * | QUEUEING -------'
1052+ * | |
1053+ * | enqueue |
1054+ * | completes |
1055+ * | v
1056+ * | QUEUED (owned by BPF scheduler)
1057+ * | |
1058+ * | dispatch |
1059+ * | |
1060+ * | v
1061+ * | DISPATCHING
1062+ * | |
1063+ * | dispatch |
1064+ * | completes |
1065+ * `---------------'
10521066 *
1053- * Tracking scx_ops_state enables sched_ext core to reliably determine whether
1054- * any given task can be dispatched by the BPF scheduler at all times and thus
1055- * relaxes the requirements on the BPF scheduler. This allows the BPF scheduler
1056- * to try to dispatch any task anytime regardless of its state as the SCX core
1057- * can safely reject invalid dispatches.
1067+ * State Descriptions
1068+ *
1069+ * - %SCX_OPSS_NONE:
1070+ * Task is owned by the SCX core. It's either on a run queue, running,
1071+ * or being manipulated by the core scheduler. The BPF scheduler has no
1072+ * claim on this task.
1073+ *
1074+ * - %SCX_OPSS_QUEUEING:
1075+ * Transitional state while transferring a task from the SCX core to
1076+ * the BPF scheduler. The task's rq lock is held during this state.
1077+ * Since QUEUEING is both entered and exited under the rq lock, dequeue
1078+ * can never observe this state (it would be a BUG). When finishing a
1079+ * dispatch, if the task is still in %SCX_OPSS_QUEUEING the completion
1080+ * path busy-waits for it to leave this state (via wait_ops_state())
1081+ * before retrying.
1082+ *
1083+ * - %SCX_OPSS_QUEUED:
1084+ * Task is owned by the BPF scheduler. It's on a DSQ (dispatch queue)
1085+ * and the BPF scheduler is responsible for dispatching it. A QSEQ
1086+ * (queue sequence number) is embedded in this state to detect
1087+ * dispatch/dequeue races: if a task is dequeued and re-enqueued, the
1088+ * QSEQ changes and any in-flight dispatch operations targeting the old
1089+ * QSEQ are safely ignored.
1090+ *
1091+ * - %SCX_OPSS_DISPATCHING:
1092+ * Transitional state while transferring a task from the BPF scheduler
1093+ * back to the SCX core. This state indicates the BPF scheduler has
1094+ * selected the task for execution. When dequeue needs to take the task
1095+ * off a DSQ and it is still in %SCX_OPSS_DISPATCHING, the dequeue path
1096+ * busy-waits for it to leave this state (via wait_ops_state()) before
1097+ * proceeding. Exits to %SCX_OPSS_NONE when dispatch completes.
1098+ *
1099+ * Memory Ordering
1100+ *
1101+ * Transitions out of %SCX_OPSS_QUEUEING and %SCX_OPSS_DISPATCHING into
1102+ * %SCX_OPSS_NONE or %SCX_OPSS_QUEUED must use atomic_long_set_release()
1103+ * and waiters must use atomic_long_read_acquire(). This ensures proper
1104+ * synchronization between concurrent operations.
1105+ *
1106+ * Cross-CPU Task Migration
1107+ *
1108+ * When moving a task in the %SCX_OPSS_DISPATCHING state, we can't simply
1109+ * grab the target CPU's rq lock because a concurrent dequeue might be
1110+ * waiting on %SCX_OPSS_DISPATCHING while holding the source rq lock
1111+ * (deadlock).
1112+ *
1113+ * The sched_ext core uses a "lock dancing" protocol coordinated by
1114+ * p->scx.holding_cpu. When moving a task to a different rq:
1115+ *
1116+ * 1. Verify task can be moved (CPU affinity, migration_disabled, etc.)
1117+ * 2. Set p->scx.holding_cpu to the current CPU
1118+ * 3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
1119+ * is set, so clearing DISPATCHING first prevents the circular wait
1120+ * (safe to lock the rq we need)
1121+ * 4. Unlock the current CPU's rq
1122+ * 5. Lock src_rq (where the task currently lives)
1123+ * 6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
1124+ * race (dequeue clears holding_cpu to -1 when it takes the task), in
1125+ * this case migration is aborted
1126+ * 7. If src_rq == dst_rq: clear holding_cpu and enqueue directly
1127+ * into dst_rq's local DSQ (no lock swap needed)
1128+ * 8. Otherwise: call move_remote_task_to_local_dsq(), which releases
1129+ * src_rq, locks dst_rq, and performs the deactivate/activate
1130+ * migration cycle (dst_rq is held on return)
1131+ * 9. Unlock dst_rq and re-lock the current CPU's rq to restore
1132+ * the lock state expected by the caller
1133+ *
1134+ * If any verification fails, abort the migration.
1135+ *
1136+ * This state tracking allows the BPF scheduler to try to dispatch any task
1137+ * at any time regardless of its state. The SCX core can safely
1138+ * reject/ignore invalid dispatches, simplifying the BPF scheduler
1139+ * implementation.
10581140 */
10591141enum scx_ops_state {
10601142 SCX_OPSS_NONE , /* owned by the SCX core */
0 commit comments