Labels: kind/bug, area/scheduing
Problem
In the preempt action (pkg/scheduler/actions/preempt/preempt.go), when multiple queues are present, the "Preemption between Task within Job" (intra-job) section iterates over the shared underRequest slice and unconditionally overwrites preemptorTasks[job.UID] with a fresh empty PriorityQueue:
preemptorTasks[job.UID] = util.NewPriorityQueue(ssn.TaskOrderFn)
This underRequest slice contains starving jobs from all queues, not just the current queue being processed. When a queue with no relevant preemptors (e.g., q1) is processed first, the intra-job section still iterates starving jobs from other queues (e.g., pg3 in q2) and overwrites their already-populated preemptorTasks entries with empty queues. When that queue (q2) is processed later, between-jobs preemption sees preemptorTasks[job.UID].Empty() == true and skips valid preemption.
Root Cause
The bug was introduced in PR volcano-sh/volcano/pull/1453 (Fix: preemption between tasks within a job) as a solution for volcano-sh/volcano/issues/1451. In this fix the intra-job section shares the preemptorTasks map with the between-jobs section but overwrites entries unconditionally without scoping to the current queue.
With the original map[api.QueueID]*api.QueueInfo for queue storage (non-deterministic iteration order), the bug manifests as a missed prremption/possibly next scheduling cycle allocation failures depending on which queue is iterated first.
This wasn't discovered by me but by Osykov: #4613 when he tried to introduce Queue order honoring in preempt. Unfortunately his solution was vague, since it did not honor the queue order he tried to introduce.
Concrete reproduction scenario (minimal)
Single node n1 with 2 CPU / 2Gi.
q1 contains pg1 with one running task (q1-runner1, 1 cpu, 1 Gi requested).
pg1 is not starving and has no pending preemptor task.
q2 contains two jobs:
pg2: low-priority running victim (q2-preemptee1, 1 cpu, 1 Gi requested)
pg3: high-priority starving job with pending preemptor (q2-preemptor1, 1 cpu, 1 Gi requested)
Expected behavior:
q2-preemptor1 should preempt q2-preemptee1 (1 eviction), allowing pg3 to make progress.
Observed buggy behavior (intermittent):
- No eviction occurs, and the starving
pg3 preemptor is effectively lost for between-job preemption.
Failure Mechanism
The failure requires one specific queue order, which is why it is flaky.
When q1 is visited before q2 (possible because queues is a Go map):
- During job discovery,
preemptorTasks[pg3] is correctly populated with
pending preemptor task q2-preemptor1.
- Between-jobs preemption for
q1 finds no preemptors (q1 has none) and
exits that phase for q1.
- The intra-job loop then runs for all
underRequest jobs (shared across
queues), including pg3 from q2.
- Buggy line overwrites
preemptorTasks[pg3] with a new empty queue:
preemptorTasks[job.UID] = util.NewPriorityQueue(ssn.TaskOrderFn).
- By the time scheduler reaches between-jobs preemption for
q2, the original
preemptor state for pg3 has been replaced/drained.
preemptorTasks[pg3] is empty, so valid preemption is skipped and no victim
is evicted.
When q2 is visited first, preemption often succeeds before this overwrite path can invalidate that iteration, so the test passes.
Key log lines from a failing run:
preempt.go:174] No preemptors in Queue <q1>, break.
preempt.go:191] No preemptor task in job <c1/pg3>.
preempt_test.go:... failed to get Evict request in case ...
Environment
- Volcano version: master (HEAD)
- Kubernetes version: N/A (unit test)
- Go version: 1.24+
Related
Labels:
kind/bug,area/scheduingProblem
In the preempt action (
pkg/scheduler/actions/preempt/preempt.go), when multiple queues are present, the "Preemption between Task within Job" (intra-job) section iterates over the sharedunderRequestslice and unconditionally overwritespreemptorTasks[job.UID]with a fresh emptyPriorityQueue:https://github.com/volcano-sh/volcano/blob/v1.14.1/pkg/scheduler/actions/preempt/preempt.go#L229
https://github.com/volcano-sh/volcano/blob/v1.14.1/pkg/scheduler/actions/preempt/preempt.go#L231
This
underRequestslice contains starving jobs from all queues, not just the current queue being processed. When a queue with no relevant preemptors (e.g.,q1) is processed first, the intra-job section still iterates starving jobs from other queues (e.g.,pg3inq2) and overwrites their already-populatedpreemptorTasksentries with empty queues. When that queue (q2) is processed later, between-jobs preemption seespreemptorTasks[job.UID].Empty() == trueand skips valid preemption.Root Cause
The bug was introduced in PR volcano-sh/volcano/pull/1453 (Fix: preemption between tasks within a job) as a solution for volcano-sh/volcano/issues/1451. In this fix the intra-job section shares the
preemptorTasksmap with the between-jobs section but overwrites entries unconditionally without scoping to the current queue.With the original
map[api.QueueID]*api.QueueInfofor queue storage (non-deterministic iteration order), the bug manifests as a missed prremption/possibly next scheduling cycle allocation failures depending on which queue is iterated first.https://github.com/volcano-sh/volcano/blob/v1.14.1/pkg/scheduler/actions/preempt/preempt.go#L113
https://github.com/volcano-sh/volcano/blob/v1.14.1/pkg/scheduler/actions/preempt/preempt.go#L161
This wasn't discovered by me but by Osykov: #4613 when he tried to introduce Queue order honoring in preempt. Unfortunately his solution was vague, since it did not honor the queue order he tried to introduce.
Concrete reproduction scenario (minimal)
Single node
n1with 2 CPU / 2Gi.q1containspg1with one running task (q1-runner1, 1 cpu, 1 Gi requested).pg1is not starving and has no pending preemptor task.q2contains two jobs:pg2: low-priority running victim (q2-preemptee1, 1 cpu, 1 Gi requested)pg3: high-priority starving job with pending preemptor (q2-preemptor1, 1 cpu, 1 Gi requested)Expected behavior:
q2-preemptor1should preemptq2-preemptee1(1 eviction), allowingpg3to make progress.Observed buggy behavior (intermittent):
pg3preemptor is effectively lost for between-job preemption.Failure Mechanism
The failure requires one specific queue order, which is why it is flaky.
When
q1is visited beforeq2(possible becausequeuesis a Go map):preemptorTasks[pg3]is correctly populated withpending preemptor task
q2-preemptor1.q1finds no preemptors (q1has none) andexits that phase for
q1.underRequestjobs (shared acrossqueues), including
pg3fromq2.preemptorTasks[pg3]with a new empty queue:preemptorTasks[job.UID] = util.NewPriorityQueue(ssn.TaskOrderFn).q2, the originalpreemptor state for
pg3has been replaced/drained.preemptorTasks[pg3]is empty, so valid preemption is skipped and no victimis evicted.
When
q2is visited first, preemption often succeeds before this overwrite path can invalidate that iteration, so the test passes.Key log lines from a failing run:
Environment
Related