Skip to content

preemptorTasks overwrite in multi-queue intra-job preemption causes flaky missed preemption #5140

@hajnalmt

Description

@hajnalmt

Labels: kind/bug, area/scheduing

Problem

In the preempt action (pkg/scheduler/actions/preempt/preempt.go), when multiple queues are present, the "Preemption between Task within Job" (intra-job) section iterates over the shared underRequest slice and unconditionally overwrites preemptorTasks[job.UID] with a fresh empty PriorityQueue:

preemptorTasks[job.UID] = util.NewPriorityQueue(ssn.TaskOrderFn)

This underRequest slice contains starving jobs from all queues, not just the current queue being processed. When a queue with no relevant preemptors (e.g., q1) is processed first, the intra-job section still iterates starving jobs from other queues (e.g., pg3 in q2) and overwrites their already-populated preemptorTasks entries with empty queues. When that queue (q2) is processed later, between-jobs preemption sees preemptorTasks[job.UID].Empty() == true and skips valid preemption.

Root Cause

The bug was introduced in PR volcano-sh/volcano/pull/1453 (Fix: preemption between tasks within a job) as a solution for volcano-sh/volcano/issues/1451. In this fix the intra-job section shares the preemptorTasks map with the between-jobs section but overwrites entries unconditionally without scoping to the current queue.

With the original map[api.QueueID]*api.QueueInfo for queue storage (non-deterministic iteration order), the bug manifests as a missed prremption/possibly next scheduling cycle allocation failures depending on which queue is iterated first.

This wasn't discovered by me but by Osykov: #4613 when he tried to introduce Queue order honoring in preempt. Unfortunately his solution was vague, since it did not honor the queue order he tried to introduce.

Concrete reproduction scenario (minimal)

Single node n1 with 2 CPU / 2Gi.

  • q1 contains pg1 with one running task (q1-runner1, 1 cpu, 1 Gi requested).
    • pg1 is not starving and has no pending preemptor task.
  • q2 contains two jobs:
    • pg2: low-priority running victim (q2-preemptee1, 1 cpu, 1 Gi requested)
    • pg3: high-priority starving job with pending preemptor (q2-preemptor1, 1 cpu, 1 Gi requested)

Expected behavior:

  • q2-preemptor1 should preempt q2-preemptee1 (1 eviction), allowing pg3 to make progress.

Observed buggy behavior (intermittent):

  • No eviction occurs, and the starving pg3 preemptor is effectively lost for between-job preemption.

Failure Mechanism

The failure requires one specific queue order, which is why it is flaky.

When q1 is visited before q2 (possible because queues is a Go map):

  1. During job discovery, preemptorTasks[pg3] is correctly populated with
    pending preemptor task q2-preemptor1.
  2. Between-jobs preemption for q1 finds no preemptors (q1 has none) and
    exits that phase for q1.
  3. The intra-job loop then runs for all underRequest jobs (shared across
    queues), including pg3 from q2.
  4. Buggy line overwrites preemptorTasks[pg3] with a new empty queue:
    preemptorTasks[job.UID] = util.NewPriorityQueue(ssn.TaskOrderFn).
  5. By the time scheduler reaches between-jobs preemption for q2, the original
    preemptor state for pg3 has been replaced/drained.
  6. preemptorTasks[pg3] is empty, so valid preemption is skipped and no victim
    is evicted.

When q2 is visited first, preemption often succeeds before this overwrite path can invalidate that iteration, so the test passes.

Key log lines from a failing run:

preempt.go:174] No preemptors in Queue <q1>, break.
preempt.go:191] No preemptor task in job <c1/pg3>.
preempt_test.go:... failed to get Evict request in case ...

Environment

  • Volcano version: master (HEAD)
  • Kubernetes version: N/A (unit test)
  • Go version: 1.24+

Related

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions