preemptorTasks overwrite in multi-queue intra-job preemption causes flaky missed preemption

**Labels:** `kind/bug`, `area/scheduing`

### Problem

In the preempt action (`pkg/scheduler/actions/preempt/preempt.go`), when multiple queues are present, the "Preemption between Task within Job" (intra-job) section iterates over the shared `underRequest` slice and unconditionally overwrites `preemptorTasks[job.UID]` with a fresh empty `PriorityQueue`:

- v1.14.1 permalink:
  https://github.com/volcano-sh/volcano/blob/v1.14.1/pkg/scheduler/actions/preempt/preempt.go#L229
  https://github.com/volcano-sh/volcano/blob/v1.14.1/pkg/scheduler/actions/preempt/preempt.go#L231

```go
preemptorTasks[job.UID] = util.NewPriorityQueue(ssn.TaskOrderFn)
```

This `underRequest` slice contains starving jobs from **all queues**, not just the current queue being processed. When a queue with no relevant preemptors (e.g., `q1`) is processed first, the intra-job section still iterates starving jobs from other queues (e.g., `pg3` in `q2`) and overwrites their already-populated `preemptorTasks` entries with empty queues. When that queue (`q2`) is processed later, between-jobs preemption sees `preemptorTasks[job.UID].Empty() == true` and skips valid preemption.

### Root Cause

The bug was introduced in PR [volcano-sh/volcano/pull/1453](https://github.com/volcano-sh/volcano/pull/1453) (Fix: preemption between tasks within a job) as a solution for [volcano-sh/volcano/issues/1451](https://github.com/volcano-sh/volcano/issues/1451). In this fix the intra-job section shares the `preemptorTasks` map with the between-jobs section but overwrites entries unconditionally without scoping to the current queue.

With the original `map[api.QueueID]*api.QueueInfo` for queue storage (non-deterministic iteration order), the bug manifests as a **missed prremption/possibly next scheduling cycle allocation failures** depending on which queue is iterated first.

- v1.14.1 map iteration permalinks:
  https://github.com/volcano-sh/volcano/blob/v1.14.1/pkg/scheduler/actions/preempt/preempt.go#L113
  https://github.com/volcano-sh/volcano/blob/v1.14.1/pkg/scheduler/actions/preempt/preempt.go#L161

This wasn't discovered by me but by Osykov: https://github.com/volcano-sh/volcano/pull/4613 when he tried to introduce Queue order honoring in preempt. Unfortunately his solution was vague, since it did not honor the queue order he tried to introduce.

### Concrete reproduction scenario (minimal)

Single node `n1` with 2 CPU / 2Gi.

- `q1` contains `pg1` with one running task (`q1-runner1`, 1 cpu, 1 Gi requested).
  - `pg1` is not starving and has no pending preemptor task.
- `q2` contains two jobs:
  - `pg2`: low-priority running victim (`q2-preemptee1`, 1 cpu, 1 Gi requested)
  - `pg3`: high-priority starving job with pending preemptor (`q2-preemptor1`, 1 cpu, 1 Gi requested)

Expected behavior:
- `q2-preemptor1` should preempt `q2-preemptee1` (1 eviction), allowing `pg3` to make progress.

Observed buggy behavior (intermittent):
- No eviction occurs, and the starving `pg3` preemptor is effectively lost for between-job preemption.

### Failure Mechanism

The failure requires one specific queue order, which is why it is flaky.

When `q1` is visited before `q2` (possible because `queues` is a Go map):

1. During job discovery, `preemptorTasks[pg3]` is correctly populated with
   pending preemptor task `q2-preemptor1`.
2. Between-jobs preemption for `q1` finds no preemptors (`q1` has none) and
   exits that phase for `q1`.
3. The intra-job loop then runs for **all** `underRequest` jobs (shared across
   queues), including `pg3` from `q2`.
4. Buggy line overwrites `preemptorTasks[pg3]` with a new empty queue:
   `preemptorTasks[job.UID] = util.NewPriorityQueue(ssn.TaskOrderFn)`.
5. By the time scheduler reaches between-jobs preemption for `q2`, the original
   preemptor state for `pg3` has been replaced/drained.
6. `preemptorTasks[pg3]` is empty, so valid preemption is skipped and no victim
   is evicted.

When `q2` is visited first, preemption often succeeds before this overwrite path can invalidate that iteration, so the test passes.

Key log lines from a failing run:
```
preempt.go:174] No preemptors in Queue <q1>, break.
preempt.go:191] No preemptor task in job <c1/pg3>.
preempt_test.go:... failed to get Evict request in case ...
```

### Environment
- Volcano version: master (HEAD)
- Kubernetes version: N/A (unit test)
- Go version: 1.24+

### Related
- Original bug report: [volcano-sh/volcano/issues/1451](https://github.com/volcano-sh/volcano/issues/1451)
- Original fix (introduced overwrite): [volcano-sh/volcano/pull/1453](https://github.com/volcano-sh/volcano/pull/1453)
- Osykov's PR (alternative fix + QueueOrderFn): [volcano-sh/volcano/pull/4613](https://github.com/volcano-sh/volcano/pull/4613)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preemptorTasks overwrite in multi-queue intra-job preemption causes flaky missed preemption #5140

Problem

Root Cause

Concrete reproduction scenario (minimal)

Failure Mechanism

Environment

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

preemptorTasks overwrite in multi-queue intra-job preemption causes flaky missed preemption #5140

Description

Problem

Root Cause

Concrete reproduction scenario (minimal)

Failure Mechanism

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions