[Feature]: Persist MIG mode and SR-IOV vGPU VFs across reboot on the sandbox/vGPU path (operator-driven GPU reset)

### Motivation

On the sandbox / vGPU workload path, changing MIG mode and creating SR-IOV vGPU VFs does not survive a node reboot, and the operator does not re-establish this state on boot. After a reboot of an NVSwitch node configured for MIG-backed vGPU:

- MIG mode / MIG instances are not present until re-applied;
- SR-IOV VFs (created via `sriov-manage -e`) are gone — VFs are runtime state and are never persisted by SR-IOV;
- the vGPU devices created on those VFs are gone.

The operator's MIG manager can enable MIG mode, but committing a MIG-mode change on these GPUs requires a GPU reset. The only reset mechanism the Kubernetes mig-manager has is `WITH_REBOOT` ("reboot the node if changing the MIG mode fails for any reason") — it never performs a targeted `nvidia-smi --gpu-reset`. And VF creation (`sriov-manage`) is not performed by the operator at all — `sriov-manage` appears nowhere in `gpu-operator`. So on the vGPU path, after a reboot, MIG + VFs + vGPUs have to be re-established out-of-band (host units that enable MIG, run `--gpu-reset`, run `sriov-manage -e` per PF, then release the operand readiness gate).

By contrast, the **systemd** deployment of `mig-parted` does handle reboot persistence: `nvidia-mig-manager.service` persists the selected config across reboot (`persist_config_across_reboot`) and orders itself via `nvidia-gpu-reset.target` (`After=nvidia-fabricmanager.service`, `Before=nvidia-gpu-reset.target`) so MIG reconfiguration, GPU reset, and driver-service ordering are coordinated at boot. The Kubernetes / operator path has no equivalent.

### Proposal / discussion

1. Give the operator a way to commit MIG-mode changes via a targeted GPU reset instead of only a full node reboot (`WITH_REBOOT`). This likely needs coordination with `nvidia-persistenced` (see #118), which holds a handle that blocks `--gpu-reset`.
2. On the vGPU / sandbox path, have the operator own re-establishing SR-IOV VFs on boot (the equivalent of `sriov-manage -e` per PF) and re-applying the MIG + vGPU device configuration, so a reboot recovers hands-off without host units. This is the smallest useful seam and could be a first change on its own (VF re-enablement in the vGPU device-manager operand before it applies the vGPU config).
3. Port the systemd deployment's boot-ordering guarantees (config persistence + `nvidia-gpu-reset.target` ordering) to the operator's operands, or document the boot units as a supported companion for the Kubernetes path.

### Related

- #403 — device-plugin validator failing after reboot with MIG enabled.
- #118 — `nvidia-persistenced` keeps a handle on the GPU and blocks `--gpu-reset`.

This is the reboot-recovery glue the operator lacks on the vGPU path today.

### Environment

NVSwitch HGX 8-GPU node, MIG-backed and whole-card vGPU, host-installed driver, GPU Operator sandbox path (`gpu.workload.config=vm-vgpu`).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Persist MIG mode and SR-IOV vGPU VFs across reboot on the sandbox/vGPU path (operator-driven GPU reset) #2600

Motivation

Proposal / discussion

Related

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Persist MIG mode and SR-IOV vGPU VFs across reboot on the sandbox/vGPU path (operator-driven GPU reset) #2600

Description

Motivation

Proposal / discussion

Related

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions