Skip to content

[Feature]: Persist MIG mode and SR-IOV vGPU VFs across reboot on the sandbox/vGPU path (operator-driven GPU reset) #2600

Description

@lexfrei

Motivation

On the sandbox / vGPU workload path, changing MIG mode and creating SR-IOV vGPU VFs does not survive a node reboot, and the operator does not re-establish this state on boot. After a reboot of an NVSwitch node configured for MIG-backed vGPU:

  • MIG mode / MIG instances are not present until re-applied;
  • SR-IOV VFs (created via sriov-manage -e) are gone — VFs are runtime state and are never persisted by SR-IOV;
  • the vGPU devices created on those VFs are gone.

The operator's MIG manager can enable MIG mode, but committing a MIG-mode change on these GPUs requires a GPU reset. The only reset mechanism the Kubernetes mig-manager has is WITH_REBOOT ("reboot the node if changing the MIG mode fails for any reason") — it never performs a targeted nvidia-smi --gpu-reset. And VF creation (sriov-manage) is not performed by the operator at all — sriov-manage appears nowhere in gpu-operator. So on the vGPU path, after a reboot, MIG + VFs + vGPUs have to be re-established out-of-band (host units that enable MIG, run --gpu-reset, run sriov-manage -e per PF, then release the operand readiness gate).

By contrast, the systemd deployment of mig-parted does handle reboot persistence: nvidia-mig-manager.service persists the selected config across reboot (persist_config_across_reboot) and orders itself via nvidia-gpu-reset.target (After=nvidia-fabricmanager.service, Before=nvidia-gpu-reset.target) so MIG reconfiguration, GPU reset, and driver-service ordering are coordinated at boot. The Kubernetes / operator path has no equivalent.

Proposal / discussion

  1. Give the operator a way to commit MIG-mode changes via a targeted GPU reset instead of only a full node reboot (WITH_REBOOT). This likely needs coordination with nvidia-persistenced (see nvidia-persistenced keeps a handle on the GPU and blocks --gpu-reset #118), which holds a handle that blocks --gpu-reset.
  2. On the vGPU / sandbox path, have the operator own re-establishing SR-IOV VFs on boot (the equivalent of sriov-manage -e per PF) and re-applying the MIG + vGPU device configuration, so a reboot recovers hands-off without host units. This is the smallest useful seam and could be a first change on its own (VF re-enablement in the vGPU device-manager operand before it applies the vGPU config).
  3. Port the systemd deployment's boot-ordering guarantees (config persistence + nvidia-gpu-reset.target ordering) to the operator's operands, or document the boot units as a supported companion for the Kubernetes path.

Related

This is the reboot-recovery glue the operator lacks on the vGPU path today.

Environment

NVSwitch HGX 8-GPU node, MIG-backed and whole-card vGPU, host-installed driver, GPU Operator sandbox path (gpu.workload.config=vm-vgpu).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions