Skip to content

feat(validator): re-enable SR-IOV vGPU VFs before waiting for them#2601

Open
lexfrei wants to merge 1 commit into
NVIDIA:mainfrom
lexfrei:feat/validator-enable-sriov-vfs
Open

feat(validator): re-enable SR-IOV vGPU VFs before waiting for them#2601
lexfrei wants to merge 1 commit into
NVIDIA:mainfrom
lexfrei:feat/validator-enable-sriov-vfs

Conversation

@lexfrei

@lexfrei lexfrei commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Description

On the vGPU (sandbox) workload path, SR-IOV Virtual Functions (VFs) are runtime state that does not survive a node reboot. The vGPU Manager validator already waits for VFs to appear (waitForVFs) before the vGPU device-manager applies its config, but nothing re-creates them. After a node reboot the VFs are gone, so validation blocks until they are re-established out-of-band (e.g. host systemd units running sriov-manage -e) — the operator does not recover this on its own.

This adds the smallest useful part of the reboot-recovery seam discussed in #2600: the operator re-enables the SR-IOV VFs itself. Right before the existing wait, VGPUManager.validate() runs sriov-manage -e ALL inside the resolved driver root (the driver container root, or the host root when the vGPU Manager driver is pre-installed on the host).

The call is:

  • gated to the vm-vgpu workload path — the function already early-returns for other workloads;
  • idempotent — skipped when every SR-IOV-capable GPU already has its full VF count, which is the normal steady state, so it is a no-op except right after a reboot when no VMs are running yet;
  • best-effort — on failure it logs and falls through to waitForVFs, preserving the current behavior for setups that create VFs out-of-band.

The PF-counting loop is extracted into a shared countVFs helper so the new idempotency guard and waitForVFs agree on the accounting.

Scope is intentionally limited to VF re-enablement. GPU reset, MIG-mode commit, and nvidia-persistenced coordination (the rest of #2600) are out of scope here.

Refs #2600.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

  • go vet and golangci-lint run on the changed package, cross-compiled GOOS=linux (the validator is a Linux-only binary): clean, 0 issues. No asset or go.mod/go.sum changes.
  • New unit test TestCountVFs covers the shared VF-accounting helper across the boundary cases (no SR-IOV GPUs, VFs missing after reboot, fully enabled, partially enabled across multiple PFs, and VFs not miscounted as PFs); it passes in a Linux container.
  • Not yet exercised on GPU hardware; the runtime sriov-manage/chroot path relies on CI and cluster testing.

On the vGPU (sandbox) workload path, SR-IOV Virtual Functions are
runtime state that does not survive a node reboot. The validator already
waits for the VFs to appear before letting the vGPU device-manager apply
its config, but nothing re-creates them, so after a reboot validation
blocks until the VFs are re-established out-of-band.

Re-enable the VFs from the vGPU Manager validator via 'sriov-manage -e
ALL' inside the resolved driver root (driver container or host) right
before the existing wait. The call is idempotent (skipped when every
SR-IOV-capable GPU already has its full VF count, so it never disturbs
VFs already assigned to running VMs) and best-effort (on failure the
existing wait still runs, preserving prior behavior on setups where the
VFs are created out-of-band).

Scope is limited to VF re-enablement; no GPU reset and no MIG
reconfiguration are performed. Refs NVIDIA#2600.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@copy-pr-bot

copy-pr-bot Bot commented Jul 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lexfrei lexfrei marked this pull request as ready for review July 4, 2026 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant