feat(validator): re-enable SR-IOV vGPU VFs before waiting for them#2601
Open
lexfrei wants to merge 1 commit into
Open
feat(validator): re-enable SR-IOV vGPU VFs before waiting for them#2601lexfrei wants to merge 1 commit into
lexfrei wants to merge 1 commit into
Conversation
On the vGPU (sandbox) workload path, SR-IOV Virtual Functions are runtime state that does not survive a node reboot. The validator already waits for the VFs to appear before letting the vGPU device-manager apply its config, but nothing re-creates them, so after a reboot validation blocks until the VFs are re-established out-of-band. Re-enable the VFs from the vGPU Manager validator via 'sriov-manage -e ALL' inside the resolved driver root (driver container or host) right before the existing wait. The call is idempotent (skipped when every SR-IOV-capable GPU already has its full VF count, so it never disturbs VFs already assigned to running VMs) and best-effort (on failure the existing wait still runs, preserving prior behavior on setups where the VFs are created out-of-band). Scope is limited to VF re-enablement; no GPU reset and no MIG reconfiguration are performed. Refs NVIDIA#2600. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
On the vGPU (sandbox) workload path, SR-IOV Virtual Functions (VFs) are runtime state that does not survive a node reboot. The vGPU Manager validator already waits for VFs to appear (
waitForVFs) before the vGPU device-manager applies its config, but nothing re-creates them. After a node reboot the VFs are gone, so validation blocks until they are re-established out-of-band (e.g. host systemd units runningsriov-manage -e) — the operator does not recover this on its own.This adds the smallest useful part of the reboot-recovery seam discussed in #2600: the operator re-enables the SR-IOV VFs itself. Right before the existing wait,
VGPUManager.validate()runssriov-manage -e ALLinside the resolved driver root (the driver container root, or the host root when the vGPU Manager driver is pre-installed on the host).The call is:
vm-vgpuworkload path — the function already early-returns for other workloads;waitForVFs, preserving the current behavior for setups that create VFs out-of-band.The PF-counting loop is extracted into a shared
countVFshelper so the new idempotency guard andwaitForVFsagree on the accounting.Scope is intentionally limited to VF re-enablement. GPU reset, MIG-mode commit, and
nvidia-persistencedcoordination (the rest of #2600) are out of scope here.Refs #2600.
Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
go vetandgolangci-lint runon the changed package, cross-compiledGOOS=linux(the validator is a Linux-only binary): clean, 0 issues. No asset orgo.mod/go.sumchanges.TestCountVFscovers the shared VF-accounting helper across the boundary cases (no SR-IOV GPUs, VFs missing after reboot, fully enabled, partially enabled across multiple PFs, and VFs not miscounted as PFs); it passes in a Linux container.sriov-manage/chroot path relies on CI and cluster testing.