Skip to content

fix(vgpu-device-manager): wait for host-installed vGPU Manager readiness#2599

Open
lexfrei wants to merge 1 commit into
NVIDIA:mainfrom
lexfrei:fix/vgpu-manager-ready-host-driver
Open

fix(vgpu-device-manager): wait for host-installed vGPU Manager readiness#2599
lexfrei wants to merge 1 commit into
NVIDIA:mainfrom
lexfrei:fix/vgpu-manager-ready-host-driver

Conversation

@lexfrei

@lexfrei lexfrei commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Description

The vgpu-manager-validation init container of the nvidia-vgpu-device-manager DaemonSet waited only for the vgpu-manager-ready status file, which the validator writes when the vGPU Manager is deployed as a container. When the vGPU Manager driver is pre-installed on the host (driver.enabled=false), the validator instead writes host-vgpu-manager-ready (and deletes both files at the start of each run), so the init container blocked indefinitely on "waiting for NVIDIA vGPU Manager to be setup" and the vGPU Device Manager never started.

host-vgpu-manager-ready is written by the validator but read nowhere else in the tree, so the host-installed-driver path never had a working readiness gate. There was already a TODO next to the gate acknowledging this.

This makes the init container wait for either status file, so the operand starts in both the container-managed and host-installed driver modes, and resolves that TODO.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Added TestVGPUDeviceManagerReadinessGate (controllers package) which decodes the daemonset asset and asserts the init container's readiness gate waits for both vgpu-manager-ready and host-vgpu-manager-ready, combined with || (OR) so it passes when either file is present. Verified red/green: the test fails on the previous single-file gate and on an accidental &&, and passes with the fix. go test ./controllers/, go vet ./controllers/, and golangci-lint run on the changed package all pass. This changes only a static asset and a unit test, so api/, config/, bundle/, deployments/, and the module files are untouched.

The vgpu-manager-validation init container waited only for the
vgpu-manager-ready status file, which the validator writes when the
vGPU Manager is deployed as a container. When the vGPU Manager driver
is pre-installed on the host (driver.enabled=false), the validator
writes host-vgpu-manager-ready instead, so the init container blocked
indefinitely on "waiting for NVIDIA vGPU Manager to be setup" and the
vGPU Device Manager never started.

Wait for either status file so the operand starts in both the
container-managed and host-installed driver modes, resolving the
existing TODO in the daemonset asset.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@copy-pr-bot

copy-pr-bot Bot commented Jul 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant