vGPU Device Manager: no SR-IOV / vendor-specific VFIO vGPU creation on Ada Lovelace / Hopper (mdev-only)

### Summary

The vGPU Device Manager (`nvidia-vgpu-device-manager`) is built around the legacy mediated-device (mdev) framework only. On Ada Lovelace and newer GPUs (L40/L40S, RTX Ada, H100, H200, Blackwell) running the vGPU release 17/18+ host driver, the driver no longer exposes mdev — vGPU profiles are assigned per SR-IOV Virtual Function through the vendor-specific VFIO sysfs (`/sys/bus/pci/devices/<VF>/nvidia/current_vgpu_type`). As a result the GPU Operator has no way to create SR-IOV vGPU devices on these GPUs.

### Environment

- GPU: NVIDIA H200 SXM 141GB (`10de:2335`)
- Host vGPU driver: `580.159.01` (NVIDIA vGPU release 18, `-vgpu-kvm`)
- Kernel: 6.8
- GPU Operator in sandbox mode: `sandboxWorkloads.enabled=true`, node label `nvidia.com/gpu.workload.config=vm-vgpu`
- `vgpu-device-manager` image `v0.4.2`

### Problem

On this GPU `/sys/bus/mdev/` does not exist by design — the driver uses the vendor-specific VFIO framework (`nvidia-smi -q` reports `Host VGPU Mode : SR-IOV`). vGPU is configured per VF: after `/usr/lib/nvidia/sriov-manage -e <BDF>`, each VF exposes `/sys/bus/pci/devices/<VF>/nvidia/{creatable_vgpu_types, current_vgpu_type, gpu_instance_id, placement_id, ...}`, and a vGPU is created by writing a type id to `current_vgpu_type`. There is no mdev bus for the device manager to walk, so it cannot enumerate or create vGPU devices — matching the failure reported in #591:

```
error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory
```

With a host-installed driver the `vgpu-device-manager` init container instead blocks indefinitely on `waiting for NVIDIA vGPU Manager to be setup`. Either way there is no operator path to create SR-IOV vGPUs on Ada/Hopper.

### The rest of the stack is ready — only operator-side creation is missing

- **KubeVirt consumes these VFs.** Its PCI device plugin recognizes a VF bound to the `nvidia` driver once `current_vgpu_type != 0` and advertises it — kubevirt/kubevirt#16890 (merged 2026-04-14), available in KubeVirt v1.9 (`v1.9.0-beta.0`; the added file `pkg/virt-handler/device-manager/nvidia.go` is present in v1.9.0-beta.0 but not in the v1.8.x releases / `release-1.8`). The design discussion in kubevirt/kubevirt#17642 explicitly scopes vGPU *profile assignment* (`current_vgpu_type`) as a node-level concern for `gpu-operator` or a custom DaemonSet — i.e. exactly what this issue asks the operator to do.
- **Manual creation works:** `sriov-manage -e` + `echo <type-id> > .../nvidia/current_vgpu_type` yields a working, KubeVirt-consumable vGPU (validated on H200 with a whole-card profile).
- Rebinding VFs to `vfio-pci` is not a workaround — unbinding the VF resets `current_vgpu_type` to 0.

### Request

Add support in `vgpu-device-manager` (or a dedicated component) for the vendor-specific VFIO / SR-IOV per-VF model (`current_vgpu_type`), so vGPU devices are declaratively created on Ada Lovelace / Hopper / Blackwell, analogous to what already works for mdev GPUs today. The downstream discovery/advertisement side is already handled by KubeVirt.

### References

- #591 — mdev-era predecessor (2023, closed as stale); the underlying gap persists for current SR-IOV vGPU hardware.
- kubevirt/kubevirt#17642, kubevirt/kubevirt#16890 — KubeVirt-side support for the vendor-specific VFIO framework (implemented).
- Same gap tracked elsewhere: OpenNebula/one#6841, harvester/harvester#6487, cloud-hypervisor/cloud-hypervisor#7572.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vGPU Device Manager: no SR-IOV / vendor-specific VFIO vGPU creation on Ada Lovelace / Hopper (mdev-only) #2594

Summary

Environment

Problem

The rest of the stack is ready — only operator-side creation is missing

Request

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

vGPU Device Manager: no SR-IOV / vendor-specific VFIO vGPU creation on Ada Lovelace / Hopper (mdev-only) #2594

Description

Summary

Environment

Problem

The rest of the stack is ready — only operator-side creation is missing

Request

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions