[WIP]OSDOCS-20033: Kueue 1.4 and DRA by StephenJamesSmith · Pull Request #113996 · openshift/openshift-docs

StephenJamesSmith · 2026-06-23T20:53:44Z

OCPSTRAT-2380 [DS/RN] Kueue 1.4 and DRA

Version: 4.21+

Jira: https://redhat.atlassian.net/browse/OSDOCS-20033

Previews:
https://113996--ocpdocs-pr.netlify.app/openshift-enterprise/latest/ai_workloads/kueue/kueue-dra-integrating-dynamic-resource-allocation.html

https://113996--ocpdocs-pr.netlify.app/openshift-enterprise/latest/ai_workloads/kueue/release-notes.html#release-notes-1.4_release-notes

Dev: @kannon92 @PannagaRao @sohankunkerkar
QE @MaysaMacedo @anahas-redhat

openshift-ci-robot · 2026-06-23T20:53:48Z

@StephenJamesSmith: This pull request references OSDOCS-20033 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target only the "5.0.0" version, but multiple target versions were set.

Details

In response to this:

DS/RN: [DS/RN] Kueue 1.4 and DRA

Version: 4.21+

Jira: https://redhat.atlassian.net/browse/OSDOCS-20033

Previews:

Dev: @kannon92 @PannagaRao
QE @MaysaMacedo @anahas-redhat

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ocpdocs-previewbot · 2026-06-23T21:05:36Z

🤖 Tue Jun 30 19:46:17 - Prow CI generated the docs preview:

https://113996--ocpdocs-pr.netlify.app/openshift-enterprise/latest/ai_workloads/kueue/kueue-dra-integrating-dynamic-resource-allocation.html
https://113996--ocpdocs-pr.netlify.app/openshift-enterprise/latest/ai_workloads/kueue/release-notes.html

ocpdocs-vale-bot · 2026-06-23T21:07:25Z

+
+* Validation using dra-example-driver and nvidia-dra-driver.
+
+.Prerequisites


🤖 [error] AsciiDocDITA.BlockTitle: Block titles can only be assigned to examples, figures, and tables in DITA.

sohankunkerkar

Did a first pass —good start on the structure. A few things need to be adapted for OCP though; I left inline comments on each. Also, on OCP, the DRA config (feature gates, deviceClassMappings) goes through the Kueue CR rather than raw Configuration YAML. @PannagaRao, could you share the HackMD docs we created for Alice on how that works so we can update the examples to reflect what users actually do?

sohankunkerkar · 2026-06-24T13:05:04Z

+// * ai_workloads/kueue/kueue-dra-integrating-dynamic-resource-allocation.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="kueue-dra-partionable-devices_{context}"]


Suggested change

[id="kueue-dra-partionable-devices_{context}"]

[id="kueue-dra-partitionable-devices_{context}"]

ugh. sorry. good catch. fixed.

sohankunkerkar · 2026-06-24T13:11:12Z

+
+* Verification of partition capacity reclaim after workload completion.
+
+* Validation using dra-example-driver and nvidia-dra-driver.


dra-example-driver is an upstream test fixture, not something OCP users would install. OpenShift docs should reference the supported. NVIDIA DRA driver for OCP, not upstream test tools. Drop dra-example-driver entirely.

sohankunkerkar · 2026-06-24T13:11:58Z

+* Validation using dra-example-driver and nvidia-dra-driver.
+
+.Prerequisites
+* Kueue is installed.


Suggested change

* Kueue is installed.

* {kueue-name} is installed.

sohankunkerkar · 2026-06-24T13:19:24Z

+
+.Prerequisites
+* Kueue is installed.
+* A Kubernetes cluster running version 1.34 or later.


It should be {product-title} 4.21 or later.

s/A Kubernetes cluster running version 1.34 or later. / {product-title} running version 4.21 or later.

sohankunkerkar · 2026-06-24T13:21:11Z

+.Prerequisites
+* Kueue is installed.
+* A Kubernetes cluster running version 1.34 or later.
+* A DRA driver installed in the cluster, for example, `dra-example-driver`` for testing, or a vendor driver such as NVIDIA `k8s-dra-driver-gpu` for production.


Drop dra-example-driver completely.

Replaced with "A DRA driver installed in the cluster, for example, nvidia-dra-driver or k8s-dra-driver-gpu."

sohankunkerkar · 2026-06-24T14:23:45Z

+metadata:
+  name: gpu.example.com
+spec:
+  extendedResourceName: example.com/gpu


Suggested change

extendedResourceName: example.com/gpu

extendedResourceName: nvidia.com/gpu

sohankunkerkar · 2026-06-24T14:41:46Z

+
+[source,terminal]
+----
+$ oc apply -f https://kueue.sigs.k8s.io/examples/dra/sample-dra-queues.yaml


I think you might need to change this as per OCP docs. We can't use the upstream example as-is.

I think the oc apply command itself is fine. it's the upstream manifest URL that needs to change. Instead of pointing to kueue.sigs.k8s.io, inline the YAML directly in the doc and have the user create a local file, something like:

Create a file called `cluster-queue.yaml` with the following content: + [source,yaml] ---- <the YAML here> ---- . Run the following command to apply the configuration: + [source,terminal] ---- $ oc apply -f cluster-queue.yaml ----

That way users don't need external network access and the YAML stays under our control if anything changes upstream.

Made these changes.

sohankunkerkar · 2026-06-24T14:43:03Z

+= Configuring the partionable devices
+
+[role="_abstract"]
+Use this procedure when your cluster has partitionable devices and you want quota to reflect actual device capacity rather than device count. This requires Kubernetes 1.35+ with the `DRAPartitionableDevices` feature gate enabled and a DRA driver that publishes `consumesCounters` in `ResourceSlice` objects.


{product-title} 4.22 or later. PD is beta in k8s 1.36 (OCP 4.22), not 1.35

s / "This requires Kubernetes 1.35+ with the DRAPartitionableDevices feature gate enabled and a DRA driver that publishes consumesCounters in ResourceSlice objects." / "This requires {product-title} 4.22 or later."

We can rework this if we feel that mentioning "Kubernetes 1.36 with the DRAPartitionableDevices feature gate enabled and a DRA driver that publishes consumesCounters in ResourceSlice objects.` "is necessary.

sohankunkerkar · 2026-06-24T14:44:52Z

+
+[source,terminal]
+----
+$ oc apply -f https://kueue.sigs.k8s.io/examples/dra/sample-dra-counter-queues.yaml


I would stick with inline YAML instead of upstream URL

@sohankunkerkar Do you mean delete the oc command and use the following yaml?

#113996 (comment)

sohankunkerkar · 2026-06-24T14:52:30Z

+
+.Procedure
+
+. Add a `deviceClassMappings`` entry to the {kueue-name} configuration that maps each `DeviceClass` to a logical resource name for quota, as shown in the following example:


Suggested change

. Add a `deviceClassMappings`` entry to the {kueue-name} configuration that maps each `DeviceClass` to a logical resource name for quota, as shown in the following example:

. Add a `deviceClassMappings` entry to the {kueue-name} configuration that maps each `DeviceClass` to a logical resource name for quota, as shown in the following example:

StephenJamesSmith · 2026-06-25T16:56:21Z

@sohankunkerkar @anahas-redhat Please review latest changes and lgtm if ready.

anahas-redhat

@StephenJamesSmith below you can find my comments.

anahas-redhat · 2026-06-25T13:59:07Z

@@ -0,0 +1,154 @@
+// Module included in the following assemblies:


I guess we have two documents with similar name:

kueue-dra-partionable-devices.adoc

kueue-dra-partitionable-devices.adoc

I'm assuming this one "kueue-dra-partionable-devices.adoc" will be the one to be excluded so, my comments will be in the second .adoc. Please, let me know otherwise.

First document has been deleted and not included in the build.

anahas-redhat · 2026-06-25T14:04:53Z

+
+include::modules/kueue-dra-resourceclaimtemplates.adoc[leveloffset=+2]
+
+// include::modules/kueue-dra-deviceclasses.adoc[leveloffset=+2]


Should this line be commented out?

Yes. it's a file that wasn't needed. I'm removing it from the Assembly file.

anahas-redhat · 2026-06-25T14:43:05Z

+
+.Prerequisites
+* {kueue-name} is installed.
+* {product-title} running version 4.21 or later.


Partitionable Devices requires OCP 4.22+ (K8s 1.35 with DRAPartitionableDevices gate). The PD module itself says "4.22 or later" — contradicting this prerequisite. The prerequisite should either list both versions or note the PD exception.

Added a Note: "To use partitionable devices, you need {product-title} 4.22 or later. "

anahas-redhat · 2026-06-25T14:44:42Z

+
+.Procedure
+
+. Enable the feature gates by installing or reconfiguring {kueue-name} with both feature gates enabled, as shown in the following example:


I think the users won't need to enable any feature gate by themselves. Good to double-check that info with @sohankunkerkar

That’s correct. We explicitly enable them in the operator ConfigMap, so no additional action is required from the user side.

@anahas-redhat Should this step be removed?

Removed the step.

anahas-redhat · 2026-06-25T14:57:25Z

+
+DRA is a Kubernetes framework that manages specialized hardware resources such as GPUs with fine-grained control. Unlike traditional resource requests, DRA allows dynamic prioritization—allocating GPUs to high-priority AI training workloads during business hours, then reallocating them to cost-optimized batch jobs overnight.
+
+You can validate partitionable devices support in {kueue-name} Dynamic Resource Allocation (DRA) integration, covering partition-aware quota, admission, and scheduling. Partitionable devices, such as NVIDIA MIG, allow graphics processing units (GPUs) to be dynamically subdivided into smaller allocations. {kueue-name} must correctly handle quota accounting for these mutually exclusive partition configurations.


It's good that you've mentioned Partitionable Devices here. Can you also mention something about Structured Parameters and Extended Resources? They are also "features" or "resources" provided by DRA.

I added some text for these.

anahas-redhat · 2026-06-25T18:43:03Z

+
+[source,yaml]
+----
+apiVersion: config.kueue.x-k8s.io/v1beta2


On OCP with the Kueue Operator, the user cannot create or apply a Configuration object directly. The operator owns it — it generates the Configuration from the Kueue CR and writes it into a ConfigMap that the controller reads. If a user tries to oc apply that YAML, there's no CRD for config.kueue.x-k8s.io/v1beta2 Configuration — it's not a Kubernetes resource you create, it's an embedded config format.

A possible way to set deviceClassMappings on OCP is through the Kueue CR:

oc patch kueue cluster --type=merge -p '{ "spec": { "config": { "resources": { "deviceClassMappings": [{ "name": "nvidia.com/gpu", "deviceClassNames": ["gpu.nvidia.com"] }] } } } }'

@sohankunkerkar ?

// Module included in the following assemblies: // // * ai_workloads/kueue/kueue-dra-integrating-dynamic-resource-allocation.adoc :_mod-docs-content-type: PROCEDURE [id="kueue-dra-partitionable-devices_{context}"] = Configuring partitionable devices [role="_abstract"] You can configure {kueue-name} to manage quota for partitionable devices based on actual device capacity rather than device count. Partitionable devices, such as NVIDIA Multi-Instance GPU (MIG) capable GPUs like the A100 or H100, allow a single GPU to be dynamically subdivided into smaller partitions. When counter-based quota is configured, {kueue-name} charges quota in capacity units such as GPU memory rather than counting whole devices. For example, a `1g.5gb` MIG partition on an A100-40GB charges `4864Mi` of GPU memory quota, while a whole GPU charges `40320Mi`. .Prerequisites * You have cluster administrator permissions. * You have installed {kueue-name} by using the {kueue-op}. * You have created a `Kueue` CR. * Your cluster is running {product-title} 4.22 or later. * A DRA driver that publishes `consumesCounters` in `ResourceSlice` objects is installed, for example, `nvidia-dra-driver`. * MIG is enabled on the GPU hardware. * You have enabled the `DRAPartitionableDevices` Kubernetes feature gate by adding the `CustomNoUpgrade` feature set to the `FeatureGate` CR named `cluster`, as shown in the following example: + [source,yaml] ---- apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: name: cluster spec: featureSet: CustomNoUpgrade customNoUpgrade: enabled: - DRAPartitionableDevices ---- + [WARNING] ==== Enabling the `CustomNoUpgrade` feature set on your cluster cannot be undone and prevents minor version updates. This feature set is not supported on production clusters. For information about enabling feature gates, see "Enabling features using feature gates". ==== .Procedure . Verify that your DRA driver publishes counter data by running the following command: + [source,terminal] ---- $ oc get resourceslices -o jsonpath='{range .items[*]}{.spec.driver}{"\t"}{range .spec.devices[*]}{.name}: {.consumesCounters}{"\n"}{end}{end}' ---- + .Example output [source,terminal] ---- gpu.nvidia.com gpu-0: [{"counterSet":"shared","counters":{"memory":{"value":"40Gi"}}}] ---- + If the output does not show `consumesCounters` data, verify that your DRA driver version supports partitionable devices and that MIG is enabled on the GPU hardware. . Configure counter-based quota by adding a `deviceClassMappings` entry with a `sources` section to the `config.resources` section of the {kueue-name} CR, as shown in the following example: + [source,yaml] ---- apiVersion: kueue.openshift.io/v1 kind: Kueue metadata: name: cluster namespace: openshift-kueue-operator spec: config: resources: deviceClassMappings: - name: gpu.memory # <1> deviceClassNames: # <2> - gpu.nvidia.com - mig.nvidia.com sources: # <3> - type: Counter counter: name: memory # <4> driver: gpu.nvidia.com deviceSelector: # <5> type: CEL cel: expression: "device.driver == 'gpu.nvidia.com'" # ... ---- <1> The logical resource name used in `ClusterQueue` quotas. When counter-based sources are configured, quota is charged in capacity units rather than device count. <2> The `DeviceClass` names that map to this resource. Include both the whole-GPU class (`gpu.nvidia.com`) and the MIG class (`mig.nvidia.com`). <3> Defines how {kueue-name} computes the quota charge. <4> The counter name must match a counter key published by the DRA driver in `ResourceSlice` devices. <5> Scopes which devices are eligible for counter-based quota accounting. + [NOTE] ==== The {kueue-name} operator automatically enables the required {kueue-name} feature gates when it detects the `DRAPartitionableDevices` Kubernetes feature gate and `sources` are configured in `deviceClassMappings`. No manual {kueue-name} feature gate configuration is required. ==== . Create a `ClusterQueue` with counter-based quota. Set the quota in capacity units rather than device count. Create a file called `pd-queues.yaml` with the following content: + .Example quota configuration for partitionable devices [source,yaml] ---- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: "default-flavor" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} resourceGroups: - coveredResources: ["cpu", "memory", "gpu.memory"] # <1> flavors: - name: "default-flavor" resources: - name: "cpu" nominalQuota: 40 - name: "memory" nominalQuota: 200Gi - name: "gpu.memory" # <2> nominalQuota: 800Gi --- apiVersion: kueue.x-k8s.io/v1beta2 kind: LocalQueue metadata: namespace: "team-a" name: "user-queue" spec: clusterQueue: "cluster-queue" ---- <1> The `gpu.memory` entry must match the `name` value in `deviceClassMappings`. <2> Sets the total GPU memory quota. For example, `800Gi` accommodates twenty A100-40GB GPUs or equivalent MIG partitions. + [NOTE] ==== When `ClusterQueue` objects share a cohort, ensure all queues use the same unit scale for counter resources. {kueue-name} does not validate unit consistency across `ClusterQueue` objects. ==== . Apply the quota configuration by running the following command: + [source,terminal] ---- $ oc apply -f pd-queues.yaml ---- . Create a workload that requests a MIG partition. Create a file called `pd-job.yaml` with the following content: + .Example workload requesting a MIG partition [source,yaml] ---- apiVersion: resource.k8s.io/v1 kind: ResourceClaimTemplate metadata: namespace: team-a name: gpu-partition spec: spec: devices: requests: - name: gpu exactly: deviceClassName: mig.nvidia.com # <1> count: 1 selectors: - cel: expression: "device.attributes['gpu.nvidia.com'].profile == '1g.5gb'" # <2> --- apiVersion: batch/v1 kind: Job metadata: generateName: pd-test-job- namespace: team-a labels: kueue.x-k8s.io/queue-name: user-queue # <3> spec: template: spec: containers: - name: worker image: registry.k8s.io/e2e-test-images/agnhost:2.53 args: ["pause"] resources: claims: - name: gpu requests: cpu: "1" memory: "200Mi" resourceClaims: - name: gpu resourceClaimTemplateName: gpu-partition # <4> restartPolicy: Never ---- <1> References the MIG `DeviceClass`. <2> Selects a specific MIG partition profile. Available profiles depend on the GPU model, for example, `1g.5gb`, `2g.10gb`, `3g.20gb`, or `7g.40gb` for the A100-40GB. <3> Identifies the local queue to submit the job to. <4> References the `ResourceClaimTemplate` defined above. The `ResourceClaimTemplate` must exist in the same namespace as the job. . Create the workload by running the following command: + [source,terminal] ---- $ oc create -f pd-job.yaml ---- .Verification . Verify that the workload is admitted and that quota was charged in capacity units by running the following command: + [source,terminal] ---- $ oc -n team-a get workloads -o jsonpath='{range .items[*]}{.metadata.name}: {.status.admission.podSetAssignments[0].resourceUsage}{"\n"}{end}' ---- + .Example output [source,terminal] ---- job-pd-test-job-xxxxx: {"cpu":"1","gpu.memory":"4864Mi","memory":"200Mi"} ---- + The `gpu.memory` value reflects the actual memory capacity of the requested MIG partition rather than a device count of `1`. . If the workload is not admitted, verify the following: + * The `DRAPartitionableDevices` Kubernetes feature gate is enabled on the cluster. * The `deviceClassMappings` `name` value matches the resource name in `coveredResources`. * The `counter.name` in `sources` matches a counter key in the `ResourceSlice` objects. * The `ClusterQueue` has sufficient GPU memory quota for the requested partition size. * MIG is enabled on the GPU hardware.

Replaced the procedure with the above.

anahas-redhat · 2026-06-25T18:45:53Z

+
+[source,terminal]
+----
+$ oc apply -f https://kueue.sigs.k8s.io/examples/dra/sample-dra-queues.yaml


Instead of pointing to an external link, can we follow the same as https://github.com/openshift/openshift-docs/pull/113996/changes#diff-50e686ab942a05f0afecd2feb3d4f0e5f49175c442dbe9a013e83a970da6d9fcR49? (where cluster-queue.yaml file was created).

anahas-redhat · 2026-06-25T18:47:35Z

+
+[source,terminal]
+----
+$ oc apply -f https://kueue.sigs.k8s.io/examples/dra/sample-dra-queues.yaml


Instead of pointing to an external link, can we follow the same as https://github.com/openshift/openshift-docs/pull/113996/changes#diff-50e686ab942a05f0afecd2feb3d4f0e5f49175c442dbe9a013e83a970da6d9fcR49? (where cluster-queue.yaml file was created).

anahas-redhat · 2026-06-25T18:48:51Z

+spec:
+  clusterQueue: "cluster-queue"
+----
+


The procedure configures infrastructure (deviceClassMappings, ClusterQueue, LocalQueue) but never shows a workload YAML. A user finishes this procedure and doesn't know what a Job with a ResourceClaimTemplate looks like. The PD module includes a workload example — the RCT module could include it too.

@sohankunkerkar ?

What clarification you need @StephenJamesSmith ?

@anahas-redhat I met with Sohan last week and he said he would go through your comments that I had questions about (you were not able to attend the meeting). I put these comments (@sohankunkerkar ?) to indicate the comments he should address.

@StephenJamesSmith feel free to reschedule if you need clarification.

anahas-redhat · 2026-06-25T18:53:18Z

+[role="_abstract"]
+Dynamic Resource Allocation (DRA) structured parameters is a Kubernetes feature that enables declarative management of specialized hardware such as GPUs, FPGAs, and network adapters. In the context of {kueue-name}, it provides quota management for workloads that use these devices.
+
+{kueue-name} provides two approaches for managing the DRA device quota:


The Structured Parameters concept module introduces two DRA paths but never explains what structured parameters actually means, why a user would choose the ResourceClaimTemplate path over the Extended Resources path, or what capabilities each path provides. Without this guidance, users have no basis for choosing between the two approaches.

@sohankunkerkar ?

Is there any question here @StephenJamesSmith ?

Suggestion of how this text can be (feel free to change):

[role="_abstract"]
Dynamic Resource Allocation (DRA) is a Kubernetes framework that provides structured discovery and allocation of specialized hardware such as GPUs. DRA drivers publish device information through ResourceSlice objects, and administrators group devices into named categories using DeviceClass objects.

Without {kueue-name} DRA integration, GPU requests made through DRA are invisible to quota management. {kueue-name} cannot account for these requests when admitting workloads, which can result in teams exceeding their GPU allocation.

{kueue-name} provides two approaches for managing DRA device quota:

ResourceClaimTemplate:: The default approach. Workloads explicitly reference a
ResourceClaimTemplate that defines device requirements. Administrators configure
deviceClassMappings in the Kueue CR to map each DeviceClass to a logical resource
name for quota tracking. Use this approach when workloads need fine-grained control over
device selection, such as targeting a specific GPU model or architecture using CEL selectors.

Extended resources:: A simplified alternative that allows workloads to use standard
Kubernetes resources.requests syntax, for example, nvidia.com/gpu: "1", instead of explicitly creating DRA objects. When a DeviceClass includes the
spec.extendedResourceName field, the Kubernetes scheduler automatically generates ResourceClaim objects. {kueue-name} detects this and charges quota only once, preventing double counting. Use this approach when you want the simplest possible user experience and backward compatibility with existing workload YAML.

For clusters with partitionable devices such as NVIDIA Multi-Instance GPU (MIG), {kueue-name} can also charge quota in capacity units, such as GPU memory, rather than device count.
Partitionable devices use ResourceClaimTemplates with CEL selectors to target specific partition profiles, and require administrators to configure counter-based sources in deviceClassMappings. This capability requires {product-title} 4.22 or later.

Replaced with the above text.

sohankunkerkar · 2026-06-29T03:44:18Z

+toc::[]
+
+[role="_abstract"]
+{kueue-name} Dynamic Resource Allocation (DRA) integration enables advanced management of specialized hardware resources like GPUs, FPGAs, and other accelerators within Kubernetes workload queuing. This integration allows for the reading and publishing of ResourceSlices, counter-based quota computation, and specific admission behaviors. 


Suggested change

{kueue-name} Dynamic Resource Allocation (DRA) integration enables advanced management of specialized hardware resources like GPUs, FPGAs, and other accelerators within Kubernetes workload queuing. This integration allows for the reading and publishing of ResourceSlices, counter-based quota computation, and specific admission behaviors.

You can configure {kueue-name} to manage quota for workloads that use Dynamic Resource Allocation (DRA) to request GPUs. When DRA quota management is configured, {kueue-name} counts DRA device requests toward quota in the same way that it counts traditional resources such as CPU and memory.

sohankunkerkar · 2026-06-29T03:44:50Z

+[role="_abstract"]
+{kueue-name} Dynamic Resource Allocation (DRA) integration enables advanced management of specialized hardware resources like GPUs, FPGAs, and other accelerators within Kubernetes workload queuing. This integration allows for the reading and publishing of ResourceSlices, counter-based quota computation, and specific admission behaviors. 
+
+DRA is a Kubernetes framework that manages specialized hardware resources such as GPUs with fine-grained control. Unlike traditional resource requests, DRA allows dynamic prioritization—allocating GPUs to high-priority AI training workloads during business hours, then reallocating them to cost-optimized batch jobs overnight.


Suggested change

DRA is a Kubernetes framework that manages specialized hardware resources such as GPUs with fine-grained control. Unlike traditional resource requests, DRA allows dynamic prioritization—allocating GPUs to high-priority AI training workloads during business hours, then reallocating them to cost-optimized batch jobs overnight.

If DRA device quota is not configured, {kueue-name} does not account for GPU requests when admitting workloads, which can result in teams exceeding their GPU allocation.

sohankunkerkar · 2026-06-29T03:45:43Z

+DRA is a Kubernetes framework that manages specialized hardware resources such as GPUs with fine-grained control. Unlike traditional resource requests, DRA allows dynamic prioritization—allocating GPUs to high-priority AI training workloads during business hours, then reallocating them to cost-optimized batch jobs overnight.
+
+You can validate partitionable devices support in {kueue-name} Dynamic Resource Allocation (DRA) integration, covering partition-aware quota, admission, and scheduling. Partitionable devices, such as NVIDIA MIG, allow graphics processing units (GPUs) to be dynamically subdivided into smaller allocations. {kueue-name} must correctly handle quota accounting for these mutually exclusive partition configurations.
+


L15-L25, not required.

@sohankunkerkar Deleting lines 15-25 would leave the 2 bulleted items on lines 26-28 hanging. Should those be deleted too?

Are you talking about the first two points from the Prerequisites?

No, the last two bullets, shown here:

Rewrote this topic as per Sohan's input.

sohankunkerkar · 2026-06-29T03:47:56Z

+.Prerequisites
+* {kueue-name} is installed.
+* {product-title} running version 4.21 or later.
+* A DRA driver installed in the cluster, for example, `nvidia-dra-driver` or `k8s-dra-driver-gpu`.


Suggested change

* A DRA driver installed in the cluster, for example, `nvidia-dra-driver` or `k8s-dra-driver-gpu`.

.Prerequisites

* You have installed {kueue-name} by using the {kueue-op}.

* You have created a `Kueue` custom resource (CR).

* Your cluster is running {product-title} 4.21 or later.

* A DRA driver is installed in the cluster, for example, `nvidia-dra-driver`. You can verify that the DRA driver is publishing device information by running the following command:

[source,terminal]

----

$ oc get resourceslices

----

+

If the command returns one or more `ResourceSlice` objects, the DRA driver is running.

* At least one `DeviceClass` object exists in the cluster. You can verify this by running the following command:

+

[source,terminal]

----

$ oc get deviceclass

----

Added the above changes.

sohankunkerkar · 2026-06-29T03:53:32Z

+= Configuring the extended resources path
+
+[role="_abstract"]
+You need to create an extended resources path that users submit workloads using the standard `resources.requests` syntax, for example, `nvidia.com/gpu: 1`, and a `DeviceClass` with `spec.extendedResourceName` that exists in the cluster.


Suggested change

You need to create an extended resources path that users submit workloads using the standard `resources.requests` syntax, for example, `nvidia.com/gpu: 1`, and a `DeviceClass` with `spec.extendedResourceName` that exists in the cluster.

You can configure {kueue-name} to manage quota for workloads that request GPUs by using the standard `resources.requests` syntax, for example, `nvidia.com/gpu: "1"`. When a `DeviceClass` includes the `spec.extendedResourceName` field, the Kubernetes scheduler automatically generates `ResourceClaim` objects. This path does not require `deviceClassMappings` configuration because {kueue-name} auto-discovers the mapping by indexing `DeviceClass` objects.

You can also add this:

[NOTE] ==== The {kueue-name} operator automatically enables the required {kueue-name} feature gates when it detects the `DRAExtendedResource` Kubernetes feature gate on the cluster. No manual {kueue-name} feature gate configuration is required.

Here's my idea for this doc:

// Module included in the following assemblies: // // * ai_workloads/kueue/kueue-dra-integrating-dynamic-resource-allocation.adoc :_mod-docs-content-type: PROCEDURE [id="kueue-dra-extended-resources_{context}"] = Configuring the extended resources path [role="_abstract"] You can configure {kueue-name} to manage quota for workloads that request GPUs by using the standard `resources.requests` syntax, for example, `nvidia.com/gpu: "1"`. When a `DeviceClass` includes the `spec.extendedResourceName` field, the Kubernetes scheduler automatically generates `ResourceClaim` objects. This path does not require `deviceClassMappings` configuration because {kueue-name} auto-discovers the mapping by indexing `DeviceClass` objects. [NOTE] ==== The {kueue-name} operator automatically enables the required {kueue-name} feature gates when it detects the `DRAExtendedResource` Kubernetes feature gate on the cluster. No manual {kueue-name} feature gate configuration is required. ==== .Prerequisites * You have cluster administrator permissions. * You have installed {kueue-name} by using the {kueue-op}. * You have created a `Kueue` CR. * A DRA driver is installed and has published `ResourceSlice` objects. * You have enabled the `DRAExtendedResource` Kubernetes feature gate by adding the `CustomNoUpgrade` feature set to the `FeatureGate` CR named `cluster`, as shown in the following example: + [source,yaml] ---- apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: name: cluster spec: featureSet: CustomNoUpgrade customNoUpgrade: enabled: - DRAExtendedResource ---- + [WARNING] ==== Enabling the `CustomNoUpgrade` feature set on your cluster cannot be undone and prevents minor version updates. This feature set is not supported on production clusters. For information about enabling feature gates, see "Enabling features using feature gates". ==== .Procedure . Verify that the `DeviceClass` has `spec.extendedResourceName` set by running the following command: + [source,terminal] ---- $ oc get deviceclass gpu.nvidia.com -o jsonpath='{.spec.extendedResourceName}' ---- + .Example output [source,terminal] ---- nvidia.com/gpu ---- + If the command does not return a value, add the `extendedResourceName` field by running the following command: + [source,terminal] ---- $ oc patch deviceclass gpu.nvidia.com --type=merge -p '{"spec":{"extendedResourceName":"nvidia.com/gpu"}}' ---- . Create a `ClusterQueue` that includes the GPU resource in `coveredResources`. Create a file called `er-queues.yaml` with the following content: + .Example quota configuration for extended resources [source,yaml] ---- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: "default-flavor" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} resourceGroups: - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] flavors: - name: "default-flavor" resources: - name: "cpu" nominalQuota: 40 - name: "memory" nominalQuota: 200Gi - name: "nvidia.com/gpu" nominalQuota: 8 --- apiVersion: kueue.x-k8s.io/v1beta2 kind: LocalQueue metadata: namespace: "team-a" name: "user-queue" spec: clusterQueue: "cluster-queue" ---- . Apply the quota configuration by running the following command: + [source,terminal] ---- $ oc apply -f er-queues.yaml ---- . Create a workload that uses the standard resource request syntax. Create a file called `er-job.yaml` with the following content: + .Example workload using extended resources [source,yaml] ---- apiVersion: batch/v1 kind: Job metadata: generateName: er-test-job- namespace: team-a labels: kueue.x-k8s.io/queue-name: user-queue # <1> spec: template: spec: containers: - name: worker image: registry.k8s.io/e2e-test-images/agnhost:2.53 args: ["pause"] resources: requests: cpu: "1" memory: "200Mi" nvidia.com/gpu: "1" # <2> restartPolicy: Never ---- <1> Identifies the local queue to submit the job to. <2> Requests a GPU by using the standard extended resource syntax. No `ResourceClaimTemplate` or `resourceClaims` section is needed. The `DeviceClass` with `spec.extendedResourceName` causes the Kubernetes scheduler to generate a `ResourceClaim` automatically. . Create the workload by running the following command: + [source,terminal] ---- $ oc create -f er-job.yaml ---- .Verification . Verify that a workload has been created and admitted by running the following command: + [source,terminal] ---- $ oc -n team-a get workloads ---- + .Example output [source,terminal] ---- NAME QUEUE RESERVED IN ADMITTED AGE job-er-test-job-4m2x-d3f4g user-queue cluster-queue True 10s ---- . Verify that a `ResourceClaim` was automatically created by running the following command: + [source,terminal] ---- $ oc -n team-a get resourceclaims ---- + The Kubernetes scheduler creates a `ResourceClaim` for each pod that requests an extended resource backed by a `DeviceClass`. + If the workload is not admitted, verify the following: + * The `DRAExtendedResource` Kubernetes feature gate is enabled on the cluster. * The `DeviceClass` has `spec.extendedResourceName` set. * The `ClusterQueue` includes the extended resource name in `coveredResources`. * The `ClusterQueue` has sufficient quota available. +

You can change this as per openshift docs standards.

Replaced with the above procedure.

ocpdocs-vale-bot · 2026-06-29T13:40:39Z

+* You have created a `Kueue` custom resource (CR).
+* Your cluster is running {product-title} 4.21 or later.
+
+====


🤖 [error] AsciiDocDITA.ExampleBlock: Example blocks can not be inside of other blocks in DITA.

anahas-redhat · 2026-06-29T20:20:46Z

@StephenJamesSmith @sohankunkerkar the material we discussed is linked below.

Structured Parameters: OCPKUEUE-574 - https://redhat.atlassian.net/browse/OCPKUEUE-574
- Check attachments → test-plan-dra-kueue.md
- Focus on: Prerequisites + TC-01: Basic Workload Lifecycle (submit → queue → admit → schedule → run → complete)
- High-level flow: Check ResourceSlices → check DeviceClass → configure deviceClassMappings with DeviceClass name in Kueue operand → create ClusterQueue/LocalQueue → create ResourceClaimTemplate → submit Job with resourceClaims → verify pod is Running → job completes
Extended Resources: OCPKUEUE-576 - https://redhat.atlassian.net/browse/OCPKUEUE-576
- Check attachments → test-plan-dra-extended-resources.md
- Focus on: Prerequisites + TS1: Workload Types — Pod, Deployment & StatefulSet with Extended Resources
- Note: Sohan already provided an example using Jobs — you can use it too
- High-level flow: Enable DRAExtendedResource feature gate → wait MCP rollout → patch DeviceClass with extendedResourceName → create ClusterQueue/LocalQueue with extended resource quota → submit Job with resources.requests → verify pod is Running → job completes
Partitionable Devices: OCPKUEUE-575 - https://redhat.atlassian.net/browse/OCPKUEUE-575
- Check attachments → test-plan-dra-pd-basic.md
- Focus on: Prerequisites + TC-01: Single Partition Counter Charge (Job)
- High-level flow: Check ResourceSlices expose partitions → check DeviceClass → configure deviceClassMappings with DeviceClass name in Kueue operand → create ClusterQueue/LocalQueue → create ResourceClaimTemplate with partition selector → submit Job with resourceClaims → verify pod is Running → job completes

Important note: not sure if this will be a separate doc or just a note but Versions 4.18–4.20 are not supported.

anahas-redhat · 2026-06-30T14:46:22Z

+
+* xref:../../nodes/pods/nodes-pods-allocate-dra.adoc#nodes-pods-allocate-dra[Allocating GPUs to pods by using DRA]
+
+


The levels below seem to be not quite correct. I guess we should have something like this:

= Integrating Dynamic Resource Allocation
== DRA quota management overview
=== Configuring ResourceClaimTemplates
=== Configuring Extended Resources
=== Configuring Partitionable Devices

This way the concept module introduces all three, and the procedures sit underneath it.

updated the levels.

anahas-redhat · 2026-06-30T14:52:38Z

+
+:_mod-docs-content-type: CONCEPT
+[id="kueue-dra-structured-parameters_{context}"]
+= Structured parameters


Suggestion: change from Structured Parameters to DRA quota management.

Why? "Structured parameters" is not an umbrella term — it's a specific Kubernetes DRA implementation (KEP #4381) that replaced "classic DRA" (KEP #3063, withdrawn in K8s 1.32). It means the scheduler can understand device attributes directly via ResourceSlices and DeviceClasses, rather than depending on opaque third-party drivers.

In that sense, all of DRA as it exists today IS structured parameters.

Changed the title.

anahas-redhat · 2026-06-30T14:55:23Z

+[role="_abstract"]
+Dynamic Resource Allocation (DRA) structured parameters is a Kubernetes feature that enables declarative management of specialized hardware such as GPUs, FPGAs, and network adapters. In the context of {kueue-name}, it provides quota management for workloads that use these devices.
+
+{kueue-name} provides two approaches for managing the DRA device quota:


Suggestion of how this text can be (feel free to change):

[role="_abstract"]
Dynamic Resource Allocation (DRA) is a Kubernetes framework that provides structured discovery and allocation of specialized hardware such as GPUs. DRA drivers publish device information through ResourceSlice objects, and administrators group devices into named categories using DeviceClass objects.

Without {kueue-name} DRA integration, GPU requests made through DRA are invisible to quota management. {kueue-name} cannot account for these requests when admitting workloads, which can result in teams exceeding their GPU allocation.

{kueue-name} provides two approaches for managing DRA device quota:

ResourceClaimTemplate:: The default approach. Workloads explicitly reference a
ResourceClaimTemplate that defines device requirements. Administrators configure
deviceClassMappings in the Kueue CR to map each DeviceClass to a logical resource
name for quota tracking. Use this approach when workloads need fine-grained control over
device selection, such as targeting a specific GPU model or architecture using CEL selectors.

Extended resources:: A simplified alternative that allows workloads to use standard
Kubernetes resources.requests syntax, for example, nvidia.com/gpu: "1", instead of explicitly creating DRA objects. When a DeviceClass includes the
spec.extendedResourceName field, the Kubernetes scheduler automatically generates ResourceClaim objects. {kueue-name} detects this and charges quota only once, preventing double counting. Use this approach when you want the simplest possible user experience and backward compatibility with existing workload YAML.

For clusters with partitionable devices such as NVIDIA Multi-Instance GPU (MIG), {kueue-name} can also charge quota in capacity units, such as GPU memory, rather than device count.
Partitionable devices use ResourceClaimTemplates with CEL selectors to target specific partition profiles, and require administrators to configure counter-based sources in deviceClassMappings. This capability requires {product-title} 4.22 or later.

ocpdocs-vale-bot · 2026-06-30T18:18:29Z

  - Applies immediate admission penalties to prevent resource monopolization

-For more information, see xref:../../ai_workloads/kueue/admission-fair-sharing.adoc#admission-fair-sharing[Admission fair sharing].
+For more information, see xref:../../ai_workloads/kueue/admission-fair-sharing.adoc#admission-fair-sharing[Admission fair sharing].


🤖 [error] OpenShiftAsciiDoc.NoXrefInModules: Do not include xrefs in modules, only assemblies (exception: release notes modules).

ocpdocs-vale-bot · 2026-06-30T18:18:31Z

+Dynamic Resource Allocation (DRA) quota management for GPUs (Technology Preview)::
+{kueue-name} now supports quota management for workloads that request GPUs through Dynamic Resource Allocation (DRA). When configured, {kueue-name} tracks DRA device requests toward quota alongside traditional resources such as CPU and memory, preventing teams from exceeding their allocated GPU resources.
+
+For more information, see xref:../../ai_workloads/kueue/kueue-dra-integrating-dynamic-resource-allocation.adoc#ueue-dra-integrating-dynamic-resource-allocation[Integrating Dynamic Resource Allocation].


🤖 [error] OpenShiftAsciiDoc.NoXrefInModules: Do not include xrefs in modules, only assemblies (exception: release notes modules).

anahas-redhat

@StephenJamesSmith thanks for the latest round of updates — the ER and PD modules look much better after Sohan's rewrites. A few items are still outstanding from earlier review rounds:

`modules/kueue-dra-resourceclaimtemplates.adoc`

1. Line 18: Still uses Configuration object instead of Kueue CR (earlier comment)

The YAML shows:

apiVersion: config.kueue.x-k8s.io/v1beta2
kind: Configuration

This is not a CRD on OCP — users cannot oc apply it. The operator owns the Configuration and generates it from the Kueue CR. Replace step 1 with an oc patch against the Kueue CR:

$ oc patch kueue cluster -n openshift-kueue-operator --type=merge -p '{
  "spec": {
    "config": {
      "resources": {
        "deviceClassMappings": [{
          "name": "nvidia.com/gpu",
          "deviceClassNames": ["gpu.nvidia.com"]
        }]
      }
    }
  }
}'

2. Line 46: Still points to upstream URL (earlier comment)

$ oc apply -f https://kueue.sigs.k8s.io/examples/dra/sample-dra-queues.yaml

This should be inlined as a local file (e.g., "Create a file called rct-queues.yaml with the following content:"), same pattern already used in the ER module (er-queues.yaml at kueue-dra-extended-resources.adoc:75) and PD module (pd-queues.yaml at kueue-dra-partitionable-devices.adoc:113).

3. After line 83: Missing workload example and verification (earlier comment)

The procedure ends after ClusterQueue/LocalQueue creation but never shows a Job + ResourceClaimTemplate example. The ER module has er-job.yaml (kueue-dra-extended-resources.adoc:108) and the PD module has pd-job.yaml (kueue-dra-partitionable-devices.adoc:155) — the RCT module needs the same. Here's an example that matches the namespace (default) and queue names (user-queue, cluster-queue) already established in step 2:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: my-gpu
  namespace: default
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com # <1>
---
apiVersion: batch/v1
kind: Job
metadata:
  generateName: rct-test-job-
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: user-queue # <2>
spec:
  template:
    spec:
      restartPolicy: Never
      resourceClaims:
      - name: gpu
        resourceClaimTemplateName: my-gpu # <3>
      containers:
      - name: worker
        image: registry.k8s.io/e2e-test-images/agnhost:2.53
        args: ["pause"]
        resources:
          claims:
          - name: gpu # <4>
          requests:
            cpu: "1"
            memory: "200Mi"

References the DeviceClass configured in deviceClassMappings.
Identifies the local queue to submit the job to.
References the ResourceClaimTemplate defined above. The template must exist in the same namespace as the job.
Attaches the resource claim to this container.

Add a verification section matching the ER (kueue-dra-extended-resources.adoc:143) and PD (kueue-dra-partitionable-devices.adoc:209) modules:

.Verification

. Verify that the workload has been created and admitted:
+
$ oc -n default get workloads

. Verify that a ResourceClaim was created from the template:
+
$ oc -n default get resourceclaims
+
If the workload is not admitted, verify the following:
+
* The deviceClassMappings in the Kueue CR maps the DeviceClass name to the resource name in coveredResources.
* The ClusterQueue has sufficient quota available.
* The ResourceClaimTemplate exists in the same namespace as the job.

`modules/kueue-dra-extended-resources.adoc`

Line 131: Broken callout marker

    kueue.x-k8s.io/queue-name: user-queue #

The # has no callout number. Should be # <1>. The annotation on line 140 (# <2>) should then become # <2> — currently the numbering is off because <1> is missing.

`modules/kueue-release-notes-1.4.adoc`

Line 28: Broken xref — missing leading k

xref:...#ueue-dra-integrating-dynamic-resource-allocation[...]

Should be #kueue-dra-integrating-dynamic-resource-allocation.

PannagaRao · 2026-06-30T18:50:10Z

+====
+Enabling the `CustomNoUpgrade` feature set on your cluster cannot be undone and prevents minor version updates. This feature set is not supported on production clusters. 
+====
+


It might be worth adding a step after the FeatureGate CR example for users to wait for the MCP worker rollout to complete before proceeding. Alice has also mentioned this in her comment as part of the flow. Same applies for Partitionable Devices flow as well.

ocpdocs-vale-bot · 2026-06-30T18:51:01Z

  - Applies immediate admission penalties to prevent resource monopolization

-For more information, see xref:../../ai_workloads/kueue/admission-fair-sharing.adoc#admission-fair-sharing[Admission fair sharing].
+For more information, see xref:../../ai_workloads/kueue/admission-fair-sharing.adoc#admission-fair-sharing[Admission fair sharing].


🤖 [error] OpenShiftAsciiDoc.NoXrefInModules: Do not include xrefs in modules, only assemblies (exception: release notes modules).

ocpdocs-vale-bot · 2026-06-30T18:51:03Z

+Dynamic Resource Allocation (DRA) quota management for GPUs (Technology Preview)::
+{kueue-name} now supports quota management for workloads that request GPUs through Dynamic Resource Allocation (DRA). When configured, {kueue-name} tracks DRA device requests toward quota alongside traditional resources such as CPU and memory, preventing teams from exceeding their allocated GPU resources.
+
+For more information, see xref:../../ai_workloads/kueue/kueue-dra-integrating-dynamic-resource-allocation.adoc#kueue-dra-integrating-dynamic-resource-allocation[Integrating Dynamic Resource Allocation].


🤖 [error] OpenShiftAsciiDoc.NoXrefInModules: Do not include xrefs in modules, only assemblies (exception: release notes modules).

PannagaRao · 2026-06-30T19:05:15Z

+* The `deviceClassMappings` `name` value matches the resource name in `coveredResources`.
+* The `counter.name` in `sources` matches a counter key in the `ResourceSlice` objects.
+* The `ClusterQueue` has sufficient GPU memory quota for the requested partition size.
+* MIG is enabled on the GPU hardware.


@sohankunkerkar Do you think we need to call out the alpha limitation that extended resources and counter sources cannot be used together on the same DeviceClass in this doc?

Yeah, good call!

[NOTE] Extended resources and counter-based sources cannot be used together on the same `DeviceClass`. If a workload uses the extended resource syntax (for example, `nvidia.com/gpu: "1"`) and the `DeviceClass` mapping has counter sources configured, the workload is marked inadmissible. For more details, see link: [Path Interactions](https://github.com/kubernetes-sigs/kueue/blob/main/keps/2941-DRA/README.md#path-interactions) in the upstream Kueue documentation.

@StephenJamesSmith Can we add this block in kueue-dra-partitionable-devices.adoc ? I'll leave the placement to you.

@PannagaRao Where can I find the limitations? I found this -

In the alpha phase of Kubernetes [Dynamic Resource Allocation (DRA)] combining extended resources and counter sources on the same DeviceClass is not supported. You must define separate device classes if you intend to track countable extended resources (like specific GPUs) alongside capacity or counter-based attributes within your cluster.

DRA models devices use attributes rather than just counting quantities. Extended resources track hard limits, that is the total integer counts of hardware. Counter sources manage granular, sliceable, or dynamic capacity constraints. Mixing these two different allocation models within a single DeviceClass causes scheduling conflicts, because the kube-scheduler attempts to reconcile discrete claims against dynamic capacity simultaneously.

You can mitigate these limitations by separating the classes: Create one DeviceClass for your countable extended resources and a completely separate DeviceClass for counter capacities.

https://github.com/kubernetes-sigs/kueue/blob/main/keps/2941-DRA/README.md#path-interactions

I have added this link in the NOTE suggestion

Two issues with this link: 1) I can only put links in the ASSEMBLY, which may not be a problem because that's probably the best place to put the NOTE. 2) We have restrictions about xrefs to external repos. I can try it, but it may get rejected in Merge review.

ocpdocs-vale-bot · 2026-06-30T19:48:07Z

  - Applies immediate admission penalties to prevent resource monopolization

-For more information, see xref:../../ai_workloads/kueue/admission-fair-sharing.adoc#admission-fair-sharing[Admission fair sharing].
+For more information, see xref:../../ai_workloads/kueue/admission-fair-sharing.adoc#admission-fair-sharing[Admission fair sharing].


🤖 [error] OpenShiftAsciiDoc.NoXrefInModules: Do not include xrefs in modules, only assemblies (exception: release notes modules).

ocpdocs-vale-bot · 2026-06-30T19:48:09Z

+Dynamic Resource Allocation (DRA) quota management for GPUs (Technology Preview)::
+{kueue-name} now supports quota management for workloads that request GPUs through Dynamic Resource Allocation (DRA). When configured, {kueue-name} tracks DRA device requests toward quota alongside traditional resources such as CPU and memory, preventing teams from exceeding their allocated GPU resources.
+
+For more information, see xref:../../ai_workloads/kueue/kueue-dra-integrating-dynamic-resource-allocation.adoc#kueue-dra-integrating-dynamic-resource-allocation[Integrating Dynamic Resource Allocation].


🤖 [error] OpenShiftAsciiDoc.NoXrefInModules: Do not include xrefs in modules, only assemblies (exception: release notes modules).

openshift-ci · 2026-06-30T19:48:23Z

@StephenJamesSmith: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

StephenJamesSmith · 2026-07-01T10:34:55Z

@PannagaRao @sohankunkerkar FYI: I've added the release note to this PR. Please review. Thx!

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 23, 2026

openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 23, 2026

ocpdocs-vale-bot reviewed Jun 23, 2026

View reviewed changes

StephenJamesSmith force-pushed the OSDOCS-20033 branch from 10075f1 to 04ba526 Compare June 24, 2026 12:43

sohankunkerkar reviewed Jun 24, 2026

View reviewed changes

StephenJamesSmith force-pushed the OSDOCS-20033 branch from 04ba526 to 9e450d7 Compare June 24, 2026 19:58

openshift-ci Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 24, 2026

StephenJamesSmith force-pushed the OSDOCS-20033 branch 3 times, most recently from 717dfe5 to c266442 Compare June 24, 2026 21:11

anahas-redhat reviewed Jun 25, 2026

View reviewed changes

sohankunkerkar reviewed Jun 29, 2026

View reviewed changes

openshift-ci Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 29, 2026

ocpdocs-vale-bot reviewed Jun 29, 2026

View reviewed changes

anahas-redhat reviewed Jun 30, 2026

View reviewed changes

StephenJamesSmith force-pushed the OSDOCS-20033 branch from cdc7c97 to 1f42c40 Compare June 30, 2026 17:21

openshift-ci Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 30, 2026

StephenJamesSmith force-pushed the OSDOCS-20033 branch from 1f42c40 to 9a3739c Compare June 30, 2026 18:08

ocpdocs-vale-bot reviewed Jun 30, 2026

View reviewed changes

anahas-redhat reviewed Jun 30, 2026

View reviewed changes

StephenJamesSmith force-pushed the OSDOCS-20033 branch from 9a3739c to 0f7218d Compare June 30, 2026 18:33

PannagaRao reviewed Jun 30, 2026

View reviewed changes

ocpdocs-vale-bot reviewed Jun 30, 2026

View reviewed changes

PannagaRao reviewed Jun 30, 2026

View reviewed changes

OSDOCS-20033: Kueue 1.4 and DRA

cdc2e6d

StephenJamesSmith force-pushed the OSDOCS-20033 branch from 0f7218d to cdc2e6d Compare June 30, 2026 19:37

ocpdocs-vale-bot reviewed Jun 30, 2026

View reviewed changes


		* Validation using dra-example-driver and nvidia-dra-driver.

		.Prerequisites

	[id="kueue-dra-partionable-devices_{context}"]
	[id="kueue-dra-partitionable-devices_{context}"]


		* Verification of partition capacity reclaim after workload completion.

		* Validation using dra-example-driver and nvidia-dra-driver.

	extendedResourceName: example.com/gpu
	extendedResourceName: nvidia.com/gpu


		.Procedure

		. Add a `deviceClassMappings`` entry to the {kueue-name} configuration that maps each `DeviceClass` to a logical resource name for quota, as shown in the following example:

	. Add a `deviceClassMappings`` entry to the {kueue-name} configuration that maps each `DeviceClass` to a logical resource name for quota, as shown in the following example:
	. Add a `deviceClassMappings` entry to the {kueue-name} configuration that maps each `DeviceClass` to a logical resource name for quota, as shown in the following example:

		@@ -0,0 +1,154 @@
		// Module included in the following assemblies:


		include::modules/kueue-dra-resourceclaimtemplates.adoc[leveloffset=+2]

		// include::modules/kueue-dra-deviceclasses.adoc[leveloffset=+2]


		.Procedure

		. Enable the feature gates by installing or reconfiguring {kueue-name} with both feature gates enabled, as shown in the following example:


		DRA is a Kubernetes framework that manages specialized hardware resources such as GPUs with fine-grained control. Unlike traditional resource requests, DRA allows dynamic prioritization—allocating GPUs to high-priority AI training workloads during business hours, then reallocating them to cost-optimized batch jobs overnight.

		You can validate partitionable devices support in {kueue-name} Dynamic Resource Allocation (DRA) integration, covering partition-aware quota, admission, and scheduling. Partitionable devices, such as NVIDIA MIG, allow graphics processing units (GPUs) to be dynamically subdivided into smaller allocations. {kueue-name} must correctly handle quota accounting for these mutually exclusive partition configurations.

	{kueue-name} Dynamic Resource Allocation (DRA) integration enables advanced management of specialized hardware resources like GPUs, FPGAs, and other accelerators within Kubernetes workload queuing. This integration allows for the reading and publishing of ResourceSlices, counter-based quota computation, and specific admission behaviors.
	You can configure {kueue-name} to manage quota for workloads that use Dynamic Resource Allocation (DRA) to request GPUs. When DRA quota management is configured, {kueue-name} counts DRA device requests toward quota in the same way that it counts traditional resources such as CPU and memory.

	DRA is a Kubernetes framework that manages specialized hardware resources such as GPUs with fine-grained control. Unlike traditional resource requests, DRA allows dynamic prioritization—allocating GPUs to high-priority AI training workloads during business hours, then reallocating them to cost-optimized batch jobs overnight.
	If DRA device quota is not configured, {kueue-name} does not account for GPU requests when admitting workloads, which can result in teams exceeding their GPU allocation.


		* xref:../../nodes/pods/nodes-pods-allocate-dra.adoc#nodes-pods-allocate-dra[Allocating GPUs to pods by using DRA]

Uh oh!

Conversation

StephenJamesSmith commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jun 23, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ocpdocs-previewbot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sohankunkerkar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StephenJamesSmith Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StephenJamesSmith commented Jun 25, 2026

Uh oh!

anahas-redhat left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StephenJamesSmith commented Jun 23, 2026 •

edited

Loading

openshift-ci-robot commented Jun 23, 2026 •

edited by openshift-ci Bot

Loading

ocpdocs-previewbot commented Jun 23, 2026 •

edited

Loading

sohankunkerkar left a comment •

edited

Loading

StephenJamesSmith Jun 24, 2026 •

edited

Loading

anahas-redhat Jun 25, 2026 •

edited

Loading

anahas-redhat Jun 25, 2026 •

edited

Loading