A dedicated KEDA external scaler designed to automatically scale GPU inference workloads in Kaito, eliminating the need for external dependencies such as Prometheus.
The KEDA Kaito Scaler provides intelligent autoscaling for vLLM inference workloads by directly collecting metrics from inference pods. It offers a simplified, user-friendly alternative to complex Prometheus-based scaling solutions while maintaining the same powerful scaling capabilities.
- π Zero Dependencies: No Prometheus stack required - directly scrapes metrics from inference pods
- β‘ Simple Configuration: Minimal YAML configuration with intelligent defaults
- π― GPU-Optimized: Conservative scaling policies designed for expensive GPU resources
- π Secure by Default: Built-in TLS authentication between components
- π Smart Fallback: Intelligent handling of missing metrics to prevent scaling flapping
- π§ Minimal Maintenance: Self-managing certificates and authentication
To enable autoscaling of KAITO GPU inference workloads, the InferenceSet custom resource must be utilized in KAITO, and the InferenceSet Controller should be activated during the KAITO installation. The InferenceSet feature was introduced in KAITO version v0.8.0 as an alpha feature.
export CLUSTER_NAME=kaito
helm repo add kaito https://kaito-project.github.io/kaito/charts/kaito
helm repo update
helm upgrade --install kaito-workspace kaito/workspace \
--namespace kaito-workspace \
--create-namespace \
--set clusterName="$CLUSTER_NAME" \
--set featureGates.enableInferenceSetController=true \
--waitthe following example demonstrates how to install KEDA using Helm chart. For instructions on installing KEDA through other methods, please refer to the guide here.
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespaceautoscaling of KAITO GPU inference workloads requires KEDA Kaito Scaler version v0.3.3 or higher.
helm repo add keda-kaito-scaler https://kaito-project.github.io/keda-kaito-scaler/charts/kaito-project
helm repo updateCheck available versions:
helm search repo -l keda-kaito-scalerNAME CHART VERSION APP VERSION DESCRIPTION
keda-kaito-scaler/keda-kaito-scaler 0.3.3 v0.3.3 A Helm chart for Kaito keda-kaito-scaler compon...
keda-kaito-scaler/keda-kaito-scaler 0.3.0 v0.3.0 A Helm chart for Kaito keda-kaito-scaler compon...
keda-kaito-scaler/keda-kaito-scaler 0.2.0 v0.2.0 A Helm chart for Kaito keda-kaito-scaler compon...
keda-kaito-scaler/keda-kaito-scaler 0.0.1 v0.0.1 A Helm chart for Kaito keda-kaito-scaler compon...
helm upgrade --install keda-kaito-scaler -n kaito-workspace keda-kaito-scaler/keda-kaito-scaler --create-namespace- The following example creates an InferenceSet for the phi-4-mini model, using annotations with the prefix
scaledobject.kaito.sh/to supply parameter inputs for the KEDA Kaito Scaler:scaledobject.kaito.sh/auto-provision- required, specifies whether KEDA Kaito Scaler will automatically provision a ScaledObject based on the
InferenceSetobject
- required, specifies whether KEDA Kaito Scaler will automatically provision a ScaledObject based on the
scaledobject.kaito.sh/metricName- optional, specifies the metric name collected from the vLLM pod, which is used for monitoring and triggering the scaling operation, default is
vllm:num_requests_waiting
- optional, specifies the metric name collected from the vLLM pod, which is used for monitoring and triggering the scaling operation, default is
scaledobject.kaito.sh/threshold- required, specifies the threshold for the monitored metric that triggers the scaling operation
scaledobject.kaito.sh/min-replicas- optional, specifies the minimum number of replicas for the ScaledObject. If not set or less than 1, it will be set to 1.
scaledobject.kaito.sh/max-replicas- optional, specifies the maximum number of replicas for the ScaledObject. When this annotation is not set, the value is computed from
spec.nodeCountLimitwhen available. - this annotation takes precedence when present: if set, it must have a value greater than
1, otherwise the controller will not auto-provision a ScaledObject even whenspec.nodeCountLimitis set. If the annotation is absent, auto-provisioning requiresspec.nodeCountLimitto be set. - when set,
max-replicasmust be greater than or equal tomin-replicas; otherwise the controller will skip reconciling the InferenceSet and emit anInvalidReplicaRangewarning event.
- optional, specifies the maximum number of replicas for the ScaledObject. When this annotation is not set, the value is computed from
cat <<EOF | kubectl apply -f -
apiVersion: kaito.sh/v1alpha1
kind: InferenceSet
metadata:
annotations:
scaledobject.kaito.sh/auto-provision: "true"
scaledobject.kaito.sh/metricName: "vllm:num_requests_waiting"
scaledobject.kaito.sh/threshold: "10"
name: phi-4
namespace: default
spec:
labelSelector:
matchLabels:
apps: phi-4
replicas: 1
nodeCountLimit: 5
template:
inference:
preset:
accessMode: public
name: phi-4-mini-instruct
resource:
instanceType: Standard_NC24ads_A100_v4
EOF- In just a few seconds, the KEDA Kaito Scaler will automatically create the
scaledobjectandhpaobjects. After a few minutes, once the inference pod is running, the KEDA Kaito Scaler will begin scraping metric values from the inference pod, and the status of thescaledobjectandhpaobjects will be marked as ready.
# kubectl get scaledobject
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX READY ACTIVE FALLBACK PAUSED TRIGGERS AUTHENTICATIONS AGE
phi-4 kaito.sh/v1alpha1.InferenceSet phi-4 1 5 True True False False external keda-kaito-scaler-creds 10m
# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
keda-hpa-phi-4 InferenceSet/phi-4 0/10 (avg) 1 5 1 11mThat's it! Your KAITO workloads will now automatically scale based on the number of waiting inference requests (vllm:num_requests_waiting).
Releases are driven by two manual GitHub Actions workflows. A release produces a multi-arch container image (ghcr.io/kaito-project/keda-kaito-scaler:<X.Y.Z>), a Helm chart (https://kaito-project.github.io/keda-kaito-scaler/charts/kaito-project), and a GitHub Release with binaries and changelog.
To publish vX.Y.Z:
-
Open a PR against
mainthat bumps the version in:charts/keda-kaito-scaler/Chart.yamlβversionandappVersioncharts/keda-kaito-scaler/values.yamlβimage.tagMakefileβVERSION ?=(optional, keeps local-build default aligned)
-
After the PR is merged, run Actions β "Publish Keda-Kaito-Scaler image(manually)" with
release_version=vX.Y.Z. This creates the Git tag, pushes the image, and auto-publishes the Helm chart togh-pages. -
Run Actions β "Create release(manually)" with the same
release_version. This runs GoReleaser against the tag and publishes the GitHub Release.
Notes:
- Use the same
vX.Y.Zvalue for both workflows. Git tags / Release names are prefixed withv; image tags are not (0.3.0). - Step 2 must finish before Step 3 (Step 3 checks out the tag created by Step 2).
- The image workflow runs in the
preset-envenvironment and may require approval. - Release branches are not needed for normal releases. Only cut a
release-vX.Ybranch (e.g.release-v0.3) whenmainhas moved on to the next minor and you still need to ship patch releases for the older line; then run the publish workflows against that branch.
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
