Skip to content

kaito-project/keda-kaito-scaler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

84 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

KEDA Kaito Scaler

A dedicated KEDA external scaler designed to automatically scale GPU inference workloads in Kaito, eliminating the need for external dependencies such as Prometheus.

Overview

The KEDA Kaito Scaler provides intelligent autoscaling for vLLM inference workloads by directly collecting metrics from inference pods. It offers a simplified, user-friendly alternative to complex Prometheus-based scaling solutions while maintaining the same powerful scaling capabilities.

Key Features

  • πŸš€ Zero Dependencies: No Prometheus stack required - directly scrapes metrics from inference pods
  • ⚑ Simple Configuration: Minimal YAML configuration with intelligent defaults
  • 🎯 GPU-Optimized: Conservative scaling policies designed for expensive GPU resources
  • πŸ”’ Secure by Default: Built-in TLS authentication between components
  • πŸ“Š Smart Fallback: Intelligent handling of missing metrics to prevent scaling flapping
  • πŸ”§ Minimal Maintenance: Self-managing certificates and authentication

Architecture

keda-kaito-scaler-arch

Prerequisites

Enable InferenceSet Controller during KAITO install

To enable autoscaling of KAITO GPU inference workloads, the InferenceSet custom resource must be utilized in KAITO, and the InferenceSet Controller should be activated during the KAITO installation. The InferenceSet feature was introduced in KAITO version v0.8.0 as an alpha feature.

export CLUSTER_NAME=kaito

helm repo add kaito https://kaito-project.github.io/kaito/charts/kaito
helm repo update
helm upgrade --install kaito-workspace kaito/workspace \
  --namespace kaito-workspace \
  --create-namespace \
  --set clusterName="$CLUSTER_NAME" \
  --set featureGates.enableInferenceSetController=true \
  --wait

install KEDA

the following example demonstrates how to install KEDA using Helm chart. For instructions on installing KEDA through other methods, please refer to the guide here.

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Quick Start

Deploy KEDA Kaito Scaler

autoscaling of KAITO GPU inference workloads requires KEDA Kaito Scaler version v0.3.3 or higher.

helm repo add keda-kaito-scaler https://kaito-project.github.io/keda-kaito-scaler/charts/kaito-project
helm repo update

Check available versions:

helm search repo -l keda-kaito-scaler
NAME                                    CHART VERSION   APP VERSION   DESCRIPTION
keda-kaito-scaler/keda-kaito-scaler     0.3.3           v0.3.3        A Helm chart for Kaito keda-kaito-scaler compon...
keda-kaito-scaler/keda-kaito-scaler     0.3.0           v0.3.0        A Helm chart for Kaito keda-kaito-scaler compon...
keda-kaito-scaler/keda-kaito-scaler     0.2.0           v0.2.0        A Helm chart for Kaito keda-kaito-scaler compon...
keda-kaito-scaler/keda-kaito-scaler     0.0.1           v0.0.1        A Helm chart for Kaito keda-kaito-scaler compon...
helm upgrade --install keda-kaito-scaler -n kaito-workspace keda-kaito-scaler/keda-kaito-scaler --create-namespace

Create a Kaito InferenceSet for running inference workloads

  • The following example creates an InferenceSet for the phi-4-mini model, using annotations with the prefix scaledobject.kaito.sh/ to supply parameter inputs for the KEDA Kaito Scaler:
    • scaledobject.kaito.sh/auto-provision
      • required, specifies whether KEDA Kaito Scaler will automatically provision a ScaledObject based on the InferenceSet object
    • scaledobject.kaito.sh/metricName
      • optional, specifies the metric name collected from the vLLM pod, which is used for monitoring and triggering the scaling operation, default is vllm:num_requests_waiting
    • scaledobject.kaito.sh/threshold
      • required, specifies the threshold for the monitored metric that triggers the scaling operation
    • scaledobject.kaito.sh/min-replicas
      • optional, specifies the minimum number of replicas for the ScaledObject. If not set or less than 1, it will be set to 1.
    • scaledobject.kaito.sh/max-replicas
      • optional, specifies the maximum number of replicas for the ScaledObject. When this annotation is not set, the value is computed from spec.nodeCountLimit when available.
      • this annotation takes precedence when present: if set, it must have a value greater than 1, otherwise the controller will not auto-provision a ScaledObject even when spec.nodeCountLimit is set. If the annotation is absent, auto-provisioning requires spec.nodeCountLimit to be set.
      • when set, max-replicas must be greater than or equal to min-replicas; otherwise the controller will skip reconciling the InferenceSet and emit an InvalidReplicaRange warning event.
cat <<EOF | kubectl apply -f -
apiVersion: kaito.sh/v1alpha1
kind: InferenceSet
metadata:
  annotations:
    scaledobject.kaito.sh/auto-provision: "true"
    scaledobject.kaito.sh/metricName: "vllm:num_requests_waiting"
    scaledobject.kaito.sh/threshold: "10"
  name: phi-4
  namespace: default
spec:
  labelSelector:
    matchLabels:
      apps: phi-4
  replicas: 1
  nodeCountLimit: 5
  template:
    inference:
      preset:
        accessMode: public
        name: phi-4-mini-instruct
    resource:
      instanceType: Standard_NC24ads_A100_v4
EOF
  • In just a few seconds, the KEDA Kaito Scaler will automatically create the scaledobject and hpa objects. After a few minutes, once the inference pod is running, the KEDA Kaito Scaler will begin scraping metric values from the inference pod, and the status of the scaledobject and hpa objects will be marked as ready.
# kubectl get scaledobject
NAME           SCALETARGETKIND                  SCALETARGETNAME   MIN   MAX   READY   ACTIVE    FALLBACK   PAUSED   TRIGGERS   AUTHENTICATIONS           AGE
phi-4          kaito.sh/v1alpha1.InferenceSet   phi-4             1     5     True    True     False      False    external   keda-kaito-scaler-creds   10m

# kubectl get hpa
NAME                    REFERENCE                   TARGETS      MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-phi-4          InferenceSet/phi-4          0/10 (avg)   1         5         1          11m

That's it! Your KAITO workloads will now automatically scale based on the number of waiting inference requests (vllm:num_requests_waiting).

Release Process

Releases are driven by two manual GitHub Actions workflows. A release produces a multi-arch container image (ghcr.io/kaito-project/keda-kaito-scaler:<X.Y.Z>), a Helm chart (https://kaito-project.github.io/keda-kaito-scaler/charts/kaito-project), and a GitHub Release with binaries and changelog.

To publish vX.Y.Z:

  1. Open a PR against main that bumps the version in:

  2. After the PR is merged, run Actions β†’ "Publish Keda-Kaito-Scaler image(manually)" with release_version=vX.Y.Z. This creates the Git tag, pushes the image, and auto-publishes the Helm chart to gh-pages.

  3. Run Actions β†’ "Create release(manually)" with the same release_version. This runs GoReleaser against the tag and publishes the GitHub Release.

Notes:

  • Use the same vX.Y.Z value for both workflows. Git tags / Release names are prefixed with v; image tags are not (0.3.0).
  • Step 2 must finish before Step 3 (Step 3 checks out the tag created by Step 2).
  • The image workflow runs in the preset-env environment and may require approval.
  • Release branches are not needed for normal releases. Only cut a release-vX.Y branch (e.g. release-v0.3) when main has moved on to the next minor and you still need to ship patch releases for the older line; then run the publish workflows against that branch.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Related Projects

  • Kaito - Kubernetes AI Toolchain Operator
  • KEDA - Kubernetes Event-driven Autoscaling
  • vLLM - Fast and easy-to-use library for LLM inference

About

A Keda external scaler for Kaito resource without depending on the third-party components(like prometheus)

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

 
 
 

Contributors