Skip to content

feat: WorkerPool Auto scaling#219

Open
Omer Yahud (omeryahud) wants to merge 7 commits into
agent-substrate:mainfrom
omeryahud:feat/workerpool-autoscaling
Open

feat: WorkerPool Auto scaling#219
Omer Yahud (omeryahud) wants to merge 7 commits into
agent-substrate:mainfrom
omeryahud:feat/workerpool-autoscaling

Conversation

@omeryahud

@omeryahud Omer Yahud (omeryahud) commented Jun 10, 2026

Copy link
Copy Markdown

Fixes #198

Notes

  • Follow-up: observe() filters workers to the pool client-side (ListWorkers has no server-side filter), so each reconcile lists all cluster workers — fine now, but should be scoped server-side as pools scale.

  • Tests pass

  • Appropriate changes to documentation are included in the PR

Add minReady, targetBuffer, and maxReplicas as optional declarative
inputs for the WorkerPool autoscaler, plus a CEL rule rejecting
minReady > maxReplicas. Regenerate deepcopy and CRD.

Part of agent-substrate#198 (Phase 1).
Add internal/autoscaler with Step(): given a pool's bounds, a live
occupancy observation, loop config and the clock, it returns the next
desired replica count. Scale-up is immediate; scale-down is gated by a
stabilization window. No Kubernetes/gRPC deps; covered by 16 unit tests.

Part of agent-substrate#198 (Phase 2a).
Add WorkerPoolAutoscaler in internal/controllers: the single writer of
spec.replicas for autoscaled pools. It reads each pool's bounds, measures
occupancy via ateapi ListWorkers, runs the decision policy, and patches
replicas. WorkerPoolReconciler still owns the Deployment and status, so
the two controllers write disjoint fields. Re-evaluation is poll-driven
(RequeueAfter) as the down-path and safety net; the reactive up-path is
added separately. Wired into atecontroller, reusing its ateapi client.

Part of agent-substrate#198 (Phase 2b).
Add a WatchCapacityPressure server-streaming RPC and an in-process
pub/sub hub. When AssignWorkerStep finds no free worker for a pool it
publishes a pool-scoped CapacityPressureEvent at the request edge;
subscribers stream them. Publish is non-blocking (lossy on a full
subscriber buffer) so it never slows the resume path. Producer side of
the reactive scale-up path; the autoscaler subscribes separately.

Part of agent-substrate#198 (Phase 3a).
Subscribe the WorkerPool autoscaler to ateapi's WatchCapacityPressure
stream and turn each event into an immediate reconcile of that pool via
a controller-runtime source.Channel, so an empty pool scales up at the
miss instead of at the next poll. The periodic requeue remains the
down-path and safety net. The stream watcher runs as a manager runnable
and reconnects on error without crashing the manager.

Part of agent-substrate#198 (Phase 3b).
Add demos/autoscaling: an autoscaled WorkerPool manifest (minReady,
targetBuffer, maxReplicas), a scenario driver (demo.sh) that exercises
reactive scale-up and hysteretic scale-down via kubectl-ate, and a
README walkthrough.

Part of agent-substrate#198.
@google-cla

google-cla Bot commented Jun 10, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@omeryahud Omer Yahud (omeryahud) marked this pull request as draft June 10, 2026 21:27
Signed-off-by: Omer Yahud <oyahud@nvidia.com>
Comment thread cmd/atecontroller/main.go
if err = (&controllers.WorkerPoolAutoscaler{
Client: mgr.GetClient(),
AteClient: ateapiClient,
Config: autoscaler.Config{

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should be configurable per worker pool?

@omeryahud Omer Yahud (omeryahud) marked this pull request as ready for review June 11, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: WorkerPool autoscaling — demand-reactive capacity for warm worker pools

1 participant