feat: WorkerPool Auto scaling by omeryahud · Pull Request #219 · agent-substrate/substrate

Omer Yahud (omeryahud) · 2026-06-10T21:22:00Z

Fixes #198

Notes

Follow-up: observe() filters workers to the pool client-side (ListWorkers has no server-side filter), so each reconcile lists all cluster workers — fine now, but should be scoped server-side as pools scale.
Tests pass
Appropriate changes to documentation are included in the PR

Add minReady, targetBuffer, and maxReplicas as optional declarative inputs for the WorkerPool autoscaler, plus a CEL rule rejecting minReady > maxReplicas. Regenerate deepcopy and CRD. Part of agent-substrate#198 (Phase 1).

Add internal/autoscaler with Step(): given a pool's bounds, a live occupancy observation, loop config and the clock, it returns the next desired replica count. Scale-up is immediate; scale-down is gated by a stabilization window. No Kubernetes/gRPC deps; covered by 16 unit tests. Part of agent-substrate#198 (Phase 2a).

Add WorkerPoolAutoscaler in internal/controllers: the single writer of spec.replicas for autoscaled pools. It reads each pool's bounds, measures occupancy via ateapi ListWorkers, runs the decision policy, and patches replicas. WorkerPoolReconciler still owns the Deployment and status, so the two controllers write disjoint fields. Re-evaluation is poll-driven (RequeueAfter) as the down-path and safety net; the reactive up-path is added separately. Wired into atecontroller, reusing its ateapi client. Part of agent-substrate#198 (Phase 2b).

Add a WatchCapacityPressure server-streaming RPC and an in-process pub/sub hub. When AssignWorkerStep finds no free worker for a pool it publishes a pool-scoped CapacityPressureEvent at the request edge; subscribers stream them. Publish is non-blocking (lossy on a full subscriber buffer) so it never slows the resume path. Producer side of the reactive scale-up path; the autoscaler subscribes separately. Part of agent-substrate#198 (Phase 3a).

Subscribe the WorkerPool autoscaler to ateapi's WatchCapacityPressure stream and turn each event into an immediate reconcile of that pool via a controller-runtime source.Channel, so an empty pool scales up at the miss instead of at the next poll. The periodic requeue remains the down-path and safety net. The stream watcher runs as a manager runnable and reconnects on error without crashing the manager. Part of agent-substrate#198 (Phase 3b).

Add demos/autoscaling: an autoscaled WorkerPool manifest (minReady, targetBuffer, maxReplicas), a scenario driver (demo.sh) that exercises reactive scale-up and hysteretic scale-down via kubectl-ate, and a README walkthrough. Part of agent-substrate#198.

google-cla · 2026-06-10T21:22:11Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Signed-off-by: Omer Yahud <oyahud@nvidia.com>

Omer Yahud (omeryahud) · 2026-06-11T12:33:14Z

+	if err = (&controllers.WorkerPoolAutoscaler{
+		Client:    mgr.GetClient(),
+		AteClient: ateapiClient,
+		Config: autoscaler.Config{


Maybe should be configurable per worker pool?

Omer Yahud (omeryahud) added 6 commits June 11, 2026 00:20

feat(api): add WorkerPool autoscaling bounds to WorkerPoolSpec

03fb8ee

Add minReady, targetBuffer, and maxReplicas as optional declarative inputs for the WorkerPool autoscaler, plus a CEL rule rejecting minReady > maxReplicas. Regenerate deepcopy and CRD. Part of agent-substrate#198 (Phase 1).

Omer Yahud (omeryahud) marked this pull request as draft June 10, 2026 21:27

Julian Gutierrez Oschmann (juli4n) self-requested a review June 10, 2026 22:29

fix demo doc

62df3d2

Signed-off-by: Omer Yahud <oyahud@nvidia.com>

Omer Yahud (omeryahud) commented Jun 11, 2026

View reviewed changes

Omer Yahud (omeryahud) marked this pull request as ready for review June 11, 2026 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: WorkerPool Auto scaling#219

feat: WorkerPool Auto scaling#219
Omer Yahud (omeryahud) wants to merge 7 commits into
agent-substrate:mainfrom
omeryahud:feat/workerpool-autoscaling

Omer Yahud (omeryahud) commented Jun 10, 2026 •

edited

Loading

Uh oh!

google-cla Bot commented Jun 10, 2026

Uh oh!

Omer Yahud (omeryahud) Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Omer Yahud (omeryahud) commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Uh oh!

google-cla Bot commented Jun 10, 2026

Uh oh!

Omer Yahud (omeryahud) Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Omer Yahud (omeryahud) commented Jun 10, 2026 •

edited

Loading