feat: WorkerPool Auto scaling#219
Open
Omer Yahud (omeryahud) wants to merge 7 commits into
Open
Conversation
Add minReady, targetBuffer, and maxReplicas as optional declarative inputs for the WorkerPool autoscaler, plus a CEL rule rejecting minReady > maxReplicas. Regenerate deepcopy and CRD. Part of agent-substrate#198 (Phase 1).
Add internal/autoscaler with Step(): given a pool's bounds, a live occupancy observation, loop config and the clock, it returns the next desired replica count. Scale-up is immediate; scale-down is gated by a stabilization window. No Kubernetes/gRPC deps; covered by 16 unit tests. Part of agent-substrate#198 (Phase 2a).
Add WorkerPoolAutoscaler in internal/controllers: the single writer of spec.replicas for autoscaled pools. It reads each pool's bounds, measures occupancy via ateapi ListWorkers, runs the decision policy, and patches replicas. WorkerPoolReconciler still owns the Deployment and status, so the two controllers write disjoint fields. Re-evaluation is poll-driven (RequeueAfter) as the down-path and safety net; the reactive up-path is added separately. Wired into atecontroller, reusing its ateapi client. Part of agent-substrate#198 (Phase 2b).
Add a WatchCapacityPressure server-streaming RPC and an in-process pub/sub hub. When AssignWorkerStep finds no free worker for a pool it publishes a pool-scoped CapacityPressureEvent at the request edge; subscribers stream them. Publish is non-blocking (lossy on a full subscriber buffer) so it never slows the resume path. Producer side of the reactive scale-up path; the autoscaler subscribes separately. Part of agent-substrate#198 (Phase 3a).
Subscribe the WorkerPool autoscaler to ateapi's WatchCapacityPressure stream and turn each event into an immediate reconcile of that pool via a controller-runtime source.Channel, so an empty pool scales up at the miss instead of at the next poll. The periodic requeue remains the down-path and safety net. The stream watcher runs as a manager runnable and reconnects on error without crashing the manager. Part of agent-substrate#198 (Phase 3b).
Add demos/autoscaling: an autoscaled WorkerPool manifest (minReady, targetBuffer, maxReplicas), a scenario driver (demo.sh) that exercises reactive scale-up and hysteretic scale-down via kubectl-ate, and a README walkthrough. Part of agent-substrate#198.
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Signed-off-by: Omer Yahud <oyahud@nvidia.com>
| if err = (&controllers.WorkerPoolAutoscaler{ | ||
| Client: mgr.GetClient(), | ||
| AteClient: ateapiClient, | ||
| Config: autoscaler.Config{ |
There was a problem hiding this comment.
Maybe should be configurable per worker pool?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #198
Notes
Follow-up:
observe()filters workers to the pool client-side (ListWorkershas no server-side filter), so each reconcile lists all cluster workers — fine now, but should be scoped server-side as pools scale.Tests pass
Appropriate changes to documentation are included in the PR