feat: Add request parking to the atenet router#221
Open
Omer Yahud (omeryahud) wants to merge 5 commits into
Open
feat: Add request parking to the atenet router#221Omer Yahud (omeryahud) wants to merge 5 commits into
Omer Yahud (omeryahud) wants to merge 5 commits into
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
4b27a7a to
9021cc5
Compare
When a request targets a suspended actor, the router resumes it via the
control plane before routing. A momentarily saturated worker pool makes
ResumeActor return FailedPrecondition ("no free workers available"), which
the router previously turned straight into a 503. In an oversubscribed
system that shortage usually clears within milliseconds as another actor
suspends and frees its worker, so failing fast was wasteful.
Park such requests instead: retry the resume on FailedPrecondition (in
addition to the existing Aborted conflict) until the actor becomes routable
or a bounded wait elapses, capped by a fixed-capacity admission lot that
sheds excess load. On budget expiry the underlying capacity error is
surfaced so the HTTP boundary maps it faithfully. singleflight still
collapses concurrent waiters for the same actor into one resume RPC.
Parking can be disabled to preserve the legacy fail-fast behavior.
- parking.go: parkingLot (bounded, non-blocking admission gate) + config
- resumer.go: retryable predicate + configurable park budget
- extproc.go: admit each request to the lot around the resume call
- metrics.go: parking.active / wait.duration / rejected instruments
- errors.go: parkingFullErr (503 "router at capacity")
- router.go: --parking-enabled / --parking-max-wait / --parking-max-parked
- status.go + dashboard.html: Request Parking status card
- docs/request-parking.md: feature documentation
Replace the bare-string park-outcome constants with a parkOutcome string type and rename the classifier to parkOutcomeFor, so the parking.wait.duration outcome label is type-checked rather than an arbitrary string.
wait.Backoff zeroes its Steps once the per-attempt delay reaches Cap, so the resume retry loop gave up after ~7 steps (~5s) regardless of --parking-max-wait. Drop the Cap and use a gentle backoff (500ms x1.1, no Cap) so the budget context alone bounds the wait; the slow growth keeps the inter-retry gap small (~3.5s by 30s) on its own. Add a regression test asserting the backoff sets no Cap.
The ext_proc filter hard-coded MessageTimeout=5s, so Envoy abandoned a parked request (HTTP 500) long before the router's park budget elapsed. Make it configurable via SetExtProcMessageTimeout and set it to --parking-max-wait + margin when parking is enabled, so Envoy holds the request open until the router itself resolves or sheds it.
An oversubscribed WorkerPool (2 workers, several actors) that exercises the router parking path: requests to a saturated pool park and retry instead of failing fast, and are served once capacity frees up. Includes load.sh and --deploy-demo-parking / --delete-demo-parking wiring in hack/install-ate.sh.
9021cc5 to
4a8f978
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a request targets a suspended actor, the router resumes it via the control plane before routing. A momentarily saturated worker pool makes
ResumeActorreturnFailedPrecondition("no free workers available"), which the router previously turned straight into a 503. In an oversubscribed system that shortage might pass quickly as another actor suspends and frees its worker, so failing fast was wasteful.Park such requests instead: retry the resume on
FailedPreconditionuntil the actor becomes routable or a bounded wait elapses, capped by a fixed-capacity admission lot that sheds excess load. On budget expiry the underlying capacity error is surfaced so the HTTP boundary maps it faithfully. singleflight still collapses concurrent waiters for the same actor into one resume RPC. Parking can be disabled to preserve the legacy fail-fast behavior.Fixes #27