Skip to content

feat: Add request parking to the atenet router#221

Open
Omer Yahud (omeryahud) wants to merge 5 commits into
agent-substrate:mainfrom
omeryahud:worktree-request-parking
Open

feat: Add request parking to the atenet router#221
Omer Yahud (omeryahud) wants to merge 5 commits into
agent-substrate:mainfrom
omeryahud:worktree-request-parking

Conversation

@omeryahud

@omeryahud Omer Yahud (omeryahud) commented Jun 11, 2026

Copy link
Copy Markdown

When a request targets a suspended actor, the router resumes it via the control plane before routing. A momentarily saturated worker pool makes ResumeActor return FailedPrecondition ("no free workers available"), which the router previously turned straight into a 503. In an oversubscribed system that shortage might pass quickly as another actor suspends and frees its worker, so failing fast was wasteful.

Park such requests instead: retry the resume on FailedPrecondition until the actor becomes routable or a bounded wait elapses, capped by a fixed-capacity admission lot that sheds excess load. On budget expiry the underlying capacity error is surfaced so the HTTP boundary maps it faithfully. singleflight still collapses concurrent waiters for the same actor into one resume RPC. Parking can be disabled to preserve the legacy fail-fast behavior.

Fixes #27

  • Tests pass
  • Appropriate changes to documentation are included in the PR

@google-cla

google-cla Bot commented Jun 11, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@bowei Bowei Du (bowei) self-assigned this Jun 12, 2026
@omeryahud Omer Yahud (omeryahud) marked this pull request as ready for review June 15, 2026 08:24
@omeryahud Omer Yahud (omeryahud) marked this pull request as draft June 15, 2026 08:24
When a request targets a suspended actor, the router resumes it via the
control plane before routing. A momentarily saturated worker pool makes
ResumeActor return FailedPrecondition ("no free workers available"), which
the router previously turned straight into a 503. In an oversubscribed
system that shortage usually clears within milliseconds as another actor
suspends and frees its worker, so failing fast was wasteful.

Park such requests instead: retry the resume on FailedPrecondition (in
addition to the existing Aborted conflict) until the actor becomes routable
or a bounded wait elapses, capped by a fixed-capacity admission lot that
sheds excess load. On budget expiry the underlying capacity error is
surfaced so the HTTP boundary maps it faithfully. singleflight still
collapses concurrent waiters for the same actor into one resume RPC.
Parking can be disabled to preserve the legacy fail-fast behavior.

- parking.go: parkingLot (bounded, non-blocking admission gate) + config
- resumer.go: retryable predicate + configurable park budget
- extproc.go: admit each request to the lot around the resume call
- metrics.go: parking.active / wait.duration / rejected instruments
- errors.go: parkingFullErr (503 "router at capacity")
- router.go: --parking-enabled / --parking-max-wait / --parking-max-parked
- status.go + dashboard.html: Request Parking status card
- docs/request-parking.md: feature documentation
Replace the bare-string park-outcome constants with a parkOutcome string type and rename the classifier to parkOutcomeFor, so the parking.wait.duration outcome label is type-checked rather than an arbitrary string.
wait.Backoff zeroes its Steps once the per-attempt delay reaches Cap, so the resume retry loop gave up after ~7 steps (~5s) regardless of --parking-max-wait. Drop the Cap and use a gentle backoff (500ms x1.1, no Cap) so the budget context alone bounds the wait; the slow growth keeps the inter-retry gap small (~3.5s by 30s) on its own. Add a regression test asserting the backoff sets no Cap.
The ext_proc filter hard-coded MessageTimeout=5s, so Envoy abandoned a parked request (HTTP 500) long before the router's park budget elapsed. Make it configurable via SetExtProcMessageTimeout and set it to --parking-max-wait + margin when parking is enabled, so Envoy holds the request open until the router itself resolves or sheds it.
An oversubscribed WorkerPool (2 workers, several actors) that exercises the router parking path: requests to a saturated pool park and retry instead of failing fast, and are served once capacity frees up. Includes load.sh and --deploy-demo-parking / --delete-demo-parking wiring in hack/install-ate.sh.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Router needs to park requests and wait for capacity

2 participants