Skip to content

fix(ipc): retry proc acquisition when spawns fail#1669

Open
rosetta-livekit-bot[bot] wants to merge 2 commits into
mainfrom
ancients-dullard-precise
Open

fix(ipc): retry proc acquisition when spawns fail#1669
rosetta-livekit-bot[bot] wants to merge 2 commits into
mainfrom
ancients-dullard-precise

Conversation

@rosetta-livekit-bot
Copy link
Copy Markdown
Contributor

@rosetta-livekit-bot rosetta-livekit-bot Bot commented Jun 1, 2026

Summary

Fixes #5868. When every worker process fails to initialize, launch_job previously hung forever on _warmed_proc_queue.get() because nothing was ever put on the queue, and the 3-attempt retry loop was unreachable.

Split the responsibility: _acquire_proc now owns the wait-for-a-warmed-process loop. It races queue.get() against every in-flight spawn task and only retries once all in-flight spawns settle without producing a proc — so peer spawns still in flight don't burn a retry attempt. After MAX_ACQUIRE_ATTEMPTS such cycles, it raises a RuntimeError. launch_job keeps its own 3-attempt budget for post-acquire launch failures (the original retry semantics, untouched).

Alternative considered

#5871 uses None sentinels on _warmed_proc_queue to unblock waiters when a spawn fails. Two issues with that approach:

  1. It wakes too eagerly — a sentinel is pushed on each individual spawn failure, even though peer spawns may still succeed, burning a retry attempt.
  2. The retry doesn't trigger a fresh spawn — on loop-back the queue isn't empty (it has more None sentinels), so launch_job consumes another sentinel without ever requesting a new process. MAX_ATTEMPTS=3 becomes "fail 3 sentinels and give up".

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Jun 1, 2026

🦋 Changeset detected

Latest commit: e0aa0ae

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 33 packages
Name Type
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-assemblyai Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-cerebras Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-fishaudio Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-hume Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-lemonslice Patch
@livekit/agents-plugin-liveavatar Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-minimax Patch
@livekit/agents-plugin-mistral Patch
@livekit/agents-plugin-mistralai Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-perplexity Patch
@livekit/agents-plugin-phonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-runway Patch
@livekit/agents-plugin-sarvam Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugin-tavus Patch
@livekit/agents-plugins-test Patch
@livekit/agents-plugin-trugen Patch
@livekit/agents-plugin-xai Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@longcw
Copy link
Copy Markdown
Contributor

longcw commented Jun 5, 2026

Review

TL;DR: the bug is real in JS, so this is worth taking — but a few things to confirm/fix before merging.

The bug exists in JS

launchJob blocks on warmedProcQueue.get() (proc_pool.ts:60), and the queue only ever receives an entry on the success path of procWatchTask (proc_pool.ts:110). If every spawned process throws in proc.initialize() (broken prewarm, missing env var, import error in the agent module), nothing is enqueued and launchJob hangs forever — the job is never accepted, failed, or reported. So converting that into a surfaced error has genuine value. (Note: only affects the procMutex/prewarm path; the numIdleProcesses === 0 branch already propagates init failures inline.)

Two behavioral things to confirm

  1. Symptom vs. root cause. This makes the waiting job give up, but doesn't touch the run() loop (proc_pool.ts:148-164), which keeps respawning crashing child processes forever (CPU spin + log spam) after launchJob throws. Matches the Python fix's scope, just worth being explicit that the crash loop is unchanged.
  2. Transient-failure resilience regression. Indefinite waiting currently lets a job self-heal if an early spawn fails transiently but a later one succeeds. With this PR, 3 failed acquisition cycles reject the job even if the next spawn would have succeeded. Fine for deterministic failures, a robustness regression for flaky ones. Python made the same MAX_ATTEMPTS=3 tradeoff — just flagging it's a conscious product decision.

Code-quality concerns

  • This is a reimplementation, not a line-by-line port. Python's _acquire_proc actively spawns when the queue is empty; the JS version only waits and relies on the background run() loop to spawn. Reasonable adaptation, but the "3 attempts" semantics differ from Python — please validate behavior on its own terms rather than assuming parity.
  • setTimeout(resolve, 0) in acquireWarmedProcEntry is fragile. It bridges the race where launchJob runs before run() has pushed a task. A single macrotask tick is a timing assumption — under load this can return undefined early and burn an attempt. Worth justifying or replacing with something deterministic.
  • Subtle proc/slot leak window. queueGetwarmedProcQueue.get() shift()s the item synchronously right before resolving (utils.ts:107). If spawnsDone wins the Promise.race in the same turn a put lands, an entry could be shifted out but returned as undefined without ever calling entry.unlock() — leaking the proc and its MultiMutex slot. Narrow window, but real; the original launchJob had no such window.
  • No unhandled-rejection issue — Promise.race keeps a reaction on the losing queueGet, so the post-abort() AbortError is consumed. That part's fine.

Tests

The added test covers the headline case (all spawns fail → throws). But it manually pushes into pool.tasks and bypasses run()/start(), so it never exercises the setTimeout(0) bridge or the concurrent respawn path — the two riskiest pieces are untested.

@rosetta-livekit-bot
Copy link
Copy Markdown
Contributor Author

Addressed the review concerns in e0aa0ae:

  • Removed the setTimeout(0) bridge by tracking spawn attempts as soon as lock acquisition starts.
  • Avoided racing a consuming queue get against spawn completion by adding a non-consuming Queue.waitForItem().
  • Split spawn-attempt tracking from long-lived proc watch tasks, matching the Python spawn/monitor distinction more closely.
  • Updated the regression test to exercise start()/run() instead of manually pushing pool.tasks.

Verification:

  • pnpm vitest run agents/src/ipc/proc_pool.test.ts
  • pnpm --filter @livekit/agents build:types
  • pnpm --filter @livekit/agents throws:check
  • pnpm --filter @livekit/agents lint (existing warnings only)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant