fix(ipc): retry proc acquisition when spawns fail by rosetta-livekit-bot[bot] · Pull Request #1669 · livekit/agents-js

rosetta-livekit-bot · 2026-06-01T07:02:22Z

Summary

Fixes #5868. When every worker process fails to initialize, launch_job previously hung forever on _warmed_proc_queue.get() because nothing was ever put on the queue, and the 3-attempt retry loop was unreachable.

Split the responsibility: _acquire_proc now owns the wait-for-a-warmed-process loop. It races queue.get() against every in-flight spawn task and only retries once all in-flight spawns settle without producing a proc — so peer spawns still in flight don't burn a retry attempt. After MAX_ACQUIRE_ATTEMPTS such cycles, it raises a RuntimeError. launch_job keeps its own 3-attempt budget for post-acquire launch failures (the original retry semantics, untouched).

Alternative considered

#5871 uses None sentinels on _warmed_proc_queue to unblock waiters when a spawn fails. Two issues with that approach:

It wakes too eagerly — a sentinel is pushed on each individual spawn failure, even though peer spawns may still succeed, burning a retry attempt.
The retry doesn't trigger a fresh spawn — on loop-back the queue isn't empty (it has more None sentinels), so launch_job consumes another sentinel without ever requesting a new process. MAX_ATTEMPTS=3 becomes "fail 3 sentinels and give up".

changeset-bot · 2026-06-01T07:02:26Z

🦋 Changeset detected

Latest commit: e0aa0ae

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 33 packages

Name	Type
@livekit/agents	Patch
@livekit/agents-plugin-anam	Patch
@livekit/agents-plugin-assemblyai	Patch
@livekit/agents-plugin-baseten	Patch
@livekit/agents-plugin-bey	Patch
@livekit/agents-plugin-cartesia	Patch
@livekit/agents-plugin-cerebras	Patch
@livekit/agents-plugin-deepgram	Patch
@livekit/agents-plugin-elevenlabs	Patch
@livekit/agents-plugin-fishaudio	Patch
@livekit/agents-plugin-google	Patch
@livekit/agents-plugin-hedra	Patch
@livekit/agents-plugin-hume	Patch
@livekit/agents-plugin-inworld	Patch
@livekit/agents-plugin-lemonslice	Patch
@livekit/agents-plugin-liveavatar	Patch
@livekit/agents-plugin-livekit	Patch
@livekit/agents-plugin-minimax	Patch
@livekit/agents-plugin-mistral	Patch
@livekit/agents-plugin-mistralai	Patch
@livekit/agents-plugin-neuphonic	Patch
@livekit/agents-plugin-openai	Patch
@livekit/agents-plugin-perplexity	Patch
@livekit/agents-plugin-phonic	Patch
@livekit/agents-plugin-resemble	Patch
@livekit/agents-plugin-rime	Patch
@livekit/agents-plugin-runway	Patch
@livekit/agents-plugin-sarvam	Patch
@livekit/agents-plugin-silero	Patch
@livekit/agents-plugin-tavus	Patch
@livekit/agents-plugins-test	Patch
@livekit/agents-plugin-trugen	Patch
@livekit/agents-plugin-xai	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

longcw · 2026-06-05T09:36:08Z

Review

TL;DR: the bug is real in JS, so this is worth taking — but a few things to confirm/fix before merging.

The bug exists in JS

launchJob blocks on warmedProcQueue.get() (proc_pool.ts:60), and the queue only ever receives an entry on the success path of procWatchTask (proc_pool.ts:110). If every spawned process throws in proc.initialize() (broken prewarm, missing env var, import error in the agent module), nothing is enqueued and launchJob hangs forever — the job is never accepted, failed, or reported. So converting that into a surfaced error has genuine value. (Note: only affects the procMutex/prewarm path; the numIdleProcesses === 0 branch already propagates init failures inline.)

Two behavioral things to confirm

Symptom vs. root cause. This makes the waiting job give up, but doesn't touch the run() loop (proc_pool.ts:148-164), which keeps respawning crashing child processes forever (CPU spin + log spam) after launchJob throws. Matches the Python fix's scope, just worth being explicit that the crash loop is unchanged.
Transient-failure resilience regression. Indefinite waiting currently lets a job self-heal if an early spawn fails transiently but a later one succeeds. With this PR, 3 failed acquisition cycles reject the job even if the next spawn would have succeeded. Fine for deterministic failures, a robustness regression for flaky ones. Python made the same MAX_ATTEMPTS=3 tradeoff — just flagging it's a conscious product decision.

Code-quality concerns

This is a reimplementation, not a line-by-line port. Python's _acquire_proc actively spawns when the queue is empty; the JS version only waits and relies on the background run() loop to spawn. Reasonable adaptation, but the "3 attempts" semantics differ from Python — please validate behavior on its own terms rather than assuming parity.
setTimeout(resolve, 0) in acquireWarmedProcEntry is fragile. It bridges the race where launchJob runs before run() has pushed a task. A single macrotask tick is a timing assumption — under load this can return undefined early and burn an attempt. Worth justifying or replacing with something deterministic.
Subtle proc/slot leak window. queueGet → warmedProcQueue.get() shift()s the item synchronously right before resolving (utils.ts:107). If spawnsDone wins the Promise.race in the same turn a put lands, an entry could be shifted out but returned as undefined without ever calling entry.unlock() — leaking the proc and its MultiMutex slot. Narrow window, but real; the original launchJob had no such window.
No unhandled-rejection issue — Promise.race keeps a reaction on the losing queueGet, so the post-abort() AbortError is consumed. That part's fine.

Tests

The added test covers the headline case (all spawns fail → throws). But it manually pushes into pool.tasks and bypasses run()/start(), so it never exercises the setTimeout(0) bridge or the concurrent respawn path — the two riskiest pieces are untested.

rosetta-livekit-bot · 2026-06-05T09:59:34Z

Addressed the review concerns in e0aa0ae:

Removed the setTimeout(0) bridge by tracking spawn attempts as soon as lock acquisition starts.
Avoided racing a consuming queue get against spawn completion by adding a non-consuming Queue.waitForItem().
Split spawn-attempt tracking from long-lived proc watch tasks, matching the Python spawn/monitor distinction more closely.
Updated the regression test to exercise start()/run() instead of manually pushing pool.tasks.

Verification:

pnpm vitest run agents/src/ipc/proc_pool.test.ts
pnpm --filter @livekit/agents build:types
pnpm --filter @livekit/agents throws:check
pnpm --filter @livekit/agents lint (existing warnings only)

fix(ipc): retry proc acquisition when spawns fail

93866f2

devin-ai-integration Bot reviewed Jun 1, 2026

View reviewed changes

fix(ipc): avoid proc acquisition race

e0aa0ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ipc): retry proc acquisition when spawns fail#1669

fix(ipc): retry proc acquisition when spawns fail#1669
rosetta-livekit-bot[bot] wants to merge 2 commits into
mainfrom
ancients-dullard-precise

rosetta-livekit-bot Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

longcw commented Jun 5, 2026

Uh oh!

rosetta-livekit-bot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rosetta-livekit-bot Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Alternative considered

Uh oh!

changeset-bot Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

longcw commented Jun 5, 2026

Review

The bug exists in JS

Two behavioral things to confirm

Code-quality concerns

Tests

Uh oh!

rosetta-livekit-bot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rosetta-livekit-bot Bot commented Jun 1, 2026 •

edited

Loading

changeset-bot Bot commented Jun 1, 2026 •

edited

Loading