fix(ipc): retry proc acquisition when spawns fail#1669
fix(ipc): retry proc acquisition when spawns fail#1669rosetta-livekit-bot[bot] wants to merge 2 commits into
Conversation
🦋 Changeset detectedLatest commit: e0aa0ae The changes in this PR will be included in the next version bump. This PR includes changesets to release 33 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
ReviewTL;DR: the bug is real in JS, so this is worth taking — but a few things to confirm/fix before merging. The bug exists in JS
Two behavioral things to confirm
Code-quality concerns
TestsThe added test covers the headline case (all spawns fail → throws). But it manually pushes into |
|
Addressed the review concerns in e0aa0ae:
Verification:
|
Summary
Fixes #5868. When every worker process fails to initialize,
launch_jobpreviously hung forever on_warmed_proc_queue.get()because nothing was ever put on the queue, and the 3-attempt retry loop was unreachable.Split the responsibility:
_acquire_procnow owns the wait-for-a-warmed-process loop. It racesqueue.get()against every in-flight spawn task and only retries once all in-flight spawns settle without producing a proc — so peer spawns still in flight don't burn a retry attempt. AfterMAX_ACQUIRE_ATTEMPTSsuch cycles, it raises aRuntimeError.launch_jobkeeps its own 3-attempt budget for post-acquire launch failures (the original retry semantics, untouched).Alternative considered
#5871 uses
Nonesentinels on_warmed_proc_queueto unblock waiters when a spawn fails. Two issues with that approach:Nonesentinels), solaunch_jobconsumes another sentinel without ever requesting a new process.MAX_ATTEMPTS=3becomes "fail 3 sentinels and give up".