fix(decopilot): unblock DBOS recovery — drop run-owner claim, make dispatch step retriable#3917
Merged
Merged
Conversation
…ndant run-owner claim A pod dying mid-run (rollout or crash) left decopilot runs stranded: the thread-gate DBOS workflow stayed PENDING (version-gated SDK recovery never re-homes it across a rollout), and a separate mesh-level run-owner claim (run_owner_pod CAS) blocked any re-dispatch with RunClaimError. The claim was a second single-execution guard redundant with the thread-gate queue's concurrency=1 per threadId — and the only thing DBOS couldn't see or reconcile. - Remove the run_owner_pod claim: RUN_STARTED is now a plain status write, no CAS; delete RunClaimError, claimRunStart, and orphanRunsByPod. The DBOS thread-gate queue is the sole single-execution guarantee. - Make the dispatchRunAndWait step retriable (was retriesAllowed:false) so a re-enqueued run re-executes. Safe now: graceful shutdown aborts the run and cancels the daemon (down-channel cancel frame) before re-enqueue, and the daemon fences stale epochs by fenceToken. - On graceful shutdown, requeueInflightThreadGateWorkflows() flips this executor's PENDING thread-gate workflows to ENQUEUED and clears executor_id + application_version, so any live executor (including a new rollout version) re-dequeues them — the queue dequeue allows application_version IS NULL, which the SDK's PENDING-recovery path does not. Runs after DBOS.shutdown() leaves them PENDING, before the pg pool closes. - stopAll() no longer writes the DB (no claim to release); it aborts controllers (cancelling daemons) and clears in-memory state. Tests: removed the obsolete claim-CAS / orphanRunsByPod cases; run-registry unit tests pass. The run_owner_pod column is left in place (now always null) for a follow-up drop migration.
…ion=NULL) Nulling application_version would let a different-version executor re-dequeue and resume the workflow by replaying its step journal against changed code. Keep the version so only a matching-version executor picks it up.
The graceful-shutdown re-enqueue was redundant with DBOS and didn't even work: DBOS.shutdown() drains in-flight workflows (verified — a sleeping step is run to SUCCESS during shutdown), so there are no PENDING workflows for the hook to re-enqueue. Recovery of an interrupted run is DBOS's responsibility. This PR now does only what's needed to UNBLOCK DBOS's own recovery: - remove the redundant run_owner_pod mesh claim (was the source of RunClaimError) - make the dispatchRunAndWait step retriable (was retriesAllowed:false, which made a recovered workflow error instead of re-running) Removed requeueInflightThreadGateWorkflows + its shutdown wiring + export.
…p runs retriesAllowed must be conditional: a user-desktop run dispatches to a laptop daemon that keeps running after pod death, so a DBOS replay would race a second concurrent dispatch against the same workdir (the original v1 corruption hazard). The graceful abort that would stop the daemon doesn't run on a hard crash. Keep user-desktop non-retriable; hosted/in-process runs (no external daemon) stay retriable so DBOS recovers them.
Updated the condition for the `retriable` variable to improve readability by breaking it into multiple lines. This change enhances code maintainability without altering functionality.
decocms Bot
pushed a commit
that referenced
this pull request
Jun 15, 2026
PR: #3917 fix(decopilot): unblock DBOS recovery — drop run-owner claim, make dispatch step retriable Bump type: patch - decocms (apps/mesh/package.json): 3.18.9 -> 3.18.10
tlgimenes
added a commit
that referenced
this pull request
Jun 15, 2026
origin/main #3917 removed the getPodId import + POD_ID const from app.ts; the clean auto-merge applied that removal while keeping the projector/ heartbeat getPodId() call sites, leaving the name undeclared (TS2304 in the CI test-merge). Re-add the module-level import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A pod dying mid-run left decopilot runs stranded and threw
RunClaimError: already running on another podon retry. Two things in mesh were blocking DBOS's own workflow recovery:run_owner_podmesh claim — a second single-execution guard (CAS inclaimRunStart) layered on top of the DBOS thread-gate queue'sconcurrency=1perthreadId. Redundant, and the one layer DBOS can't see/reconcile, so a dead pod's stalerun_owner_podblocked any re-dispatch.retriesAllowed: falseon thedispatchRunAndWaitstep — so a workflow DBOS did recover would refuse to re-run the step and just error.Fix — only what's needed to let DBOS recover runs
run_owner_podclaim.RUN_STARTEDis a plainstatus='in_progress'write; deletedRunClaimError,claimRunStart,orphanRunsByPod. The DBOS thread-gate queue is the sole single-execution guarantee.retriesAllowed: false) so DBOS re-runs the run on another executor. Safe becausestopAll()aborts the in-flight run and cancels the daemon (down-channelcancelframe) on shutdown, and the daemon fences stale epochs byfenceToken.stopAll()no longer writes the DB — aborts controllers + clears in-memory state.No custom recovery logic. Recovery is DBOS's job; this PR just stops blocking it.
Why not a custom re-enqueue on shutdown (removed)
An earlier revision re-enqueued workflows in the shutdown hook. Removed — redundant with DBOS and it didn't work: an empirical probe (real DBOS + Postgres) showed
DBOS.shutdown()drains in-flight workflows (a sleeping step ran to SUCCESS during shutdown), so there are no PENDING workflows for a hook to re-enqueue. The probe also confirmed the underlying primitive is sound: a workflow orphaned by a crashed executor, once visible to a live executor, re-runs its retriable step and completes — which is what DBOS recovery drives.Testing
bun run checkclean;run-registryunit tests 19/19. Obsolete claim/orphan test cases removed.Follow-up
run_owner_podcolumn (migration).