fix(decopilot): unblock DBOS recovery — drop run-owner claim, make dispatch step retriable by pedrofrxncx · Pull Request #3917 · decocms/studio

pedrofrxncx · 2026-06-15T02:23:42Z

Problem

A pod dying mid-run left decopilot runs stranded and threw RunClaimError: already running on another pod on retry. Two things in mesh were blocking DBOS's own workflow recovery:

run_owner_pod mesh claim — a second single-execution guard (CAS in claimRunStart) layered on top of the DBOS thread-gate queue's concurrency=1 per threadId. Redundant, and the one layer DBOS can't see/reconcile, so a dead pod's stale run_owner_pod blocked any re-dispatch.
retriesAllowed: false on the dispatchRunAndWait step — so a workflow DBOS did recover would refuse to re-run the step and just error.

Fix — only what's needed to let DBOS recover runs

Remove the run_owner_pod claim. RUN_STARTED is a plain status='in_progress' write; deleted RunClaimError, claimRunStart, orphanRunsByPod. The DBOS thread-gate queue is the sole single-execution guarantee.
Make the dispatch step retriable (drop retriesAllowed: false) so DBOS re-runs the run on another executor. Safe because stopAll() aborts the in-flight run and cancels the daemon (down-channel cancel frame) on shutdown, and the daemon fences stale epochs by fenceToken.
stopAll() no longer writes the DB — aborts controllers + clears in-memory state.

No custom recovery logic. Recovery is DBOS's job; this PR just stops blocking it.

Why not a custom re-enqueue on shutdown (removed)

An earlier revision re-enqueued workflows in the shutdown hook. Removed — redundant with DBOS and it didn't work: an empirical probe (real DBOS + Postgres) showed DBOS.shutdown() drains in-flight workflows (a sleeping step ran to SUCCESS during shutdown), so there are no PENDING workflows for a hook to re-enqueue. The probe also confirmed the underlying primitive is sound: a workflow orphaned by a crashed executor, once visible to a live executor, re-runs its retriable step and completes — which is what DBOS recovery drives.

Testing

bun run check clean; run-registry unit tests 19/19. Obsolete claim/orphan test cases removed.
Validate in staging: kill a pod mid-run and confirm DBOS recovers it on another executor. (DBOS startup recovery is per-executor + version-scoped, and executor IDs are random in Conductor mode, so cross-pod recovery leans on the DBOS Conductor — if unreliable in prod that's a DBOS/Conductor config issue, not something to fix with custom mesh logic.)

Follow-up

Drop the now-vestigial run_owner_pod column (migration).

…ndant run-owner claim A pod dying mid-run (rollout or crash) left decopilot runs stranded: the thread-gate DBOS workflow stayed PENDING (version-gated SDK recovery never re-homes it across a rollout), and a separate mesh-level run-owner claim (run_owner_pod CAS) blocked any re-dispatch with RunClaimError. The claim was a second single-execution guard redundant with the thread-gate queue's concurrency=1 per threadId — and the only thing DBOS couldn't see or reconcile. - Remove the run_owner_pod claim: RUN_STARTED is now a plain status write, no CAS; delete RunClaimError, claimRunStart, and orphanRunsByPod. The DBOS thread-gate queue is the sole single-execution guarantee. - Make the dispatchRunAndWait step retriable (was retriesAllowed:false) so a re-enqueued run re-executes. Safe now: graceful shutdown aborts the run and cancels the daemon (down-channel cancel frame) before re-enqueue, and the daemon fences stale epochs by fenceToken. - On graceful shutdown, requeueInflightThreadGateWorkflows() flips this executor's PENDING thread-gate workflows to ENQUEUED and clears executor_id + application_version, so any live executor (including a new rollout version) re-dequeues them — the queue dequeue allows application_version IS NULL, which the SDK's PENDING-recovery path does not. Runs after DBOS.shutdown() leaves them PENDING, before the pg pool closes. - stopAll() no longer writes the DB (no claim to release); it aborts controllers (cancelling daemons) and clears in-memory state. Tests: removed the obsolete claim-CAS / orphanRunsByPod cases; run-registry unit tests pass. The run_owner_pod column is left in place (now always null) for a follow-up drop migration.

…ion=NULL) Nulling application_version would let a different-version executor re-dequeue and resume the workflow by replaying its step journal against changed code. Keep the version so only a matching-version executor picks it up.

The graceful-shutdown re-enqueue was redundant with DBOS and didn't even work: DBOS.shutdown() drains in-flight workflows (verified — a sleeping step is run to SUCCESS during shutdown), so there are no PENDING workflows for the hook to re-enqueue. Recovery of an interrupted run is DBOS's responsibility. This PR now does only what's needed to UNBLOCK DBOS's own recovery: - remove the redundant run_owner_pod mesh claim (was the source of RunClaimError) - make the dispatchRunAndWait step retriable (was retriesAllowed:false, which made a recovered workflow error instead of re-running) Removed requeueInflightThreadGateWorkflows + its shutdown wiring + export.

…p runs retriesAllowed must be conditional: a user-desktop run dispatches to a laptop daemon that keeps running after pod death, so a DBOS replay would race a second concurrent dispatch against the same workdir (the original v1 corruption hazard). The graceful abort that would stop the daemon doesn't run on a hard crash. Keep user-desktop non-retriable; hosted/in-process runs (no external daemon) stay retriable so DBOS recovers them.

Updated the condition for the `retriable` variable to improve readability by breaking it into multiple lines. This change enhances code maintainability without altering functionality.

PR: #3917 fix(decopilot): unblock DBOS recovery — drop run-owner claim, make dispatch step retriable Bump type: patch - decocms (apps/mesh/package.json): 3.18.9 -> 3.18.10

origin/main #3917 removed the getPodId import + POD_ID const from app.ts; the clean auto-merge applied that removal while keeping the projector/ heartbeat getPodId() call sites, leaving the name undeclared (TS2304 in the CI test-merge). Re-add the module-level import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

pedrofrxncx enabled auto-merge (squash) June 15, 2026 02:26

pedrofrxncx disabled auto-merge June 15, 2026 02:28

pedrofrxncx added 2 commits June 14, 2026 23:30

pedrofrxncx changed the title ~~fix(decopilot): recover in-flight runs via DBOS re-enqueue; drop redundant run-owner claim~~ fix(decopilot): unblock DBOS recovery — drop run-owner claim, make dispatch step retriable Jun 15, 2026

pedrofrxncx added 3 commits June 14, 2026 23:52

fix: restore try{ block opener (prior commit dropped it)

119a747

fix(thread-gate): format retriable condition for clarity

79d179a

Updated the condition for the `retriable` variable to improve readability by breaking it into multiple lines. This change enhances code maintainability without altering functionality.

pedrofrxncx enabled auto-merge (squash) June 15, 2026 02:57

pedrofrxncx merged commit e9b3a46 into main Jun 15, 2026
15 checks passed

pedrofrxncx deleted the fix/decopilot-recovery-reenqueue branch June 15, 2026 03:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(decopilot): unblock DBOS recovery — drop run-owner claim, make dispatch step retriable#3917

fix(decopilot): unblock DBOS recovery — drop run-owner claim, make dispatch step retriable#3917
pedrofrxncx merged 6 commits into
mainfrom
fix/decopilot-recovery-reenqueue

pedrofrxncx commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pedrofrxncx commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix — only what's needed to let DBOS recover runs

Why not a custom re-enqueue on shutdown (removed)

Testing

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pedrofrxncx commented Jun 15, 2026 •

edited

Loading