Skip to content

fix(decopilot): unblock DBOS recovery — drop run-owner claim, make dispatch step retriable#3917

Merged
pedrofrxncx merged 6 commits into
mainfrom
fix/decopilot-recovery-reenqueue
Jun 15, 2026
Merged

fix(decopilot): unblock DBOS recovery — drop run-owner claim, make dispatch step retriable#3917
pedrofrxncx merged 6 commits into
mainfrom
fix/decopilot-recovery-reenqueue

Conversation

@pedrofrxncx

@pedrofrxncx pedrofrxncx commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Problem

A pod dying mid-run left decopilot runs stranded and threw RunClaimError: already running on another pod on retry. Two things in mesh were blocking DBOS's own workflow recovery:

  1. run_owner_pod mesh claim — a second single-execution guard (CAS in claimRunStart) layered on top of the DBOS thread-gate queue's concurrency=1 per threadId. Redundant, and the one layer DBOS can't see/reconcile, so a dead pod's stale run_owner_pod blocked any re-dispatch.
  2. retriesAllowed: false on the dispatchRunAndWait step — so a workflow DBOS did recover would refuse to re-run the step and just error.

Fix — only what's needed to let DBOS recover runs

  • Remove the run_owner_pod claim. RUN_STARTED is a plain status='in_progress' write; deleted RunClaimError, claimRunStart, orphanRunsByPod. The DBOS thread-gate queue is the sole single-execution guarantee.
  • Make the dispatch step retriable (drop retriesAllowed: false) so DBOS re-runs the run on another executor. Safe because stopAll() aborts the in-flight run and cancels the daemon (down-channel cancel frame) on shutdown, and the daemon fences stale epochs by fenceToken.
  • stopAll() no longer writes the DB — aborts controllers + clears in-memory state.

No custom recovery logic. Recovery is DBOS's job; this PR just stops blocking it.

Why not a custom re-enqueue on shutdown (removed)

An earlier revision re-enqueued workflows in the shutdown hook. Removed — redundant with DBOS and it didn't work: an empirical probe (real DBOS + Postgres) showed DBOS.shutdown() drains in-flight workflows (a sleeping step ran to SUCCESS during shutdown), so there are no PENDING workflows for a hook to re-enqueue. The probe also confirmed the underlying primitive is sound: a workflow orphaned by a crashed executor, once visible to a live executor, re-runs its retriable step and completes — which is what DBOS recovery drives.

Testing

  • bun run check clean; run-registry unit tests 19/19. Obsolete claim/orphan test cases removed.
  • Validate in staging: kill a pod mid-run and confirm DBOS recovers it on another executor. (DBOS startup recovery is per-executor + version-scoped, and executor IDs are random in Conductor mode, so cross-pod recovery leans on the DBOS Conductor — if unreliable in prod that's a DBOS/Conductor config issue, not something to fix with custom mesh logic.)

Follow-up

  • Drop the now-vestigial run_owner_pod column (migration).

…ndant run-owner claim

A pod dying mid-run (rollout or crash) left decopilot runs stranded: the
thread-gate DBOS workflow stayed PENDING (version-gated SDK recovery never
re-homes it across a rollout), and a separate mesh-level run-owner claim
(run_owner_pod CAS) blocked any re-dispatch with RunClaimError. The claim was a
second single-execution guard redundant with the thread-gate queue's
concurrency=1 per threadId — and the only thing DBOS couldn't see or reconcile.

- Remove the run_owner_pod claim: RUN_STARTED is now a plain status write, no
  CAS; delete RunClaimError, claimRunStart, and orphanRunsByPod. The DBOS
  thread-gate queue is the sole single-execution guarantee.
- Make the dispatchRunAndWait step retriable (was retriesAllowed:false) so a
  re-enqueued run re-executes. Safe now: graceful shutdown aborts the run and
  cancels the daemon (down-channel cancel frame) before re-enqueue, and the
  daemon fences stale epochs by fenceToken.
- On graceful shutdown, requeueInflightThreadGateWorkflows() flips this
  executor's PENDING thread-gate workflows to ENQUEUED and clears executor_id +
  application_version, so any live executor (including a new rollout version)
  re-dequeues them — the queue dequeue allows application_version IS NULL, which
  the SDK's PENDING-recovery path does not. Runs after DBOS.shutdown() leaves
  them PENDING, before the pg pool closes.
- stopAll() no longer writes the DB (no claim to release); it aborts controllers
  (cancelling daemons) and clears in-memory state.

Tests: removed the obsolete claim-CAS / orphanRunsByPod cases; run-registry unit
tests pass. The run_owner_pod column is left in place (now always null) for a
follow-up drop migration.
@pedrofrxncx pedrofrxncx enabled auto-merge (squash) June 15, 2026 02:26
@pedrofrxncx pedrofrxncx disabled auto-merge June 15, 2026 02:28
…ion=NULL)

Nulling application_version would let a different-version executor re-dequeue and
resume the workflow by replaying its step journal against changed code. Keep the
version so only a matching-version executor picks it up.
The graceful-shutdown re-enqueue was redundant with DBOS and didn't even work:
DBOS.shutdown() drains in-flight workflows (verified — a sleeping step is run to
SUCCESS during shutdown), so there are no PENDING workflows for the hook to
re-enqueue. Recovery of an interrupted run is DBOS's responsibility.

This PR now does only what's needed to UNBLOCK DBOS's own recovery:
- remove the redundant run_owner_pod mesh claim (was the source of RunClaimError)
- make the dispatchRunAndWait step retriable (was retriesAllowed:false, which
  made a recovered workflow error instead of re-running)

Removed requeueInflightThreadGateWorkflows + its shutdown wiring + export.
@pedrofrxncx pedrofrxncx changed the title fix(decopilot): recover in-flight runs via DBOS re-enqueue; drop redundant run-owner claim fix(decopilot): unblock DBOS recovery — drop run-owner claim, make dispatch step retriable Jun 15, 2026
…p runs

retriesAllowed must be conditional: a user-desktop run dispatches to a laptop
daemon that keeps running after pod death, so a DBOS replay would race a second
concurrent dispatch against the same workdir (the original v1 corruption hazard).
The graceful abort that would stop the daemon doesn't run on a hard crash. Keep
user-desktop non-retriable; hosted/in-process runs (no external daemon) stay
retriable so DBOS recovers them.
Updated the condition for the `retriable` variable to improve readability by breaking it into multiple lines. This change enhances code maintainability without altering functionality.
@pedrofrxncx pedrofrxncx enabled auto-merge (squash) June 15, 2026 02:57
@pedrofrxncx pedrofrxncx merged commit e9b3a46 into main Jun 15, 2026
15 checks passed
@pedrofrxncx pedrofrxncx deleted the fix/decopilot-recovery-reenqueue branch June 15, 2026 03:02
decocms Bot pushed a commit that referenced this pull request Jun 15, 2026
PR: #3917 fix(decopilot): unblock DBOS recovery — drop run-owner claim, make dispatch step retriable
Bump type: patch

- decocms (apps/mesh/package.json): 3.18.9 -> 3.18.10
tlgimenes added a commit that referenced this pull request Jun 15, 2026
origin/main #3917 removed the getPodId import + POD_ID const from app.ts;
the clean auto-merge applied that removal while keeping the projector/
heartbeat getPodId() call sites, leaving the name undeclared (TS2304 in
the CI test-merge). Re-add the module-level import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant