Summary
In production, we hit a webhook-triggered workflow execution that remained stuck in workflow_execution_logs.status = 'running' while the related async job stayed in async_jobs.status = 'processing'. The workflow had clearly started, some internal requests were executed, but the run never reached a terminal state and there was not enough persisted evidence to determine where it got stuck.
What we observed
- Webhook request returned
200 successfully.
- Related
workflow_execution_logs row remained:
status = 'running'
ended_at = NULL
- no final trace/failure envelope persisted
- Related
async_jobs row remained:
status = 'processing'
- no
completed_at
- no error persisted
- There was no paused execution row, so this was not a wait/human-in-the-loop pause case.
- Render logs showed the ingress webhook request plus multiple internal Bun requests (function/tool/memory/custom-tool calls), so the execution had definitely progressed beyond ingress.
- But request logs alone were not enough to reconstruct the last successful block or failure point.
Likely root causes
Based on code-path analysis, there seem to be two main races:
-
Detached local async execution after enqueue
- For non-Trigger.dev backends, jobs are enqueued and then still executed in-process via a detached async path.
- If the process is interrupted/recycled, jobs can be left in
processing.
-
Detached execution-log finalization
- Execution completion/error logging appears to be finalized in a fire-and-forget path.
- That allows
workflow_execution_logs to remain running if the process dies before final persistence.
Why this is painful operationally
When this happens, operators cannot reliably answer:
- which webhook request maps to which job and execution
- whether the execution never really started vs started and stalled
- which block started last
- which block completed last
- whether execution completed but log finalization failed
Suggestions
Immediate reliability fixes
- Replace raw detached local execution in async routes with a route-safe post-response mechanism.
- Await execution finalization/log completion before marking jobs complete.
- Apply the same reliability fix consistently across webhook/workflow/schedule async entry points.
Forensic observability fixes
- Persist additive execution progress markers in
workflow_execution_logs.execution_data, especially:
lastStartedBlock
lastCompletedBlock
- Preserve start-time execution metadata instead of overwriting it on completion.
- Persist a structured completion/fallback status so
log-completion-failed is detectable without string matching.
Correlation-chain fixes
- Carry a stable chain across ingress → queue → execution log:
ingressRequestId
jobId
executionId
- Expose
jobId -> executionId in job-status reads.
- Make it possible to traverse:
requestId -> jobId -> executionId -> execution log
Cleanup / operator improvements
- Reconcile stale async jobs and stale execution logs together instead of treating them as unrelated cleanup domains.
- Distinguish cleanup classifications such as:
- never started
- started but no block started
- stalled in block
- stalled after block
- log completion failed
Longer-term architecture
- Consider moving Redis/database backends toward a true worker/claim model rather than request-owned execution after enqueue.
Why I’m filing this
This issue is not just about a single stuck run; it exposed a broader gap where:
- async execution can be abandoned
- terminal log state can be lost
- postmortem correlation is weak
If useful, I can also open a follow-up issue/PR proposal with a phased implementation plan for:
- reliability fixes,
- forensic observability,
- correlation chain,
- stale cleanup classification,
- longer-term worker model.
Summary
In production, we hit a webhook-triggered workflow execution that remained stuck in
workflow_execution_logs.status = 'running'while the related async job stayed inasync_jobs.status = 'processing'. The workflow had clearly started, some internal requests were executed, but the run never reached a terminal state and there was not enough persisted evidence to determine where it got stuck.What we observed
200successfully.workflow_execution_logsrow remained:status = 'running'ended_at = NULLasync_jobsrow remained:status = 'processing'completed_atLikely root causes
Based on code-path analysis, there seem to be two main races:
Detached local async execution after enqueue
processing.Detached execution-log finalization
workflow_execution_logsto remainrunningif the process dies before final persistence.Why this is painful operationally
When this happens, operators cannot reliably answer:
Suggestions
Immediate reliability fixes
Forensic observability fixes
workflow_execution_logs.execution_data, especially:lastStartedBlocklastCompletedBlocklog-completion-failedis detectable without string matching.Correlation-chain fixes
ingressRequestIdjobIdexecutionIdjobId -> executionIdin job-status reads.requestId -> jobId -> executionId -> execution logCleanup / operator improvements
Longer-term architecture
Why I’m filing this
This issue is not just about a single stuck run; it exposed a broader gap where:
If useful, I can also open a follow-up issue/PR proposal with a phased implementation plan for: