Async executions can stay stuck in running with missing final logs

## Summary
In production, we hit a webhook-triggered workflow execution that remained stuck in `workflow_execution_logs.status = 'running'` while the related async job stayed in `async_jobs.status = 'processing'`. The workflow had clearly started, some internal requests were executed, but the run never reached a terminal state and there was not enough persisted evidence to determine where it got stuck.

## What we observed
- Webhook request returned `200` successfully.
- Related `workflow_execution_logs` row remained:
  - `status = 'running'`
  - `ended_at = NULL`
  - no final trace/failure envelope persisted
- Related `async_jobs` row remained:
  - `status = 'processing'`
  - no `completed_at`
  - no error persisted
- There was **no paused execution** row, so this was not a wait/human-in-the-loop pause case.
- Render logs showed the ingress webhook request plus multiple internal Bun requests (function/tool/memory/custom-tool calls), so the execution had definitely progressed beyond ingress.
- But request logs alone were not enough to reconstruct the last successful block or failure point.

## Likely root causes
Based on code-path analysis, there seem to be two main races:

1. **Detached local async execution after enqueue**
   - For non-Trigger.dev backends, jobs are enqueued and then still executed in-process via a detached async path.
   - If the process is interrupted/recycled, jobs can be left in `processing`.

2. **Detached execution-log finalization**
   - Execution completion/error logging appears to be finalized in a fire-and-forget path.
   - That allows `workflow_execution_logs` to remain `running` if the process dies before final persistence.

## Why this is painful operationally
When this happens, operators cannot reliably answer:
- which webhook request maps to which job and execution
- whether the execution never really started vs started and stalled
- which block started last
- which block completed last
- whether execution completed but log finalization failed

## Suggestions
### Immediate reliability fixes
- Replace raw detached local execution in async routes with a route-safe post-response mechanism.
- Await execution finalization/log completion before marking jobs complete.
- Apply the same reliability fix consistently across webhook/workflow/schedule async entry points.

### Forensic observability fixes
- Persist additive execution progress markers in `workflow_execution_logs.execution_data`, especially:
  - `lastStartedBlock`
  - `lastCompletedBlock`
- Preserve start-time execution metadata instead of overwriting it on completion.
- Persist a structured completion/fallback status so `log-completion-failed` is detectable without string matching.

### Correlation-chain fixes
- Carry a stable chain across ingress → queue → execution log:
  - `ingressRequestId`
  - `jobId`
  - `executionId`
- Expose `jobId -> executionId` in job-status reads.
- Make it possible to traverse:
  - `requestId -> jobId -> executionId -> execution log`

### Cleanup / operator improvements
- Reconcile stale async jobs and stale execution logs together instead of treating them as unrelated cleanup domains.
- Distinguish cleanup classifications such as:
  - never started
  - started but no block started
  - stalled in block
  - stalled after block
  - log completion failed

### Longer-term architecture
- Consider moving Redis/database backends toward a true worker/claim model rather than request-owned execution after enqueue.

## Why I’m filing this
This issue is not just about a single stuck run; it exposed a broader gap where:
- async execution can be abandoned
- terminal log state can be lost
- postmortem correlation is weak

If useful, I can also open a follow-up issue/PR proposal with a phased implementation plan for:
1. reliability fixes,
2. forensic observability,
3. correlation chain,
4. stale cleanup classification,
5. longer-term worker model.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async executions can stay stuck in running with missing final logs #3518

Summary

What we observed

Likely root causes

Why this is painful operationally

Suggestions

Immediate reliability fixes

Forensic observability fixes

Correlation-chain fixes

Cleanup / operator improvements

Longer-term architecture

Why I’m filing this

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Async executions can stay stuck in running with missing final logs #3518

Description

Summary

What we observed

Likely root causes

Why this is painful operationally

Suggestions

Immediate reliability fixes

Forensic observability fixes

Correlation-chain fixes

Cleanup / operator improvements

Longer-term architecture

Why I’m filing this

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions