[DO NOT MERGE] diagnostic: log replay event log + fail run on replay divergence#2229
[DO NOT MERGE] diagnostic: log replay event log + fail run on replay divergence#2229TooTallNate wants to merge 7 commits into
Conversation
Replay divergence ("Replay could not consume event ...") and corrupted
event log errors are a function of which events are present in the log
and in what order. To truly understand why these divergences happen in
production (e.g. CodeRabbit), log the full event log — event identities
and ordering only — immediately before each replay's runWorkflow call.
This lets us reconstruct, after the fact, exactly what the runtime saw
when a divergence/corruption was raised, including whether the event log
handed back by the world was itself incomplete or mis-ordered.
Payloads (eventData) are intentionally omitted: they can be large,
encrypted, or contain user data, and are not needed to diagnose a
branching/ordering divergence.
🦋 Changeset detectedLatest commit: f255aec The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (901 failed)astro (81 failed):
example (81 failed):
express (81 failed):
fastify (81 failed):
hono (81 failed):
nextjs-turbopack (86 failed):
nextjs-webpack (86 failed):
nitro (81 failed):
nuxt (81 failed):
sveltekit (81 failed):
vite (81 failed):
🌍 Community Worlds (91 failed)mongodb (13 failed):
redis (9 failed):
turso-dev (1 failed):
turso (68 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details.
Check the workflow run for details. |
There was a problem hiding this comment.
Pull request overview
Adds a diagnostic log emitted immediately before each workflow replay to help investigate REPLAY_DIVERGENCE / corrupted-event-log scenarios by capturing the event log’s identities and ordering (without payloads).
Changes:
- Emit a structured “workflow replay event log” entry (event identities + ordering metadata) before
workflow.replay/runWorkflow. - Add a patch changeset for
@workflow/core.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/core/src/runtime.ts | Logs replay-time event log metadata just before workflow replay execution. |
| .changeset/log-replay-event-log.md | Declares a patch bump for adding replay event-log diagnostics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| runtimeLogger.info('Workflow replay event log', { | ||
| workflowRunId: runId, | ||
| eventCount: events.length, | ||
| eventsCursor, | ||
| eventLog: events.map((event, index) => ({ | ||
| index, | ||
| eventId: event.eventId, | ||
| eventType: event.eventType, | ||
| correlationId: event.correlationId, | ||
| createdAt: | ||
| event.createdAt instanceof Date | ||
| ? event.createdAt.toISOString() | ||
| : event.createdAt, | ||
| })), | ||
| }); |
| runtimeLogger.info('Workflow replay event log', { | ||
| workflowRunId: runId, | ||
| eventCount: events.length, | ||
| eventsCursor, | ||
| eventLog: events.map((event, index) => ({ | ||
| index, | ||
| eventId: event.eventId, | ||
| eventType: event.eventType, | ||
| correlationId: event.correlationId, | ||
| createdAt: | ||
| event.createdAt instanceof Date | ||
| ? event.createdAt.toISOString() | ||
| : event.createdAt, | ||
| })), | ||
| }); |
Event Log Race Repro✅ No event-log regressions in the latest repro job. 2000 non-gating infra non-completions (2000 HARNESS_ERROR) are reported but do not fail the job. Event-Log RegressionsNone — no gating outcomes in the latest run. ✅ Infra (non-gating)2000 harness-side non-completions that do not fail the job:
Run History
Latest Scenario Breakdown
|
There was a problem hiding this comment.
Let's make this opt-in via DEBUG variable if we want to ship it? Unless the idea is to force users into it
Separately, though non-blocking: there may be parts of the code that extend/refresh the event log with cursor queries. I see you already put it after the potential cursor load following a wait_created, but to cover everything, we should log before and after that.
Maybe easiest to just put the log in `getWorkflowRunEvents since it should have all the information
…(diagnostic) We are not observing any failed runs in CI even after many retries, which means the REPLAY_DIVERGENCE recovery-replay redrive (#2208/#2212) is masking real divergences: a subsequent redrive happens to read a consistent snapshot and succeeds, so we lose the signal needed to understand why the log diverged. Remove the recovery-replay redrive so the first ReplayDivergenceError fails the run as a terminal CORRUPTED_EVENT_LOG. Combined with the per-replay event log logging, this makes every divergence immediately visible and lets us capture the exact diverging event log for investigation. This is intentionally aggressive for diagnosis; the recovery budget (REPLAY_DIVERGENCE_MAX_RETRIES) is left defined and can be reinstated once the root cause is understood. Updates the two runtime tests that asserted the redrive behavior to assert immediate failure.
…alidation Manually ports the client-side optimistic-concurrency fence from peter/sdk-event-write-cas (#2113) onto our stable-based diagnostics branch (which has #2171 + the replay event-log logging + fail-fast). Lets us validate the #447 server fence end-to-end against our reproduction. - fenced-write.ts: fencedEventCreate helper (bail on fence conflict, no retry) - world/world-vercel: thread lastKnownEventId fence param through events.create - world-vercel/utils: map HTTP 412 -> "fence conflict" EntityConflictError - classify-error: classify EntityConflict/RunExpired/TooEarly/Throttle as RUNTIME_ERROR - suspension-handler: fence the 4 branch-decision writes (hook_created, hook_disposed, step_created, wait_created); creates now sequential so each fenced write chains off the prior event id (single-tip CAS); only queue steps whose step_created this handler actually wrote - runtime: fence run_completed against the load-time tail; pass suspension fence NOT FOR MERGE — throwaway validation build. run_failed left unfenced so the fail-fast diagnostic still surfaces any divergence the fence fails to prevent.
NOT FOR MERGE. Validation wiring on the diagnostics branch: - WORKFLOW_SERVER_URL_OVERRIDE pinned to the #447 OCC-fence server preview (workflow-server-qccj339st.vercel.sh, commit e6722b2) - forward WORKFLOW_VERCEL_PROTECTION_BYPASS as x-vercel-protection-bypass so the repro app can reach the protected preview Pairs the ported SDK fence with the server fence end to end. Contract verified: SDK sends lastKnownEventId in the create body; server returns 412 on conflict which the SDK maps to a fence-conflict EntityConflictError.
Warning
DO NOT MERGE. This is a throwaway diagnostic / debugging branch used to investigate replay divergence in production. It intentionally disables the
REPLAY_DIVERGENCErecovery redrive so divergences fail fast and become visible, and adds verbose per-replay event-log logging. Not intended for release.Purpose
Make replay divergence (
REPLAY_DIVERGENCE/ corrupted-event-log) diagnosable. Build a tarball from this branch and run it to capture what the runtime actually sees when a divergence occurs.1. Log the event log on every replay
Logs the full event log — identities + ordering only, no payloads — immediately before every replay (
runtime.ts, beforerunWorkflow). Oneinfoline per replay:workflowRunId,eventCount,eventsCursor,eventLog: [{ index, eventId, eventType, correlationId, createdAt }]. Payloads omitted (large/encrypted/PII; not needed for ordering divergence).2. Fail the run on the first replay divergence (no redrive)
We weren't seeing any failed runs in CI after many retries — the recovery redrive (#2208/#2212) was masking divergences (a later redrive reads a consistent snapshot and succeeds). This makes the first
ReplayDivergenceErrorfail the run as a terminalCORRUPTED_EVENT_LOGso every divergence is visible and the diverging event log is captured.The
REPLAY_DIVERGENCE_MAX_RETRIESconstant is left defined so the redrive is trivial to restore.Status
Diagnostic only. Tests/typecheck updated to keep the branch green for building tarballs, but this branch should be closed, not merged, once the investigation concludes.