Skip to content

fix(core): harden wait_completed.resumeAt validation (defensive; root cause fixed by #2171)#2177

Open
TooTallNate wants to merge 17 commits into
stablefrom
nate/fix-reused-sleep-resumeat-replay
Open

fix(core): harden wait_completed.resumeAt validation (defensive; root cause fixed by #2171)#2177
TooTallNate wants to merge 17 commits into
stablefrom
nate/fix-reused-sleep-resumeat-replay

Conversation

@TooTallNate
Copy link
Copy Markdown
Member

Summary

Fixes a second, independent source of CORRUPTED_EVENT_LOG on replay: a wait_completed whose resumeAt is validated against a non-deterministic, wall-clock-derived value, producing a false corruption error on a perfectly consistent event log. This is the residual wait_completed.resumeAt shape that survived the hook-vs-sleep fix (#2171) in stress testing.

Root cause (confirmed via production instrumentation)

sleep(<ms|string>) computes resumeAt = Date.now() + duration (parseDurationToDate). The original run records that absolute timestamp into both wait_created and wait_completed. During replay the VM clock advances to each event's createdAt, so a freshly-created sleep recomputes a different absolute resumeAt.

Normally harmless: the wait_created consumer overwrites the queue item's resumeAt with the recorded (authoritative) value before wait_completed is validated. The bug fires when a wait_completed is consumed by a sleep consumer that never applied a wait_created (hasCreatedEvent=false) — the queue item still holds the freshly-recomputed value, and the comparison fails even though the log is internally consistent.

I instrumented the SDK and captured this in production stress runs. Every failing sample showed hasCreatedEvent=false, with ~18–42s deltas between the recomputed and recorded resumeAt, e.g.:

hasCreatedEvent=false queueItemResumeAt=1780153339381 (recomputed)
eventMs=1780153320646 (recorded)  delta=-18735ms

The recorded resumeAt is the source of truth; the consumer's recomputed value is not a valid basis for a corruption assertion.

Fix

Only validate resumeAt when an authoritative recorded value is available (hasCreatedEvent=true). When it is not, the correlationId match already establishes the wait's identity, so skip the check rather than fail a consistent log. Validation is extracted into detectResumeAtMismatch, which also lowers the consumer callback's pre-existing cognitive-complexity warning (33 → 21).

Tests

  • New regression test in sleep.test.ts advances the replay clock (updateTimestamp) and asserts a consistent wait_completed with hasCreatedEvent=false no longer raises CorruptedEventLogError. Fails before the fix (reproduces the exact production error), passes after.
  • Existing resumeAt-mismatch / invalid-resumeAt tests (which have hasCreatedEvent=true) still correctly fire.
  • Full @workflow/core suite: 635/635, typecheck clean.

Scope

Pre-existing bug on stable. Independent of the hook-vs-sleep race fix (#2171) and of the server-side wait_created atomicity work (workflow-server #462) — those address different failure shapes. Stress data showed #2171 removes the step-consumer-mismatch shape; this removes the wait_completed.resumeAt shape.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com

…rded value

A reused/duration sleep races a `wait_completed` replay against a
non-deterministic, wall-clock-derived expected value, producing a false
CorruptedEventLogError on a perfectly consistent event log.

`sleep(<ms|string>)` computes its resumeAt as `Date.now() + duration` (see
parseDurationToDate). The original run records that absolute timestamp into
both wait_created and wait_completed. During replay the VM clock advances to
each event's createdAt, so a freshly-created sleep recomputes a *different*
absolute resumeAt. Normally harmless: the wait_created consumer overwrites the
queue item's resumeAt with the recorded (authoritative) value before
wait_completed is validated.

The bug: when a wait_completed is consumed by a sleep consumer that never
applied a wait_created (hasCreatedEvent=false), the queue item still holds the
freshly-recomputed value, and the resumeAt comparison fails — even though the
event log is internally consistent and the recorded resumeAt is the source of
truth. Captured in production stress runs: hasCreatedEvent=false with ~18-42s
deltas between the recomputed and recorded resumeAt.

Fix: only validate resumeAt when an authoritative recorded value is available
(hasCreatedEvent=true). When it is not, the correlationId match already
establishes the wait's identity, so skip the check rather than fail. Extracted
the validation into `detectResumeAtMismatch`, which also lowers the consumer
callback's cognitive-complexity warning (33 -> 21).

Adds a regression test that advances the replay clock (via updateTimestamp)
and asserts a consistent wait_completed with hasCreatedEvent=false no longer
raises CorruptedEventLogError. Pre-existing stable bug; independent of the
hook-vs-sleep race fix (#2171) and of the server-side work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@TooTallNate TooTallNate requested a review from a team as a code owner May 30, 2026 15:30
Copilot AI review requested due to automatic review settings May 30, 2026 15:30
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 30, 2026

🦋 Changeset detected

Latest commit: 671a3ea

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 16 packages
Name Type
@workflow/core Patch
@workflow/builders Patch
@workflow/cli Patch
@workflow/next Patch
@workflow/nitro Patch
@workflow/vitest Patch
@workflow/web-shared Patch
@workflow/web Patch
workflow Patch
@workflow/world-testing Patch
@workflow/astro Patch
@workflow/nest Patch
@workflow/rollup Patch
@workflow/sveltekit Patch
@workflow/vite Patch
@workflow/nuxt Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
example-nextjs-workflow-turbopack Ready Ready Preview, Comment Jun 1, 2026 10:05pm
example-nextjs-workflow-webpack Ready Ready Preview, Comment Jun 1, 2026 10:05pm
example-workflow Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workbench-astro-workflow Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workbench-express-workflow Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workbench-fastify-workflow Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workbench-hono-workflow Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workbench-nitro-workflow Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workbench-nuxt-workflow Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workbench-sveltekit-workflow Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workbench-tanstack-start-workflow Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workbench-vite-workflow Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workflow-docs Ready Ready Preview, Comment, Open in v0 Jun 1, 2026 10:05pm
workflow-swc-playground Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workflow-tarballs Ready Ready Preview, Comment Jun 1, 2026 10:05pm
workflow-web Ready Ready Preview, Comment Jun 1, 2026 10:05pm

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 30, 2026

🧪 E2E Test Results

Some tests failed

Summary

Passed Failed Skipped Total
❌ ▲ Vercel Production 922 1 67 990
✅ 💻 Local Development 994 0 86 1080
✅ 📦 Local Production 994 0 86 1080
✅ 🐘 Local Postgres 994 0 86 1080
✅ 🪟 Windows 90 0 0 90
❌ 🌍 Community Worlds 136 92 0 228
✅ 📋 Other 504 0 36 540
Total 4634 93 361 5088

❌ Failed Tests

▲ Vercel Production (1 failed)

vite (1 failed):

🌍 Community Worlds (92 failed)

mongodb (14 failed):

  • hookWorkflow is not resumable via public webhook endpoint | wrun_01KT2KDWR4CJPCV4KF93JHGCG6
  • webhookWorkflow | wrun_01KT2KE1245NRQWPBV1AT1Z8CW
  • sleepingWorkflow | wrun_01KT2KFHF3TAK9BCQGVVBMH2KN
  • outputStreamWorkflow no startIndex (reads all chunks)
  • outputStreamWorkflow negative startIndex (reads from end)
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
  • outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
  • outputStreamInsideStepWorkflow - getWritable() called inside step functions | wrun_01KT2KJFXQ9AF2Y076M0EYNW7X
  • writableForwardedFromWorkflowWorkflow | wrun_01KT2KJX8NFAHFABXH2TTK02DD
  • writableForwardedFromStepWorkflow | wrun_01KT2KK2QDGCPFCD1BHXXJ0GYR
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KT2KPQYFBNDX1GHX631MPXPB
  • pages router sleepingWorkflow via pages router
  • resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KT2KWARMSYH1VWQ6TF28GWAV

redis (9 failed):

  • hookWorkflow is not resumable via public webhook endpoint | wrun_01KT2KDWR4CJPCV4KF93JHGCG6
  • sleepingWorkflow | wrun_01KT2KFHF3TAK9BCQGVVBMH2KN
  • outputStreamWorkflow negative startIndex (reads from end)
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
  • outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KT2KPQYFBNDX1GHX631MPXPB
  • pages router sleepingWorkflow via pages router
  • resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KT2KWARMSYH1VWQ6TF28GWAV

turso-dev (1 failed):

  • dev e2e should rebuild on imported step dependency change

turso (68 failed):

  • addTenWorkflow | wrun_01KT2KCS4M5XK193AB6VN0F3BN
  • addTenWorkflow | wrun_01KT2KCS4M5XK193AB6VN0F3BN
  • wellKnownAgentWorkflow (.well-known/agent) | wrun_01KT2KE4MF3Y78S5QNQTZ8Y20M
  • should work with react rendering in step
  • promiseAllWorkflow | wrun_01KT2KD0G60VB2N7TTTEYSYT1N
  • promiseRaceWorkflow | wrun_01KT2KD6W8F158K74NPC5HBSQK
  • promiseAnyWorkflow | wrun_01KT2KD8THC6MBHSVRK63XVNKY
  • importedStepOnlyWorkflow | wrun_01KT2KEQPT5ZE0ZEGFE7DA1GSE
  • readableStreamWorkflow | wrun_01KT2KDAPVZEARDZ5CC857R7PX
  • hookWorkflow | wrun_01KT2KDPYJ51TANZ87FXBEYR8C
  • hookWorkflow is not resumable via public webhook endpoint | wrun_01KT2KDWR4CJPCV4KF93JHGCG6
  • webhookWorkflow | wrun_01KT2KE1245NRQWPBV1AT1Z8CW
  • sleepingWorkflow | wrun_01KT2KFHF3TAK9BCQGVVBMH2KN
  • parallelSleepWorkflow | wrun_01KT2KG0BQX1KX9F7D8AXQ43QK
  • nullByteWorkflow | wrun_01KT2KG4EWHYFE7QR71HQAJQHZ
  • workflowAndStepMetadataWorkflow | wrun_01KT2KG6E4Q8KYGAW7EYF1Z5PB
  • outputStreamWorkflow no startIndex (reads all chunks)
  • outputStreamWorkflow positive startIndex (skips first chunk)
  • outputStreamWorkflow negative startIndex (reads from end)
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
  • outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
  • outputStreamInsideStepWorkflow - getWritable() called inside step functions | wrun_01KT2KJFXQ9AF2Y076M0EYNW7X
  • writableForwardedFromWorkflowWorkflow | wrun_01KT2KJX8NFAHFABXH2TTK02DD
  • writableForwardedFromStepWorkflow | wrun_01KT2KK2QDGCPFCD1BHXXJ0GYR
  • fetchWorkflow | wrun_01KT2KK61105R59RS7J6C4TMCC
  • promiseRaceStressTestWorkflow | wrun_01KT2KK94CT1PQEDHVN9EG6XFT
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation workflow errors cross-file imports preserve message and stack trace
  • error handling error propagation step errors basic step error preserves message and stack trace
  • error handling error propagation step errors cross-file step error preserves message and function names in stack
  • error handling retry behavior regular Error retries until success
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior RetryableError respects custom retryAfter delay
  • error handling retry behavior maxRetries=0 disables retries
  • error handling catchability FatalError can be caught and detected with FatalError.is()
  • error handling not registered WorkflowNotRegisteredError fails the run when workflow does not exist
  • error handling not registered StepNotRegisteredError fails the step but workflow can catch it
  • error handling not registered StepNotRegisteredError fails the run when not caught in workflow
  • hookCleanupTestWorkflow - hook token reuse after workflow completion | wrun_01KT2KPD9A3J85BPKK93SW6028
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KT2KPQYFBNDX1GHX631MPXPB
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running | wrun_01KT2KQ4ZTYXAB0FZ3G5V4FJSE
  • stepFunctionPassingWorkflow - step function references can be passed as arguments (without closure vars) | wrun_01KT2KQM39T6NE24VSJSPGJ1BV
  • stepFunctionWithClosureWorkflow - step function with closure variables passed as argument | wrun_01KT2KQXFKFKC86FXH71BSEB3N
  • closureVariableWorkflow - nested step functions with closure variables | wrun_01KT2KR23MQ0MCVPWZRHYV9X1V
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step | wrun_01KT2KR40C0XSFZCKNRA9BES1A
  • health check (queue-based) - workflow and step endpoints respond to health check messages
  • health check (CLI) - workflow health command reports healthy endpoints
  • pathsAliasWorkflow - TypeScript path aliases resolve correctly | wrun_01KT2KRHXX954WKGQEFSBXY854
  • Calculator.calculate - static workflow method using static step methods from another class | wrun_01KT2KRPTHBM5853GKVXWA1272
  • AllInOneService.processNumber - static workflow method using sibling static step methods | wrun_01KT2KRWX43JHCP82FAZ5NCSW4
  • ChainableService.processWithThis - static step methods using this to reference the class | wrun_01KT2KS2VZS54D3XX5FYNCF8HN
  • thisSerializationWorkflow - step function invoked with .call() and .apply() | wrun_01KT2KS919QGAZAPVN894TPD5Z
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE | wrun_01KT2KSG0J449SDCQN4TRRTVJQ
  • instanceMethodStepWorkflow - instance methods with "use step" directive | wrun_01KT2KSPZ1574P6861D4GVX6JB
  • crossContextSerdeWorkflow - classes defined in step code are deserializable in workflow context | wrun_01KT2KT39HDD75T7ZE6RPVZKC5
  • stepFunctionAsStartArgWorkflow - step function reference passed as start() argument | wrun_01KT2KTBH88BN1RE0A512BMMNQ
  • cancelRun - cancelling a running workflow | wrun_01KT2KTJFC29XD67PD4GHQ3VQX
  • cancelRun via CLI - cancelling a running workflow | wrun_01KT2KTV2HKFKK36N528CYNQ2X
  • pages router addTenWorkflow via pages router
  • pages router promiseAllWorkflow via pages router
  • pages router sleepingWorkflow via pages router
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep | wrun_01KT2KV5Z6FWM79SHB3Y3PQEBM
  • sleepInLoopWorkflow - sleep inside loop with steps actually delays each iteration | wrun_01KT2KVMF6RM60YJGZBAFY1HNG
  • sleepWithSequentialStepsWorkflow - sequential steps work with concurrent sleep (control) | wrun_01KT2KW12WZ52V1G011E6AH59C
  • importMetaUrlWorkflow - import.meta.url is available in step bundles | wrun_01KT2KW70B6VRBXH9M4Q65JD72
  • metadataFromHelperWorkflow - getWorkflowMetadata/getStepMetadata work from module-level helper (#1577) | wrun_01KT2KW8XPTJY23Z0BGBD2ATX6
  • resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KT2KWARMSYH1VWQ6TF28GWAV

Details by Category

❌ ▲ Vercel Production
App Passed Failed Skipped
✅ astro 83 0 7
✅ example 83 0 7
✅ express 83 0 7
✅ fastify 83 0 7
✅ hono 83 0 7
✅ nextjs-turbopack 88 0 2
✅ nextjs-webpack 88 0 2
✅ nitro 83 0 7
✅ nuxt 83 0 7
✅ sveltekit 83 0 7
❌ vite 82 1 7
✅ 💻 Local Development
App Passed Failed Skipped
✅ astro-stable 84 0 6
✅ express-stable 84 0 6
✅ fastify-stable 84 0 6
✅ hono-stable 84 0 6
✅ nextjs-turbopack-canary 71 0 19
✅ nextjs-turbopack-stable 90 0 0
✅ nextjs-webpack-canary 71 0 19
✅ nextjs-webpack-stable 90 0 0
✅ nitro-stable 84 0 6
✅ nuxt-stable 84 0 6
✅ sveltekit-stable 84 0 6
✅ vite-stable 84 0 6
✅ 📦 Local Production
App Passed Failed Skipped
✅ astro-stable 84 0 6
✅ express-stable 84 0 6
✅ fastify-stable 84 0 6
✅ hono-stable 84 0 6
✅ nextjs-turbopack-canary 71 0 19
✅ nextjs-turbopack-stable 90 0 0
✅ nextjs-webpack-canary 71 0 19
✅ nextjs-webpack-stable 90 0 0
✅ nitro-stable 84 0 6
✅ nuxt-stable 84 0 6
✅ sveltekit-stable 84 0 6
✅ vite-stable 84 0 6
✅ 🐘 Local Postgres
App Passed Failed Skipped
✅ astro-stable 84 0 6
✅ express-stable 84 0 6
✅ fastify-stable 84 0 6
✅ hono-stable 84 0 6
✅ nextjs-turbopack-canary 71 0 19
✅ nextjs-turbopack-stable 90 0 0
✅ nextjs-webpack-canary 71 0 19
✅ nextjs-webpack-stable 90 0 0
✅ nitro-stable 84 0 6
✅ nuxt-stable 84 0 6
✅ sveltekit-stable 84 0 6
✅ vite-stable 84 0 6
✅ 🪟 Windows
App Passed Failed Skipped
✅ nextjs-turbopack 90 0 0
❌ 🌍 Community Worlds
App Passed Failed Skipped
✅ mongodb-dev 5 0 0
❌ mongodb 57 14 0
✅ redis-dev 5 0 0
❌ redis 62 9 0
❌ turso-dev 4 1 0
❌ turso 3 68 0
✅ 📋 Other
App Passed Failed Skipped
✅ e2e-local-dev-nest-stable 84 0 6
✅ e2e-local-dev-tanstack-start-stable 84 0 6
✅ e2e-local-postgres-nest-stable 84 0 6
✅ e2e-local-postgres-tanstack-start-stable 84 0 6
✅ e2e-local-prod-nest-stable 84 0 6
✅ e2e-local-prod-tanstack-start-stable 84 0 6

📋 View full workflow run


Some E2E test jobs failed:

  • Vercel Prod: failure
  • Local Dev: success
  • Local Prod: success
  • Local Postgres: success
  • Windows: success

Check the workflow run for details.

⚠️ Community world tests failed (non-blocking):

  • Community Worlds: failure

Check the workflow run for details.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a false-positive CorruptedEventLogError raised on replay when a sleep consumer processes a wait_completed event without having first applied the corresponding wait_created. In that case the queue item's resumeAt still reflects a freshly recomputed (wall-clock-dependent) value from parseDurationToDate, so comparing it against the recorded resumeAt produces a spurious mismatch on a consistent event log.

Changes:

  • Extracted resumeAt validation into detectResumeAtMismatch() helper.
  • Skip the resumeAt comparison unless queueItem.hasCreatedEvent is true (authoritative recorded value present).
  • Added regression test simulating replay clock drift with wait_completed and no prior wait_created.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
packages/core/src/workflow/sleep.ts Extract resumeAt mismatch detection; only validate when authoritative recorded value (hasCreatedEvent) is available.
packages/core/src/workflow/sleep.test.ts Add regression test for replay-clock-advanced wait_completed without wait_created; expose updateTimestamp from setup helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@pranaygp pranaygp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a patch changeset for @workflow/core before merging. This PR changes a published package on stable; without a changeset this bug fix will not produce a package release.

Comment thread packages/core/src/workflow/sleep.ts Outdated
Copy link
Copy Markdown
Member

@VaguelySerious VaguelySerious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI review: blocking issues found

Comment thread packages/core/src/workflow/sleep.ts Outdated
Comment thread packages/core/src/workflow/sleep.test.ts Outdated
@VaguelySerious
Copy link
Copy Markdown
Member

AI Review: Blocking

Missing changeset. This PR fixes a real bug in a published package (@workflow/core) but ships no changeset, so it won't produce a version bump or changelog entry — the fix wouldn't actually reach users on release. Per the repo's contribution rule every workflow PR needs one (pnpm changeset, a patch for @workflow/core here). There's no automated changeset gate on this PR, which is why it slipped past CI.

This is the only blocker — the code change itself is regression-free and low-risk (see the inline note on sleep.ts): it strictly weakens the resumeAt validation, adding no new error path, and legitimate corruption detection (hasCreatedEvent=true) is preserved.


AI Review: Note — on the red CI checks, none of which are caused by this change:

  • Unit Tests (windows-latest) — failed at @workflow/swc-plugin#build (the Rust/wasm crate build), not at any test. This job is itself flaky on stable (mixed success/failure in recent runs); unrelated to a pure-TypeScript @workflow/core change.
  • E2E Community World (Redis / MongoDB) — the failing cases are error-handling workflows (errorWorkflowNested, errorWorkflowCrossFile, errorStepBasic) and the jobs ended in cancellation/timeout. These adapters are cancelled/incomplete on stable too. A change that only removes a wait_completed.resumeAt error path cannot cause unrelated error-workflow tests to start failing.
  • E2E Required Check — red only because it aggregates the Windows (UNIT_STATUS/WINDOWS_STATUS=failure) and community (COMMUNITY_STATUS=cancelled) jobs above.

Local validation on 80545bb: full @workflow/core unit suite 635/635 (stable across 5 repeats), typecheck clean. The new regression test reproduces the exact production error on stable (resumeAt "2025-07-25..." but expects "2026-05-31...") and passes on this branch. I also confirmed, via an ad-hoc test, that a consistent log with wait_created does not false-positive across a 30s replay-clock advance, and that a genuine resumeAt mismatch still fires regardless of clock state.

Recommend adding the changeset; once that's in, this is good to merge.

VaguelySerious and others added 2 commits May 31, 2026 10:56
Add the missing patch changeset for the `@workflow/core` wait_completed
resumeAt replay fix so it gets a version bump and changelog entry.

Also remove the fixed 250ms grace timer from the new regression test: it
now races the error-vs-resolve outcomes directly, so a regression surfaces
deterministically (error branch, or a hang caught by the test timeout)
rather than via a flaky race against a wall-clock guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@VaguelySerious VaguelySerious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, applied the AI nit about the test case directly

VaguelySerious and others added 6 commits June 1, 2026 11:17
(cherry picked from commit ae3c833)
(cherry picked from commit c1d7bab)
…ist + infra breakdown

Rework the PR-comment renderer so a human can immediately see what gates the
job and inspect every failing run:

- 🚨 Event-Log Regressions table lists *every* gating run in full (never
  truncated), each with its duration, a synthesised detail line, and a direct
  dashboard link. Stuck runs render "no terminal state after <ms>".
- Infra (non-gating) section groups harness noise by error code with a
  plain-language explanation and example run links, instead of flooding one
  table with thousands of rows.
- Headline names the regression count and digests the infra noise
  (e.g. "904 HOOK_RESUME_FAILED, 61 NO_WAKE_BRANCH").

Adds unit coverage for the breakdown, message synthesis and the
never-truncate-regressions guarantee.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On run-poll timeout, fetch the run's event log and record the latest event
(type, step name, elapsed) as the stuck run's errorMessage. The summary's
regression table then shows "stuck after N events; latest step_started (foo)
at +12.3s" with a dashboard link, instead of only a duration — so a human can
see where the run wedged without opening every link. Best-effort; falls back
to the duration-only note if the event fetch fails.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ep-resumeat-replay

# Conflicts:
#	.github/scripts/render-event-log-race-repro-results.js
#	.github/scripts/render-event-log-race-repro-results.test.js
A run flagged at the 150s poll budget can simply be slow on a loaded preview
deployment — observed wrun_…EFDZ9 completed shortly after the harness gave up
and was wrongly gated as `stuck`.

Add a generous post-budget grace window: a run that reaches a terminal state
during grace is classified by its real outcome (completed → non-gating
`SLOW_COMPLETION` infra, surfaced for visibility; failed → its error class).
Only a run still non-terminal after budget + grace is a genuine wedge (gating
`stuck`). Renderer gains notes for SLOW_COMPLETION/CANCELLED and singular/plural
agreement fixes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e they occur

Investigating HARNESS_ERRORs on a repro run: a `fetch failed` and a `Hook not
found`. Both came from harness-side network calls to the deployment, not the
SDK. A single dropped connection should never abort tracking an otherwise
healthy run.

- Add `withRetry` (linear backoff, transient-network detection) and apply it to
  the harness network calls: getWorkflowMetadata, start, resumeHook, and the
  run-status poll. On final failure the error is prefixed with the call site
  (e.g. "start: fetch failed", "poll runs.get: fetch failed"), so the infra
  breakdown says *where* it happened.
- pollTerminalRun no longer aborts on a flaky GET: a transient error just
  retries/continues until the deadline.
- waitForHook labels its surfaced error ("waitForHook: Hook not found") so the
  hook-propagation timeout is identifiable in the summary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
VaguelySerious and others added 2 commits June 1, 2026 03:29
…issing field

A failed WorkflowRun exposes its reason as `error: { code, message }` and has no
top-level `errorCode`, so the poller's `classifyFailure(runData.errorCode)` was
always passing `undefined` — collapsing every polled failure to an
uncategorised, detail-less `other`. Read `runData.error.code`/`.message` so
USER_ERROR/RUNTIME_ERROR/CORRUPTED_EVENT_LOG are classified correctly and the
regression row shows why the run failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit d2a59d4)
TooTallNate and others added 3 commits June 1, 2026 11:29
…leep(Date)

Addresses review feedback on the resumeAt-skip guard: the no-`wait_created`
skip was too broad. It correctly avoids a false `CorruptedEventLogError` for
duration-based `sleep(<ms|string>)` (whose resumeAt is `Date.now() + duration`
and therefore varies across replays), but it also skipped validation for an
absolute `sleep(Date)`, whose resumeAt is recomputed identically every replay
and so remains an authoritative value worth checking even without a recorded
`wait_created`.

Track `resumeAtIsDeterministic` on the wait queue item (true when the sleep was
given a Date / date-like), and only skip the equality check when resumeAt is
non-deterministic AND no `wait_created` was applied. A genuine
absolute-Date mismatch now still raises.

Adds a regression test (mismatched `sleep(Date)` without `wait_created` →
CorruptedEventLogError). The malformed/Invalid-Date case was already handled
unconditionally before the gate and is already covered.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
An A/B stress reproduction (instrumented build, 300 runs each) confirmed the
no-`wait_created` state this guard handles is a downstream symptom of the
hook-vs-sleep race fixed in #2171: every captured case was a `wait_completed`
whose correlationId has NO matching `wait_created` in the log (a divergent
replay shifted the deterministic ULID sequence). It reproduced readily with
#2171 reverted (5 hits / 300) and never with #2171 present (0 / 300).

Reword the changeset to one sentence describing the validation hardening, and
document the confirmed root cause inline. No behavior change.
@TooTallNate
Copy link
Copy Markdown
Member Author

Sit-rep: root cause confirmed — this is defensive hardening, not an independent bug fix

We dug into why a wait_completed would ever be validated without a recorded wait_created (hasCreatedEvent=false). An instrumented A/B stress reproduction settled it.

Instrumentation

Built a debug SDK that force-fails on any hasCreatedEvent=false at wait_completed, capturing the consumer's subscribe-time event index, completed-at index, and — critically — the wait_created log index for that correlationId.

A/B result (same harness, 300 runs each, identical params)

Build CORRUPTED_EVENT_LOG hasCreatedEvent=false captured
stable + this PR (incl. #2171) 0 0 / 300
#2171 reverted (+ this PR) 241 5 / 300

Every captured hasCreatedEvent=false case showed waitCreatedLogIndex = -1 — i.e. the wait_completed has no matching wait_created anywhere in the event log for its correlationId.

Root cause

That state is a divergent-replay artifact of the hook-vs-sleep race fixed in #2171: when the race resolved non-deterministically, the workflow's branch decisions diverged on replay, shifting the deterministic ULID sequence — so a sleep got a correlationId whose wait_created isn't in the committed log. The consumer then validated a wait_completed for a wait it never saw created, comparing against a freshly-recomputed wall-clock resumeAt → spurious CorruptedEventLogError.

With #2171 (now merged) the race is deterministic and this no longer occurs — confirmed by the 0/300 above.

What this PR is now

Defensive hardening of the resumeAt validation path, reframed accordingly:

  • Skip the equality check only when there's no authoritative recorded value (no wait_created AND a non-deterministic duration-based sleep).
  • Still validate absolute sleep(Date) (deterministic resumeAt) even without wait_created.
  • Still reject malformed/non-finite resumeAt unconditionally.

Changeset reworded to one sentence; root cause documented inline. No behavior change since the last review pass. Given the root cause is fixed upstream by #2171, this is belt-and-suspenders — happy to keep it as hardening or close it if folks would rather not carry it. cc @pranaygp @VaguelySerious

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

event-log-race-repro Run the event log race reproduction job

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants