[e2e] event-log-race-repro: actionable summary, slow≠stuck, fetch retries#2195
Merged
Conversation
…ist + infra breakdown Rework the PR-comment renderer so a human can immediately see what gates the job and inspect every failing run: - 🚨 Event-Log Regressions table lists *every* gating run in full (never truncated), each with its duration, a synthesised detail line, and a direct dashboard link. Stuck runs render "no terminal state after <ms>". - Infra (non-gating) section groups harness noise by error code with a plain-language explanation and example run links, instead of flooding one table with thousands of rows. - Headline names the regression count and digests the infra noise (e.g. "904 HOOK_RESUME_FAILED, 61 NO_WAKE_BRANCH"). Adds unit coverage for the breakdown, message synthesis and the never-truncate-regressions guarantee. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 8f41186)
On run-poll timeout, fetch the run's event log and record the latest event (type, step name, elapsed) as the stuck run's errorMessage. The summary's regression table then shows "stuck after N events; latest step_started (foo) at +12.3s" with a dashboard link, instead of only a duration — so a human can see where the run wedged without opening every link. Best-effort; falls back to the duration-only note if the event fetch fails. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit dee4370)
A run flagged at the 150s poll budget can simply be slow on a loaded preview deployment — observed wrun_…EFDZ9 completed shortly after the harness gave up and was wrongly gated as `stuck`. Add a generous post-budget grace window: a run that reaches a terminal state during grace is classified by its real outcome (completed → non-gating `SLOW_COMPLETION` infra, surfaced for visibility; failed → its error class). Only a run still non-terminal after budget + grace is a genuine wedge (gating `stuck`). Renderer gains notes for SLOW_COMPLETION/CANCELLED and singular/plural agreement fixes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 31d5b99)
…e they occur
Investigating HARNESS_ERRORs on a repro run: a `fetch failed` and a `Hook not
found`. Both came from harness-side network calls to the deployment, not the
SDK. A single dropped connection should never abort tracking an otherwise
healthy run.
- Add `withRetry` (linear backoff, transient-network detection) and apply it to
the harness network calls: getWorkflowMetadata, start, resumeHook, and the
run-status poll. On final failure the error is prefixed with the call site
(e.g. "start: fetch failed", "poll runs.get: fetch failed"), so the infra
breakdown says *where* it happened.
- pollTerminalRun no longer aborts on a flaky GET: a transient error just
retries/continues until the deadline.
- waitForHook labels its surfaced error ("waitForHook: Hook not found") so the
hook-propagation timeout is identifiable in the summary.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit a9b68c0)
Contributor
🦋 Changeset detectedLatest commit: b726540 The changes in this PR will be included in the next version bump. This PR includes changesets to release 0 packagesWhen changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (88 failed)mongodb (12 failed):
redis (9 failed):
turso-dev (1 failed):
turso (66 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details.
Check the workflow run for details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Standalone CI-only changes for the
event-log-race-reprojob, extracted so they can be merged independently of any core fix. All on top of the classification work already onstable(#2194).Actionable result summary
Slow ≠ stuck
A run flagged at the poll budget can simply be slow on a loaded preview deployment (observed: a
stuckrun that actually completed shortly after). Added a generous post-budget grace window: a run that reaches a terminal state during grace is classified by its real outcome —completed→ non-gatingSLOW_COMPLETION,failed→ its error class. Only a run still non-terminal after budget + grace is a genuine wedge (gatingstuck). Stuck runs also record where they wedged (latest event/step).Retry transient fetch failures
Investigating two
HARNESS_ERRORs (fetch failed,Hook not found): both came from harness-side network calls to the deployment, not the SDK. AddedwithRetry(linear backoff, transient-network detection) around the harness network calls (getWorkflowMetadata, start, resumeHook, run-status poll); the poll no longer aborts on a flaky GET. On final failure the error is prefixed with the call site (e.g.start: fetch failed) so the infra breakdown says where it happened.Renderer is unit-tested (
node:test, run by the CI Scripts Tests job).🤖 Generated with Claude Code