Skip to content

fix: complete transient-failure recovery (tier-column reset + review/qa-stranded recovery)#103

Merged
tzone85 merged 1 commit into
mainfrom
vxd-transient-recovery
Jun 24, 2026
Merged

fix: complete transient-failure recovery (tier-column reset + review/qa-stranded recovery)#103
tzone85 merged 1 commit into
mainfrom
vxd-transient-recovery

Conversation

@tzone85

@tzone85 tzone85 commented Jun 24, 2026

Copy link
Copy Markdown
Owner

What

Completes the transient-failure recovery story (Bug #4) so a build that paused because stories exhausted their escalation tiers for a transient reason (Claude session-limit / 429 storm, since-fixed base breakage) can be cleanly retried instead of instantly re-pausing at the top tier.

  • STORY_RESET zeroes the cached escalation_tier column (not just status). The event-sourced CurrentTier was already reset-aware, but the dispatcher reads the denormalized column and warned story … at tier 3 reached routeStory, expected monitor interception. Now both agree.
  • Resume recovers review/qa-stranded stories, not just in_progress orphans. When an agent finishes (STORY_COMPLETEDreview) but the monitor is killed before review→QA→merge (e.g. a session-limit pause), the story was stranded forever, blocking every dependent story.

Pairs with the vxd retry command + reset-aware CurrentTier already on main.

Test plan

  • go test ./... -count=1 green; go vet clean.
  • internal/cli/resume_orphan_recovery_test.go covers the status predicate.
  • Validated live: recovered pulsereview's escalation-pinned s-006 (tier 3, code complete) — retried to tier 0 and re-dispatched onto clean main.

🤖 Generated with Claude Code

…recover review/qa-stranded stories)

- STORY_RESET now zeroes the cached escalation_tier column (not just status),
  so the dispatcher routes a retried story fresh instead of warning 'tier 3
  reached routeStory'. The event-sourced CurrentTier was already reset-aware.
- resume recovers stories stranded in 'review'/'qa' (agent finished, monitor
  killed before review->QA->merge) — previously only 'in_progress' orphans were
  recovered, so a session-limit pause could strand a story forever, blocking
  every dependent story.

Together with vxd retry + reset-aware CurrentTier, an interrupted/429-stormed
build resumes cleanly instead of re-pausing at the top tier.
@tzone85 tzone85 merged commit 987092c into main Jun 24, 2026
5 checks passed
@tzone85 tzone85 deleted the vxd-transient-recovery branch June 24, 2026 17:27
tzone85 added a commit that referenced this pull request Jun 25, 2026
…ergeable file never thrashes a story (#106)

Root cause of clipforge/pulsereview hanging for hours: when the LLM returned
commentary instead of merged content for package.json/vitest.config.ts, the
resolver aborted the whole rebase, and the story thrashed through every
escalation tier indefinitely ('tech lead returned commentary, not file content').

Two fixes in internal/engine/conflict_resolver.go (+ internal/git/conflict.go):
- Structural JSON merge for package.json/tsconfig*.json/jsconfig.json/etc:
  pulls both index sides (git.ConflictSides :2:/:3:) and deep-unions the objects
  (deepMergeJSON) so BOTH sides' deps/scripts/compilerOptions survive — fully
  deterministic, no LLM. Lock files excluded; invalid JSON falls through to LLM.
- Deterministic --theirs fallback (git.CheckoutTheirs) when the LLM genuinely
  cannot merge any file — surfaced as the new errUnmergeable sentinel (commentary
  or leftover markers). The story-branch version is kept and the rebase continues;
  the pre-merge QA gate + integration build validate it. API/transport errors
  (fatal, capacity, transient) are NOT errUnmergeable, so they still abort/pause
  for a clean retry/resume — never silently take a side under an outage.

Also restores the review/qa orphan-recovery regression test (#103 shipped without
one): resume_orphan_recovery_test.go.

Tests: conflict_json_merge_test.go (deep-union, theirs-wins, invalid-JSON),
conflict_sides_test.go (rebase ours=base/theirs=story semantics, CheckoutTheirs),
conflict_fallback_test.go (commentary→fallback succeeds; generic LLM error still
aborts), resume_orphan_recovery_test.go. go build/vet/lint(0)/test all green.

Co-authored-by: Thando Mini <tzone85@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant