fix: complete transient-failure recovery (tier-column reset + review/qa-stranded recovery)#103
Merged
Merged
Conversation
…recover review/qa-stranded stories) - STORY_RESET now zeroes the cached escalation_tier column (not just status), so the dispatcher routes a retried story fresh instead of warning 'tier 3 reached routeStory'. The event-sourced CurrentTier was already reset-aware. - resume recovers stories stranded in 'review'/'qa' (agent finished, monitor killed before review->QA->merge) — previously only 'in_progress' orphans were recovered, so a session-limit pause could strand a story forever, blocking every dependent story. Together with vxd retry + reset-aware CurrentTier, an interrupted/429-stormed build resumes cleanly instead of re-pausing at the top tier.
5 tasks
tzone85
added a commit
that referenced
this pull request
Jun 25, 2026
…ergeable file never thrashes a story (#106) Root cause of clipforge/pulsereview hanging for hours: when the LLM returned commentary instead of merged content for package.json/vitest.config.ts, the resolver aborted the whole rebase, and the story thrashed through every escalation tier indefinitely ('tech lead returned commentary, not file content'). Two fixes in internal/engine/conflict_resolver.go (+ internal/git/conflict.go): - Structural JSON merge for package.json/tsconfig*.json/jsconfig.json/etc: pulls both index sides (git.ConflictSides :2:/:3:) and deep-unions the objects (deepMergeJSON) so BOTH sides' deps/scripts/compilerOptions survive — fully deterministic, no LLM. Lock files excluded; invalid JSON falls through to LLM. - Deterministic --theirs fallback (git.CheckoutTheirs) when the LLM genuinely cannot merge any file — surfaced as the new errUnmergeable sentinel (commentary or leftover markers). The story-branch version is kept and the rebase continues; the pre-merge QA gate + integration build validate it. API/transport errors (fatal, capacity, transient) are NOT errUnmergeable, so they still abort/pause for a clean retry/resume — never silently take a side under an outage. Also restores the review/qa orphan-recovery regression test (#103 shipped without one): resume_orphan_recovery_test.go. Tests: conflict_json_merge_test.go (deep-union, theirs-wins, invalid-JSON), conflict_sides_test.go (rebase ours=base/theirs=story semantics, CheckoutTheirs), conflict_fallback_test.go (commentary→fallback succeeds; generic LLM error still aborts), resume_orphan_recovery_test.go. go build/vet/lint(0)/test all green. Co-authored-by: Thando Mini <tzone85@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Completes the transient-failure recovery story (Bug #4) so a build that paused because stories exhausted their escalation tiers for a transient reason (Claude session-limit / 429 storm, since-fixed base breakage) can be cleanly retried instead of instantly re-pausing at the top tier.
STORY_RESETzeroes the cachedescalation_tiercolumn (not just status). The event-sourcedCurrentTierwas already reset-aware, but the dispatcher reads the denormalized column and warnedstory … at tier 3 reached routeStory, expected monitor interception. Now both agree.review/qa-stranded stories, not justin_progressorphans. When an agent finishes (STORY_COMPLETED→review) but the monitor is killed before review→QA→merge (e.g. a session-limit pause), the story was stranded forever, blocking every dependent story.Pairs with the
vxd retrycommand + reset-awareCurrentTieralready on main.Test plan
go test ./... -count=1green;go vetclean.internal/cli/resume_orphan_recovery_test.gocovers the status predicate.🤖 Generated with Claude Code