feat: restore checkpoint after any AI response turn#2356
Draft
Basit-Balogun10 wants to merge 11 commits into
Draft
feat: restore checkpoint after any AI response turn#2356Basit-Balogun10 wants to merge 11 commits into
Basit-Balogun10 wants to merge 11 commits into
Conversation
7dc1907 to
021fe43
Compare
- Add checkpoint tRPC router with restore procedure that reverts git state, truncates session JSONL to the restore point, and deletes orphaned checkpoint refs for abandoned future turns - Track lastCheckpointId per turn in buildConversationItems so each completed agent turn knows its git ref - Show per-turn restore button in AgentMessage (disabled with tooltip when no checkpoint exists for that turn) - Add CheckpointTimelineModal (mod+shift+h) — command-palette-style list of all checkpoints in the session, newest first, with user message snippet and relative timestamp; shortcut is user-remappable via keybindings store - Add RestoreCheckpointDialog with confirmation warning before reverting - Add useRestoreCheckpoint hook to wire restore flow end-to-end - Register checkpoint-timeline as a configurable shortcut Closes PostHog#2328
The cloud path (agent-server.ts) guards checkpoint capture on posthogAPI being configured, so local tasks never emit _posthog/git_checkpoint. Hook into extNotification in the local AgentService: on TURN_COMPLETE, run CaptureCheckpointSaga, then emit a synthetic _posthog/git_checkpoint ACP message to the renderer and append it to the session JSONL so it survives reload. The renderer's buildConversationItems already handles the notification correctly — it just wasn't arriving. Add console logs in buildConversationItems and structured logs in service.ts for visibility during debugging.
158a49c to
1af21bd
Compare
extNotification is not reliably called by the ACP SDK for _posthog/ notifications in the local path. The raw stream tap (onAcpMessage) is guaranteed to fire for every ndjson frame — move the TURN_COMPLETE checkpoint hook there instead.
1af21bd to
0a73dc5
Compare
Suppress codex-acp loadSession replay at the SDK layer so resumed sessions don't re-persist history to logs.ndjson/S3, and truncate the on-disk rollout + local logs to the checkpoint boundary so the agent remembers only up to the restored point. - Add suppressReplay gate on CodexSessionState; codex-client drops session/update events while a loadSession replay is in flight - Toggle the flag around loadSession / resumeSession / refreshSession - Add truncateCodexRollout with writeFileWithRetry to survive the Windows file-lock race after cancelSession - Truncate local logs at the prompt boundary; return restoredSessionId from the restore mutation and reconnect with it - Remove the redundant renderer suppressReplayEvents flag
…acing) - Reject overlapping restores for the same session with an in-flight lock in the checkpoint router (prevents two restores racing to truncate logs/rollout) - Surface truncation failures: restore mutation returns truncationFailed and the renderer warns the user that agent memory may extend past the checkpoint - truncateLocalLogsAtPromptBoundary now returns whether the boundary was found, so a missing anchor is reported instead of silently leaving stale turns - Block checkpoint restore for cloud sessions with a clear message instead of failing silently after the destructive revert - Show a Current badge (no Restore button) on the most-recent checkpoint and disable timeline restore buttons while a restore is in progress - Note in the confirm dialog when restoring will stop an in-progress response
…cy, parity) Adds automated coverage for the checkpoint-restore fix and hardening: - rollout.test.ts: truncateCodexRollout trims to the first/middle turn, leaves the file intact when no complete turns exist, and reports not-found - rollout.retry.test.ts: writeFileWithRetry retries on EPERM/EBUSY/EACCES and succeeds once the Windows file lock releases, fails fast on other errors - codex-client suppressReplay: replayed session/update events are dropped (no upstream forward, no re-fired structured-output) until the flag clears - codex-agent: suppressReplay is set during loadSession and cleared after, including when loadSession throws - local-logs: truncateLocalLogsAtPromptBoundary returns found/not-found and trims to the prompt boundary - checkpoint router: concurrent restores for the same session are rejected and the lock is released after success and after a failed revert - jsonl-hydration: Claude force-refetches the truncated S3 log on restore and rewrites only the surviving turns (memory parity, no replay)
…tton - Add enableOnFormTags/enableOnContentEditable to the checkpoint-timeline hotkey so Ctrl/Cmd+Shift+H opens the timeline even when the message editor has focus (previously suppressed by react-hotkeys-hook, the common case) - Render the per-turn restore button always, disabled with an explanatory tooltip when the turn has no checkpoint (or a cloud-task tooltip for cloud sessions) instead of hiding it
The timeline modal (Ctrl/Cmd+Shift+H) is dropped — the per-turn restore button on each user message is the restore entry point. Removes the component, its keyboard shortcut wiring (SHORTCUTS / KEYBOARD_SHORTCUTS / CONFIGURABLE_SHORTCUT_IDS / DEFAULT_KEYBINDINGS), and the ConversationView state, hotkey, and render. Per-turn restore, the confirm dialog, and the underlying restore/truncation flow are unchanged.
…estart
captureLocalCheckpoint stored each checkpoint only in the in-memory
sessionCheckpoints map, the agent JSONL, and S3. On a cold start the
in-memory map is gone and fetchSessionLogs reads the local logs.ndjson
cache first (before S3), so the checkpoint notifications were missing and
every restore icon showed disabled ("No checkpoint was captured for this turn").
Append the checkpoint notification to logs.ndjson at capture time (matching
the SessionLogWriter tap's line-by-line append), so cold loads after a
restart find it. S3 sync is unchanged.
Also adds buildConversationItems tests for checkpoint-to-turn association and
updates three user_message expectations for the turnContext field.
- Restore trims logs.ndjson at the restored turn's prompt-response line, which dropped that turn's git_checkpoint notification (appended after the response on TURN_COMPLETE) and left it with a disabled restore icon after an app restart. truncateLocalLogsAtPromptBoundary now accepts preserveTrailingEntries and the checkpoint router re-adds the restored checkpoint atomically, so the restored turn stays restorable. - RestoreCheckpointDialog: drop the amber title/warning icons and the amber callout background, remove the icon on the red Restore button, and equalize the Cancel/Restore buttons. Adds local-logs tests for preserved trailing entries.
…anup The local-cache fix kept the restored turn's checkpoint in logs.ndjson, but S3 still had it trimmed. If the old-session cleanup later deletes the local cache, the cold load falls back to S3 and the restored turn would lose its icon again. After the S3 truncate, re-append the restored checkpoint to S3 (gated on the truncate actually having removed it, to avoid duplicates), reusing the entry built for the local preserve.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
After an AI turn completes, there's no way to roll back to the git state captured at the end of that turn. If the agent goes down a wrong path you have to manually reset the branch or throw away work — there's no checkpoint-based undo.
Tracked in #724 / #2328.
Changes
Core restore flow
checkpointtRPC router (routers/checkpoint.ts) with arestoreprocedure that:RevertCheckpointSagato reset the worktree to the saved git state.jsonlto the restore point so the session replays correctlygetSessionInfo(taskRunId)added toAgentServiceto expose{ sessionId, repoPath }without leaking internal typesPer-turn restore button
buildConversationItemstrackslastCheckpointIdper turn — set when a_posthog/git_checkpointnotification is seen, cleared on each new turn startRestore confirmation
RestoreCheckpointDialog— confirmation dialog with an amber warning before the restore runs, so accidental clicks don't immediately drop workuseRestoreCheckpointhook wires the full flow: opens the confirmation dialog → calls the tRPC mutation → truncates the in-memory events viasessionStoreSetters.truncateEventsToCheckpoint→ shows a success/error toastHow did you test this?
pnpm typecheckpasses across all packages (exit 0); biome clean via the pre-commit hookgetSessionJsonlPathpath-separator assertions (they assert forward-slash paths;path.joinemits backslashes on Windows) which pass on Linux CI and are unrelated to this changeScope: local-only (cloud is intentionally gated)
Checkpoint restore is available for local tasks only. This is by design, not an oversight.
Why checkpoints only exist for local turns. Checkpoints are git commits captured against the local worktree by the local agent after each turn (
captureLocalCheckpoint→CaptureCheckpointSaga). A cloud task runs on remote infrastructure with no local agent and no local turn-completion hook, so it emits no_posthog/git_checkpointevents. The per-turn restore button is therefore rendered disabled with a "not available for cloud tasks" tooltip for cloud sessions (and disabled with a "no checkpoint for this turn" tooltip for any local turn without one).Why restore cannot run for a cloud session. Restore performs three local-machine operations: (1)
RevertCheckpointSagarewinds files in the local worktree, (2)cancelSessionSIGTERMs and reconnects the local agent subprocess, (3)truncateCodexRolloutrewrites the on-disk rollout. None apply to a remote run whose agent and worktree live server-side. The S3truncate_logcall is the only cloud-capable piece, and on its own it would truncate the displayed history while the remote agent kept full memory and files stayed put — worse than a no-op.useRestoreCheckpointtherefore blocksisCloudsessions with a clear message before any destructive revert.Handoff behavior. For a task that spans local and cloud (either direction), only turns that executed locally have checkpoints. Turns that ran in the cloud show no restore icon; local turns remain restorable. If a task with local checkpoints is currently running in the cloud, the click-time guard prevents a partial restore.
What full cloud support would take (future, out of scope). A backend feature, not a client flag: (1) server-side per-turn checkpoint capture that writes the
checkpoint_id → prompt_idmapping into the run log so the client timeline can render restore points, (2) a server-side restore endpoint that reverts the remote worktree + truncates the remote agent memory + resumes atomically, (3) client rewiring to call that endpoint for cloud sessions, and (4) a product decision for runs with external side effects already pushed (commits/PRs). This touches the cloud runner + API and is tracked separately.Hardening + tests added since initial review
suppressReplaygate drops the rollout re-stream duringloadSessionso resumed sessions do not re-persist history tologs.ndjson/S3.truncationFailed; the renderer warns when agent memory may extend past the checkpoint instead of failing silently.suppressReplay(client + agent wiring), local-log prompt-boundary truncation, checkpoint-router concurrency lock, and Claude JSONL hydration parity (force-refetch, no replay).