Skip to content

feat: restore checkpoint after any AI response turn#2356

Draft
Basit-Balogun10 wants to merge 11 commits into
PostHog:mainfrom
Basit-Balogun10:claude/gracious-hawking-38aade
Draft

feat: restore checkpoint after any AI response turn#2356
Basit-Balogun10 wants to merge 11 commits into
PostHog:mainfrom
Basit-Balogun10:claude/gracious-hawking-38aade

Conversation

@Basit-Balogun10

@Basit-Balogun10 Basit-Balogun10 commented May 25, 2026

Copy link
Copy Markdown
Contributor

Note: This PR is stacked on #2321. The diff shows +2k lines because GitHub uses main as the base - cross-fork branches can't be set as PR bases (tried using Graphite to track the stack too, same limitation). The actual changes in this PR are a few commits on top of #2321. Please review with that context, or we wait for #2321 to be merged first.

Problem

After an AI turn completes, there's no way to roll back to the git state captured at the end of that turn. If the agent goes down a wrong path you have to manually reset the branch or throw away work — there's no checkpoint-based undo.

Tracked in #724 / #2328.

Changes

Core restore flow

  • New checkpoint tRPC router (routers/checkpoint.ts) with a restore procedure that:
    1. Runs RevertCheckpointSaga to reset the worktree to the saved git state
    2. Truncates the session .jsonl to the restore point so the session replays correctly
    3. Collects any orphaned checkpoint refs from the discarded lines and deletes them — no accumulation of stale refs from abandoned future turns
  • getSessionInfo(taskRunId) added to AgentService to expose { sessionId, repoPath } without leaking internal types

Per-turn restore button

  • buildConversationItems tracks lastCheckpointId per turn — set when a _posthog/git_checkpoint notification is seen, cleared on each new turn start
  • Each user message shows a restore button (visible on hover): enabled when that turn has a checkpoint, disabled with an explanatory tooltip when it doesn't, and disabled with a "not available for cloud tasks" tooltip for cloud sessions

Restore confirmation

  • RestoreCheckpointDialog — confirmation dialog with an amber warning before the restore runs, so accidental clicks don't immediately drop work
  • useRestoreCheckpoint hook wires the full flow: opens the confirmation dialog → calls the tRPC mutation → truncates the in-memory events via sessionStoreSetters.truncateEventsToCheckpoint → shows a success/error toast

How did you test this?

  • Full pnpm typecheck passes across all packages (exit 0); biome clean via the pre-commit hook
  • Automated tests (vitest) for the restore path — see the "Hardening + tests" section below for the breakdown. All green except 3 pre-existing, Windows-only getSessionJsonlPath path-separator assertions (they assert forward-slash paths; path.join emits backslashes on Windows) which pass on Linux CI and are unrelated to this change
  • Manual end-to-end (local Codex session): restored to a prior turn, reloaded the page — no duplicated history — and confirmed the resumed agent's memory stops at the checkpoint (asked it to recall later turns; it had none)
  • Verified the cloud guard: restore is blocked with a message for cloud sessions, and the per-turn button / timeline reflect checkpoint availability

Scope: local-only (cloud is intentionally gated)

Checkpoint restore is available for local tasks only. This is by design, not an oversight.

Why checkpoints only exist for local turns. Checkpoints are git commits captured against the local worktree by the local agent after each turn (captureLocalCheckpointCaptureCheckpointSaga). A cloud task runs on remote infrastructure with no local agent and no local turn-completion hook, so it emits no _posthog/git_checkpoint events. The per-turn restore button is therefore rendered disabled with a "not available for cloud tasks" tooltip for cloud sessions (and disabled with a "no checkpoint for this turn" tooltip for any local turn without one).

Why restore cannot run for a cloud session. Restore performs three local-machine operations: (1) RevertCheckpointSaga rewinds files in the local worktree, (2) cancelSession SIGTERMs and reconnects the local agent subprocess, (3) truncateCodexRollout rewrites the on-disk rollout. None apply to a remote run whose agent and worktree live server-side. The S3 truncate_log call is the only cloud-capable piece, and on its own it would truncate the displayed history while the remote agent kept full memory and files stayed put — worse than a no-op. useRestoreCheckpoint therefore blocks isCloud sessions with a clear message before any destructive revert.

Handoff behavior. For a task that spans local and cloud (either direction), only turns that executed locally have checkpoints. Turns that ran in the cloud show no restore icon; local turns remain restorable. If a task with local checkpoints is currently running in the cloud, the click-time guard prevents a partial restore.

What full cloud support would take (future, out of scope). A backend feature, not a client flag: (1) server-side per-turn checkpoint capture that writes the checkpoint_id → prompt_id mapping into the run log so the client timeline can render restore points, (2) a server-side restore endpoint that reverts the remote worktree + truncates the remote agent memory + resumes atomically, (3) client rewiring to call that endpoint for cloud sessions, and (4) a product decision for runs with external side effects already pushed (commits/PRs). This touches the cloud runner + API and is tracked separately.

Hardening + tests added since initial review

  • Duplication fix (Codex): suppressReplay gate drops the rollout re-stream during loadSession so resumed sessions do not re-persist history to logs.ndjson/S3.
  • Memory truncation: on-disk codex rollout + local logs are trimmed to the checkpoint boundary; a Windows file-lock race on the rollout write is handled with bounded retry.
  • Concurrency: overlapping restores for the same session are rejected with an in-flight lock (prevents two restores racing to truncate at different offsets).
  • Failure surfacing: the restore mutation returns truncationFailed; the renderer warns when agent memory may extend past the checkpoint instead of failing silently.
  • Edge cases: "Current" badge on the most-recent checkpoint (no-op restore), a note when restoring will stop an in-progress response, and the cloud-session rejection message above.
  • Automated tests: rollout truncation + lock retry, codex suppressReplay (client + agent wiring), local-log prompt-boundary truncation, checkpoint-router concurrency lock, and Claude JSONL hydration parity (force-refetch, no replay).

@Basit-Balogun10 Basit-Balogun10 force-pushed the claude/gracious-hawking-38aade branch from 7dc1907 to 021fe43 Compare May 26, 2026 15:35
- Add checkpoint tRPC router with restore procedure that reverts git state,
  truncates session JSONL to the restore point, and deletes orphaned
  checkpoint refs for abandoned future turns
- Track lastCheckpointId per turn in buildConversationItems so each
  completed agent turn knows its git ref
- Show per-turn restore button in AgentMessage (disabled with tooltip when
  no checkpoint exists for that turn)
- Add CheckpointTimelineModal (mod+shift+h) — command-palette-style list of
  all checkpoints in the session, newest first, with user message snippet
  and relative timestamp; shortcut is user-remappable via keybindings store
- Add RestoreCheckpointDialog with confirmation warning before reverting
- Add useRestoreCheckpoint hook to wire restore flow end-to-end
- Register checkpoint-timeline as a configurable shortcut

Closes PostHog#2328
The cloud path (agent-server.ts) guards checkpoint capture on posthogAPI
being configured, so local tasks never emit _posthog/git_checkpoint.

Hook into extNotification in the local AgentService: on TURN_COMPLETE, run
CaptureCheckpointSaga, then emit a synthetic _posthog/git_checkpoint ACP
message to the renderer and append it to the session JSONL so it survives
reload. The renderer's buildConversationItems already handles the
notification correctly — it just wasn't arriving.

Add console logs in buildConversationItems and structured logs in service.ts
for visibility during debugging.
@Basit-Balogun10 Basit-Balogun10 force-pushed the claude/gracious-hawking-38aade branch 2 times, most recently from 158a49c to 1af21bd Compare June 2, 2026 11:09
extNotification is not reliably called by the ACP SDK for _posthog/
notifications in the local path. The raw stream tap (onAcpMessage)
is guaranteed to fire for every ndjson frame — move the TURN_COMPLETE
checkpoint hook there instead.
@Basit-Balogun10 Basit-Balogun10 force-pushed the claude/gracious-hawking-38aade branch from 1af21bd to 0a73dc5 Compare June 2, 2026 15:07
Suppress codex-acp loadSession replay at the SDK layer so resumed
sessions don't re-persist history to logs.ndjson/S3, and truncate the
on-disk rollout + local logs to the checkpoint boundary so the agent
remembers only up to the restored point.

- Add suppressReplay gate on CodexSessionState; codex-client drops
  session/update events while a loadSession replay is in flight
- Toggle the flag around loadSession / resumeSession / refreshSession
- Add truncateCodexRollout with writeFileWithRetry to survive the
  Windows file-lock race after cancelSession
- Truncate local logs at the prompt boundary; return restoredSessionId
  from the restore mutation and reconnect with it
- Remove the redundant renderer suppressReplayEvents flag
…acing)

- Reject overlapping restores for the same session with an in-flight lock in
  the checkpoint router (prevents two restores racing to truncate logs/rollout)
- Surface truncation failures: restore mutation returns truncationFailed and
  the renderer warns the user that agent memory may extend past the checkpoint
- truncateLocalLogsAtPromptBoundary now returns whether the boundary was found,
  so a missing anchor is reported instead of silently leaving stale turns
- Block checkpoint restore for cloud sessions with a clear message instead of
  failing silently after the destructive revert
- Show a Current badge (no Restore button) on the most-recent checkpoint and
  disable timeline restore buttons while a restore is in progress
- Note in the confirm dialog when restoring will stop an in-progress response
…cy, parity)

Adds automated coverage for the checkpoint-restore fix and hardening:

- rollout.test.ts: truncateCodexRollout trims to the first/middle turn, leaves
  the file intact when no complete turns exist, and reports not-found
- rollout.retry.test.ts: writeFileWithRetry retries on EPERM/EBUSY/EACCES and
  succeeds once the Windows file lock releases, fails fast on other errors
- codex-client suppressReplay: replayed session/update events are dropped (no
  upstream forward, no re-fired structured-output) until the flag clears
- codex-agent: suppressReplay is set during loadSession and cleared after,
  including when loadSession throws
- local-logs: truncateLocalLogsAtPromptBoundary returns found/not-found and
  trims to the prompt boundary
- checkpoint router: concurrent restores for the same session are rejected and
  the lock is released after success and after a failed revert
- jsonl-hydration: Claude force-refetches the truncated S3 log on restore and
  rewrites only the surviving turns (memory parity, no replay)
…tton

- Add enableOnFormTags/enableOnContentEditable to the checkpoint-timeline
  hotkey so Ctrl/Cmd+Shift+H opens the timeline even when the message editor
  has focus (previously suppressed by react-hotkeys-hook, the common case)
- Render the per-turn restore button always, disabled with an explanatory
  tooltip when the turn has no checkpoint (or a cloud-task tooltip for cloud
  sessions) instead of hiding it
The timeline modal (Ctrl/Cmd+Shift+H) is dropped — the per-turn restore
button on each user message is the restore entry point. Removes the
component, its keyboard shortcut wiring (SHORTCUTS / KEYBOARD_SHORTCUTS /
CONFIGURABLE_SHORTCUT_IDS / DEFAULT_KEYBINDINGS), and the ConversationView
state, hotkey, and render. Per-turn restore, the confirm dialog, and the
underlying restore/truncation flow are unchanged.
…estart

captureLocalCheckpoint stored each checkpoint only in the in-memory
sessionCheckpoints map, the agent JSONL, and S3. On a cold start the
in-memory map is gone and fetchSessionLogs reads the local logs.ndjson
cache first (before S3), so the checkpoint notifications were missing and
every restore icon showed disabled ("No checkpoint was captured for this turn").

Append the checkpoint notification to logs.ndjson at capture time (matching
the SessionLogWriter tap's line-by-line append), so cold loads after a
restart find it. S3 sync is unchanged.

Also adds buildConversationItems tests for checkpoint-to-turn association and
updates three user_message expectations for the turnContext field.
- Restore trims logs.ndjson at the restored turn's prompt-response line, which
  dropped that turn's git_checkpoint notification (appended after the response on
  TURN_COMPLETE) and left it with a disabled restore icon after an app restart.
  truncateLocalLogsAtPromptBoundary now accepts preserveTrailingEntries and the
  checkpoint router re-adds the restored checkpoint atomically, so the restored
  turn stays restorable.
- RestoreCheckpointDialog: drop the amber title/warning icons and the amber
  callout background, remove the icon on the red Restore button, and equalize the
  Cancel/Restore buttons.

Adds local-logs tests for preserved trailing entries.
…anup

The local-cache fix kept the restored turn's checkpoint in logs.ndjson, but S3
still had it trimmed. If the old-session cleanup later deletes the local cache,
the cold load falls back to S3 and the restored turn would lose its icon again.
After the S3 truncate, re-append the restored checkpoint to S3 (gated on the
truncate actually having removed it, to avoid duplicates), reusing the entry
built for the local preserve.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant