feat(sandbox): workspace snapshot persistence + web app builder tile#3397
feat(sandbox): workspace snapshot persistence + web app builder tile#3397rafavalls wants to merge 15 commits into
Conversation
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="SANDBOX_PERSISTENCE.md">
<violation number="1" location="SANDBOX_PERSISTENCE.md:121">
P2: Create endpoint returns 204 for empty workdir, but snapshot-saver unconditionally pipes the response body to store.put(). A 204 has no body — results in a malformed snapshot entry.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| - cd <repoDir> | ||
| - spawn `tar -cf - --exclude=./tmp .` | ||
| - pipe tar's stdout into the HTTP response body | ||
| - returns 204 if repoDir is empty (nothing to snapshot yet) |
There was a problem hiding this comment.
P2: Create endpoint returns 204 for empty workdir, but snapshot-saver unconditionally pipes the response body to store.put(). A 204 has no body — results in a malformed snapshot entry.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At SANDBOX_PERSISTENCE.md, line 121:
<comment>Create endpoint returns 204 for empty workdir, but snapshot-saver unconditionally pipes the response body to store.put(). A 204 has no body — results in a malformed snapshot entry.</comment>
<file context>
@@ -0,0 +1,315 @@
+ - cd <repoDir>
+ - spawn `tar -cf - --exclude=./tmp .`
+ - pipe tar's stdout into the HTTP response body
+ - returns 204 if repoDir is empty (nothing to snapshot yet)
+
+POST /_decopilot_vm/snapshot/restore
</file context>
Release OptionsSuggested: Patch ( React with an emoji to override the release type:
Current version:
|
There was a problem hiding this comment.
4 issues found across 8 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/mesh/src/sandbox/sandbox-store/local-fs-store.ts">
<violation number="1" location="apps/mesh/src/sandbox/sandbox-store/local-fs-store.ts:44">
P2: Temp filename needs stronger uniqueness. Same-ms concurrent puts can collide. Add random suffix.</violation>
</file>
<file name="packages/sandbox/daemon/entry.ts">
<violation number="1" location="packages/sandbox/daemon/entry.ts:301">
P2: Snapshot root wrong. It points at `repoDir`, not full workdir. Files outside `/app/repo` never save/restore, and `--exclude=./tmp` no longer matches intended path. Wire snapshot handlers to workspace root.</violation>
</file>
<file name="packages/sandbox/daemon/routes/snapshot.ts">
<violation number="1" location="packages/sandbox/daemon/routes/snapshot.ts:119">
P1: Restore untars straight into live repoDir. Tar can fail after writing some files, so repoDir can end up half-restored. Extract into a temp dir and swap only on success (or clean repoDir on failure).</violation>
<violation number="2" location="packages/sandbox/daemon/routes/snapshot.ts:148">
P1: Exit-code check alone too weak for tar integrity. Truncated archive can still exit 0 and look successful, but restore misses files. Add integrity verification (for example checksum/size metadata) before returning ok.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| child.on("exit", (code, signal) => resolve({ code, signal })); | ||
| }); | ||
|
|
||
| if (exit.code !== 0) { |
There was a problem hiding this comment.
P1: Exit-code check alone too weak for tar integrity. Truncated archive can still exit 0 and look successful, but restore misses files. Add integrity verification (for example checksum/size metadata) before returning ok.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/sandbox/daemon/routes/snapshot.ts, line 148:
<comment>Exit-code check alone too weak for tar integrity. Truncated archive can still exit 0 and look successful, but restore misses files. Add integrity verification (for example checksum/size metadata) before returning ok.</comment>
<file context>
@@ -0,0 +1,159 @@
+ child.on("exit", (code, signal) => resolve({ code, signal }));
+ });
+
+ if (exit.code !== 0) {
+ return jsonResponse(
+ {
</file context>
There was a problem hiding this comment.
4 issues found across 15 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/mesh/src/sandbox/snapshot-saver.ts">
<violation number="1" location="apps/mesh/src/sandbox/snapshot-saver.ts:112">
P2: Record save watermark at snapshot start, not upload finish. Finish time can hide activity that happened during save.</violation>
<violation number="2" location="apps/mesh/src/sandbox/snapshot-saver.ts:128">
P2: Add re-entry guard around tick. setInterval can overlap async runs. Duplicate saves and lastSavedAt races can happen.</violation>
</file>
<file name="apps/mesh/src/sandbox/sandbox-store/index.ts">
<violation number="1" location="apps/mesh/src/sandbox/sandbox-store/index.ts:36">
P2: Boolean env parse too strict. `"true"` ignored, path-style may stay off. Accept `"true"` and `"1"`.</violation>
</file>
<file name="apps/mesh/src/tools/vm/start.ts">
<violation number="1" location="apps/mesh/src/tools/vm/start.ts:320">
P1: Best-effort restore not actually best-effort. Catch restore errors in onDaemonReady so VM_START can continue to fresh clone.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
1 issue found across 9 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/mesh-sdk/src/types/virtual-mcp.ts">
<violation number="1" location="packages/mesh-sdk/src/types/virtual-mcp.ts:172">
P1: Restrict cloneUrl protocol to https. `.url()` alone accepts insecure schemes like `http`, so token-in-URL clones can leak credentials. Add protocol check.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
2 issues found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="SANDBOX_PERSISTENCE.md">
<violation number="1" location="SANDBOX_PERSISTENCE.md:121">
P2: Create endpoint returns 204 for empty workdir, but snapshot-saver unconditionally pipes the response body to store.put(). A 204 has no body — results in a malformed snapshot entry.</violation>
</file>
<file name="packages/sandbox/daemon/routes/snapshot.ts">
<violation number="1" location="packages/sandbox/daemon/routes/snapshot.ts:148">
P1: Exit-code check alone too weak for tar integrity. Truncated archive can still exit 0 and look successful, but restore misses files. Add integrity verification (for example checksum/size metadata) before returning ok.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
Proposal for adding bundle-based persistence so vibecoded sandboxes (cloneUrl-backed vMCPs from #3361) survive idle/restart without GitHub. Covers a small VibecodeStore abstraction (local FS + S3 IRSA), daemon snapshot routes that produce/consume a `git bundle`, a mesh-side idle poller, and a restore pre-step in VM_START. The restore path feeds the bundle file path as cloneUrl so the existing setup orchestrator handles clone/install/start unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three substantive changes: 1. Universal, not vibecode-only. Per dev review: the sandbox IS the user's computer, so every sandbox saves its state — GitHub-backed or not. Git push to GitHub remains a user-initiated "publish", orthogonal to the auto-save of machine state. 2. Format switched from `git bundle` to plain `.tar` (no gzip). Stream-only, zero CPU on compress/decompress. Including `node_modules` means restore is just untar + boot — no install on cold-start. Big perceived UX win. 3. Acknowledged Firecracker as the long-term direction (whole-VM snapshots, sub-second restore). The SandboxStore interface in this plan is forward-compatible — a firecracker backend would swap the producer/consumer but the abstraction holds. File renamed VIBECODE_PERSISTENCE.md → SANDBOX_PERSISTENCE.md to match the broader scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 1 of the sandbox persistence plan (SANDBOX_PERSISTENCE.md): - `SandboxStore` interface: put/get/head/delete keyed by `<orgId>/<vmcpId>/<branch>.tar` - `snapshotKey()` helper that sanitizes each component, preserves `/` inside branches as legitimate prefix hierarchy, and neutralizes `..` before it reaches the filesystem - `LocalFsStore`: writes under `<baseDir>/<key>` with atomic temp+rename semantics. Belt-and-suspenders path-traversal defense via `path.relative(baseDir, resolved)` on every call - `pickStoreFromEnv()` shim: returns LocalFsStore today; S3Store slots in later steps without changing callers No wiring yet — the picker isn't called from anywhere, this is pure utility that lands without touching prod behavior. 22 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 2 of the sandbox persistence plan. Two POST routes under /_decopilot_vm/snapshot/* mounted in entry.ts's vmRouteH dispatcher: - /snapshot/create — streams `tar -cf - --exclude=./tmp .` from repoDir. Spawns tar(1) directly so multi-hundred-MB archives never sit in Node memory; stdout is piped through the HTTP response body. Returns 404 when repoDir doesn't exist yet (daemon unconfigured). - /snapshot/restore — consumes raw tar bytes from the request body via `tar -xf - -C <repoDir>`. Ensures repoDir exists, surfaces tar's stderr in the 500 response on malformed input. Caller (mesh) re-clones from source if restore fails. Both routes sit behind the existing daemonToken bearer-auth gate. No compression at the tar layer — workspace already contains compressed assets, CPU > storage cost. 7 daemon tests pass (round-trip a populated workdir, exclude ./tmp, restore creates target dir, 400 on missing body, 500 on malformed tar). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 3 of the sandbox persistence plan. Adds an `onDaemonReady` hook to `EnsureOptions` that fires after the daemon's HTTP server is healthy but before the runner POSTs its initial config (which triggers the orchestrator's clone+install+start). All three postConfig runners — host, docker, agent-sandbox — call the hook in identical fashion. Freestyle has no postConfig flow and ignores it naturally. VM_START's `provisionSandbox` passes a hook that pulls the workspace tar from the SandboxStore (LocalFsStore today, S3Store later) and POSTs it to /_decopilot_vm/snapshot/restore. Best-effort: restore failures log and fall through to the orchestrator's fresh-clone path, since the restored `.git` would have short-circuited the clone via the existing `hasGitRepo()` gate. Applies to both VM_START and ensureVmForBranch (both call provisionSandbox), so every sandbox — GitHub-backed or cloneUrl-only — gets restoration without further gating. Tests: existing 206 pass; added one ordering test in docker runner asserting hook fires after /health and before /_decopilot_vm/config with the matching daemon token. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 4 of the sandbox persistence plan. `apps/mesh/src/sandbox/snapshot-saver.ts` adds a small in-memory registry of active sandboxes (handle → orgId/vmcpId/branch). Two entry points: - `startSnapshotSaver()` boots a 30s interval that, for each tracked sandbox, probes the daemon's `/_decopilot_vm/idle` endpoint and — when `idleMs > 60s` AND the user did work since the last save — POSTs `/_decopilot_vm/snapshot/create` and pipes the response stream into `SandboxStore.put(...)`. - `saveAllSnapshotsOnShutdown()` runs in mesh's gracefulShutdown after ingress close but before connection drain, saving every tracked sandbox unconditionally (skipping the idle gate). Tracking hooks: - VM_START registers via `trackSandbox` after successful provisioning. - VM_DELETE unregisters before runner teardown so the saver doesn't proxy to a dead pod. Mesh restart loses the in-memory registry — outstanding sandboxes won't be auto-saved until their next VM_START, but the daemon's own SIGTERM git-add+commit path still preserves GitHub-backed work on pod recycle. 6 saver tests pass (idle threshold, no-resave-when-unchanged, shutdown unconditional save, error paths). Full mesh+sandbox test suite stays green at 212 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 5 of the sandbox persistence plan. Wraps `@aws-sdk/client-s3` (already a mesh dep for the org-file storage service) so we don't hand-roll SigV4. The SDK's default credential provider chain handles both IRSA in EKS (via AWS_WEB_IDENTITY_TOKEN_FILE + AWS_ROLE_ARN) and local dev with explicit AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY — no extra refresh loop needed. Picker now reads its own env vars instead of taking them as args, which shrinks the two call sites (VM_START + snapshot-saver): SANDBOX_SNAPSHOTS_BUCKET (selects S3 vs local FS) SANDBOX_SNAPSHOTS_REGION (default us-east-1) SANDBOX_SNAPSHOTS_PREFIX (default "sandbox-snapshots") SANDBOX_SNAPSHOTS_ENDPOINT (localstack / R2 / MinIO override) SANDBOX_SNAPSHOTS_FORCE_PATH_STYLE (compat with non-AWS S3 backends) Test only covers key prefixing/construction — actual put/get round-trip runs against staging S3 in step 6's smoke test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 6 of the sandbox persistence plan.
Adds a documented `sandboxSnapshots:` block to values.yaml and threads
SANDBOX_SNAPSHOTS_{BUCKET,REGION,PREFIX} into the deployment env when
enabled. Fails the template if `enabled=true` but `bucket` is empty so
misconfiguration surfaces at helm install, not at first VM_START.
Auth: reuses the existing IRSA ServiceAccount that s3Sync.roleArn
already annotates — one role, both permissions. Documented in the
values comments. Operators grant the role:
s3:PutObject, s3:GetObject, s3:HeadObject, s3:DeleteObject
on arn:aws:s3:::<bucket>/<prefix>/*
Bucket settings (versioning, lifecycle expiry, SSE) are documented as
recommendations — not enforced by the chart so operators can manage
their own IaC.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `cloneUrl: string | null` to the vMCP metadata schema and threads
it through VM_START so a sandbox can boot from a plain git URL without
requiring a connected GitHub App. Public github.com URLs work out of
the box; private repos require credentials embedded in the URL (those
get stripped from the display name before logging).
Generalizes the "repo-backed agent" predicate from "has githubRepo" to
"has githubRepo OR cloneUrl" across:
* `tools/thread/create.ts` — branch resolution kicks in for cloneUrl
too, so threads stick to a stable branch instead of falling to
`thread:<taskId>` synthetic keying.
* `api/routes/decopilot/dispatch-run.ts` — per-branch VM keying
applies, so the preview the user sees points at the same sandbox
the agent is editing.
* `web/lib/github-repo.ts` — new `hasPreviewableRepo()` predicate;
`preview-tab.tsx`, `use-main-panel-tabs.ts`, and the agent shell
layout all use it to gate Preview-tab visibility. The "git" tab
stays GitHub-only (PR creation requires OAuth).
Snapshot persistence (already on this branch) layers cleanly: a
cloneUrl-backed sandbox saves its workdir to S3/LocalFs the same way a
GitHub-backed one does.
441 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Web app" tile at the front of the home page tile row that: 1. Creates a vMCP with `cloneUrl: github.com/decocms/webapp-template` (no GitHub OAuth needed thanks to the cloneUrl plumbing on this branch). 2. Eagerly fires VM_START in the background so clone+install starts running during navigation — by the time the user lands on the agent the preview is already booting. 3. Pre-sets layout metadata so the agent opens with the Preview tab as the main view and the chat panel docked on the side. 4. Pins the new vMCP to the sidebar. The bundled instructions tell the agent it's vibecoding the deco webapp template, not to restart the dev server, and not to mention URLs/ports (the user has the live preview right there). Pairs with the snapshot persistence already on this branch: a user can close the tab, come back later, and pick up where they left off — the sandbox restores from S3 (or local FS in dev) on the next VM_START. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…awnp The clone failure surfaced as "[orchestrator] clone failed: posix_spawnp failed." on macOS — node-pty's forkpty can fail deterministically on bun + libuv before the wrapped command (git ls-remote, git clone, etc.) ever runs. Setup commands don't need a TTY, so swap them to `node:child_process.spawn` instead. We use Node's spawn (not Bun.spawn) because we need `uid`/`gid` for the privilege drop path; Bun.spawn doesn't expose those. stdout and stderr both pipe into the existing onChunk stream so callers see the same log shape. Also adds `posix_spawnp failed` / `Resource temporarily unavailable` to the clone's TRANSIENT_ERRORS list and wraps `runStep` in a try/catch so spawn-level throws are treated as transient and retried, instead of hard-failing the orchestrator's clone step. Ports the relevant subset of #3361's stability fixes — minimum needed to make the Web app tile usable on macOS dev machines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The setup-step fix landed; the dev-server start path was still hitting `posix_spawnp failed.` because task-manager uses node-pty's spawnPty (which DOES benefit from a TTY for colored output and progress bars). Instead of dropping that benefit globally, catch the spawn error and fall back to non-PTY `child_process.spawn` for that one task — the user loses colors but the dev server boots. Also fixes a latent env bug in startPipe: child_process.spawn *replaces* env when given an `env` option (the PTY path merges). task.spec.env carries overrides like HOST/PORT only — without PATH the child can't find `bun`, `node`, `npm`. Merge process.env in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… correctness - snapshot restore: extract to temp dir + atomic rename-swap so a failed or truncated restore never leaves workDir half-written - snapshot root: wire handlers to appRoot (full workspace) instead of repoDir so node_modules outside repo/ are captured - vm/start: wrap onDaemonReady restore in try/catch so errors fall through to fresh clone instead of aborting VM_START - spawn-step: resolve(1) on spawn error (restores Promise<number> contract); fix signal exit code via os.constants.signals - clone: remove dead try/catch now that spawnSetupStep always resolves - cloneUrl schema: restrict to https:// protocol - local-fs-store: use randomUUID() for temp file names (no pid+ms collision) - snapshot-saver: record watermark before upload; add tickRunning guard to prevent setInterval overlap - sandbox-store: accept "true" and "1" for FORCE_PATH_STYLE env var - misc: trim AI-generated multi-paragraph docblocks to single-line comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
f8d1bb9 to
1cbda81
Compare
- mention-slash.tsx: suppress ban-ref-current-assignment for render-time ref write (known pattern, TODO refactor) - sandbox-store/index.ts: remove unused S3Store re-export flagged by knip - snapshot.test.ts: rename repoDir→workDir in all handler instantiations to match SnapshotDeps interface rename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In CI, the ephemeral port allocated by ephemeralPort() can be grabbed
by another process before the setTimeout's listenOn call fires. The
original `void listenOn(...)` left the rejection unhandled, which Bun
counts as a test error. Chain .catch(() => {}) so the rejection is
swallowed — waitForPort still detects the port as in-use and resolves.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
What
Two features shipped together:
1. Sandbox snapshot persistence
Every running sandbox now auto-saves its full workspace state (a tar of the workdir, including
.gitandnode_modules) to a blob store. On cold start,VM_STARTrestores from the snapshot before cloning — so state survives pod evictions with nonpm installpenalty.Store adapters:
LocalFsStore— for local dev / host runner (writes to<DATA_DIR>/sandbox-snapshots/)S3Store— for prod (bucket + IRSA, readsSANDBOX_SNAPSHOTS_BUCKET/ region / prefix / endpoint env vars)Daemon routes (
packages/sandbox/daemon/routes/snapshot.ts):GET /_decopilot_vm/snapshot/create— streams a tar of the workdir, excluding./tmpPOST /_decopilot_vm/snapshot/restore— atomic restore: extracts to a temp dir, then rename-swaps on success so a failed restore never leaves a partial workdirIdle poller (
apps/mesh/src/sandbox/snapshot-saver.ts):setInterval; saves only when activity has occurred since the last save (watermark-based)VM start (
apps/mesh/src/tools/vm/start.ts):/snapshot/restoreHelm (
deploy/helm/studio/):sandboxSnapshotsvalues block wires all env vars into the mesh deployment2. Web app builder tile
One-click "build a web app" entry point on the home screen (
feat(home): Web app tile) — launches a vibecode session from a deco template.3. Plain cloneUrl support
Agents can now specify a
cloneUrl(plain HTTPS, no GitHub OAuth) in addition togithubRepo. Both harnesses (decopilotandclaude-code) andVM_STARThandle both source types.4. Daemon spawn reliability fixes
spawnSetupStepnow useschild_process.spawn(not node-pty / Bun.spawn) — avoidsforkptyfailures on macOS and supports uid/gid dropsTRANSIENT_ERRORSexpanded to catchposix_spawnp failed,posix_spawn failed,Resource temporarily unavailablespawnSetupStepalways resolves (never rejects) — spawn errors surface as exit code 1 +onChunkmessage, preserving the promise contract at all call sitesTest plan
bun test packages/sandbox/daemon/routes/snapshot.test.ts— create/restore round-trip, tmp exclusion, malformed tar, missing bodybun test apps/mesh/src/sandbox/snapshot-saver.test.ts— idle detection, re-entry guard, watermark timingnpm installon cold start) — runnable locally with host runner +LocalFsStore🤖 Generated with Claude Code