feat(images): paste clipboard images into a message, shown inline as [image-N] by mike-diff · Pull Request #29 · mike-diff/sesh

mike-diff · 2026-06-16T23:19:59Z

Paste clipboard images into a message with Ctrl+V (Alt+V fallback on Windows Terminal), shown inline as [image-N] and sent to vision-capable models. Stdlib only, zero new dependencies.

What it does

Ctrl+V reads the clipboard image (shell-out, the read twin of /copy), downscales to a 1568px edge, and inserts an atomic [image-N] token; paste several and they renumber. Honest capture note (format, size, ~tokens); a failed read says why.
Both wire adapters serialize images (Anthropic image, OpenAI image_url); text-only turns are byte-identical to before.
Images persist out-of-line: the Turn keeps a hash (Data is json:"-"), bytes live content-addressed under ~/.sesh/blobs, rehydrated before the call and carried by reference across resume and handoff. Session JSON stays lean.
Vision gating: a name heuristic plus a per-profile vision dial; a non-vision model is blocked with guidance, never silently dropped.
no_tools profile dial so tools-less local vision models can be used for image Q&A.

Verification

build / gofmt / vet / tests / -race green; go.mod unchanged (zero deps). Tests cover both wire shapes plus the text-only regression, downscale, blob round-trip, token renumbering/compose, vision gating, no-tools omission, and resume/handoff. Smoke-tested in a pty against a real cloud API and a local vision model (qwen2.5vl correctly described a pasted image via no_tools).

Caveats

The macOS osascript read and Alt+V want a manual smoke; Linux wl-paste/xclip is the tested path.

Foundation for clipboard image paste (Phase 0 of the spec): the plumbing that lets a vision model receive an image, with no user-facing UI yet. - agent.Image{Hash,MediaType,Width,Height,Data} on Turn; Data is json:"-" so history stays lean on disk (bytes live out of line, repopulated before the call that needs them). - Anthropic and OpenAI adapters emit a content-block/parts array only when a user turn carries images; text-only turns serialize exactly as before. - harness/blob.go: content-addressed sidecar under ~/.sesh/blobs, atomic write, dedupe by sha256. - harness/image.go: stdlib-only decode + downscale to a 1568px longest edge (free token savings); undecodable formats pass through. Token estimate for display. Zero third-party dependencies. Tests assert the real serialized wire shape, the text-only regression, downscale cap, blob round-trip/dedupe.

Phase 1 of clipboard image paste: the user-facing capture. - Ctrl+V reads an image off the clipboard (shell-out twin of /copy's write path: wl-paste/xclip on Linux, osascript on macOS, PowerShell on Windows), downscales and stores it, and inserts an atomic [image-N] token. Multiple pastes renumber by appearance; the token's rune is its stable byte index. - Honest feedback: a capture note with format, dimensions, and ~token cost; a failed read says why (e.g. which tool to install), never silently nothing. - Vision gating: a model-name heuristic plus a per-profile "vision" dial override; an unknown model is treated as text-only and the paste is blocked with guidance rather than dropped. Default posture, blocks-with-guidance. - composeMessage threads the ordered images onto the user turn at submit; the [image-N] labels stay in the text so a multi-image prompt can refer to them. Driven steers carry text only, by design. Zero third-party deps. Tests pin renumbering, ordered compose with no raw-rune leak, takeImages draining, and the vision heuristic + dial. The macOS osascript read path is unverified in this Linux environment; needs a smoke test.

Phase 2 of clipboard image paste: continuity. - rehydrateImages loads each image's bytes back from the blob store before a resumed or handed-off turn is sent (Data is json:"-", so it is dropped on save). Called before the two worker entry points that run on resumed history (interactive runTurn and print-mode first turn); drive iterations inherit the in-memory bytes. Idempotent: a live capture that already holds Data is skipped. - A referenced blob that has gone missing drops its image from the turn with a note, rather than sending zero bytes to the model. - approxTokens counts an image's patch-grid cost so the verbatim-tail budget reflects what an image turn will actually send. - renderTranscript notes a user turn's images (count, media type, dimensions) byte-free, so the handoff brief writer knows an image existed. Session JSON carries only the hash and metadata; the bytes live out of line. Tests pin no-base64-in-JSON, resume/handoff rehydration from the shared store, missing-blob drop, and the token accounting. Zero third-party deps.

Phase 3 of clipboard image paste: polish. - Alt+V routes through the escape handler to the same capture pipeline, for terminals (Windows Terminal) that swallow Ctrl+V for their own paste. - An empty bracketed paste tries a quiet image capture, recovering the macOS habit of Cmd+V: it acts only when an image is actually on the clipboard, so an ordinary empty paste stays silent. - gcBlobs sweeps orphaned image blobs at startup (background, best-effort): it deletes only blobs referenced by no session (sealed sessions scanned too) and older than an hour, so a blob another instance just pasted is never raced. - Docs: help.go input-keys block and the providers.json "vision" dial; a README line for the paste gesture. Alt+V and the Cmd+V empty-paste path are terminal-delivery behaviors that need a manual smoke test (no TTY in this environment). Zero third-party deps.

Two follow-ups from exercising the feature against real vision models. - modelSupportsVision now recognizes "vl", "vision", "moondream", and "minicpm-v" names. A real round-trip showed llama3.2-vision and qwen2.5vl were wrongly classified text-only (the old check wanted a hyphen in "-vl" and had no "vision"), so the paste was blocked on genuinely vision-capable models. - decodeAndDownscale keeps the short edge at >=1px: a pathological >1568:1 aspect ratio previously rounded it to 0 and re-encoded an empty image. Tests pin the new heuristic names and the 1px-floor edge case.

Two follow-ups for the image-paste feature. - no_tools: a per-profile dial ("no_tools": true) that sends no tool definitions to that model. Tools-less models (e.g. a local vision model used only to read a screenshot) reject any tools array, so without this they cannot be used at all; with it the turn runs as plain conversation. Gated once where the toolset is built, so it covers interactive, drive, and print paths. - decodeAndDownscale now rejects bytes it can neither decode nor identify with a clear "unsupported image format" message, instead of storing and sending an application/octet-stream blob the API would reject anyway. Known-but-undecodable formats (WEBP) still pass through for the API to handle. Tests: the request omits the tools field when none are passed; unidentifiable bytes are rejected. Docs: help.go documents the no_tools dial.

mike-diff added 7 commits June 16, 2026 15:39

chore: gitignore .context/ (ephemeral agent planning artifacts)

578c3b0

mike-diff force-pushed the clipboard-image-paste branch from 5f9e396 to 0ed5e9d Compare June 17, 2026 00:16

mike-diff merged commit bff128c into main Jun 17, 2026
2 checks passed

mike-diff deleted the clipboard-image-paste branch June 17, 2026 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(images): paste clipboard images into a message, shown inline as [image-N]#29

feat(images): paste clipboard images into a message, shown inline as [image-N]#29
mike-diff merged 7 commits into
mainfrom
clipboard-image-paste

mike-diff commented Jun 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mike-diff commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it does

Verification

Caveats

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mike-diff commented Jun 16, 2026 •

edited

Loading