Skip to content

feat(images): paste clipboard images into a message, shown inline as [image-N]#29

Merged
mike-diff merged 7 commits into
mainfrom
clipboard-image-paste
Jun 17, 2026
Merged

feat(images): paste clipboard images into a message, shown inline as [image-N]#29
mike-diff merged 7 commits into
mainfrom
clipboard-image-paste

Conversation

@mike-diff

@mike-diff mike-diff commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Paste clipboard images into a message with Ctrl+V (Alt+V fallback on Windows Terminal), shown inline as [image-N] and sent to vision-capable models. Stdlib only, zero new dependencies.

What it does

  • Ctrl+V reads the clipboard image (shell-out, the read twin of /copy), downscales to a 1568px edge, and inserts an atomic [image-N] token; paste several and they renumber. Honest capture note (format, size, ~tokens); a failed read says why.
  • Both wire adapters serialize images (Anthropic image, OpenAI image_url); text-only turns are byte-identical to before.
  • Images persist out-of-line: the Turn keeps a hash (Data is json:"-"), bytes live content-addressed under ~/.sesh/blobs, rehydrated before the call and carried by reference across resume and handoff. Session JSON stays lean.
  • Vision gating: a name heuristic plus a per-profile vision dial; a non-vision model is blocked with guidance, never silently dropped.
  • no_tools profile dial so tools-less local vision models can be used for image Q&A.

Verification

build / gofmt / vet / tests / -race green; go.mod unchanged (zero deps). Tests cover both wire shapes plus the text-only regression, downscale, blob round-trip, token renumbering/compose, vision gating, no-tools omission, and resume/handoff. Smoke-tested in a pty against a real cloud API and a local vision model (qwen2.5vl correctly described a pasted image via no_tools).

Caveats

The macOS osascript read and Alt+V want a manual smoke; Linux wl-paste/xclip is the tested path.

Foundation for clipboard image paste (Phase 0 of the spec): the plumbing
that lets a vision model receive an image, with no user-facing UI yet.

- agent.Image{Hash,MediaType,Width,Height,Data} on Turn; Data is json:"-"
  so history stays lean on disk (bytes live out of line, repopulated before
  the call that needs them).
- Anthropic and OpenAI adapters emit a content-block/parts array only when a
  user turn carries images; text-only turns serialize exactly as before.
- harness/blob.go: content-addressed sidecar under ~/.sesh/blobs, atomic
  write, dedupe by sha256.
- harness/image.go: stdlib-only decode + downscale to a 1568px longest edge
  (free token savings); undecodable formats pass through. Token estimate for
  display.

Zero third-party dependencies. Tests assert the real serialized wire shape,
the text-only regression, downscale cap, blob round-trip/dedupe.
Phase 1 of clipboard image paste: the user-facing capture.

- Ctrl+V reads an image off the clipboard (shell-out twin of /copy's write
  path: wl-paste/xclip on Linux, osascript on macOS, PowerShell on Windows),
  downscales and stores it, and inserts an atomic [image-N] token. Multiple
  pastes renumber by appearance; the token's rune is its stable byte index.
- Honest feedback: a capture note with format, dimensions, and ~token cost;
  a failed read says why (e.g. which tool to install), never silently nothing.
- Vision gating: a model-name heuristic plus a per-profile "vision" dial
  override; an unknown model is treated as text-only and the paste is blocked
  with guidance rather than dropped. Default posture, blocks-with-guidance.
- composeMessage threads the ordered images onto the user turn at submit; the
  [image-N] labels stay in the text so a multi-image prompt can refer to them.
  Driven steers carry text only, by design.

Zero third-party deps. Tests pin renumbering, ordered compose with no raw-rune
leak, takeImages draining, and the vision heuristic + dial. The macOS
osascript read path is unverified in this Linux environment; needs a smoke test.
Phase 2 of clipboard image paste: continuity.

- rehydrateImages loads each image's bytes back from the blob store before a
  resumed or handed-off turn is sent (Data is json:"-", so it is dropped on
  save). Called before the two worker entry points that run on resumed history
  (interactive runTurn and print-mode first turn); drive iterations inherit the
  in-memory bytes. Idempotent: a live capture that already holds Data is skipped.
- A referenced blob that has gone missing drops its image from the turn with a
  note, rather than sending zero bytes to the model.
- approxTokens counts an image's patch-grid cost so the verbatim-tail budget
  reflects what an image turn will actually send.
- renderTranscript notes a user turn's images (count, media type, dimensions)
  byte-free, so the handoff brief writer knows an image existed.

Session JSON carries only the hash and metadata; the bytes live out of line.
Tests pin no-base64-in-JSON, resume/handoff rehydration from the shared store,
missing-blob drop, and the token accounting. Zero third-party deps.
Phase 3 of clipboard image paste: polish.

- Alt+V routes through the escape handler to the same capture pipeline, for
  terminals (Windows Terminal) that swallow Ctrl+V for their own paste.
- An empty bracketed paste tries a quiet image capture, recovering the macOS
  habit of Cmd+V: it acts only when an image is actually on the clipboard, so an
  ordinary empty paste stays silent.
- gcBlobs sweeps orphaned image blobs at startup (background, best-effort):
  it deletes only blobs referenced by no session (sealed sessions scanned too)
  and older than an hour, so a blob another instance just pasted is never raced.
- Docs: help.go input-keys block and the providers.json "vision" dial; a README
  line for the paste gesture.

Alt+V and the Cmd+V empty-paste path are terminal-delivery behaviors that need a
manual smoke test (no TTY in this environment). Zero third-party deps.
Two follow-ups from exercising the feature against real vision models.

- modelSupportsVision now recognizes "vl", "vision", "moondream", and
  "minicpm-v" names. A real round-trip showed llama3.2-vision and qwen2.5vl were
  wrongly classified text-only (the old check wanted a hyphen in "-vl" and had no
  "vision"), so the paste was blocked on genuinely vision-capable models.
- decodeAndDownscale keeps the short edge at >=1px: a pathological >1568:1 aspect
  ratio previously rounded it to 0 and re-encoded an empty image.

Tests pin the new heuristic names and the 1px-floor edge case.
Two follow-ups for the image-paste feature.

- no_tools: a per-profile dial ("no_tools": true) that sends no tool definitions
  to that model. Tools-less models (e.g. a local vision model used only to read a
  screenshot) reject any tools array, so without this they cannot be used at all;
  with it the turn runs as plain conversation. Gated once where the toolset is
  built, so it covers interactive, drive, and print paths.
- decodeAndDownscale now rejects bytes it can neither decode nor identify with a
  clear "unsupported image format" message, instead of storing and sending an
  application/octet-stream blob the API would reject anyway. Known-but-undecodable
  formats (WEBP) still pass through for the API to handle.

Tests: the request omits the tools field when none are passed; unidentifiable
bytes are rejected. Docs: help.go documents the no_tools dial.
@mike-diff mike-diff force-pushed the clipboard-image-paste branch from 5f9e396 to 0ed5e9d Compare June 17, 2026 00:16
@mike-diff mike-diff merged commit bff128c into main Jun 17, 2026
2 checks passed
@mike-diff mike-diff deleted the clipboard-image-paste branch June 17, 2026 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant