feat(images): paste clipboard images into a message, shown inline as [image-N]#29
Merged
Conversation
Foundation for clipboard image paste (Phase 0 of the spec): the plumbing
that lets a vision model receive an image, with no user-facing UI yet.
- agent.Image{Hash,MediaType,Width,Height,Data} on Turn; Data is json:"-"
so history stays lean on disk (bytes live out of line, repopulated before
the call that needs them).
- Anthropic and OpenAI adapters emit a content-block/parts array only when a
user turn carries images; text-only turns serialize exactly as before.
- harness/blob.go: content-addressed sidecar under ~/.sesh/blobs, atomic
write, dedupe by sha256.
- harness/image.go: stdlib-only decode + downscale to a 1568px longest edge
(free token savings); undecodable formats pass through. Token estimate for
display.
Zero third-party dependencies. Tests assert the real serialized wire shape,
the text-only regression, downscale cap, blob round-trip/dedupe.
Phase 1 of clipboard image paste: the user-facing capture. - Ctrl+V reads an image off the clipboard (shell-out twin of /copy's write path: wl-paste/xclip on Linux, osascript on macOS, PowerShell on Windows), downscales and stores it, and inserts an atomic [image-N] token. Multiple pastes renumber by appearance; the token's rune is its stable byte index. - Honest feedback: a capture note with format, dimensions, and ~token cost; a failed read says why (e.g. which tool to install), never silently nothing. - Vision gating: a model-name heuristic plus a per-profile "vision" dial override; an unknown model is treated as text-only and the paste is blocked with guidance rather than dropped. Default posture, blocks-with-guidance. - composeMessage threads the ordered images onto the user turn at submit; the [image-N] labels stay in the text so a multi-image prompt can refer to them. Driven steers carry text only, by design. Zero third-party deps. Tests pin renumbering, ordered compose with no raw-rune leak, takeImages draining, and the vision heuristic + dial. The macOS osascript read path is unverified in this Linux environment; needs a smoke test.
Phase 2 of clipboard image paste: continuity. - rehydrateImages loads each image's bytes back from the blob store before a resumed or handed-off turn is sent (Data is json:"-", so it is dropped on save). Called before the two worker entry points that run on resumed history (interactive runTurn and print-mode first turn); drive iterations inherit the in-memory bytes. Idempotent: a live capture that already holds Data is skipped. - A referenced blob that has gone missing drops its image from the turn with a note, rather than sending zero bytes to the model. - approxTokens counts an image's patch-grid cost so the verbatim-tail budget reflects what an image turn will actually send. - renderTranscript notes a user turn's images (count, media type, dimensions) byte-free, so the handoff brief writer knows an image existed. Session JSON carries only the hash and metadata; the bytes live out of line. Tests pin no-base64-in-JSON, resume/handoff rehydration from the shared store, missing-blob drop, and the token accounting. Zero third-party deps.
Phase 3 of clipboard image paste: polish. - Alt+V routes through the escape handler to the same capture pipeline, for terminals (Windows Terminal) that swallow Ctrl+V for their own paste. - An empty bracketed paste tries a quiet image capture, recovering the macOS habit of Cmd+V: it acts only when an image is actually on the clipboard, so an ordinary empty paste stays silent. - gcBlobs sweeps orphaned image blobs at startup (background, best-effort): it deletes only blobs referenced by no session (sealed sessions scanned too) and older than an hour, so a blob another instance just pasted is never raced. - Docs: help.go input-keys block and the providers.json "vision" dial; a README line for the paste gesture. Alt+V and the Cmd+V empty-paste path are terminal-delivery behaviors that need a manual smoke test (no TTY in this environment). Zero third-party deps.
Two follow-ups from exercising the feature against real vision models. - modelSupportsVision now recognizes "vl", "vision", "moondream", and "minicpm-v" names. A real round-trip showed llama3.2-vision and qwen2.5vl were wrongly classified text-only (the old check wanted a hyphen in "-vl" and had no "vision"), so the paste was blocked on genuinely vision-capable models. - decodeAndDownscale keeps the short edge at >=1px: a pathological >1568:1 aspect ratio previously rounded it to 0 and re-encoded an empty image. Tests pin the new heuristic names and the 1px-floor edge case.
Two follow-ups for the image-paste feature.
- no_tools: a per-profile dial ("no_tools": true) that sends no tool definitions
to that model. Tools-less models (e.g. a local vision model used only to read a
screenshot) reject any tools array, so without this they cannot be used at all;
with it the turn runs as plain conversation. Gated once where the toolset is
built, so it covers interactive, drive, and print paths.
- decodeAndDownscale now rejects bytes it can neither decode nor identify with a
clear "unsupported image format" message, instead of storing and sending an
application/octet-stream blob the API would reject anyway. Known-but-undecodable
formats (WEBP) still pass through for the API to handle.
Tests: the request omits the tools field when none are passed; unidentifiable
bytes are rejected. Docs: help.go documents the no_tools dial.
5f9e396 to
0ed5e9d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Paste clipboard images into a message with Ctrl+V (Alt+V fallback on Windows Terminal), shown inline as
[image-N]and sent to vision-capable models. Stdlib only, zero new dependencies.What it does
/copy), downscales to a 1568px edge, and inserts an atomic[image-N]token; paste several and they renumber. Honest capture note (format, size, ~tokens); a failed read says why.image, OpenAIimage_url); text-only turns are byte-identical to before.Dataisjson:"-"), bytes live content-addressed under~/.sesh/blobs, rehydrated before the call and carried by reference across resume and handoff. Session JSON stays lean.visiondial; a non-vision model is blocked with guidance, never silently dropped.no_toolsprofile dial so tools-less local vision models can be used for image Q&A.Verification
build / gofmt / vet / tests /
-racegreen;go.modunchanged (zero deps). Tests cover both wire shapes plus the text-only regression, downscale, blob round-trip, token renumbering/compose, vision gating, no-tools omission, and resume/handoff. Smoke-tested in a pty against a real cloud API and a local vision model (qwen2.5vlcorrectly described a pasted image viano_tools).Caveats
The macOS
osascriptread and Alt+V want a manual smoke; Linuxwl-paste/xclipis the tested path.