ai-bot: the assistant can read a skill directly from the realm server by jurgenwerk · Pull Request #5344 · cardstack/boxel

jurgenwerk · 2026-06-26T09:50:50Z

Here is a video demonstrating:

Informing the AI about a set of available skills and where to get them if needed. In next work, we will create curated skill index files and attach it at the beginning of the system prompt. This is so that AI will be aware of the available skills but won't load them until it actually needs them
The bot pulling the skills on demand from the realm server directly (but still visible as command results in the chat UI). The reasons we want to fetch from the server directly: no Matrix uploads at enable time, no dependency on the user's host tab being open, content is always fresh
AI then receives the skills on its own demand (in a "pull" mechanism, as opposed to our current "push" mechanism where we inject all the attached skills upfront in the prompt)

Screen.Recording.2026-06-30.at.23.01.39.mov

More details:

What this does

The assistant can now grab a file from your realm by itself, while it's answering — usually a skill's instructions, but really any file. Before, every enabled skill's full text was stuffed into the prompt up front. Now the bot reads only what it needs, when it needs it.

How it works (the simple version)

Mid-answer, the bot thinks "I need that file" and calls a tool: readRealmFile(realm, url).
The bot fetches the file itself, using a short-lived, read-only pass made out in your name — so it can only read what you already could. No browser, no round-trip; it happens in the same turn.
It reads the file's contents into its own working memory, keeps going, and answers you.
You see a pill in the chat — "Read file: demo/SKILL.md" — that spins while loading and turns to ✓ when done (or red with a reason if it failed). That pill is just a record; the file's text goes to the model, not into the chat.

Why `readRealmFile` and not `loadSkill`

There's nothing skill-specific about it — it just reads a file. So it gets an honest, general name. (It's distinct from the existing host readFile, which runs in your browser; this one runs server-side with a delegated token.)

Mixed tool calls

Bot-run tools (like this) finish this turn; host commands finish next turn. Those two timings don't mix. So if the model asks for a read and a host command in the same round, the read is explicitly rejected with a clear message ("readRealmFile cannot be mixed with host-dispatched commands in the same round") instead of being silently dropped — and the host command still runs.

Safety

Only in single-human rooms, so "you" is unambiguous.
Read-only and scoped to you — can't read what you can't, can't write.
Always live — the current file, never a stale snapshot.
A missing file / no access / bad URL comes back as an error the model can read (and a red pill), never a crash.
Inert unless the delegation secret is configured (already set in our deployed envs).

Also in here (unrelated)

A one-line CSS fix removing a stray border line that appeared above the AI action bar during diff-apply.

Tests

Unit tests for the fetch/executor (delegated-token retry, 404 / disabled / forbidden), the tool-call classifier (bot vs host, mixed), and the read loop; verified live with single and multiple reads, rendered in order.

First slice of the pull-model loadSkill tool (CS-11554): the bot-side core that fetches a skill's instructions on demand. - lib/load-skill.ts: the `loadSkill` tool schema the model will be offered, plus `executeLoadSkill` — mints a delegated, user-scoped realm token (CS-11553) and GETs skills/<name>/SKILL.md (or references/<path>) as raw source. Read-only and scoped to the requesting human, so the bot can't read anything they couldn't. Never throws; returns an ok/error result the caller hands back to the model. - tests/load-skill-test.ts: URL building, token minting, success, 404, and the disabled / forbidden delegation paths. Not yet wired into the response loop (advertise the tool + intercept the call + feed the result back to the model) — that lands next in this ticket; see the PR description. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Describe what the loadSkill module does as timeless fact rather than citing issue identifiers, per the evergreen-comments convention. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make loadSkill a real, bot-executed tool instead of just a building block: - getResponse offers loadSkill whenever delegated realm sessions are configured (gated on the manager being enabled), and accepts a messages override so a turn can be re-run with tool results appended. - response-state drops loadSkill from the emitted command requests, so it is never handed to the host to execute (the bot runs it itself). - main.ts runs the generation as a bounded loop: when a round's only tool calls are loadSkill, fetch the skills and generate again with the results in context, up to LOAD_SKILL_MAX_ROUNDS; cost is summed across rounds and the response is finalized once. - load-skill-loop.ts holds the pure decision (buildLoadSkillFollowup) with tests; the whole path is inert unless AI_BOT_DELEGATION_SECRET is set, so behavior is unchanged where delegation is off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ol-in-ai-bot-body-references

jurgenwerk

[Claude Code 🤖] Self-review — bugs/risks I can substantiate, worst first.

1. (High, security) No single-human-room guard before minting delegation

main.ts runs the loadSkill loop whenever delegatedRealmSessions.enabled, with onBehalfOf: senderMatrixUserId (main.ts:597). In a room with more than one human, the bot mints a read token for whoever sent the message and reads their realms — the confused-deputy case the security design calls out. It's gated by the secret and the guard is a known follow-up, but this must land before the secret is enabled anywhere. Right now nothing stops it.

2. (Medium, correctness) A rejected delegated token is never invalidated

executeLoadSkill (lib/load-skill.ts) returns an error on a non-2xx file response but never calls delegatedRealmSessions.invalidate(...). Tokens are cached for ~30 min; if the user's read is revoked inside that window, the cached token gets a 401/403 and every loadSkill for that (user, realm) keeps failing for up to 30 minutes. Should invalidate on 401/403 and retry once. Note the deps are Pick<…, 'getToken'>, so the fix also has to widen them to include invalidate.

3. (Medium) loadSkill calls can be silently dropped — no result, no error

response-state drops loadSkill from the emitted command requests, and the bot only executes it when buildLoadSkillFollowup returns a followup. But it returns [] when the model mixes loadSkill with a host command, and the loop stops at LOAD_SKILL_MAX_ROUNDS. In those cases the loadSkill call is neither executed nor handed to the host — the model "called a tool" and gets nothing back, so the turn can end with empty/confused content. At minimum it should feed back an error tool result (or stop advertising the tool once at the cap).

4. (Medium, UX — the untested path) Multi-round rendering into one Matrix message

The responder is reused across rounds and ResponseState.updateContent replaces latestContent. Round 1 (the loadSkill round) and round 2 (the answer) stream into the same message, so any visible round-1 content is overwritten and the message can shrink/reset mid-stream. This is the multi-round behavior I haven't exercised live (the single-round happy path was verified). Needs a real >1-round chat before trusting it.

5. (Low) `executeLoadSkill` follows redirects

Unlike the manual script (redirect: 'manual'), the executor uses default fetch, so a stray 30x is followed silently. A file GET shouldn't redirect, but worth pinning to surface surprises.

6. (Low) No size cap on fetched content

executeLoadSkill returns await response.text() unbounded; a large references/ file is injected wholesale into the next prompt. Fine for SKILL.md, risky for arbitrary reference files.

7. (Low) `generationId` reflects only the last round

Cost is summed across rounds (good), but generationId used for credit tracking is the last round's id only — minor attribution skew across a multi-round turn.

Nothing here is a crash on the normal (delegation-off) path — that stays byte-identical. #1 gates enabling the feature; #2/#3 are the substantive correctness fixes; #4 is the thing to verify live.

- Single-human-room guard: count non-bot members and only offer/run loadSkill when exactly one human is present, so the bot never reads a realm on an ambiguous "user"'s behalf in a shared room. The tool is no longer advertised there either (getResponse takes an explicit offerLoadSkill flag). - Invalidate-and-retry: a cached delegated token rejected with 401/403 (access revoked inside its staleness window) is dropped and re-minted once, instead of failing for the rest of the token's TTL. - Stop following redirects on the skill fetch (redirect: 'manual') and cap fetched content length so a large references file can't blow up the next round's prompt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jurgenwerk · 2026-06-29T08:44:37Z

[Claude Code 🤖] Fixes pushed for the review (numbering matches the review items above; written (n) so GitHub doesn't auto-link them as PRs — sorry about the earlier mangled links):

(1) single-human-room guard — counts non-bot members; loadSkill is neither offered nor executed unless exactly one human is in the room, so the bot never reads a realm on an ambiguous user's behalf. (getResponse now takes an explicit offerLoadSkill flag.)
(2) token invalidation — a cached token rejected with 401/403 is invalidated and re-minted once before failing, instead of being wedged for the rest of its TTL.
(5) redirects — the skill fetch uses redirect: 'manual'; (6) size cap — fetched content is bounded so a large references file can't blow up the next round's prompt.

New tests cover the retry and the cap; types/lint/responder suite green.

Still open (intentionally not patched here):

(3) — when the model mixes loadSkill with a host command, or exceeds the round cap, the loadSkill call is dropped rather than answered. Closing it cleanly needs a host-vs-bot tool-precedence decision, so I left it as a known limitation.
(4) — multi-round rendering into one Matrix message still needs a real >1-round chat to verify; the single-round path is confirmed.

- Remove the fetched-content length cap and the loadSkill round cap: the model loads what it asks for and loops until it stops, no arbitrary limits. - Rename the DelegatedRealmSession* symbols to DelegatedUserRealmSession* (and the manager field to delegatedUserRealmSessions) to say plainly that the session is delegated for a specific user on a realm. Code-only; the wire header values are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-29T10:07:22Z

Preview deployments

Host staging preview

Host production preview

Host Test Results

154 tests 154 ✅ 5m 41s ⏱️
1 suites 0 💤
1 files 0 ❌

Results for commit faf693f.

Realm Server Test Results

1 files ±0 1 suites ±0 10m 56s ⏱️ +40s
1 686 tests +7 1 686 ✅ +7 0 💤 ±0 0 ❌ ±0
1 765 runs +7 1 765 ✅ +7 0 💤 ±0 0 ❌ ±0

Results for commit faf693f. ± Comparison against earlier commit 417a669.

Discovery hands the bot full skill URLs (entry points are refs by realm URL), so the tool now takes the file `url` directly instead of a skill `name` + an optional references `path` it had to construct. `realm` stays — the delegated token is realm-scoped — and the url must be inside it (checked up front). Drops the skillFileUrl path-construction and its layout assumption. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

jurgenwerk · 2026-06-29T13:58:02Z

idea by @lukemelia: when AI requests a tool call, the ai bot can make a command request with a special property named something like handledByServer and then the host can know it doesn't have to execute it (because the ai bot already loaded the skill directly from the server). This is so that we can show a visual indicator in the chat history telling the user the necessary skills got loaded (will also be helpful to debug in case of problems)

lukemelia · 2026-06-29T14:11:06Z

idea by @lukemelia: when AI requests a tool call, the ai bot can make a command request with a special property named something like handledByServer and then the host can know it doesn't have to execute it (because the ai bot already loaded the skill directly from the server).

Or, even better might be executedBy: "ai-bot". A little more information dense, and could possibly help us when we scale to multi-user or multi-bot rooms.

lukemelia

Let's re-implement with the approach we discussed before merging.

lukemelia · 2026-06-29T14:49:49Z

One other question to consider: there is nothing specific to skills in the loadSkill tool AFAIK. Should it just be a readFile tool?

The bot-executed tool that pulls a file from a realm at tool-call time had no skill-specific logic, so rename it loadSkill -> readRealmFile: a generic, delegated, read-only realm file read. Keeps {realm, url}, the user-scoped token, and the in-process load-then-continue loop. Surface each read in the timeline as an executedBy: 'ai-bot' command request: the host records it but never executes it (skipped across validate/run/ auto-exec/stuck). The bot repurposes the in-flight "thinking" message into a loading marker and rotates to a fresh event for the answer, so the marker renders above the answer and animates applying -> applied (or invalid + reason on a failed read). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jurgenwerk · 2026-06-30T07:48:33Z

(codex) Review note: I think the current readRealmFile path is carrying too much special-case control flow.

The main issue I see is that mixed tool rounds are still lossy. readRealmFileOnlyToolCalls() only handles the round when every tool call is readRealmFile, while ResponseState also strips readRealmFile out of the host-visible command requests. So if the model emits readRealmFile plus a normal host command in the same round, the read is neither executed nor surfaced to the host.

I think the simpler shape is a single post-completion tool-call pass:

if ai-bot has a local executor for a tool, run it and append a tool result
otherwise emit a normal host command request
keep executedBy: "ai-bot" only as timeline/debug metadata, not as a separate control path

That would remove most of the branching between read-realm-file-loop.ts and main.ts, and it seems closer to the re-implementation direction already being discussed here.

…opping A round mixing readRealmFile with a host-dispatched command used to drop the read silently: the all-or-nothing gate ran nothing, and ResponseState strips readRealmFile from host-visible command requests, so it was neither executed nor surfaced. Replace the gate with classifyToolCalls() -> { botToolCalls, hostToolCalls } and branch explicitly in the generation loop: - bot-only: run them in-process and loop (marker applying -> done, then answer) - host-only: the normal host-command path - mixed: don't loop; emit a visible failed marker for the reads ("readRealmFile cannot be mixed with host-dispatched commands in the same round") and let the host commands proceed Bot tools are same-turn, host tools next-turn; keeping those time models separate (rather than a fake unified dispatch) avoids dragging history reconstruction into this change. executedBy: 'ai-bot' stays as timeline/debug metadata, not a control path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The action bar appears above the composer while a diff or command is applying; its top border drew a stray horizontal line above the bar. Remove just the top edge — the side and bottom border still tuck it into the composer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Spell out the reasoning in the doc comment: reusing the current (Thinking) event keeps the marker in this turn's earliest timeline slot, so it renders before the answer that streams into the rotated fresh event. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A readRealmFile marker is an executedBy command the host never runs, but buildMessageCommand still resolved it through the skill list — an `await store.get` per enabled skill — which left the marker pill blank for ~a second while those reads ran. Short-circuit executedBy commands before the skill lookup and build them directly (no codeRef, no approval, status straight from the result event), so the pill paints immediately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ol-in-ai-bot-body-references

Per review: the host skip checks used `if (command.executedBy)` (truthiness), which would wrongly skip a command whose executedBy names a different actor — e.g. 'host', a natural future value for host-run commands. Compare against AI_BOT_EXECUTOR at all six sites so the host only skips what ai-bot ran itself, and a 'host' (or any other) executor is evaluated normally. Adds a test that a non-ai-bot executor isn't short-circuited. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Standardize the nomenclature for the timeline element that shows a bot-executed tool call (readRealmFile) and its result. Drops the informal "pill" and the ad-hoc "marker" in favor of "command-result indicator" throughout: methods beginCommandResultIndicator / sendCommandResultIndicator, the indicatorEventId locals, comments, and test names. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jurgenwerk · 2026-06-30T11:43:13Z

@lukemelia the implementation should now be more in line with what we discussed - I also attached a video in the PR description

jurgenwerk · 2026-06-30T14:30:53Z

@lukemelia regarding the 2 different flows, it's because of an ordering problem

The ordering problem was where bot does the read and the answer in one turn, and that turn has a single "Thinking…" bubble that the answer streams into. When the bot also tried to show a "Read file…" command request/result for the read, that landed below the answer, which looked weird and out of place

Host commands get correct ordering for free: the bot ends its turn, the host runs the command and posts the command result, and that starts a new turn = a new message. So the request, result, and continuation naturally land in order.

On the other hand, readRealmFile stays in one turn: the bot runs it itself and keeps writing the answer in the same turn. So we replace the "Thinking…" message with the command request(s), post the command result(s), and push the answer into a new message after manually arranging the order

lukemelia · 2026-06-30T14:43:00Z

The ordering problem was where bot does the read and the answer in one turn

I'm going to try creating a patch to this PR we can review that uses a multi-turn approach, keeping host-executed commands and aibot-executed commands more similar to each other and how they are treated my ai-bot's main processing engine. I think it could simpilify things a lot.

…and path Re-architect readRealmFile so the bot no longer runs reads in an inline, same-turn loop. Instead it surfaces each read as an `executedBy: 'ai-bot'` command request, fetches the file, uploads it to the Matrix media repo, and posts a command-result event carrying the file as `data.attachedFiles`. The existing host-command result path then reconstructs the content on the next turn (via the same attachment-download the host `readFile` uses), and the bot's own result event re-triggers generation. Because reads now resolve next-turn like host commands, the two timing models collapse into one: an answer may freely mix reads and host commands, so the mixed-round rejection, the inline `for(;;)` loop, the "thinking"-event indicator rotation, and the in-memory followup splicing are all deleted. Uploads dedupe by content hash, so the same skill read repeatedly is stored in Matrix once; keying on the hash keeps it version-correct (changed content re-uploads). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jurgenwerk · 2026-06-30T16:20:35Z

I noticed another a real problem in this implementation - the skill content is not persisted, and it's only available for the current turn.

The bot fulfills a readRealmFile read by posting a command-result event, which is meant to re-trigger the bot for the continuation turn. Two things prevented that continuation from running: - Fulfillment happened while the room lock was still held. The re-trigger has to acquire that lock, so it was dropped. Fulfill after the lock is released instead. - The re-trigger runs on the result event's local echo, before the homeserver has indexed it into /messages. getRoomEvents is a server fetch, so the continuation's history missed the just-posted result and the read looked unresolved (shouldRespond=false). Splice the in-hand result event into the history when the fetch didn't include it. Host command results arrive via sync (already server-side), so this only affects the bot's own results. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-attachment ai-bot: fulfill readRealmFile via Matrix attachments on the host-command path

jurgenwerk · 2026-06-30T17:01:31Z

Working now:

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: faf693f9fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-30T17:07:08Z

+  await publish(deps, {
+    msgtype: APP_BOXEL_COMMAND_RESULT_WITH_NO_OUTPUT_MSGTYPE,
+    commandRequestId,
+    failureReason: error,


Surface readRealmFile failures to the model

When a realm read fails (for example 404, forbidden, or malformed arguments), this stores the useful reason only in failureReason on a no-output command result. The prompt reconstruction path for ordinary tool results ignores failureReason and emits only Tool call invalid. (packages/runtime-common/ai/prompt.ts), so the continuation model cannot see why readRealmFile failed or choose a different URL/access path. Please include the error in the tool-result content for this command, or teach prompt reconstruction to surface failureReason for these results.

Useful? React with 👍 / 👎.

lukemelia

Let's re-implement with the approach we discussed before merging.

jurgenwerk and others added 4 commits June 26, 2026 11:50

Drop tracker IDs from loadSkill comments

89558f1

Describe what the loadSkill module does as timeless fact rather than citing issue identifiers, per the evergreen-comments convention. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into cs-11554-loadskill-to…

dbc5b00

…ol-in-ai-bot-body-references

jurgenwerk commented Jun 29, 2026

View reviewed changes

jurgenwerk requested a review from Copilot June 29, 2026 11:21

Copilot AI reviewed Jun 29, 2026

jurgenwerk requested a review from a team June 29, 2026 11:32

jurgenwerk marked this pull request as ready for review June 29, 2026 11:32

jurgenwerk requested a review from lukemelia June 29, 2026 11:32

habdelra approved these changes Jun 29, 2026

View reviewed changes

lukemelia requested changes Jun 29, 2026

View reviewed changes

jurgenwerk and others added 2 commits June 30, 2026 10:03

jurgenwerk changed the title ~~ai-bot: loadSkill — pull a skill's instructions on demand~~ ai-bot: the assistant can read a realm file on its own, mid-answer (readRealmFile) Jun 30, 2026

jurgenwerk and others added 2 commits June 30, 2026 10:56

jurgenwerk changed the title ~~ai-bot: the assistant can read a realm file on its own, mid-answer (readRealmFile)~~ ai-bot: the assistant can read a skill directly from the realm server Jun 30, 2026

Merge remote-tracking branch 'origin/main' into cs-11554-loadskill-to…

65dad57

…ol-in-ai-bot-body-references