Skip to content

Auto-generate llms.txt from sidebars + frontmatter#1203

Open
dkijania wants to merge 1 commit into
mainfrom
dkijania/auto-generate-llms-index
Open

Auto-generate llms.txt from sidebars + frontmatter#1203
dkijania wants to merge 1 commit into
mainfrom
dkijania/auto-generate-llms-index

Conversation

@dkijania
Copy link
Copy Markdown
Member

Summary

`static/llms.txt` has been hand-maintained and drifting since May 5. PR #1192 restructured the docs (exchange-operators → node-operators, new Rosetta layout, signer page consolidation), but `llms.txt` still reflects the pre-#1192 hierarchy. AI agents that fetch `llms.txt` as a discovery layer (per llmstxt.org) are seeing a stale view.

This was directly visible in the AI benchmark from #1202: the `llms` source scored 84.3% with several specific gaps (mempool size, transaction confirmation count, default GraphQL port) that do exist in the docs but didn't make it into the hand-curated index.

What this changes

  • New `scripts/generate-llms-index.mjs` produces `static/llms.txt` from the same source of truth the docs site uses:
    • Hierarchy ← `sidebars.js`
    • Title ← `frontmatter.title`
    • Description ← `frontmatter.description`
  • Wired into `build`, `generate-llms-*`, and `check-llms-txt` npm scripts so neither file can drift again
  • The CI gate now checks both `llms.txt` and `llms-full.txt`

Generator design

Decision Reason
Skip pages with no `description` frontmatter (219 today) Quality gate — mostly o1js-reference auto-generated subpages that don't belong in a discovery-layer index. Generator prints them as warnings so authors can fill in what's worth surfacing
Audience-grouped top-level sections Mirrors how AI agents triage queries (zkApp dev vs node operator vs exchange)
Skip "Participate" top-level entry Community / process content, not actionable for AI agents
Append "Operator-facing facts" section Hard-coded callouts for the high-signal exchange-FAQ specifics (mempool 3000, account creation fee 1 MINA, 15-block confirmation, GraphQL port 3085) that AI agents miss because they're buried inside FAQ anchor sections

Output stats

  • 123 pages indexed across 8 sections
  • 24 KB — slightly above the typical 5-10 KB llms.txt budget. A follow-up could trim deep tutorial subpages (Berkeley archive migration has ~12 entries that arguably belong in llms-full.txt only)
  • 219 pages skipped, listed as warnings during build

Why this matters for AI discoverability (#1195)

`llms.txt` is the first thing AI agents fetch when they want to answer a Mina question. A stale or sparse `llms.txt` means agents miss real docs and fall back to training data — which is fine for evergreen facts but breaks down for anything operator-facing or recently changed. Auto-generating it removes the staleness class of failure entirely.

Test plan

  • `npm run generate-llms-index` produces a clean diff
  • `npm run check-llms-txt` passes locally
  • Spot-check 5 random URLs in the new `llms.txt` resolve 200 on docs.minaprotocol.com
  • After merge: rerun the AI benchmark (`gh workflow run benchmark-llms-docs.yml --repo MinaProtocol/docs2`) and confirm the `llms` source score moves on the affected questions (f2 confirmation, f4 mempool, f9 port — were missing in old llms.txt)
  • Confirm CI's existing `check-llms-txt` gate fails as expected when a doc's frontmatter changes without regeneration

🤖 Generated with Claude Code

The hand-maintained static/llms.txt has been drifting since May 5: it
predates PR #1192's exchange/node-operator restructure, references
moved URLs, and is missing operator-facing facts that integrators
search for. AI agents that fetch llms.txt as a discovery layer are
seeing a stale view of the docs.

Add scripts/generate-llms-index.mjs that produces llms.txt the same
way scripts/generate-llms-txt.mjs produces llms-full.txt — auto-built
from canonical sources on every build. Wire it into the build, the
generate-llms-* npm scripts, and the check-llms-txt CI gate so neither
file can drift again.

The generator:

- Walks sidebars.js for hierarchy (canonical TOC source of truth)
- Reads each .mdx's frontmatter for title and description
- Skips pages with no `description`, prints a warning listing them so
  authors can fill in what's missing (219 pages today, mostly the
  auto-generated o1js-reference subpages — those don't belong in a
  discovery-layer index anyway)
- Groups output by top-level sidebar category (audience-grouped:
  Network Upgrades / zkApp Developers / Mina Protocol / Node Operators
  / Exchange Operators / Developer Tools / Mina Security)
- Skips top-level "Participate" since it's community / process content
  not actionable for AI agents
- Appends an "Operator-facing facts" section that surfaces the
  high-signal exchange-FAQ specifics (mempool 3000, account creation
  fee 1 MINA, 15-block confirmation, GraphQL port 3085) which are
  buried inside FAQ pages that the model otherwise misses

Output is 24 KB / 123 pages / 8 sections — slightly above the
recommended llms.txt budget but workable. A follow-up could trim
deep tutorial subpages (Berkeley archive migration walkthrough has
~12 entries that probably belong in llms-full.txt only).

The new check-llms-txt now gates both files: any drift in either
file fails CI, mirroring the existing gate for llms-full.txt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs2 Ready Ready Preview, Comment May 10, 2026 10:40am

Request Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant