Route every coding task to the cheapest capable, policy-allowed lane — local-first, content-free, and honest about what it saves you.
TokenMaxed is a router for coding agents. You already pay flat-rate for tools like Claude Max and a ChatGPT/Codex subscription, and you may have a capable model running locally. TokenMaxed spends that already-paid, flat-rate capacity first, falls back to metered APIs only when it has to, and shows you — in real dollars — what you actually spent and how much metered API cost you avoided (with the all-frontier comparison kept as a clearly-labeled hypothetical, not the headline).
It is local-first: the routing brain, your prompts, and your code stay on your machine. Any hosted feature added later transmits only content-free metadata you explicitly opt into.
New here? Start with Claude Code → Use in Claude Code — install the plugin, run
/tokenmaxed:setup, done.
- Subsidy capture. Subscriptions are flat-rate; their marginal cost is ~$0 until you hit caps. TokenMaxed defaults to that capacity before burning metered API dollars.
- Data minimization (the moat). Trusted lanes (Claude, Codex, local) can see your repo and tools. Untrusted lanes receive only a scrubbed, bounded, no-tool sub-request — never your repo, tokens, or paths.
- Honest accounting. The headline is the finance-grade number — what you actually spent and the metered dollars avoided; the all-frontier baseline (every task on the top model) is shown too but clearly labeled a hypothetical, never the headline. We never claim caps don't exist.
v0 — early but usable. The portable routing brain (@tokenmaxed/core), the
data-minimization/policy gate + manager review, the tokenmaxed CLI, and the
Claude Code plugin are in place; broadening lane coverage and other host
adapters come next. APIs may still change. Built in small, reviewed commits in
the open.
A portable core with thin adapters around it:
packages/
core/ # the routing brain — pure, host-agnostic, no I/O, no network
# route · decide the cheapest capable lane (pure function)
# registry· load locally-configured lanes
# price · canonical savings math
# ledger · append-only, content-free local event log
mcp/ # the MCP server exposing core to hosts (thin bridge)
plugin/ # the Claude Code adapter: bundled server, /tokenmaxed:* skills, hooks
Privacy invariant (absolute): No prompt or code content ever leaves your machine to a TokenMaxed-hosted backend. Downstream model lanes receive only minimized, policy-gated payloads. The local event log is content-free by construction (integers, enums, model ids — never text), which is also what lets an optional web dashboard be added later as a pure forwarder, with zero schema changes and nothing new leaving the machine.
- Node.js >= 22.18 (the test suite runs TypeScript directly via Node's
built-in type stripping, which is enabled by default from 22.18 — no extra
test runner). A
tscbuild emits plain JavaScript for publishing/consumption.
👉 First time here? This is where to start. Using TokenMaxed in Claude Code is
three steps: install the plugin, run /tokenmaxed:setup, then code as usual.
(Requires Node.js ≥ 22.18.)
git clone https://github.com/TolyK/TokenMaxed.git && cd TokenMaxed
npm install
npm run build:plugin # bundle the self-contained plugin server
claude --plugin-dir packages/plugin # load it into Claude Code for this sessionA marketplace install (
claude plugin install tokenmaxed@…) ships with the first published release. Until then use--plugin-dir(add it to your Claude Code settings to load the plugin every session).
Inside Claude Code, run:
/tokenmaxed:setup
It creates your config at ~/.tokenmaxed/lanes.yaml and policy.yaml from
starter templates (it never overwrites an existing file), validates it, and
prints what's enabled and what to do next. That's the whole required setup.
Just code as usual — there's no separate command to "run" TokenMaxed. When a
step is a bounded, self-contained subtask (boilerplate, codegen, docs, a
mechanical refactor, an isolated bugfix), Claude offloads it to the cheapest
capable lane on its own, guided by the bundled route skill — it's Claude's
judgment call, not a background daemon, so you can also nudge it ("offload this
to a cheaper lane") or drive everything by hand:
| Command | What it does |
|---|---|
/tokenmaxed:setup |
create/validate config and show status |
/tokenmaxed:summary |
at-a-glance: 24h/7d/lifetime usage + metered $ avoided, your lanes + the active reviewer |
/tokenmaxed:savings [7d] |
savings from the local ledger |
/tokenmaxed:tokens [by lane] |
token usage (by model or lane) |
/tokenmaxed:why <category> |
preview which lane would handle a category — nothing runs |
/tokenmaxed:review |
manager review of your current working-tree changes |
/tokenmaxed:status · /tokenmaxed:on · /tokenmaxed:off |
show / enable / disable routing for this project |
/tokenmaxed:prefer <lane> · /tokenmaxed:prefer off |
temporarily favor one configured lane (any vendor, CLI or API) over normal routing — e.g. to push a sprint's work to a cheaper subscription while another's credits run low; clears with off. Honored only when that lane is eligible, available, and capable for the task (else it falls back to normal routing). Persisted per project; no relaunch. |
/tokenmaxed:yolo · /tokenmaxed:yolo off |
--dangerously-skip-permissions analogue); see YOLO mode under optional features below. |
Once the plugin is loaded and /tokenmaxed:setup has run, you're done — just
code. The three steps above are the whole required path. Offload is
agent-driven: Claude invokes the bundled route skill to hand a suitable
subtask to the cheapest capable lane via the router_delegate tool — the trusted
subscription CLI lanes enabled by default (Codex, the cheaper-Claude lane) work
with no flags (a local Ollama is an opt-in template — flip blocked→full
once a server is running). The plugin's hooks don't route on their own; they only gate delegation
when routing is off (a deterministic backstop) and run the turn-end review-iterate
loop (ON by default when a reviewer lane exists; opt out with
TOKENMAXED_REVIEW_ON_STOP=false). The env flags below switch on the other
optional features.
A worker lane's one weakness is that it's blind to your repo — it can
hallucinate plausible-but-wrong facts it can't see (a model price, an enum value,
a test-fixture idiom). To prevent that, router_delegate takes an optional
files list of repo-relative paths: they're read verbatim (server-side,
path-confined to the project), then scrubbed + size-bounded + policy-gated by the
minimizer, so the lane copies real values instead of inventing them. Private-repo
files only reach a reader-trust lane with reader egress enabled; otherwise
they're dropped and the reply says which and why. Still review offloaded output —
visibility fixes facts, not logic.
Tandem routing (worker-first, full-access lane steps in for repo-tight work).
Some subtasks genuinely need live repo/tool/shell access — running the test suite,
coordinated multi-file edits, broad cross-file reasoning — and no amount of
attached files makes a blind worker able to do them. Two mechanisms keep those
on a capable lane without giving up worker offload everywhere else:
access_needonrouter_delegate(worker-ok|repo-tight|auto, defaultauto).repo-tightroutes straight to a lane that can actually act on the repo — the native host or an agentic CLI lane (a full answer-only or remote-API lane is excluded; it would only get prompt + attachments) — and when none is available the task simply runs on the host. This is an access axis, orthogonal to therepo_class/sensitivitydata-egress policy below.- Give-back. Left on
auto, every untagged subtask still tries a worker; a worker that finds it can't finish without context it was never given replies with a sentinel, and TokenMaxed hands the task back to the host (recorded honestly as a fallback — real spend counted, no savings claimed, and it never skews the learned capability scores). So the cheap path is tried first and the host steps in only when a worker actually hits a wall.
The optional features are opt-in environment flags you set when you launch
Claude Code. In the shell they go before claude (they're environment
variables, not CLI arguments):
# Plain — trusted subscription CLI offload (Codex, cheaper-Claude) works out of the box:
claude --plugin-dir packages/plugin
# Common "turn the safe extras on" launch: open the safety gate (needs gitleaks)
# so API/BYOK worker lanes and full API lanes can run. (The review-iterate loop is
# already ON by default when a reviewer lane exists — no flag needed; opt out with
# TOKENMAXED_REVIEW_ON_STOP=false. Reader lanes need MORE than the gate —
# TOKENMAXED_READER_EGRESS, per-lane attestation, a policy allow rule; see below.)
TOKENMAXED_GATE_READY=true \
claude --plugin-dir packages/plugin
# Same, but also skip Claude Code's per-tool permission prompts (you trust the
# offloads to run unattended):
TOKENMAXED_GATE_READY=true \
claude --dangerously-skip-permissions --plugin-dir packages/plugin
⚠️ claude --dangerously-skip-permissions TOKENMAXED_GATE_READY=truedoes not work: anything afterclaudeis passed to Claude Code (here it'd be read as an opening prompt), not exported to the environment. Env assignments must precede the command, as shown above. (To persist them instead,exportthe vars in your shell profile or set them in your Claude Code env settings.)
Each flag is described under Configure & extend below; combine whichever you want on one launch line. Note: a full CLI reviewer (e.g. Codex) needs no safety gate to run the (default-on) turn-end review; the gate is needed only for API/BYOK egress.
Typing the env flags and --plugin-dir every launch gets old. Two pieces make
it a one-word command:
1. Persist the env flags in your Claude Code settings (~/.claude/settings.json)
so they're always on and never typed. This file is strict JSON — no comments —
so add just the env keys:
{
"env": {
"TOKENMAXED_GATE_READY": "true",
"TOKENMAXED_ESCALATE": "true"
}
}(TOKENMAXED_GATE_READY opens the safety gate so API/BYOK egress is allowed — for
both worker/reader lanes and full API lanes; TOKENMAXED_ESCALATE reworks/escalates
offloads that fail review instead of shipping them unreviewed. The turn-end
review-iterate loop is on by default — no flag — so it isn't listed here; set
TOKENMAXED_REVIEW_ON_STOP=false to opt out, or TOKENMAXED_REVIEW_MAX_ROUNDS to
change the rework-round bound.)
2. Alias the launch. Pick one of these — the same word can't be both a standalone command and an appended argument (in zsh the later definition wins):
# (a) STANDALONE — type `tmax` alone, from any folder (recommended):
alias tmax='claude --dangerously-skip-permissions --plugin-dir /ABS/PATH/TO/packages/plugin'
# → tmax
# (b) APPENDED — a zsh GLOBAL alias you tack onto `claude` (`alias -g`, zsh-only):
alias -g tmax='--plugin-dir /ABS/PATH/TO/packages/plugin'
# → claude tmax
# → claude --dangerously-skip-permissions tmaxUse an absolute --plugin-dir path so it works from any directory. With the
flags in settings (step 1), every form above launches fully configured — gate
open, the turn-end review-iterate loop on (its default), and offloads escalated
rather than shipped unreviewed.
(TOKENMAXED_GATE_READY in global settings is inert until the plugin is loaded;
once loaded it affects /tokenmaxed:why/setup and gates all API/BYOK egress
— worker, reader, and full API lanes. Drop it from settings and prepend it
per-launch if you'd rather opt into the gate explicitly each time.)
- Where config lives. The plugin reads user-owned
~/.tokenmaxed/lanes.yamlpolicy.yaml— not the repo'sconfig/, so a cloned repo can never introduce an executable lane. Edit~/.tokenmaxed/lanes.yamlto add or trust lanes: provider CLIs (Codex, Gemini, …), a local Ollama, the cheaper-Claude lane, or a BYOK worker. (ThetokenmaxedCLI instead uses in-repoconfig/.)
- BYOK API keys. A BYOK
apilane names anauthHandle; put its key in env varTOKENMAXED_KEY_<authHandle>(e.g.TOKENMAXED_KEY_OPENAI). Keys are never stored by TokenMaxed. - Optional, off by default (trusted CLI/local offloads work without either):
- Untrusted worker lanes — install
gitleaksand start Claude Code withTOKENMAXED_GATE_READY=true. - Review-iterate loop (ON by default when a reviewer lane exists) — at the
end of a turn a trusted manager reviews all your changed code (tracked
and untracked/new files; a very large/numerous untracked set is bounded for
speed and any omission is flagged in the diff so a pass never reads as complete
coverage); on a non-pass verdict Claude is told to rework and
the change is re-reviewed, iterating until the reviewer passes. It is
deterministic (a
Stophook — Claude can't forget to run it) and protected so the review actually happens without holding you up: a reviewer error/timeout doesn't pass silently — the review re-fires (retries) on the next turn until it succeeds, and the whole loop (reworks and retries) is bounded byTOKENMAXED_REVIEW_MAX_ROUNDS(default 5); only after that does it yield with the outstanding notes so even a persistent failure can't trap you. Opt out of reviewing entirely withTOKENMAXED_REVIEW_ON_STOP=false(or simply configure no reviewer lane — then nothing is reviewed). A full CLI reviewer (e.g. Codex) needs no safety gate; an API/BYOK reviewer needsTOKENMAXED_GATE_READY=true. - Quality escalation — when an offloaded result fails its manager review,
retry it on a more capable lane (and ultimately give the task back to Claude
rather than ship something that failed review); enable with
TOKENMAXED_ESCALATE=true. Therouter_delegateoutcome reports what happened ("accepted after rework", "accepted after escalation", a give-back when a reviewed result still failed, or — when no eligible manager is available — the result delivered unreviewed), and the per-offload escalation rate shows up in/tokenmaxed:savings. - Learned capability — let observed manager-review outcomes adjust routing
over time; enable with
TOKENMAXED_LEARN_CAPABILITY=true. Each lane's hand-assigned per-categorycapabilityscore is treated as a prior; the recent pass/needs-rework/fail rate for that lane×category (recency-decayed, ~30-day half-life) shrinks it toward what's actually observed. A cheap lane that keeps passing earns more traffic; a once-best lane that starts failing loses it — and/tokenmaxed:whyshows(learned: declared 0.70, n=12)when evidence moved a score. It moves slowly: the declared prior dominates until evidence accumulates (one or two reviews barely shift a score), an explicitcapability: 0opt-out is never resurrected, and the config file is never modified (the adjustment is computed in memory from the ledger). Caveat: review success is an empirical signal, not a true model-quality measure (it's confounded by task difficulty and reviewer strictness), and lanes that win routing accrue more samples — so this is a useful heuristic, not unbiased benchmarking. - Reader lanes (middle trust tier) — a vendor you trust with your code
but not your secrets/shell. A
readerlane receives bounded, secret-scanned repo-read context (no secrets, no shell, no tools, answer-only) so repo-aware work can offload without marking the vendor fully trusted. This deliberately sends (possibly private) repo code to that vendor — secret egress is fail-closed and scanner-gated, not proven impossible, and the vendor's terms govern code once it's in the prompt — so it is high-friction: selectable only with all ofTOKENMAXED_GATE_READY=true(the safety gate, needs gitleaks) +TOKENMAXED_READER_EGRESS=true(global),repo_read_attestation: trueon that lane, an API/BYOK lane (reader execution is API-only), and a policyallowrule for the repo. Results are flagged reader-derived and must not be re-delegated to a worker. First-class vendor lanes (Gemini, Kimi, GLM, MiniMax) ship as safeblockedtemplates inlanes.example.yaml— you opt each one up deliberately. Note the executor constraint: CLI lanes can only befull(orblocked);worker/readerare API/BYOK-only (the certified executors are HTTP). So a CLI vendor is full-or-nothing, while an API vendor can beworker/reader/full. - Tiered routing (start cheap, step up) — enable with
TOKENMAXED_TIERED=true. Instead of maximizing capability, routing picks the cheapest lane whose effective capability clears a floor (TOKENMAXED_TIER_FLOOR, default 0.6; per-category overridable), so within a family (Haiku → Sonnet → Opus; minimax small → large) it starts on the smallest model that's good enough and steps up to a stronger same-family lane on a review failure (the escalation gate). Ties among floor-clearers break by cap-health, then real price (from the price table, so two subscription tiers are still distinguishable), then the lowest-capable lane that clears. Acapability: 0lane is never selected; if nothing clears the floor, routing falls back to maximize so it never fails./tokenmaxed:whyshows the tiered pick. Family grouping uses each lane'smodel_family. ⚠️ YOLO mode (the--dangerously-skip-permissionsanalogue) — turns off the routing permission gates so EVERY configured worker/reader lane is selectable. Launch withTOKENMAXED_YOLO=true(session default), or flip it per project at runtime with/tokenmaxed:yolo(on) //tokenmaxed:yolo off(the per-project setting overrides the env default and is persisted). When on, the router forcesTOKENMAXED_GATE_READYandTOKENMAXED_READER_EGRESSopen and waives the per-lanerepo_read_attestation, the reader hard cap, and aforce-trustedpolicy verdict — so a task routes to a worker/reader even on a private/sensitive/unknown repo. This means (possibly private) repo code may be sent to any configured vendor lane. Like--dangerously-skip-permissionsstill honoring deny rules, YOLO does not override two deliberate kill-switches — an explicit policyblockrule anddisabledLaneIds— and it does not disable the orthogonal protections that aren't routing permissions: the secret scanner still gates every payload (fail-closed), the lane/policy config is still read only from the user-owned~/.tokenmaxed(the RCE guard), and executor certification still applies./tokenmaxed:statusand/tokenmaxed:whyshow a loud warning while it's on. Only enable it on code you are comfortable sending to every lane you've configured.- Set
TOKENMAXED_DISABLE=trueto turn the whole router off (kill-switch) regardless of the flags above (including YOLO mode).
- Untrusted worker lanes — install
A lane's model is a pinned string, so it can drift stale as a vendor's family
advances (e.g. minimax-m2 while the family reaches minimax-m3). TokenMaxed makes
that visible and gives you the choice:
- Track the latest — set
model: <family>@lateston an api lane (e.g.minimax@latest). It resolves to the newest model TokenMaxed can price in that family, from the price table. Resolution is pure and adds no network call on the routing path; the shipped api vendor templates default to@latest. - Pin deliberately — keep a concrete id (e.g.
minimax-m2) to stay on a specific version; your choice is respected, never overridden. A pinned model that's priced in the table is still checked for staleness automatically (its family comes from the table metadata); for an unpriced or unknown-family pin, addmodel_family: <family>to the lane to enable the check. - Staleness is shown —
/tokenmaxed:statusmakes a one-time provider/modelscall (sends only the API key — no repo/task content), caches the result, and warns when a newer same-family model exists. The session-start summary then shows that warning from cache only (no per-launch egress). If the newer model isn't in your price table yet, the warning says so (a "pricing gap" — add its price to route it). - Price-table metadata —
config/prices.seed.json(schema_version 2) carries an optionalfamily+releasedper model so@latestand staleness can order a family newest-first. CLI lanes (Codex/Gemini/Kimi) pin a concrete model the provider runtime selects — there's no model endpoint to auto-verify, so@latestis api-only.
Prefer the command line or your own integration? The steps below cover the
tokenmaxedCLI (savings/token reports) and driving the routing brain (@tokenmaxed/core) directly. Using Claude Code? See Use in Claude Code above — that's the fastest path.
git clone https://github.com/TolyK/TokenMaxed.git
cd TokenMaxed
npm install
npm run build # compile @tokenmaxed/core so it can be imported by nameA lane is a way to run a task — a subscription CLI, a local model, or (later) a metered API. Copy the example and edit it for your machine:
mkdir -p config
cp config/lanes.example.yaml config/lanes.yamlEach lane declares its kind, model, trust_mode, costBasis, provenance,
and optional per-category capability scores in [0, 1]. See
config/lanes.example.yaml for the full,
commented schema. Trusted subscription CLI lanes (Codex, the cheaper-Claude lane)
are enabled by default; a local Ollama and untrusted worker (BYOK API) lanes ship as
opt-in blocked templates — Ollama needs only a flip to full (plus a running
local server), while worker lanes also require opening the safety gate
(TOKENMAXED_GATE_READY=true with a secret scanner installed). That ordering is
enforced in code by the minimization/policy gate.
import { routeDecide } from '@tokenmaxed/core';
import { loadLaneConfig } from '@tokenmaxed/core/node'; // file I/O lives in the Node adapter
// Load and validate your lanes (throws a clear error on a bad config).
const registry = loadLaneConfig('config/lanes.yaml');
// Decide which lane should handle a task of a given category.
const decision = routeDecide(
{ category: 'bugfix' },
{ lanes: registry.candidateLanes('bugfix') },
{}, // policy — empty in v0
);
console.log(`${decision.laneId} — ${decision.reason}`);
// codex-cli — Selected codex-cli (gpt-5.5) for bugfix: capability 0.92 at subscription cost.routeDecide is pure and deterministic: the same inputs always pick the same
lane, and decision.scores shows how every candidate ranked (useful for a
future why command).
After npm run build, the tokenmaxed command reads your local, content-free
ledger (~/.tokenmaxed/ledger.jsonl by default) and reports on it:
npx tokenmaxed savings # actual spend + metered $ avoided (headline) + baseline context + tokens
npx tokenmaxed tokens --by lane # full per-lane token breakdown (--by model is the default)
npx tokenmaxed outcomes # manager-review verdicts (pass/needs-rework/fail) + success rate per lane
npx tokenmaxed lanes # your configured lanes: trust mode, autonomy, roles, manager eligibility
npx tokenmaxed savings --period 7d # any command takes --period all|Nd|Nh
npx tokenmaxed help # full usageTokenMaxed — savings (all time)
Actual API spend $0.00 — saved $139.50 (100.0% of the frontier-equivalent cost)
Baseline context: $139.50 avoided vs an all-frontier baseline (100.0%) — a hypothetical ceiling, not cash you'd otherwise have paid
Lanes: claude-native ×1, codex-cli ×1, ollama-llama3 ×1
Sensitive sends blocked: 0
Tokens (usage, not $): 2,800,000 in / 1,300,000 out / 4,100,000 total
claude-opus-4-7 2,000,000 / 1,000,000 / 3,000,000 (73.2%) reported
...
→ full breakdown: tokenmaxed tokens
The headline is the honest, finance-grade figure — what you actually spent and
the metered dollars avoided — while the all-frontier baseline (every task on the
top model) is demoted to a clearly-labeled hypothetical, never the headline.
The token block is explicitly a usage count (not dollars), with estimated figures
marked. The ledger fills as you route tasks (via the Claude Code plugin below, or
your own @tokenmaxed/core integration); until then the report says "No tasks
recorded yet", while tokenmaxed lanes works immediately off your
config/lanes.yaml.
| Surface | Status | How |
|---|---|---|
CLI (tokenmaxed) |
available | the commands above, after npm run build |
| Claude Code plugin | available | claude --plugin-dir packages/plugin, then /tokenmaxed:setup |
| Other hosts (Codex, Gemini, Cursor, Kimi Code, Pi, …) | planned | same core, thin per-host adapters |
Setup is intentionally minimal: in Claude Code run /tokenmaxed:setup; for the
CLI, copy config/lanes.example.yaml and edit it.
npm test # run the test suite (TypeScript, no build needed)
npm run typecheck
npm run build # emit JavaScript to packages/*/distContributions are welcome. Please read CONTRIBUTING.md and our Code of Conduct. For anything security- or privacy-sensitive, see SECURITY.md.
Two rules are non-negotiable and enforced in CI as they land:
- No content → network. Nothing derived from prompts or code may reach a network client.
- Honest savings. Every savings figure carries its assumptions.
MIT © TokenMaxed contributors