Fix AI docs benchmark: fail loudly, cache prompts, add judge model#1202
Open
dkijania wants to merge 3 commits into
Open
Fix AI docs benchmark: fail loudly, cache prompts, add judge model#1202dkijania wants to merge 3 commits into
dkijania wants to merge 3 commits into
Conversation
The scheduled Benchmark-LLMs-Docs workflow has been silently scoring 0% on every run since at least early April: the ANTHROPIC_API_KEY secret is out of credits, every API call returns 400, and the script catches each per-question error, records score=0, and exits 0. CI then reports the job "successful" with a 0/30 results file. Changes: 1. Track per-question API errors. If more than 50% of questions error, exit 2 with a clear FAILED message that names the likely causes (credits, model availability). The credit-exhaustion case now turns the workflow red instead of green. 2. Wrap the docs system prompt in a cache-controlled content block. The same ~50k-token system prompt is re-sent for all 30 questions per source × 3 sources, so prompt caching cuts repeated input costs roughly 10x. Token usage (input / cache_read / cache_write / output) is now reported and saved in the results JSON. 3. Add --judge-model (default claude-opus-4-7), separate from the answering --model (default claude-sonnet-4-6-20250514). Same model judging itself biases scores upward; a stronger separate judge gives more honest grades on open-ended categories. 4. Surface truncation explicitly. When the docs corpus exceeds the 180k-char system budget, print a warning and record truncated_context: true in the results metadata, so "full" mode regressions stop being silent if the docs grow past the limit. 5. Reword f1 from "minimum recommended fee" to "current average fee" — the 0.001 MINA value the question expects is described in the FAQ as the average, not a minimum (no minimum is documented). 6. Workflow inputs gain judge_model and matrix.fail-fast: false, so one source's failure no longer cancels sibling jobs. The exhausted ANTHROPIC_API_KEY itself is a separate operational problem the secret owner needs to top up — but with these changes the next run will at least surface that loudly instead of pretending to succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The first run of the patched workflow surfaced two issues that the new failure detection cleanly exposed: 1. The default model ID claude-sonnet-4-6-20250514 is invalid (404 not_found_error from the API). The actual published ID is just claude-sonnet-4-6 — the dated suffix isn't a real model. 2. The 180_000-char system-prompt budget was 9% of llms-full.txt (which is 2_046_837 chars). "full" mode comparisons were running against a tiny prefix of the corpus, not the full text. Sonnet 4.6 has a 200k-token context window — bumping to 750_000 chars (~190k tokens, leaves headroom for question + response) lets "full" mode actually represent ~37% of the corpus and matches the model's real capability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first full-matrix run revealed two interacting issues with Tier 1 rate limits (30k input tokens/min for Sonnet 4.6): 1. The 750k-char system budget produced ~190k-token cache writes on the first call of the "full" source — instant 429. 2. The matrix ran all three sources in parallel, so even small "llms" and "none" jobs got caught in the rate-limit window. Changes: - maxSystemChars: 750_000 → 100_000 (~25k tokens), leaving headroom under the 30k/min cap. Tier 2+ accounts can bump this back up. - Workflow matrix gains max-parallel: 1 — sources run sequentially instead of competing for the same rate-limit bucket. - callAnthropic now retries on 429 / 529, honoring Retry-After when present, with exponential backoff otherwise (5s, 10s, 20s). Caps at 4 attempts before failing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The scheduled `Benchmark-LLMs-Docs` workflow has been silently producing 0% on every run since at least early April. The `ANTHROPIC_API_KEY` secret is out of credits, every API call returns 400, and the script swallows each per-question error, records score=0, and exits 0. CI shows the job green with a clean 0/30 results file. Nobody noticed because the workflow appeared healthy.
Run logs from May 4 confirm:
```
ERROR: Anthropic API error 400: "Your credit balance is too low..."
OVERALL: 0.00/30 (0.0%)
```
…and exit code 0. The 28-41-second run times across the last 5 scheduled runs (real benchmarks take 5-10 min) were the only outward sign.
What this PR fixes
The exhausted API key is a separate operational problem the secret owner has to top up. This PR ensures the next time the key runs dry, the workflow turns red instead of pretending to succeed.
Files changed
Test plan
🤖 Generated with Claude Code