A lightweight LLM evaluation harness built for Rubrics RL research — fast, repeatable scoring of the many checkpoints an RL run produces. It wraps a single vLLM-compatible inference path with task runners for IF-EVAL, IFBench, WritingBench, HealthBench, Arena-Hard, and AlpacaEval.
Why it's fast. Evaluation is driven through vLLM's continuous batching: the
runner fires --num-threads concurrent requests at one server so the GPU stays
saturated end-to-end (a single H100 holds ~700 W throughout the run). On one
H100 a 4B checkpoint clears every task in roughly 15 minutes;
examples/shard_parallel_eval.sh shards a
single model across N GPUs to go faster still.
git clone https://github.com/THUAIS-Lab/eval_framework.git
cd eval_framework
uv venv && source .venv/bin/activate
uv pip install -e .
uv pip install vllm --torch-backend=autoOr install the released package from PyPI (no clone needed):
pip install llm-eval-framework
# llm-eval-framework does NOT pull in vLLM (vLLM needs a torch backend matched
# to your CUDA). Install it separately depending on how you run the model:
# • server mode — run `vllm serve` in a separate env/terminal (see Quick Start)
# • --local mode — vLLM is imported in-process, so it must live in THIS env:
uv pip install vllm --torch-backend=auto # or: pip install vllmAfter installation the eval-framework command is available in the venv.
Most benchmark data ships with the package — Arena-Hard v2.0 questions/baselines
and the AlpacaEval GPT-4 baseline are bundled under tasks/arena_hard/data/ and
tasks/alpaca_eval/data/. IFBench is the one exception: its verifier source
is not redistributed here, so clone it only if you plan to run that task:
git clone https://github.com/allenai/IFBench .external/IFBench # only for --tasks ifbench(or point --ifbench-dir at an existing checkout). Every other task runs without
this step.
The runner needs a model to call. There are two ways to provide one.
Start a vLLM server in one terminal:
vllm serve Qwen/Qwen3-4B \
--served-model-name Qwen3-4B \
--gpu-memory-utilization 0.95Then point the runner at it. Begin with a 4-example smoke test to confirm the setup is wired up correctly (~30 s) before launching a full run:
eval-framework \
--tasks ifeval \
--model Qwen3-4B \
--base-url http://localhost:8000/v1 \
--max-examples 4 \
--output-dir outputs/smokeDrop --max-examples for the full task. The --model value must match the
server's --served-model-name.
--local loads vLLM in-process, so no separate server is needed — handy for a
quick one-off:
eval-framework \
--tasks ifeval \
--model Qwen/Qwen3-4B \
--local \
--max-examples 4 \
--output-dir outputs/smokeUse --local for convenience; use server mode for throughput. In server
mode the runner drives vLLM with --num-threads concurrent requests, which
vLLM batches continuously to keep the GPU fully utilised — this is what gets a
single card to ~700 W and a 4B checkpoint through all tasks in ~15 minutes.
To sweep a whole RL run across many checkpoints and GPUs, use the ready-made
scripts in examples/ — see Multi-GPU Batch Evaluation below.
| Task | Judge needed? | Key flags |
|---|---|---|
ifeval |
No (rule-based) | --ifeval-input |
ifbench |
No (rule-based) | --ifbench-dir, --ifbench-input |
writingbench |
Yes | --writingbench-query, --writingbench-write-excel |
healthbench |
Yes | --healthbench-data |
arena-hard |
Yes | --arena-hard-dir, --arena-hard-benchmark |
alpaca-eval |
Yes | --alpaca-eval-reference, --alpaca-eval-hf-dataset |
--inference-only— generate responses, skip judging. Judge later with--judge-only.--judge-only— score existing responses. Only supports writingbench / healthbench / arena-hard / alpaca-eval (ifeval and ifbench are rule-based and score during inference).
For RL experiments you typically need to evaluate many checkpoints across all benchmarks. We provide ready-to-use scripts in examples/:
| Script | Use case |
|---|---|
examples/shard_parallel_eval.sh |
Evaluate ONE model on all benchmarks — shards data across N GPUs for max throughput |
examples/batch_eval.sh |
Evaluate one training run — auto-detects checkpoints, schedules across N GPUs in rounds, judges, plots |
Usage:
# 1. Copy and edit the CONFIG section at the top of the script
cp examples/batch_eval.sh my_eval.sh
vim my_eval.sh # edit CKPT_DIR, OUT_DIR, STEPS, etc.
# 2. Run
bash my_eval.shWhat the scripts handle automatically:
- Multi-round scheduling — if you have more checkpoints than GPUs, the script runs them in rounds and cleans up vLLM between rounds
- vLLM lifecycle — starts servers, waits for health checks, kills process groups after eval
- Judge batching — runs judge jobs in small batches to respect API rate limits (configurable
JUDGE_BATCH_SIZE) - Phase control — set
RUN_INFERENCE=0/RUN_JUDGE=0/RUN_PLOT=0to skip phases (e.g. re-run judge only after fixing an issue) - Logging — all vLLM and eval logs go to
LOG_DIRfor debugging; judge stderr (tqdm) is tee'd to terminal
- Do NOT set
--max-model-lenunless you know exactly what you're doing. Let the model use its native context length (e.g. 32768 for Qwen3-4B). Setting it too low causesVLLMValidationErroron long prompts. --gpu-memory-utilization 0.95is safe for H100s and maximizes KV cache.- Increase
--num-threadswhen GPU utilization is low and the serving backend has available capacity. - Kill process groups, not just PIDs —
kill -- -${pid}ensures all vLLM child processes are cleaned up. Follow withpkill -f "vllm serve"between rounds.
outputs/
├── step_120/
│ ├── run_0/ # one subdir per sample (mean@N evaluation)
│ │ ├── ifeval/ # summary.json, responses.jsonl
│ │ ├── ifbench/ # summary.json, responses.jsonl, eval_results_*.jsonl
│ │ ├── writingbench/ # responses.jsonl, scores.jsonl, summary.json
│ │ ├── healthbench/ # responses.jsonl, scores.jsonl, summary.json
│ │ ├── arena-hard/ # model_answer/, model_judgment/, summary.json
│ │ └── alpaca-eval/ # model_answer/, model_judgment/, summary.json
│ ├── run_1/ ... # up to run_{N-1}
│ ├── ifeval/summary_agg.json # aggregated mean / std / sem / per_run
│ ├── healthbench/summary_agg.json
│ └── ...
├── step_240/
│ └── ...
└── plots/
├── ifeval.png
├── ifbench.png
├── healthbench.png
├── writingbench.png
├── arena-hard.png
├── alpaca-eval.png
└── all_tasks.png
run_k/ holds the k-th sample's raw artifacts; summary_agg.json at the step
root is what plotting consumes. With N=1 everything still works but error bars
collapse to zero width.
batch_eval.sh runs each checkpoint N times per task and then aggregates.
Because the same live vLLM server handles all N samples, prefix caching
amortises prefill — wall time is roughly decode(N)×, not N× cold starts.
Per-task defaults (override with env vars):
| Task | Default N | Why |
|---|---|---|
ifeval / ifbench |
8 | Rule-based scoring, cost is only GPU decode |
healthbench |
8 | Rubric-based, judge cost 8× but gives honest error bars |
writingbench |
4 | Large rubric per prompt; 4 samples is usually enough |
arena-hard / alpaca-eval |
1 | These already report internal bootstrap CI; extra sampling rarely helps |
Override any of them:
N_SAMPLES_HEALTHBENCH=4 N_SAMPLES_WRITINGBENCH=1 bash examples/batch_eval.shSet them all to 1 to reproduce the original single-run behavior.
After inference + judge + aggregate, combine eval results from any set of
checkpoints into training curves. Steps that carry a summary_agg.json get
error bars automatically; those without fall back to a plain line.
python tools/plot_training_curves.py \
--runs "run_a=outputs/run_a" \
--runs "run_b=outputs/run_b" \
--name-pattern "run_a=step_{step}" \
--name-pattern "run_b=step_{step}" \
--steps "120,240,360,480,600" \
--tasks "ifeval,ifbench,healthbench,writingbench,arena-hard,alpaca-eval" \
--plot-dir outputs/plots \
--show-errorbar ci95 # ci95 (1.96·SEM) | sem | std | nonebatch_eval.sh runs aggregate_runs.py during its plotting phase. To aggregate
manually:
python tools/aggregate_runs.py \
--out-dir outputs/run_a \
--steps 120,240,360,480,600 \
--tasks ifeval,ifbench,healthbench,writingbench,arena-hard,alpaca-eval \
--n-samples ifeval=8,ifbench=8,healthbench=8,writingbench=4,arena-hard=1,alpaca-eval=1Compare scores from different judge models:
python tools/judge_compare.py \
--judges flash=outputs/qwen3-4B \
--judges plus=outputs/qwen3-4B-judge-qwen-plus \
--out outputs/judge_compare.jsonWhen running many judge jobs in parallel (e.g. 5 background eval-framework processes), all
remote API requests share a file-lock-based global throttle to prevent 429 rate-limit errors.
| Env var | Default | Description |
|---|---|---|
MIN_INTERVAL_S |
0.005 (≈200 QPS) |
Minimum interval between consecutive API requests across all threads/processes |
EVAL_THROTTLE_STATE_PATH |
/tmp/eval_framework_global_throttle.state |
Shared state file path; processes using the same path share one throttle |
export MIN_INTERVAL_S=0.01 # ~100 QPS global cap
export EVAL_THROTTLE_STATE_PATH=/tmp/eval_framework_global_throttle.stateSet MIN_INTERVAL_S=0 to disable throttling entirely.
--output-dircontrols where responses/scores/summaries go. With--tasks, output is written to<output-dir>/<task>/.- If you set
--served-model-nameinvllm serve, pass that same name via--model. - IFBench test data is bundled at
tasks/ifbench/data/IFBench_test.jsonl. The AllenAI verifier source resolves from.external/IFBenchunless you pass--ifbench-dir. - Arena-Hard questions and baselines (
o3-mini-2025-01-31,gemini-2.0-flash-001for v2.0) are bundled attasks/arena_hard/data/. Falls back to.external/arena-hard-autoif present. Override with--arena-hard-dirto use a custom repo (e.g. a newer bench version). - AlpacaEval reference outputs auto-download from HuggingFace. Override with
--alpaca-eval-reference. - IFBench also needs
emoji+syllapyinstalled (included inpyproject.tomldeps). setuptools<81is pinned becausesyllapydepends onpkg_resourceswhich was removed in setuptools 82.
The framework code is released under Apache-2.0. Bundled benchmark assets remain under their original upstream licenses and citation requirements. Before redistributing modified benchmark data, check the upstream projects for the current license and attribution terms.