Public assets for the workshop "Judge the Judge: Building LLM Evaluators That Actually Work."
This repo is centered around the notebook in workshop/01_build_judge_with_gepa.ipynb.
It includes:
- the workshop notebook
- the processed airline policy dataset
- the core scripts used by the notebook
- curated experiment outputs and rubrics
- reproducible experiment entrypoints
- GEPA repo:
https://github.com/gepa-ai/gepa - GEPA paper:
https://arxiv.org/abs/2507.19457 - Tau-bench repo:
https://github.com/sierra-research/tau-bench
This repo uses uv.
uv syncuv run jupyter labThen open:
workshop/01_build_judge_with_gepa.ipynb
The notebook is the main workshop experience.
By default, it uses checked-in results for the baseline and optimized judge so it can run without API credentials.
That means you can:
- open the notebook
- inspect the dataset
- see the baseline failure mode
- load the optimized rubric
- compare seed vs optimized results
- inspect flipped examples and remaining failures
without rerunning expensive model calls.
The notebook walks through:
- problem framing: why naive LLM judges fail
- loading and inspecting the dataset
- baseline judge behavior
- GEPA optimization setup
- loading the optimized rubric
- before/after comparison
- failure analysis
- takeaways and practical recipe
.
├── workshop/
│ ├── 01_build_judge_with_gepa.ipynb
│ ├── core/
│ │ ├── annotate.py
│ │ ├── evaluate.py
│ │ ├── extract.py
│ │ ├── optimize.py
│ │ └── split.py
│ ├── data/
│ │ └── airline_policy_v0/
│ └── results/
│ ├── baseline/
│ ├── best/
│ └── slide_metrics/
├── experiments/
├── slides/
└── docs/
The main workshop assets.
Human-readable helper scripts used by the notebook and experiments:
extract.py: build the dataset from raw Tau2-bench tracessplit.py: split by task to avoid leakageannotate.py: generate trace-level annotationsevaluate.py: run a judge rubric on a datasetoptimize.py: run GEPA prompt optimization for the judge rubric
Processed dataset used in the workshop.
Files included:
train.jsontrain_annotated.jsonval.jsonval_annotated.jsonfull.jsonfull_annotated.jsonpolicy.md
Curated outputs used by the notebook.
Important files:
baseline/eval_results.jsonbest/rubric_seed.txtbest/rubric_optimized.txtbest/eval_val.jsonbest/optimization_summary.json
Reproducible experiment scripts.
These are not required for the main notebook flow. They exist for people who want to rerun the ablations or the best experiment from the command line.
If you want to rerun live model calls, copy .env.example to .env and set the required API keys.
Then you can run, for example:
uv run python experiments/exp_001_baseline.pyOr the best experiment:
uv run python experiments/exp_007_grok_gemini.pyLive experiment runs write outputs to runs/ so they do not overwrite the curated checked-in results used by the notebook.
The notebook is designed to work without API credentials because it loads saved results by default.
You only need credentials if you want to rerun:
- judge evaluations
- annotation generation
- GEPA optimization experiments
Depending on model choice, you may need keys and credits for providers such as OpenAI or OpenRouter.
The core workshop story is the shift from a naive judge that mostly rubber-stamps traces as compliant to an optimized judge that catches many more real violations.
Checked-in validation results show:
- baseline accuracy:
65.2% - optimized accuracy:
69.6% - baseline non-compliant recall:
14.0% - optimized non-compliant recall:
55.8%
That non-compliant recall jump is the main point of the workshop.
The slide deck PDF lives in slides/.
MIT.