Judge the Judge: Workshop Assets

Public assets for the workshop "Judge the Judge: Building LLM Evaluators That Actually Work."

This repo is centered around the notebook in workshop/01_build_judge_with_gepa.ipynb.

It includes:

the workshop notebook
the processed airline policy dataset
the core scripts used by the notebook
curated experiment outputs and rubrics
reproducible experiment entrypoints

Resources

GEPA repo: https://github.com/gepa-ai/gepa
GEPA paper: https://arxiv.org/abs/2507.19457
Tau-bench repo: https://github.com/sierra-research/tau-bench

Quickstart

1. Install dependencies

This repo uses uv.

uv sync

2. Start Jupyter

uv run jupyter lab

Then open:

workshop/01_build_judge_with_gepa.ipynb

Recommended Path

The notebook is the main workshop experience.

By default, it uses checked-in results for the baseline and optimized judge so it can run without API credentials.

That means you can:

open the notebook
inspect the dataset
see the baseline failure mode
load the optimized rubric
compare seed vs optimized results
inspect flipped examples and remaining failures

without rerunning expensive model calls.

What The Notebook Covers

The notebook walks through:

problem framing: why naive LLM judges fail
loading and inspecting the dataset
baseline judge behavior
GEPA optimization setup
loading the optimized rubric
before/after comparison
failure analysis
takeaways and practical recipe

Repo Structure

.
├── workshop/
│   ├── 01_build_judge_with_gepa.ipynb
│   ├── core/
│   │   ├── annotate.py
│   │   ├── evaluate.py
│   │   ├── extract.py
│   │   ├── optimize.py
│   │   └── split.py
│   ├── data/
│   │   └── airline_policy_v0/
│   └── results/
│       ├── baseline/
│       ├── best/
│       └── slide_metrics/
├── experiments/
├── slides/
└── docs/

Core Folders

`workshop/`

The main workshop assets.

`workshop/core/`

Human-readable helper scripts used by the notebook and experiments:

extract.py: build the dataset from raw Tau2-bench traces
split.py: split by task to avoid leakage
annotate.py: generate trace-level annotations
evaluate.py: run a judge rubric on a dataset
optimize.py: run GEPA prompt optimization for the judge rubric

`workshop/data/airline_policy_v0/`

Processed dataset used in the workshop.

Files included:

train.json
train_annotated.json
val.json
val_annotated.json
full.json
full_annotated.json
policy.md

`workshop/results/`

Curated outputs used by the notebook.

Important files:

baseline/eval_results.json
best/rubric_seed.txt
best/rubric_optimized.txt
best/eval_val.json
best/optimization_summary.json

`experiments/`

Reproducible experiment scripts.

These are not required for the main notebook flow. They exist for people who want to rerun the ablations or the best experiment from the command line.

Running Experiments

If you want to rerun live model calls, copy .env.example to .env and set the required API keys.

Then you can run, for example:

uv run python experiments/exp_001_baseline.py

Or the best experiment:

uv run python experiments/exp_007_grok_gemini.py

Live experiment runs write outputs to runs/ so they do not overwrite the curated checked-in results used by the notebook.

Notes On Credentials

The notebook is designed to work without API credentials because it loads saved results by default.

You only need credentials if you want to rerun:

judge evaluations
annotation generation
GEPA optimization experiments

Depending on model choice, you may need keys and credits for providers such as OpenAI or OpenRouter.

Main Result

The core workshop story is the shift from a naive judge that mostly rubber-stamps traces as compliant to an optimized judge that catches many more real violations.

Checked-in validation results show:

baseline accuracy: 65.2%
optimized accuracy: 69.6%
baseline non-compliant recall: 14.0%
optimized non-compliant recall: 55.8%

That non-compliant recall jump is the main point of the workshop.

Slides

The slide deck PDF lives in slides/.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
experiments		experiments
workshop		workshop
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Judge the Judge: Workshop Assets

Resources

Quickstart

1. Install dependencies

2. Start Jupyter

Recommended Path

What The Notebook Covers

Repo Structure

Core Folders

`workshop/`

`workshop/core/`

`workshop/data/airline_policy_v0/`

`workshop/results/`

`experiments/`

Running Experiments

Notes On Credentials

Main Result

Slides

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Judge the Judge: Workshop Assets

Resources

Quickstart

1. Install dependencies

2. Start Jupyter

Recommended Path

What The Notebook Covers

Repo Structure

Core Folders

workshop/

workshop/core/

workshop/data/airline_policy_v0/

workshop/results/

experiments/

Running Experiments

Notes On Credentials

Main Result

Slides

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

`workshop/`

`workshop/core/`

`workshop/data/airline_policy_v0/`

`workshop/results/`

`experiments/`

Packages