Problem-Reductions Bug-Finding Benchmark

A benchmark that measures how efficiently AI models find bugs in reduction rules from the problem-reductions library (290+ rules).

The leaderboard is a static site (site/) published to GitHub Pages. Submitting is a GitHub pull request that adds submissions/<handle>/<model>.json; the submission is re-verified on the PR (after a maintainer approves the run) and the result must pass before merge. See CONTRIBUTING.md to run and submit.

What this measures

A reduction rule maps problem A → problem B. A bug is a round-trip failure:

A  →(reduce)→  B  →(solve)→  s  →(extract)→  A'

The rule is correct on an instance a only if solving it directly agrees with solving it through the reduction, compared by value (optimization) or feasibility (decision):

solve(a)  ==  solve(reduce(a))

A mismatch is a bug. The AI finds these by constructing counterexample certificates — a JSON object naming the source instance a and the rule; the backend re-derives the bundle and round-trips it with pred, so the AI's claim is never trusted directly. The mismatch is reported with a derived label (optimum_not_preserved, feasibility_not_preserved, or spurious_solution); an optional target_config witness can additionally expose extraction bugs on a specific target solution (unsound_extraction / suboptimal_extraction).

Primary metric: bugs found — the number of distinct rules with at least one confirmed bug, on a pinned library commit. One rule = one bug, no matter how many counterexamples (or violation types) target it. This count is fully verifiable and cannot be inflated by resubmitting certificates. Secondary metrics: bugs/Ktok and bugs/$ — token- and cost-efficiency. These have a self-reported denominator (tokens/cost), so they rank ties and serve as reference, not as the headline.

Provenance is intentionally not scored: on a fixed commit, a pred-confirmed certificate is a bug regardless of who or what produced it.

How to add a model

Implement the AgentRunner interface in benchmark/runner.py:

from benchmark.runner import AgentRunner

class MyRunner(AgentRunner):
    def run(self, ctx, model: str, rule_name: str, per_rule_budget: float) -> dict:
        # Run the model, return a certificate if a bug is found
        return {
            "rule": rule_name,
            "result": "bug_found",   # or "no_certificate" | "rejected" | "error:..."
            "cost": 0.05,            # USD spent
            "tokens_k": 12.3,        # tokens used (thousands)
            "certificate": {...},    # required when result == "bug_found"
        }

Then pass it to Scheduler in benchmark/scheduler.py. See MiniSweRunner for a full example.

A run is packaged as a submission.json (envelope around the per-rule rows, see benchmark/submission.schema.json) and submitted as a GitHub pull request that adds it under submissions/; the PR re-verifies every certificate with pred (after a maintainer approves the run) and must pass before merge. See CONTRIBUTING.md.

How to run locally

Requirements:

pred binary in PATH (pinned commit aa2d1a1 of problem-reductions)
Python 3.12 with dependencies: pip install -r benchmark/requirements.txt
An API key for your model

# Run all unit tests (no API key needed) — this is what exercises the runner wiring
make test-unit

# Test the verifier against the fixtures (no API key)
make verify-calibration

# Configure your run, then validate it with one tiny real call before the full batch
cp submission.env.example submission.env   # fill in MODEL_NAME, key, PRICE_IN/OUT
make preflight

# Run the benchmark via Docker → ./out/submission.json (this does NOT upload it)
make run

Key make targets:

Target	Description
`make test-unit`	All unit tests, no API key needed
`make verify-calibration`	Test verifier against the fixtures (accept + reject paths)
`make verify-judgment`	Pred-free sanity tests (docs, CI, trajectory)
`make preflight`	Validate `submission.env` with one tiny real call before a full run
`make run`	Run the benchmark via Docker → `out/submission.json` (does not upload)
`make score-local`	Score submissions with the zero-trust backend

How to read the metrics

Metric	Formula	When to use
`bugs_found`	distinct rules with a confirmed bug	Primary ranking — fully verifiable, cannot be inflated
`bugs/Ktok`	bugs ÷ tokens(K)	Tiebreak / efficiency reference — self-reported denominator
`bugs/$`	bugs ÷ USD spent	Tiebreak / cost-efficiency reference — self-reported denominator

Rank by bugs_found. Among models that find the same number of bugs, bugs/Ktok breaks the tie (use bugs/$ when optimizing for budget). The efficiency metrics divide by tokens/cost, which the submitter self-reports — treat them as informative, not authoritative.

Models that don't publish pricing can still compete on bugs/Ktok.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.claude/skills/run-benchmark		.claude/skills/run-benchmark
.github		.github
benchmark		benchmark
docker		docker
site		site
submissions		submissions
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
index.html		index.html
pytest.ini		pytest.ini
submission.env.example		submission.env.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Problem-Reductions Bug-Finding Benchmark

What this measures

How to add a model

How to run locally

How to read the metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Problem-Reductions Bug-Finding Benchmark

What this measures

How to add a model

How to run locally

How to read the metrics

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages