📖 Read me first — benchmark overview & glossary

*Read this first — it explains, in plain language, what this project is and defines every term the other issues use. No prior context assumed.*

## What this project is

A **leaderboard that measures how good different AI models are at finding bugs** in a library of "problem reductions." Think of it like SWE-bench, but instead of fixing GitHub issues, each AI tries to find *counterexamples* that prove a piece of math/code is wrong.

The library under test is **[`CodingThrust/problem-reductions`](https://github.com/CodingThrust/problem-reductions)** — a Rust library with 240+ "reduction rules."

## The 60-second domain primer

- **Problem reduction.** Many hard problems can be *translated* into one another (e.g. "graph coloring" → "QUBO"). Each rule in the library implements one such translation with two halves:
  - `reduce_to(A)` — translate an instance of problem **A** (the *source*) into an instance of problem **B** (the *target*).
  - `extract_solution(s)` — translate a solution `s` of **B** back into a solution of **A**.
- **The round-trip correctness property** (what a "bug" violates). A rule is correct only if: *take a problem A → reduce to B → solve B → extract the solution back to A → the result is a valid (or, for optimization, an optimal) solution to A.* If that round trip ever produces an invalid or non-optimal answer, the rule has a **bug**.
- **Decision vs optimization problems.** *Decision* = yes/no ("is there a valid coloring?"). *Optimization* = best value ("what's the largest independent set?"). Bugs look slightly different in each (see the certificate below).
- **Witness / configuration.** A concrete candidate solution (e.g. `[1,0,1,0]` = which vertices are chosen). A "valid witness" satisfies the problem's constraints.
- **Brute-force solver.** An exhaustive solver that finds *all* exact solutions of a small instance — our source of ground truth.

## How the benchmark runs

1. Each AI model is given a clone of the library, a command-line tool (`pred`, below), and a **budget** (a spending cap).
2. It runs an **open-ended bug-hunting session**: it explores the rules freely, *picks its own targets*, and tries to construct a counterexample. There is **no fixed list of rules to test** — discovery is part of the skill being measured.
3. Whatever it claims, we **re-check every counterexample independently** — we never trust the AI's own word.
4. Score = **bugs per 1000 tokens** (`bugs/Ktok`, fair across models) and **bugs per dollar** (`bugs/$`, practical). Only **verified** and **novel** (not-already-reported) bugs count.

## The counterexample certificate (the bug artifact)

A found bug is recorded as a **self-contained JSON "certificate"** — because it's plain data, anyone can re-validate it later. It contains: the source instance, the target instance, the relevant solutions, and *which* correctness property is violated:
- **decision · unsound extraction:** a valid solution of B that the rule maps to an *invalid* solution of A.
- **decision · incomplete reduction:** A has a valid solution, but B has *none* (the reduction lost it).
- **optimization · suboptimal extraction:** B's optimal solution maps to a *non-optimal* A solution, and we show a strictly better one.

**Independent validation** re-runs three checks with `pred`: (1) the target really is `reduce_to(source)`; (2) the listed solutions are present; (3) re-solving confirms the claimed violation.

## The `pred` CLI

The official command-line tool shipped with the library. The whole benchmark talks to the library **only through `pred`** — we never modify the library's code; if `pred` is missing something we need, we file an issue upstream. Key verbs:

| verb | what it does |
|---|---|
| `pred create <PROBLEM> …` | build a problem instance (JSON) |
| `pred reduce <inst> --to <T>` | apply a reduction A → B |
| `pred solve <inst> --solver brute-force` | solve exactly (all solutions / optimum) |
| `pred extract <bundle> --config <s>` | map a target solution back to the source (`extract_solution`) |
| `pred evaluate <inst> --config <s>` | check a solution is valid and get its objective value |

## The agent harness

The AI runs inside **mini-SWE-agent** (a ~100-line open-source bash-loop agent by the SWE-agent authors). It's wrapped behind a small `AgentRunner` interface so a different agent (e.g. opencode) can be swapped in later. "Fair" means every model gets the *identical* environment, prompt, and budget.

## The roadmap (milestones)

- **M1 — MVP:** get one model working end-to-end and show *real* numbers on the leaderboard (issues #1–#3).
- **M2 — Fair multi-model + trustworthy checker:** run many models fairly; make the verifier tamper-proof; count each bug once (#4–#5).
- **M3 — Public website:** dual-metric leaderboard, per-bug detail pages, agent trajectories, auto-publish (#6–#7).
- **M4 — Hardening & release:** docs, reproducibility, CI calibration, observability (#8, tracking).

## Glossary (quick)

reduction rule · source/target problem · `reduce_to` / `extract_solution` · round-trip property · witness/configuration · decision vs optimization · brute-force solver · counterexample certificate · `pred` · mini-SWE-agent · bugs/Ktok · bugs/$ · novelty (already-reported vs new).

---
*Background doc — not a work item. Linked from every other issue so each one is readable on its own.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

📖 Read me first — benchmark overview & glossary #9

What this project is

The 60-second domain primer

How the benchmark runs

The counterexample certificate (the bug artifact)

The `pred` CLI

The agent harness

The roadmap (milestones)

Glossary (quick)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

verb	what it does
`pred create <PROBLEM> …`	build a problem instance (JSON)
`pred reduce <inst> --to <T>`	apply a reduction A → B
`pred solve <inst> --solver brute-force`	solve exactly (all solutions / optimum)
`pred extract <bundle> --config <s>`	map a target solution back to the source (`extract_solution`)
`pred evaluate <inst> --config <s>`	check a solution is valid and get its objective value

Uh oh!

📖 Read me first — benchmark overview & glossary #9

Description

What this project is

The 60-second domain primer

How the benchmark runs

The counterexample certificate (the bug artifact)

The pred CLI

The agent harness

The roadmap (milestones)

Glossary (quick)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

The `pred` CLI