Read this first β it explains, in plain language, what this project is and defines every term the other issues use. No prior context assumed.
What this project is
A leaderboard that measures how good different AI models are at finding bugs in a library of "problem reductions." Think of it like SWE-bench, but instead of fixing GitHub issues, each AI tries to find counterexamples that prove a piece of math/code is wrong.
The library under test is CodingThrust/problem-reductions β a Rust library with 240+ "reduction rules."
The 60-second domain primer
- Problem reduction. Many hard problems can be translated into one another (e.g. "graph coloring" β "QUBO"). Each rule in the library implements one such translation with two halves:
reduce_to(A) β translate an instance of problem A (the source) into an instance of problem B (the target).
extract_solution(s) β translate a solution s of B back into a solution of A.
- The round-trip correctness property (what a "bug" violates). A rule is correct only if: take a problem A β reduce to B β solve B β extract the solution back to A β the result is a valid (or, for optimization, an optimal) solution to A. If that round trip ever produces an invalid or non-optimal answer, the rule has a bug.
- Decision vs optimization problems. Decision = yes/no ("is there a valid coloring?"). Optimization = best value ("what's the largest independent set?"). Bugs look slightly different in each (see the certificate below).
- Witness / configuration. A concrete candidate solution (e.g.
[1,0,1,0] = which vertices are chosen). A "valid witness" satisfies the problem's constraints.
- Brute-force solver. An exhaustive solver that finds all exact solutions of a small instance β our source of ground truth.
How the benchmark runs
- Each AI model is given a clone of the library, a command-line tool (
pred, below), and a budget (a spending cap).
- It runs an open-ended bug-hunting session: it explores the rules freely, picks its own targets, and tries to construct a counterexample. There is no fixed list of rules to test β discovery is part of the skill being measured.
- Whatever it claims, we re-check every counterexample independently β we never trust the AI's own word.
- Score = bugs per 1000 tokens (
bugs/Ktok, fair across models) and bugs per dollar (bugs/$, practical). Only verified and novel (not-already-reported) bugs count.
The counterexample certificate (the bug artifact)
A found bug is recorded as a self-contained JSON "certificate" β because it's plain data, anyone can re-validate it later. It contains: the source instance, the target instance, the relevant solutions, and which correctness property is violated:
- decision Β· unsound extraction: a valid solution of B that the rule maps to an invalid solution of A.
- decision Β· incomplete reduction: A has a valid solution, but B has none (the reduction lost it).
- optimization Β· suboptimal extraction: B's optimal solution maps to a non-optimal A solution, and we show a strictly better one.
Independent validation re-runs three checks with pred: (1) the target really is reduce_to(source); (2) the listed solutions are present; (3) re-solving confirms the claimed violation.
The pred CLI
The official command-line tool shipped with the library. The whole benchmark talks to the library only through pred β we never modify the library's code; if pred is missing something we need, we file an issue upstream. Key verbs:
| verb |
what it does |
pred create <PROBLEM> β¦ |
build a problem instance (JSON) |
pred reduce <inst> --to <T> |
apply a reduction A β B |
pred solve <inst> --solver brute-force |
solve exactly (all solutions / optimum) |
pred extract <bundle> --config <s> |
map a target solution back to the source (extract_solution) |
pred evaluate <inst> --config <s> |
check a solution is valid and get its objective value |
The agent harness
The AI runs inside mini-SWE-agent (a ~100-line open-source bash-loop agent by the SWE-agent authors). It's wrapped behind a small AgentRunner interface so a different agent (e.g. opencode) can be swapped in later. "Fair" means every model gets the identical environment, prompt, and budget.
The roadmap (milestones)
Glossary (quick)
reduction rule Β· source/target problem Β· reduce_to / extract_solution Β· round-trip property Β· witness/configuration Β· decision vs optimization Β· brute-force solver Β· counterexample certificate Β· pred Β· mini-SWE-agent Β· bugs/Ktok Β· bugs/$ Β· novelty (already-reported vs new).
Background doc β not a work item. Linked from every other issue so each one is readable on its own.
Read this first β it explains, in plain language, what this project is and defines every term the other issues use. No prior context assumed.
What this project is
A leaderboard that measures how good different AI models are at finding bugs in a library of "problem reductions." Think of it like SWE-bench, but instead of fixing GitHub issues, each AI tries to find counterexamples that prove a piece of math/code is wrong.
The library under test is
CodingThrust/problem-reductionsβ a Rust library with 240+ "reduction rules."The 60-second domain primer
reduce_to(A)β translate an instance of problem A (the source) into an instance of problem B (the target).extract_solution(s)β translate a solutionsof B back into a solution of A.[1,0,1,0]= which vertices are chosen). A "valid witness" satisfies the problem's constraints.How the benchmark runs
pred, below), and a budget (a spending cap).bugs/Ktok, fair across models) and bugs per dollar (bugs/$, practical). Only verified and novel (not-already-reported) bugs count.The counterexample certificate (the bug artifact)
A found bug is recorded as a self-contained JSON "certificate" β because it's plain data, anyone can re-validate it later. It contains: the source instance, the target instance, the relevant solutions, and which correctness property is violated:
Independent validation re-runs three checks with
pred: (1) the target really isreduce_to(source); (2) the listed solutions are present; (3) re-solving confirms the claimed violation.The
predCLIThe official command-line tool shipped with the library. The whole benchmark talks to the library only through
predβ we never modify the library's code; ifpredis missing something we need, we file an issue upstream. Key verbs:pred create <PROBLEM> β¦pred reduce <inst> --to <T>pred solve <inst> --solver brute-forcepred extract <bundle> --config <s>extract_solution)pred evaluate <inst> --config <s>The agent harness
The AI runs inside mini-SWE-agent (a ~100-line open-source bash-loop agent by the SWE-agent authors). It's wrapped behind a small
AgentRunnerinterface so a different agent (e.g. opencode) can be swapped in later. "Fair" means every model gets the identical environment, prompt, and budget.The roadmap (milestones)
Glossary (quick)
reduction rule Β· source/target problem Β·
reduce_to/extract_solutionΒ· round-trip property Β· witness/configuration Β· decision vs optimization Β· brute-force solver Β· counterexample certificate Β·predΒ· mini-SWE-agent Β· bugs/Ktok Β· bugs/$ Β· novelty (already-reported vs new).Background doc β not a work item. Linked from every other issue so each one is readable on its own.