Skip to content

πŸ“– Read me first β€” benchmark overview & glossaryΒ #9

Description

@GiggleLiu

Read this first β€” it explains, in plain language, what this project is and defines every term the other issues use. No prior context assumed.

What this project is

A leaderboard that measures how good different AI models are at finding bugs in a library of "problem reductions." Think of it like SWE-bench, but instead of fixing GitHub issues, each AI tries to find counterexamples that prove a piece of math/code is wrong.

The library under test is CodingThrust/problem-reductions β€” a Rust library with 240+ "reduction rules."

The 60-second domain primer

  • Problem reduction. Many hard problems can be translated into one another (e.g. "graph coloring" β†’ "QUBO"). Each rule in the library implements one such translation with two halves:
    • reduce_to(A) β€” translate an instance of problem A (the source) into an instance of problem B (the target).
    • extract_solution(s) β€” translate a solution s of B back into a solution of A.
  • The round-trip correctness property (what a "bug" violates). A rule is correct only if: take a problem A β†’ reduce to B β†’ solve B β†’ extract the solution back to A β†’ the result is a valid (or, for optimization, an optimal) solution to A. If that round trip ever produces an invalid or non-optimal answer, the rule has a bug.
  • Decision vs optimization problems. Decision = yes/no ("is there a valid coloring?"). Optimization = best value ("what's the largest independent set?"). Bugs look slightly different in each (see the certificate below).
  • Witness / configuration. A concrete candidate solution (e.g. [1,0,1,0] = which vertices are chosen). A "valid witness" satisfies the problem's constraints.
  • Brute-force solver. An exhaustive solver that finds all exact solutions of a small instance β€” our source of ground truth.

How the benchmark runs

  1. Each AI model is given a clone of the library, a command-line tool (pred, below), and a budget (a spending cap).
  2. It runs an open-ended bug-hunting session: it explores the rules freely, picks its own targets, and tries to construct a counterexample. There is no fixed list of rules to test β€” discovery is part of the skill being measured.
  3. Whatever it claims, we re-check every counterexample independently β€” we never trust the AI's own word.
  4. Score = bugs per 1000 tokens (bugs/Ktok, fair across models) and bugs per dollar (bugs/$, practical). Only verified and novel (not-already-reported) bugs count.

The counterexample certificate (the bug artifact)

A found bug is recorded as a self-contained JSON "certificate" β€” because it's plain data, anyone can re-validate it later. It contains: the source instance, the target instance, the relevant solutions, and which correctness property is violated:

  • decision Β· unsound extraction: a valid solution of B that the rule maps to an invalid solution of A.
  • decision Β· incomplete reduction: A has a valid solution, but B has none (the reduction lost it).
  • optimization Β· suboptimal extraction: B's optimal solution maps to a non-optimal A solution, and we show a strictly better one.

Independent validation re-runs three checks with pred: (1) the target really is reduce_to(source); (2) the listed solutions are present; (3) re-solving confirms the claimed violation.

The pred CLI

The official command-line tool shipped with the library. The whole benchmark talks to the library only through pred β€” we never modify the library's code; if pred is missing something we need, we file an issue upstream. Key verbs:

verb what it does
pred create <PROBLEM> … build a problem instance (JSON)
pred reduce <inst> --to <T> apply a reduction A β†’ B
pred solve <inst> --solver brute-force solve exactly (all solutions / optimum)
pred extract <bundle> --config <s> map a target solution back to the source (extract_solution)
pred evaluate <inst> --config <s> check a solution is valid and get its objective value

The agent harness

The AI runs inside mini-SWE-agent (a ~100-line open-source bash-loop agent by the SWE-agent authors). It's wrapped behind a small AgentRunner interface so a different agent (e.g. opencode) can be swapped in later. "Fair" means every model gets the identical environment, prompt, and budget.

The roadmap (milestones)

Glossary (quick)

reduction rule Β· source/target problem Β· reduce_to / extract_solution Β· round-trip property Β· witness/configuration Β· decision vs optimization Β· brute-force solver Β· counterexample certificate Β· pred Β· mini-SWE-agent Β· bugs/Ktok Β· bugs/$ Β· novelty (already-reported vs new).


Background doc β€” not a work item. Linked from every other issue so each one is readable on its own.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions