You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New here? Read #9 first — it explains the project and defines every term below.
Issue #2 already re-checks every counterexample by re-running pred. This issue makes that checker hard to fool and makes sure each distinct bug is counted only once. Two concrete risks: (a) two problem instances can be equal but written differently in JSON (different field order, a rounded number) — the checker must not be tricked into rejecting a real bug over formatting; (b) the same bug found twice, or a bug already reported elsewhere, shouldn't inflate a model's score. ("Novel" = not already reported; see #9.)
Objective
Make the checker robust (compare by meaning, tolerate rounding) and add a novelty filter so each distinct, not-already-known bug counts exactly once.
Interface (Input → Output)
Input: a counterexample certificate + the library clone + a ledger of already-known bugs.
Output: a scored verdict — whether it's a real bug (with the re-derived evidence) and whether it's novel or already known (and if known, which prior report it matches).
Compare objective values with a documented rounding tolerance.
Decide novelty by the pair (rule name, normalized source instance); replace the current crude "is the rule name a substring of an issue title" check.
Verification (how a reviewer confirms this is done)
Run make verify-judgment over a catalog of tricky certificates. It must get all of these right or exit with an error:
a certificate whose target JSON is reordered/reserialized but means the same thing → still accepted (not fooled by formatting);
a certificate with an objective value off only by rounding within tolerance → accepted;
a certificate that claims a bug but whose supposedly-invalid solution is actually valid → rejected;
a "bug" that's actually identical to the correct extracted solution (no real violation) → rejected;
a bug already in the known-bugs ledger → marked known (adds nothing to the score); a fresh one → novel; and a rule whose name merely contains another rule's name is not mistakenly treated as a duplicate.
Dependencies
Depends on #2 (the basic checker). Robust equality relates to the upstream canonical-JSON request from #1.
Background
Issue #2 already re-checks every counterexample by re-running
pred. This issue makes that checker hard to fool and makes sure each distinct bug is counted only once. Two concrete risks: (a) two problem instances can be equal but written differently in JSON (different field order, a rounded number) — the checker must not be tricked into rejecting a real bug over formatting; (b) the same bug found twice, or a bug already reported elsewhere, shouldn't inflate a model's score. ("Novel" = not already reported; see #9.)Objective
Make the checker robust (compare by meaning, tolerate rounding) and add a novelty filter so each distinct, not-already-known bug counts exactly once.
Interface (Input → Output)
Technical recommendations (suggestions)
pred reduce).Verification (how a reviewer confirms this is done)
Run
make verify-judgmentover a catalog of tricky certificates. It must get all of these right or exit with an error:Dependencies
Depends on #2 (the basic checker). Robust equality relates to the upstream canonical-JSON request from #1.
Out of scope
Website rendering (#6).