Skip to content

M2.2 — Harden the verifier (robust equality) and add the novelty filter #5

Description

@GiggleLiu

Background

New here? Read #9 first — it explains the project and defines every term below.

Issue #2 already re-checks every counterexample by re-running pred. This issue makes that checker hard to fool and makes sure each distinct bug is counted only once. Two concrete risks: (a) two problem instances can be equal but written differently in JSON (different field order, a rounded number) — the checker must not be tricked into rejecting a real bug over formatting; (b) the same bug found twice, or a bug already reported elsewhere, shouldn't inflate a model's score. ("Novel" = not already reported; see #9.)

Objective

Make the checker robust (compare by meaning, tolerate rounding) and add a novelty filter so each distinct, not-already-known bug counts exactly once.

Interface (Input → Output)

  • Input: a counterexample certificate + the library clone + a ledger of already-known bugs.
  • Output: a scored verdict — whether it's a real bug (with the re-derived evidence) and whether it's novel or already known (and if known, which prior report it matches).

Technical recommendations (suggestions)

  • Compare problem instances by meaning, not raw text — normalize the JSON before comparing (this is also why M1.1 — Build the pred environment & capability foundation (env + upstream-gap audit) #1 files an upstream request for canonical JSON from pred reduce).
  • Compare objective values with a documented rounding tolerance.
  • Decide novelty by the pair (rule name, normalized source instance); replace the current crude "is the rule name a substring of an issue title" check.

Verification (how a reviewer confirms this is done)

Run make verify-judgment over a catalog of tricky certificates. It must get all of these right or exit with an error:

  • a certificate whose target JSON is reordered/reserialized but means the same thing → still accepted (not fooled by formatting);
  • a certificate with an objective value off only by rounding within tolerance → accepted;
  • a certificate that claims a bug but whose supposedly-invalid solution is actually valid → rejected;
  • a "bug" that's actually identical to the correct extracted solution (no real violation) → rejected;
  • a bug already in the known-bugs ledger → marked known (adds nothing to the score); a fresh one → novel; and a rule whose name merely contains another rule's name is not mistakenly treated as a duplicate.

Dependencies

Depends on #2 (the basic checker). Robust equality relates to the upstream canonical-JSON request from #1.

Out of scope

Website rendering (#6).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions