M2.2 — Harden the verifier (robust equality) and add the novelty filter

## Background
> New here? Read **#9** first — it explains the project and defines every term below.

Issue #2 already re-checks every counterexample by re-running `pred`. This issue makes that checker **hard to fool** and makes sure each distinct bug is **counted only once**. Two concrete risks: (a) two problem instances can be *equal* but written differently in JSON (different field order, a rounded number) — the checker must not be tricked into rejecting a real bug over formatting; (b) the same bug found twice, or a bug already reported elsewhere, shouldn't inflate a model's score. ("Novel" = not already reported; see #9.)

## Objective
Make the checker robust (compare by meaning, tolerate rounding) and add a novelty filter so each distinct, not-already-known bug counts exactly once.

## Interface (Input → Output)
- **Input:** a counterexample certificate + the library clone + a ledger of already-known bugs.
- **Output:** a scored verdict — whether it's a real bug (with the re-derived evidence) and whether it's **novel** or already **known** (and if known, which prior report it matches).

## Technical recommendations (suggestions)
- Compare problem instances by *meaning*, not raw text — normalize the JSON before comparing (this is also why #1 files an upstream request for canonical JSON from `pred reduce`).
- Compare objective values with a documented rounding tolerance.
- Decide novelty by the pair (rule name, normalized source instance); replace the current crude "is the rule name a substring of an issue title" check.

## Verification (how a reviewer confirms this is done)
Run `make verify-judgment` over a catalog of tricky certificates. It must get all of these right or exit with an error:
- a certificate whose target JSON is reordered/reserialized but means the same thing → still **accepted** (not fooled by formatting);
- a certificate with an objective value off only by rounding within tolerance → **accepted**;
- a certificate that *claims* a bug but whose supposedly-invalid solution is actually valid → **rejected**;
- a "bug" that's actually identical to the correct extracted solution (no real violation) → **rejected**;
- a bug already in the known-bugs ledger → marked **known** (adds nothing to the score); a fresh one → **novel**; and a rule whose name merely *contains* another rule's name is **not** mistakenly treated as a duplicate.

## Dependencies
Depends on #2 (the basic checker). Robust equality relates to the upstream canonical-JSON request from #1.

## Out of scope
Website rendering (#6).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

M2.2 — Harden the verifier (robust equality) and add the novelty filter #5

Background

Objective

Interface (Input → Output)

Technical recommendations (suggestions)

Verification (how a reviewer confirms this is done)

Dependencies

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

M2.2 — Harden the verifier (robust equality) and add the novelty filter #5

Description

Background

Objective

Interface (Input → Output)

Technical recommendations (suggestions)

Verification (how a reviewer confirms this is done)

Dependencies

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions