You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Solve-allowed verification: one round-trip check that subsumes all three violation types
Complement to #15. Where #15 keeps certificates solver-free for large instances, this issue adds the opposite (and much simpler) mode: on small instances, allow pred solve, and reduction correctness collapses to a single equality. Builds on #14.
The observation
pred solve <bundle> already performs the whole round-trip — it brute-force-solves the target, then extracts back to source space. The outer evaluation/solution is the source-space answer; intermediate holds the target solution. Example (MIS 0-1,1-2 → MaximumClique):
A reduction is correct on instance A iff value(solve(A)) == value(solve(reduce(A))) — the true optimum of A equals the optimum recovered by solving the target and extracting. A bug is any small A where they disagree (with None/false ⇒ infeasible, numeric compared within FLOAT_TOLERANCE).
This single equality catches all three existing failure modes at once:
incomplete_reduction: A feasible, solve(bundle) infeasible → Max(k) vs Max(None).
unsound_extraction: extracted optimum infeasible for A → mismatch.
suboptimal_extraction: extracted value worse than true optimum → Max(1) vs Max(2). And it fixes the current soundness hole (verify.py:280) because solve(bundle) returns the verified target optimum, instead of trusting an arbitrary target_config.
Why the check is much easier
Minimal certificate. The AI only supplies a small source instance + target type: {rule, violation: "solve_mismatch", source, bundle}. No target_config, no claimed_source_solution, no brute_force_solution, no declared cost constants, no forward map.
Zero trust in AI configs. The verifier solves both sides itself; nothing the AI claims about solutions can spoof it.
One comparison.solve(source).evaluation vs solve(bundle).evaluation.
Cost / scope (why this is not a replacement for #15)
Bounded to brute-force-solvable instances. Gate: pred solve runs under the existing 30s subprocess timeout; on timeout the cert is rejected ("instance too large for solve-based verification"). Reduction bugs almost always reproduce on tiny instances (gadget bugs show on small graphs), so this is pragmatic.
Add solve_mismatch to the violation enum in results.schema.json (no new required fields — rule, violation, source, bundle suffices).
_run_pred: catch subprocess.TimeoutExpired → return a timeout rc so oversized instances are rejected cleanly (also hardens existing incomplete/suboptimal solve paths).
Implement _check_solve_mismatch(cert, source_file, bundle_file): two solves + evaluation comparison; accept on disagreement, reject on agreement, reject on timeout.
Route "solve_mismatch" in _verify_in.
Tests: a correct reduction (MIS→MaximumClique) under solve_mismatch must be rejected ("recovers the true optimum"); add a calibration fixture for it. (A true-positive needs an actually-buggy rule, same limitation noted for incomplete_reduction tests.)
Acceptance criteria
verify accepts a solve_mismatch cert iff solve(source) and solve(bundle) evaluations disagree.
MIS→MaximumClique (a correct reduction) under solve_mismatch is rejected with a reason mentioning the recovered optimum equals the true optimum.
An oversized source instance (solve exceeds the timeout) is rejected, not crashed.
Solve-allowed verification: one round-trip check that subsumes all three violation types
Complement to #15. Where #15 keeps certificates solver-free for large instances, this issue adds the opposite (and much simpler) mode: on small instances, allow
pred solve, and reduction correctness collapses to a single equality. Builds on #14.The observation
pred solve <bundle>already performs the whole round-trip — it brute-force-solves the target, then extracts back to source space. The outerevaluation/solutionis the source-space answer;intermediateholds the target solution. Example (MIS0-1,1-2→ MaximumClique):The rule
This single equality catches all three existing failure modes at once:
solve(bundle)infeasible →Max(k)vsMax(None).Max(1)vsMax(2). And it fixes the current soundness hole (verify.py:280) becausesolve(bundle)returns the verified target optimum, instead of trusting an arbitrarytarget_config.Why the check is much easier
{rule, violation: "solve_mismatch", source, bundle}. Notarget_config, noclaimed_source_solution, nobrute_force_solution, no declared cost constants, no forward map.solve(source).evaluationvssolve(bundle).evaluation.Cost / scope (why this is not a replacement for #15)
pred solveruns under the existing 30s subprocess timeout; on timeout the cert is rejected ("instance too large for solve-based verification"). Reduction bugs almost always reproduce on tiny instances (gadget bugs show on small graphs), so this is pragmatic.Proposed work (this repo)
solve_mismatchto theviolationenum inresults.schema.json(no new required fields —rule, violation, source, bundlesuffices)._run_pred: catchsubprocess.TimeoutExpired→ return a timeout rc so oversized instances are rejected cleanly (also hardens existingincomplete/suboptimalsolve paths)._check_solve_mismatch(cert, source_file, bundle_file): two solves + evaluation comparison; accept on disagreement, reject on agreement, reject on timeout."solve_mismatch"in_verify_in.solve_mismatchmust be rejected ("recovers the true optimum"); add a calibration fixture for it. (A true-positive needs an actually-buggy rule, same limitation noted forincomplete_reductiontests.)Acceptance criteria
verifyaccepts asolve_mismatchcert iffsolve(source)andsolve(bundle)evaluations disagree.solve_mismatchis rejected with a reason mentioning the recovered optimum equals the true optimum.--calibratepasses including the new fixture.Relates to #14, complements #15.