Background
New here? Read #9 first — it explains the project and defines every term below.
Issue #3 made a minimal page that shows real data. This issue grows it into the public product. Two ranking metrics matter: bugs per 1000 tokens (fair across models regardless of their pricing) and bugs per dollar (practical cost-efficiency). Visitors should be able to click a model to see its bugs, click a bug to see the full counterexample (source problem → target problem → solutions → verdict), and view the trajectory — the exact sequence of pred commands the AI ran to find that bug, for transparency. It's a static site (just files, no server).
Objective
Grow the minimal page into the public product: a dual-metric ranked table, per-model breakdown, per-bug detail pages, and a link to each bug's discovery trajectory.
Interface (Input → Output)
Technical recommendations (suggestions)
- Reuse the existing chart; add a toggle between the two metrics.
- The per-bug page shows the certificate (source → target → solutions → verdict) and links to its trajectory.
- The trajectory view reads the agent's saved log and shows the
pred commands in order.
Verification (how a reviewer confirms this is done)
Use a small fixture with 2 models chosen so the two metrics rank them in opposite orders:
- Toggle the metric between "bugs / 1000 tokens" and "bugs / $" → the table reorders. (Proves both metrics are really computed, not just labels.)
- Click a bug → the source/target/solutions/verdict shown match the JSON file, and the trajectory link shows the
pred commands ending in exactly that bug.
- A model that found 0 bugs still appears, listed with 0.
- A bug whose trajectory file is missing still renders ("trajectory unavailable") instead of breaking the page.
Dependencies
Depends on #3 (results format + rendering) and #4 (multiple models' results).
Out of scope
A second agent track (opencode); the full offline-reproducibility archive (#8).
Background
Issue #3 made a minimal page that shows real data. This issue grows it into the public product. Two ranking metrics matter: bugs per 1000 tokens (fair across models regardless of their pricing) and bugs per dollar (practical cost-efficiency). Visitors should be able to click a model to see its bugs, click a bug to see the full counterexample (source problem → target problem → solutions → verdict), and view the trajectory — the exact sequence of
predcommands the AI ran to find that bug, for transparency. It's a static site (just files, no server).Objective
Grow the minimal page into the public product: a dual-metric ranked table, per-model breakdown, per-bug detail pages, and a link to each bug's discovery trajectory.
Interface (Input → Output)
Technical recommendations (suggestions)
predcommands in order.Verification (how a reviewer confirms this is done)
Use a small fixture with 2 models chosen so the two metrics rank them in opposite orders:
predcommands ending in exactly that bug.Dependencies
Depends on #3 (results format + rendering) and #4 (multiple models' results).
Out of scope
A second agent track (opencode); the full offline-reproducibility archive (#8).