Skip to content

M3.1 — Build the public leaderboard: dual metrics + certificate drilldown + trajectory #6

Description

@GiggleLiu

Background

New here? Read #9 first — it explains the project and defines every term below.

Issue #3 made a minimal page that shows real data. This issue grows it into the public product. Two ranking metrics matter: bugs per 1000 tokens (fair across models regardless of their pricing) and bugs per dollar (practical cost-efficiency). Visitors should be able to click a model to see its bugs, click a bug to see the full counterexample (source problem → target problem → solutions → verdict), and view the trajectory — the exact sequence of pred commands the AI ran to find that bug, for transparency. It's a static site (just files, no server).

Objective

Grow the minimal page into the public product: a dual-metric ranked table, per-model breakdown, per-bug detail pages, and a link to each bug's discovery trajectory.

Interface (Input → Output)

Technical recommendations (suggestions)

  • Reuse the existing chart; add a toggle between the two metrics.
  • The per-bug page shows the certificate (source → target → solutions → verdict) and links to its trajectory.
  • The trajectory view reads the agent's saved log and shows the pred commands in order.

Verification (how a reviewer confirms this is done)

Use a small fixture with 2 models chosen so the two metrics rank them in opposite orders:

  1. Toggle the metric between "bugs / 1000 tokens" and "bugs / $" → the table reorders. (Proves both metrics are really computed, not just labels.)
  2. Click a bug → the source/target/solutions/verdict shown match the JSON file, and the trajectory link shows the pred commands ending in exactly that bug.
  3. A model that found 0 bugs still appears, listed with 0.
  4. A bug whose trajectory file is missing still renders ("trajectory unavailable") instead of breaking the page.

Dependencies

Depends on #3 (results format + rendering) and #4 (multiple models' results).

Out of scope

A second agent track (opencode); the full offline-reproducibility archive (#8).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions