Skip to content

Universal tool and model logging to leaderboard and Prisma#1

Open
RayirthDinesh wants to merge 33 commits into
mainfrom
test
Open

Universal tool and model logging to leaderboard and Prisma#1
RayirthDinesh wants to merge 33 commits into
mainfrom
test

Conversation

@RayirthDinesh

@RayirthDinesh RayirthDinesh commented May 15, 2026

Copy link
Copy Markdown
Collaborator

Agent-facing instructions for the universal tool/model logging system. Tells every AI agent to write a .gym_attribution.json (tool, model, optional version) in the problem folder before any submit, so the CLI can attribute the run and submission is never rejected for missing attribution.

Companion PRs:

  • aicodinggym-cli: MLE-Bench_Logger — detection + reject gate.
  • AICodingGym (site): leaderboard-tools-models — persistence + leaderboard.

🤖 Generated with Claude Code

RayirthDinesh and others added 30 commits April 10, 2026 14:04
Replace the "Latest metric" stat with "Best accuracy" so the prominent
value is always the best (max) of all recorded metric cards, not the
most recently logged one. Applied in both the standalone dashboard.html
and supervisor.sh's ensure_dashboard HEREDOC so newly-seeded problem
folders match.
Adds MLEL_DASH_V2 marker to ensure_dashboard() heredoc and checks for
it on entry — if missing, deletes and regenerates so existing problems
with old dashboards get dot-tooltip, agent metadata, and cell breakdown
without manual intervention.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- AICodingGym brand theme (#F97316 orange, #FAF8F5 cream background)
- Dynamic Y-axis zoom: pads around actual accuracy range for readability
- Approach summary split into 3 sections: Preprocessing / Model / Training Strategy
- Model card shows name + hyperparameters table
- MLE-bench-only: header reads 'MLE Bench Logger' with Live badge

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a prompt has cells:[] in solution_log.json, the watcher now reads
solution.ipynb alongside and injects the full cell/line breakdown automatically.
Works for all MLE-bench problems without requiring the AI to write cells manually.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RayirthDinesh and others added 3 commits May 14, 2026 20:41
- accuracy must be a real measured float after notebook runs, never null
- cells must include every non-blank source line with content + ai_summary
- model must be non-null even for lookup/rule-based approaches
- CLAUDE.md adds explicit numbered hard-rules section for Claude Code agents

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Atomic tmp->solution_log.json rename fires FileMovedEvent on Windows,
not modified/created. on_moved was missing so renames were silently dropped.
Refactored _check/_fire to share debounce logic across all three event types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document the .gym_attribution.json self-report file every AI agent must
write before running any submit, so the CLI can attribute the tool and
model used. Note that submission is rejected without it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@RayirthDinesh RayirthDinesh changed the title Improved structure Universal tool and model logging to leaderboard and Prisma Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant