Universal tool and model logging to leaderboard and Prisma#1
Open
RayirthDinesh wants to merge 33 commits into
Open
Universal tool and model logging to leaderboard and Prisma#1RayirthDinesh wants to merge 33 commits into
RayirthDinesh wants to merge 33 commits into
Conversation
Replace the "Latest metric" stat with "Best accuracy" so the prominent value is always the best (max) of all recorded metric cards, not the most recently logged one. Applied in both the standalone dashboard.html and supervisor.sh's ensure_dashboard HEREDOC so newly-seeded problem folders match.
Adds MLEL_DASH_V2 marker to ensure_dashboard() heredoc and checks for it on entry — if missing, deletes and regenerates so existing problems with old dashboards get dot-tooltip, agent metadata, and cell breakdown without manual intervention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- AICodingGym brand theme (#F97316 orange, #FAF8F5 cream background) - Dynamic Y-axis zoom: pads around actual accuracy range for readability - Approach summary split into 3 sections: Preprocessing / Model / Training Strategy - Model card shows name + hyperparameters table - MLE-bench-only: header reads 'MLE Bench Logger' with Live badge Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a prompt has cells:[] in solution_log.json, the watcher now reads solution.ipynb alongside and injects the full cell/line breakdown automatically. Works for all MLE-bench problems without requiring the AI to write cells manually. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- accuracy must be a real measured float after notebook runs, never null - cells must include every non-blank source line with content + ai_summary - model must be non-null even for lookup/rule-based approaches - CLAUDE.md adds explicit numbered hard-rules section for Claude Code agents Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Atomic tmp->solution_log.json rename fires FileMovedEvent on Windows, not modified/created. on_moved was missing so renames were silently dropped. Refactored _check/_fire to share debounce logic across all three event types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document the .gym_attribution.json self-report file every AI agent must write before running any submit, so the CLI can attribute the tool and model used. Note that submission is rejected without it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Agent-facing instructions for the universal tool/model logging system. Tells every AI agent to write a
.gym_attribution.json(tool, model, optional version) in the problem folder before any submit, so the CLI can attribute the run and submission is never rejected for missing attribution.Companion PRs:
MLE-Bench_Logger— detection + reject gate.leaderboard-tools-models— persistence + leaderboard.🤖 Generated with Claude Code