Universal tool and model logging to leaderboard and Prisma by RayirthDinesh · Pull Request #1 · AICodingGym/gym-environment

RayirthDinesh · 2026-05-15T04:59:59Z

Agent-facing instructions for the universal tool/model logging system. Tells every AI agent to write a .gym_attribution.json (tool, model, optional version) in the problem folder before any submit, so the CLI can attribute the run and submission is never rejected for missing attribution.

Companion PRs:

aicodinggym-cli: MLE-Bench_Logger — detection + reject gate.
AICodingGym (site): leaderboard-tools-models — persistence + leaderboard.

🤖 Generated with Claude Code

Replace the "Latest metric" stat with "Best accuracy" so the prominent value is always the best (max) of all recorded metric cards, not the most recently logged one. Applied in both the standalone dashboard.html and supervisor.sh's ensure_dashboard HEREDOC so newly-seeded problem folders match.

Adds MLEL_DASH_V2 marker to ensure_dashboard() heredoc and checks for it on entry — if missing, deletes and regenerates so existing problems with old dashboards get dot-tooltip, agent metadata, and cell breakdown without manual intervention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- AICodingGym brand theme (#F97316 orange, #FAF8F5 cream background) - Dynamic Y-axis zoom: pads around actual accuracy range for readability - Approach summary split into 3 sections: Preprocessing / Model / Training Strategy - Model card shows name + hyperparameters table - MLE-bench-only: header reads 'MLE Bench Logger' with Live badge Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When a prompt has cells:[] in solution_log.json, the watcher now reads solution.ipynb alongside and injects the full cell/line breakdown automatically. Works for all MLE-bench problems without requiring the AI to write cells manually. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- accuracy must be a real measured float after notebook runs, never null - cells must include every non-blank source line with content + ai_summary - model must be non-null even for lookup/rule-based approaches - CLAUDE.md adds explicit numbered hard-rules section for Claude Code agents Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Atomic tmp->solution_log.json rename fires FileMovedEvent on Windows, not modified/created. on_moved was missing so renames were silently dropped. Refactored _check/_fire to share debounce logic across all three event types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Document the .gym_attribution.json self-report file every AI agent must write before running any submit, so the CLI can attribute the tool and model used. Note that submission is rejected without it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RayirthDinesh and others added 30 commits April 10, 2026 14:04

md file changes for mle bench

5dd4043

mle changes

af93ace

update md files

3b48566

html dashboard

b202c6d

watcher implemented

0ba9386

summarizer

fba6412

aicoding gym ui theme

6c4c28d

dashboard fix

abd9a79

metric clarity improvement

8e0503a

improved sumamarization

5b57c83

trajectory history added

d0e45a7

improve trajectory analysis

ea564fc

structured logging

4f248cd

improve loggin

f03cd23

dashboard changes

b3446e3

improve supervisor

570b00a

improve logging parameters

50afd7e

structure improvement

958273e

improved summarizer

d9b1f51

prompt display

8dd6be0

improve prompt logging

0a5b594

improve logging

bd2ef10

Dashboard has improved prompt logging

33e7ebc

added watch dog

cdbfc4e

fixed ui

aca4f1c

restructure logging format

f385e29

RayirthDinesh and others added 3 commits May 14, 2026 20:41

RayirthDinesh changed the title ~~Improved structure~~ Universal tool and model logging to leaderboard and Prisma Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Universal tool and model logging to leaderboard and Prisma#1

Universal tool and model logging to leaderboard and Prisma#1
RayirthDinesh wants to merge 33 commits into
mainfrom
test

RayirthDinesh commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RayirthDinesh commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RayirthDinesh commented May 15, 2026 •

edited

Loading