Add dotted eval config CLI flags#1470
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 824fc7d. Configure here.
| merged = normalize_sampling_config( | ||
| merged, "CLI config", merge_sampling_with_existing=True | ||
| ) | ||
| return normalize_env_config_sections(merged) |
There was a problem hiding this comment.
Legacy JSON args overlap failure
Medium Severity
Combining legacy -a / --env-args JSON with dotted --args.* flags for the same loader key makes merge_direct_eval_config call normalize_env_config_sections, which raises instead of merging. Dotted overrides cannot replace individual keys from the legacy JSON object.
Reviewed by Cursor Bugbot for commit 824fc7d. Configure here.
There was a problem hiding this comment.
we should not allow all the cursed configs, imo
|
|
||
| def main(argv: list[str] | None = None): | ||
| args = parse_args(argv) | ||
| args, config_args = parse_known_args(argv) |
There was a problem hiding this comment.
Eval skills not updated
Low Severity
This PR changes the prime eval run command contract and docs/evaluation.md, but skills/evaluate-environments/SKILL.md still documents only legacy JSON for v1 overrides and -x for extra kwargs, not the new dotted flags.
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit 824fc7d. Configure here.
ApprovabilityVerdict: Needs human review Unable to check for correctness in 824fc7d. This PR introduces new CLI parsing behavior with an unresolved medium-severity bug identified: combining legacy JSON args with new dotted flags for the same key raises an error instead of merging. This substantive issue requires human review. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 824fc7d6d3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| merged = normalize_sampling_config( | ||
| merged, "CLI config", merge_sampling_with_existing=True | ||
| ) | ||
| return normalize_env_config_sections(merged) |
There was a problem hiding this comment.
Allow mixed legacy and dotted env config flags
merge_direct_eval_config() always runs normalize_env_config_sections() on CLI input, which raises when both legacy --env-args (with config.taskset/config.harness) and new dotted --taskset.*/--harness.* are provided. That makes incremental migration fail with a hard ValueError (for example, -a '{"config":{"taskset":{"split":"train"}}}' --taskset.max-turns 4), even though this change advertises legacy/new flag compatibility; these inputs should merge with deterministic precedence rather than abort.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
no, we shouldn't allow cursed setups
|
will be done more in-depth with #1473 |


Summary
Examples
Legacy JSON flags still work:
The new dotted flags provide the same config without JSON quoting:
V1 taskset/harness config can be passed directly:
Tests
Note
Medium Risk
Changes how eval CLI args are parsed and merged into run config; mistakes could misconfigure loads, sampling, or v1 env sections, though legacy JSON paths and new tests reduce regression risk.
Overview
Adds dotted CLI flags for direct
prime eval runconfiguration, aligned with TOML sections:--args.*,--taskset.*,--harness.*,--sampling.*, and--extra-env-kwargs.*. The eval entrypoint now usesparse_known_argspluspydantic_config(DirectEvalConfig) to parse trailing config tokens, merge them with legacy JSON flags (-a,-x,-S), and normalize intoenv_args/sampling_args/ v1configsections. TOML eval configs reject leftover unknown args.Legacy
-a,-x, and-SJSON remain supported. Docs and synced AGENTS guides document the new flags and examples (including concurrency override via--extra-env-kwargs.concurrency). CLI tests cover precedence, v1 nesting, and backward compatibility.Reviewed by Cursor Bugbot for commit 824fc7d. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add dotted CLI flags for eval config sections in
verifiers/scripts/eval.py--args.<key>,--taskset.<key>,--harness.<key>,--extra-env-kwargs.<key>,--sampling.<key>) as an alternative to legacy JSON flags for configuring eval runs.parse_known_argsto separate standard flags from dotted config args, andmerge_direct_eval_configto deep-merge dotted values with legacy JSON/flag-derived values with defined precedence.pydantic_configinto aDirectEvalConfigstruct, then normalized usingnormalize_sampling_configandnormalize_env_config_sections.--timeoutflag takes precedence overtimeout_secondsset via either dotted or legacy--extra-env-kwargs; dotted--sampling.*values take precedence over individual--max-tokens/--temperatureflags.Macroscope summarized 824fc7d.