Skip to content

Add dotted eval config CLI flags#1470

Closed
xeophon wants to merge 1 commit into
mainfrom
cli-pydantic-config-backcompat
Closed

Add dotted eval config CLI flags#1470
xeophon wants to merge 1 commit into
mainfrom
cli-pydantic-config-backcompat

Conversation

@xeophon
Copy link
Copy Markdown
Member

@xeophon xeophon commented May 26, 2026

Summary

  • add dotted direct CLI config sections for args, taskset, harness, sampling, and extra env kwargs
  • keep legacy JSON flags for backwards compatibility
  • update eval/environment docs and generated environment AGENTS guides

Examples

Legacy JSON flags still work:

prime eval run my-env \
  -a '{"split": "test"}' \
  -S '{"top_p": 0.9}' \
  -x '{"concurrency": 256}'

The new dotted flags provide the same config without JSON quoting:

prime eval run my-env \
  --args.split test \
  --sampling.top-p 0.9 \
  --extra-env-kwargs.concurrency 256

V1 taskset/harness config can be passed directly:

prime eval run my-v1-env \
  --taskset.split eval \
  --harness.max-turns 20 \
  --taskset.scoring.exact-answer.weight 0.5

Tests

  • uv run pytest tests/test_eval_cli.py -q
  • uv run pytest tests/test_v1_config_extension.py -q -k 'load_environment_validates_typed_env_config_arg or load_environment_validates_env_config_subclass_sections or load_environment_uses_factory_annotations_for_child_config_types'
  • uv run ruff check --fix
  • uv run ruff format
  • UV_FROZEN=1 git commit hooks: ruff, semgrep, sync-agents
  • UV_FROZEN=1 git push hooks: ruff, semgrep, sync-agents, ty

Note

Medium Risk
Changes how eval CLI args are parsed and merged into run config; mistakes could misconfigure loads, sampling, or v1 env sections, though legacy JSON paths and new tests reduce regression risk.

Overview
Adds dotted CLI flags for direct prime eval run configuration, aligned with TOML sections: --args.*, --taskset.*, --harness.*, --sampling.*, and --extra-env-kwargs.*. The eval entrypoint now uses parse_known_args plus pydantic_config (DirectEvalConfig) to parse trailing config tokens, merge them with legacy JSON flags (-a, -x, -S), and normalize into env_args / sampling_args / v1 config sections. TOML eval configs reject leftover unknown args.

Legacy -a, -x, and -S JSON remain supported. Docs and synced AGENTS guides document the new flags and examples (including concurrency override via --extra-env-kwargs.concurrency). CLI tests cover precedence, v1 nesting, and backward compatibility.

Reviewed by Cursor Bugbot for commit 824fc7d. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add dotted CLI flags for eval config sections in verifiers/scripts/eval.py

  • Introduces dotted CLI flags (--args.<key>, --taskset.<key>, --harness.<key>, --extra-env-kwargs.<key>, --sampling.<key>) as an alternative to legacy JSON flags for configuring eval runs.
  • Adds parse_known_args to separate standard flags from dotted config args, and merge_direct_eval_config to deep-merge dotted values with legacy JSON/flag-derived values with defined precedence.
  • Dotted flags are parsed via pydantic_config into a DirectEvalConfig struct, then normalized using normalize_sampling_config and normalize_env_config_sections.
  • Extra config args alongside a TOML config file now cause an immediate error exit.
  • Behavioral Change: --timeout flag takes precedence over timeout_seconds set via either dotted or legacy --extra-env-kwargs; dotted --sampling.* values take precedence over individual --max-tokens/--temperature flags.

Macroscope summarized 824fc7d.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 824fc7d. Configure here.

Comment thread verifiers/scripts/eval.py
merged = normalize_sampling_config(
merged, "CLI config", merge_sampling_with_existing=True
)
return normalize_env_config_sections(merged)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy JSON args overlap failure

Medium Severity

Combining legacy -a / --env-args JSON with dotted --args.* flags for the same loader key makes merge_direct_eval_config call normalize_env_config_sections, which raises instead of merging. Dotted overrides cannot replace individual keys from the legacy JSON object.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 824fc7d. Configure here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not allow all the cursed configs, imo

Comment thread verifiers/scripts/eval.py

def main(argv: list[str] | None = None):
args = parse_args(argv)
args, config_args = parse_known_args(argv)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eval skills not updated

Low Severity

This PR changes the prime eval run command contract and docs/evaluation.md, but skills/evaluate-environments/SKILL.md still documents only legacy JSON for v1 overrides and -x for extra kwargs, not the new dotted flags.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit 824fc7d. Configure here.

@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 26, 2026

Approvability

Verdict: Needs human review

Unable to check for correctness in 824fc7d. This PR introduces new CLI parsing behavior with an unresolved medium-severity bug identified: combining legacy JSON args with new dotted flags for the same key raises an error instead of merging. This substantive issue requires human review.

You can customize Macroscope's approvability policy. Learn more.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 824fc7d6d3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/scripts/eval.py
merged = normalize_sampling_config(
merged, "CLI config", merge_sampling_with_existing=True
)
return normalize_env_config_sections(merged)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow mixed legacy and dotted env config flags

merge_direct_eval_config() always runs normalize_env_config_sections() on CLI input, which raises when both legacy --env-args (with config.taskset/config.harness) and new dotted --taskset.*/--harness.* are provided. That makes incremental migration fail with a hard ValueError (for example, -a '{"config":{"taskset":{"split":"train"}}}' --taskset.max-turns 4), even though this change advertises legacy/new flag compatibility; these inputs should merge with deterministic precedence rather than abort.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we shouldn't allow cursed setups

@xeophon
Copy link
Copy Markdown
Member Author

xeophon commented May 28, 2026

will be done more in-depth with #1473

@xeophon xeophon closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant