Add dotted eval config CLI flags by xeophon · Pull Request #1470 · PrimeIntellect-ai/verifiers

xeophon · 2026-05-26T12:00:05Z

Summary

add dotted direct CLI config sections for args, taskset, harness, sampling, and extra env kwargs
keep legacy JSON flags for backwards compatibility
update eval/environment docs and generated environment AGENTS guides

Examples

Legacy JSON flags still work:

prime eval run my-env \
  -a '{"split": "test"}' \
  -S '{"top_p": 0.9}' \
  -x '{"concurrency": 256}'

The new dotted flags provide the same config without JSON quoting:

prime eval run my-env \
  --args.split test \
  --sampling.top-p 0.9 \
  --extra-env-kwargs.concurrency 256

V1 taskset/harness config can be passed directly:

prime eval run my-v1-env \
  --taskset.split eval \
  --harness.max-turns 20 \
  --taskset.scoring.exact-answer.weight 0.5

Tests

uv run pytest tests/test_eval_cli.py -q
uv run pytest tests/test_v1_config_extension.py -q -k 'load_environment_validates_typed_env_config_arg or load_environment_validates_env_config_subclass_sections or load_environment_uses_factory_annotations_for_child_config_types'
uv run ruff check --fix
uv run ruff format
UV_FROZEN=1 git commit hooks: ruff, semgrep, sync-agents
UV_FROZEN=1 git push hooks: ruff, semgrep, sync-agents, ty

Note

Medium Risk
Changes how eval CLI args are parsed and merged into run config; mistakes could misconfigure loads, sampling, or v1 env sections, though legacy JSON paths and new tests reduce regression risk.

Overview
Adds dotted CLI flags for direct prime eval run configuration, aligned with TOML sections: --args.*, --taskset.*, --harness.*, --sampling.*, and --extra-env-kwargs.*. The eval entrypoint now uses parse_known_args plus pydantic_config (DirectEvalConfig) to parse trailing config tokens, merge them with legacy JSON flags (-a, -x, -S), and normalize into env_args / sampling_args / v1 config sections. TOML eval configs reject leftover unknown args.

Legacy -a, -x, and -S JSON remain supported. Docs and synced AGENTS guides document the new flags and examples (including concurrency override via --extra-env-kwargs.concurrency). CLI tests cover precedence, v1 nesting, and backward compatibility.

^{Reviewed by Cursor Bugbot for commit 824fc7d. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Add dotted CLI flags for eval config sections in `verifiers/scripts/eval.py`

Introduces dotted CLI flags (--args.<key>, --taskset.<key>, --harness.<key>, --extra-env-kwargs.<key>, --sampling.<key>) as an alternative to legacy JSON flags for configuring eval runs.
Adds parse_known_args to separate standard flags from dotted config args, and merge_direct_eval_config to deep-merge dotted values with legacy JSON/flag-derived values with defined precedence.
Dotted flags are parsed via pydantic_config into a DirectEvalConfig struct, then normalized using normalize_sampling_config and normalize_env_config_sections.
Extra config args alongside a TOML config file now cause an immediate error exit.
Behavioral Change: --timeout flag takes precedence over timeout_seconds set via either dotted or legacy --extra-env-kwargs; dotted --sampling.* values take precedence over individual --max-tokens/--temperature flags.

^{Macroscope summarized 824fc7d.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 824fc7d. Configure here.}

cursor · 2026-05-26T12:02:05Z

+    merged = normalize_sampling_config(
+        merged, "CLI config", merge_sampling_with_existing=True
+    )
+    return normalize_env_config_sections(merged)


Legacy JSON args overlap failure

Medium Severity

Combining legacy -a / --env-args JSON with dotted --args.* flags for the same loader key makes merge_direct_eval_config call normalize_env_config_sections, which raises instead of merging. Dotted overrides cannot replace individual keys from the legacy JSON object.

^{Reviewed by Cursor Bugbot for commit 824fc7d. Configure here.}

we should not allow all the cursed configs, imo

cursor · 2026-05-26T12:02:05Z

+
 def main(argv: list[str] | None = None):
-    args = parse_args(argv)
+    args, config_args = parse_known_args(argv)


Eval skills not updated

Low Severity

This PR changes the prime eval run command contract and docs/evaluation.md, but skills/evaluate-environments/SKILL.md still documents only legacy JSON for v1 overrides and -x for extra kwargs, not the new dotted flags.

^{Triggered by project rule: BugBot Instructions}

^{Reviewed by Cursor Bugbot for commit 824fc7d. Configure here.}

macroscopeapp · 2026-05-26T12:02:36Z

Approvability

Verdict: Needs human review

Unable to check for correctness in 824fc7d. This PR introduces new CLI parsing behavior with an unresolved medium-severity bug identified: combining legacy JSON args with new dotted flags for the same key raises an error instead of merging. This substantive issue requires human review.

^{You can customize Macroscope's approvability policy. Learn more.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 824fc7d6d3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-26T12:07:42Z

+    merged = normalize_sampling_config(
+        merged, "CLI config", merge_sampling_with_existing=True
+    )
+    return normalize_env_config_sections(merged)


Allow mixed legacy and dotted env config flags

merge_direct_eval_config() always runs normalize_env_config_sections() on CLI input, which raises when both legacy --env-args (with config.taskset/config.harness) and new dotted --taskset.*/--harness.* are provided. That makes incremental migration fail with a hard ValueError (for example, -a '{"config":{"taskset":{"split":"train"}}}' --taskset.max-turns 4), even though this change advertises legacy/new flag compatibility; these inputs should merge with deterministic precedence rather than abort.

Useful? React with 👍 / 👎.

no, we shouldn't allow cursed setups

xeophon · 2026-05-28T18:59:24Z

will be done more in-depth with #1473

Add dotted eval config CLI flags

824fc7d

cursor Bot reviewed May 26, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 26, 2026

View reviewed changes

mikasenghaas mentioned this pull request May 26, 2026

Add vf-eval-v1: pydantic-config CLI for v1 taskset/harness envs #1473

Open

xeophon closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dotted eval config CLI flags#1470

Add dotted eval config CLI flags#1470
xeophon wants to merge 1 commit into
mainfrom
cli-pydantic-config-backcompat

xeophon commented May 26, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 26, 2026

Uh oh!

xeophon May 26, 2026

Uh oh!

cursor Bot May 26, 2026

Uh oh!

macroscopeapp Bot commented May 26, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Uh oh!

xeophon May 26, 2026

Uh oh!

xeophon commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xeophon commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Examples

Tests

Add dotted CLI flags for eval config sections in verifiers/scripts/eval.py

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 26, 2026

Choose a reason for hiding this comment

Legacy JSON args overlap failure

Uh oh!

xeophon May 26, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 26, 2026

Choose a reason for hiding this comment

Eval skills not updated

Uh oh!

macroscopeapp Bot commented May 26, 2026

Approvability

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

xeophon May 26, 2026

Choose a reason for hiding this comment

Uh oh!

xeophon commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xeophon commented May 26, 2026 •

edited

Loading

Add dotted CLI flags for eval config sections in `verifiers/scripts/eval.py`