Stage standalone tasksets and harnesses packages with hardened v1 program configs by willccbb · Pull Request #1475 · PrimeIntellect-ai/verifiers

willccbb · 2026-05-27T07:27:48Z

Summary

Extract reusable v1 tasksets and harnesses into standalone top-level tasksets and harnesses packages at 0.1.0.post0, with package build/publish scaffolds and optional dependency wiring from verifiers.
Harden v1 runtime boundaries around typed ModelConfig, ProgramConfig, SandboxConfig, UserConfig, ToolsetConfig, and split-aware Taskset.load_tasks(...), while keeping vf.Harness(model=..., client=..., sampling_args=...) as the narrow standalone shortcut.
Rework Harbor, OpenEnv, OpenReward, TextArena, OpenCode, Pi, MiniSWEAgent, RLM, and Terminus2 to use self-contained taskset/harness classes, rollout-scoped tool provisioning via State.add_tool, and task-level toolset show/hide controls.
Update examples, init templates, docs, Semgrep policy, and CI workflows to use the new package/config surfaces.

Testing

uv run pytest tests/test_v1_runtime_lifecycle.py tests/test_v1_config_extension.py tests/test_v1_harbor_cli.py tests/test_v1_mini_swe_agent.py tests/test_v1_openenv_taskset.py tests/test_v1_openreward_taskset.py tests/test_v1_textarena_taskset.py tests/test_v1_rlm_swe.py tests/test_init_script.py -q
uv run ty check verifiers/v1 packages/tasksets/tasksets packages/harnesses/harnesses
uv run ruff check verifiers/v1 packages/tasksets/tasksets packages/harnesses/harnesses tests/test_v1_runtime_lifecycle.py tests/test_v1_config_extension.py tests/test_v1_harbor_cli.py tests/test_v1_mini_swe_agent.py tests/test_v1_openenv_taskset.py tests/test_v1_openreward_taskset.py tests/test_v1_textarena_taskset.py tests/test_v1_rlm_swe.py tests/test_init_script.py
env PYTHONWARNINGS=ignore::SyntaxWarning uv run --no-dev --group policy semgrep --metrics=off --disable-version-check --config .semgrep/verifiers.yml --error --quiet
uv build packages/tasksets
uv build packages/harnesses

Note

High Risk
Large breaking surface across imports, loaders, endpoint/config types, and many environment packages, plus new release workflows tied to tag/version alignment in package pyproject.toml files.

Overview
This PR splits reusable v1 tasksets and harnesses out of the monorepo into first-class PyPI packages under packages/tasksets and packages/harnesses, with new tag-driven publish workflows and verifiers[tasksets], verifiers[harnesses], and verifiers[packages] extras. Imports move from verifiers.v1.packages.* to top-level tasksets / harnesses.

v1 API and docs are aligned around stricter config and loading: ProgramConfig (and harness-specific subclasses) own program resolution; task data uses load_tasks(split="train" | "eval") instead of separate eval loaders; judges and agents use state.get_client / typed EndpointConfig instead of dict endpoints and raw API keys; toolsets/MCP/users are wired via load_toolsets, SandboxConfig, and User subclasses rather than string refs and ad-hoc __init__ overrides.

Examples and policy follow the new shape—Harbor/OpenEnv/RLM/BFCL and hello-* v1 envs depend on the packages, OpenEnv drops OpenEnvEnv for OpenEnvTaskset, and Semgrep/CI/local docs expand typing and lint to the new package trees while discouraging Mapping escape hatches and custom Taskset/Harness constructors.

^{Reviewed by Cursor Bugbot for commit 35418a8. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Extract standalone `tasksets` and `harnesses` packages with structured `ProgramConfig`

Moves harness and taskset implementations out of verifiers.v1.packages into standalone installable packages at packages/harnesses and packages/tasksets, each with their own pyproject.toml and publish CI workflows.
Introduces ProgramConfig as the standard way to declare harness programs; direct callable or string references on harness/program fields are replaced by {'fn': '<import_ref>'} dicts or typed subclasses with a resolve() method.
Adds structured config models throughout: ModelConfig, SandboxConfig, VisibilityConfig, ArtifactsConfig, BindingsConfig, ObjectsConfig, UserConfig, SystemPromptConfig, and EndpointConfig (promoted from TypedDict to pydantic BaseModel).
Taskset.load_tasks now requires a split: TaskSplit = 'train' parameter; Env construction now always requires a Harness; User is now a class with get_response instead of a function reference.
Tool calls within a single model step are now executed concurrently via asyncio.gather in base_program.
Risk: Many public API surfaces changed — ConfigMap/MutableConfigMap/TaskRow/GroupHandler removed from exports; non-dict Mapping inputs rejected across validators, bindings, scoring, and serialization utilities; load_eval_tasks replaced by load_tasks(split='eval'); programs must use the dict/ProgramConfig form.

^{Macroscope summarized 35418a8.}

macroscopeapp · 2026-05-27T07:28:25Z

Approvability

Verdict: Needs human review

Unable to check for correctness in 35418a8. Diff is too large for automated approval analysis. A human reviewer should evaluate this PR.

^{You can customize Macroscope's approvability policy. Learn more.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f9a5f58. Configure here.}

mikasenghaas

SHIPIT

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread packages/harnesses/harnesses/rlm.py Outdated

Comment thread environments/wordle_v1/pyproject.toml Outdated

willccbb requested review from mikasenghaas and xeophon May 27, 2026 08:09

xeophon previously approved these changes May 27, 2026

View reviewed changes

willccbb dismissed xeophon’s stale review via d212d29 May 27, 2026 08:42

willccbb force-pushed the codex/tasksets-harnesses-packages branch from d212d29 to 17851f5 Compare May 27, 2026 08:46

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread packages/harnesses/harnesses/opencode.py Outdated

Comment thread environments/bfcl_v3/bfcl_v3.py Outdated

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread docs/overview.md

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread pyproject.toml

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread verifiers/scripts/init.py

Comment thread environments/openenv_echo/openenv_echo.py Outdated

macroscopeapp Bot reviewed May 28, 2026

View reviewed changes

Comment thread packages/tasksets/tasksets/openreward.py Outdated

Comment thread packages/tasksets/tasksets/harbor.py Outdated

cursor Bot reviewed May 28, 2026

View reviewed changes

Comment thread packages/tasksets/tasksets/openreward.py Outdated

xeophon self-requested a review May 28, 2026 09:23

xeophon previously approved these changes May 28, 2026

View reviewed changes

willccbb dismissed xeophon’s stale review via 3591543 May 28, 2026 21:38

macroscopeapp Bot reviewed May 28, 2026

View reviewed changes

Comment thread packages/harnesses/harnesses/rlm.py Outdated

macroscopeapp Bot reviewed May 28, 2026

View reviewed changes

Comment thread packages/tasksets/tasksets/harbor.py Outdated

Comment thread packages/harnesses/harnesses/opencode.py Outdated

Comment thread packages/harnesses/harnesses/rlm.py Outdated

cursor Bot reviewed May 28, 2026

View reviewed changes

Comment thread packages/tasksets/tasksets/openenv.py Outdated

Comment thread packages/tasksets/pyproject.toml