Skip to content

Stage standalone tasksets and harnesses packages with hardened v1 program configs#1475

Merged
willccbb merged 18 commits into
mainfrom
codex/tasksets-harnesses-packages
May 29, 2026
Merged

Stage standalone tasksets and harnesses packages with hardened v1 program configs#1475
willccbb merged 18 commits into
mainfrom
codex/tasksets-harnesses-packages

Conversation

@willccbb
Copy link
Copy Markdown
Member

@willccbb willccbb commented May 27, 2026

Summary

  • Extract reusable v1 tasksets and harnesses into standalone top-level tasksets and harnesses packages at 0.1.0.post0, with package build/publish scaffolds and optional dependency wiring from verifiers.
  • Harden v1 runtime boundaries around typed ModelConfig, ProgramConfig, SandboxConfig, UserConfig, ToolsetConfig, and split-aware Taskset.load_tasks(...), while keeping vf.Harness(model=..., client=..., sampling_args=...) as the narrow standalone shortcut.
  • Rework Harbor, OpenEnv, OpenReward, TextArena, OpenCode, Pi, MiniSWEAgent, RLM, and Terminus2 to use self-contained taskset/harness classes, rollout-scoped tool provisioning via State.add_tool, and task-level toolset show/hide controls.
  • Update examples, init templates, docs, Semgrep policy, and CI workflows to use the new package/config surfaces.

Testing

  • uv run pytest tests/test_v1_runtime_lifecycle.py tests/test_v1_config_extension.py tests/test_v1_harbor_cli.py tests/test_v1_mini_swe_agent.py tests/test_v1_openenv_taskset.py tests/test_v1_openreward_taskset.py tests/test_v1_textarena_taskset.py tests/test_v1_rlm_swe.py tests/test_init_script.py -q
  • uv run ty check verifiers/v1 packages/tasksets/tasksets packages/harnesses/harnesses
  • uv run ruff check verifiers/v1 packages/tasksets/tasksets packages/harnesses/harnesses tests/test_v1_runtime_lifecycle.py tests/test_v1_config_extension.py tests/test_v1_harbor_cli.py tests/test_v1_mini_swe_agent.py tests/test_v1_openenv_taskset.py tests/test_v1_openreward_taskset.py tests/test_v1_textarena_taskset.py tests/test_v1_rlm_swe.py tests/test_init_script.py
  • env PYTHONWARNINGS=ignore::SyntaxWarning uv run --no-dev --group policy semgrep --metrics=off --disable-version-check --config .semgrep/verifiers.yml --error --quiet
  • uv build packages/tasksets
  • uv build packages/harnesses

Note

High Risk
Large breaking surface across imports, loaders, endpoint/config types, and many environment packages, plus new release workflows tied to tag/version alignment in package pyproject.toml files.

Overview
This PR splits reusable v1 tasksets and harnesses out of the monorepo into first-class PyPI packages under packages/tasksets and packages/harnesses, with new tag-driven publish workflows and verifiers[tasksets], verifiers[harnesses], and verifiers[packages] extras. Imports move from verifiers.v1.packages.* to top-level tasksets / harnesses.

v1 API and docs are aligned around stricter config and loading: ProgramConfig (and harness-specific subclasses) own program resolution; task data uses load_tasks(split="train" | "eval") instead of separate eval loaders; judges and agents use state.get_client / typed EndpointConfig instead of dict endpoints and raw API keys; toolsets/MCP/users are wired via load_toolsets, SandboxConfig, and User subclasses rather than string refs and ad-hoc __init__ overrides.

Examples and policy follow the new shape—Harbor/OpenEnv/RLM/BFCL and hello-* v1 envs depend on the packages, OpenEnv drops OpenEnvEnv for OpenEnvTaskset, and Semgrep/CI/local docs expand typing and lint to the new package trees while discouraging Mapping escape hatches and custom Taskset/Harness constructors.

Reviewed by Cursor Bugbot for commit 35418a8. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Extract standalone tasksets and harnesses packages with structured ProgramConfig

  • Moves harness and taskset implementations out of verifiers.v1.packages into standalone installable packages at packages/harnesses and packages/tasksets, each with their own pyproject.toml and publish CI workflows.
  • Introduces ProgramConfig as the standard way to declare harness programs; direct callable or string references on harness/program fields are replaced by {'fn': '<import_ref>'} dicts or typed subclasses with a resolve() method.
  • Adds structured config models throughout: ModelConfig, SandboxConfig, VisibilityConfig, ArtifactsConfig, BindingsConfig, ObjectsConfig, UserConfig, SystemPromptConfig, and EndpointConfig (promoted from TypedDict to pydantic BaseModel).
  • Taskset.load_tasks now requires a split: TaskSplit = 'train' parameter; Env construction now always requires a Harness; User is now a class with get_response instead of a function reference.
  • Tool calls within a single model step are now executed concurrently via asyncio.gather in base_program.
  • Risk: Many public API surfaces changed — ConfigMap/MutableConfigMap/TaskRow/GroupHandler removed from exports; non-dict Mapping inputs rejected across validators, bindings, scoring, and serialization utilities; load_eval_tasks replaced by load_tasks(split='eval'); programs must use the dict/ProgramConfig form.

Macroscope summarized 35418a8.

@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 27, 2026

Approvability

Verdict: Needs human review

Unable to check for correctness in 35418a8. Diff is too large for automated approval analysis. A human reviewer should evaluate this PR.

You can customize Macroscope's approvability policy. Learn more.

Comment thread packages/harnesses/harnesses/rlm.py Outdated
Comment thread environments/wordle_v1/pyproject.toml Outdated
@willccbb willccbb requested review from mikasenghaas and xeophon May 27, 2026 08:09
xeophon
xeophon previously approved these changes May 27, 2026
@willccbb willccbb force-pushed the codex/tasksets-harnesses-packages branch from d212d29 to 17851f5 Compare May 27, 2026 08:46
Comment thread packages/harnesses/harnesses/opencode.py Outdated
Comment thread environments/bfcl_v3/bfcl_v3.py Outdated
Comment thread docs/overview.md
Comment thread pyproject.toml
Comment thread verifiers/scripts/init.py
Comment thread environments/openenv_echo/openenv_echo.py Outdated
Comment thread packages/tasksets/tasksets/openreward.py Outdated
Comment thread packages/tasksets/tasksets/harbor.py Outdated
Comment thread packages/tasksets/tasksets/openreward.py Outdated
@xeophon xeophon self-requested a review May 28, 2026 09:23
xeophon
xeophon previously approved these changes May 28, 2026
Comment thread packages/harnesses/harnesses/rlm.py Outdated
Comment thread packages/tasksets/tasksets/harbor.py Outdated
Comment thread packages/harnesses/harnesses/opencode.py Outdated
Comment thread packages/harnesses/harnesses/rlm.py Outdated
Comment thread packages/tasksets/tasksets/openenv.py Outdated
Comment thread packages/tasksets/pyproject.toml
Comment thread environments/tau2_bench_v1/tau2_bench_v1.py Outdated
Comment thread environments/opencode_harbor/pyproject.toml Outdated
Comment thread environments/reverse_text/reverse_text_v1.py Outdated
Comment thread .github/workflows/publish-harnesses.yml Outdated
@willccbb willccbb force-pushed the codex/tasksets-harnesses-packages branch from e6f0103 to f9a5f58 Compare May 29, 2026 18:30
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f9a5f58. Configure here.

Comment thread environments/hello_subagent_v1/hello_subagent_v1.py Outdated
@willccbb willccbb force-pushed the codex/tasksets-harnesses-packages branch from f9a5f58 to 35418a8 Compare May 29, 2026 19:59
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHIPIT

@willccbb willccbb merged commit 4f19876 into main May 29, 2026
14 of 15 checks passed
@willccbb willccbb deleted the codex/tasksets-harnesses-packages branch May 29, 2026 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants