Stage standalone tasksets and harnesses packages with hardened v1 program configs#1475
Merged
Conversation
ApprovabilityVerdict: Needs human review Unable to check for correctness in 35418a8. Diff is too large for automated approval analysis. A human reviewer should evaluate this PR. You can customize Macroscope's approvability policy. Learn more. |
xeophon
previously approved these changes
May 27, 2026
d212d29 to
17851f5
Compare
xeophon
previously approved these changes
May 28, 2026
e6f0103 to
f9a5f58
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f9a5f58. Configure here.
f9a5f58 to
35418a8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
tasksetsandharnessespackages at0.1.0.post0, with package build/publish scaffolds and optional dependency wiring fromverifiers.ModelConfig,ProgramConfig,SandboxConfig,UserConfig,ToolsetConfig, and split-awareTaskset.load_tasks(...), while keepingvf.Harness(model=..., client=..., sampling_args=...)as the narrow standalone shortcut.State.add_tool, and task-level toolset show/hide controls.Testing
uv run pytest tests/test_v1_runtime_lifecycle.py tests/test_v1_config_extension.py tests/test_v1_harbor_cli.py tests/test_v1_mini_swe_agent.py tests/test_v1_openenv_taskset.py tests/test_v1_openreward_taskset.py tests/test_v1_textarena_taskset.py tests/test_v1_rlm_swe.py tests/test_init_script.py -quv run ty check verifiers/v1 packages/tasksets/tasksets packages/harnesses/harnessesuv run ruff check verifiers/v1 packages/tasksets/tasksets packages/harnesses/harnesses tests/test_v1_runtime_lifecycle.py tests/test_v1_config_extension.py tests/test_v1_harbor_cli.py tests/test_v1_mini_swe_agent.py tests/test_v1_openenv_taskset.py tests/test_v1_openreward_taskset.py tests/test_v1_textarena_taskset.py tests/test_v1_rlm_swe.py tests/test_init_script.pyenv PYTHONWARNINGS=ignore::SyntaxWarning uv run --no-dev --group policy semgrep --metrics=off --disable-version-check --config .semgrep/verifiers.yml --error --quietuv build packages/tasksetsuv build packages/harnessesNote
High Risk
Large breaking surface across imports, loaders, endpoint/config types, and many environment packages, plus new release workflows tied to tag/version alignment in package
pyproject.tomlfiles.Overview
This PR splits reusable v1 tasksets and harnesses out of the monorepo into first-class PyPI packages under
packages/tasksetsandpackages/harnesses, with new tag-driven publish workflows andverifiers[tasksets],verifiers[harnesses], andverifiers[packages]extras. Imports move fromverifiers.v1.packages.*to top-leveltasksets/harnesses.v1 API and docs are aligned around stricter config and loading:
ProgramConfig(and harness-specific subclasses) own program resolution; task data usesload_tasks(split="train" | "eval")instead of separate eval loaders; judges and agents usestate.get_client/ typedEndpointConfiginstead of dict endpoints and raw API keys; toolsets/MCP/users are wired viaload_toolsets,SandboxConfig, andUsersubclasses rather than string refs and ad-hoc__init__overrides.Examples and policy follow the new shape—Harbor/OpenEnv/RLM/BFCL and hello-* v1 envs depend on the packages, OpenEnv drops
OpenEnvEnvforOpenEnvTaskset, and Semgrep/CI/local docs expand typing and lint to the new package trees while discouragingMappingescape hatches and customTaskset/Harnessconstructors.Reviewed by Cursor Bugbot for commit 35418a8. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Extract standalone
tasksetsandharnessespackages with structuredProgramConfigverifiers.v1.packagesinto standalone installable packages atpackages/harnessesandpackages/tasksets, each with their ownpyproject.tomland publish CI workflows.ProgramConfigas the standard way to declare harness programs; direct callable or string references on harness/program fields are replaced by{'fn': '<import_ref>'}dicts or typed subclasses with aresolve()method.ModelConfig,SandboxConfig,VisibilityConfig,ArtifactsConfig,BindingsConfig,ObjectsConfig,UserConfig,SystemPromptConfig, andEndpointConfig(promoted from TypedDict to pydanticBaseModel).Taskset.load_tasksnow requires asplit: TaskSplit = 'train'parameter;Envconstruction now always requires aHarness;Useris now a class withget_responseinstead of a function reference.asyncio.gatherinbase_program.ConfigMap/MutableConfigMap/TaskRow/GroupHandlerremoved from exports; non-dictMappinginputs rejected across validators, bindings, scoring, and serialization utilities;load_eval_tasksreplaced byload_tasks(split='eval'); programs must use the dict/ProgramConfig form.Macroscope summarized 35418a8.