APR-262 Add Harbor verifiers support for v1 by xeophon · Pull Request #1401 · PrimeIntellect-ai/verifiers

xeophon · 2026-05-17T13:41:54Z

Summary

add Harbor v1 support for resolving [verifier].environment_mode and [verifier.environment], including shared/separate validation and separate-mode sandbox config
run separate Harbor verifiers in a fresh sandbox, reusing the shared test runner/archive upload helpers and uploading task tests only for the environment_mode = "separate" fallback without [verifier.environment]
transfer /logs/artifacts plus configured Harbor artifacts into verifier sandboxes, including explicit /logs/artifacts entries with exclude patterns and source-path handling for artifact destination mappings
align reward parsing and docs with current Harbor behavior by reading reward.json before reward.txt and documenting the v1 Harbor verifier lifecycle

Verification

uv run --frozen pytest tests/test_v1_harbor_cli.py tests/test_opencode_harbor.py -q
uv run --frozen ruff format .
uv run --frozen ruff check --fix .
uv run --frozen ty check verifiers
uv run --frozen --group policy semgrep --metrics=off --disable-version-check --config .semgrep/verifiers.yml --error --quiet
GitHub Actions passed: Ruff, Semgrep, Ty, CodeQL/analyze, Environments, Verifiers 3.10/3.11/3.12/3.13

Note

Medium Risk
Changes Harbor scoring to create ephemeral verifier sandboxes and move grading inputs across sandboxes; host env substitution for verifier secrets adds operational sensitivity, though behavior is covered by new tests.

Overview
Adds Harbor v1 support for [verifier] settings from task.toml: shared (tests in the agent sandbox, unchanged default) vs separate (dedicated verifier sandbox). [verifier.environment] implies separate mode; explicit environment_mode = "separate" without a verifier env reuses the agent [environment] and sets verifier_upload_tests so tests/ are uploaded when the image has no baked tests. Invalid combos (e.g. shared + [verifier.environment]) fail at load time.

Separate-mode scoring spins up a lease via create_sandbox_lease, optionally uploads tests, packs /logs/artifacts plus task artifacts (with exclude patterns) from the agent sandbox into a tarball, extracts them on the verifier, runs bash test.sh with resolved [verifier].env (${VAR} / ${VAR:-default} from the host), then reads reward via reward.json before reward.txt. Shared mode still uploads tests to the agent sandbox but uses the same reward command order.

Also extends SandboxConfig / create_sandbox with gpu_type and vm (default vm when gpu_count > 0), and documents the verifier lifecycle in BYO Harness docs.

^{Reviewed by Cursor Bugbot for commit f3f955b. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Add separate verifier sandbox support to Harbor v1 taskset

Introduces a separate verifier mode in harbor.py: when [verifier.environment] is declared in a task's TOML, a dedicated verifier sandbox is created, inputs/artifacts are transferred from the agent sandbox, tests are run, and the sandbox is deleted on completion.
Adds harbor_verifier_env to resolve [verifier].env values from host environment variables with ${VAR:-default} template support, raising if required variables are missing.
Updates HARBOR_REWARD_COMMAND to prefer /logs/verifier/reward.json over /logs/verifier/reward.txt when reading reward output.
Extends SandboxConfig and create_sandbox to carry gpu_type and vm fields, defaulting vm=True when gpu_count > 0.
Behavioral Change: declaring environment_mode="shared" together with [verifier.environment] now raises ValueError at task-row construction time.

^{Macroscope summarized f3f955b.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a91f41ae52

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T13:43:40Z

+        if verifier_mode == VERIFIER_MODE_SEPARATE and verifier_environment is None:
+            verifier_sandbox = {**sandbox, "command_timeout": int(test_timeout)}


Upload task tests when reusing the agent image

When a task sets [verifier] environment_mode = "separate" but does not provide [verifier.environment], this branch advertises support by creating a fresh verifier sandbox from the agent sandbox config. That fresh sandbox never receives the task's tests/ directory: run_separate_harbor_verifier only creates an empty /tests directory and transfers /logs/artifacts plus configured artifacts before running bash /tests/test.sh. In this supported mode the script is absent unless the base agent image happens to contain it, so otherwise-valid Harbor tasks always get reward 0.

Useful? React with 👍 / 👎.

Harbor PR #1655 does support environment_mode = "separate" without [verifier.environment]; upstream resolves it to a fresh copy of the top-level environment. The difference is that Harbor can build the verifier image from tests/, while this v1 sandbox path cannot. I fixed the fallback by marking that case with verifier_upload_tests and uploading the task tests into the fresh verifier sandbox before running /tests/test.sh, while explicit [verifier.environment] continues to expect a pre-baked verifier image.

macroscopeapp · 2026-05-17T13:49:22Z

Approvability

Verdict: Needs human review

Unable to check for correctness in f3f955b. This PR introduces substantial new Harbor verifier functionality with separate sandbox modes, affecting task execution and reward scoring pipelines. Multiple unresolved P1 review comments identify potential runtime bugs including incorrect working directories and changed reward file precedence.

^{You can customize Macroscope's approvability policy. Learn more.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f02768b6d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f23a0066c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T09:47:33Z

+    "if [ -s /logs/verifier/reward.json ]; then "
+    "cat /logs/verifier/reward.json; "
+    "elif [ -s /logs/verifier/reward.txt ]; then "
+    "cat /logs/verifier/reward.txt; fi"


Restore reward.txt precedence in Harbor reward lookup

When both /logs/verifier/reward.txt and /logs/verifier/reward.json are present, this command now reads JSON first, which can score from stale/diagnostic JSON or return 0 for invalid JSON even if reward.txt contains the valid final score. That changes previously working Harbor tasks and diverges from the existing Harbor implementations in this repo that prioritize reward.txt first. Fresh evidence in this commit: HARBOR_REWARD_COMMAND checks reward.json before reward.txt, and the new test test_harbor_reward_command_prefers_json_reward explicitly locks in that order.

Useful? React with 👍 / 👎.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f23a006. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9daf4b2aee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T13:06:48Z

+            lease.id,
+            state,
+            command="bash /tests/test.sh",
+            working_dir=None,


Run separate Harbor tests from /tests working directory

In separate verifier mode, the test runner now invokes bash /tests/test.sh with working_dir=None, which changes execution semantics versus the existing Harbor path (bash test.sh in /tests). Any verifier script that uses relative paths (for example ./fixtures/... or python helper.py) will resolve against the sandbox default workdir (often /app or image WORKDIR) instead of /tests, causing false failures and reward 0 in otherwise valid tasks. This is specific to verifier_mode = "separate" and is introduced by the new separate-sandbox path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a063a69bff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T13:16:45Z

+    "if [ -s /logs/verifier/reward.json ]; then "
+    "cat /logs/verifier/reward.json; "
+    "elif [ -s /logs/verifier/reward.txt ]; then "
+    "cat /logs/verifier/reward.txt; fi"


Read reward.txt before reward.json for Harbor rewards

HARBOR_REWARD_COMMAND now reads /logs/verifier/reward.json before /logs/verifier/reward.txt, which can mis-score rollouts when both files exist (for example, a stale/diagnostic JSON alongside the final scalar reward in reward.txt). In that case valid tasks can return the wrong reward or 0 if JSON is malformed, despite a correct text reward being present, so this ordering can silently corrupt evaluation results.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c43b51cc5d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-26T08:30:50Z

+    verifier_env = harbor.get("verifier_env") or {}
+    if not isinstance(verifier_env, Mapping):
+        raise TypeError("[verifier].env must be a mapping.")
+    verifier_env = {str(key): str(value) for key, value in verifier_env.items()}


Resolve verifier env placeholders before launching tests

The new verifier env plumbing passes [verifier.env] values through as raw strings ({"KEY": "${KEY}"} stays literal) instead of resolving host-backed placeholders. Harbor tasks commonly use ${VAR} here for API keys (e.g., LLM-judge verifiers), so in those tasks the verifier process receives the literal token ${ANTHROPIC_API_KEY} rather than the actual secret, causing authentication failures and reward 0 despite valid host configuration.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3f955ba2f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T16:51:00Z

+    sandbox: ConfigData = {
+        "image": environment.get("docker_image") or config.docker_image,
+        "cpu_cores": parse_number(environment.get("cpus"), config.cpu_cores),
+        "memory_gb": parse_gb(memory, config.memory_gb),
+        "disk_size_gb": parse_gb(storage, config.disk_size_gb),
+        "timeout_minutes": config.timeout_minutes,
+        "command_timeout": command_timeout,


Preserve Harbor GPU requirements in sandbox configs

For Harbor tasks whose [environment] or new [verifier.environment] requests GPUs with gpus = 1 / gpu_types = [...], this sandbox mapping drops those fields, so create_sandbox() falls back to gpu_count=0 and the verifier/agent starts without the required accelerator. This is especially visible for the new separate verifier path because Harbor documents [verifier.environment] as the same schema as [environment], but GPU-backed graders will now fail or be scheduled on CPU-only sandboxes despite the task config requesting GPUs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

Comment thread verifiers/v1/packages/tasksets/harbor.py

cursor Bot reviewed May 17, 2026

View reviewed changes

Comment thread verifiers/v1/packages/tasksets/harbor.py

xeophon force-pushed the apr-262-harbor-v1-verifier-modes branch from 8ae240a to 05d4739 Compare May 20, 2026 09:22

cursor Bot reviewed May 20, 2026

View reviewed changes

Comment thread verifiers/v1/packages/tasksets/harbor.py

xeophon force-pushed the apr-262-harbor-v1-verifier-modes branch 2 times, most recently from a953660 to f23a006 Compare May 20, 2026 09:41

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

cursor Bot reviewed May 20, 2026

View reviewed changes

Comment thread verifiers/v1/packages/tasksets/harbor.py Outdated

xeophon force-pushed the apr-262-harbor-v1-verifier-modes branch from f23a006 to 9daf4b2 Compare May 20, 2026 13:01

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

xeophon force-pushed the apr-262-harbor-v1-verifier-modes branch from 9daf4b2 to a063a69 Compare May 20, 2026 13:09

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

xeophon force-pushed the apr-262-harbor-v1-verifier-modes branch from a063a69 to c43b51c Compare May 26, 2026 08:23

chatgpt-codex-connector Bot reviewed May 26, 2026

View reviewed changes

APR-262 Add Harbor verifier modes for v1

071c2ba

xeophon force-pushed the apr-262-harbor-v1-verifier-modes branch from c43b51c to 071c2ba Compare May 26, 2026 08:34

This was referenced May 27, 2026

Support GPU type in v1 sandbox config #1478

Merged

Add PostTrainBench Harbor taskset PrimeIntellect-ai/research-environments#415

Open

Support GPU type in v1 sandbox config (#1478)

f3f955b

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

		if verifier_mode == VERIFIER_MODE_SEPARATE and verifier_environment is None:
		verifier_sandbox = {**sandbox, "command_timeout": int(test_timeout)}

Conversation

xeophon commented May 17, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Add separate verifier sandbox support to Harbor v1 taskset

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

xeophon May 17, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xeophon commented May 17, 2026 •

edited by macroscopeapp Bot

Loading

macroscopeapp Bot commented May 17, 2026 •

edited

Loading