fix(eval): bind resolved S-axis env + C-axis overlay on the sharded and run-config paths by xdotli · Pull Request #804 · benchflow-ai/benchflow

xdotli · 2026-06-18T17:34:21Z

What

Follow-up to #790, which added the --state (S) and --config-override (C) eval axes. Two delivery paths silently dropped the binding — producing wrong eval results with no error, the worst failure mode for a benchmarking pipeline. Found during a post-merge review of #790.

1. Sharded / `--worker-concurrency` runs dropped `--state` entirely

run_sharded_evaluation serialized only a manifest path (request.environment_manifest) into the worker payload. For a --state run that path is None, so every worker booted with no Environment plane; and an inline resolve_state tool subset ({"name":"env0","tools":[...]}) was lost even when a path was present (it lives only on the resolved object).

Fix: serialize the already-resolved config.environment_manifest object and rebuild it in the worker (model_dump(mode="json") ↔ model_validate), so the worker boots the exact world the parent resolved — including the filtered service subset, with no dependency on $BENCHFLOW_ENV_REGISTRY being present in the subprocess. Removes the now-redundant environment_manifest_path plumbing (worker keeps a back-compat fallback for in-flight pre-fix payloads).

2. The run-config-file path dropped `--config-override`

_run_config_file_eval threaded every CLI override onto the YAML-loaded Evaluation except config_override, so bench eval create --config run.yaml --config-override '{...}' validated the overlay (no error) then ran every task with its original config. The --tasks-dir path applied it correctly, which masked the gap.

Fix: apply plan.eval_config_override, mirroring the existing --environment-manifest-wins-over-YAML precedence right above it.

Tests

test_worker_payload_round_trips_resolved_environment_manifest — resolved manifest with a filtered service subset survives the JSON shard-payload boundary.
test_worker_payload_environment_manifest_none_stays_none — no-S-axis run serializes a null manifest.
test_cli_config_override_applies_on_run_config_file_path — --config <yaml> --config-override lands on EvaluationConfig.config_override.

Full unit suite green (4293 passed, 12 skipped); ruff check, ruff format --check, ty check all clean.

Not in scope (from the same review, deferred)

build_context_stage.py re-implements sandbox/setup.py:stage_dockerfile_deps and drops its symlink / path-escape guards (Sandbox isolation leaks host state and weakens no-internet guarantees #363/Sandbox upload and staging helpers dereference local symlinks into remote or build contexts #411).
The C-axis allowlist permits the whole sandbox section (network_mode/allowed_hosts/resources), which the default shared verifier can depend on.
Residual _agent_process_kill_pattern edges (flag-values not skipped; silent None → no teardown).

…nd run-config paths PR #790 added the --state (S) and --config-override (C) axes, but two delivery paths silently dropped them — producing wrong eval results with no error, the worst failure mode for a benchmarking pipeline: - Sharded / --worker-concurrency runs serialized only a manifest *path* into the worker payload (request.environment_manifest), which is None for a --state run, so every worker booted with no Environment plane; an inline resolve_state tool subset was lost even when a path was present. Serialize the already-resolved config.environment_manifest object and rebuild it in the worker (model_dump <-> model_validate) so the worker boots the exact world the parent resolved. Removes the now-redundant environment_manifest_path plumbing. - The run-config-file path (_run_config_file_eval) threaded every CLI override onto the YAML-loaded Evaluation except config_override, so `eval create --config run.yaml --config-override '{...}'` validated the overlay then ran every task unchanged. Apply it, mirroring the existing --environment-manifest precedence. Regression tests for both: a resolved-manifest (with filtered service subset) round-trip through the shard payload, and a CLI --config-override-on-run-config-file threading test. Full suite green.

greptile-apps · 2026-06-18T17:39:37Z

Greptile Summary

This PR fixes two silent data-correctness bugs introduced by #790. Both bugs caused evals to silently run with wrong configuration — no error, just wrong results.

Sharding fix: _config_payload previously serialized only the file-system path of the environment manifest; for --state runs this path is None, so every worker subprocess booted with no Environment plane. The fix serializes the already-resolved EnvironmentManifest object via model_dump(mode="json") and rebuilds it in the worker with model_validate, with a back-compat fallback for in-flight path-keyed payloads across upgrades.
Run-config-file fix: _run_config_file_eval applied every CLI override to the YAML-loaded Evaluation except config_override, so --config run.yaml --config-override '{...}' was a silent no-op. The fix mirrors the existing eval_env_manifest precedence rule directly above it.

Confidence Score: 5/5

Safe to merge — both changes are narrow, targeted patches to silent data-loss paths with direct test coverage.

The sharding fix correctly promotes the already-resolved EnvironmentManifest object across the JSON subprocess boundary instead of a potentially-null path, and the back-compat fallback ensures in-flight payloads survive an upgrade. The run-config-file fix is a one-liner that mirrors an identical adjacent guard already present for eval_env_manifest. Three new tests directly exercise the fixed code paths. No existing behaviour is changed for callers that don't use the S or C axes.

No files require special attention. All six changed files are clean and consistent with each other.

Important Files Changed

Filename	Overview
src/benchflow/eval_sharding.py	Removes environment_manifest_path parameter from _config_payload and run_sharded_evaluation; serializes the resolved EnvironmentManifest object instead of a path. Logic is correct and clean.
src/benchflow/eval_worker.py	Updates _environment_manifest to deserialize the new object-keyed payload first, with a back-compat fallback for pre-fix path-keyed payloads still in flight during upgrades. Handles the None case correctly.
src/benchflow/cli/main.py	Removes now-redundant environment_manifest_path kwarg from the sharded call site, and adds the missing config_override override in _run_config_file_eval, mirroring the adjacent eval_env_manifest pattern.
tests/test_eval_sharding.py	Adds two targeted round-trip tests for the manifest serialization fix; updates three existing calls to drop the removed parameter. Tests are well-scoped and exercise the exact JSON boundary the worker reads.
tests/test_config_override.py	Adds an end-to-end test that patches Evaluation.run to capture config_override and verifies the CLI overlay reaches j._config on the YAML config-file path.
tests/test_usage_tracking.py	Removes the now-dead environment_manifest_path=None keyword argument from two existing shard-payload helper calls. No behavioural changes.

_{Reviews (1): Last reviewed commit: "fix(eval): bind resolved S-axis env + C-..." | Re-trigger Greptile}

xdotli · 2026-06-18T18:38:39Z

End-to-end verification on Daytona ✅

Ran the real pipeline (deepseek-v4-pro + deepagents harness + Daytona) against the archive-amazon-shipping clawsbench task — the BF-10 canary, whose verifier reads live service state over HTTP, so a stateful pass is only possible if the Environment plane actually booted. The task ships task.toml (no task.md), so _environment_manifest_from_task_document returns None — there is no task-doc fallback to mask a missing manifest, which makes the A/B below clean.

S-axis binding resolved identically across all runs: clawsbench@v1 → sha256:6f156dec….

Bug #1 — sharded `--state` (A/B causation proof)

Identical command on both — eval … --state '{"name":"clawsbench","tools":["gmail"]}' --worker-concurrency 1 (forces the run_sharded_evaluation worker-subprocess path) — only the code differs:

Code under test	Path	Score	Detail
pre-fix `main` `44c10a23`	sharded worker	0/1 (0.0%)	`failed:1` — no env plane in the worker; verifier found wrong state
fix #804 `d8a0a60f`	sharded worker	1/1 (100%)	`passed:1`, `verifier_errored:0`, verifier reached the live gmail service

worker_count:1 in both confirms the sharded path was taken. Pre-fix, the worker received environment_manifest=None (the parent serialized only a manifest path, which is None for a --state run) → gmail service never started → 0.0. The fix serializes the resolved manifest object, so the worker boots the exact world the parent resolved.

Bug #2 — `--config-override` on the run-config-file path

eval --config run.yaml --state '{…}' --config-override '{"agent":{"timeout_sec":222}}' (the non-sharded _run_config_file_eval path):

Score: 1/1 (100%), reward: 1.0.

Persisted config.json proves the overlay threaded + applied (pre-fix it was silently dropped):

config_override: {keys:['agent'], sha256:'sha256:1ae107…', patch:{agent:{timeout_sec:222}}}
"timeout_sec": 222   ← applied to the agent config

Coverage note

The unit suite (4293 passed) already locks the serialization boundary deterministically; these runs confirm the real Daytona pipeline end-to-end.
Only one clawsbench task exists in this fork, so a multi-task/multi-worker fan-out wasn't run here; the worker-subprocess boundary (the thing Feat/v0.1.5 #1 fixes) is identical for N workers and is exercised above with worker_count:1.

xdotli · 2026-06-18T18:48:44Z

E2E follow-up — multi-worker fan-out + second agent config

Extended the verification along two axes: worker fan-out (N=4) and a second agent harness (oracle, deterministic). Same --state '{"name":"clawsbench","tools":["gmail"]}' → clawsbench@v1 sha256:6f156dec…, sharded with --concurrency 4 --worker-concurrency 1 so each task lands in its own worker subprocess (worker_count: 4).

Run	Agent / model	Tasks × workers	Score	verifier_errored
A	oracle (deterministic, no LLM)	4 × 4	4/4 (100%)	0
B	deepagents + deepseek-v4-pro	4 × 4	4/4 (100%)	0

Across 8 independent worker subprocesses (4 per run), every one deserialized the resolved manifest and booted the gmail service — verifier_errored: 0 everywhere. Run A removes agent stochasticity entirely (oracle just runs solve.sh), so its 4/4 is a deterministic proof the manifest survives the fan-out boundary; pre-fix, each worker would have received environment_manifest=None.

Full matrix now covering both bugs

#	Scenario	Code	Result
1	sharded `--state`, 1 worker, deepseek	fix	1.0 ✅
1	sharded `--state`, 1 worker, deepseek	pre-fix main	0.0 (A/B causation)
1	sharded `--state`, 4 workers, oracle	fix	4/4 ✅
1	sharded `--state`, 4 workers, deepseek	fix	4/4 ✅
2	run-config-file `--config-override`	fix	1.0 ✅ + `timeout_sec:222` applied & hashed

Honest scope: the 4 tasks are synthetic duplicates of archive-amazon-shipping (the only non-vendor stateful task on disk), so this proves the worker-subprocess boundary at N=4 across two harnesses — not task diversity. Agent coverage: deepagents + oracle (no Anthropic/OpenAI keys available this session for claude/codex harnesses).

xdotli · 2026-06-18T18:55:24Z

⚠️ Correction: the "harness-auth-error (orthogonal)" classification below was WRONG — openclaw/openhands/harvey-lab failed on a benchflow provider-resolution bug (bare model ids), fixed in #805 (openhands now runs, 0→26 tool calls). See: #804 (comment)

E2E follow-up — additional agent harnesses

Stressed the fix across different launcher shapes (each harness has a distinct launch_cmd, which is what _agent_process_kill_pattern keys on). Same sharded --state path, --model deepseek-v4-pro (the only model provider with a key available this session):

Harness	launch_cmd shape	Result	Note
deepagents	python shim	1.0 ✅	(earlier)
oracle	`solve.sh`, no LLM	4/4 ✅	(earlier) deterministic
pi-acp	`pi-acp-launcher` → exec `pi-acp`	1.0 ✅	env survived; kill-pattern did not reap the gmail service for this launcher shape
openclaw	`openclaw-acp-shim` (JS via py shim)	⚠️ errored	`suspected_api_error`, 0 tool calls — openclaw↔deepseek provider auth, not the env-axis fix. Env plane resolved fine (`verifier_errored: 0`, `errored` ≠ verifier failure).

Read of the openclaw failure: the agent never made a tool call and hit an API error → it couldn't authenticate to deepseek through openclaw's BENCHFLOW_PROVIDER_* bridge. The Environment plane itself was healthy (clawsbench@v1 resolved, no verifier error). This is a harness/provider-integration gap, orthogonal to PR #804.

Coverage ceiling this session: model is pinned to deepseek (no direct Gemini provider — only Vertex/ADC; GEMINI_API_KEY drives the judge, not an agent model), and claude-agent-acp / codex-acp / gemini harnesses need Anthropic/OpenAI/Vertex creds not present here. Net: 3 harnesses (deepagents, oracle, pi-acp) confirm the fix end-to-end across distinct launcher shapes.

xdotli · 2026-06-18T19:07:09Z

⚠️ Correction: the "harness-auth-error (orthogonal)" classification below was WRONG — openclaw/openhands/harvey-lab failed on a benchflow provider-resolution bug (bare model ids), fixed in #805 (openhands now runs, 0→26 tool calls). See: #804 (comment)

E2E follow-up — extended harness matrix (multi-agent run)

Fanned out 5 more harnesses, each a real sharded --state Daytona e2e on the fixed branch, with an adversarial critic pass to separate the env-axis fix from harness-internal issues. Raw summary.json independently re-verified.

Harness	Model	Score	Classification	Decisive evidence
opencode	deepseek-v4-pro	1.0 ✅	fix-confirmed	passed 1, verifier_errored 0, 3 tool calls; worker reached gmail mock @:9001, reward_valid
mimo	deepseek-v4-pro	1.0 ✅	fix-confirmed	passed 1, verifier_errored 0, 2 tool calls; env resolved in worker
gemini	gemini-3.1-pro-preview	1.0 ✅	fix-confirmed	passed 1, verifier_errored 0, 5 tool calls — a non-deepseek model, so model diversity too
openhands	deepseek-v4-pro	0.0 ⚠️	harness-auth-error (orthogonal)	errored 1, 0 tool calls / 0 tokens; `litellm.BadRequestError: LLM Provider NOT provided … model=deepseek-v4-pro`. Env plane fine (resolved, no verifier error)
harvey-lab-harness	deepseek-v4-pro	0.0 ⚠️	harness-auth-error (orthogonal)	errored 1, 0 tool calls; `Can't determine provider for model: deepseek-v4-pro` (Harvey adapter only maps claude/gpt/o*/gemini). Env plane fine

Both errored harnesses do their own provider resolution and never reached an LLM (0 tool calls / 0 tokens) — they are not env-failures; env_resolved=true and zero env-service signals in their logs.

Full session tally

6 harnesses confirm the fix end-to-end across the sharded --state worker path: deepagents, oracle (4/4, deterministic), pi-acp, opencode, mimo, gemini — spanning 2 models (deepseek-v4-pro + gemini-3.1-pro-preview) and distinct launcher shapes.
3 harnesses (openclaw, openhands, harvey-lab) failed on model-provider auth/resolution — orthogonal to this PR.
Zero harnesses exhibited an env-plane regression. With the pre-fix main 0.0 A/B baseline, PR #804 is comprehensively validated.

xdotli · 2026-06-18T20:23:33Z

⚠️ Correction: the "harness-auth-error (orthogonal)" calls above were wrong

The two harness-matrix follow-ups above classified openclaw, openhands, and harvey-lab as "harness-auth-error (orthogonal)" — i.e. not this codebase's bug. That was a mis-analysis. All three failed for the same reason, and it was a benchflow provider-resolution bug, not an external harness issue.

A bare model id (deepseek-v4-pro, no provider/ prefix) never resolved to its provider in resolve_provider_env / _normalize_openhands_model, so BENCHFLOW_PROVIDER_NAME/BASE_URL/API_KEY were never emitted. The symptoms I wrongly called "orthogonal" were all downstream of that:

openhands → litellm: LLM Provider NOT provided
openclaw → anthropic/deepseek-v4-pro misroute
harvey-lab → Can't determine provider (raised by benchflow's own harvey_lab_acp_shim, in-repo)

Fixed in #805 (merged to main): bare ids now route via find_provider_for_bare_model, plus a default base_url for the deepseek provider. Re-dogfooded on Daytona:

openhands — now runs, 0 → 26 tool calls ✅
openclaw — provider resolves correctly; a separate downstream stall (no first response → 400s timeout) remains
harvey-lab — the hard "can't determine provider" block is removed by a routing fix, but a deeper usage-tracking-proxy model-alias (benchflow-deepseek-v4-pro) blocks it — documented follow-up

So these were fixable benchflow issues, not orthogonal harness limitations. (#804's env-axis code + CI are unaffected — this corrects only my verification commentary.)

bingran-you

Code review (2026-06-20, automation) — logic is correct and well-tested; approving the code. Recommending a real sharded --state E2E before final merge (see note at the end).

I traced all three binding paths against the codebase and the fix is sound:

1. Sharded path (eval_sharding.py) — the core fix. _config_payload previously serialized only environment_manifest_path. For a --state run the parent resolves --state / name@version into config.environment_manifest (an object), and request.environment_manifest (the path) is None — so the old payload shipped null and every worker booted with no Environment plane, plus any inline resolve_state tool subset was lost. Serializing config.environment_manifest.model_dump(mode="json") round-trips the already-filtered services so the worker reconstructs the exact world. Correct.

2. Worker reconstruction (eval_worker.py). EnvironmentManifest.model_validate(manifest) rebuilds the object directly (no $BENCHFLOW_ENV_REGISTRY re-resolution needed in the subprocess), and the environment_manifest_path fallback is kept for pre-fix shard payloads retried mid-upgrade — good defensive back-compat. Preference order (object → path → None) is right.

3. Run-config path (cli/main.py). Applying plan.eval_config_override to j._config.config_override closes the C-axis hole where --config-override on a run-config file was silently a no-op. Consistent with how eval_env_manifest is already overlaid two lines above.

The dropped environment_manifest_path param on run_sharded_evaluation / run_batch_eval is internal and the sole call site is updated. CI is green (eval-and-judge, test, pip-audit), and the change ships focused tests (test_eval_sharding, test_config_override, test_usage_tracking).

Merge note: because this corrects a silent wrong-result bug on the sharded/--state path — the worst failure mode for an eval pipeline — the one gap before merge is an actual end-to-end sharded run with --state + --config-override on Daytona/Docker confirming workers boot the resolved world and the overlay lands (the unit tests prove the serialization round-trip but not a live worker boot). Holding the merge for the maintainer to either run that or confirm the integration gate covers it. Code itself: approve.

xdotli temporarily deployed to pypi-internal-preview June 18, 2026 17:34 — with GitHub Actions Inactive

xdotli mentioned this pull request Jun 18, 2026

fix(agents): resolve bare model ids to their provider so harnesses route correctly #805

Merged

Yiminnn added ready-to-merge Auto-runs the L3 integration final review (codex) on the PR and removed ready-to-merge Auto-runs the L3 integration final review (codex) on the PR labels Jun 19, 2026

bingran-you added bug Something isn't working P0 Blocks the next release, or security incident (secret leak / data loss / headline regression). review:pending PR is ready-for-review, no reviewer engagement yet. area:eval Issue / PR lives primarily in the "eval" subsystem. labels Jun 19, 2026

bingran-you mentioned this pull request Jun 19, 2026

chore: release v0.6.4 #801

Open

6 tasks

bingran-you approved these changes Jun 20, 2026

View reviewed changes

bingran-you added review:approved Reviewer approved; awaiting merge / CI / second review. and removed review:pending PR is ready-for-review, no reviewer engagement yet. labels Jun 20, 2026

bingran-you merged commit 514a10b into main Jun 21, 2026
10 checks passed

bingran-you deleted the fix/eval-sharded-state-and-runconfig-override branch June 21, 2026 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): bind resolved S-axis env + C-axis overlay on the sharded and run-config paths#804

fix(eval): bind resolved S-axis env + C-axis overlay on the sharded and run-config paths#804
bingran-you merged 1 commit into
mainfrom
fix/eval-sharded-state-and-runconfig-override

xdotli commented Jun 18, 2026

Uh oh!

greptile-apps Bot commented Jun 18, 2026

Uh oh!

xdotli commented Jun 18, 2026

Uh oh!

xdotli commented Jun 18, 2026

Uh oh!

xdotli commented Jun 18, 2026 •

edited

Loading

Uh oh!

xdotli commented Jun 18, 2026 •

edited

Loading

Uh oh!

xdotli commented Jun 18, 2026

Uh oh!

bingran-you left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xdotli commented Jun 18, 2026

What

1. Sharded / --worker-concurrency runs dropped --state entirely

2. The run-config-file path dropped --config-override

Tests

Not in scope (from the same review, deferred)

Uh oh!

greptile-apps Bot commented Jun 18, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

xdotli commented Jun 18, 2026

End-to-end verification on Daytona ✅

Bug #1 — sharded --state (A/B causation proof)

Bug #2 — --config-override on the run-config-file path

Coverage note

Uh oh!

xdotli commented Jun 18, 2026

E2E follow-up — multi-worker fan-out + second agent config

Full matrix now covering both bugs

Uh oh!

xdotli commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E follow-up — additional agent harnesses

Uh oh!

xdotli commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E follow-up — extended harness matrix (multi-agent run)

Full session tally

Uh oh!

xdotli commented Jun 18, 2026

⚠️ Correction: the "harness-auth-error (orthogonal)" calls above were wrong

Uh oh!

bingran-you left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Sharded / `--worker-concurrency` runs dropped `--state` entirely

2. The run-config-file path dropped `--config-override`

Bug #1 — sharded `--state` (A/B causation proof)

Bug #2 — `--config-override` on the run-config-file path

xdotli commented Jun 18, 2026 •

edited

Loading

xdotli commented Jun 18, 2026 •

edited

Loading