Skip to content

fix(eval): bind resolved S-axis env + C-axis overlay on the sharded and run-config paths#804

Merged
bingran-you merged 1 commit into
mainfrom
fix/eval-sharded-state-and-runconfig-override
Jun 21, 2026
Merged

fix(eval): bind resolved S-axis env + C-axis overlay on the sharded and run-config paths#804
bingran-you merged 1 commit into
mainfrom
fix/eval-sharded-state-and-runconfig-override

Conversation

@xdotli

@xdotli xdotli commented Jun 18, 2026

Copy link
Copy Markdown
Member

What

Follow-up to #790, which added the --state (S) and --config-override (C) eval axes. Two delivery paths silently dropped the binding — producing wrong eval results with no error, the worst failure mode for a benchmarking pipeline. Found during a post-merge review of #790.

1. Sharded / --worker-concurrency runs dropped --state entirely

run_sharded_evaluation serialized only a manifest path (request.environment_manifest) into the worker payload. For a --state run that path is None, so every worker booted with no Environment plane; and an inline resolve_state tool subset ({"name":"env0","tools":[...]}) was lost even when a path was present (it lives only on the resolved object).

Fix: serialize the already-resolved config.environment_manifest object and rebuild it in the worker (model_dump(mode="json")model_validate), so the worker boots the exact world the parent resolved — including the filtered service subset, with no dependency on $BENCHFLOW_ENV_REGISTRY being present in the subprocess. Removes the now-redundant environment_manifest_path plumbing (worker keeps a back-compat fallback for in-flight pre-fix payloads).

2. The run-config-file path dropped --config-override

_run_config_file_eval threaded every CLI override onto the YAML-loaded Evaluation except config_override, so bench eval create --config run.yaml --config-override '{...}' validated the overlay (no error) then ran every task with its original config. The --tasks-dir path applied it correctly, which masked the gap.

Fix: apply plan.eval_config_override, mirroring the existing --environment-manifest-wins-over-YAML precedence right above it.

Tests

  • test_worker_payload_round_trips_resolved_environment_manifest — resolved manifest with a filtered service subset survives the JSON shard-payload boundary.
  • test_worker_payload_environment_manifest_none_stays_none — no-S-axis run serializes a null manifest.
  • test_cli_config_override_applies_on_run_config_file_path--config <yaml> --config-override lands on EvaluationConfig.config_override.

Full unit suite green (4293 passed, 12 skipped); ruff check, ruff format --check, ty check all clean.

Not in scope (from the same review, deferred)

…nd run-config paths

PR #790 added the --state (S) and --config-override (C) axes, but two
delivery paths silently dropped them — producing wrong eval results with
no error, the worst failure mode for a benchmarking pipeline:

- Sharded / --worker-concurrency runs serialized only a manifest *path*
  into the worker payload (request.environment_manifest), which is None for
  a --state run, so every worker booted with no Environment plane; an
  inline resolve_state tool subset was lost even when a path was present.
  Serialize the already-resolved config.environment_manifest object and
  rebuild it in the worker (model_dump <-> model_validate) so the worker
  boots the exact world the parent resolved. Removes the now-redundant
  environment_manifest_path plumbing.

- The run-config-file path (_run_config_file_eval) threaded every CLI
  override onto the YAML-loaded Evaluation except config_override, so
  `eval create --config run.yaml --config-override '{...}'` validated the
  overlay then ran every task unchanged. Apply it, mirroring the existing
  --environment-manifest precedence.

Regression tests for both: a resolved-manifest (with filtered service
subset) round-trip through the shard payload, and a CLI
--config-override-on-run-config-file threading test. Full suite green.
@xdotli xdotli temporarily deployed to pypi-internal-preview June 18, 2026 17:34 — with GitHub Actions Inactive
@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes two silent data-correctness bugs introduced by #790. Both bugs caused evals to silently run with wrong configuration — no error, just wrong results.

  • Sharding fix: _config_payload previously serialized only the file-system path of the environment manifest; for --state runs this path is None, so every worker subprocess booted with no Environment plane. The fix serializes the already-resolved EnvironmentManifest object via model_dump(mode="json") and rebuilds it in the worker with model_validate, with a back-compat fallback for in-flight path-keyed payloads across upgrades.
  • Run-config-file fix: _run_config_file_eval applied every CLI override to the YAML-loaded Evaluation except config_override, so --config run.yaml --config-override '{...}' was a silent no-op. The fix mirrors the existing eval_env_manifest precedence rule directly above it.

Confidence Score: 5/5

Safe to merge — both changes are narrow, targeted patches to silent data-loss paths with direct test coverage.

The sharding fix correctly promotes the already-resolved EnvironmentManifest object across the JSON subprocess boundary instead of a potentially-null path, and the back-compat fallback ensures in-flight payloads survive an upgrade. The run-config-file fix is a one-liner that mirrors an identical adjacent guard already present for eval_env_manifest. Three new tests directly exercise the fixed code paths. No existing behaviour is changed for callers that don't use the S or C axes.

No files require special attention. All six changed files are clean and consistent with each other.

Important Files Changed

Filename Overview
src/benchflow/eval_sharding.py Removes environment_manifest_path parameter from _config_payload and run_sharded_evaluation; serializes the resolved EnvironmentManifest object instead of a path. Logic is correct and clean.
src/benchflow/eval_worker.py Updates _environment_manifest to deserialize the new object-keyed payload first, with a back-compat fallback for pre-fix path-keyed payloads still in flight during upgrades. Handles the None case correctly.
src/benchflow/cli/main.py Removes now-redundant environment_manifest_path kwarg from the sharded call site, and adds the missing config_override override in _run_config_file_eval, mirroring the adjacent eval_env_manifest pattern.
tests/test_eval_sharding.py Adds two targeted round-trip tests for the manifest serialization fix; updates three existing calls to drop the removed parameter. Tests are well-scoped and exercise the exact JSON boundary the worker reads.
tests/test_config_override.py Adds an end-to-end test that patches Evaluation.run to capture config_override and verifies the CLI overlay reaches j._config on the YAML config-file path.
tests/test_usage_tracking.py Removes the now-dead environment_manifest_path=None keyword argument from two existing shard-payload helper calls. No behavioural changes.

Reviews (1): Last reviewed commit: "fix(eval): bind resolved S-axis env + C-..." | Re-trigger Greptile

@xdotli

xdotli commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

End-to-end verification on Daytona ✅

Ran the real pipeline (deepseek-v4-pro + deepagents harness + Daytona) against the archive-amazon-shipping clawsbench task — the BF-10 canary, whose verifier reads live service state over HTTP, so a stateful pass is only possible if the Environment plane actually booted. The task ships task.toml (no task.md), so _environment_manifest_from_task_document returns None — there is no task-doc fallback to mask a missing manifest, which makes the A/B below clean.

S-axis binding resolved identically across all runs: clawsbench@v1sha256:6f156dec….

Bug #1 — sharded --state (A/B causation proof)

Identical command on both — eval … --state '{"name":"clawsbench","tools":["gmail"]}' --worker-concurrency 1 (forces the run_sharded_evaluation worker-subprocess path) — only the code differs:

Code under test Path Score Detail
pre-fix main 44c10a23 sharded worker 0/1 (0.0%) failed:1 — no env plane in the worker; verifier found wrong state
fix #804 d8a0a60f sharded worker 1/1 (100%) passed:1, verifier_errored:0, verifier reached the live gmail service

worker_count:1 in both confirms the sharded path was taken. Pre-fix, the worker received environment_manifest=None (the parent serialized only a manifest path, which is None for a --state run) → gmail service never started → 0.0. The fix serializes the resolved manifest object, so the worker boots the exact world the parent resolved.

Bug #2--config-override on the run-config-file path

eval --config run.yaml --state '{…}' --config-override '{"agent":{"timeout_sec":222}}' (the non-sharded _run_config_file_eval path):

  • Score: 1/1 (100%), reward: 1.0.
  • Persisted config.json proves the overlay threaded + applied (pre-fix it was silently dropped):
    config_override: {keys:['agent'], sha256:'sha256:1ae107…', patch:{agent:{timeout_sec:222}}}
    "timeout_sec": 222   ← applied to the agent config
    

Coverage note

  • The unit suite (4293 passed) already locks the serialization boundary deterministically; these runs confirm the real Daytona pipeline end-to-end.
  • Only one clawsbench task exists in this fork, so a multi-task/multi-worker fan-out wasn't run here; the worker-subprocess boundary (the thing Feat/v0.1.5 #1 fixes) is identical for N workers and is exercised above with worker_count:1.

@xdotli

xdotli commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

E2E follow-up — multi-worker fan-out + second agent config

Extended the verification along two axes: worker fan-out (N=4) and a second agent harness (oracle, deterministic). Same --state '{"name":"clawsbench","tools":["gmail"]}'clawsbench@v1 sha256:6f156dec…, sharded with --concurrency 4 --worker-concurrency 1 so each task lands in its own worker subprocess (worker_count: 4).

Run Agent / model Tasks × workers Score verifier_errored
A oracle (deterministic, no LLM) 4 × 4 4/4 (100%) 0
B deepagents + deepseek-v4-pro 4 × 4 4/4 (100%) 0

Across 8 independent worker subprocesses (4 per run), every one deserialized the resolved manifest and booted the gmail service — verifier_errored: 0 everywhere. Run A removes agent stochasticity entirely (oracle just runs solve.sh), so its 4/4 is a deterministic proof the manifest survives the fan-out boundary; pre-fix, each worker would have received environment_manifest=None.

Full matrix now covering both bugs

# Scenario Code Result
1 sharded --state, 1 worker, deepseek fix 1.0 ✅
1 sharded --state, 1 worker, deepseek pre-fix main 0.0 (A/B causation)
1 sharded --state, 4 workers, oracle fix 4/4 ✅
1 sharded --state, 4 workers, deepseek fix 4/4 ✅
2 run-config-file --config-override fix 1.0 ✅ + timeout_sec:222 applied & hashed

Honest scope: the 4 tasks are synthetic duplicates of archive-amazon-shipping (the only non-vendor stateful task on disk), so this proves the worker-subprocess boundary at N=4 across two harnesses — not task diversity. Agent coverage: deepagents + oracle (no Anthropic/OpenAI keys available this session for claude/codex harnesses).

@xdotli

xdotli commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

⚠️ Correction: the "harness-auth-error (orthogonal)" classification below was WRONG — openclaw/openhands/harvey-lab failed on a benchflow provider-resolution bug (bare model ids), fixed in #805 (openhands now runs, 0→26 tool calls). See: #804 (comment)


E2E follow-up — additional agent harnesses

Stressed the fix across different launcher shapes (each harness has a distinct launch_cmd, which is what _agent_process_kill_pattern keys on). Same sharded --state path, --model deepseek-v4-pro (the only model provider with a key available this session):

Harness launch_cmd shape Result Note
deepagents python shim 1.0 ✅ (earlier)
oracle solve.sh, no LLM 4/4 ✅ (earlier) deterministic
pi-acp pi-acp-launcher → exec pi-acp 1.0 ✅ env survived; kill-pattern did not reap the gmail service for this launcher shape
openclaw openclaw-acp-shim (JS via py shim) ⚠️ errored suspected_api_error, 0 tool calls — openclaw↔deepseek provider auth, not the env-axis fix. Env plane resolved fine (verifier_errored: 0, errored ≠ verifier failure).

Read of the openclaw failure: the agent never made a tool call and hit an API error → it couldn't authenticate to deepseek through openclaw's BENCHFLOW_PROVIDER_* bridge. The Environment plane itself was healthy (clawsbench@v1 resolved, no verifier error). This is a harness/provider-integration gap, orthogonal to PR #804.

Coverage ceiling this session: model is pinned to deepseek (no direct Gemini provider — only Vertex/ADC; GEMINI_API_KEY drives the judge, not an agent model), and claude-agent-acp / codex-acp / gemini harnesses need Anthropic/OpenAI/Vertex creds not present here. Net: 3 harnesses (deepagents, oracle, pi-acp) confirm the fix end-to-end across distinct launcher shapes.

@xdotli

xdotli commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

⚠️ Correction: the "harness-auth-error (orthogonal)" classification below was WRONG — openclaw/openhands/harvey-lab failed on a benchflow provider-resolution bug (bare model ids), fixed in #805 (openhands now runs, 0→26 tool calls). See: #804 (comment)


E2E follow-up — extended harness matrix (multi-agent run)

Fanned out 5 more harnesses, each a real sharded --state Daytona e2e on the fixed branch, with an adversarial critic pass to separate the env-axis fix from harness-internal issues. Raw summary.json independently re-verified.

Harness Model Score Classification Decisive evidence
opencode deepseek-v4-pro 1.0 fix-confirmed passed 1, verifier_errored 0, 3 tool calls; worker reached gmail mock @:9001, reward_valid
mimo deepseek-v4-pro 1.0 fix-confirmed passed 1, verifier_errored 0, 2 tool calls; env resolved in worker
gemini gemini-3.1-pro-preview 1.0 fix-confirmed passed 1, verifier_errored 0, 5 tool calls — a non-deepseek model, so model diversity too
openhands deepseek-v4-pro 0.0 ⚠️ harness-auth-error (orthogonal) errored 1, 0 tool calls / 0 tokens; litellm.BadRequestError: LLM Provider NOT provided … model=deepseek-v4-pro. Env plane fine (resolved, no verifier error)
harvey-lab-harness deepseek-v4-pro 0.0 ⚠️ harness-auth-error (orthogonal) errored 1, 0 tool calls; Can't determine provider for model: deepseek-v4-pro (Harvey adapter only maps claude/gpt/o*/gemini). Env plane fine

Both errored harnesses do their own provider resolution and never reached an LLM (0 tool calls / 0 tokens) — they are not env-failures; env_resolved=true and zero env-service signals in their logs.

Full session tally

6 harnesses confirm the fix end-to-end across the sharded --state worker path: deepagents, oracle (4/4, deterministic), pi-acp, opencode, mimo, gemini — spanning 2 models (deepseek-v4-pro + gemini-3.1-pro-preview) and distinct launcher shapes.
3 harnesses (openclaw, openhands, harvey-lab) failed on model-provider auth/resolution — orthogonal to this PR.
Zero harnesses exhibited an env-plane regression. With the pre-fix main 0.0 A/B baseline, PR #804 is comprehensively validated.

@xdotli

xdotli commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

⚠️ Correction: the "harness-auth-error (orthogonal)" calls above were wrong

The two harness-matrix follow-ups above classified openclaw, openhands, and harvey-lab as "harness-auth-error (orthogonal)" — i.e. not this codebase's bug. That was a mis-analysis. All three failed for the same reason, and it was a benchflow provider-resolution bug, not an external harness issue.

A bare model id (deepseek-v4-pro, no provider/ prefix) never resolved to its provider in resolve_provider_env / _normalize_openhands_model, so BENCHFLOW_PROVIDER_NAME/BASE_URL/API_KEY were never emitted. The symptoms I wrongly called "orthogonal" were all downstream of that:

  • openhands → litellm: LLM Provider NOT provided
  • openclaw → anthropic/deepseek-v4-pro misroute
  • harvey-lab → Can't determine provider (raised by benchflow's own harvey_lab_acp_shim, in-repo)

Fixed in #805 (merged to main): bare ids now route via find_provider_for_bare_model, plus a default base_url for the deepseek provider. Re-dogfooded on Daytona:

  • openhands — now runs, 0 → 26 tool calls
  • openclaw — provider resolves correctly; a separate downstream stall (no first response → 400s timeout) remains
  • harvey-lab — the hard "can't determine provider" block is removed by a routing fix, but a deeper usage-tracking-proxy model-alias (benchflow-deepseek-v4-pro) blocks it — documented follow-up

So these were fixable benchflow issues, not orthogonal harness limitations. (#804's env-axis code + CI are unaffected — this corrects only my verification commentary.)

@Yiminnn Yiminnn added ready-to-merge Auto-runs the L3 integration final review (codex) on the PR and removed ready-to-merge Auto-runs the L3 integration final review (codex) on the PR labels Jun 19, 2026
@bingran-you bingran-you added bug Something isn't working P0 Blocks the next release, or security incident (secret leak / data loss / headline regression). review:pending PR is ready-for-review, no reviewer engagement yet. area:eval Issue / PR lives primarily in the "eval" subsystem. labels Jun 19, 2026
@bingran-you bingran-you mentioned this pull request Jun 19, 2026
6 tasks

@bingran-you bingran-you left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review (2026-06-20, automation) — logic is correct and well-tested; approving the code. Recommending a real sharded --state E2E before final merge (see note at the end).

I traced all three binding paths against the codebase and the fix is sound:

1. Sharded path (eval_sharding.py) — the core fix. _config_payload previously serialized only environment_manifest_path. For a --state run the parent resolves --state / name@version into config.environment_manifest (an object), and request.environment_manifest (the path) is None — so the old payload shipped null and every worker booted with no Environment plane, plus any inline resolve_state tool subset was lost. Serializing config.environment_manifest.model_dump(mode="json") round-trips the already-filtered services so the worker reconstructs the exact world. Correct.

2. Worker reconstruction (eval_worker.py). EnvironmentManifest.model_validate(manifest) rebuilds the object directly (no $BENCHFLOW_ENV_REGISTRY re-resolution needed in the subprocess), and the environment_manifest_path fallback is kept for pre-fix shard payloads retried mid-upgrade — good defensive back-compat. Preference order (object → path → None) is right.

3. Run-config path (cli/main.py). Applying plan.eval_config_override to j._config.config_override closes the C-axis hole where --config-override on a run-config file was silently a no-op. Consistent with how eval_env_manifest is already overlaid two lines above.

The dropped environment_manifest_path param on run_sharded_evaluation / run_batch_eval is internal and the sole call site is updated. CI is green (eval-and-judge, test, pip-audit), and the change ships focused tests (test_eval_sharding, test_config_override, test_usage_tracking).

Merge note: because this corrects a silent wrong-result bug on the sharded/--state path — the worst failure mode for an eval pipeline — the one gap before merge is an actual end-to-end sharded run with --state + --config-override on Daytona/Docker confirming workers boot the resolved world and the overlay lands (the unit tests prove the serialization round-trip but not a live worker boot). Holding the merge for the maintainer to either run that or confirm the integration gate covers it. Code itself: approve.

@bingran-you bingran-you added review:approved Reviewer approved; awaiting merge / CI / second review. and removed review:pending PR is ready-for-review, no reviewer engagement yet. labels Jun 20, 2026
@bingran-you bingran-you merged commit 514a10b into main Jun 21, 2026
10 checks passed
@bingran-you bingran-you deleted the fix/eval-sharded-state-and-runconfig-override branch June 21, 2026 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:eval Issue / PR lives primarily in the "eval" subsystem. bug Something isn't working P0 Blocks the next release, or security incident (secret leak / data loss / headline regression). review:approved Reviewer approved; awaiting merge / CI / second review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants