fix(eval): bind resolved S-axis env + C-axis overlay on the sharded and run-config paths#804
Conversation
…nd run-config paths PR #790 added the --state (S) and --config-override (C) axes, but two delivery paths silently dropped them — producing wrong eval results with no error, the worst failure mode for a benchmarking pipeline: - Sharded / --worker-concurrency runs serialized only a manifest *path* into the worker payload (request.environment_manifest), which is None for a --state run, so every worker booted with no Environment plane; an inline resolve_state tool subset was lost even when a path was present. Serialize the already-resolved config.environment_manifest object and rebuild it in the worker (model_dump <-> model_validate) so the worker boots the exact world the parent resolved. Removes the now-redundant environment_manifest_path plumbing. - The run-config-file path (_run_config_file_eval) threaded every CLI override onto the YAML-loaded Evaluation except config_override, so `eval create --config run.yaml --config-override '{...}'` validated the overlay then ran every task unchanged. Apply it, mirroring the existing --environment-manifest precedence. Regression tests for both: a resolved-manifest (with filtered service subset) round-trip through the shard payload, and a CLI --config-override-on-run-config-file threading test. Full suite green.
Greptile SummaryThis PR fixes two silent data-correctness bugs introduced by #790. Both bugs caused evals to silently run with wrong configuration — no error, just wrong results.
Confidence Score: 5/5Safe to merge — both changes are narrow, targeted patches to silent data-loss paths with direct test coverage. The sharding fix correctly promotes the already-resolved EnvironmentManifest object across the JSON subprocess boundary instead of a potentially-null path, and the back-compat fallback ensures in-flight payloads survive an upgrade. The run-config-file fix is a one-liner that mirrors an identical adjacent guard already present for eval_env_manifest. Three new tests directly exercise the fixed code paths. No existing behaviour is changed for callers that don't use the S or C axes. No files require special attention. All six changed files are clean and consistent with each other. Important Files Changed
Reviews (1): Last reviewed commit: "fix(eval): bind resolved S-axis env + C-..." | Re-trigger Greptile |
End-to-end verification on Daytona ✅Ran the real pipeline (deepseek-v4-pro + deepagents harness + Daytona) against the S-axis binding resolved identically across all runs: Bug #1 — sharded
|
| Code under test | Path | Score | Detail |
|---|---|---|---|
pre-fix main 44c10a23 |
sharded worker | 0/1 (0.0%) | failed:1 — no env plane in the worker; verifier found wrong state |
fix #804 d8a0a60f |
sharded worker | 1/1 (100%) | passed:1, verifier_errored:0, verifier reached the live gmail service |
worker_count:1 in both confirms the sharded path was taken. Pre-fix, the worker received environment_manifest=None (the parent serialized only a manifest path, which is None for a --state run) → gmail service never started → 0.0. The fix serializes the resolved manifest object, so the worker boots the exact world the parent resolved.
Bug #2 — --config-override on the run-config-file path
eval --config run.yaml --state '{…}' --config-override '{"agent":{"timeout_sec":222}}' (the non-sharded _run_config_file_eval path):
- Score: 1/1 (100%),
reward: 1.0. - Persisted
config.jsonproves the overlay threaded + applied (pre-fix it was silently dropped):config_override: {keys:['agent'], sha256:'sha256:1ae107…', patch:{agent:{timeout_sec:222}}} "timeout_sec": 222 ← applied to the agent config
Coverage note
- The unit suite (4293 passed) already locks the serialization boundary deterministically; these runs confirm the real Daytona pipeline end-to-end.
- Only one clawsbench task exists in this fork, so a multi-task/multi-worker fan-out wasn't run here; the worker-subprocess boundary (the thing Feat/v0.1.5 #1 fixes) is identical for N workers and is exercised above with
worker_count:1.
E2E follow-up — multi-worker fan-out + second agent configExtended the verification along two axes: worker fan-out (N=4) and a second agent harness (oracle, deterministic). Same
Across 8 independent worker subprocesses (4 per run), every one deserialized the resolved manifest and booted the gmail service — Full matrix now covering both bugs
Honest scope: the 4 tasks are synthetic duplicates of |
E2E follow-up — additional agent harnessesStressed the fix across different launcher shapes (each harness has a distinct
Read of the openclaw failure: the agent never made a tool call and hit an API error → it couldn't authenticate to deepseek through openclaw's Coverage ceiling this session: model is pinned to deepseek (no direct Gemini provider — only Vertex/ADC; |
E2E follow-up — extended harness matrix (multi-agent run)Fanned out 5 more harnesses, each a real sharded
Both errored harnesses do their own provider resolution and never reached an LLM (0 tool calls / 0 tokens) — they are not env-failures; Full session tally6 harnesses confirm the fix end-to-end across the sharded |
|
bingran-you
left a comment
There was a problem hiding this comment.
Code review (2026-06-20, automation) — logic is correct and well-tested; approving the code. Recommending a real sharded --state E2E before final merge (see note at the end).
I traced all three binding paths against the codebase and the fix is sound:
1. Sharded path (eval_sharding.py) — the core fix. _config_payload previously serialized only environment_manifest_path. For a --state run the parent resolves --state / name@version into config.environment_manifest (an object), and request.environment_manifest (the path) is None — so the old payload shipped null and every worker booted with no Environment plane, plus any inline resolve_state tool subset was lost. Serializing config.environment_manifest.model_dump(mode="json") round-trips the already-filtered services so the worker reconstructs the exact world. Correct.
2. Worker reconstruction (eval_worker.py). EnvironmentManifest.model_validate(manifest) rebuilds the object directly (no $BENCHFLOW_ENV_REGISTRY re-resolution needed in the subprocess), and the environment_manifest_path fallback is kept for pre-fix shard payloads retried mid-upgrade — good defensive back-compat. Preference order (object → path → None) is right.
3. Run-config path (cli/main.py). Applying plan.eval_config_override to j._config.config_override closes the C-axis hole where --config-override on a run-config file was silently a no-op. Consistent with how eval_env_manifest is already overlaid two lines above.
The dropped environment_manifest_path param on run_sharded_evaluation / run_batch_eval is internal and the sole call site is updated. CI is green (eval-and-judge, test, pip-audit), and the change ships focused tests (test_eval_sharding, test_config_override, test_usage_tracking).
Merge note: because this corrects a silent wrong-result bug on the sharded/--state path — the worst failure mode for an eval pipeline — the one gap before merge is an actual end-to-end sharded run with --state + --config-override on Daytona/Docker confirming workers boot the resolved world and the overlay lands (the unit tests prove the serialization round-trip but not a live worker boot). Holding the merge for the maintainer to either run that or confirm the integration gate covers it. Code itself: approve.
What
Follow-up to #790, which added the
--state(S) and--config-override(C) eval axes. Two delivery paths silently dropped the binding — producing wrong eval results with no error, the worst failure mode for a benchmarking pipeline. Found during a post-merge review of #790.1. Sharded /
--worker-concurrencyruns dropped--stateentirelyrun_sharded_evaluationserialized only a manifest path (request.environment_manifest) into the worker payload. For a--staterun that path isNone, so every worker booted with no Environment plane; and an inlineresolve_statetool subset ({"name":"env0","tools":[...]}) was lost even when a path was present (it lives only on the resolved object).Fix: serialize the already-resolved
config.environment_manifestobject and rebuild it in the worker (model_dump(mode="json")↔model_validate), so the worker boots the exact world the parent resolved — including the filtered service subset, with no dependency on$BENCHFLOW_ENV_REGISTRYbeing present in the subprocess. Removes the now-redundantenvironment_manifest_pathplumbing (worker keeps a back-compat fallback for in-flight pre-fix payloads).2. The run-config-file path dropped
--config-override_run_config_file_evalthreaded every CLI override onto the YAML-loadedEvaluationexceptconfig_override, sobench eval create --config run.yaml --config-override '{...}'validated the overlay (no error) then ran every task with its original config. The--tasks-dirpath applied it correctly, which masked the gap.Fix: apply
plan.eval_config_override, mirroring the existing--environment-manifest-wins-over-YAML precedence right above it.Tests
test_worker_payload_round_trips_resolved_environment_manifest— resolved manifest with a filtered service subset survives the JSON shard-payload boundary.test_worker_payload_environment_manifest_none_stays_none— no-S-axis run serializes a null manifest.test_cli_config_override_applies_on_run_config_file_path—--config <yaml> --config-overridelands onEvaluationConfig.config_override.Full unit suite green (4293 passed, 12 skipped);
ruff check,ruff format --check,ty checkall clean.Not in scope (from the same review, deferred)
build_context_stage.pyre-implementssandbox/setup.py:stage_dockerfile_depsand drops its symlink / path-escape guards (Sandbox isolation leaks host state and weakens no-internet guarantees #363/Sandbox upload and staging helpers dereference local symlinks into remote or build contexts #411).sandboxsection (network_mode/allowed_hosts/resources), which the default shared verifier can depend on._agent_process_kill_patternedges (flag-values not skipped; silentNone→ no teardown).