feat: parallel candidate evaluation via worktrees by KRRT7 · Pull Request #2121 · codeflash-ai/codeflash

KRRT7 · 2026-05-06T23:38:47Z

Summary

Adds --parallel-candidates N CLI flag that evaluates optimization candidates in isolated git worktrees concurrently via a WorktreePool
Behavioral tests and performance benchmarks run in parallel per-candidate, with pass/fail gating before benchmarking
Refinement and repair are dispatched immediately via the existing ThreadPoolExecutor (no async client needed)
Line profiler runs on the winning candidate after selection
Includes async worktree subprocess execution, instrumented test file copying, and XML result parsing in worktrees

Key design decisions

Worktree isolation: Each candidate gets its own worktree slot — no shared file state between concurrent evaluations
pass_fail_only=True: Parallel path compares pass/fail status only (return values are stored in SQLite relative to main tree). Serial path handles deeper comparison if needed
Line profiler after selection: Only the winner gets line-profiled (requires writing to main tree for @profile instrumentation)
ThreadPoolExecutor for refinement/repair: Integrates naturally with CandidateProcessor's concurrent.futures.Future expectations

Test plan

End-to-end validation on topological_sort.py with --parallel-candidates 4
Unit tests for WorktreePool and async subprocess execution
Integration test for parallel evaluator
CI green on this branch

Move OptimizedCandidateSource and BatchRefiner models from models.py to shared_types.py to avoid a circular dependency between models.py and function_optimizer.py.

Pool of N git worktree slots with async acquire/release semantics. Each slot provides write_candidate() and mirror() for file isolation.

Add async_execute_test_subprocess() for running pytest via asyncio.create_subprocess_exec with stdout/stderr capture and timeout. Add --parallel-candidates N CLI argument. Add anyio dependency.

Phase 1 (concurrent): behavioral correctness tests run in parallel. Failed candidates release their worktree slot immediately. Phase 2 (sequential): only passing candidates get benchmarked, one at a time, for accurate timing without CPU contention. EvalFailure carries test diffs for repair context.

Adds the API method for submitting multiple candidates for refinement in a single request — used by the parallel evaluator to dispatch refinement/repair after evaluation completes.

@Profile

Line profiler needs @Profile instrumented in the main tree, so it must run after candidate selection rather than inside the worktree. This method handles write → profile → restore for the parallel path. Also adds # mypy: ignore-errors — this file has 181 pre-existing mypy errors unrelated to this PR.

Wires the parallel evaluation path into _evaluate_candidates: - Checks --parallel-candidates flag to branch between sequential/parallel - Batches candidates with dedup/normalization gating - Dispatches repair and refinement futures from evaluation results - Calls _run_line_profiler_for_winner after selection New methods: _evaluate_candidates_parallel, _dispatch_refinement, _dispatch_repair_if_possible.

Covers the full stack: pool lifecycle/cleanup, file isolation between slots, subprocess stdout/stderr/timeout, and evaluator logic (failure with diffs, success routing, concurrent multi-candidate).

… evaluator Critical fixes from code review: - Deadlock: slots are now released after behavioral tests (Phase 1), re-acquired for benchmarking (Phase 2). Previously, holding slots across phases caused deadlock when passes >= pool_size. - Pydantic ValidationError: behavior_test_results is now stored in _BehavioralPass and passed through to OptimizedCandidateResult. - Slot leak on cancellation: catch BaseException in _behavioral_phase. WorktreePool improvements: - Graceful partial creation failure (one slot failing doesn't crash pool). - Cleanup resilience (one rmtree failure doesn't abort others). - Stream lifecycle: close send/receive in cleanup(). - Async-safe: use anyio.Path for exists() checks. - Python 3.12+: use onexc instead of deprecated onerror for rmtree. - Remove dead code: PID file, unused restore_file method. Other fixes: - _run_line_profiler_for_winner: catch all exceptions. - _dispatch_repair_if_possible: skip when diffs are empty. - aiservice.py: pass language to _get_valid_candidates in batch path. - Remove unused AIServiceBatchRefinerRequest dataclass. - Fix result file path collision: include slot.index in filename. - Remove _code_replace_lock (no longer needed since slots are released immediately and _replace_and_capture is serialized by GIL).

…ession test - Parallel path now checks if a successful candidate was previously refined (via path_to_root ancestry). If so, dispatches adaptive optimization instead of batch refinement — matching sequential behavior. - Adds regression test: 6 candidates with pool_size=2 all pass, proving no deadlock occurs when passes exceed available slots.

- Add replace_lock to serialize main-tree access in _replace_and_capture - Fix Phase 2 benchmark not writing candidate code to fresh worktree slot - Add _closed flag and ClosedResourceError suppression in pool release - Broaden exception handling and protect finally restore block - Remove unused eval_ctx/exp_type params from run_parallel_evaluation - Add tests for re-staging, partial pool init, restore-on-failure, empty candidates

KRRT7 · 2026-05-07T01:47:55Z

superseded by stacked PRs #2124 → #2125 → #2126 → #2127

KRRT7 requested review from aseembits93 and misrasaurabh1 as code owners May 6, 2026 23:38

KRRT7 force-pushed the codeflash-agent branch 2 times, most recently from e68f7e2 to 9677b56 Compare May 7, 2026 00:25

KRRT7 added 8 commits May 6, 2026 19:28

refactor: extract shared types to break circular import

1df9a1a

Move OptimizedCandidateSource and BatchRefiner models from models.py to shared_types.py to avoid a circular dependency between models.py and function_optimizer.py.

feat: add WorktreePool for isolated git worktree management

aa7d582

Pool of N git worktree slots with async acquire/release semantics. Each slot provides write_candidate() and mirror() for file isolation.

feat: async test subprocess execution and --parallel-candidates flag

5f6e451

Add async_execute_test_subprocess() for running pytest via asyncio.create_subprocess_exec with stdout/stderr capture and timeout. Add --parallel-candidates N CLI argument. Add anyio dependency.

feat: add batch_refine endpoint to AiServiceClient

ec123fd

Adds the API method for submitting multiple candidates for refinement in a single request — used by the parallel evaluator to dispatch refinement/repair after evaluation completes.

test: parallel evaluation unit and integration tests

96fd1ca

Covers the full stack: pool lifecycle/cleanup, file isolation between slots, subprocess stdout/stderr/timeout, and evaluator logic (failure with diffs, success routing, concurrent multi-candidate).

KRRT7 force-pushed the codeflash-agent branch from 9677b56 to 96fd1ca Compare May 7, 2026 00:32

KRRT7 added 3 commits May 6, 2026 19:46

KRRT7 closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parallel candidate evaluation via worktrees#2121

feat: parallel candidate evaluation via worktrees#2121
KRRT7 wants to merge 11 commits intomainfrom
codeflash-agent

KRRT7 commented May 6, 2026

Uh oh!

KRRT7 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KRRT7 commented May 6, 2026

Summary

Key design decisions

Test plan

Uh oh!

KRRT7 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant