feat(spider-execution-manager): Add the runtime that drives the main task-dispatch loop. by LinZhihao-723 · Pull Request #329 · y-scope/spider

LinZhihao-723 · 2026-05-26T02:31:37Z

Description

This PR adds spider_execution_manager::runtime, the last piece of the execution-manager: a generic owner of the process pool, liveness actor, and clients that pulls task assignments from the scheduler, dispatches them through the executor pool, and reports outcomes back to storage.

Runtime<SchedulerClientType, StorageClientType> — generic over the two client types the loop calls every iteration. Storage carries + Clone + 'static because the outcome reporter is spawned onto a detached tokio task and needs an owned client. The liveness client is only generic at the create factory (the runtime hands its only Arc to the actor at boot and never touches it again).

Bootstrap (Runtime::create) — registers with storage via LivenessClient::register, seeds a SessionTracker with the returned session id, builds the initial ProcessPool, and spawns the liveness actor. By the time create returns, the actor has already sent its first heartbeat and the runtime is ready to serve.

Main loop (Runtime::run / Runtime::main_loop) — each iteration:

Pull an assignment. SchedulerClient::next_task is raced against the cancellation token via tokio::select! with biased, so shutdown is observed promptly even when the scheduler is idle. Any SchedulerError is treated as fatal and propagated via ? (the client is responsible for absorbing transient transport blips internally; a surfaced error means it's already given up).
Stale-session triage. The bundle's session_id is compared against the shared SessionTracker::current(): a strictly-older bundle is logged and dropped; a strictly-newer bundle nudges the liveness actor via LivenessHandle::refresh so the tracker re-syncs through the next heartbeat. The tracker is never advanced from the main loop — the liveness actor is the only writer.
Register the task instance. StorageClient::register_task_instance is racing-aware (cancellation interrupts the await). StaleSession triggers a liveness refresh and drops the assignment; InvalidInput is treated as a contract bug and is fatal; every other storage error is logged and the assignment is dropped.
Execute. An ExecuteRequest is built from the registration's ExecutionContext (the hard_timeout is derived from timeout_policy.hard_timeout_ms) and submitted to ProcessPool::execute. The pool already enforces the hard timeout, so the call is awaited to completion — losing the in-flight outcome to a cancellation-race would prevent step 5 from reporting it. Pool InternalErrors bubble up via ? (executor binary missing, encode failures, etc.) and bail the runtime.
Report outcome on a background task. A fire-and-forget tokio::spawn(report_outcome(...)) decouples result reporting from the next iteration's scheduler poll. The reporter consumes the outcome end-to-end: builds a Report (Success(Option<Vec<u8>>) or Failure(String) with a message derived from the Outcome variant), then transfers ownership into StorageClient::report_task_success / report_task_failure. No retry logic in this layer — same rationale as the scheduler call.

Shutdown semantics — run(self) wraps main_loop with explicit teardown: cancel the token (covering the scheduler-error / process-pool-error exit paths where the loop bailed without external cancellation), await liveness_join, then return result. A tokio_util::sync::DropGuard field guarantees the token is cancelled even when the run future itself is dropped at an outer await point.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Ensure all workflows pass.
Add test cases to cover the basic behavior of the runtime.

Summary by CodeRabbit

Release Notes

New Features
- Added distributed task execution framework with process-based executor management
- Implemented timeout and crash-recovery mechanisms for task execution
- Added liveness monitoring to track execution manager health via periodic heartbeats
- Introduced session tracking for coordinating distributed execution state
Tests
- Added comprehensive integration tests for task execution pipeline, including success, failure, panic, and timeout scenarios
- Added performance instrumentation tests to measure execution overhead

coderabbitai · 2026-05-26T02:31:51Z

Warning

Review limit reached

@LinZhihao-723, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 46 minutes and 31 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1f0042cc-0270-4a77-9392-719a1233b796

📥 Commits

Reviewing files that changed from the base of the PR and between 53b6b2c and bd66860.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (17)

Cargo.toml
components/spider-execution-manager/Cargo.toml
components/spider-execution-manager/src/client/scheduler.rs
components/spider-execution-manager/src/lib.rs
components/spider-execution-manager/src/runtime.rs
tests/huntsman/em-runtime/Cargo.toml
tests/huntsman/em-runtime/src/lib.rs
tests/huntsman/em-runtime/tests/test_runtime.rs
tests/huntsman/task-executor/Cargo.toml
tests/huntsman/task-executor/src/lib.rs
tests/huntsman/task-executor/tests/overhead_instrument.rs
tests/huntsman/task-executor/tests/test_executor.rs
tests/huntsman/task-executor/tests/test_process_pool.rs
tests/huntsman/test-utils/Cargo.toml
tests/huntsman/test-utils/src/executor.rs
tests/huntsman/test-utils/src/lib.rs
tests/huntsman/test-utils/src/mock.rs

Walkthrough

This PR introduces the execution manager, a new node-level service that supervises task-executor subprocesses. It adds session tracking, client contracts for distributed services, a heartbeat actor, process pooling with stdin/stdout framing, a task dispatch runtime loop, the executor binary itself, and comprehensive integration tests validating the full execution path.

Changes

Core Session Tracking & Client Contracts

Layer / File(s)	Summary
Session tracker and client trait definitions `components/spider-core/src/session.rs`, `components/spider-core/src/lib.rs`, `components/spider-core/Cargo.toml`, `components/spider-execution-manager/src/client/*.rs`, `components/spider-execution-manager/src/lib.rs`	`SessionTracker` uses `Arc<AtomicU64>` and a CAS loop to track forward-only session IDs; async trait contracts define liveness (register/heartbeat), scheduler (task assignment), and storage (task registration/outcome) client APIs with structured error and response types.

Liveness Heartbeat Actor

Layer / File(s)	Summary
Liveness actor with periodic heartbeat `components/spider-execution-manager/src/liveness.rs`	`LivenessCommand::Refresh` and `LivenessHandle` enable on-demand heartbeats; the `spawn` function creates a Tokio actor that selects between cancellation, command reception, and interval ticks. Heartbeats advance the session tracker, cancel the runtime on terminal errors, and log transient failures. Unit tests with a mock client verify correct behaviour.

Process Pool and Task Executor Protocol

Layer / File(s)	Summary
Executor wire protocol and process pooling `components/spider-task-executor/src/protocol.rs`, `components/spider-execution-manager/src/process_pool.rs`	`Request`, `Response`, and `ExecutorOutcome` enums define bincode-framed serialization for executor communication. `ProcessPool` wraps a single mutex-guarded cached executor handle, spawns subprocesses with `SPIDER_TDL_PACKAGE_DIR` set, handles stdin/stdout framing via `ExecutorHandle`, enforces hard timeouts, and transparently respawns on crash or timeout.

Runtime Main Loop and Orchestration

Layer / File(s)	Summary
Runtime initialization and main loop `components/spider-execution-manager/src/runtime.rs`	`RuntimeConfig` and `RuntimeError` establish boot parameters and error cases. `Runtime::create` registers with liveness, seeds the session tracker, creates the process pool, and spawns the liveness actor. `main_loop` polls the scheduler, filters by session freshness, registers task instances, dispatches to the process pool, and spawns detached outcome reporting tasks.

Task Executor Binary and Serialization

Layer / File(s)	Summary
Executor binary and error serialization `components/spider-task-executor/src/error.rs`, `components/spider-task-executor/src/manager.rs`, `components/spider-task-executor/src/bin/spider_task_executor.rs`, `components/spider-task-executor/Cargo.toml`	`ExecutorError` now derives serde traits and stores `String` payloads (with explicit `From` impls); `TdlPackageManager::load` returns a package reference instead of its name. The binary initializes JSON tracing, reads `SPIDER_TDL_PACKAGE_DIR`, spawns a Tokio runtime, and sequentially processes framed task execution requests, measuring FFI elapsed time.

Integration Tests and Test Infrastructure

Layer / File(s)	Summary
Test tasks, harness, and end-to-end validation `tests/huntsman/integration-test-tasks/`, `tests/huntsman/task-executor/`, `tests/huntsman/tdl-integration/tests/complex.rs`, `taskfiles/test.yaml`	Four TDL test tasks (fibonacci, always_fail, always_panic, instrument) provide rich failure/success/panic/echo scenarios. `ExecutorHandle` harness manages executor subprocess I/O. Tests validate successful execution, error decoding, crash detection, process pool respawning, hard timeout recovery, and latency instrumentation (avg/percentile). Updated build scripts stage TDL packages and the executor binary.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

y-scope/spider#325: The execution manager's process pool and runtime rely directly on the wire-protocol structure and executor binary introduction from this PR.

Suggested reviewers

sitaowang1998

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and specifically describes the main change: adding the runtime that drives the task-dispatch loop for the spider-execution-manager.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (1)

tests/huntsman/task-executor/tests/test_process_pool.rs (1)

115-121: ⚡ Quick win

Bound ProcessPool::execute waits to prevent hanging test runs.

If executor/pool IPC wedges, these awaits can hang the test indefinitely. Add an outer test timeout for each execute call (or a small helper that applies it).

Also applies to: 133-135, 154-157, 170-172, 183-189, 202-204

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/huntsman/task-executor/tests/test_process_pool.rs` around lines 115 -
121, The test currently awaits pool.execute(...) directly (e.g., calls to
ProcessPool::execute with make_request and NORMAL_TIMEOUT), which can hang if
IPC wedges; wrap each execute call in a short outer timeout (using
tokio::time::timeout or a test helper like run_with_timeout) and fail the test
if the outer timeout elapses. Update every occurrence where pool.execute(...) is
awaited (including the calls that use make_request("fibonacci", ...),
NORMAL_TIMEOUT, and similar variants) to use that timeout wrapper (or introduce
a small helper function that takes the future returned by pool.execute and the
outer timeout duration) so tests abort instead of hanging when IPC stalls.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/spider-execution-manager/src/process_pool.rs`:
- Around line 246-247: The FramedWrite/FramedRead usage creates
LengthDelimitedCodec with the 8 MiB default which will reject larger payloads;
update the codec construction for requests and responses in process_pool.rs to
use LengthDelimitedCodec::builder().max_frame_length(N).new_write(...) and
.new_read(...) (or call set_max_frame_length) with an explicit N (pick a
project-appropriate limit) and ensure the subprocess side uses the identical
limit; also propagate a clear “frame too large” error from the codec handling so
callers of requests/responses see a descriptive error when the limit is
exceeded.

In `@components/spider-execution-manager/src/runtime.rs`:
- Around line 176-185: The run method currently drops any
tokio::spawn(report_outcome(...)) tasks, risking lost reports; update the
runtime to track outcome tasks with a tokio::task::JoinSet (add a field like
outcome_joinset: JoinSet<ResultType> to the runtime struct), replace direct
tokio::spawn(report_outcome(...)) calls in main_loop/other places with
outcome_joinset.spawn(report_outcome(...)), and during shutdown (in run after
cancelling cancellation_token and awaiting liveness_join) drain the JoinSet by
awaiting join_next in a loop (or collect/await all remaining joins) to ensure
all report_outcome tasks complete before returning; keep existing error logging
(tracing::warn) for any failed joins.
- Around line 265-266: The comment claiming "Retries and logging are handled
inside `report_outcome`" is incorrect because `report_outcome` only logs errors;
update the comment at the fire-and-forget outcome-report site to accurately
state that outcome reporting is dispatched asynchronously and that
`report_outcome` currently only logs failures (it does not perform retries), and
note that retry/backoff would need to be implemented inside `report_outcome` (or
by the caller) if desired; reference `report_outcome` and the fire-and-forget
dispatch so reviewers can locate and, if choosing to add retries later,
implement retry logic with configured attempts and backoff inside
`report_outcome`.

In `@components/spider-task-executor/src/bin/spider_task_executor.rs`:
- Around line 8-14: Update the top-level docs to accurately reflect the binary
contract: state that packages are resolved to
`${SPIDER_TDL_PACKAGE_DIR}/${package}/lib${package}.so` (note the `lib` prefix
used by run_task) and revise the execution model text to say the binary
processes many Execute requests sequentially on a single-threaded tokio runtime
(no concurrency) rather than implying a single task runs for the lifetime of the
process; update references in the header comments to match the behavior
implemented in main and run_task.

In `@tests/huntsman/integration-test-tasks/src/lib.rs`:
- Line 8: Update the intra-doc link in the module doc for task_decl::instrument
to reference the actual exported constant name INSTRUMENT_SLEEP_US (replace the
broken `INSTRUMENT_SLEEP` link with `INSTRUMENT_SLEEP_US`) so the rustdoc link
resolves correctly; ensure any surrounding text still makes sense (e.g., "sleeps
for a known [`INSTRUMENT_SLEEP_US`] duration").

In `@tests/huntsman/task-executor/tests/overhead_instrument.rs`:
- Around line 21-24: The code opens/writes the report file (using File::create /
File and a PathBuf named report_path) without ensuring the parent directory
exists, which causes failures when the directory is missing; before
creating/writing the file in the test (the block that calls
File::create(report_path) / writes the report), call std::fs::create_dir_all on
report_path.parent() (handling the None case safely) to create the directory
tree, then proceed to create and write the file.

In `@tests/huntsman/task-executor/tests/test_executor.rs`:
- Around line 22-27: The test currently awaits ExecutorHandle IPC calls without
any deadlines which can hang; wrap all awaited calls on the ExecutorHandle
(e.g., handle.recv().await, handle.shutdown_clean().await,
handle.try_recv().await, handle.wait_for_exit().await) with a hard timeout
(tokio::time::timeout with a reasonable Duration) and assert or unwrap the
timeout result so the test fails fast on hangs; update calls around
ExecutorHandle::spawn, send/recv/try_recv, shutdown_clean, and wait_for_exit to
use timeout and propagate a clear test failure when a timeout occurs.

---

Nitpick comments:
In `@tests/huntsman/task-executor/tests/test_process_pool.rs`:
- Around line 115-121: The test currently awaits pool.execute(...) directly
(e.g., calls to ProcessPool::execute with make_request and NORMAL_TIMEOUT),
which can hang if IPC wedges; wrap each execute call in a short outer timeout
(using tokio::time::timeout or a test helper like run_with_timeout) and fail the
test if the outer timeout elapses. Update every occurrence where
pool.execute(...) is awaited (including the calls that use
make_request("fibonacci", ...), NORMAL_TIMEOUT, and similar variants) to use
that timeout wrapper (or introduce a small helper function that takes the future
returned by pool.execute and the outer timeout duration) so tests abort instead
of hanging when IPC stalls.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5659315f-8271-4f2f-9ad3-dfceb77d9564

📥 Commits

Reviewing files that changed from the base of the PR and between 6e812cc and 53b6b2c.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (28)

Cargo.toml
components/spider-core/Cargo.toml
components/spider-core/src/lib.rs
components/spider-core/src/session.rs
components/spider-execution-manager/Cargo.toml
components/spider-execution-manager/src/client.rs
components/spider-execution-manager/src/client/liveness.rs
components/spider-execution-manager/src/client/scheduler.rs
components/spider-execution-manager/src/client/storage.rs
components/spider-execution-manager/src/lib.rs
components/spider-execution-manager/src/liveness.rs
components/spider-execution-manager/src/process_pool.rs
components/spider-execution-manager/src/runtime.rs
components/spider-task-executor/Cargo.toml
components/spider-task-executor/src/bin/spider_task_executor.rs
components/spider-task-executor/src/error.rs
components/spider-task-executor/src/lib.rs
components/spider-task-executor/src/manager.rs
components/spider-task-executor/src/protocol.rs
taskfiles/test.yaml
tests/huntsman/integration-test-tasks/Cargo.toml
tests/huntsman/integration-test-tasks/src/lib.rs
tests/huntsman/task-executor/Cargo.toml
tests/huntsman/task-executor/src/lib.rs
tests/huntsman/task-executor/tests/overhead_instrument.rs
tests/huntsman/task-executor/tests/test_executor.rs
tests/huntsman/task-executor/tests/test_process_pool.rs
tests/huntsman/tdl-integration/tests/complex.rs

coderabbitai · 2026-05-26T02:41:30Z

+            requests: FramedWrite::new(stdin, LengthDelimitedCodec::new()),
+            responses: FramedRead::new(stdout, LengthDelimitedCodec::new()),


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What is the default maximum frame length for tokio_util::codec::LengthDelimitedCodec::new(), and how do you configure a larger max_frame_length for both FramedRead and FramedWrite?

💡 Result:

Default max frame length: 8 MiB (8 * 1024 * 1024 bytes) for tokio_util::codec::LengthDelimitedCodec::new/Default configuration. [1][2] How to configure a larger max_frame_length for both FramedRead and FramedWrite: 1) For FramedRead (AsyncRead -> Stream of frames): Use the LengthDelimitedCodec::builder and set max_frame_length before calling new_read. Example: let framed_read = tokio_util::codec::LengthDelimitedCodec::builder.max_frame_length(16 * 1024 * 1024) // larger limit.new_read(reader); The builder’s max_frame_length setting is “the largest size this codec will accept” and larger frames are rejected; it affects decoding. [1][3] 2) For FramedWrite (AsyncWrite -> Sink of frames): Use the same builder and set max_frame_length before calling new_write. The configuration option applies to both encoding and decoding. [1][4] Example: let framed_write = tokio_util::codec::LengthDelimitedCodec::builder.max_frame_length(16 * 1024 * 1024) // larger limit.new_write(writer); The builder applies max_frame_length to both encoding (payload length checked) and decoding (frames accepted). [1][4] Note: There is also an imperative setter set_max_frame_length(&mut self, val: usize) on LengthDelimitedCodec; the change takes effect the next time a frame is decoded. [5][4]

Citations:

1: https://people.eecs.berkeley.edu/~pschafhalter/pub/erdos/doc/tokio_util/codec/length_delimited/struct.Builder.html

2: https://docs.rs/tokio-util/latest/tokio_util/codec/length_delimited/

3: https://docs.rs/tokio-util/latest/tokio_util/codec/length_delimited/index.html

4: https://docs.rs/tokio-util/latest/src/tokio_util/codec/length_delimited.rs.html

5: https://docs.rs/tokio-util/latest/tokio_util/codec/length_delimited/struct.LengthDelimitedCodec.html

Set explicit IPC frame-size limits for LengthDelimitedCodec (avoid the 8 MiB default).

components/spider-execution-manager/src/process_pool.rs constructs LengthDelimitedCodec::new() for both FramedWrite/FramedRead, whose default max_frame_length is 8 MiB; payloads larger than this will be rejected by the codec. Configure max_frame_length explicitly and mirror the same limit on the subprocess side (e.g., LengthDelimitedCodec::builder().max_frame_length(N).new_read(...) / .new_write(...), or set_max_frame_length) and surface a clear “frame too large” error when exceeded.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/spider-execution-manager/src/process_pool.rs` around lines 246 - 247, The FramedWrite/FramedRead usage creates LengthDelimitedCodec with the 8 MiB default which will reject larger payloads; update the codec construction for requests and responses in process_pool.rs to use LengthDelimitedCodec::builder().max_frame_length(N).new_write(...) and .new_read(...) (or call set_max_frame_length) with an explicit N (pick a project-appropriate limit) and ensure the subprocess side uses the identical limit; also propagate a clear “frame too large” error from the codec handling so callers of requests/responses see a descriptive error when the limit is exceeded.

coderabbitai · 2026-05-26T02:41:30Z

+    pub async fn run(self) -> Result<(), RuntimeError> {
+        tracing::info!(em_id = ? self.em_id, "Runtime main loop starting.");
+        let result = self.main_loop().await;
+        tracing::info!(em_id = ? self.em_id, "Runtime main loop exited. Shutting down.");
+        self.cancellation_token.cancel();
+        if let Err(err) = self.liveness_join.await {
+            tracing::warn!(err = ? err, "Liveness actor task did not exit cleanly.");
+        }
+        result
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Spawned outcome reports may be lost on shutdown.

When run returns, any in-flight tokio::spawn(report_outcome(...)) tasks are not awaited or tracked. If the tokio runtime shuts down immediately after, these reports may never reach storage.

Consider tracking spawned tasks via a JoinSet and draining them during shutdown, or documenting this as an accepted trade-off.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/spider-execution-manager/src/runtime.rs` around lines 176 - 185, The run method currently drops any tokio::spawn(report_outcome(...)) tasks, risking lost reports; update the runtime to track outcome tasks with a tokio::task::JoinSet (add a field like outcome_joinset: JoinSet<ResultType> to the runtime struct), replace direct tokio::spawn(report_outcome(...)) calls in main_loop/other places with outcome_joinset.spawn(report_outcome(...)), and during shutdown (in run after cancelling cancellation_token and awaiting liveness_join) drain the JoinSet by awaiting join_next in a loop (or collect/await all remaining joins) to ensure all report_outcome tasks complete before returning; keep existing error logging (tracing::warn) for any failed joins.

coderabbitai · 2026-05-26T02:41:30Z

+            // Fire-and-forget the outcome report so the main loop can dispatch the next task
+            // without waiting on storage. Retries and logging are handled inside `report_outcome`.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Misleading comment about retry handling.

The comment states "Retries and logging are handled inside report_outcome", but report_outcome (lines 468-479) only logs errors—there is no retry logic. Either update the comment to reflect the actual behaviour or implement retries.

📝 Suggested fix

- // Fire-and-forget the outcome report so the main loop can dispatch the next task - // without waiting on storage. Retries and logging are handled inside `report_outcome`. + // Fire-and-forget the outcome report so the main loop can dispatch the next task + // without waiting on storage. Errors are logged inside `report_outcome` but not + // retried.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Fire-and-forget the outcome report so the main loop can dispatch the next task

// without waiting on storage. Retries and logging are handled inside `report_outcome`.

// Fire-and-forget the outcome report so the main loop can dispatch the next task

// without waiting on storage. Errors are logged inside `report_outcome` but not

// retried.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/spider-execution-manager/src/runtime.rs` around lines 265 - 266, The comment claiming "Retries and logging are handled inside `report_outcome`" is incorrect because `report_outcome` only logs errors; update the comment at the fire-and-forget outcome-report site to accurately state that outcome reporting is dispatched asynchronously and that `report_outcome` currently only logs failures (it does not perform retries), and note that retry/backoff would need to be implemented inside `report_outcome` (or by the caller) if desired; reference `report_outcome` and the fire-and-forget dispatch so reviewers can locate and, if choosing to add retries later, implement retry logic with configured attempts and backoff inside `report_outcome`.

coderabbitai · 2026-05-26T02:41:31Z

+//! Package resolution: each `Execute` request names a TDL package; the executor looks for
+//! `${SPIDER_TDL_PACKAGE_DIR}/${package}/${package}.so` and caches the loaded library by name.
+//!
+//! Execution model: requests are processed strictly sequentially on a single-threaded tokio
+//! runtime. Tokio is used only to match the async I/O surface on the execution manager side;
+//! the executor itself has no concurrency requirements, and exactly one task runs for the
+//! lifetime of the process.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the top-level docs to match the binary’s actual contract.

Line 9 omits the lib prefix used by run_task, and Lines 11-14 say one task runs for the lifetime of the process even though main serves many requests sequentially. Those comments read like the subprocess contract, so they should match the implementation.

📝 Suggested doc fix

-//! Package resolution: each `Execute` request names a TDL package; the executor looks for -//! `${SPIDER_TDL_PACKAGE_DIR}/${package}/${package}.so` and caches the loaded library by name. +//! Package resolution: each `Execute` request names a TDL package; the executor looks for +//! `${SPIDER_TDL_PACKAGE_DIR}/${package}/lib{package}.so` and caches the loaded library by name. //! //! Execution model: requests are processed strictly sequentially on a single-threaded tokio //! runtime. Tokio is used only to match the async I/O surface on the execution manager side; -//! the executor itself has no concurrency requirements, and exactly one task runs for the -//! lifetime of the process. +//! the executor itself has no concurrency requirements, and at most one task runs at a time +//! during the process lifetime.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

//! Package resolution: each `Execute` request names a TDL package; the executor looks for

//! `${SPIDER_TDL_PACKAGE_DIR}/${package}/${package}.so` and caches the loaded library by name.

//!

//! Execution model: requests are processed strictly sequentially on a single-threaded tokio

//! runtime. Tokio is used only to match the async I/O surface on the execution manager side;

//! the executor itself has no concurrency requirements, and exactly one task runs for the

//! lifetime of the process.

//! Package resolution: each `Execute` request names a TDL package; the executor looks for

//! `${SPIDER_TDL_PACKAGE_DIR}/${package}/lib{package}.so` and caches the loaded library by name.

//!

//! Execution model: requests are processed strictly sequentially on a single-threaded tokio

//! runtime. Tokio is used only to match the async I/O surface on the execution manager side;

//! the executor itself has no concurrency requirements, and at most one task runs at a time

//! during the process lifetime.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/spider-task-executor/src/bin/spider_task_executor.rs` around lines 8 - 14, Update the top-level docs to accurately reflect the binary contract: state that packages are resolved to `${SPIDER_TDL_PACKAGE_DIR}/${package}/lib${package}.so` (note the `lib` prefix used by run_task) and revise the execution model text to say the binary processes many Execute requests sequentially on a single-threaded tokio runtime (no concurrency) rather than implying a single task runs for the lifetime of the process; update references in the header comments to match the behavior implemented in main and run_task.

coderabbitai · 2026-05-26T02:41:31Z

+use std::{
+    fs::File,
+    io::Write,
+    path::PathBuf,


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Create the output directory before writing the report file.

Line 179 can fail when Line 177 resolves to a non-existent directory. Please create the directory tree first so report emission is robust.

🛠️ Suggested patch

use std::{ - fs::File, + fs::{create_dir_all, File}, io::Write, path::PathBuf, time::{Duration, Instant}, }; @@ - let path = output_dir.join(OUTPUT_FILE); + create_dir_all(&output_dir).unwrap_or_else(|err| { + panic!("create output dir {} failed: {err}", output_dir.display()) + }); + let path = output_dir.join(OUTPUT_FILE);

Also applies to: 177-179

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/huntsman/task-executor/tests/overhead_instrument.rs` around lines 21 - 24, The code opens/writes the report file (using File::create / File and a PathBuf named report_path) without ensuring the parent directory exists, which causes failures when the directory is missing; before creating/writing the file in the test (the block that calls File::create(report_path) / writes the report), call std::fs::create_dir_all on report_path.parent() (handling the None case safely) to create the directory tree, then proceed to create and write the file.

coderabbitai · 2026-05-26T02:41:31Z

+    let mut handle = ExecutorHandle::spawn();
+    let input: u64 = 10;
+    handle
+        .send(&execute_request("fibonacci", encode_single_input(&input)))
+        .await;
+    let Response::Result { outcome, .. } = handle.recv().await;


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify unbounded awaits in these integration tests. rg -nP 'handle\.(recv|try_recv|wait_for_exit|shutdown_clean)\.await' tests/huntsman/task-executor/tests/test_executor.rs

Repository: y-scope/spider

Length of output: 362

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tests/huntsman/task-executor/tests/test_executor.rs" echo "---- file header (first 120 lines) ----" nl -ba "$FILE" | head -n 120 echo echo "---- context around unbounded awaits ----" for L in 20 25 35 45 60 75 82 87 90; do echo "### around line $L" nl -ba "$FILE" | sed -n "$((L-5)),$((L+6))p" done

Repository: y-scope/spider

Length of output: 159

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tests/huntsman/task-executor/tests/test_executor.rs" echo "---- file header (first 120 lines) ----" cat -n "$FILE" | head -n 120 echo echo "---- context around unbounded awaits ----" for L in 20 27 40 50 67 80 85 87 90; do echo "### around line $L" cat -n "$FILE" | sed -n "$((L-5)),$((L+6))p" done

Repository: y-scope/spider

Length of output: 8230

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="tests/huntsman/task-executor/tests/test_executor.rs" echo "---- tokio::time / Duration imports in file ----" rg -n "tokio::time|Duration|STEP_TIMEOUT|timeout\(" "$FILE" || true echo echo "---- full import section ----" sed -n '1,30p' "$FILE"

Repository: y-scope/spider

Length of output: 1186

Add bounded timeouts to prevent integration test hangs
tests/huntsman/task-executor/tests/test_executor.rs currently awaits executor IPC without any step deadlines, so a stuck child/process framing can hang the test run:

handle.recv().await (lines 27, 50)

handle.shutdown_clean().await (lines 40, 67)

handle.try_recv().await (line 80)

handle.wait_for_exit().await (line 85)

⏱️ Proposed hardening

+use std::time::Duration; + +const STEP_TIMEOUT: Duration = Duration::from_secs(10); @@ - let Response::Result { outcome, .. } = handle.recv().await; + let Response::Result { outcome, .. } = tokio::time::timeout(STEP_TIMEOUT, handle.recv()) + .await + .expect("timed out waiting for executor response"); @@ - handle.shutdown_clean().await; + tokio::time::timeout(STEP_TIMEOUT, handle.shutdown_clean()) + .await + .expect("timed out during clean shutdown"); @@ - let frame = handle.try_recv().await; + let frame = tokio::time::timeout(STEP_TIMEOUT, handle.try_recv()) + .await + .expect("timed out waiting for EOF"); @@ - let status = handle.wait_for_exit().await; + let status = tokio::time::timeout(STEP_TIMEOUT, handle.wait_for_exit()) + .await + .expect("timed out waiting for process exit");

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/huntsman/task-executor/tests/test_executor.rs` around lines 22 - 27, The test currently awaits ExecutorHandle IPC calls without any deadlines which can hang; wrap all awaited calls on the ExecutorHandle (e.g., handle.recv().await, handle.shutdown_clean().await, handle.try_recv().await, handle.wait_for_exit().await) with a hard timeout (tokio::time::timeout with a reasonable Duration) and assert or unwrap the timeout result so the test fails fast on hangs; update calls around ExecutorHandle::spawn, send/recv/try_recv, shutdown_clean, and wait_for_exit to use timeout and propagate a clear test failure when a timeout occurs.

Introduces tests/huntsman/test-utils, a shared support crate that consolidates the executor subprocess harness, TDL wire-payload helpers, and in-process mock client implementations (scheduler/storage/liveness) so multiple integration suites can reuse them. task-executor-tests is converted to a thin test-only crate that depends on it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The scheduler and storage mocks are only needed by the execution-manager runtime tests, which land on a later branch. Drop them here so this branch's test-utils exposes just the liveness client mock that the liveness work uses. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

LinZhihao-723 added 11 commits May 14, 2026 19:36

Implementation done.

320c2d0

Commit integration tests.

d25131c

gg, forgot to apply linters.

777cff5

Done with process pool implementation.

47a004f

Rename test source.

e8253e9

Merge branch 'task-executor-impl' into process-pool-basic

86bb2ed

Fix.

5946d18

Merge branch 'task-executor-impl' into process-pool-basic

cf6fdf8

Rename test binary.

c614fd3

Add client trait implement.

fd93247

Add liveness actor.

49b34d2

LinZhihao-723 requested review from a team and sitaowang1998 as code owners May 26, 2026 02:31

LinZhihao-723 marked this pull request as draft May 26, 2026 02:32

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

LinZhihao-723 and others added 6 commits May 26, 2026 21:43

Slightly polished.

ed01e30

In-progress.

802eb0e

Runtime implementation finished. Tests still pending.

c4d51fc

Done with testing.

71c039b

LinZhihao-723 force-pushed the em-runtime branch from 8aa0436 to 71c039b Compare May 27, 2026 02:09

LinZhihao-723 added 5 commits May 26, 2026 22:10

Remove useless description.

1f500ef

Merge oss-main

56815ee

Fix toml format.

1b9fad5

Merge 327.

9e98106

Merge branch 'liveness-actor' into em-runtime

bd66860

LinZhihao-723 marked this pull request as ready for review May 27, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spider-execution-manager): Add the runtime that drives the main task-dispatch loop.#329

feat(spider-execution-manager): Add the runtime that drives the main task-dispatch loop.#329
LinZhihao-723 wants to merge 22 commits into
y-scope:mainfrom
LinZhihao-723:em-runtime

LinZhihao-723 commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review limit reached

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		requests: FramedWrite::new(stdin, LengthDelimitedCodec::new()),
		responses: FramedRead::new(stdout, LengthDelimitedCodec::new()),

		// Fire-and-forget the outcome report so the main loop can dispatch the next task
		// without waiting on storage. Retries and logging are handled inside `report_outcome`.

Conversation

LinZhihao-723 commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LinZhihao-723 commented May 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading