feat(supervisor): exponential-backoff restart with component reinstantiation (COW-1033) by brunota20 · Pull Request #39 · bleu/nullis-shepherd

brunota20 · 2026-06-18T14:29:08Z

What does this PR do?

When a module traps in `on_event` (OutOfFuel, MemoryOutOfBounds, unhandled host error), the supervisor now schedules an exponential-backoff restart instead of permanently quarantining the module. After the backoff window the supervisor re-instantiates the component (fresh wasmtime Store + bindings + re-call `init`) and dispatches again. A successful event resets the failure counter.

Fifth M4 issue landed.

Why reinstantiation, not just refuel

A wasmtime trap leaves the component instance poisoned: subsequent `call_on_event` returns `wasm trap: cannot enter component instance`. Just `set_fuel` on the Store does not recover - we have to tear it down. The supervisor caches the `Component`, `init_config`, and `http_allowlist` on `LoadedModule` at boot so a restart only needs a fresh Store + re-instantiation - the compiled component bytes are reused.

Backoff schedule

failure_count	next_attempt delay
0	(steady state)
1	1s
2	2s
3	4s
...	doubles
9+	capped at 5 min

Changes

Area	Change
`runtime::restart_policy` (new)	`backoff_for(failure_count) -> Duration` + 4 unit tests
`Supervisor`	Now owns `engine + cow_pool + provider_pool + local_store` (all internally `Arc`-backed). New `reinstantiate_one(idx)` method.
`LoadedModule`	New fields: `component`, `init_config`, `http_allowlist` (cached for restart), `failure_count`, `next_attempt`.
`dispatch_block` / `dispatch_log`	Restructured into two phases: restart sweep -> steady-state dispatch. Trap path schedules backoff; success path resets it.
Logs (COW-1035)	Trap line carries `failure_count` + `backoff_ms`. New `info!` line on each restart attempt.
Metrics (COW-1034)	`shepherd_module_restarts_total{module}` increments on every restart attempt.
`modules/fixtures/flaky-bomb/` (new test fixture)	Traps via OutOfFuel for the first N events; recovers afterwards. Uses local-store for the attempt counter so progress survives wasm instance reset.

New integration test

`supervisor::tests::restart_flaky_module_recovers_after_backoff` (real-time, ~1.4s wall clock):

Boot flaky-bomb with `fail_first_n = 1`.
Dispatch 1: trap. alive=false, failure_count=1, next_attempt=+1s.
Immediate redispatch: skipped (in backoff).
`tokio::time::sleep(1.1s)`.
Dispatch 3: restart fires (fresh component instance, fresh wasm state, but local-store survived). With attempt=2 > N=1, the module returns Ok. alive=true, failure_count=0.
Dispatch 4: steady-state.

Breaking changes

`Supervisor` struct gained 4 new private fields. Callers that constructed `Supervisor { modules: ... }` directly (only the 2 unit tests in `supervisor::tests`) updated to use the new `Supervisor::empty_for_test(engine, store)` helper.

Tests

`cargo test --workspace` -> 159 host tests + 6 doctests passing (was 154 + 6; +4 from `restart_policy` unit tests + 1 from the integration test).
`cargo clippy --all-targets --workspace -- -D warnings` clean.
`cargo fmt --all --check` clean.
All existing resource-limit tests (COW-1036) still pass: their assertions are against state immediately after the trap (before backoff elapses), so the restart machinery is transparent.
`init_failure_marks_module_dead_and_excludes_from_dispatch` (COW-1070) still passes: init-failed modules carry `next_attempt = None` so the restart sweep never picks them up.

Out of scope

Cross-engine-restart persistence of `failure_count` / `next_attempt` (a 0.3 follow-up).
WS reconnect-with-backoff (COW-1071, separate axis).
Operator-tunable backoff via `engine.toml::[engine.restart]` (configurable in 0.3).
Module-side `on_restart` hook. Modules just see a fresh `init` after restart.

AI assistance disclosure

AI Assistance: this change + description was produced by a Claude Code agent (Claude Opus 4.7 1M context). The agent diagnosed the wasmtime "cannot enter component instance" semantic, designed the reinstantiation path, implemented the policy + test fixture, and authored this PR description. A human (Bruno) reviewed and is accountable for the result.

Linear: COW-1033. Stacks on #38 (COW-1034 Prometheus metrics).

…tiation (COW-1033) When a module traps in `on_event` (OutOfFuel, MemoryOutOfBounds, unhandled host error), the supervisor now: 1. Marks the module `alive = false` and increments `failure_count`. 2. Schedules a `next_attempt` instant via the new `runtime::restart_policy::backoff_for` (1s → 2s → 4s → ... cap 5 min). All dispatches before that instant skip the module. 3. On the first dispatch past the backoff window, the supervisor tears down the trapped wasmtime Store + component instance and re-instantiates from the cached `Component`. The instance state resets but host-side persistent state (local-store) survives so a module's progress counters live across restarts. 4. On a successful `on_event` after recovery, `failure_count` resets to 0 + `next_attempt = None`. ## Why the reinstantiation is required A wasmtime trap leaves the component instance poisoned: subsequent `call_on_event` returns "wasm trap: cannot enter component instance". Just refueling the Store does not recover. The supervisor caches the `Component`, `init_config`, and `http_allowlist` on `LoadedModule` at boot so a restart only needs a fresh Store + re-instantiation - the compiled component bytes are reused. ## New types / files - `crates/nexum-engine/src/runtime/restart_policy.rs`: `backoff_for(failure_count) -> Duration` with the 1s → 5min schedule. 4 unit tests covering the steady-state, first-failure, doubling, and cap arms. - `Supervisor` gains four cached backends (`engine`, `cow_pool`, `provider_pool`, `local_store`) so `reinstantiate_one(idx)` can rebuild the wasi Linker + HostState + Store + bindings on demand. - `LoadedModule` gains `component: Component`, `init_config: Config`, `http_allowlist: Vec<String>` (all cloned at boot), plus `failure_count: u32` and `next_attempt: Option<Instant>` for the schedule. ## Dispatch path changes `dispatch_block` and `dispatch_log` now restructure into two phases: 1. **Phase 1 (restart sweep)**: walk modules, collect indices of dead-but-due modules, call `reinstantiate_one` on each. Failed restarts bump the backoff again. Successful restarts flip `alive = true` so phase 2 dispatches the next event to them. 2. **Phase 2 (steady-state dispatch)**: unchanged from before - walk modules, dispatch where subscribed + alive. Trap path sets `next_attempt` + bumps `failure_count`; success path resets both. The structured logs from COW-1035 gain `failure_count` + `backoff_ms` on trap + `restart attempt` info lines on each restart. The `shepherd_module_restarts_total{module}` Prometheus counter from COW-1034 increments on every restart attempt. ## New fixture + integration test `modules/fixtures/flaky-bomb/` (test-only): traps via OutOfFuel on the first N events (N from `[config].fail_first_n`) and recovers afterwards. Uses local-store for the attempt counter because the wasm instance state resets on each reinstantiation; the counter persists in the host-side store so the module deterministically recovers after the configured N. `supervisor::tests::restart_flaky_module_recovers_after_backoff` (new): boots flaky-bomb with fail_first_n=1, dispatches, observes: - Dispatch 1: trap. alive=false, failure_count=1, next_attempt=+1s. - Immediate redispatch: skipped (still in backoff). - Sleep 1.1s. - Dispatch 3: restart fires, fresh instance attempts again. With attempt=2 > N=1, returns Ok. alive=true, failure_count=0, next_attempt=None. - Dispatch 4: steady-state, dispatches normally. Test wall-clock ~1.4s. ## Tests - `cargo test --workspace` -> 159 host tests + 6 doctests passing. +4 from `restart_policy` unit tests + 1 from the new integration test (was 154 + 6). - `cargo clippy --all-targets --workspace -- -D warnings` clean. - `cargo fmt --all --check` clean. - All existing resource-limit tests (COW-1036) still pass against the new dispatch shape: their assertions are against state *immediately* after the trap (before backoff elapses), so the restart machinery is transparent. - The `init_failure_marks_module_dead_and_excludes_from_dispatch` test (COW-1070) still passes: init-failed modules carry `next_attempt = None` so the restart sweep never picks them up. ## Out of scope - Persistence of `failure_count` / `next_attempt` across full engine restarts. The schedule resets on every boot; cross-engine persistence is a 0.3 follow-up. - WS reconnect-with-backoff for upstream RPC drops - that is COW-1071, a separate axis. - Operator-tunable backoff via `engine.toml::[engine.restart]`. The current constants are workspace literals in `runtime::restart_policy`; configurable in 0.3. - Module-side `on_restart` hook. Modules just see a fresh `init` call after a restart, same as boot. Linear: COW-1033. Fifth M4 issue landed; stacks on #38 (COW-1034).

linear-code · 2026-06-18T14:29:12Z

COW-1033

brunota20 mentioned this pull request Jun 18, 2026

feat(event-loop): WS reconnect with exponential backoff per stream (COW-1071) #40

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(supervisor): exponential-backoff restart with component reinstantiation (COW-1033)#39

feat(supervisor): exponential-backoff restart with component reinstantiation (COW-1033)#39
brunota20 wants to merge 1 commit into
feat/prometheus-metrics-cow-1034from
feat/supervisor-restart-cow-1033

brunota20 commented Jun 18, 2026

Uh oh!

linear-code Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brunota20 commented Jun 18, 2026

What does this PR do?

Why reinstantiation, not just refuel

Backoff schedule

Changes

New integration test

Breaking changes

Tests

Out of scope

AI assistance disclosure

Uh oh!

linear-code Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant