Skip to content

feat(supervisor): exponential-backoff restart with component reinstantiation (COW-1033)#39

Open
brunota20 wants to merge 1 commit into
feat/prometheus-metrics-cow-1034from
feat/supervisor-restart-cow-1033
Open

feat(supervisor): exponential-backoff restart with component reinstantiation (COW-1033)#39
brunota20 wants to merge 1 commit into
feat/prometheus-metrics-cow-1034from
feat/supervisor-restart-cow-1033

Conversation

@brunota20

Copy link
Copy Markdown
Collaborator

What does this PR do?

When a module traps in `on_event` (OutOfFuel, MemoryOutOfBounds, unhandled host error), the supervisor now schedules an exponential-backoff restart instead of permanently quarantining the module. After the backoff window the supervisor re-instantiates the component (fresh wasmtime Store + bindings + re-call `init`) and dispatches again. A successful event resets the failure counter.

Fifth M4 issue landed.

Why reinstantiation, not just refuel

A wasmtime trap leaves the component instance poisoned: subsequent `call_on_event` returns `wasm trap: cannot enter component instance`. Just `set_fuel` on the Store does not recover - we have to tear it down. The supervisor caches the `Component`, `init_config`, and `http_allowlist` on `LoadedModule` at boot so a restart only needs a fresh Store + re-instantiation - the compiled component bytes are reused.

Backoff schedule

failure_count next_attempt delay
0 (steady state)
1 1s
2 2s
3 4s
... doubles
9+ capped at 5 min

Changes

Area Change
`runtime::restart_policy` (new) `backoff_for(failure_count) -> Duration` + 4 unit tests
`Supervisor` Now owns `engine + cow_pool + provider_pool + local_store` (all internally `Arc`-backed). New `reinstantiate_one(idx)` method.
`LoadedModule` New fields: `component`, `init_config`, `http_allowlist` (cached for restart), `failure_count`, `next_attempt`.
`dispatch_block` / `dispatch_log` Restructured into two phases: restart sweep -> steady-state dispatch. Trap path schedules backoff; success path resets it.
Logs (COW-1035) Trap line carries `failure_count` + `backoff_ms`. New `info!` line on each restart attempt.
Metrics (COW-1034) `shepherd_module_restarts_total{module}` increments on every restart attempt.
`modules/fixtures/flaky-bomb/` (new test fixture) Traps via OutOfFuel for the first N events; recovers afterwards. Uses local-store for the attempt counter so progress survives wasm instance reset.

New integration test

`supervisor::tests::restart_flaky_module_recovers_after_backoff` (real-time, ~1.4s wall clock):

  1. Boot flaky-bomb with `fail_first_n = 1`.
  2. Dispatch 1: trap. alive=false, failure_count=1, next_attempt=+1s.
  3. Immediate redispatch: skipped (in backoff).
  4. `tokio::time::sleep(1.1s)`.
  5. Dispatch 3: restart fires (fresh component instance, fresh wasm state, but local-store survived). With attempt=2 > N=1, the module returns Ok. alive=true, failure_count=0.
  6. Dispatch 4: steady-state.

Breaking changes

`Supervisor` struct gained 4 new private fields. Callers that constructed `Supervisor { modules: ... }` directly (only the 2 unit tests in `supervisor::tests`) updated to use the new `Supervisor::empty_for_test(engine, store)` helper.

Tests

  • `cargo test --workspace` -> 159 host tests + 6 doctests passing (was 154 + 6; +4 from `restart_policy` unit tests + 1 from the integration test).
  • `cargo clippy --all-targets --workspace -- -D warnings` clean.
  • `cargo fmt --all --check` clean.
  • All existing resource-limit tests (COW-1036) still pass: their assertions are against state immediately after the trap (before backoff elapses), so the restart machinery is transparent.
  • `init_failure_marks_module_dead_and_excludes_from_dispatch` (COW-1070) still passes: init-failed modules carry `next_attempt = None` so the restart sweep never picks them up.

Out of scope

  • Cross-engine-restart persistence of `failure_count` / `next_attempt` (a 0.3 follow-up).
  • WS reconnect-with-backoff (COW-1071, separate axis).
  • Operator-tunable backoff via `engine.toml::[engine.restart]` (configurable in 0.3).
  • Module-side `on_restart` hook. Modules just see a fresh `init` after restart.

AI assistance disclosure

AI Assistance: this change + description was produced by a Claude Code agent (Claude Opus 4.7 1M context). The agent diagnosed the wasmtime "cannot enter component instance" semantic, designed the reinstantiation path, implemented the policy + test fixture, and authored this PR description. A human (Bruno) reviewed and is accountable for the result.

Linear: COW-1033. Stacks on #38 (COW-1034 Prometheus metrics).

…tiation (COW-1033)

When a module traps in `on_event` (OutOfFuel, MemoryOutOfBounds,
unhandled host error), the supervisor now:

1. Marks the module `alive = false` and increments `failure_count`.
2. Schedules a `next_attempt` instant via the new
   `runtime::restart_policy::backoff_for` (1s → 2s → 4s → ... cap
   5 min). All dispatches before that instant skip the module.
3. On the first dispatch past the backoff window, the supervisor
   tears down the trapped wasmtime Store + component instance and
   re-instantiates from the cached `Component`. The instance state
   resets but host-side persistent state (local-store) survives
   so a module's progress counters live across restarts.
4. On a successful `on_event` after recovery, `failure_count` resets
   to 0 + `next_attempt = None`.

## Why the reinstantiation is required

A wasmtime trap leaves the component instance poisoned: subsequent
`call_on_event` returns "wasm trap: cannot enter component instance".
Just refueling the Store does not recover. The supervisor caches
the `Component`, `init_config`, and `http_allowlist` on
`LoadedModule` at boot so a restart only needs a fresh Store +
re-instantiation - the compiled component bytes are reused.

## New types / files

- `crates/nexum-engine/src/runtime/restart_policy.rs`: `backoff_for(failure_count) -> Duration` with the 1s → 5min schedule. 4 unit tests covering the steady-state, first-failure, doubling, and cap arms.
- `Supervisor` gains four cached backends (`engine`, `cow_pool`, `provider_pool`, `local_store`) so `reinstantiate_one(idx)` can rebuild the wasi Linker + HostState + Store + bindings on demand.
- `LoadedModule` gains `component: Component`, `init_config: Config`, `http_allowlist: Vec<String>` (all cloned at boot), plus `failure_count: u32` and `next_attempt: Option<Instant>` for the schedule.

## Dispatch path changes

`dispatch_block` and `dispatch_log` now restructure into two
phases:

1. **Phase 1 (restart sweep)**: walk modules, collect indices of
   dead-but-due modules, call `reinstantiate_one` on each. Failed
   restarts bump the backoff again. Successful restarts flip
   `alive = true` so phase 2 dispatches the next event to them.
2. **Phase 2 (steady-state dispatch)**: unchanged from before -
   walk modules, dispatch where subscribed + alive. Trap path
   sets `next_attempt` + bumps `failure_count`; success path
   resets both.

The structured logs from COW-1035 gain `failure_count` + `backoff_ms`
on trap + `restart attempt` info lines on each restart. The
`shepherd_module_restarts_total{module}` Prometheus counter from
COW-1034 increments on every restart attempt.

## New fixture + integration test

`modules/fixtures/flaky-bomb/` (test-only): traps via OutOfFuel on
the first N events (N from `[config].fail_first_n`) and recovers
afterwards. Uses local-store for the attempt counter because the
wasm instance state resets on each reinstantiation; the counter
persists in the host-side store so the module deterministically
recovers after the configured N.

`supervisor::tests::restart_flaky_module_recovers_after_backoff`
(new): boots flaky-bomb with fail_first_n=1, dispatches, observes:

- Dispatch 1: trap. alive=false, failure_count=1, next_attempt=+1s.
- Immediate redispatch: skipped (still in backoff).
- Sleep 1.1s.
- Dispatch 3: restart fires, fresh instance attempts again. With
  attempt=2 > N=1, returns Ok. alive=true, failure_count=0,
  next_attempt=None.
- Dispatch 4: steady-state, dispatches normally.

Test wall-clock ~1.4s.

## Tests

- `cargo test --workspace` -> 159 host tests + 6 doctests passing.
  +4 from `restart_policy` unit tests + 1 from the new integration
  test (was 154 + 6).
- `cargo clippy --all-targets --workspace -- -D warnings` clean.
- `cargo fmt --all --check` clean.
- All existing resource-limit tests (COW-1036) still pass against
  the new dispatch shape: their assertions are against state
  *immediately* after the trap (before backoff elapses), so the
  restart machinery is transparent.
- The `init_failure_marks_module_dead_and_excludes_from_dispatch`
  test (COW-1070) still passes: init-failed modules carry
  `next_attempt = None` so the restart sweep never picks them up.

## Out of scope

- Persistence of `failure_count` / `next_attempt` across full
  engine restarts. The schedule resets on every boot; cross-engine
  persistence is a 0.3 follow-up.
- WS reconnect-with-backoff for upstream RPC drops - that is
  COW-1071, a separate axis.
- Operator-tunable backoff via `engine.toml::[engine.restart]`.
  The current constants are workspace literals in
  `runtime::restart_policy`; configurable in 0.3.
- Module-side `on_restart` hook. Modules just see a fresh `init`
  call after a restart, same as boot.

Linear: COW-1033. Fifth M4 issue landed; stacks on #38 (COW-1034).
@linear-code

linear-code Bot commented Jun 18, 2026

Copy link
Copy Markdown

COW-1033

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant