feat(supervisor): exponential-backoff restart with component reinstantiation (COW-1033)#39
Open
brunota20 wants to merge 1 commit into
Open
Conversation
…tiation (COW-1033)
When a module traps in `on_event` (OutOfFuel, MemoryOutOfBounds,
unhandled host error), the supervisor now:
1. Marks the module `alive = false` and increments `failure_count`.
2. Schedules a `next_attempt` instant via the new
`runtime::restart_policy::backoff_for` (1s → 2s → 4s → ... cap
5 min). All dispatches before that instant skip the module.
3. On the first dispatch past the backoff window, the supervisor
tears down the trapped wasmtime Store + component instance and
re-instantiates from the cached `Component`. The instance state
resets but host-side persistent state (local-store) survives
so a module's progress counters live across restarts.
4. On a successful `on_event` after recovery, `failure_count` resets
to 0 + `next_attempt = None`.
## Why the reinstantiation is required
A wasmtime trap leaves the component instance poisoned: subsequent
`call_on_event` returns "wasm trap: cannot enter component instance".
Just refueling the Store does not recover. The supervisor caches
the `Component`, `init_config`, and `http_allowlist` on
`LoadedModule` at boot so a restart only needs a fresh Store +
re-instantiation - the compiled component bytes are reused.
## New types / files
- `crates/nexum-engine/src/runtime/restart_policy.rs`: `backoff_for(failure_count) -> Duration` with the 1s → 5min schedule. 4 unit tests covering the steady-state, first-failure, doubling, and cap arms.
- `Supervisor` gains four cached backends (`engine`, `cow_pool`, `provider_pool`, `local_store`) so `reinstantiate_one(idx)` can rebuild the wasi Linker + HostState + Store + bindings on demand.
- `LoadedModule` gains `component: Component`, `init_config: Config`, `http_allowlist: Vec<String>` (all cloned at boot), plus `failure_count: u32` and `next_attempt: Option<Instant>` for the schedule.
## Dispatch path changes
`dispatch_block` and `dispatch_log` now restructure into two
phases:
1. **Phase 1 (restart sweep)**: walk modules, collect indices of
dead-but-due modules, call `reinstantiate_one` on each. Failed
restarts bump the backoff again. Successful restarts flip
`alive = true` so phase 2 dispatches the next event to them.
2. **Phase 2 (steady-state dispatch)**: unchanged from before -
walk modules, dispatch where subscribed + alive. Trap path
sets `next_attempt` + bumps `failure_count`; success path
resets both.
The structured logs from COW-1035 gain `failure_count` + `backoff_ms`
on trap + `restart attempt` info lines on each restart. The
`shepherd_module_restarts_total{module}` Prometheus counter from
COW-1034 increments on every restart attempt.
## New fixture + integration test
`modules/fixtures/flaky-bomb/` (test-only): traps via OutOfFuel on
the first N events (N from `[config].fail_first_n`) and recovers
afterwards. Uses local-store for the attempt counter because the
wasm instance state resets on each reinstantiation; the counter
persists in the host-side store so the module deterministically
recovers after the configured N.
`supervisor::tests::restart_flaky_module_recovers_after_backoff`
(new): boots flaky-bomb with fail_first_n=1, dispatches, observes:
- Dispatch 1: trap. alive=false, failure_count=1, next_attempt=+1s.
- Immediate redispatch: skipped (still in backoff).
- Sleep 1.1s.
- Dispatch 3: restart fires, fresh instance attempts again. With
attempt=2 > N=1, returns Ok. alive=true, failure_count=0,
next_attempt=None.
- Dispatch 4: steady-state, dispatches normally.
Test wall-clock ~1.4s.
## Tests
- `cargo test --workspace` -> 159 host tests + 6 doctests passing.
+4 from `restart_policy` unit tests + 1 from the new integration
test (was 154 + 6).
- `cargo clippy --all-targets --workspace -- -D warnings` clean.
- `cargo fmt --all --check` clean.
- All existing resource-limit tests (COW-1036) still pass against
the new dispatch shape: their assertions are against state
*immediately* after the trap (before backoff elapses), so the
restart machinery is transparent.
- The `init_failure_marks_module_dead_and_excludes_from_dispatch`
test (COW-1070) still passes: init-failed modules carry
`next_attempt = None` so the restart sweep never picks them up.
## Out of scope
- Persistence of `failure_count` / `next_attempt` across full
engine restarts. The schedule resets on every boot; cross-engine
persistence is a 0.3 follow-up.
- WS reconnect-with-backoff for upstream RPC drops - that is
COW-1071, a separate axis.
- Operator-tunable backoff via `engine.toml::[engine.restart]`.
The current constants are workspace literals in
`runtime::restart_policy`; configurable in 0.3.
- Module-side `on_restart` hook. Modules just see a fresh `init`
call after a restart, same as boot.
Linear: COW-1033. Fifth M4 issue landed; stacks on #38 (COW-1034).
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
When a module traps in `on_event` (OutOfFuel, MemoryOutOfBounds, unhandled host error), the supervisor now schedules an exponential-backoff restart instead of permanently quarantining the module. After the backoff window the supervisor re-instantiates the component (fresh wasmtime Store + bindings + re-call `init`) and dispatches again. A successful event resets the failure counter.
Fifth M4 issue landed.
Why reinstantiation, not just refuel
A wasmtime trap leaves the component instance poisoned: subsequent `call_on_event` returns `wasm trap: cannot enter component instance`. Just `set_fuel` on the Store does not recover - we have to tear it down. The supervisor caches the `Component`, `init_config`, and `http_allowlist` on `LoadedModule` at boot so a restart only needs a fresh Store + re-instantiation - the compiled component bytes are reused.
Backoff schedule
Changes
New integration test
`supervisor::tests::restart_flaky_module_recovers_after_backoff` (real-time, ~1.4s wall clock):
Breaking changes
`Supervisor` struct gained 4 new private fields. Callers that constructed `Supervisor { modules: ... }` directly (only the 2 unit tests in `supervisor::tests`) updated to use the new `Supervisor::empty_for_test(engine, store)` helper.
Tests
Out of scope
AI assistance disclosure
AI Assistance: this change + description was produced by a Claude Code agent (Claude Opus 4.7 1M context). The agent diagnosed the wasmtime "cannot enter component instance" semantic, designed the reinstantiation path, implemented the policy + test fixture, and authored this PR description. A human (Bruno) reviewed and is accountable for the result.
Linear: COW-1033. Stacks on #38 (COW-1034 Prometheus metrics).