feat(supervisor): poison-pill detection + quarantine (COW-1032) by brunota20 · Pull Request #41 · bleu/nullis-shepherd

brunota20 · 2026-06-18T15:41:01Z

What does this PR do?

Escalates the COW-1033 restart policy: when a module traps more than `PoisonPolicy.max_failures` times within a sliding `PoisonPolicy.window`, the supervisor marks it poisoned and stops all dispatch / restart attempts. Recovery is operator-driven only.

Seventh M4 issue landed.

Production thresholds

5 traps inside 10 minutes -> quarantine. Aggressive enough to catch a deterministically broken module without burning every restart slot from the COW-1033 backoff schedule; lenient enough that a one-off RPC blip during a real cow-api submit does not get a module quarantined.

Changes

`crates/nexum-engine/src/runtime/poison_policy.rs` (new): `POISON_MAX_FAILURES = 5`, `POISON_WINDOW = 600 s`, `PoisonPolicy` struct, `should_poison` helper, 2 unit tests.
`supervisor.rs`:
- `Supervisor` gains `poison_policy: PoisonPolicy` (production default; tests override via `with_poison_policy`).
- `LoadedModule` gains `failure_timestamps: VecDeque` + `poisoned: bool`.
- New `record_failure_and_maybe_poison` free fn called from every trap arm. Prunes old entries, pushes current timestamp, flips poisoned + emits the metric on threshold cross.
- Restart sweep + dispatch fast-path both check `poisoned` first.
- New `poisoned_count()` accessor.

New metric

`shepherd_module_poisoned{module}` gauge: flips to 1 on the first quarantine. Surfaces in Grafana so SRE can page on "module went terminal".

New integration test (real-time, ~3.5 s wall clock)

`poison_pill_quarantines_module_after_threshold` boots fuel-bomb (the always-trapping COW-1036 fixture) with a tight `PoisonPolicy::new(3, 60s)`:

Dispatch 1 → trap. failure_count=1, poisoned=0.
Sleep 1.1s, dispatch 2 → trap. failure_count=2, poisoned=0.
Sleep 2.1s, dispatch 3 → trap. failure_count=3. 3 failures inside the 60-s window crosses the threshold → poisoned=1.
Dispatch 4 (no wait) → returns 0, no restart attempt entered, no dispatch entered. Silently excluded.

Out of scope

Operator-tunable thresholds via `engine.toml::[engine.poison]` (configurable in 0.3).
Auto-recovery via slow decay. The spec is explicit: poisoned modules need operator action.
Per-module poison policies. One workspace-wide threshold today.

Tests

`cargo test --workspace` → 161 host tests + 6 doctests passing (was 159 + 6).
`cargo clippy --all-targets --workspace -- -D warnings` clean.
`cargo fmt --all --check` clean.
All existing tests pass: `restart_flaky_module_recovers_after_backoff` (COW-1033) uses fail_first_n=1 with default policy (5 failures) - module recovers well before threshold. `resource_limit_dead_bomb_does_not_starve_healthy_module` (COW-1036) does 2 dispatches with default policy - no quarantine. Init-failed module path (COW-1070) unchanged.

AI assistance disclosure

AI Assistance: this change + description was produced by a Claude Code agent (Claude Opus 4.7 1M context). The agent designed the policy + sliding-window check, implemented the helper + dispatch integration, validated against the existing test suite, and authored this PR description. A human (Bruno) reviewed and is accountable for the result.

Linear: COW-1032. Stacks on #40 (COW-1071 WS reconnect).

Escalates the COW-1033 restart policy: when a module traps more than `PoisonPolicy.max_failures` times within a sliding `PoisonPolicy.window`, the supervisor marks it **poisoned**: - Dispatch path skips poisoned modules forever (no further restart attempts, no fuel + RPC cost on no-ops). - A WARN log emits the module name + last error class with a hint to remove it from `engine.toml::[[modules]]` + restart. - `shepherd_module_poisoned{module}` gauge flips to 1. Production thresholds: 5 traps inside 10 minutes -> quarantine. Aggressive enough to catch a deterministically broken module without burning every restart slot from the COW-1033 backoff schedule; lenient enough that a one-off RPC blip during a real cow-api submit does not get a module quarantined. Recovery requires an operator action: remove the entry from `engine.toml::[[modules]]` + restart the engine. There is no automatic recovery on the production schedule; the assumption is that 5 traps inside 10 min is a structural failure, not a transient that would self-heal. ## New file `crates/nexum-engine/src/runtime/poison_policy.rs`: - `POISON_MAX_FAILURES = 5`, `POISON_WINDOW = 600 s` consts. - `PoisonPolicy { max_failures, window }` struct with `Default` pointing at production + `::new(...)` for tests. - `should_poison(policy, recent_failures) -> bool` helper. - 2 unit tests covering the threshold edge cases. ## supervisor.rs changes - `Supervisor` gains `poison_policy: PoisonPolicy` (defaults to production; tests override via `with_poison_policy`). - `LoadedModule` gains `failure_timestamps: VecDeque<Instant>` + `poisoned: bool`. - New free-function `record_failure_and_maybe_poison` is called from every trap arm in `dispatch_block` + `dispatch_log`. It prunes old entries beyond the window, pushes the current timestamp, and flips `poisoned = true` if the window holds >= `policy.max_failures` entries. - Restart sweep + dispatch fast-path both check `poisoned` first, excluding quarantined modules from any further work. - New `poisoned_count()` accessor for metrics + tests. ## New integration test `poison_pill_quarantines_module_after_threshold` (real-time, ~3.5 s wall clock): 1. Boot fuel-bomb (always-trapping fixture from COW-1036) with a tight policy: `PoisonPolicy::new(3, Duration::from_secs(60))`. 2. Dispatch 1 -> trap. failure_count=1, next_attempt=+1s, poisoned=0. 3. Sleep 1.1s, dispatch 2 -> trap. failure_count=2, poisoned=0. 4. Sleep 2.1s, dispatch 3 -> trap. failure_count=3. **3 failures inside the 60-s window crosses the threshold -> poisoned=1.** 5. Dispatch 4 (no wait) -> returns 0, no restart attempt, no dispatch entered. The module is silently excluded. ## Workspace impact - `cargo test --workspace` -> 161 host tests + 6 doctests passing (was 159 + 6; +2 from `poison_policy` units + 1 from the integration test). - `cargo clippy --all-targets --workspace -- -D warnings` clean. - `cargo fmt --all --check` clean. - All existing tests pass against the new dispatch shape: the `restart_flaky_module_recovers_after_backoff` test (COW-1033) uses fail_first_n=1 with the default production policy, so the module recovers well before the 5-trap threshold. - `resource_limit_dead_bomb_does_not_starve_healthy_module` (COW-1036) dispatches the bomb twice; both with the default policy, well under 5 traps -> no quarantine. ## Out of scope - Operator-tunable thresholds via `engine.toml::[engine.poison]`. The current constants live in `runtime::poison_policy`; configurable in 0.3. - Auto-recovery via slow decay (e.g. "after 1 h of being poisoned, try one more time"). The spec is explicit: poisoned modules need operator action. - Per-module poison policies. One workspace-wide threshold today. Linear: COW-1032. Seventh M4 issue landed; stacks on #40 (COW-1071).

linear-code · 2026-06-18T15:41:05Z

COW-1032

brunota20 mentioned this pull request Jun 18, 2026

feat(event-loop+supervisor): graceful shutdown + last-block persistence (COW-1072) #42

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(supervisor): poison-pill detection + quarantine (COW-1032)#41

feat(supervisor): poison-pill detection + quarantine (COW-1032)#41
brunota20 wants to merge 1 commit into
feat/ws-reconnect-cow-1071from
feat/poison-pill-cow-1032

brunota20 commented Jun 18, 2026

Uh oh!

linear-code Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brunota20 commented Jun 18, 2026

What does this PR do?

Production thresholds

Changes

New metric

New integration test (real-time, ~3.5 s wall clock)

Out of scope

Tests

AI assistance disclosure

Uh oh!

linear-code Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant