feat(supervisor): poison-pill detection + quarantine (COW-1032)#41
Open
brunota20 wants to merge 1 commit into
Open
feat(supervisor): poison-pill detection + quarantine (COW-1032)#41brunota20 wants to merge 1 commit into
brunota20 wants to merge 1 commit into
Conversation
Escalates the COW-1033 restart policy: when a module traps more
than `PoisonPolicy.max_failures` times within a sliding
`PoisonPolicy.window`, the supervisor marks it **poisoned**:
- Dispatch path skips poisoned modules forever (no further restart
attempts, no fuel + RPC cost on no-ops).
- A WARN log emits the module name + last error class with a hint
to remove it from `engine.toml::[[modules]]` + restart.
- `shepherd_module_poisoned{module}` gauge flips to 1.
Production thresholds: 5 traps inside 10 minutes -> quarantine.
Aggressive enough to catch a deterministically broken module
without burning every restart slot from the COW-1033 backoff
schedule; lenient enough that a one-off RPC blip during a real
cow-api submit does not get a module quarantined.
Recovery requires an operator action: remove the entry from
`engine.toml::[[modules]]` + restart the engine. There is no
automatic recovery on the production schedule; the assumption is
that 5 traps inside 10 min is a structural failure, not a
transient that would self-heal.
## New file
`crates/nexum-engine/src/runtime/poison_policy.rs`:
- `POISON_MAX_FAILURES = 5`, `POISON_WINDOW = 600 s` consts.
- `PoisonPolicy { max_failures, window }` struct with `Default`
pointing at production + `::new(...)` for tests.
- `should_poison(policy, recent_failures) -> bool` helper.
- 2 unit tests covering the threshold edge cases.
## supervisor.rs changes
- `Supervisor` gains `poison_policy: PoisonPolicy` (defaults to
production; tests override via `with_poison_policy`).
- `LoadedModule` gains `failure_timestamps: VecDeque<Instant>` +
`poisoned: bool`.
- New free-function `record_failure_and_maybe_poison` is called
from every trap arm in `dispatch_block` + `dispatch_log`. It
prunes old entries beyond the window, pushes the current
timestamp, and flips `poisoned = true` if the window holds
>= `policy.max_failures` entries.
- Restart sweep + dispatch fast-path both check `poisoned` first,
excluding quarantined modules from any further work.
- New `poisoned_count()` accessor for metrics + tests.
## New integration test
`poison_pill_quarantines_module_after_threshold` (real-time,
~3.5 s wall clock):
1. Boot fuel-bomb (always-trapping fixture from COW-1036) with a
tight policy: `PoisonPolicy::new(3, Duration::from_secs(60))`.
2. Dispatch 1 -> trap. failure_count=1, next_attempt=+1s, poisoned=0.
3. Sleep 1.1s, dispatch 2 -> trap. failure_count=2, poisoned=0.
4. Sleep 2.1s, dispatch 3 -> trap. failure_count=3. **3 failures
inside the 60-s window crosses the threshold -> poisoned=1.**
5. Dispatch 4 (no wait) -> returns 0, no restart attempt, no
dispatch entered. The module is silently excluded.
## Workspace impact
- `cargo test --workspace` -> 161 host tests + 6 doctests passing
(was 159 + 6; +2 from `poison_policy` units + 1 from the
integration test).
- `cargo clippy --all-targets --workspace -- -D warnings` clean.
- `cargo fmt --all --check` clean.
- All existing tests pass against the new dispatch shape: the
`restart_flaky_module_recovers_after_backoff` test (COW-1033)
uses fail_first_n=1 with the default production policy, so the
module recovers well before the 5-trap threshold.
- `resource_limit_dead_bomb_does_not_starve_healthy_module`
(COW-1036) dispatches the bomb twice; both with the default
policy, well under 5 traps -> no quarantine.
## Out of scope
- Operator-tunable thresholds via `engine.toml::[engine.poison]`.
The current constants live in `runtime::poison_policy`;
configurable in 0.3.
- Auto-recovery via slow decay (e.g. "after 1 h of being
poisoned, try one more time"). The spec is explicit: poisoned
modules need operator action.
- Per-module poison policies. One workspace-wide threshold today.
Linear: COW-1032. Seventh M4 issue landed; stacks on #40 (COW-1071).
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Escalates the COW-1033 restart policy: when a module traps more than `PoisonPolicy.max_failures` times within a sliding `PoisonPolicy.window`, the supervisor marks it poisoned and stops all dispatch / restart attempts. Recovery is operator-driven only.
Seventh M4 issue landed.
Production thresholds
5 traps inside 10 minutes -> quarantine. Aggressive enough to catch a deterministically broken module without burning every restart slot from the COW-1033 backoff schedule; lenient enough that a one-off RPC blip during a real cow-api submit does not get a module quarantined.
Changes
New metric
`shepherd_module_poisoned{module}` gauge: flips to 1 on the first quarantine. Surfaces in Grafana so SRE can page on "module went terminal".
New integration test (real-time, ~3.5 s wall clock)
`poison_pill_quarantines_module_after_threshold` boots fuel-bomb (the always-trapping COW-1036 fixture) with a tight `PoisonPolicy::new(3, 60s)`:
Out of scope
Tests
AI assistance disclosure
AI Assistance: this change + description was produced by a Claude Code agent (Claude Opus 4.7 1M context). The agent designed the policy + sliding-window check, implemented the helper + dispatch integration, validated against the existing test suite, and authored this PR description. A human (Bruno) reviewed and is accountable for the result.
Linear: COW-1032. Stacks on #40 (COW-1071 WS reconnect).