docs(ops): production deployment guide (COW-1030)#45
Open
brunota20 wants to merge 1 commit into
Open
Conversation
Adds `docs/production.md` — the operator handbook the
production-hardening milestone has been pointing at since M2.
Sister doc to `docs/06-production-hardening.md`: the existing
file is the architecture / design rationale (resource model,
restart policy, RPC resilience, logging + metrics design); this
new one is the concrete operator handbook (unit files, backup
recipes, alert rules, runbook procedures). Cross-referenced
both ways.
## Sections
1. **Pre-flight checklist** — every box you need ticked before
the first start: release-mode binary, persistent state dir,
metrics on loopback, paid RPC, Prometheus + log pipeline,
on-call runbook reference.
2. **systemd unit** — full `/etc/systemd/system/shepherd.service`
with: dedicated `shepherd` user, SIGINT for graceful
shutdown (30 s timeout — covers the COW-1072 last-block
persistence path), `NoNewPrivileges`/`ProtectSystem=strict`,
2 G memory cap (defence in depth on top of wasmtime's 64
MiB / module), restart-on-failure with 5 s backoff. Install
recipe + journalctl tail snippet.
3. **Docker Compose** — interim Dockerfile (multi-stage Rust
build + Debian slim runtime, non-root, EXPOSE 9100,
tini PID 1) + compose stack with bundled Prometheus, host
loopback port mapping only, `stop_signal: SIGINT`,
`stop_grace_period: 30s`, /metrics-based healthcheck.
Marked interim because the official Dockerfile is a
separate tracking issue.
4. **redb backup** — three operationally-supported paths:
cold backup (systemctl stop + cp + start; byte-identical
on graceful shutdown), hot backup (SIGSTOP + cp + SIGCONT
pattern, safe because the on-disk format is consistent at
any commit boundary), restore + `Database::check_integrity()`.
Honest about the redb 2.6 surface — no in-process snapshot
API today; flagged the roadmap. Retention policy:
7 daily / 4 weekly / 12 monthly = ~2.3 GiB on a 100 MiB
store.
5. **Logs** — JSON shape on stdout, two-tier retention model
(7 d hot full debug, 90 d cold INFO-only on S3 Glacier),
Vector config sample for journald -> Loki/hot +
journald -> S3/cold split. Sizing estimate based on the
E2E run shape (5 modules × 1 dispatch/12 s ≈ 200 MiB/wk
INFO+DEBUG combined).
6. **RPC selection** — provider plan recommendations
(Alchemy/Infura/QuickNode tiers), capacity sizing per
chain (1 block sub + N log subs + M `eth_call`/block where
M grows with TWAP active orders), why public nodes are
non-starters.
7. **Metrics + scraping** — complete metric surface table
from `grep metrics::counter/histogram/gauge`:
`shepherd_event_latency_seconds`,
`shepherd_module_errors_total`,
`shepherd_module_restarts_total`,
`shepherd_module_poisoned`,
`shepherd_chain_request_total`,
`shepherd_cow_api_submit_total`,
`shepherd_stream_reconnects_total`. Label set verified
against the source. 15 s scrape interval recommended.
8. **Workload-class tuning** — light indexer / TWAP-style
polling / multi-chain swarm classes with concrete
fuel + memory numbers. Honest that the limits are
compile-time constants today and per-module overrides via
`[engine.limits]` are a 0.3 follow-up — operator can
change `runtime/limits.rs` constants and rebuild, or
ensure load fits within defaults.
9. **Alert rules** — full `prometheus-rules.yml` covering
the seven alerts that map to the metric surface:
- `ShepherdModulePoisoned` (page) — production module
quarantined, needs operator action.
- `ShepherdModuleTraps` (ticket) — pre-poison signal.
- `ShepherdRpcErrorRate` (ticket) — > 5% RPC errors.
- `ShepherdReconnectStorm` (ticket) — WS flapping.
- `ShepherdCowApiErrorRate` (ticket) — > 20% orderbook
errors over 15 min.
- `ShepherdDispatchLatency` (ticket) — p95 > 5 s
sustained.
- `ShepherdDown` (page) — engine absent for 2 min.
10. **Operational runbook** — five common tasks:
tail-per-module, reset poisoned module, add module to
running deploy, inspect local-store, bump log level.
Each task carries a concrete shell snippet.
11. **Pre-upgrade checklist** — CHANGELOG read, cold backup,
stage binary, validate `supervisor ready modules=N
chains=M`, swap binary, restart, watch 5 min.
12. **References** — links to the architecture doc, ADRs,
runbooks, sister Linear issues.
## Drive-by fix
The COW-1064 e2e report template (committed in PR #44) had two
metric-label mistakes I caught while writing the metric
surface table in section 7:
- `result="ok|err"` -> `outcome="ok|err"` (the actual label
name in `cow_api.rs` + `chain.rs`).
- `reason="trap"` -> `error_kind="trap"` (the actual label
name in `supervisor.rs`).
Both labels appear in 4 places in the template (section 5
metric delta table + section 7 acceptance checklist). Fixed
in-place rather than as a separate PR.
## Workspace impact
- No code changes.
- No new build dependencies.
- `cargo fmt --all --check` clean.
Linear: COW-1030. Eleventh M4 issue landed; stacks on #44
(COW-1064).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
docs/production.md— the operator handbook theproduction-hardening milestone has been pointing at since M2.
Sister doc to
docs/06-production-hardening.md: that existingfile is the architecture / design rationale; this is the
concrete operator handbook (unit files, backup recipes, alert
rules, runbook procedures). Cross-referenced both ways so an
operator reading either lands on the right next step.
Stacks on #44 (COW-1064 E2E scaffold). Linear: COW-1030.
Sections (12 total)
dir, metrics on loopback, paid RPC, observability pipeline.
shepherd.servicewith dedicateduser, SIGINT for graceful shutdown (30 s timeout covers
the COW-1072 last-block-persistence path), filesystem
hardening, 2 G memory cap (defence in depth over wasmtime's
per-module 64 MiB).
with bundled Prometheus, loopback-only port mapping,
stop_signal: SIGINT,/metrics-based healthcheck. Markedinterim because the official Dockerfile is tracked
separately.
(cold/hot/restore + integrity check) with the redb 2.6
limitations stated honestly. Retention policy: 7 daily / 4
weekly / 12 monthly ≈ 2.3 GiB on a 100 MiB store.
full debug + 90 d cold INFO-only on S3 Glacier), Vector
config sample for the split. Sizing based on the E2E run
shape.
capacity sizing, why public nodes are non-starters.
verified against
grep metrics::(counter|histogram|gauge)!:shepherd_event_latency_seconds{module, event_kind}shepherd_module_errors_total{module, error_kind}shepherd_module_restarts_total{module}shepherd_module_poisoned{module}shepherd_chain_request_total{chain_id, method, outcome}shepherd_cow_api_submit_total{chain_id, outcome}shepherd_stream_reconnects_total{kind, chain_id, module?}multi-chain swarm classes with concrete fuel + memory
numbers. Honest that limits are compile-time constants
today and per-module overrides via
[engine.limits]are a0.3 follow-up.
prometheus-rules.yml(7 alerts,page vs ticket severity).
ShepherdModulePoisoned+ShepherdDownpage; the rest ticket.reset poisoned module, add module to running deploy,
inspect local-store, bump log level. Each carries a
concrete shell snippet.
promoting.
runbooks, sister Linear issues.
Drive-by fix
The COW-1064 e2e report template (committed in #44) had two
metric-label mistakes I caught while writing this PR's section
7 metric surface table:
result=\"ok|err\"->outcome=\"ok|err\"(the actual labelname in
cow_api.rs+chain.rs).reason=\"trap\"->error_kind=\"trap\"(the actual labelname in
supervisor.rs).Both labels appear in 4 places in the template; fixed
in-place rather than as a separate PR.
Workspace impact
cargo fmt --all --checkclean.AI assistance disclosure
Claude (Opus 4.7, 1M-context) drafted
docs/production.mdworking from: the issue's spec (systemd / Docker Compose /
redb backup / log retention / workload tuning / alert
templates), the existing
docs/06-production-hardening.mdshape, the actual metric surface (verified via
grep metrics::(counter|histogram|gauge)!over the engine crate),the actual runtime constants (
DEFAULT_FUEL_PER_EVENT,DEFAULT_MEMORY_LIMIT,POISON_MAX_FAILURES,POISON_WINDOW,RESTART_MAX_BACKOFF), and the existingtestnet runbooks for shape consistency. I reviewed every
section, verified each shell snippet would work as written
(systemd directives, docker compose schema, journalctl + jq
filter syntax, redb integrity API), and the drive-by metric-
label fix in the e2e report template came from cross-checking
the table against the source.