docs(ops): production deployment guide (COW-1030) by brunota20 · Pull Request #45 · bleu/nullis-shepherd

brunota20 · 2026-06-18T16:39:36Z

Summary

Adds docs/production.md — the operator handbook the
production-hardening milestone has been pointing at since M2.
Sister doc to docs/06-production-hardening.md: that existing
file is the architecture / design rationale; this is the
concrete operator handbook (unit files, backup recipes, alert
rules, runbook procedures). Cross-referenced both ways so an
operator reading either lands on the right next step.

Stacks on #44 (COW-1064 E2E scaffold). Linear: COW-1030.

Sections (12 total)

Pre-flight checklist — release binary, persistent state
dir, metrics on loopback, paid RPC, observability pipeline.
systemd unit — full shepherd.service with dedicated
user, SIGINT for graceful shutdown (30 s timeout covers
the COW-1072 last-block-persistence path), filesystem
hardening, 2 G memory cap (defence in depth over wasmtime's
per-module 64 MiB).
Docker Compose — interim Dockerfile + compose stack
with bundled Prometheus, loopback-only port mapping,
stop_signal: SIGINT, /metrics-based healthcheck. Marked
interim because the official Dockerfile is tracked
separately.
redb backup — three operationally-supported paths
(cold/hot/restore + integrity check) with the redb 2.6
limitations stated honestly. Retention policy: 7 daily / 4
weekly / 12 monthly ≈ 2.3 GiB on a 100 MiB store.
Logs — JSON shape, two-tier retention model (7 d hot
full debug + 90 d cold INFO-only on S3 Glacier), Vector
config sample for the split. Sizing based on the E2E run
shape.
RPC selection — provider tier recommendations, per-chain
capacity sizing, why public nodes are non-starters.
Metrics + scraping — complete metric surface table
verified against grep metrics::(counter|histogram|gauge)!:
- shepherd_event_latency_seconds{module, event_kind}
- shepherd_module_errors_total{module, error_kind}
- shepherd_module_restarts_total{module}
- shepherd_module_poisoned{module}
- shepherd_chain_request_total{chain_id, method, outcome}
- shepherd_cow_api_submit_total{chain_id, outcome}
- shepherd_stream_reconnects_total{kind, chain_id, module?}
Workload-class tuning — light indexer / TWAP-style /
multi-chain swarm classes with concrete fuel + memory
numbers. Honest that limits are compile-time constants
today and per-module overrides via [engine.limits] are a
0.3 follow-up.
Alert rules — full prometheus-rules.yml (7 alerts,
page vs ticket severity). ShepherdModulePoisoned +
ShepherdDown page; the rest ticket.
Operational runbook — five common tasks: tail-per-module,
reset poisoned module, add module to running deploy,
inspect local-store, bump log level. Each carries a
concrete shell snippet.
Pre-upgrade checklist — staged-binary smoke before
promoting.
References — links to the architecture doc, ADRs,
runbooks, sister Linear issues.

Drive-by fix

The COW-1064 e2e report template (committed in #44) had two
metric-label mistakes I caught while writing this PR's section
7 metric surface table:

result=\"ok|err\" -> outcome=\"ok|err\" (the actual label
name in cow_api.rs + chain.rs).
reason=\"trap\" -> error_kind=\"trap\" (the actual label
name in supervisor.rs).

Both labels appear in 4 places in the template; fixed
in-place rather than as a separate PR.

Workspace impact

No code changes (pure documentation).
No new build dependencies.
cargo fmt --all --check clean.

AI assistance disclosure

Claude (Opus 4.7, 1M-context) drafted docs/production.md
working from: the issue's spec (systemd / Docker Compose /
redb backup / log retention / workload tuning / alert
templates), the existing docs/06-production-hardening.md
shape, the actual metric surface (verified via grep metrics::(counter|histogram|gauge)! over the engine crate),
the actual runtime constants (DEFAULT_FUEL_PER_EVENT,
DEFAULT_MEMORY_LIMIT, POISON_MAX_FAILURES,
POISON_WINDOW, RESTART_MAX_BACKOFF), and the existing
testnet runbooks for shape consistency. I reviewed every
section, verified each shell snippet would work as written
(systemd directives, docker compose schema, journalctl + jq
filter syntax, redb integrity API), and the drive-by metric-
label fix in the e2e report template came from cross-checking
the table against the source.

Adds `docs/production.md` — the operator handbook the production-hardening milestone has been pointing at since M2. Sister doc to `docs/06-production-hardening.md`: the existing file is the architecture / design rationale (resource model, restart policy, RPC resilience, logging + metrics design); this new one is the concrete operator handbook (unit files, backup recipes, alert rules, runbook procedures). Cross-referenced both ways. ## Sections 1. **Pre-flight checklist** — every box you need ticked before the first start: release-mode binary, persistent state dir, metrics on loopback, paid RPC, Prometheus + log pipeline, on-call runbook reference. 2. **systemd unit** — full `/etc/systemd/system/shepherd.service` with: dedicated `shepherd` user, SIGINT for graceful shutdown (30 s timeout — covers the COW-1072 last-block persistence path), `NoNewPrivileges`/`ProtectSystem=strict`, 2 G memory cap (defence in depth on top of wasmtime's 64 MiB / module), restart-on-failure with 5 s backoff. Install recipe + journalctl tail snippet. 3. **Docker Compose** — interim Dockerfile (multi-stage Rust build + Debian slim runtime, non-root, EXPOSE 9100, tini PID 1) + compose stack with bundled Prometheus, host loopback port mapping only, `stop_signal: SIGINT`, `stop_grace_period: 30s`, /metrics-based healthcheck. Marked interim because the official Dockerfile is a separate tracking issue. 4. **redb backup** — three operationally-supported paths: cold backup (systemctl stop + cp + start; byte-identical on graceful shutdown), hot backup (SIGSTOP + cp + SIGCONT pattern, safe because the on-disk format is consistent at any commit boundary), restore + `Database::check_integrity()`. Honest about the redb 2.6 surface — no in-process snapshot API today; flagged the roadmap. Retention policy: 7 daily / 4 weekly / 12 monthly = ~2.3 GiB on a 100 MiB store. 5. **Logs** — JSON shape on stdout, two-tier retention model (7 d hot full debug, 90 d cold INFO-only on S3 Glacier), Vector config sample for journald -> Loki/hot + journald -> S3/cold split. Sizing estimate based on the E2E run shape (5 modules × 1 dispatch/12 s ≈ 200 MiB/wk INFO+DEBUG combined). 6. **RPC selection** — provider plan recommendations (Alchemy/Infura/QuickNode tiers), capacity sizing per chain (1 block sub + N log subs + M `eth_call`/block where M grows with TWAP active orders), why public nodes are non-starters. 7. **Metrics + scraping** — complete metric surface table from `grep metrics::counter/histogram/gauge`: `shepherd_event_latency_seconds`, `shepherd_module_errors_total`, `shepherd_module_restarts_total`, `shepherd_module_poisoned`, `shepherd_chain_request_total`, `shepherd_cow_api_submit_total`, `shepherd_stream_reconnects_total`. Label set verified against the source. 15 s scrape interval recommended. 8. **Workload-class tuning** — light indexer / TWAP-style polling / multi-chain swarm classes with concrete fuel + memory numbers. Honest that the limits are compile-time constants today and per-module overrides via `[engine.limits]` are a 0.3 follow-up — operator can change `runtime/limits.rs` constants and rebuild, or ensure load fits within defaults. 9. **Alert rules** — full `prometheus-rules.yml` covering the seven alerts that map to the metric surface: - `ShepherdModulePoisoned` (page) — production module quarantined, needs operator action. - `ShepherdModuleTraps` (ticket) — pre-poison signal. - `ShepherdRpcErrorRate` (ticket) — > 5% RPC errors. - `ShepherdReconnectStorm` (ticket) — WS flapping. - `ShepherdCowApiErrorRate` (ticket) — > 20% orderbook errors over 15 min. - `ShepherdDispatchLatency` (ticket) — p95 > 5 s sustained. - `ShepherdDown` (page) — engine absent for 2 min. 10. **Operational runbook** — five common tasks: tail-per-module, reset poisoned module, add module to running deploy, inspect local-store, bump log level. Each task carries a concrete shell snippet. 11. **Pre-upgrade checklist** — CHANGELOG read, cold backup, stage binary, validate `supervisor ready modules=N chains=M`, swap binary, restart, watch 5 min. 12. **References** — links to the architecture doc, ADRs, runbooks, sister Linear issues. ## Drive-by fix The COW-1064 e2e report template (committed in PR #44) had two metric-label mistakes I caught while writing the metric surface table in section 7: - `result="ok|err"` -> `outcome="ok|err"` (the actual label name in `cow_api.rs` + `chain.rs`). - `reason="trap"` -> `error_kind="trap"` (the actual label name in `supervisor.rs`). Both labels appear in 4 places in the template (section 5 metric delta table + section 7 acceptance checklist). Fixed in-place rather than as a separate PR. ## Workspace impact - No code changes. - No new build dependencies. - `cargo fmt --all --check` clean. Linear: COW-1030. Eleventh M4 issue landed; stacks on #44 (COW-1064).

linear-code · 2026-06-18T16:39:41Z

COW-1030

brunota20 mentioned this pull request Jun 18, 2026

ops(e2e): pin module configs + run-prep punch list for COW-1064 dry run #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ops): production deployment guide (COW-1030)#45

docs(ops): production deployment guide (COW-1030)#45
brunota20 wants to merge 1 commit into
feat/e2e-testnet-cow-1064from
feat/deployment-guide-cow-1030

brunota20 commented Jun 18, 2026

Uh oh!

linear-code Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brunota20 commented Jun 18, 2026

Summary

Sections (12 total)

Drive-by fix

Workspace impact

AI assistance disclosure

Uh oh!

linear-code Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant