Skip to content

docs(ops): production deployment guide (COW-1030)#45

Open
brunota20 wants to merge 1 commit into
feat/e2e-testnet-cow-1064from
feat/deployment-guide-cow-1030
Open

docs(ops): production deployment guide (COW-1030)#45
brunota20 wants to merge 1 commit into
feat/e2e-testnet-cow-1064from
feat/deployment-guide-cow-1030

Conversation

@brunota20

Copy link
Copy Markdown
Collaborator

Summary

Adds docs/production.md — the operator handbook the
production-hardening milestone has been pointing at since M2.
Sister doc to docs/06-production-hardening.md: that existing
file is the architecture / design rationale; this is the
concrete operator handbook (unit files, backup recipes, alert
rules, runbook procedures). Cross-referenced both ways so an
operator reading either lands on the right next step.

Stacks on #44 (COW-1064 E2E scaffold). Linear: COW-1030.

Sections (12 total)

  1. Pre-flight checklist — release binary, persistent state
    dir, metrics on loopback, paid RPC, observability pipeline.
  2. systemd unit — full shepherd.service with dedicated
    user, SIGINT for graceful shutdown (30 s timeout covers
    the COW-1072 last-block-persistence path), filesystem
    hardening, 2 G memory cap (defence in depth over wasmtime's
    per-module 64 MiB).
  3. Docker Compose — interim Dockerfile + compose stack
    with bundled Prometheus, loopback-only port mapping,
    stop_signal: SIGINT, /metrics-based healthcheck. Marked
    interim because the official Dockerfile is tracked
    separately.
  4. redb backup — three operationally-supported paths
    (cold/hot/restore + integrity check) with the redb 2.6
    limitations stated honestly. Retention policy: 7 daily / 4
    weekly / 12 monthly ≈ 2.3 GiB on a 100 MiB store.
  5. Logs — JSON shape, two-tier retention model (7 d hot
    full debug + 90 d cold INFO-only on S3 Glacier), Vector
    config sample for the split. Sizing based on the E2E run
    shape.
  6. RPC selection — provider tier recommendations, per-chain
    capacity sizing, why public nodes are non-starters.
  7. Metrics + scraping — complete metric surface table
    verified against grep metrics::(counter|histogram|gauge)!:
    • shepherd_event_latency_seconds{module, event_kind}
    • shepherd_module_errors_total{module, error_kind}
    • shepherd_module_restarts_total{module}
    • shepherd_module_poisoned{module}
    • shepherd_chain_request_total{chain_id, method, outcome}
    • shepherd_cow_api_submit_total{chain_id, outcome}
    • shepherd_stream_reconnects_total{kind, chain_id, module?}
  8. Workload-class tuning — light indexer / TWAP-style /
    multi-chain swarm classes with concrete fuel + memory
    numbers. Honest that limits are compile-time constants
    today and per-module overrides via [engine.limits] are a
    0.3 follow-up.
  9. Alert rules — full prometheus-rules.yml (7 alerts,
    page vs ticket severity). ShepherdModulePoisoned +
    ShepherdDown page; the rest ticket.
  10. Operational runbook — five common tasks: tail-per-module,
    reset poisoned module, add module to running deploy,
    inspect local-store, bump log level. Each carries a
    concrete shell snippet.
  11. Pre-upgrade checklist — staged-binary smoke before
    promoting.
  12. References — links to the architecture doc, ADRs,
    runbooks, sister Linear issues.

Drive-by fix

The COW-1064 e2e report template (committed in #44) had two
metric-label mistakes I caught while writing this PR's section
7 metric surface table:

  • result=\"ok|err\" -> outcome=\"ok|err\" (the actual label
    name in cow_api.rs + chain.rs).
  • reason=\"trap\" -> error_kind=\"trap\" (the actual label
    name in supervisor.rs).

Both labels appear in 4 places in the template; fixed
in-place rather than as a separate PR.

Workspace impact

  • No code changes (pure documentation).
  • No new build dependencies.
  • cargo fmt --all --check clean.

AI assistance disclosure

Claude (Opus 4.7, 1M-context) drafted docs/production.md
working from: the issue's spec (systemd / Docker Compose /
redb backup / log retention / workload tuning / alert
templates), the existing docs/06-production-hardening.md
shape, the actual metric surface (verified via grep metrics::(counter|histogram|gauge)! over the engine crate),
the actual runtime constants (DEFAULT_FUEL_PER_EVENT,
DEFAULT_MEMORY_LIMIT, POISON_MAX_FAILURES,
POISON_WINDOW, RESTART_MAX_BACKOFF), and the existing
testnet runbooks for shape consistency. I reviewed every
section, verified each shell snippet would work as written
(systemd directives, docker compose schema, journalctl + jq
filter syntax, redb integrity API), and the drive-by metric-
label fix in the e2e report template came from cross-checking
the table against the source.

Adds `docs/production.md` — the operator handbook the
production-hardening milestone has been pointing at since M2.
Sister doc to `docs/06-production-hardening.md`: the existing
file is the architecture / design rationale (resource model,
restart policy, RPC resilience, logging + metrics design); this
new one is the concrete operator handbook (unit files, backup
recipes, alert rules, runbook procedures). Cross-referenced
both ways.

## Sections

1. **Pre-flight checklist** — every box you need ticked before
   the first start: release-mode binary, persistent state dir,
   metrics on loopback, paid RPC, Prometheus + log pipeline,
   on-call runbook reference.
2. **systemd unit** — full `/etc/systemd/system/shepherd.service`
   with: dedicated `shepherd` user, SIGINT for graceful
   shutdown (30 s timeout — covers the COW-1072 last-block
   persistence path), `NoNewPrivileges`/`ProtectSystem=strict`,
   2 G memory cap (defence in depth on top of wasmtime's 64
   MiB / module), restart-on-failure with 5 s backoff. Install
   recipe + journalctl tail snippet.
3. **Docker Compose** — interim Dockerfile (multi-stage Rust
   build + Debian slim runtime, non-root, EXPOSE 9100,
   tini PID 1) + compose stack with bundled Prometheus, host
   loopback port mapping only, `stop_signal: SIGINT`,
   `stop_grace_period: 30s`, /metrics-based healthcheck.
   Marked interim because the official Dockerfile is a
   separate tracking issue.
4. **redb backup** — three operationally-supported paths:
   cold backup (systemctl stop + cp + start; byte-identical
   on graceful shutdown), hot backup (SIGSTOP + cp + SIGCONT
   pattern, safe because the on-disk format is consistent at
   any commit boundary), restore + `Database::check_integrity()`.
   Honest about the redb 2.6 surface — no in-process snapshot
   API today; flagged the roadmap. Retention policy:
   7 daily / 4 weekly / 12 monthly = ~2.3 GiB on a 100 MiB
   store.
5. **Logs** — JSON shape on stdout, two-tier retention model
   (7 d hot full debug, 90 d cold INFO-only on S3 Glacier),
   Vector config sample for journald -> Loki/hot +
   journald -> S3/cold split. Sizing estimate based on the
   E2E run shape (5 modules × 1 dispatch/12 s ≈ 200 MiB/wk
   INFO+DEBUG combined).
6. **RPC selection** — provider plan recommendations
   (Alchemy/Infura/QuickNode tiers), capacity sizing per
   chain (1 block sub + N log subs + M `eth_call`/block where
   M grows with TWAP active orders), why public nodes are
   non-starters.
7. **Metrics + scraping** — complete metric surface table
   from `grep metrics::counter/histogram/gauge`:
   `shepherd_event_latency_seconds`,
   `shepherd_module_errors_total`,
   `shepherd_module_restarts_total`,
   `shepherd_module_poisoned`,
   `shepherd_chain_request_total`,
   `shepherd_cow_api_submit_total`,
   `shepherd_stream_reconnects_total`. Label set verified
   against the source. 15 s scrape interval recommended.
8. **Workload-class tuning** — light indexer / TWAP-style
   polling / multi-chain swarm classes with concrete
   fuel + memory numbers. Honest that the limits are
   compile-time constants today and per-module overrides via
   `[engine.limits]` are a 0.3 follow-up — operator can
   change `runtime/limits.rs` constants and rebuild, or
   ensure load fits within defaults.
9. **Alert rules** — full `prometheus-rules.yml` covering
   the seven alerts that map to the metric surface:
   - `ShepherdModulePoisoned` (page) — production module
     quarantined, needs operator action.
   - `ShepherdModuleTraps` (ticket) — pre-poison signal.
   - `ShepherdRpcErrorRate` (ticket) — > 5% RPC errors.
   - `ShepherdReconnectStorm` (ticket) — WS flapping.
   - `ShepherdCowApiErrorRate` (ticket) — > 20% orderbook
     errors over 15 min.
   - `ShepherdDispatchLatency` (ticket) — p95 > 5 s
     sustained.
   - `ShepherdDown` (page) — engine absent for 2 min.
10. **Operational runbook** — five common tasks:
    tail-per-module, reset poisoned module, add module to
    running deploy, inspect local-store, bump log level.
    Each task carries a concrete shell snippet.
11. **Pre-upgrade checklist** — CHANGELOG read, cold backup,
    stage binary, validate `supervisor ready modules=N
    chains=M`, swap binary, restart, watch 5 min.
12. **References** — links to the architecture doc, ADRs,
    runbooks, sister Linear issues.

## Drive-by fix

The COW-1064 e2e report template (committed in PR #44) had two
metric-label mistakes I caught while writing the metric
surface table in section 7:

- `result="ok|err"` -> `outcome="ok|err"` (the actual label
  name in `cow_api.rs` + `chain.rs`).
- `reason="trap"` -> `error_kind="trap"` (the actual label
  name in `supervisor.rs`).

Both labels appear in 4 places in the template (section 5
metric delta table + section 7 acceptance checklist). Fixed
in-place rather than as a separate PR.

## Workspace impact

- No code changes.
- No new build dependencies.
- `cargo fmt --all --check` clean.

Linear: COW-1030. Eleventh M4 issue landed; stacks on #44
(COW-1064).
@linear-code

linear-code Bot commented Jun 18, 2026

Copy link
Copy Markdown

COW-1030

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant