Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ __pycache__/
/benchmarks/
/.watchdog-cm-work/

# Watchdog runtime metrics written by tick (not test golden fixtures)
**/status.prom
**/status.prom.tmp

# Local CM / demo scratch (not part of the repo)
/input-*-output-*.bin
docs/live-demo.md
Expand Down
54 changes: 51 additions & 3 deletions docs/watchdog/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,12 +108,13 @@ Lua modules:
- `sequencer_reader.lua`: sequencer HTTP client (`GET /finalized_state/inclusion_block`, `GET /finalized_state`).
- `compare.lua`: raw byte comparison.
- `checkpoint.lua`: manifest-backed checkpoint persistence (`head.json` pointer).
- `state.lua`: persisted `config.json` and single-run state lock.
- `state.lua`: persisted `config.json`, atomic file writes, single-run state lock.
- `metrics.lua`: Prometheus textfile (`status.prom`) built and written each tick.
- `retry.lua`: bounded retry helper used by the runtime.
- `runner.lua`: one compare cycle — cheap `/finalized_state/inclusion_block`
poll, then (when finalized advanced) L1 fetch, CM replay, SSZ compare,
checkpoint write.
- `main.lua`: dispatches `init` and `tick`; `tick` exits `0`/`1`/`2`.
- `main.lua`: dispatches `init` and `tick`; `tick` exits `0`/`1`/`2` and writes `status.prom`.

The L1 reader follows the Rust partition strategy from
`sequencer/src/l1/partition.rs`: if an RPC provider rejects a large range, the
Expand Down Expand Up @@ -152,6 +153,7 @@ inputs.
state_dir/
config.json
head.json
status.prom # Prometheus textfile from the last tick (see Metrics below)
run.lock # advisory lock handle; file existence is not lock state
checkpoints/
00000000000001234567/
Expand Down Expand Up @@ -195,18 +197,64 @@ host scheduling should provide the same non-overlap guarantee. Each tick:
then inspects with query `state`.
5. Byte-compares the SSZ report against `GET /finalized_state`; on match writes a
new checkpoint, on mismatch emits a `watchdog_event` and exits `2`.
6. Atomically writes `$CARTESI_WATCHDOG_STATE_DIR/status.prom` (or
`CARTESI_WATCHDOG_METRICS_FILE`) before exit.

Runtime knobs:

- `CARTESI_WATCHDOG_BLOCKCHAIN_HTTP_ENDPOINT`: current L1 JSON-RPC endpoint for tick.
- `CARTESI_WATCHDOG_BLOCKCHAIN_ID`: optional chain id label persisted at `init` for `status.prom`.
- `CARTESI_WATCHDOG_METRICS_FILE`: optional override for the Prometheus textfile path (default `$CARTESI_WATCHDOG_STATE_DIR/status.prom`).
- `CARTESI_WATCHDOG_RETRY_ATTEMPTS`: bounded retry attempts per run, default `3`.
- `CARTESI_WATCHDOG_RETRY_DELAY_SEC`: delay between retry attempts, default `5`.

## Metrics (`status.prom`)

Each `tick` writes a [Prometheus textfile](https://github.com/prometheus/node_exporter#textfile-collector)
before exiting. Operators scrape or push it from their side — the watchdog does
not run an HTTP server.

| Exit code | `state` label | Meaning |
|-----------|---------------|---------|
| `0` | `ok` | Compare passed, or idle (finalized unchanged) |
| `1` | `warning` | Transient failure after retries |
| `2` | `failed` | Deterministic divergence |

Gauges (labels `chain`, `app_address` on every series):

- `cartesi_watchdog_status{state="ok|warning|failed"}` — exactly one series is `1`
- `cartesi_watchdog_last_tick_unix_seconds`
- `cartesi_watchdog_exit_code`
- `cartesi_watchdog_divergence_info{kind}` — only on exit `2`

Set `CARTESI_WATCHDOG_BLOCKCHAIN_ID` at `init` for the `chain` label (defaults to
`unknown`). Golden fixtures: [`tests/fixtures/watchdog_status_ok.prom`](../tests/fixtures/watchdog_status_ok.prom),
[`tests/fixtures/watchdog_status_failed.prom`](../tests/fixtures/watchdog_status_failed.prom).

Example after a clean tick:

```prometheus
cartesi_watchdog_status{chain="11155111",app_address="0x4CE...",state="ok"} 1
cartesi_watchdog_status{chain="11155111",app_address="0x4CE...",state="warning"} 0
cartesi_watchdog_status{chain="11155111",app_address="0x4CE...",state="failed"} 0
cartesi_watchdog_last_tick_unix_seconds{chain="11155111",app_address="0x4CE..."} 1717420800
cartesi_watchdog_exit_code{chain="11155111",app_address="0x4CE..."} 0
```

Example Prometheus alert (pull or push gateway — operator choice):

```promql
cartesi_watchdog_status{state="failed"} == 1
```

Divergence playbook: **notify only**; manual intervention (see
[`operator-deployment.md`](operator-deployment.md)).

## Local Tests

| Command | What it exercises |
|---------|-------------------|
| `just test-watchdog` | Lua unit tests (fake HTTP/RPC/CM; no live chain) |
| `just test-watchdog` | Lua unit tests (fake HTTP/RPC/CM; includes `status.prom` golden fixtures) |
| `just test-watchdog-e2e` | Real CM: advance, inspect; optional live compare if `CARTESI_WATCHDOG_E2E_SEQUENCER_URL` set |
| `just test-watchdog-compare-harness` | **Full E2E**: Anvil + devnet sequencer + `/finalized_state` + CM inspect + Lua `init`/`tick` |
| `just test-rollups-e2e` | All rollups e2e scenarios; includes watchdog genesis/non-genesis compare plus `watchdog_non_genesis_divergence_test` (needs Sepolia CM image) |
Expand Down
1 change: 1 addition & 0 deletions docs/watchdog/design-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ fsync support.
state/
config.json
head.json
status.prom # last tick metrics (Prometheus textfile)
run.lock # advisory lock handle in the production container
checkpoints/
00000000000000000042/
Expand Down
4 changes: 3 additions & 1 deletion docs/watchdog/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ export CARTESI_WATCHDOG_LUA_DEPS=.deps/lua

Success: exit **0**. If finalized has advanced, stderr ends in `compare pass complete`; if it has not, the tick exits idle after the cheap poll.

Exit codes from `sequencer-watchdog tick`: **0** clean (or idle — finalized unchanged), **1** transient failure (RPC/CM/network after retries), **2** deterministic divergence (`watchdog_event` emitted on stderr before exit).
Exit codes from `sequencer-watchdog tick`: **0** clean (or idle — finalized unchanged), **1** transient failure (RPC/CM/network after retries), **2** deterministic divergence (`watchdog_event` emitted on stderr before exit). Each tick writes `$CARTESI_WATCHDOG_STATE_DIR/status.prom` — see [`README.md` — Metrics](README.md#metrics-statusprom).

The watchdog tick runs **one cycle per process and exits** — re-run it on a timer/cron for continuous monitoring. When `inclusion_block` has not advanced since the watchdog checkpoint, the cycle **skips** L1/CM work (idle-cheap) and exits 0.
`sequencer-watchdog` takes a non-blocking `flock`; production schedulers should
Expand Down Expand Up @@ -164,6 +164,8 @@ Full operator runbook: **[`operator-deployment.md`](operator-deployment.md)**.
|----------|----------|-------------|
| `CARTESI_WATCHDOG_SEQUENCER_URL` | yes | e.g. `http://127.0.0.1:54321` |
| `CARTESI_WATCHDOG_BLOCKCHAIN_HTTP_ENDPOINT` | tick | Current L1 JSON-RPC; not persisted by `init` |
| `CARTESI_WATCHDOG_BLOCKCHAIN_ID` | init | Optional chain id label for `status.prom` |
| `CARTESI_WATCHDOG_METRICS_FILE` | tick | Optional override for Prometheus textfile path |
| `CARTESI_WATCHDOG_CONTRACTS_INPUT_BOX_ADDRESS` | yes | InputBox contract |
| `CARTESI_WATCHDOG_APP_ADDRESS` | yes | Rollup application contract |
| `CARTESI_WATCHDOG_STATE_DIR` | yes | Persistent watchdog state (`config.json`, `head.json`, checkpoints) |
Expand Down
48 changes: 46 additions & 2 deletions docs/watchdog/operator-deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,8 @@ Today `WalletApp::default()` / `WalletConfig::sepolia()` align with Sepolia stag
| `CARTESI_WATCHDOG_STATE_DIR` | Persistent volume on watchdog host |
| `CARTESI_WATCHDOG_CM_SNAPSHOT_DIR` | Bootstrap CM snapshot (`init` only) |
| `CARTESI_WATCHDOG_CM_SNAPSHOT_SAFE_BLOCK` | L1 block that bootstrap snapshot represents (= finalized `inclusion_block` at bootstrap) |
| `CARTESI_WATCHDOG_BLOCKCHAIN_ID` | Chain id label for `status.prom` metrics (optional; defaults to `unknown`) |
| `CARTESI_WATCHDOG_METRICS_FILE` | Override path for the Prometheus textfile written by each `tick` |
| `CARTESI_WATCHDOG_LUA_DEPS` | `.deps/lua` |

The sequencer discovers and pins `input_box_address` at startup; use the same values as `CARTESI_SEQUENCER_BLOCKCHAIN_HTTP_ENDPOINT` / `CARTESI_SEQUENCER_APP_ADDRESS` configuration.
Expand All @@ -181,11 +183,53 @@ sequencer-watchdog init

After init, schedule `tick`; tick will fail if `head.json` is missing.

Each `tick` atomically writes a Prometheus textfile to
`$CARTESI_WATCHDOG_STATE_DIR/status.prom` (override with
`CARTESI_WATCHDOG_METRICS_FILE`). Operators can scrape or push it from their
side. Gauges:

- `cartesi_watchdog_status{chain,app_address,state="ok|warning|failed"}` — `1` on the active state
- `cartesi_watchdog_last_tick_unix_seconds{chain,app_address}`
- `cartesi_watchdog_exit_code{chain,app_address}`
- `cartesi_watchdog_divergence_info{chain,app_address,kind}` — present on exit `2`

Set `CARTESI_WATCHDOG_BLOCKCHAIN_ID` at `init` so `chain` is labeled (defaults to
`unknown` when omitted). Divergence playbook: notify only; manual intervention.

Example `status.prom` after a successful tick:

```prometheus
cartesi_watchdog_status{chain="11155111",app_address="0x4CE...",state="ok"} 1
cartesi_watchdog_status{chain="11155111",app_address="0x4CE...",state="warning"} 0
cartesi_watchdog_status{chain="11155111",app_address="0x4CE...",state="failed"} 0
cartesi_watchdog_last_tick_unix_seconds{chain="11155111",app_address="0x4CE..."} 1717420800
cartesi_watchdog_exit_code{chain="11155111",app_address="0x4CE..."} 0
```

On divergence (exit `2`), `state="failed"` is `1` and
`cartesi_watchdog_divergence_info{kind="state_mismatch"}` (or
`inclusion_block_regressed`) is present. Example alert:

```promql
cartesi_watchdog_status{state="failed"} == 1
```

Cron + push pattern (operator pushes to Prometheus after each tick):

```bash
#!/bin/sh
set -eu
sequencer-watchdog tick || true # exit code still written to status.prom
# push $CARTESI_WATCHDOG_STATE_DIR/status.prom via your exporter
```

### 6. Run tick

The watchdog runs **one tick per process, then exits** — there is no daemon
loop. Run it once as a smoke check, then schedule it (systemd timer / k8s
CronJob) and alert on the exit code:
loop. Run it once as a smoke check, then schedule it (cron, systemd timer, k8s
CronJob). Alert on `$CARTESI_WATCHDOG_STATE_DIR/status.prom` (preferred for
Prometheus push/pull) or on the process exit code. If the process is killed
mid-tick, `status.prom` keeps the last completed value until the next run.

```bash
sequencer-watchdog tick # exit 0 = clean/idle, 1 = transient, 2 = divergence
Expand Down
37 changes: 34 additions & 3 deletions docs/watchdog/staging-drills.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This document covers staging and manual verification beyond the devnet tutorial.
- JSON is pure Lua (`watchdog/third_party/json.lua`); no cjson compile step
- Staging or local sequencer reachable at `CARTESI_WATCHDOG_SEQUENCER_URL`
- L1 RPC + InputBox + app addresses matching that deployment
- Log collection for `watchdog_event` lines and process exit codes
- Log collection for `watchdog_event` lines, process exit codes, and `status.prom`

## Drill 1 — Divergence signal (synthetic mismatch, no CM)

Expand All @@ -30,7 +30,8 @@ CARTESI_WATCHDOG_LUA_DEPS=.deps/lua lua watchdog/tests/drill_divergence.lua #
```

Expected: `main.lua` emits a structured `watchdog_event` with `kind=state_mismatch` and
non-zero `mismatch_offset`, then the drill process exits with code `2`.
non-zero `mismatch_offset`, then the drill process exits with code `2` and writes
`status.prom` with `state="failed"`.

Unit coverage: `just test-watchdog` (`runner returns state mismatch payload`).

Expand Down Expand Up @@ -64,6 +65,8 @@ export CARTESI_WATCHDOG_LUA_DEPS=.deps/lua
```

Expected: exit **0**; the tick may exit idle if the finalized block is unchanged.
`$CARTESI_WATCHDOG_STATE_DIR/status.prom` should show `state="ok"` and
`cartesi_watchdog_exit_code ... 0`.
The harness path also proves byte-identical **devnet** genesis SSZ on sequencer `/finalized_state` and CM inspect
(same bytes as `wallet_snapshot::encode(WalletConfig::devnet())`; the `.hex` fixture
is for Sepolia `default()` — do not use it as the devnet golden).
Expand All @@ -86,7 +89,35 @@ Exit codes from `sequencer-watchdog tick`:
|------|---------|
| `0` | Compare cycle completed — clean, or idle when finalized is unchanged |
| `1` | Transient error after retries (RPC, CM, network) |
| `2` | Deterministic divergence — `watchdog_event` on stderr with `{kind, previous_safe_block, sequencer_inclusion_block, mismatch_offset?}` |
| `2` | Deterministic divergence — `watchdog_event` on stderr with `{kind, previous_safe_block, sequencer_inclusion_block, mismatch_offset?}`; `status.prom` has `state="failed"` |

Each tick also writes `$CARTESI_WATCHDOG_STATE_DIR/status.prom` before exit. See
[`README.md` — Metrics](README.md#metrics-statusprom) for gauge names and alert
examples.

## Drill 4 — Metrics file (synthetic divergence)

Verifies `status.prom` is written on divergence without a live sequencer.

```bash
just watchdog-lua-deps
dir=$(mktemp -d)
export CARTESI_WATCHDOG_STATE_DIR="$dir"
export CARTESI_WATCHDOG_BLOCKCHAIN_ID=31337
# init once (needs CM snapshot env — reuse Drill 2 exports), then:
CARTESI_WATCHDOG_LUA_DEPS=.deps/lua lua watchdog/tests/drill_divergence.lua || true
cat "$dir/status.prom"
```

Or run the unit tests (includes golden fixture checks):

```bash
just test-watchdog
```

Expected after Drill 1: `cartesi_watchdog_status{...,state="failed"} 1` and
`cartesi_watchdog_divergence_info{...,kind="state_mismatch"} 1` in
`$CARTESI_WATCHDOG_STATE_DIR/status.prom`.

## Triage checklist

Expand Down
14 changes: 14 additions & 0 deletions tests/fixtures/watchdog_status_failed.prom
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# HELP cartesi_watchdog_status Current watchdog compare state (1 = active).
# TYPE cartesi_watchdog_status gauge
cartesi_watchdog_status{app_address="0xdeadbeef",chain="31337",state="ok"} 0
cartesi_watchdog_status{app_address="0xdeadbeef",chain="31337",state="warning"} 0
cartesi_watchdog_status{app_address="0xdeadbeef",chain="31337",state="failed"} 1
# HELP cartesi_watchdog_last_tick_unix_seconds Unix time of the last completed tick.
# TYPE cartesi_watchdog_last_tick_unix_seconds gauge
cartesi_watchdog_last_tick_unix_seconds{app_address="0xdeadbeef",chain="31337"} 1710000000
# HELP cartesi_watchdog_exit_code Process exit code from the last tick.
# TYPE cartesi_watchdog_exit_code gauge
cartesi_watchdog_exit_code{app_address="0xdeadbeef",chain="31337"} 2
# HELP cartesi_watchdog_divergence_info Divergence kind from the last tick (1 = present).
# TYPE cartesi_watchdog_divergence_info gauge
cartesi_watchdog_divergence_info{app_address="0xdeadbeef",chain="31337",kind="state_mismatch"} 1
11 changes: 11 additions & 0 deletions tests/fixtures/watchdog_status_ok.prom
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# HELP cartesi_watchdog_status Current watchdog compare state (1 = active).
# TYPE cartesi_watchdog_status gauge
cartesi_watchdog_status{app_address="0x4CE633CA71071818cD73187765ee60F696dae083",chain="11155111",state="ok"} 1
cartesi_watchdog_status{app_address="0x4CE633CA71071818cD73187765ee60F696dae083",chain="11155111",state="warning"} 0
cartesi_watchdog_status{app_address="0x4CE633CA71071818cD73187765ee60F696dae083",chain="11155111",state="failed"} 0
# HELP cartesi_watchdog_last_tick_unix_seconds Unix time of the last completed tick.
# TYPE cartesi_watchdog_last_tick_unix_seconds gauge
cartesi_watchdog_last_tick_unix_seconds{app_address="0x4CE633CA71071818cD73187765ee60F696dae083",chain="11155111"} 1710000000
# HELP cartesi_watchdog_exit_code Process exit code from the last tick.
# TYPE cartesi_watchdog_exit_code gauge
cartesi_watchdog_exit_code{app_address="0x4CE633CA71071818cD73187765ee60F696dae083",chain="11155111"} 0
5 changes: 5 additions & 0 deletions watchdog/config.lua
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,8 @@ function config.load_init(env)
env
),
cm_image_hash = env.CARTESI_WATCHDOG_CM_IMAGE_HASH,
blockchain_id = env.CARTESI_WATCHDOG_BLOCKCHAIN_ID,
metrics_file = env.CARTESI_WATCHDOG_METRICS_FILE,
retry_attempts = optional_number("CARTESI_WATCHDOG_RETRY_ATTEMPTS", 3, env),
retry_delay_sec = optional_number("CARTESI_WATCHDOG_RETRY_DELAY_SEC", 5, env),
long_block_range_error_codes = split_csv(
Expand Down Expand Up @@ -116,6 +118,7 @@ function config.persisted(cfg)
app_address = cfg.app_address,
input_added_topic = cfg.input_added_topic,
cm_image_hash = cfg.cm_image_hash,
blockchain_id = cfg.blockchain_id,
retry_attempts = cfg.retry_attempts,
retry_delay_sec = cfg.retry_delay_sec,
long_block_range_error_codes = cfg.long_block_range_error_codes,
Expand All @@ -136,6 +139,8 @@ function config.from_persisted(state_dir, data, env)
app_address = required_field(data, "app_address"),
input_added_topic = data.input_added_topic,
cm_image_hash = data.cm_image_hash,
blockchain_id = data.blockchain_id,
metrics_file = env.CARTESI_WATCHDOG_METRICS_FILE,
retry_attempts = optional_field_number(data, "retry_attempts", 3),
retry_delay_sec = optional_field_number(data, "retry_delay_sec", 5),
long_block_range_error_codes = data.long_block_range_error_codes or {},
Expand Down
Loading