hotblocks: skip stale data packs below the finalized head instead of crash-looping by elina-chertova · Pull Request #75 · subsquid/data

elina-chertova · 2026-06-21T14:22:42Z

Cause (proven)

On hotblocks-db-0 in network-hotblocks-mainnet-internal, the binance-mainnet-traceless dataset update task was stuck in a 60s crash-restart loop:

ERROR dataset update task failed, will restart it in 1 minute
reason: failed to write new chunk 105536431-105536898 ...
Caused by: can't fork safely, because fork base is below the current finalized head
           and finalized head of the data pack is below the current

Mechanism:

A data source endpoint fell ~17 min behind (head stalled at canonical block 105536898, block-time 13:53:31Z) while a healthy endpoint had already taken the dataset to finalized head 105536909.
The multi-source layer (data-source/src/standard.rs) treats the lagging endpoint as being on a fork and, when the healthy endpoint is idle at head (forks == active) or after the 2s consensus timeout, emits a spurious DataEvent::Fork, rolling the ingest position back to ~105536431 — below the finalized head.
Ingest re-streams that already-finalized region and flushes a chunk whose finalized head is below the current one → WriteController::new_chunk hits bail!("can't fork safely …") → the epoch dies → controller restarts it in 60s → same stale pack → fails again. Lag grows ~60s per cycle, unbounded.

hotblocks-db-1 (same dataset, same endpoints) was unaffected. An independent node (publicnode.com) confirmed block 105536898 hash 0x6bca4bc11c175b464e62a1ba893275e93f3f729fe873f9a45f7b920c9e9d0d9b — i.e. the lagging pack's data was canonical, just stale, not a real fork.

Fix (rule: a transient error that crash-loops the process is a code bug — fix the fatality)

In new_chunk, when the incoming pack lies entirely within the already-finalized region (chunk.last_block() <= current finalized head) and reports a finalized head below ours, log a warning and skip it instead of bail!-ing. Finalized blocks are immutable, so the existing chain is kept intact — the stale pack is dropped, not accepted as truth. Genuine forks above the finalized head are unaffected (the bail! for the ambiguous straddle case is preserved). This stops a single behind endpoint from crash-looping the whole dataset.

Tested

Syntax validated with rustfmt (file parses cleanly). A full cargo check was not possible in the investigation sandbox (no C linker for transitive build-scripts such as librocksdb-sys); the change uses only symbols already imported and used in the same file (warn!, chunk.first_block()/last_block(), self.finalized_head).
Logic verified against the live failing path above: the skipped pack is exactly the 105536431-105536898 chunk whose finalized head (105536898) is below the current (105536909).

Falsification

If a stale-below-finalized pack is still delivered, new_chunk now logs ignoring stale data pack below the current finalized head and returns Ok. If instead the task keeps logging can't fork safely / dataset update task failed, will restart it in 1 minute, the guard did not cover the case.

Operator notes (not in this PR)

The currently-stuck hotblocks-db-0 epoch needs the new binary deployed (or a pod restart, or the lagging upstream to catch up) to clear the in-flight loop; this code change prevents recurrence, it does not retroactively unstick the running process.
Mitigation only: the lagging endpoint for binance-mainnet-traceless (dwellir) is the trigger. dwellir is already commented out for the sibling binance-mainnet dataset in infra/iac/network-hotblocks/values.mainnet-internal.yaml but left active for binance-mainnet-traceless. Swapping/removing a provider is an operator decision (it only removes today's trigger), so it is intentionally not part of this PR.

…crash-looping A data source that has fallen behind can re-deliver a pack that lies entirely within the already-finalized region: its blocks end at or below the current finalized head and the finalized head it reports is below ours. Because finalized blocks are immutable, this is stale, already-committed data from a lagging endpoint, not a genuine fork. Previously WriteController::new_chunk treated this as an unrecoverable fork and returned `bail!("can't fork safely ...")`. That error aborts the dataset update task, which the controller then restarts every 60s, re-pulls the same stale pack, and fails again - an infinite crash-loop. A single behind endpoint thus stalls the whole dataset even though the other endpoints are ahead, and the dataset's lag grows unbounded. Fix the fatality: when the incoming pack is entirely within the finalized region and reports a lower finalized head, log a warning and skip it (no-op) rather than failing. The existing finalized chain is kept intact - the stale pack is dropped, not accepted as truth - so a lagging data source can no longer crash-loop ingestion. Genuine forks above the finalized head are unaffected. Cause: observed on hotblocks-db-0 for dataset binance-mainnet-traceless, where a lagging upstream (head stuck ~17 min behind at canonical block 105536898) drove the loop while a healthy endpoint was already at 105536909; cross-checked against an independent node that the lagging pack's data was canonical, i.e. merely stale. Falsification: if a stale-below-finalized pack is still delivered, new_chunk now emits "ignoring stale data pack below the current finalized head" and returns Ok; if instead the dataset task keeps logging "can't fork safely" / "dataset update task failed, will restart it in 1 minute", the guard did not cover the case. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

eldargab · 2026-06-21T19:34:09Z

I haven't understood the bug explanation, but that certainly is not the right fix!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hotblocks: skip stale data packs below the finalized head instead of crash-looping#75

hotblocks: skip stale data packs below the finalized head instead of crash-looping#75
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/6BRCp6-hotblocks-stale-pack-crashloop

elina-chertova commented Jun 21, 2026

Uh oh!

eldargab commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elina-chertova commented Jun 21, 2026

Cause (proven)

Fix (rule: a transient error that crash-loops the process is a code bug — fix the fatality)

Tested

Falsification

Operator notes (not in this PR)

Uh oh!

eldargab commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants