Skip to content

hotblocks: skip stale data packs below the finalized head instead of crash-looping#75

Open
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/6BRCp6-hotblocks-stale-pack-crashloop
Open

hotblocks: skip stale data packs below the finalized head instead of crash-looping#75
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/6BRCp6-hotblocks-stale-pack-crashloop

Conversation

@elina-chertova

Copy link
Copy Markdown

Cause (proven)

On hotblocks-db-0 in network-hotblocks-mainnet-internal, the binance-mainnet-traceless dataset update task was stuck in a 60s crash-restart loop:

ERROR dataset update task failed, will restart it in 1 minute
reason: failed to write new chunk 105536431-105536898 ...
Caused by: can't fork safely, because fork base is below the current finalized head
           and finalized head of the data pack is below the current

Mechanism:

  1. A data source endpoint fell ~17 min behind (head stalled at canonical block 105536898, block-time 13:53:31Z) while a healthy endpoint had already taken the dataset to finalized head 105536909.
  2. The multi-source layer (data-source/src/standard.rs) treats the lagging endpoint as being on a fork and, when the healthy endpoint is idle at head (forks == active) or after the 2s consensus timeout, emits a spurious DataEvent::Fork, rolling the ingest position back to ~105536431below the finalized head.
  3. Ingest re-streams that already-finalized region and flushes a chunk whose finalized head is below the current one → WriteController::new_chunk hits bail!("can't fork safely …") → the epoch dies → controller restarts it in 60s → same stale pack → fails again. Lag grows ~60s per cycle, unbounded.

hotblocks-db-1 (same dataset, same endpoints) was unaffected. An independent node (publicnode.com) confirmed block 105536898 hash 0x6bca4bc11c175b464e62a1ba893275e93f3f729fe873f9a45f7b920c9e9d0d9b — i.e. the lagging pack's data was canonical, just stale, not a real fork.

Fix (rule: a transient error that crash-loops the process is a code bug — fix the fatality)

In new_chunk, when the incoming pack lies entirely within the already-finalized region (chunk.last_block() <= current finalized head) and reports a finalized head below ours, log a warning and skip it instead of bail!-ing. Finalized blocks are immutable, so the existing chain is kept intact — the stale pack is dropped, not accepted as truth. Genuine forks above the finalized head are unaffected (the bail! for the ambiguous straddle case is preserved). This stops a single behind endpoint from crash-looping the whole dataset.

Tested

  • Syntax validated with rustfmt (file parses cleanly). A full cargo check was not possible in the investigation sandbox (no C linker for transitive build-scripts such as librocksdb-sys); the change uses only symbols already imported and used in the same file (warn!, chunk.first_block()/last_block(), self.finalized_head).
  • Logic verified against the live failing path above: the skipped pack is exactly the 105536431-105536898 chunk whose finalized head (105536898) is below the current (105536909).

Falsification

If a stale-below-finalized pack is still delivered, new_chunk now logs ignoring stale data pack below the current finalized head and returns Ok. If instead the task keeps logging can't fork safely / dataset update task failed, will restart it in 1 minute, the guard did not cover the case.

Operator notes (not in this PR)

  • The currently-stuck hotblocks-db-0 epoch needs the new binary deployed (or a pod restart, or the lagging upstream to catch up) to clear the in-flight loop; this code change prevents recurrence, it does not retroactively unstick the running process.
  • Mitigation only: the lagging endpoint for binance-mainnet-traceless (dwellir) is the trigger. dwellir is already commented out for the sibling binance-mainnet dataset in infra/iac/network-hotblocks/values.mainnet-internal.yaml but left active for binance-mainnet-traceless. Swapping/removing a provider is an operator decision (it only removes today's trigger), so it is intentionally not part of this PR.

…crash-looping

A data source that has fallen behind can re-deliver a pack that lies entirely
within the already-finalized region: its blocks end at or below the current
finalized head and the finalized head it reports is below ours. Because
finalized blocks are immutable, this is stale, already-committed data from a
lagging endpoint, not a genuine fork.

Previously WriteController::new_chunk treated this as an unrecoverable fork and
returned `bail!("can't fork safely ...")`. That error aborts the dataset update
task, which the controller then restarts every 60s, re-pulls the same stale
pack, and fails again - an infinite crash-loop. A single behind endpoint thus
stalls the whole dataset even though the other endpoints are ahead, and the
dataset's lag grows unbounded.

Fix the fatality: when the incoming pack is entirely within the finalized
region and reports a lower finalized head, log a warning and skip it (no-op)
rather than failing. The existing finalized chain is kept intact - the stale
pack is dropped, not accepted as truth - so a lagging data source can no longer
crash-loop ingestion. Genuine forks above the finalized head are unaffected.

Cause: observed on hotblocks-db-0 for dataset binance-mainnet-traceless, where a
lagging upstream (head stuck ~17 min behind at canonical block 105536898) drove
the loop while a healthy endpoint was already at 105536909; cross-checked
against an independent node that the lagging pack's data was canonical, i.e.
merely stale.

Falsification: if a stale-below-finalized pack is still delivered, new_chunk now
emits "ignoring stale data pack below the current finalized head" and returns
Ok; if instead the dataset task keeps logging "can't fork safely" / "dataset
update task failed, will restart it in 1 minute", the guard did not cover the
case.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@eldargab

Copy link
Copy Markdown
Collaborator

I haven't understood the bug explanation, but that certainly is not the right fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants