hotblocks: skip stale data packs below the finalized head instead of crash-looping#75
Open
elina-chertova wants to merge 1 commit into
Open
Conversation
…crash-looping
A data source that has fallen behind can re-deliver a pack that lies entirely
within the already-finalized region: its blocks end at or below the current
finalized head and the finalized head it reports is below ours. Because
finalized blocks are immutable, this is stale, already-committed data from a
lagging endpoint, not a genuine fork.
Previously WriteController::new_chunk treated this as an unrecoverable fork and
returned `bail!("can't fork safely ...")`. That error aborts the dataset update
task, which the controller then restarts every 60s, re-pulls the same stale
pack, and fails again - an infinite crash-loop. A single behind endpoint thus
stalls the whole dataset even though the other endpoints are ahead, and the
dataset's lag grows unbounded.
Fix the fatality: when the incoming pack is entirely within the finalized
region and reports a lower finalized head, log a warning and skip it (no-op)
rather than failing. The existing finalized chain is kept intact - the stale
pack is dropped, not accepted as truth - so a lagging data source can no longer
crash-loop ingestion. Genuine forks above the finalized head are unaffected.
Cause: observed on hotblocks-db-0 for dataset binance-mainnet-traceless, where a
lagging upstream (head stuck ~17 min behind at canonical block 105536898) drove
the loop while a healthy endpoint was already at 105536909; cross-checked
against an independent node that the lagging pack's data was canonical, i.e.
merely stale.
Falsification: if a stale-below-finalized pack is still delivered, new_chunk now
emits "ignoring stale data pack below the current finalized head" and returns
Ok; if instead the dataset task keeps logging "can't fork safely" / "dataset
update task failed, will restart it in 1 minute", the guard did not cover the
case.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collaborator
|
I haven't understood the bug explanation, but that certainly is not the right fix! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cause (proven)
On
hotblocks-db-0innetwork-hotblocks-mainnet-internal, thebinance-mainnet-tracelessdataset update task was stuck in a 60s crash-restart loop:Mechanism:
105536898, block-time13:53:31Z) while a healthy endpoint had already taken the dataset to finalized head105536909.data-source/src/standard.rs) treats the lagging endpoint as being on a fork and, when the healthy endpoint is idle at head (forks == active) or after the 2s consensus timeout, emits a spuriousDataEvent::Fork, rolling the ingest position back to ~105536431— below the finalized head.WriteController::new_chunkhitsbail!("can't fork safely …")→ the epoch dies → controller restarts it in 60s → same stale pack → fails again. Lag grows ~60s per cycle, unbounded.hotblocks-db-1(same dataset, same endpoints) was unaffected. An independent node (publicnode.com) confirmed block105536898hash0x6bca4bc11c175b464e62a1ba893275e93f3f729fe873f9a45f7b920c9e9d0d9b— i.e. the lagging pack's data was canonical, just stale, not a real fork.Fix (rule: a transient error that crash-loops the process is a code bug — fix the fatality)
In
new_chunk, when the incoming pack lies entirely within the already-finalized region (chunk.last_block() <= current finalized head) and reports a finalized head below ours, log a warning and skip it instead ofbail!-ing. Finalized blocks are immutable, so the existing chain is kept intact — the stale pack is dropped, not accepted as truth. Genuine forks above the finalized head are unaffected (thebail!for the ambiguous straddle case is preserved). This stops a single behind endpoint from crash-looping the whole dataset.Tested
rustfmt(file parses cleanly). A fullcargo checkwas not possible in the investigation sandbox (no C linker for transitive build-scripts such aslibrocksdb-sys); the change uses only symbols already imported and used in the same file (warn!,chunk.first_block()/last_block(),self.finalized_head).105536431-105536898chunk whose finalized head (105536898) is below the current (105536909).Falsification
If a stale-below-finalized pack is still delivered,
new_chunknow logsignoring stale data pack below the current finalized headand returnsOk. If instead the task keeps loggingcan't fork safely/dataset update task failed, will restart it in 1 minute, the guard did not cover the case.Operator notes (not in this PR)
hotblocks-db-0epoch needs the new binary deployed (or a pod restart, or the lagging upstream to catch up) to clear the in-flight loop; this code change prevents recurrence, it does not retroactively unstick the running process.binance-mainnet-traceless(dwellir) is the trigger.dwelliris already commented out for the siblingbinance-mainnetdataset ininfra/iac/network-hotblocks/values.mainnet-internal.yamlbut left active forbinance-mainnet-traceless. Swapping/removing a provider is an operator decision (it only removes today's trigger), so it is intentionally not part of this PR.