dump-cli: recover from transient ForkException on finalized dump instead of crashing#502
Open
elina-chertova wants to merge 1 commit into
Open
dump-cli: recover from transient ForkException on finalized dump instead of crashing#502elina-chertova wants to merge 1 commit into
elina-chertova wants to merge 1 commit into
Conversation
…ead of crashing The finalized archive dump crashes the whole process on a ForkException. Because finalized blocks are immutable, such an exception is never a real reorg — it is transient upstream/provider inconsistency (e.g. a load-balanced RPC backend that jumped to a recent snapshot and serves a gapped/forked view of an already-finalized slot range). The process exits, k8s restarts it, it resumes from the last persisted chunk and hits the same inconsistency, and the dumper crash-loops indefinitely, stalling the writer (no new chunks). appendRawBlocks only persists complete chunks, so the in-flight (unflushed) buffer is discarded on throw and the resume point is re-derived from the last persisted chunk on the next call. Wrap the finalized append loop so a ForkException triggers an in-process retry from that durable boundary — identical to a clean process restart, but without crash-looping. Recovery is bounded by progress: if retries stop advancing the last written block, the exception is re-thrown so a genuine, persistent divergence still surfaces as a hard failure rather than being retried forever. Detection is structural (isSqdForkException marker) to avoid adding a dependency on the data-source package. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
Alert solana-devnet_Writer_Short_Stall (solana-archive, kind=ingest, onset 2026-06-19 11:42 PDT). The writer stalls because no new raw chunks are being produced.
Root cause (proven)
The upstream
dump-solana-devnet-0pod (nssolana-archive) crash-loops (81 restarts,exit=1,reason=Error). Logs:The
ForkExceptionthrown byChainFixeron the finalized dump stream propagates uncaught up throughgetFinalizedStream→ingest()→appendRawBlocks→runProgram, whichlog().fatals and exits. k8s restarts the pod, it resumes from the last persisted chunk, hits the same upstream inconsistency, and crash-loops indefinitely → writer stall.Provider cross-check (independent node implementations — rule against blaming the chain)
getBlockfor the two slots, Helius (the configured endpoint) vs the official public devnet RPC (different node implementation):A2efcChz…(clean)A2efcChz…(clean)error -32007: Slot 470552136 was skipped … due to ledger jump to recent snapshotCwBp9Tj4…, parentSlot 470552135, prevA2efcChz…(clean, continuous)The chain is continuous and fine (block 470552136 links cleanly to 470552135 on the independent source). Helius reports the slot missing due to a snapshot/ledger jump on a backend — transient provider inconsistency across its load-balanced nodes. Finalized blocks are immutable, so the
ForkExceptionhere is never a real reorg.Fix (tested)
Make the finalized archive dump survive the transient inconsistency instead of crash-looping.
appendRawBlocksonly persists complete chunks, so the in-flight (unflushed) buffer — including the offending block — is discarded on throw, and the resume point is re-derived from the last persisted chunk on the next call. This PR wraps the finalized append loop so aForkExceptiontriggers an in-process retry from that durable boundary — identical to a clean process restart, but without exiting.Recovery is bounded by progress: each retry that persists a higher block resets the counter; if
maxStuckRetries(default 10) consecutive retries make no progress, the exception is re-thrown so a genuine, persistent archive/chain divergence still surfaces as a hard failure (it does not silently mask a real regression). Detection is structural (isSqdForkExceptionmarker) to avoid adding a dependency on the data-source package. Scope is limited to the resumable (archive,dest != null) path and only toForkException— non-fork errors (e.g. EVM assertion crashes) propagate unchanged.Verification
appendWithForkRecovery: (1) transient forks that advance progress → recovers without re-throwing; (2) non-fork error → propagates immediately; (3) persistent fork with no progress → re-throws aftermaxStuckRetries.Falsification
If the dump still crash-loops after this deploys with a
ForkExceptionwhose last written block never advances across retries, then the persisted archive itself diverges from the canonical chain (a real corruption, not transient provider noise) — that is a different problem and this guard will correctly re-throw to surface it.Operator note (not part of this PR)
The trigger is a Helius devnet ledger/snapshot-jump serving inconsistent finalized data. A provider swap would remove today's trigger but is a temporary mitigation, not a durable fix — handing that to operators rather than shipping it here. This code change is the durable resolution: a single provider hiccup can no longer crash-loop the dump.