Skip to content

dump-cli: recover from transient ForkException on finalized dump instead of crashing#502

Open
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/LHg2kK-solana-dump-fork-crashloop
Open

dump-cli: recover from transient ForkException on finalized dump instead of crashing#502
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/LHg2kK-solana-dump-fork-crashloop

Conversation

@elina-chertova

Copy link
Copy Markdown
Contributor

Symptom

Alert solana-devnet_Writer_Short_Stall (solana-archive, kind=ingest, onset 2026-06-19 11:42 PDT). The writer stalls because no new raw chunks are being produced.

Root cause (proven)

The upstream dump-solana-devnet-0 pod (ns solana-archive) crash-loops (81 restarts, exit=1, reason=Error). Logs:

ForkException: expected 470552136#CwBp9Tj4qG6PvAYnf7CZTF2jpPvzZXBuHY1S8h5FJBUP to have parent
  2mQdd58epwuH5r6mQFLu7i5XmebCTGvNs66zYcu3ppPE, but got 470552135#A2efcChzc1nWqQt5m1NE4kLNdSVB9WsU9rwDi1Ysx2mk
    at ChainFixer.acceptBatch (.../solana-rpc/lib/data-source/chain-fixer.js)
    at SolanaRpcDataSource.getFinalizedStream (...)
    at SolanaDumper.getBlocks (.../solana-dump/lib/dumper.js)
    at Dumper.ingest (.../util-internal-dump-cli/lib/dumper.js)

The ForkException thrown by ChainFixer on the finalized dump stream propagates uncaught up through getFinalizedStreamingest()appendRawBlocksrunProgram, which log().fatals and exits. k8s restarts the pod, it resumes from the last persisted chunk, hits the same upstream inconsistency, and crash-loops indefinitely → writer stall.

Provider cross-check (independent node implementations — rule against blaming the chain)

getBlock for the two slots, Helius (the configured endpoint) vs the official public devnet RPC (different node implementation):

slot Helius public api.devnet.solana.com
470552135 A2efcChz… (clean) A2efcChz… (clean)
470552136 error -32007: Slot 470552136 was skipped … due to ledger jump to recent snapshot CwBp9Tj4…, parentSlot 470552135, prev A2efcChz… (clean, continuous)

The chain is continuous and fine (block 470552136 links cleanly to 470552135 on the independent source). Helius reports the slot missing due to a snapshot/ledger jump on a backend — transient provider inconsistency across its load-balanced nodes. Finalized blocks are immutable, so the ForkException here is never a real reorg.

Fix (tested)

Make the finalized archive dump survive the transient inconsistency instead of crash-looping. appendRawBlocks only persists complete chunks, so the in-flight (unflushed) buffer — including the offending block — is discarded on throw, and the resume point is re-derived from the last persisted chunk on the next call. This PR wraps the finalized append loop so a ForkException triggers an in-process retry from that durable boundary — identical to a clean process restart, but without exiting.

Recovery is bounded by progress: each retry that persists a higher block resets the counter; if maxStuckRetries (default 10) consecutive retries make no progress, the exception is re-thrown so a genuine, persistent archive/chain divergence still surfaces as a hard failure (it does not silently mask a real regression). Detection is structural (isSqdForkException marker) to avoid adding a dependency on the data-source package. Scope is limited to the resumable (archive, dest != null) path and only to ForkException — non-fork errors (e.g. EVM assertion crashes) propagate unchanged.

Verification

  • Standalone behavior test of appendWithForkRecovery: (1) transient forks that advance progress → recovers without re-throwing; (2) non-fork error → propagates immediately; (3) persistent fork with no progress → re-throws after maxStuckRetries.
  • Provider cross-check above proves a re-fetch yields consistent finalized data (so the retry converges once a healthy backend is hit).

Falsification

If the dump still crash-loops after this deploys with a ForkException whose last written block never advances across retries, then the persisted archive itself diverges from the canonical chain (a real corruption, not transient provider noise) — that is a different problem and this guard will correctly re-throw to surface it.

Operator note (not part of this PR)

The trigger is a Helius devnet ledger/snapshot-jump serving inconsistent finalized data. A provider swap would remove today's trigger but is a temporary mitigation, not a durable fix — handing that to operators rather than shipping it here. This code change is the durable resolution: a single provider hiccup can no longer crash-loop the dump.

…ead of crashing

The finalized archive dump crashes the whole process on a ForkException.
Because finalized blocks are immutable, such an exception is never a real
reorg — it is transient upstream/provider inconsistency (e.g. a load-balanced
RPC backend that jumped to a recent snapshot and serves a gapped/forked view
of an already-finalized slot range). The process exits, k8s restarts it, it
resumes from the last persisted chunk and hits the same inconsistency, and the
dumper crash-loops indefinitely, stalling the writer (no new chunks).

appendRawBlocks only persists complete chunks, so the in-flight (unflushed)
buffer is discarded on throw and the resume point is re-derived from the last
persisted chunk on the next call. Wrap the finalized append loop so a
ForkException triggers an in-process retry from that durable boundary —
identical to a clean process restart, but without crash-looping. Recovery is
bounded by progress: if retries stop advancing the last written block, the
exception is re-thrown so a genuine, persistent divergence still surfaces as a
hard failure rather than being retried forever.

Detection is structural (isSqdForkException marker) to avoid adding a
dependency on the data-source package.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant