dump-cli: recover from transient ForkException on finalized dump instead of crashing by elina-chertova · Pull Request #502 · subsquid/squid-sdk

elina-chertova · 2026-06-20T02:29:26Z

Symptom

Alert solana-devnet_Writer_Short_Stall (solana-archive, kind=ingest, onset 2026-06-19 11:42 PDT). The writer stalls because no new raw chunks are being produced.

Root cause (proven)

The upstream dump-solana-devnet-0 pod (ns solana-archive) crash-loops (81 restarts, exit=1, reason=Error). Logs:

ForkException: expected 470552136#CwBp9Tj4qG6PvAYnf7CZTF2jpPvzZXBuHY1S8h5FJBUP to have parent
  2mQdd58epwuH5r6mQFLu7i5XmebCTGvNs66zYcu3ppPE, but got 470552135#A2efcChzc1nWqQt5m1NE4kLNdSVB9WsU9rwDi1Ysx2mk
    at ChainFixer.acceptBatch (.../solana-rpc/lib/data-source/chain-fixer.js)
    at SolanaRpcDataSource.getFinalizedStream (...)
    at SolanaDumper.getBlocks (.../solana-dump/lib/dumper.js)
    at Dumper.ingest (.../util-internal-dump-cli/lib/dumper.js)

The ForkException thrown by ChainFixer on the finalized dump stream propagates uncaught up through getFinalizedStream → ingest() → appendRawBlocks → runProgram, which log().fatals and exits. k8s restarts the pod, it resumes from the last persisted chunk, hits the same upstream inconsistency, and crash-loops indefinitely → writer stall.

Provider cross-check (independent node implementations — rule against blaming the chain)

getBlock for the two slots, Helius (the configured endpoint) vs the official public devnet RPC (different node implementation):

slot	Helius	public api.devnet.solana.com
470552135	`A2efcChz…` (clean)	`A2efcChz…` (clean)
470552136	`error -32007: Slot 470552136 was skipped … due to ledger jump to recent snapshot`	`CwBp9Tj4…`, parentSlot 470552135, prev `A2efcChz…` (clean, continuous)

The chain is continuous and fine (block 470552136 links cleanly to 470552135 on the independent source). Helius reports the slot missing due to a snapshot/ledger jump on a backend — transient provider inconsistency across its load-balanced nodes. Finalized blocks are immutable, so the ForkException here is never a real reorg.

Fix (tested)

Make the finalized archive dump survive the transient inconsistency instead of crash-looping. appendRawBlocks only persists complete chunks, so the in-flight (unflushed) buffer — including the offending block — is discarded on throw, and the resume point is re-derived from the last persisted chunk on the next call. This PR wraps the finalized append loop so a ForkException triggers an in-process retry from that durable boundary — identical to a clean process restart, but without exiting.

Recovery is bounded by progress: each retry that persists a higher block resets the counter; if maxStuckRetries (default 10) consecutive retries make no progress, the exception is re-thrown so a genuine, persistent archive/chain divergence still surfaces as a hard failure (it does not silently mask a real regression). Detection is structural (isSqdForkException marker) to avoid adding a dependency on the data-source package. Scope is limited to the resumable (archive, dest != null) path and only to ForkException — non-fork errors (e.g. EVM assertion crashes) propagate unchanged.

Verification

Standalone behavior test of appendWithForkRecovery: (1) transient forks that advance progress → recovers without re-throwing; (2) non-fork error → propagates immediately; (3) persistent fork with no progress → re-throws after maxStuckRetries.
Provider cross-check above proves a re-fetch yields consistent finalized data (so the retry converges once a healthy backend is hit).

Falsification

If the dump still crash-loops after this deploys with a ForkException whose last written block never advances across retries, then the persisted archive itself diverges from the canonical chain (a real corruption, not transient provider noise) — that is a different problem and this guard will correctly re-throw to surface it.

Operator note (not part of this PR)

The trigger is a Helius devnet ledger/snapshot-jump serving inconsistent finalized data. A provider swap would remove today's trigger but is a temporary mitigation, not a durable fix — handing that to operators rather than shipping it here. This code change is the durable resolution: a single provider hiccup can no longer crash-loop the dump.

…ead of crashing The finalized archive dump crashes the whole process on a ForkException. Because finalized blocks are immutable, such an exception is never a real reorg — it is transient upstream/provider inconsistency (e.g. a load-balanced RPC backend that jumped to a recent snapshot and serves a gapped/forked view of an already-finalized slot range). The process exits, k8s restarts it, it resumes from the last persisted chunk and hits the same inconsistency, and the dumper crash-loops indefinitely, stalling the writer (no new chunks). appendRawBlocks only persists complete chunks, so the in-flight (unflushed) buffer is discarded on throw and the resume point is re-derived from the last persisted chunk on the next call. Wrap the finalized append loop so a ForkException triggers an in-process retry from that durable boundary — identical to a clean process restart, but without crash-looping. Recovery is bounded by progress: if retries stop advancing the last written block, the exception is re-thrown so a genuine, persistent divergence still surfaces as a hard failure rather than being retried forever. Detection is structural (isSqdForkException marker) to avoid adding a dependency on the data-source package. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dump-cli: recover from transient ForkException on finalized dump instead of crashing#502

dump-cli: recover from transient ForkException on finalized dump instead of crashing#502
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/LHg2kK-solana-dump-fork-crashloop

elina-chertova commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

elina-chertova commented Jun 20, 2026

Symptom

Root cause (proven)

Provider cross-check (independent node implementations — rule against blaming the chain)

Fix (tested)

Verification

Falsification

Operator note (not part of this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant