Skip to content

evm-rpc: ride out intermittent missing/invalid blocks instead of crash-looping the dumper#500

Open
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/blP0g7-receipts-null-backoff
Open

evm-rpc: ride out intermittent missing/invalid blocks instead of crash-looping the dumper#500
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/blP0g7-receipts-null-backoff

Conversation

@elina-chertova

Copy link
Copy Markdown
Contributor

Cause (proven)

The polygon-amoy-testnet EVM dumper (evm-archive) entered CrashLoopBackOff
(15 restarts, exit 1). The fatal error was:

Error: eth_getBlockReceipts returned null
  at getBlocks (evm/evm-rpc/lib/data-source/get-blocks.js)
  rpcUrl: https://api.uniblock.dev/.../chainId=80002
  failedBlocks: [40429264,40429265,40429266,40429267], retries: 5

addReceiptsByBlock already flags a null eth_getBlockReceipts result as
_isInvalid (rpc.ts) so getBlocks can retry it. But getBlocks retried only
5 times with a fixed 100ms wait (~0.5s total) before throwing a fatal,
non-retryable error that crash-loops the dumper.

The upstream null is intermittent, not inherent to the chain. Probing block
0x268e6d0 (40429264) directly:

  • uniblock eth_getBlockReceipts → returns full receipts on two attempts,
    null on a third (the block is served fine by eth_getBlockByNumber).
  • alchemy (different node implementation) eth_getBlockReceipts → returns the
    full receipt set every time.

So the block genuinely has receipts; uniblock just flakily drops them. A ~0.5s
retry window is too short to ride out that degradation window, so a transient
provider hiccup becomes a fatal crash.

Fix (tested)

getBlocks now backs off exponentially (capped at 10s, mirroring the
rpc-client's own escalating pause) and uses a larger retry budget (10), so the
retry window spans ~40s and absorbs a transient provider degradation window
instead of crash-looping after half a second. A block that is genuinely
unservable still fails loudly
once the budget is exhausted — the existing
fail-early behaviour is preserved.

This is the durable, fleet-wide guard for any _isInvalid path (receipts,
traces, statediffs, replays all route through this loop), not just today's
uniblock/amoy trigger.

Verified by unit tests (in this PR): full-success, persistent-failure-throws,
quick-recover, and a new flaky-for-7-rounds-then-recovers case that would
have crashed under the old 5-retry budget but now succeeds.

Operator follow-up (not part of this PR)

Switching polygon-amoy-testnet's dump endpoint from uniblock to alchemy removes
today's trigger (alchemy serves the receipts reliably in probing), but a
provider swap is a temporary mitigation — this code change is the durable fix. The
uniblock data-quality issue (intermittent null eth_getBlockReceipts) should be
raised with the provider.

Falsification

This is the wrong fix if the dumper still crashes with eth_getBlockReceipts returned null after deploying an evm-dump image built from this commit — i.e. if
uniblock's null responses are sustained for >~40s per block rather than
intermittent (in which case the resolution is the provider swap, not a retry
budget). Probing showed uniblock serving the same block successfully within
seconds, so the intermittent case holds.

…h-looping

getBlocks retried transiently missing or _isInvalid blocks only 5 times
with a fixed 100ms wait (~0.5s total) before throwing a fatal error. A
provider that intermittently returns null for a block that genuinely has
data — e.g. a flaky eth_getBlockReceipts that serves the same block on a
later attempt — could fail all attempts inside that 0.5s window, turning a
transient hiccup into a fatal error that crash-loops the dumper.

Add exponential backoff (capped at 10s, mirroring the rpc-client's own
escalating pause) and extend the retry budget to 10, so the retry window
spans ~40s and absorbs a transient provider degradation window. A block
that is genuinely unservable still fails loudly once the budget is
exhausted, preserving the existing fail-early behaviour.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant