evm-rpc: ride out intermittent missing/invalid blocks instead of crash-looping the dumper#500
Open
elina-chertova wants to merge 1 commit into
Open
evm-rpc: ride out intermittent missing/invalid blocks instead of crash-looping the dumper#500elina-chertova wants to merge 1 commit into
elina-chertova wants to merge 1 commit into
Conversation
…h-looping getBlocks retried transiently missing or _isInvalid blocks only 5 times with a fixed 100ms wait (~0.5s total) before throwing a fatal error. A provider that intermittently returns null for a block that genuinely has data — e.g. a flaky eth_getBlockReceipts that serves the same block on a later attempt — could fail all attempts inside that 0.5s window, turning a transient hiccup into a fatal error that crash-loops the dumper. Add exponential backoff (capped at 10s, mirroring the rpc-client's own escalating pause) and extend the retry budget to 10, so the retry window spans ~40s and absorbs a transient provider degradation window. A block that is genuinely unservable still fails loudly once the budget is exhausted, preserving the existing fail-early behaviour. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cause (proven)
The
polygon-amoy-testnetEVM dumper (evm-archive) enteredCrashLoopBackOff(15 restarts, exit 1). The fatal error was:
addReceiptsByBlockalready flags a nulleth_getBlockReceiptsresult as_isInvalid(rpc.ts) sogetBlockscan retry it. ButgetBlocksretried only5 times with a fixed 100ms wait (~0.5s total) before throwing a fatal,
non-retryable error that crash-loops the dumper.
The upstream null is intermittent, not inherent to the chain. Probing block
0x268e6d0(40429264) directly:eth_getBlockReceipts→ returns full receipts on two attempts,nullon a third (the block is served fine byeth_getBlockByNumber).eth_getBlockReceipts→ returns thefull receipt set every time.
So the block genuinely has receipts; uniblock just flakily drops them. A ~0.5s
retry window is too short to ride out that degradation window, so a transient
provider hiccup becomes a fatal crash.
Fix (tested)
getBlocksnow backs off exponentially (capped at 10s, mirroring therpc-client's own escalating pause) and uses a larger retry budget (10), so the
retry window spans ~40s and absorbs a transient provider degradation window
instead of crash-looping after half a second. A block that is genuinely
unservable still fails loudly once the budget is exhausted — the existing
fail-early behaviour is preserved.
This is the durable, fleet-wide guard for any
_isInvalidpath (receipts,traces, statediffs, replays all route through this loop), not just today's
uniblock/amoy trigger.
Verified by unit tests (in this PR): full-success, persistent-failure-throws,
quick-recover, and a new flaky-for-7-rounds-then-recovers case that would
have crashed under the old 5-retry budget but now succeeds.
Operator follow-up (not part of this PR)
Switching
polygon-amoy-testnet's dump endpoint from uniblock to alchemy removestoday's trigger (alchemy serves the receipts reliably in probing), but a
provider swap is a temporary mitigation — this code change is the durable fix. The
uniblock data-quality issue (intermittent null
eth_getBlockReceipts) should beraised with the provider.
Falsification
This is the wrong fix if the dumper still crashes with
eth_getBlockReceipts returned nullafter deploying an evm-dump image built from this commit — i.e. ifuniblock's null responses are sustained for >~40s per block rather than
intermittent (in which case the resolution is the provider swap, not a retry
budget). Probing showed uniblock serving the same block successfully within
seconds, so the intermittent case holds.