evm-rpc: retry "response too large" (-32020) instead of crash-looping the dumper#505
Open
elina-chertova wants to merge 1 commit into
Open
evm-rpc: retry "response too large" (-32020) instead of crash-looping the dumper#505elina-chertova wants to merge 1 commit into
elina-chertova wants to merge 1 commit into
Conversation
… the dumper Some RPC backends (e.g. okx xlayer-mainnet) intermittently reject a request whose response exceeds an internal size cap with JSON-RPC error code -32020 "backend response too large". This error was neither recognised as retryable by EvmRpcClient nor by reduceBatchOnRetry, so it bubbled up as a fatal error and crashed the data-service into a "data ingestion terminated, will restart" loop, stalling hotblocks ingestion. The condition is transient/data-dependent: the exact same block fetches fine moments later, and an oversized eth_getBlockReceipts batch can simply be split. Recognise it as a retryable connection-class error so the client retries it and reduceBatchOnRetry (which keys off isConnectionError) splits oversized batches, mirroring the existing handling of geth's "response too large" / -32000. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cause (proven)
okx_xlayer-mainnet_Hotblocks_Critical_Lag— theevm-okx-xlayer-mainnet-hotblocks-servicelag grew to ~24 min and rising while the okx RPC head itself stayed current.Pod logs (
evm-okx-xlayer-mainnet-hotblocks-service-...-pjbsk, nsevm-hotblocks) show a crash-restart loop:okx returns JSON-RPC error
-32020 "backend response too large"oneth_getBlockReceipts. This code/message was recognised neither byEvmRpcClient.isConnectionError(so it was not retried) nor byEvmRpcClient.isResponseTooLargeError/isBatchRetryableError(so an oversized batch was never split). It therefore bubbled up toutil-internal-data-service'srun()as a fatal error → restart loop → growing lag.The condition is transient/data-dependent, not a hard limit:
eth_getBlockReceiptsfor the failing block0x3c7c244now returns HTTP 200, 4.6 MB on okx (and identically on uniblock and the publicrpc.xlayer.tech) — the exact same call that was-32020during the incident.This is the same class of bug as #500/#501/#503: an intermittent upstream response should not crash the dumper.
Fix
Recognise
-32020/ "response too large" as a retryable connection-class error inEvmRpcClient(newisResponseTooLargeError, wired intoisConnectionError), mirroring the existingisUpstreamUnavailableError(#501) and the geth"response too large"/-32000handling. BecausereduceBatchOnRetrykeys offisConnectionError, this also lets an oversizedeth_getBlockReceiptsbatch be split in half until it fits — so both the intermittent single-call case (retry) and the oversized-batch case (split) are handled instead of crashing.Falsification
-32020after this change, the predicate isn't reached on the failing path (the fix would be wrong).eth_getBlockReceiptsconsistently-32020, not intermittent), retry/split cannot help and the durable answer is a per-tx receipts fallback or a different provider for that block — not this change. Probing showed the single-block call succeeds (4.6 MB), so that is not the current situation.Note: a separate operator step is needed to roll a new
subsquid/evm-data-serviceimage carrying this commit into the infra deployment (currently pinned ate56f20a9).