Skip to content

evm-rpc: retry "response too large" (-32020) instead of crash-looping the dumper#505

Open
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/1pUdAQ-okx-xlayer-response-too-large
Open

evm-rpc: retry "response too large" (-32020) instead of crash-looping the dumper#505
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/1pUdAQ-okx-xlayer-response-too-large

Conversation

@elina-chertova

Copy link
Copy Markdown
Contributor

Cause (proven)

okx_xlayer-mainnet_Hotblocks_Critical_Lag — the evm-okx-xlayer-mainnet-hotblocks-service lag grew to ~24 min and rising while the okx RPC head itself stayed current.

Pod logs (evm-okx-xlayer-mainnet-hotblocks-service-...-pjbsk, ns evm-hotblocks) show a crash-restart loop:

RpcError: backend response too large
    at validateError (/squid/evm/evm-rpc/lib/rpc.js:246:23)
  code: -32020, rpcUrl: https://xlayerrpc.okx.com/, rpcMethod: eth_getBlockReceipts
"data ingestion terminated, will restart in 30 seconds"

okx returns JSON-RPC error -32020 "backend response too large" on eth_getBlockReceipts. This code/message was recognised neither by EvmRpcClient.isConnectionError (so it was not retried) nor by EvmRpcClient.isResponseTooLargeError/isBatchRetryableError (so an oversized batch was never split). It therefore bubbled up to util-internal-data-service's run() as a fatal error → restart loop → growing lag.

The condition is transient/data-dependent, not a hard limit:

  • A single eth_getBlockReceipts for the failing block 0x3c7c244 now returns HTTP 200, 4.6 MB on okx (and identically on uniblock and the public rpc.xlayer.tech) — the exact same call that was -32020 during the incident.
  • Lag self-recovered (1.47M ms → ~1 s) once okx's backend stopped returning the error, confirming it is intermittent rather than a permanently-too-large block.

This is the same class of bug as #500/#501/#503: an intermittent upstream response should not crash the dumper.

Fix

Recognise -32020 / "response too large" as a retryable connection-class error in EvmRpcClient (new isResponseTooLargeError, wired into isConnectionError), mirroring the existing isUpstreamUnavailableError (#501) and the geth "response too large" / -32000 handling. Because reduceBatchOnRetry keys off isConnectionError, this also lets an oversized eth_getBlockReceipts batch be split in half until it fits — so both the intermittent single-call case (retry) and the oversized-batch case (split) are handled instead of crashing.

Falsification

  • If the dumper still crash-loops on -32020 after this change, the predicate isn't reached on the failing path (the fix would be wrong).
  • If a single block's receipts genuinely exceed okx's cap persistently (single-call eth_getBlockReceipts consistently -32020, not intermittent), retry/split cannot help and the durable answer is a per-tx receipts fallback or a different provider for that block — not this change. Probing showed the single-block call succeeds (4.6 MB), so that is not the current situation.

Note: a separate operator step is needed to roll a new subsquid/evm-data-service image carrying this commit into the infra deployment (currently pinned at e56f20a9).

… the dumper

Some RPC backends (e.g. okx xlayer-mainnet) intermittently reject a request
whose response exceeds an internal size cap with JSON-RPC error code -32020
"backend response too large". This error was neither recognised as retryable
by EvmRpcClient nor by reduceBatchOnRetry, so it bubbled up as a fatal error
and crashed the data-service into a "data ingestion terminated, will restart"
loop, stalling hotblocks ingestion.

The condition is transient/data-dependent: the exact same block fetches fine
moments later, and an oversized eth_getBlockReceipts batch can simply be split.
Recognise it as a retryable connection-class error so the client retries it and
reduceBatchOnRetry (which keys off isConnectionError) splits oversized batches,
mirroring the existing handling of geth's "response too large" / -32000.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant