Skip to content

rpc-client: retry transient 'no available provider' errors instead of crashing#504

Open
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/jvqYTR-retry-provider-unavailable
Open

rpc-client: retry transient 'no available provider' errors instead of crashing#504
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/jvqYTR-retry-provider-unavailable

Conversation

@elina-chertova

Copy link
Copy Markdown
Contributor

See cause/remedy below.

Cause (proven)

The evm-dump pod dump-hyperliquid-testnet-0 (namespace evm-archive) was in a crash-loop (512 restarts, exit code 1 every ~50s). Each crash was:

RpcError: Errors from the following providers prevented the request from being fulfilled: dRPC, Alchemy.
  at validateError (evm/evm-rpc/lib/rpc.js:97)
  ...
code: -32503
data: { DRPC: { error: { code: 10, message: "User balance exceeded" } },
        Alchemy: { error: { code: -32001, message: "Unable to complete request at this time." } } }
rpcMethod: eth_getBlockByNumber

The configured endpoint is a uniblock aggregator (chainId 998). When all of its upstream providers momentarily fail, uniblock returns this -32503 error. RpcClient.isConnectionError() did not recognise it (it only matches rate-limit / execution-timeout / request-timed-out / connection / HTTP 4xx-5xx errors), so it escaped the retry machinery and propagated to the caller.

evm-dump builds its client with retryAttempts: Number.MAX_SAFE_INTEGER (evm/evm-dump/src/dumper.ts), and hyperliquid-testnet uses batch_limit: 1, so the request goes straight to batchCall whose only retry gate is isConnectionError. A non-retried error there therefore terminates the process → CrashLoopBackOff → the Dumper_Pod_Restarts alert.

Remedy (tested)

The error is transient — it is the JSON-RPC analog of an HTTP 503 ("no upstream could serve this right now"), which isConnectionError already retries. I re-probed the exact failing request (eth_getBlockByNumber 0x34d7f6d withTx=true) against the same uniblock endpoint 4/4 succeeded, confirming a single retry would have recovered.

This PR makes isConnectionError treat the aggregator "providers prevented the request from being fulfilled" error as a connection error, so it is retried with backoff like the sibling transient errors, instead of crashing the dump. The fix is central (one chokepoint all RPC methods flow through), so it also protects receipts/traces calls, not just eth_getBlockByNumber.

The underlying provider degradation (dRPC "User balance exceeded", Alchemy "Unable to complete request") is being handled separately as an operator/provider escalation; per our policy a provider swap is a temporary mitigation, while this is the durable code fix that prevents a momentary provider hiccup from crash-looping ingestion.

Falsification

If the dump still crash-loops on this exact -32503 after this change, or if the provider error is in fact sustained (not transient) — i.e. the same request fails on repeated retries against an independent node implementation — then retrying is not the right behavior and the resolution is purely the provider escalation. Re-probing showed the request succeeding on retry, so retry is correct here.

RPC aggregators (e.g. uniblock) reply with an RpcError whose message is
"Errors from the following providers prevented the request from being
fulfilled: ..." when none of their upstream providers can serve a request
at that moment. This is the JSON-RPC analog of an HTTP 503 and is
transient, but isConnectionError() did not recognise it, so the error
escaped the retry machinery and propagated to the caller.

For ingestion processes that set retryAttempts to a high value (e.g.
evm-dump), a single momentary upstream hiccup therefore terminated the
process instead of being retried, producing a crash-loop. Treat it as a
connection error so it is retried with backoff like rate-limit/timeout
errors already are.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant