rpc-client: retry transient 'no available provider' errors instead of crashing#504
Open
elina-chertova wants to merge 1 commit into
Open
rpc-client: retry transient 'no available provider' errors instead of crashing#504elina-chertova wants to merge 1 commit into
elina-chertova wants to merge 1 commit into
Conversation
RPC aggregators (e.g. uniblock) reply with an RpcError whose message is "Errors from the following providers prevented the request from being fulfilled: ..." when none of their upstream providers can serve a request at that moment. This is the JSON-RPC analog of an HTTP 503 and is transient, but isConnectionError() did not recognise it, so the error escaped the retry machinery and propagated to the caller. For ingestion processes that set retryAttempts to a high value (e.g. evm-dump), a single momentary upstream hiccup therefore terminated the process instead of being retried, producing a crash-loop. Treat it as a connection error so it is retried with backoff like rate-limit/timeout errors already are.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See cause/remedy below.
Cause (proven)
The
evm-dumppoddump-hyperliquid-testnet-0(namespaceevm-archive) was in a crash-loop (512 restarts, exit code 1 every ~50s). Each crash was:The configured endpoint is a uniblock aggregator (chainId 998). When all of its upstream providers momentarily fail, uniblock returns this
-32503error.RpcClient.isConnectionError()did not recognise it (it only matches rate-limit / execution-timeout / request-timed-out / connection / HTTP 4xx-5xx errors), so it escaped the retry machinery and propagated to the caller.evm-dumpbuilds its client withretryAttempts: Number.MAX_SAFE_INTEGER(evm/evm-dump/src/dumper.ts), and hyperliquid-testnet usesbatch_limit: 1, so the request goes straight tobatchCallwhose only retry gate isisConnectionError. A non-retried error there therefore terminates the process → CrashLoopBackOff → theDumper_Pod_Restartsalert.Remedy (tested)
The error is transient — it is the JSON-RPC analog of an HTTP 503 ("no upstream could serve this right now"), which
isConnectionErroralready retries. I re-probed the exact failing request (eth_getBlockByNumber 0x34d7f6d withTx=true) against the same uniblock endpoint 4/4 succeeded, confirming a single retry would have recovered.This PR makes
isConnectionErrortreat the aggregator "providers prevented the request from being fulfilled" error as a connection error, so it is retried with backoff like the sibling transient errors, instead of crashing the dump. The fix is central (one chokepoint all RPC methods flow through), so it also protects receipts/traces calls, not justeth_getBlockByNumber.The underlying provider degradation (dRPC "User balance exceeded", Alchemy "Unable to complete request") is being handled separately as an operator/provider escalation; per our policy a provider swap is a temporary mitigation, while this is the durable code fix that prevents a momentary provider hiccup from crash-looping ingestion.
Falsification
If the dump still crash-loops on this exact
-32503after this change, or if the provider error is in fact sustained (not transient) — i.e. the same request fails on repeated retries against an independent node implementation — then retrying is not the right behavior and the resolution is purely the provider escalation. Re-probing showed the request succeeding on retry, so retry is correct here.