Skip to content

[NET-541] [Alert rUUC7K] uniblock_hyperliquid-mainnet_Hotblocks_Critical_Lag#480

Open
elina-chertova wants to merge 1 commit into
open-betafrom
alert-fix/ruuc7k-uniblock-hyperliquid-mainnet-hotblocks-critical-lag-squid-sdk
Open

[NET-541] [Alert rUUC7K] uniblock_hyperliquid-mainnet_Hotblocks_Critical_Lag#480
elina-chertova wants to merge 1 commit into
open-betafrom
alert-fix/ruuc7k-uniblock-hyperliquid-mainnet-hotblocks-critical-lag-squid-sdk

Conversation

@elina-chertova

Copy link
Copy Markdown
Contributor

Automated fix proposal for alert rUUC7K.

  • Alert: uniblock_hyperliquid-mainnet_Hotblocks_Critical_Lag
  • Base branch: open-beta
  • Investigation: /root/alert/incident-agent/agent-system/data/investigations/rUUC7K
  • Report: /root/alert/incident-agent/agent-system/data/investigations/rUUC7K/report.html

Reviewer quick view

  • Scope: 1 file(s) in evm

  • Root cause (agent): not explicitly captured

  • Summary: FINAL RESPONSE — uniblock_hyperliquid-mainnet_Hotblocks_Critical_Lag
    (rUUC7K)

    Fix class: rca_fix
    Confidence: medium
    Evidence basis: code, metrics, logs
    Falsification: If Uniblock shows a confirmed RPC outage at 01:40–01:55Z
    UTC
    2026-05-18, the spike was pure timeout — SDK fix still
    latently
    correct but not the cause of this alert.
    Follow-up: Check Uniblock status page for that window; recover spike
    block number.

    Verdict: accept_with_changes

    Plan stop condition: met

    The execution plan stop condition was "chain-utils.ts still has
    assertNotNull → SDK fix required." [evidence] Code audit confirms
    assertNotNull(txByHash.get(log.transactionHash)) remains in the Hyperliquid
    branch of calculateLogsBloom in open-beta SHA a6fccb11. Image 9a7fea7
    (2026-05-07) postdates last chain-utils.ts change (2026-04-16) — ships the
    unfixed code.

    Track A — Safe action now

    PR to subsquid/squid-sdk (open-beta) — patch already on disk at
    fixes/proposed/evm/evm-rpc/src/chain-utils.ts:

    if (this.isHyperliquidMainnet || this.isHyperliquidTestnet) {
    let txByHash = new Map(transactions.map(tx => [getTxHash(tx), tx]))
    logs = logs.filter(log => {

    •    let tx = assertNotNull(txByHash.get(log.transactionHash))
      
    •    return !isHyperliquidSystemTx(tx)
      
    •    let tx = txByHash.get(log.transactionHash)
      
    •    // Hyperliquid hidden system txs absent from block.transactions —
      

    skip

    •    if (tx == null) return false
      
    •    return !isHyperliquidSystemTx(tx)
      
      })
      }

    The assertNotNull import stays — still used in the Polygon branch.

    Track B — Root cause ambiguity

    Two explanations fit the ~135s spike equally well:

    ┌──────────────────┬────────────────────────────────────┬──────────────────┐
    │ Hypothesis │ Evidence for │ Against │
    ├──────────────────┼────────────────────────────────────┼──────────────────┤
    │ assertNotNull on │ Confirmed in source; known 8LVGbY │ 11 days / 0 │
    │ hidden system tx │ pattern [memory:8LVGbY] │ restarts; only │
    │ │ │ one spike │
    ├──────────────────┼────────────────────────────────────┼──────────────────┤
    │ Uniblock RPC │ --http-rpc-timeout 30000 × ~4–5 │ No provider │
    │ timeout cascade │ retries = 120–150s ≈ 135s │ status data │
    │ │ [evidence] │ │
    └──────────────────┴────────────────────────────────────┴──────────────────┘

    The SDK fix is right regardless. Alert self-resolved; pod is healthy now.
    The fix eliminates the latent assertNotNull defect. tuning_confidence:
    likely

    Post-deploy signal: sqd_hotblocks_lag_ms ≤5s continuously for 7+ days with
    no spikes and no AssertionError lines in logs.

Fix metadata

  • Fix class: rca_fix
  • Confidence: medium
  • Evidence basis: code, metrics, logs
  • Falsification: If Uniblock shows a confirmed RPC outage at 01:40–01:55Z
  • Follow-up: Check Uniblock status page for that window; recover spike
    (Generated by the terminal-debate agent — values reflect the agent's self-assessment, not a verified verdict. Use them as a starting point for review.)

Summary

FINAL RESPONSE — uniblock_hyperliquid-mainnet_Hotblocks_Critical_Lag
(rUUC7K)

Fix class: rca_fix
Confidence: medium
Evidence basis: code, metrics, logs
Falsification: If Uniblock shows a confirmed RPC outage at 01:40–01:55Z
UTC
2026-05-18, the spike was pure timeout — SDK fix still
latently
correct but not the cause of this alert.
Follow-up: Check Uniblock status page for that window; recover spike
block number.

Verdict: accept_with_changes

Plan stop condition: met

The execution plan stop condition was "chain-utils.ts still has
assertNotNull → SDK fix required." [evidence] Code audit confirms
assertNotNull(txByHash.get(log.transactionHash)) remains in the Hyperliquid
branch of calculateLogsBloom in open-beta SHA a6fccb11. Image 9a7fea7
(2026-05-07) postdates last chain-utils.ts change (2026-04-16) — ships the
unfixed code.

Track A — Safe action now

PR to subsquid/squid-sdk (open-beta) — patch already on disk at
fixes/proposed/evm/evm-rpc/src/chain-utils.ts:

if (this.isHyperliquidMainnet || this.isHyperliquidTestnet) {
    let txByHash = new Map(transactions.map(tx => [getTxHash(tx), tx]))
    logs = logs.filter(log => {
  •    let tx = assertNotNull(txByHash.get(log.transactionHash))
    
  •    return !isHyperliquidSystemTx(tx)
    
  •    let tx = txByHash.get(log.transactionHash)
    
  •    // Hyperliquid hidden system txs absent from block.transactions —
    

skip

  •    if (tx == null) return false
    
  •    return !isHyperliquidSystemTx(tx)
    
    })
    }

The assertNotNull import stays — still used in the Polygon branch.

Track B — Root cause ambiguity

Two explanations fit the ~135s spike equally well:

┌──────────────────┬────────────────────────────────────┬──────────────────┐
│ Hypothesis │ Evidence for │ Against │
├──────────────────┼────────────────────────────────────┼──────────────────┤
│ assertNotNull on │ Confirmed in source; known 8LVGbY │ 11 days / 0 │
│ hidden system tx │ pattern [memory:8LVGbY] │ restarts; only │
│ │ │ one spike │
├──────────────────┼────────────────────────────────────┼──────────────────┤
│ Uniblock RPC │ --http-rpc-timeout 30000 × ~4–5 │ No provider │
│ timeout cascade │ retries = 120–150s ≈ 135s │ status data │
│ │ [evidence] │ │
└──────────────────┴────────────────────────────────────┴──────────────────┘

The SDK fix is right regardless. Alert self-resolved; pod is healthy now.
The fix eliminates the latent assertNotNull defect. tuning_confidence:
likely

Post-deploy signal: sqd_hotblocks_lag_ms ≤5s continuously for 7+ days with
no spikes and no AssertionError lines in logs.

Risk & rollout

  • Suggested rollout: canary / one-network-first, then broader rollout after signal is stable.
  • Rollback: revert this PR (or restore previous config values/files) if the incident signal worsens.

Reproduction status

Incident behavior was reproduced or corroborated strongly enough for a non-hypothesis fix proposal.

Validation checklist

  • Verify the original incident signal improves (logs/metrics/alerts) after deploy.
  • Verify no regression on sibling networks/providers/services touched by this change.
  • Confirm queue / delivery pipeline status returns to expected steady state.

Changed files

  • evm/evm-rpc/src/chain-utils.ts

Notify

cc @tmcgroul (automation opened this PR.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants