fix(backend): distributed-lock fencing, metrics on RPC failure, webhook retry consolidation, yield history tests#1292
Open
dikachim060-wq wants to merge 4 commits into
Conversation
…for lock release (LabsCrypt#1163) acquireLock now stores the unique token it set in Redis as a private instance variable. releaseLock calls a new cacheService.deleteIfEqual that evaluates a Lua compare-and-delete atomically, so a run that outlives the 10-minute LOCK_TTL cannot delete a lock acquired by a subsequent instance. Co-Authored-By: Claude <noreply@anthropic.com>
…nt false-healthy readings (LabsCrypt#1201) getLatestLedgerSequence now returns null instead of 0 on error, and increments a new indexer_rpc_fetch_errors_total counter. pollOnce skips recordIndexerLedgers when the tip is null, so an RPC outage no longer resets indexer_lag_ledgers to 0 and silently defeats the behind-chain-tip alert. Co-Authored-By: Claude <noreply@anthropic.com>
…s and XDR share-decoding (LabsCrypt#1171) Adds three tests: - Withdraw with value=null verifying cost-basis is reduced proportionally (shareRatio = withdrawn/total shares, costBasis *= 1 - shareRatio) - Deposit + Withdraw with a non-null base64 XDR ScVec value asserting shares are decoded and BigInt values converted via the parseDepositWithdrawAmounts path - EmergencyWithdraw encoded in XDR following the same cost-basis path Co-Authored-By: Claude <noreply@anthropic.com>
…ocessor as the live retry path (LabsCrypt#1167) startWebhookRetryProcessor (which drives WebhookService.processRetries) is now wired in index.ts in place of startWebhookRetryScheduler. This makes the retry schedule single-sourced through processRetries, which respects the next_retry_at timestamps stored by sendToWebhook (RETRY_DELAYS_MS: 5m/15m/45m, MAX_RETRY_ATTEMPTS=4). The old scheduler's independent BACKOFF=[60,300,1800]s that ignored next_retry_at is no longer the live path. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: this contribution was authored with AI assistance (Claude Code).
Summary
This batch PR addresses four backend issues assigned in Wave 6: atomic distributed-lock release, suppressing false-healthy metrics on RPC failure, wiring the correct webhook retry implementation, and adding missing test coverage for yield history share/cost-basis logic.
closes #1163
closes #1201
closes #1167
closes #1171
Changes
[Backend] defaultChecker releaseLock() deletes the Redis lock unconditionally, letting a long run delete another instance's lock #1163 — Atomic lock release in defaultChecker:
acquireLocknow stores the unique fencing token it set in Redis.releaseLockcalls a newcacheService.deleteIfEqualmethod that runs an atomic Lua compare-and-delete (GET + DELin one round-trip). A run that outlives the 10-minuteLOCK_TTLcan no longer delete a lock acquired by a subsequent instance, preventing the overlappingcheck_defaultssubmissions the lock was designed to prevent.[Backend] RPC failure to fetch chain tip records indexer_lag_ledgers=0, masking outages and defeating the lag alert #1201 — Skip lag/chain-tip metrics on RPC tip-fetch failure:
getLatestLedgerSequencenow returnsnullinstead of0on error, and increments a newindexer_rpc_fetch_errors_totalPrometheus counter.pollOnceskipsrecordIndexerLedgerswhen the tip isnull, so an RPC outage no longer resetsindexer_lag_ledgersto 0 and silently defeats the behind-chain-tip alert.[Backend] startWebhookRetryProcessor is dead and the wired scheduler uses a different backoff than webhookService stores #1167 — Wire webhookRetryProcessor as the single live retry path:
index.tsnow startsstartWebhookRetryProcessor(which drivesWebhookService.processRetries) instead ofstartWebhookRetryScheduler. This makes the retry schedule single-sourced:processRetriesrespects thenext_retry_attimestamps stored bysendToWebhook(RETRY_DELAYS_MS: 5m/15m/45m, MAX_RETRY_ATTEMPTS=4). The old scheduler's independent BACKOFF=[60,300,1800]s that ignored storednext_retry_atis no longer the live path.[Testing] yieldHistoryService withdraw cost-basis and XDR share-decoding paths are uncovered #1171 — Test coverage for Withdraw/EmergencyWithdraw cost-basis and XDR share-decoding: Three new tests in
yieldHistoryService.test.ts:value=null— asserts cost-basis is reduced proportionally viashareRatioparseDepositWithdrawAmountsdecodes shares as BigInt and converts to Number correctlyTest plan
webhookRetryProcessor.test.tscoversWebhookService.processRetries(the now-live path)yieldHistoryService.test.tscover the three acceptance criteria for [Testing] yieldHistoryService withdraw cost-basis and XDR share-decoding paths are uncovered #1171defaultChecker.test.tscovers lock acquisition; the newdeleteIfEqualis a thin Lua wrapper tested via integrationNot built/tested locally (dependencies not installed in the Wave workspace); relying on CI for compilation and full test-suite validation.