Skip to content

feat(tbtc): fault isolation for wallet registry and resilient store loading#3933

Merged
lrsaturnino merged 5 commits intofeat/psbt-covenant-final-project-prfrom
feat/fault-isolation-resilient-loading
Apr 16, 2026
Merged

feat(tbtc): fault isolation for wallet registry and resilient store loading#3933
lrsaturnino merged 5 commits intofeat/psbt-covenant-final-project-prfrom
feat/fault-isolation-resilient-loading

Conversation

@lrsaturnino
Copy link
Copy Markdown
Member

@lrsaturnino lrsaturnino commented Apr 9, 2026

Summary

1. GetWallet — wallet registry failure is fatal

Issue: GetWallet calls both the Bridge and the wallet registry. If the registry call fails (transient RPC outage, registry contract issue), the entire GetWallet returns an error — even though most callers only need Bridge-sourced fields (State, timestamps, MainUtxoHash) and never touch MembersIDsHash. A transient registry outage cascades into failures for all wallet queries.

Solution: The registry call is now best-effort. On failure, GetWallet logs a warning and returns Bridge-sourced fields with a zero-valued MembersIDsHash. Downstream consumers that need registry data (e.g. signer approval certificate) already guard against the zero value.

2. Store load aborts on any single corrupt file

Issue: Store.load() iterates persisted job files and returns a hard error on any of: unreadable file (disk I/O error), malformed JSON, or unparseable timestamp when resolving duplicate route keys. A single corrupt file prevents the entire store from loading, blocking the covenant signer from starting.

Solution: Each failure mode now logs a warning and skips the offending file instead of aborting:

  • Unreadable file: descriptor.Content() error → skip with warning.
  • Malformed JSON: json.Unmarshal error → skip with warning.
  • Invalid timestamp on duplicate route key: when isNewerOrSameJobRevision fails, the loader checks which job has a parseable timestamp — if the candidate's timestamp is valid, it replaces the existing job (whose timestamp must be the broken one); otherwise the candidate is skipped. This preserves the best available data.

Test coverage

GetWallet fault isolation (pkg/tbtc/get_wallet_fault_isolation_test.go)

Test Scenario
TestGetWalletReturnsDataWhenRegistryFails Registry error → returns Bridge fields, MembersIDsHash is zero, no error
TestGetWalletReturnsFullDataWhenRegistrySucceeds Both sources succeed → all fields including MembersIDsHash populated
TestGetWalletBridgeFailureStillReturnsError Unknown wallet (Bridge failure) → still returns error (no behavior change)

Test mock (pkg/tbtc/chain_test.go)

  • Added walletRegistryErrs map and setWalletRegistryErr() to localChain, enabling simulation of registry failures that return degraded data with zero MembersIDsHash.

Resilient store loading (pkg/covenantsigner/store_test.go)

Test Scenario
TestStoreLoadSkipsUnreadableFile One readable job + one faulting descriptor → store loads, valid job accessible
TestStoreLoadSkipsMalformedJSON One valid job + one garbage JSON file → store loads, valid job accessible
TestStoreLoadSkipsInvalidTimestampOnDuplicateRouteKey Two jobs on same route key, one with "invalid-timestamp" → store loads, valid-timestamp job wins
TestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeys (updated) Now expects success instead of error — validates the graceful degradation

Test helpers (pkg/covenantsigner/covenantsigner_test.go)

  • faultingDescriptor: returns injected error from Content(), simulating unreadable files.
  • contentFaultingHandle: injects faulting descriptors into the ReadAll channel alongside normal ones.

…re loading

GetWallet now degrades gracefully when the wallet registry is
unavailable, returning Bridge-sourced fields with a zero MembersIDsHash
instead of failing outright. Covenant signer store loading skips
unreadable files, malformed JSON, and invalid timestamps on duplicate
route keys rather than aborting the entire load.
Copy link
Copy Markdown

@mswilkison mswilkison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two substantive issues remain:

  1. pkg/covenantsigner/store.go
    load() now skips unreadable/malformed job files and still returns success. After a restart, that can leave the process without the persisted job that originally owned a routeRequestId, while Service.Submit still deduplicates only through store.GetByRouteRequest(...). If the skipped file belonged to an already-accepted request, a retry will create a second signing job for the same covenant request instead of failing closed. I don't think this should merge without a fail-closed safeguard for skipped persisted jobs, or an equivalent quarantine/recovery path that preserves dedupe semantics. This is not limited to the skipped > 0 && loaded == 0 case; losing even one accepted job is enough to break node-local idempotency/replay protection.

  2. pkg/chain/ethereum/tbtc.go
    The best-effort wallet-registry read looks safe for the current caller set, but the degraded-mode contract is implicit: a zero MembersIDsHash now means "registry data unavailable". That needs to be explicit in WalletChainData documentation at minimum, and the signer-approval path should surface the likely cause more clearly. Today the operator-facing error degrades into wallet chain data must include members IDs hash, which obscures the underlying registry outage. Also, the new tests exercise the local mock chain, not the real Ethereum adapter path that changed.

Non-blocking:

  • When load replaces one job with another for the same route key, the superseded job can remain reachable through byRequestID.
  • TestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeys now asserts success and should be renamed.

@lrsaturnino
Copy link
Copy Markdown
Member Author

Reverted 50e7acb (address store lock gosec findings) — it uses os.Root.MkdirAll which was added in Go 1.26, but CI runs Go 1.24. The commit itself is cosmetic: dataDir comes from operator config (not user input), so G304 is a false positive here, and the unchecked .Close() calls are on error-exit paths where the FD leak is harmless. If we want the G104 close-error handling later, that part is Go-1.24-compatible on its own without the os.OpenRoot rewrite.

…e operations

Add #nosec annotations for three findings in store.go:
- G304 on os.OpenFile: lockPath is built from operator config (dataDir)
  plus constants (jobsDirectory, lockFileName), not user input
- G104 on lockFile.Close(): best-effort cleanup after failed flock;
  the lock error is what gets returned
- G104 on store.Close(): best-effort lock release after failed load;
  the load error is what gets returned
@lrsaturnino lrsaturnino requested a review from mswilkison April 13, 2026 04:56
lrsaturnino added a commit that referenced this pull request Apr 16, 2026
## Summary

Implements 15 fixes from the consolidated audit of PR #3933 (covenant
signer + fault isolation). These address findings from 5 review passes
(initial triage, 2 multi-agent lens reviews, PR reviewer feedback,
validation review).

### Fixes included

**P1 - High**
- **A2**: Use `canonicaljson.Marshal` for handoff payload hash
(cross-language determinism)
- **A3/A23**: Improve wallet registry error messages, elevate log to
Error, add WalletChainData godoc
- **A4**: Extract Submit critical section into `createOrDedup` helper (5
unlock points -> defer)
- **A5**: Deterministic tiebreaker when both timestamps unparseable
(lexicographic RequestID)
- **A6**: Restrict healthz auth bypass to GET method only
- **A22**: Poison route keys from skipped jobs to preserve dedupe
semantics

**P2 - Medium**
- **A7**: Document AuthToken CLI flag /proc visibility risk
- **A12/A27**: Cancel service context on init failure + add
SIGINT/SIGTERM signal handling
- **A15**: Document advisory flock limitations and storage requirements
- **A16**: Add aggregate load summary with skip count
- **A24**: Remove superseded job from byRequestID on dedup replacement
- **A25**: Rename misleading test after resilient loading change
- **A29**: Use `errors.Is()` for errJobNotFound comparison

---

## Follow-up issues

### Pre-mainnet (must address before mainnet deployment)

**A1: Default `requireApprovalTrustRoots` to true in production**
- **File:** `pkg/covenantsigner/server.go:70-75`
- **Risk:** When `signerApprovalVerifier` is nil, the service accepts
requests without cryptographic signer approval verification. The
`requireApprovalTrustRoots` config flag exists as mitigation but
defaults to false.
- **Action:** Default `requireApprovalTrustRoots` to true in production
configs, or fail startup when port is non-zero and no verifier is
available.
- **Effort:** Low

**A8: Bitcoin regtest integration tests for witness scripts**
- **File:** `pkg/tbtc/covenant_signer.go` (design)
- **Risk:** Go unit tests cannot enforce `cleanstack`, signature
encoding quirks, or opcode limits. A script passing unit tests could be
rejected by the Bitcoin network, locking funds.
- **Action:** Introduce `btcd` or Bitcoin Core regtest integration tests
that compile witness scripts, sign transactions, and broadcast on a
local regtest node.
- **Effort:** High

### Post-merge (code quality and maintainability)

**A9: Split monolithic validation.go (1922 lines)**
- **File:** `pkg/covenantsigner/validation.go`
- **Issue:** Mixes input parsing, crypto verification, normalization,
and commitment computation in a single file.
- **Action:** Split into focused files: `validation_quote.go`,
`validation_approval.go`, `validation_template.go`.
- **Effort:** High

**A10: Deduplicate UTXO resolution logic**
- **File:** `pkg/tbtc/covenant_signer.go:512-628`
- **Issue:** `resolveSelfV1ActiveUtxo` and `resolveQcV1ActiveUtxo` are
near-identical (~60 lines each).
- **Action:** Extract shared `resolveActiveUtxo(request, witnessScript,
templateName)` helper.
- **Effort:** Medium

**A11: Deduplicate trust root normalization**
- **File:** `pkg/covenantsigner/validation.go:628-724`
- **Issue:** `normalizeDepositorTrustRoots` and
`normalizeCustodianTrustRoots` are structurally identical.
- **Action:** Unify via generics or shared inner function.
- **Effort:** Low

**A13: strictUnmarshal trailing token rejection**
- **File:** `pkg/covenantsigner/validation.go:71-75`
- **Issue:** Inconsistent with server-level `decodeJSON` which rejects
trailing JSON tokens.
- **Action:** Add trailing-token rejection or document single-document
context constraint.
- **Effort:** Low

**A14: Document Engine interface contract**
- **File:** `pkg/covenantsigner/engine.go`
- **Issue:** No documentation of what nil Transition means (OnSubmit:
default to Pending; OnPoll: no-op).
- **Action:** Add godoc specifying the contract and allowed return
states.
- **Effort:** Low

**A26: Rate limiting on submit endpoint**
- **File:** `pkg/covenantsigner/server.go:345`
- **Issue:** Authenticated callers can spawn unbounded concurrent
5-minute signing operations. Mitigated by auth and routeRequestID dedup
but no per-client throttle exists.
- **Action:** Add token-bucket rate limiting per auth token (e.g.,
`golang.org/x/time/rate`, ~5 req/min).
- **Effort:** Medium

**A28: Replace reflect.DeepEqual for Handoff comparison**
- **File:** `pkg/covenantsigner/service.go:201`
- **Issue:** `sameJobRevision` uses `reflect.DeepEqual` for
`map[string]any` Handoff comparison. Correct but potentially slow at
scale.
- **Action:** Define a concrete `HandoffData` struct with an `Equal`
method. Acceptable as-is for single-digit RPM throughput.
- **Effort:** Low

**A30: Document qcV1SignerHandoff schema**
- **File:** `pkg/tbtc/covenant_signer.go:35`
- **Issue:** The `Kind` constant serves as a version identifier but the
downstream contract for handoff artifacts is implicit.
- **Action:** Add doc comment noting this struct defines the downstream
API schema.
- **Effort:** Low

### Type-safety follow-up (from A3)

**A3 (type-level): Add `HasRegistryData` to WalletChainData**
- **File:** `pkg/tbtc/chain.go:418`
- **Issue:** Zero `[32]byte` for MembersIDsHash means "unavailable" but
this is implicit. Downstream guards work but future code could silently
use the zero value.
- **Action:** Add `HasRegistryData bool` field or use `*[32]byte` for
MembersIDsHash to make unavailability explicit at the type level.
- **Effort:** Medium (touches many callers)

---

## Pre-existing test failure

`TestSubmitHandlerPreservesContextValues` fails on the base branch
(`fix/review-findings`). Confirmed not introduced by these changes --
the test expects request context values to propagate through the service
context detachment, which is not the current design.

## Test plan
- [x] `go build ./pkg/covenantsigner/...` and `go build ./pkg/tbtc/...`
compile cleanly
- [x] All covenantsigner tests pass (except pre-existing
`TestSubmitHandlerPreservesContextValues`)
- [x] All relevant tbtc tests pass (GetWallet fault isolation, signer
approval, payload hash)
- [x] Pinned hash test
(`TestComputeQcV1SignerHandoffPayloadHash_DeterministicKeyOrdering`)
still passes after canonicaljson switch
- [ ] CI green
@lrsaturnino lrsaturnino merged commit b527032 into feat/psbt-covenant-final-project-pr Apr 16, 2026
17 checks passed
@lrsaturnino lrsaturnino deleted the feat/fault-isolation-resilient-loading branch April 16, 2026 04:10
lrsaturnino added a commit that referenced this pull request Apr 16, 2026
Implements 15 fixes from the consolidated audit of PR #3933 (covenant
signer + fault isolation). These address findings from 5 review passes
(initial triage, 2 multi-agent lens reviews, PR reviewer feedback,
validation review).

**P1 - High**
- **A2**: Use `canonicaljson.Marshal` for handoff payload hash
(cross-language determinism)
- **A3/A23**: Improve wallet registry error messages, elevate log to
Error, add WalletChainData godoc
- **A4**: Extract Submit critical section into `createOrDedup` helper (5
unlock points -> defer)
- **A5**: Deterministic tiebreaker when both timestamps unparseable
(lexicographic RequestID)
- **A6**: Restrict healthz auth bypass to GET method only
- **A22**: Poison route keys from skipped jobs to preserve dedupe
semantics

**P2 - Medium**
- **A7**: Document AuthToken CLI flag /proc visibility risk
- **A12/A27**: Cancel service context on init failure + add
SIGINT/SIGTERM signal handling
- **A15**: Document advisory flock limitations and storage requirements
- **A16**: Add aggregate load summary with skip count
- **A24**: Remove superseded job from byRequestID on dedup replacement
- **A25**: Rename misleading test after resilient loading change
- **A29**: Use `errors.Is()` for errJobNotFound comparison

---

**A1: Default `requireApprovalTrustRoots` to true in production**
- **File:** `pkg/covenantsigner/server.go:70-75`
- **Risk:** When `signerApprovalVerifier` is nil, the service accepts
requests without cryptographic signer approval verification. The
`requireApprovalTrustRoots` config flag exists as mitigation but
defaults to false.
- **Action:** Default `requireApprovalTrustRoots` to true in production
configs, or fail startup when port is non-zero and no verifier is
available.
- **Effort:** Low

**A8: Bitcoin regtest integration tests for witness scripts**
- **File:** `pkg/tbtc/covenant_signer.go` (design)
- **Risk:** Go unit tests cannot enforce `cleanstack`, signature
encoding quirks, or opcode limits. A script passing unit tests could be
rejected by the Bitcoin network, locking funds.
- **Action:** Introduce `btcd` or Bitcoin Core regtest integration tests
that compile witness scripts, sign transactions, and broadcast on a
local regtest node.
- **Effort:** High

**A9: Split monolithic validation.go (1922 lines)**
- **File:** `pkg/covenantsigner/validation.go`
- **Issue:** Mixes input parsing, crypto verification, normalization,
and commitment computation in a single file.
- **Action:** Split into focused files: `validation_quote.go`,
`validation_approval.go`, `validation_template.go`.
- **Effort:** High

**A10: Deduplicate UTXO resolution logic**
- **File:** `pkg/tbtc/covenant_signer.go:512-628`
- **Issue:** `resolveSelfV1ActiveUtxo` and `resolveQcV1ActiveUtxo` are
near-identical (~60 lines each).
- **Action:** Extract shared `resolveActiveUtxo(request, witnessScript,
templateName)` helper.
- **Effort:** Medium

**A11: Deduplicate trust root normalization**
- **File:** `pkg/covenantsigner/validation.go:628-724`
- **Issue:** `normalizeDepositorTrustRoots` and
`normalizeCustodianTrustRoots` are structurally identical.
- **Action:** Unify via generics or shared inner function.
- **Effort:** Low

**A13: strictUnmarshal trailing token rejection**
- **File:** `pkg/covenantsigner/validation.go:71-75`
- **Issue:** Inconsistent with server-level `decodeJSON` which rejects
trailing JSON tokens.
- **Action:** Add trailing-token rejection or document single-document
context constraint.
- **Effort:** Low

**A14: Document Engine interface contract**
- **File:** `pkg/covenantsigner/engine.go`
- **Issue:** No documentation of what nil Transition means (OnSubmit:
default to Pending; OnPoll: no-op).
- **Action:** Add godoc specifying the contract and allowed return
states.
- **Effort:** Low

**A26: Rate limiting on submit endpoint**
- **File:** `pkg/covenantsigner/server.go:345`
- **Issue:** Authenticated callers can spawn unbounded concurrent
5-minute signing operations. Mitigated by auth and routeRequestID dedup
but no per-client throttle exists.
- **Action:** Add token-bucket rate limiting per auth token (e.g.,
`golang.org/x/time/rate`, ~5 req/min).
- **Effort:** Medium

**A28: Replace reflect.DeepEqual for Handoff comparison**
- **File:** `pkg/covenantsigner/service.go:201`
- **Issue:** `sameJobRevision` uses `reflect.DeepEqual` for
`map[string]any` Handoff comparison. Correct but potentially slow at
scale.
- **Action:** Define a concrete `HandoffData` struct with an `Equal`
method. Acceptable as-is for single-digit RPM throughput.
- **Effort:** Low

**A30: Document qcV1SignerHandoff schema**
- **File:** `pkg/tbtc/covenant_signer.go:35`
- **Issue:** The `Kind` constant serves as a version identifier but the
downstream contract for handoff artifacts is implicit.
- **Action:** Add doc comment noting this struct defines the downstream
API schema.
- **Effort:** Low

**A3 (type-level): Add `HasRegistryData` to WalletChainData**
- **File:** `pkg/tbtc/chain.go:418`
- **Issue:** Zero `[32]byte` for MembersIDsHash means "unavailable" but
this is implicit. Downstream guards work but future code could silently
use the zero value.
- **Action:** Add `HasRegistryData bool` field or use `*[32]byte` for
MembersIDsHash to make unavailability explicit at the type level.
- **Effort:** Medium (touches many callers)

---

`TestSubmitHandlerPreservesContextValues` fails on the base branch
(`fix/review-findings`). Confirmed not introduced by these changes --
the test expects request context values to propagate through the service
context detachment, which is not the current design.

- [x] `go build ./pkg/covenantsigner/...` and `go build ./pkg/tbtc/...`
compile cleanly
- [x] All covenantsigner tests pass (except pre-existing
`TestSubmitHandlerPreservesContextValues`)
- [x] All relevant tbtc tests pass (GetWallet fault isolation, signer
approval, payload hash)
- [x] Pinned hash test
(`TestComputeQcV1SignerHandoffPayloadHash_DeterministicKeyOrdering`)
still passes after canonicaljson switch
- [ ] CI green
lrsaturnino added a commit that referenced this pull request Apr 16, 2026
…rent base (#3940)

## Problem

PR #3935 was merged into `fix/review-findings`, an intermediate base
branch that was never integrated into
`feat/psbt-covenant-final-project-pr`. This left
`feat/psbt-covenant-final-project-pr` without the audit-finding fixes
that #3935 implemented (createOrDedup mutex extraction, healthz auth
bypass restriction, canonical JSON for handoff payload, error message
improvements, gosec hardening, and more).

## Solution

This PR cherry-picks the squash commit of #3935 onto the current state
of `feat/psbt-covenant-final-project-pr`, which already contains PR
#3933 (Maclane's fault-isolation work plus his subsequent fixes for the
review comments raised on that PR). Three files required manual conflict
resolution where #3935 and #3933 made overlapping changes addressing the
same concerns. In `pkg/tbtc/signer_approval_certificate.go`, the
`ErrMissingWalletID` and `ErrMissingMembersIDsHash` sentinel errors from
#3935 were dropped because Maclane's `ensureWalletRegistryDataAvailable`
function already covers the same case and is in active use; keeping both
would have left the sentinels as dead code. In
`pkg/covenantsigner/server_test.go`, the test rewrite from #3935 (which
depended on a `serviceCtx` parameter to `newHandler`) was reverted in
favor of the existing middleware-based test, which already exercises the
`context.WithoutCancel(r.Context())` detachment path. In
`pkg/covenantsigner/store_test.go`, the test rename collision was
resolved by taking #3935's name
(`TestStoreLoadResolvesInvalidUpdatedAtForDuplicateRouteKeys`).

Three additional cleanups followed from the merge mechanics: the godoc
on `WalletChainData.MembersIDsHash` was updated to reference
`ensureWalletRegistryDataAvailable` instead of the dropped sentinels, an
orphaned `cancelService()` call was removed from `server.go` (its
`serviceCtx` infrastructure did not auto-merge), and a duplicate
`delete(s.byRequestID, existingID)` was deduplicated in `store.go`'s
load path.

## Tests

All tests across `pkg/covenantsigner`, `pkg/tbtc`, and
`pkg/chain/ethereum` pass. A behavioral diff against the base branch
shows identical build, vet, and file-listing outputs, with all `ok`
lines matching except for timing variation. The two adapter tests added
by Maclane (`TestMakeWalletChainData...`) and the two
registry-unavailable tests for the issue-and-verify signer approval
paths are preserved and pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants