feat(tbtc): fault isolation for wallet registry and resilient store loading#3933
Conversation
…re loading GetWallet now degrades gracefully when the wallet registry is unavailable, returning Bridge-sourced fields with a zero MembersIDsHash instead of failing outright. Covenant signer store loading skips unreadable files, malformed JSON, and invalid timestamps on duplicate route keys rather than aborting the entire load.
mswilkison
left a comment
There was a problem hiding this comment.
Two substantive issues remain:
-
pkg/covenantsigner/store.go
load()now skips unreadable/malformed job files and still returns success. After a restart, that can leave the process without the persisted job that originally owned arouteRequestId, whileService.Submitstill deduplicates only throughstore.GetByRouteRequest(...). If the skipped file belonged to an already-accepted request, a retry will create a second signing job for the same covenant request instead of failing closed. I don't think this should merge without a fail-closed safeguard for skipped persisted jobs, or an equivalent quarantine/recovery path that preserves dedupe semantics. This is not limited to theskipped > 0 && loaded == 0case; losing even one accepted job is enough to break node-local idempotency/replay protection. -
pkg/chain/ethereum/tbtc.go
The best-effort wallet-registry read looks safe for the current caller set, but the degraded-mode contract is implicit: a zeroMembersIDsHashnow means "registry data unavailable". That needs to be explicit inWalletChainDatadocumentation at minimum, and the signer-approval path should surface the likely cause more clearly. Today the operator-facing error degrades intowallet chain data must include members IDs hash, which obscures the underlying registry outage. Also, the new tests exercise the local mock chain, not the real Ethereum adapter path that changed.
Non-blocking:
- When load replaces one job with another for the same route key, the superseded job can remain reachable through
byRequestID. TestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeysnow asserts success and should be renamed.
|
Reverted 50e7acb ( |
…e operations Add #nosec annotations for three findings in store.go: - G304 on os.OpenFile: lockPath is built from operator config (dataDir) plus constants (jobsDirectory, lockFileName), not user input - G104 on lockFile.Close(): best-effort cleanup after failed flock; the lock error is what gets returned - G104 on store.Close(): best-effort lock release after failed load; the load error is what gets returned
## Summary Implements 15 fixes from the consolidated audit of PR #3933 (covenant signer + fault isolation). These address findings from 5 review passes (initial triage, 2 multi-agent lens reviews, PR reviewer feedback, validation review). ### Fixes included **P1 - High** - **A2**: Use `canonicaljson.Marshal` for handoff payload hash (cross-language determinism) - **A3/A23**: Improve wallet registry error messages, elevate log to Error, add WalletChainData godoc - **A4**: Extract Submit critical section into `createOrDedup` helper (5 unlock points -> defer) - **A5**: Deterministic tiebreaker when both timestamps unparseable (lexicographic RequestID) - **A6**: Restrict healthz auth bypass to GET method only - **A22**: Poison route keys from skipped jobs to preserve dedupe semantics **P2 - Medium** - **A7**: Document AuthToken CLI flag /proc visibility risk - **A12/A27**: Cancel service context on init failure + add SIGINT/SIGTERM signal handling - **A15**: Document advisory flock limitations and storage requirements - **A16**: Add aggregate load summary with skip count - **A24**: Remove superseded job from byRequestID on dedup replacement - **A25**: Rename misleading test after resilient loading change - **A29**: Use `errors.Is()` for errJobNotFound comparison --- ## Follow-up issues ### Pre-mainnet (must address before mainnet deployment) **A1: Default `requireApprovalTrustRoots` to true in production** - **File:** `pkg/covenantsigner/server.go:70-75` - **Risk:** When `signerApprovalVerifier` is nil, the service accepts requests without cryptographic signer approval verification. The `requireApprovalTrustRoots` config flag exists as mitigation but defaults to false. - **Action:** Default `requireApprovalTrustRoots` to true in production configs, or fail startup when port is non-zero and no verifier is available. - **Effort:** Low **A8: Bitcoin regtest integration tests for witness scripts** - **File:** `pkg/tbtc/covenant_signer.go` (design) - **Risk:** Go unit tests cannot enforce `cleanstack`, signature encoding quirks, or opcode limits. A script passing unit tests could be rejected by the Bitcoin network, locking funds. - **Action:** Introduce `btcd` or Bitcoin Core regtest integration tests that compile witness scripts, sign transactions, and broadcast on a local regtest node. - **Effort:** High ### Post-merge (code quality and maintainability) **A9: Split monolithic validation.go (1922 lines)** - **File:** `pkg/covenantsigner/validation.go` - **Issue:** Mixes input parsing, crypto verification, normalization, and commitment computation in a single file. - **Action:** Split into focused files: `validation_quote.go`, `validation_approval.go`, `validation_template.go`. - **Effort:** High **A10: Deduplicate UTXO resolution logic** - **File:** `pkg/tbtc/covenant_signer.go:512-628` - **Issue:** `resolveSelfV1ActiveUtxo` and `resolveQcV1ActiveUtxo` are near-identical (~60 lines each). - **Action:** Extract shared `resolveActiveUtxo(request, witnessScript, templateName)` helper. - **Effort:** Medium **A11: Deduplicate trust root normalization** - **File:** `pkg/covenantsigner/validation.go:628-724` - **Issue:** `normalizeDepositorTrustRoots` and `normalizeCustodianTrustRoots` are structurally identical. - **Action:** Unify via generics or shared inner function. - **Effort:** Low **A13: strictUnmarshal trailing token rejection** - **File:** `pkg/covenantsigner/validation.go:71-75` - **Issue:** Inconsistent with server-level `decodeJSON` which rejects trailing JSON tokens. - **Action:** Add trailing-token rejection or document single-document context constraint. - **Effort:** Low **A14: Document Engine interface contract** - **File:** `pkg/covenantsigner/engine.go` - **Issue:** No documentation of what nil Transition means (OnSubmit: default to Pending; OnPoll: no-op). - **Action:** Add godoc specifying the contract and allowed return states. - **Effort:** Low **A26: Rate limiting on submit endpoint** - **File:** `pkg/covenantsigner/server.go:345` - **Issue:** Authenticated callers can spawn unbounded concurrent 5-minute signing operations. Mitigated by auth and routeRequestID dedup but no per-client throttle exists. - **Action:** Add token-bucket rate limiting per auth token (e.g., `golang.org/x/time/rate`, ~5 req/min). - **Effort:** Medium **A28: Replace reflect.DeepEqual for Handoff comparison** - **File:** `pkg/covenantsigner/service.go:201` - **Issue:** `sameJobRevision` uses `reflect.DeepEqual` for `map[string]any` Handoff comparison. Correct but potentially slow at scale. - **Action:** Define a concrete `HandoffData` struct with an `Equal` method. Acceptable as-is for single-digit RPM throughput. - **Effort:** Low **A30: Document qcV1SignerHandoff schema** - **File:** `pkg/tbtc/covenant_signer.go:35` - **Issue:** The `Kind` constant serves as a version identifier but the downstream contract for handoff artifacts is implicit. - **Action:** Add doc comment noting this struct defines the downstream API schema. - **Effort:** Low ### Type-safety follow-up (from A3) **A3 (type-level): Add `HasRegistryData` to WalletChainData** - **File:** `pkg/tbtc/chain.go:418` - **Issue:** Zero `[32]byte` for MembersIDsHash means "unavailable" but this is implicit. Downstream guards work but future code could silently use the zero value. - **Action:** Add `HasRegistryData bool` field or use `*[32]byte` for MembersIDsHash to make unavailability explicit at the type level. - **Effort:** Medium (touches many callers) --- ## Pre-existing test failure `TestSubmitHandlerPreservesContextValues` fails on the base branch (`fix/review-findings`). Confirmed not introduced by these changes -- the test expects request context values to propagate through the service context detachment, which is not the current design. ## Test plan - [x] `go build ./pkg/covenantsigner/...` and `go build ./pkg/tbtc/...` compile cleanly - [x] All covenantsigner tests pass (except pre-existing `TestSubmitHandlerPreservesContextValues`) - [x] All relevant tbtc tests pass (GetWallet fault isolation, signer approval, payload hash) - [x] Pinned hash test (`TestComputeQcV1SignerHandoffPayloadHash_DeterministicKeyOrdering`) still passes after canonicaljson switch - [ ] CI green
b527032
into
feat/psbt-covenant-final-project-pr
Implements 15 fixes from the consolidated audit of PR #3933 (covenant signer + fault isolation). These address findings from 5 review passes (initial triage, 2 multi-agent lens reviews, PR reviewer feedback, validation review). **P1 - High** - **A2**: Use `canonicaljson.Marshal` for handoff payload hash (cross-language determinism) - **A3/A23**: Improve wallet registry error messages, elevate log to Error, add WalletChainData godoc - **A4**: Extract Submit critical section into `createOrDedup` helper (5 unlock points -> defer) - **A5**: Deterministic tiebreaker when both timestamps unparseable (lexicographic RequestID) - **A6**: Restrict healthz auth bypass to GET method only - **A22**: Poison route keys from skipped jobs to preserve dedupe semantics **P2 - Medium** - **A7**: Document AuthToken CLI flag /proc visibility risk - **A12/A27**: Cancel service context on init failure + add SIGINT/SIGTERM signal handling - **A15**: Document advisory flock limitations and storage requirements - **A16**: Add aggregate load summary with skip count - **A24**: Remove superseded job from byRequestID on dedup replacement - **A25**: Rename misleading test after resilient loading change - **A29**: Use `errors.Is()` for errJobNotFound comparison --- **A1: Default `requireApprovalTrustRoots` to true in production** - **File:** `pkg/covenantsigner/server.go:70-75` - **Risk:** When `signerApprovalVerifier` is nil, the service accepts requests without cryptographic signer approval verification. The `requireApprovalTrustRoots` config flag exists as mitigation but defaults to false. - **Action:** Default `requireApprovalTrustRoots` to true in production configs, or fail startup when port is non-zero and no verifier is available. - **Effort:** Low **A8: Bitcoin regtest integration tests for witness scripts** - **File:** `pkg/tbtc/covenant_signer.go` (design) - **Risk:** Go unit tests cannot enforce `cleanstack`, signature encoding quirks, or opcode limits. A script passing unit tests could be rejected by the Bitcoin network, locking funds. - **Action:** Introduce `btcd` or Bitcoin Core regtest integration tests that compile witness scripts, sign transactions, and broadcast on a local regtest node. - **Effort:** High **A9: Split monolithic validation.go (1922 lines)** - **File:** `pkg/covenantsigner/validation.go` - **Issue:** Mixes input parsing, crypto verification, normalization, and commitment computation in a single file. - **Action:** Split into focused files: `validation_quote.go`, `validation_approval.go`, `validation_template.go`. - **Effort:** High **A10: Deduplicate UTXO resolution logic** - **File:** `pkg/tbtc/covenant_signer.go:512-628` - **Issue:** `resolveSelfV1ActiveUtxo` and `resolveQcV1ActiveUtxo` are near-identical (~60 lines each). - **Action:** Extract shared `resolveActiveUtxo(request, witnessScript, templateName)` helper. - **Effort:** Medium **A11: Deduplicate trust root normalization** - **File:** `pkg/covenantsigner/validation.go:628-724` - **Issue:** `normalizeDepositorTrustRoots` and `normalizeCustodianTrustRoots` are structurally identical. - **Action:** Unify via generics or shared inner function. - **Effort:** Low **A13: strictUnmarshal trailing token rejection** - **File:** `pkg/covenantsigner/validation.go:71-75` - **Issue:** Inconsistent with server-level `decodeJSON` which rejects trailing JSON tokens. - **Action:** Add trailing-token rejection or document single-document context constraint. - **Effort:** Low **A14: Document Engine interface contract** - **File:** `pkg/covenantsigner/engine.go` - **Issue:** No documentation of what nil Transition means (OnSubmit: default to Pending; OnPoll: no-op). - **Action:** Add godoc specifying the contract and allowed return states. - **Effort:** Low **A26: Rate limiting on submit endpoint** - **File:** `pkg/covenantsigner/server.go:345` - **Issue:** Authenticated callers can spawn unbounded concurrent 5-minute signing operations. Mitigated by auth and routeRequestID dedup but no per-client throttle exists. - **Action:** Add token-bucket rate limiting per auth token (e.g., `golang.org/x/time/rate`, ~5 req/min). - **Effort:** Medium **A28: Replace reflect.DeepEqual for Handoff comparison** - **File:** `pkg/covenantsigner/service.go:201` - **Issue:** `sameJobRevision` uses `reflect.DeepEqual` for `map[string]any` Handoff comparison. Correct but potentially slow at scale. - **Action:** Define a concrete `HandoffData` struct with an `Equal` method. Acceptable as-is for single-digit RPM throughput. - **Effort:** Low **A30: Document qcV1SignerHandoff schema** - **File:** `pkg/tbtc/covenant_signer.go:35` - **Issue:** The `Kind` constant serves as a version identifier but the downstream contract for handoff artifacts is implicit. - **Action:** Add doc comment noting this struct defines the downstream API schema. - **Effort:** Low **A3 (type-level): Add `HasRegistryData` to WalletChainData** - **File:** `pkg/tbtc/chain.go:418` - **Issue:** Zero `[32]byte` for MembersIDsHash means "unavailable" but this is implicit. Downstream guards work but future code could silently use the zero value. - **Action:** Add `HasRegistryData bool` field or use `*[32]byte` for MembersIDsHash to make unavailability explicit at the type level. - **Effort:** Medium (touches many callers) --- `TestSubmitHandlerPreservesContextValues` fails on the base branch (`fix/review-findings`). Confirmed not introduced by these changes -- the test expects request context values to propagate through the service context detachment, which is not the current design. - [x] `go build ./pkg/covenantsigner/...` and `go build ./pkg/tbtc/...` compile cleanly - [x] All covenantsigner tests pass (except pre-existing `TestSubmitHandlerPreservesContextValues`) - [x] All relevant tbtc tests pass (GetWallet fault isolation, signer approval, payload hash) - [x] Pinned hash test (`TestComputeQcV1SignerHandoffPayloadHash_DeterministicKeyOrdering`) still passes after canonicaljson switch - [ ] CI green
…rent base (#3940) ## Problem PR #3935 was merged into `fix/review-findings`, an intermediate base branch that was never integrated into `feat/psbt-covenant-final-project-pr`. This left `feat/psbt-covenant-final-project-pr` without the audit-finding fixes that #3935 implemented (createOrDedup mutex extraction, healthz auth bypass restriction, canonical JSON for handoff payload, error message improvements, gosec hardening, and more). ## Solution This PR cherry-picks the squash commit of #3935 onto the current state of `feat/psbt-covenant-final-project-pr`, which already contains PR #3933 (Maclane's fault-isolation work plus his subsequent fixes for the review comments raised on that PR). Three files required manual conflict resolution where #3935 and #3933 made overlapping changes addressing the same concerns. In `pkg/tbtc/signer_approval_certificate.go`, the `ErrMissingWalletID` and `ErrMissingMembersIDsHash` sentinel errors from #3935 were dropped because Maclane's `ensureWalletRegistryDataAvailable` function already covers the same case and is in active use; keeping both would have left the sentinels as dead code. In `pkg/covenantsigner/server_test.go`, the test rewrite from #3935 (which depended on a `serviceCtx` parameter to `newHandler`) was reverted in favor of the existing middleware-based test, which already exercises the `context.WithoutCancel(r.Context())` detachment path. In `pkg/covenantsigner/store_test.go`, the test rename collision was resolved by taking #3935's name (`TestStoreLoadResolvesInvalidUpdatedAtForDuplicateRouteKeys`). Three additional cleanups followed from the merge mechanics: the godoc on `WalletChainData.MembersIDsHash` was updated to reference `ensureWalletRegistryDataAvailable` instead of the dropped sentinels, an orphaned `cancelService()` call was removed from `server.go` (its `serviceCtx` infrastructure did not auto-merge), and a duplicate `delete(s.byRequestID, existingID)` was deduplicated in `store.go`'s load path. ## Tests All tests across `pkg/covenantsigner`, `pkg/tbtc`, and `pkg/chain/ethereum` pass. A behavioral diff against the base branch shows identical build, vet, and file-listing outputs, with all `ok` lines matching except for timing variation. The two adapter tests added by Maclane (`TestMakeWalletChainData...`) and the two registry-unavailable tests for the issue-and-verify signer approval paths are preserved and pass.
Summary
1. GetWallet — wallet registry failure is fatal
Issue:
GetWalletcalls both the Bridge and the wallet registry. If the registry call fails (transient RPC outage, registry contract issue), the entireGetWalletreturns an error — even though most callers only need Bridge-sourced fields (State, timestamps,MainUtxoHash) and never touchMembersIDsHash. A transient registry outage cascades into failures for all wallet queries.Solution: The registry call is now best-effort. On failure,
GetWalletlogs a warning and returns Bridge-sourced fields with a zero-valuedMembersIDsHash. Downstream consumers that need registry data (e.g. signer approval certificate) already guard against the zero value.2. Store load aborts on any single corrupt file
Issue:
Store.load()iterates persisted job files and returns a hard error on any of: unreadable file (disk I/O error), malformed JSON, or unparseable timestamp when resolving duplicate route keys. A single corrupt file prevents the entire store from loading, blocking the covenant signer from starting.Solution: Each failure mode now logs a warning and skips the offending file instead of aborting:
descriptor.Content()error → skip with warning.json.Unmarshalerror → skip with warning.isNewerOrSameJobRevisionfails, the loader checks which job has a parseable timestamp — if the candidate's timestamp is valid, it replaces the existing job (whose timestamp must be the broken one); otherwise the candidate is skipped. This preserves the best available data.Test coverage
GetWallet fault isolation (
pkg/tbtc/get_wallet_fault_isolation_test.go)TestGetWalletReturnsDataWhenRegistryFailsMembersIDsHashis zero, no errorTestGetWalletReturnsFullDataWhenRegistrySucceedsMembersIDsHashpopulatedTestGetWalletBridgeFailureStillReturnsErrorTest mock (
pkg/tbtc/chain_test.go)walletRegistryErrsmap andsetWalletRegistryErr()tolocalChain, enabling simulation of registry failures that return degraded data with zeroMembersIDsHash.Resilient store loading (
pkg/covenantsigner/store_test.go)TestStoreLoadSkipsUnreadableFileTestStoreLoadSkipsMalformedJSONTestStoreLoadSkipsInvalidTimestampOnDuplicateRouteKey"invalid-timestamp"→ store loads, valid-timestamp job winsTestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeys(updated)Test helpers (
pkg/covenantsigner/covenantsigner_test.go)faultingDescriptor: returns injected error fromContent(), simulating unreadable files.contentFaultingHandle: injects faulting descriptors into theReadAllchannel alongside normal ones.