fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log#3230
fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log#3230
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 55 minutes and 7 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis PR adds Raft leadership resignation capability, extends configuration with election and snapshot tuning parameters, improves FSM apply concurrency safety with mutex guards, handles ahead-of-raft recovery scenarios gracefully, and implements leader abdication when the local store falls significantly behind the Raft state. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI (SIGINT/SIGTERM)
participant Runner as Run Control
participant Node as FullNode
participant RaftNode as RaftNode
participant FSM as FSM/Apply
CLI->>Runner: Signal received
Runner->>Node: Type-assert LeaderResigner
alt LeaderResigner supported
Runner->>Node: ResignLeader(ctx, 3s timeout)
Node->>RaftNode: ResignLeader(ctx)
RaftNode->>RaftNode: Transfer leadership
RaftNode-->>Node: Success/Error
Node-->>Runner: Return result
Runner->>Runner: Log outcome
else Not supported
Runner->>Runner: Skip resignation
end
Runner->>Runner: Call cancel()
Runner->>Runner: Wait for shutdown (10s)
sequenceDiagram
participant Election as Election Loop
participant Store as Store (Local)
participant Runnable as Runnable
participant Leader as Leader Sync
participant Transfer as Leadership Transfer
Election->>Runnable: IsSynced(raftState)
Runnable-->>Election: diff = -5 (lag > 1 block)
alt Store significantly behind
Election->>Election: Log warning with lag size
Election->>Transfer: leadershipTransfer()
Transfer-->>Election: Initiated
Election->>Election: Skip leader start (continue)
else Store within tolerance
Election->>Leader: Proceed with leader startup
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude finished @auricom's task in 4m 15s —— View job Code Review — Raft HA Production Hardening
This is a solid, well-scoped hardening PR. The changes address real failure modes observed in SIGTERM cycling tests with thoughtful fixes. Below are my findings, organized by severity. Bugs / Correctness Issues1.
// pkg/config/config.go — add to RaftConfig.Validate()
if c.SnapCount == 0 {
multiErr = errors.Join(multiErr, fmt.Errorf("snap_count must be > 0"))
}2. hashicorp/raft panics with // pkg/config/config.go — add to RaftConfig.Validate()
if c.LeaderLeaseTimeout > 0 && c.HeartbeatTimeout > 0 && c.HeartbeatTimeout < 2*c.LeaderLeaseTimeout {
multiErr = errors.Join(multiErr, fmt.Errorf(
"heartbeat timeout (%v) must be >= 2x leader lease timeout (%v)", c.HeartbeatTimeout, c.LeaderLeaseTimeout))
}Concurrency / Shutdown3. Goroutine leak in When // pkg/raft/node.go:Stop()
// Note: ResignLeader may have already initiated a transfer; IsLeader() guards
// against a duplicate, but if the 3s deadline expired, both may overlap briefly.
if n.IsLeader() { ... }4. This ordering is intentional and correct: 5. The Logging / Observability6. Both 7. Block provenance audit log ( The addition of Error Handling8. If if raftState.Height == 0 {
// Cannot verify hash at genesis; treat as divergence.
return fmt.Errorf("local state ahead (local=%d) but raft has no committed blocks", currentState.LastBlockHeight)
}9. Abdication error propagation (
Configuration10. The 11. The clamping logic guards against a zero-interval panic in Tests12. The abdication test uses 13. Table-driven test coverage is good — 14. The test validates that Summary
The two correctness issues in config validation (#1 and #2) are the only blockers. The rest are improvements. Core logic — leader fencing, FSM data race fix, abdication, |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
block/internal/syncing/raft_retriever_test.go (1)
42-61: Consider converting this to a table-driven test.The current case is good, but a table shape will make it easier to add stop idempotency and start/stop-cycle variants without duplicating setup.
As per coding guidelines "Use table-driven tests in Go unit tests".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/syncing/raft_retriever_test.go` around lines 42 - 61, The test TestRaftRetrieverStopClearsApplyCallback should be converted into a table-driven test to cover multiple scenarios (current stop behavior, stop idempotency, start/stop cycles) without duplicating setup: create a slice of test cases each with a name and a sequence of actions (e.g., start, stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh stubRaftNode and retriever via newRaftRetriever, call retriever.Start and retriever.Stop according to the case, then assert expected recordedCallbacks via stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require assertions on callback length and nil/non-nil entries as in the original test. Ensure each case isolates state by creating new retriever and stubRaftNode within the loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@block/internal/syncing/raft_retriever.go`:
- Line 77: The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply
because Apply reads/sends on applyCh while the raft node may concurrently invoke
the callback; fix by adding a mutex to the raft node to guard access to the
apply callback: protect the callback field and its setter Get/SetApplyCallback
(or SetApplyCallback and any internal invocation sites) with the new mutex so
that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil
or changing callback mid-invocation; update the raft node's invocation path that
calls the callback (where Apply is invoked) to acquire the same mutex (or use a
read lock) when reading the callback and release it immediately after obtaining
the pointer, then call the callback outside the lock if needed to avoid
long-held locks.
---
Nitpick comments:
In `@block/internal/syncing/raft_retriever_test.go`:
- Around line 42-61: The test TestRaftRetrieverStopClearsApplyCallback should be
converted into a table-driven test to cover multiple scenarios (current stop
behavior, stop idempotency, start/stop cycles) without duplicating setup: create
a slice of test cases each with a name and a sequence of actions (e.g., start,
stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh
stubRaftNode and retriever via newRaftRetriever, call retriever.Start and
retriever.Stop according to the case, then assert expected recordedCallbacks via
stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require
assertions on callback length and nil/non-nil entries as in the original test.
Ensure each case isolates state by creating new retriever and stubRaftNode
within the loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d9bc4987-af22-4eb8-a04c-1c9ef989e55a
📒 Files selected for processing (3)
block/internal/syncing/raft_retriever.goblock/internal/syncing/raft_retriever_test.gopkg/raft/node.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3230 +/- ##
==========================================
+ Coverage 62.33% 62.45% +0.12%
==========================================
Files 122 122
Lines 12873 12965 +92
==========================================
+ Hits 8024 8097 +73
- Misses 3968 3985 +17
- Partials 881 883 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Bug A: RecoverFromRaft crashed with "invalid block height" when the node restarted after SIGTERM and the EVM state (persisted before kill) was ahead of the raft FSM snapshot (which hadn't finished log replay yet). The function now verifies the hash of the local block at raftState.Height — if it matches the snapshot hash the EVM history is correct and recovery is safely skipped; a mismatch returns an error indicating a genuine fork. Bug B: waitForMsgsLanded used two repeating tickers with the same effective period (SendTimeout/2 poll, SendTimeout timeout), so both could fire simultaneously in select and the timeout would win even when AppliedIndex >= CommitIndex. Replaced the deadline ticker with a one-shot time.NewTimer, added a final check in the deadline branch, and reduced poll interval to min(50ms, timeout/4) for more responsive detection. Fixes the crash-restart Docker backoff loop observed in SIGTERM HA test cycle 7 (poc-ha-2 never rejoining within the 300s kill interval). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SetApplyCallback(nil) called from raftRetriever.Stop() raced with FSM.Apply reading applyCh: wg.Wait() only ensures the consumer goroutine exited, but the raft library can still invoke Apply concurrently. Add applyMu sync.RWMutex to FSM; take write lock in SetApplyCallback and read lock in Apply before reading the channel pointer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…GTERM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… to RaftConfig; fix SnapCount default 0→3 Add three new Raft config parameters: - ElectionTimeout: timeout for candidate to wait for votes (defaults to 1s) - SnapshotThreshold: outstanding log entries that trigger snapshot (defaults to 500) - TrailingLogs: log entries to retain after snapshot (defaults to 200) Fix critical default: SnapCount was 0 (broken, retains no snapshots) → 3 This enables control over Raft's snapshot frequency and recovery behavior to prevent resync debt from accumulating unbounded during normal operation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nto hashicorp/raft config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r block provenance audit Add Term field to RaftApplyMsg struct to track the raft term in which each block was committed. Update FSM.Apply() debug logging to include both raft_term and raft_index fields alongside block height and hash. This enables better audit trails and debugging of replication issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gering tests The gci formatter requires single space before inline comments (not aligned double-space). Also removed TestNodeResignLeader_NotLeaderNoop and TestNewNode_SnapshotConfigApplied which create real boltdb-backed raft nodes: boltdb@v1.3.1 has an unsafe pointer alignment issue that panics under Go 1.25's -checkptr. The nil-receiver test (TestNodeResignLeader_NilNoop) is retained as it exercises the same guard without touching boltdb. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
hashicorp/raft-boltdb uses defer tx.Rollback() as a safety net on every transaction. When Commit() succeeds, the deferred Rollback() returns bolt.ErrTxClosed and raft-boltdb logs it as an error — even though it is the expected outcome of every successful read or write. The message has no actionable meaning and floods logs at high block rates. Add a one-time stdlib log filter (sync.Once in NewNode) that silently drops lines containing 'tx closed' and forwards everything else to stderr. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ocs, tests - Call raftRetriever.Stop() in Syncer.Stop() so SetApplyCallback(nil) is actually reached and the goroutine is awaited before wg.Wait() - Log leadershipTransfer error at warn level in Node.Stop() instead of discarding it silently - Fix SnapCount comments in config.go: it retains snapshot files on disk (NewFileSnapshotStore retain param), not log-entry frequency - Extract buildRaftConfig helper from NewNode to enable config wiring tests - Add TestNodeResignLeader_NotLeaderNoop (non-nil node, nil raft → noop) - Add TestNewNode_SnapshotConfigApplied (table-driven, verifies SnapshotThreshold and TrailingLogs wiring with custom and zero values) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, election validation - Add ShutdownTimeout field (default 5s) to raft Config so Stop() drains committed logs with a meaningful timeout instead of the 200ms SendTimeout - Wrap ResignLeader() in a 3s goroutine+select fence in the SIGTERM handler so a hung leadership transfer cannot block graceful shutdown indefinitely - Validate ElectionTimeout >= HeartbeatTimeout in RaftConfig.Validate() to prevent hashicorp/raft panicking at startup with an invalid config - Fix stale "NewNode creates a new raft node" comment that had migrated onto buildRaftConfig after the function was extracted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gofmt/gci requires minimal alignment; excessive spaces in the TestNewNode_SnapshotConfigApplied struct literal caused a lint failure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add unit tests for lines flagged by Codecov: - boltTxClosedFilter.Write: filter drops "tx closed", forwards others - buildRaftConfig: ElectionTimeout > 0 applied, zero uses default - FullNode.ResignLeader: nil raftNode no-op; non-leader raftNode no-op - Syncer.Stop: raftRetriever.Stop is called when raftRetriever is set - Syncer.RecoverFromRaft: GetHeader failure when local state is ahead of stale raft snapshot returns "cannot verify hash" error Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
node/full_node_test.go (1)
87-97: Prefer a table-driven test for theResignLeaderno-op cases.These two tests are the same shape and are easier to extend/maintain as one table-driven test.
♻️ Proposed refactor
-func TestFullNode_ResignLeader_NilRaftNode(t *testing.T) { - n := &FullNode{} // raftNode is nil - assert.NoError(t, n.ResignLeader()) -} - -func TestFullNode_ResignLeader_NonLeaderRaftNode(t *testing.T) { - // Empty *raftpkg.Node has nil raft field so IsLeader() returns false; - // ResignLeader() is a no-op and returns nil. - n := &FullNode{raftNode: &raftpkg.Node{}} - assert.NoError(t, n.ResignLeader()) -} +func TestFullNode_ResignLeader_NoOpCases(t *testing.T) { + tests := []struct { + name string + node *FullNode + }{ + {name: "nil raft node", node: &FullNode{}}, + {name: "non-leader raft node", node: &FullNode{raftNode: &raftpkg.Node{}}}, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + assert.NoError(t, tt.node.ResignLeader()) + }) + } +}As per coding guidelines,
**/*_test.go: "Use table-driven tests in Go unit tests".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@node/full_node_test.go` around lines 87 - 97, Combine the two ResignLeader no-op tests into a single table-driven test: create a slice of cases with descriptive names for the nil-raftNode case and the non-leader raftNode case, each providing a FullNode value (e.g., &FullNode{} and &FullNode{raftNode: &raftpkg.Node{}}) and the expected outcome (no error); iterate over cases using t.Run(case.name, func(t *testing.T) { assert.NoError(t, case.node.ResignLeader()) }), keeping references to FullNode.ResignLeader and raftpkg.Node to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cmd/run_node.go`:
- Around line 231-245: The current shutdown uses a 3s resignCtx but launches
resigner.ResignLeader() without cancellation so the resign can keep running;
change the API to accept a context (e.g., LeaderResigner.ResignLeader(ctx
context.Context) error), call resigner.ResignLeader(resignCtx) directly (or in a
goroutine and select on its returned error and resignCtx.Done()) and
propagate/observe the context to abort the resign operation; update the call
site in run_node.go to pass resignCtx and handle the returned error or context
timeout to avoid leaking a background goroutine or leaving an unbounded resign
in progress.
In `@pkg/config/config.go`:
- Around line 454-456: The validation must reject negative election timeouts:
add an explicit check in the same validation block to treat c.ElectionTimeout <
0 as invalid (join an error like "election timeout (%v) must be >= 0" into
multiErr), keeping 0 as the "use default" sentinel; ensure this complements the
existing comparison with c.HeartbeatTimeout and aligns with buildRaftConfig's
behavior (which only applies values > 0) so negative values fail fast instead of
being silently ignored.
In `@pkg/raft/node.go`:
- Around line 95-97: The code currently calls
log.SetOutput(&boltTxClosedFilter{w: os.Stderr}) inside suppressBoltNoise.Do(),
which clobbers the process-wide stdlib logger; instead capture the current
stdlib logger writer and wrap it so existing configuration is preserved: obtain
prev := log.Writer() and call log.SetOutput(&boltTxClosedFilter{w: prev}) (still
guarded by suppressBoltNoise.Do()). Update the usage in the same block where
suppressBoltNoise.Do and boltTxClosedFilter are referenced (e.g., in NewNode
initialization) so you no longer hard-code os.Stderr.
---
Nitpick comments:
In `@node/full_node_test.go`:
- Around line 87-97: Combine the two ResignLeader no-op tests into a single
table-driven test: create a slice of cases with descriptive names for the
nil-raftNode case and the non-leader raftNode case, each providing a FullNode
value (e.g., &FullNode{} and &FullNode{raftNode: &raftpkg.Node{}}) and the
expected outcome (no error); iterate over cases using t.Run(case.name, func(t
*testing.T) { assert.NoError(t, case.node.ResignLeader()) }), keeping references
to FullNode.ResignLeader and raftpkg.Node to locate the code to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bff68b93-3da9-4c95-9c3b-ae84e849204f
📒 Files selected for processing (12)
block/internal/syncing/syncer.goblock/internal/syncing/syncer_test.gonode/full.gonode/full_node_test.gonode/node.gopkg/cmd/run_node.gopkg/config/config.gopkg/config/config_test.gopkg/config/defaults.gopkg/raft/node.gopkg/raft/node_test.gopkg/raft/types.go
A negative ElectionTimeout was silently ignored (buildRaftConfig only applies values > 0), allowing a misconfigured node to start with the library default instead of failing fast. Add an explicit < 0 check that returns an error; 0 remains valid as the "use library default" sentinel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…x through ResignLeader - suppressBoltNoise.Do now wraps log.Writer() instead of os.Stderr so any existing stdlib logger redirection is preserved rather than clobbered - ResignLeader now accepts a context.Context: leadershipTransfer runs in a goroutine and a select abandons the caller at ctx.Done(), returning ctx.Err(); the goroutine itself exits once the inner raft transfer completes (bounded by ElectionTimeout) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rface - LeaderResigner.ResignLeader() → ResignLeader(ctx context.Context) error - FullNode.ResignLeader passes ctx down to raft.Node.ResignLeader - run_node.go calls resigner.ResignLeader(resignCtx) directly — no wrapper goroutine/select needed; context.DeadlineExceeded vs other errors are logged distinctly - Merge TestFullNode_ResignLeader_NilRaftNode and TestFullNode_ResignLeader_NonLeaderRaftNode into single table-driven test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/raft/node_test.go (1)
195-228: Test logic is good; consider renaming for precision.The assertions are correct, but the function name suggests
NewNodebehavior while it directly testsbuildRaftConfig. Renaming would improve discoverability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/raft/node_test.go` around lines 195 - 228, Rename the test TestNewNode_SnapshotConfigApplied to reflect what it actually verifies: the buildRaftConfig function applies snapshot-related fields from Config; for example rename to TestBuildRaftConfig_SnapshotConfigApplied (or TestBuildRaftConfig_AppliesSnapshotConfig). Update the test name only (keep the same table-driven cases and assertions using buildRaftConfig, rc, Config) so test discovery and intent match the implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/raft/node_test.go`:
- Around line 195-228: Rename the test TestNewNode_SnapshotConfigApplied to
reflect what it actually verifies: the buildRaftConfig function applies
snapshot-related fields from Config; for example rename to
TestBuildRaftConfig_SnapshotConfigApplied (or
TestBuildRaftConfig_AppliesSnapshotConfig). Update the test name only (keep the
same table-driven cases and assertions using buildRaftConfig, rc, Config) so
test discovery and intent match the implementation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 36de18f5-92b8-4b48-8098-050aafc90d75
📒 Files selected for processing (8)
node/full.gonode/full_node_test.gonode/node.gopkg/cmd/run_node.gopkg/config/config.gopkg/config/config_test.gopkg/raft/node.gopkg/raft/node_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/cmd/run_node.go
🚧 Files skipped from review as they are similar to previous changes (4)
- node/node.go
- node/full.go
- pkg/config/config.go
- pkg/raft/node.go
There was a problem hiding this comment.
this test does not look necessary
There was a problem hiding this comment.
Done — deleted in bf6bb50. The test was validating an implementation detail (that Stop() calls SetApplyCallback(nil)) rather than observable behaviour. The stubRaftNode helper it defined has been moved to syncer_test.go where it's still needed by TestSyncer_Stop_CallsRaftRetrieverStop.
| } | ||
|
|
||
| func NewNode(cfg *Config, logger zerolog.Logger) (*Node, error) { | ||
| suppressBoltNoise.Do(func() { |
There was a problem hiding this comment.
Let's delete this, this is does not use the correct logger anyway.
There was a problem hiding this comment.
Done — deleted in bf6bb50. The global stdlib log.SetOutput redirect was the wrong scope; nothing else in the codebase should be silenced that way.
| "google.golang.org/protobuf/proto" | ||
| ) | ||
|
|
||
| // suppressBoltNoise redirects the stdlib log output once to drop the |
There was a problem hiding this comment.
ditto, if we don't need log from this, and if it uses the stdlib log implementation, we should make sure hashicorp/raft-boltdb is instantiated with zerolog
There was a problem hiding this comment.
Unfortunately hashicorp/raft-boltdb does not support logger injection — its Options struct has no Logger field and all rollback errors are emitted via hardcoded log.Printf calls. There is no way to route those through zerolog without patching the library upstream. The "Rollback failed: tx closed" noise is benign (it fires every time defer tx.Rollback() runs after a successful tx.Commit()), so for now we accept it.
| // Wait for FSM to apply all committed logs before shutdown to prevent state loss. | ||
| // This ensures pending raft messages are processed before the node stops. | ||
| if err := n.waitForMsgsLanded(n.config.SendTimeout); err != nil { | ||
| if err := n.waitForMsgsLanded(n.config.ShutdownTimeout); err != nil { |
There was a problem hiding this comment.
Do we really need a different ShutdownTimeout vs SendTimeout? That's a log of timeouts in the config, which is confusing for operators (like cometbft has thousands of confusing timeouts)
There was a problem hiding this comment.
Agreed — removed from public config in bf6bb50. ShutdownTimeout is still computed internally as 5 × SendTimeout at the initRaftNode call site (node/full.go), so the drain window scales with the operator's chosen send timeout without exposing another knob. The ShutdownTimeout field stays in pkg/raft.Config as an internal detail.
…t state When a node wins election but its local store is more than 1 block behind the raft FSM state, RecoverFromRaft cannot replay the gap (it only handles the single latest block in the raft snapshot). Previously the node would log "recovery successful" and start leader operations anyway, then stall block production while the executor repeatedly failed to sync the missing blocks from a store that did not have them. Fix: in DynamicLeaderElection.Run, detect diff < -1 at the follower→leader transition and immediately transfer leadership to a better-synced peer. diff == -1 is preserved: RecoverFromRaft can apply exactly one block from the raft snapshot, so that path is unchanged. Closes #3255 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nTimeout - Remove stdlib log filter (boltTxClosedFilter / suppressBoltNoise): it redirected the global stdlib logger which is the wrong scope. raft-boltdb uses log.Printf directly with no Logger option, so the "tx closed" noise is now accepted as-is from stderr. - Wire hashicorp/raft's internal hclog output through zerolog: set raft.Config.Logger to an hclog.Logger backed by the zerolog writer so all raft-internal messages appear in the structured log stream under component=raft-hashicorp. - Remove ShutdownTimeout from public config: it was a second "how long to wait" knob that confused operators. ShutdownTimeout is now computed internally as 5 × SendTimeout at the initRaftNode call site. - Delete TestRaftRetrieverStopClearsApplyCallback: tested an implementation detail (Stop clears the apply callback pointer) rather than observable behaviour. The stubRaftNode helper it defined is moved to syncer_test.go where it is still needed. - Rename TestNewNode_SnapshotConfigApplied → TestBuildRaftConfig_SnapshotConfigApplied to reflect that it tests buildRaftConfig, not NewNode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test go mod tidy promotes github.com/hashicorp/go-hclog from indirect to direct now that pkg/raft/node.go imports it explicitly. gci auto-formatted stubRaftNode method stubs in syncer_test.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/config/config.go (1)
419-452:⚠️ Potential issue | 🟠 MajorValidate
raft.snap_counthere too.
SnapCountis still allowed to be0, butpkg/raft/node.gopasses it straight intoraft.NewFileSnapshotStore(...), which will fail during node startup instead of during config validation. Since this is now a user-facing retention knob, reject zero inRaftConfig.Validate()so bad operator input fails fast.Suggested fix
func (c RaftConfig) Validate() error { if !c.Enable { return nil } var multiErr error @@ + if c.SnapCount == 0 { + multiErr = errors.Join(multiErr, fmt.Errorf("snap count must be > 0")) + } + if c.SendTimeout <= 0 { multiErr = errors.Join(multiErr, fmt.Errorf("send timeout must be positive")) }As per coding guidelines, "Validate all inputs from external sources in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/config/config.go` around lines 419 - 452, RaftConfig.Validate currently misses validating SnapCount; add a check in the RaftConfig.Validate() method to reject SnapCount == 0 (or <= 0) and join an appropriate error (e.g., fmt.Errorf("snap count must be positive")) into multiErr so invalid config fails fast; reference the RaftConfig type and its Validate() method (and note that raft.NewFileSnapshotStore in pkg/raft/node.go expects a positive snap count) when locating where to add the validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/raft/election_test.go`:
- Around line 245-261: The test is nondeterministic due to time.Sleep and
t.Fatal in a goroutine; replace those with channel-based synchronization: in the
testRunnable used as leader (testRunnable.runFn) send a message on a new
leaderStartAttemptCh instead of calling t.Fatal, and have the main test
goroutine wait on that channel to observe the leader start attempt; remove
time.Sleep and instead use the existing leaderCh/fStarted or a new
leaderStartedCh to coordinate the leader transfer (send/receive to trigger state
transitions) and then perform assertions on the captured signals from
leaderStartAttemptCh and leaderStartedCh; update where DynamicLeaderElection is
constructed (leaderFactory/followerFactory) to return the testRunnable that uses
these channels for deterministic coordination.
In `@pkg/raft/election.go`:
- Around line 147-150: The current code swallows failures from
node.leadershipTransfer() by logging and continuing, which can leave the node
stuck; change the behavior so that transfer failures are surfaced instead of
silently continuing: inside the loop where node.leadershipTransfer() is called
(the call shown as d.node.leadershipTransfer() with the logger d.logger),
replace the bare continue on error with logic that returns or propagates tErr to
the caller (or otherwise transitions the state to abort leader startup) so the
failure is visible and handled upstream; ensure the log still records the error
before returning/propagating.
In `@pkg/raft/node.go`:
- Around line 84-90: NewNode currently can leave cfg.ShutdownTimeout as zero
which causes a panic in Stop() -> waitForMsgsLanded when creating a ticker with
interval zero; fix this by initializing/clamping the internal shutdown timeout
in NewNode before the config is stored: if cfg.ShutdownTimeout <= 0 set it to
the internal default (e.g. 5 * cfg.SendTimeout) or a sane minimum, so callers of
Stop() / waitForMsgsLanded always see a positive timeout value.
---
Outside diff comments:
In `@pkg/config/config.go`:
- Around line 419-452: RaftConfig.Validate currently misses validating
SnapCount; add a check in the RaftConfig.Validate() method to reject SnapCount
== 0 (or <= 0) and join an appropriate error (e.g., fmt.Errorf("snap count must
be positive")) into multiErr so invalid config fails fast; reference the
RaftConfig type and its Validate() method (and note that
raft.NewFileSnapshotStore in pkg/raft/node.go expects a positive snap count)
when locating where to add the validation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b4087f79-ce2a-422e-a24c-41233d9807d3
📒 Files selected for processing (9)
block/internal/syncing/syncer_test.gonode/full.gopkg/config/config.gopkg/config/config_test.gopkg/config/defaults.gopkg/raft/election.gopkg/raft/election_test.gopkg/raft/node.gopkg/raft/node_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/config/config_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/config/defaults.go
- node/full.go
- block/internal/syncing/syncer_test.go
…nsfer error propagation, deterministic test ShutdownTimeout zero-value panic (critical): NewNode now clamps ShutdownTimeout to 5*SendTimeout when the caller passes zero, preventing a panic in time.NewTicker inside waitForMsgsLanded. The normal path through initRaftNode already sets it explicitly; this guard protects direct callers (e.g. tests) that omit the field. Leadership transfer error propagation (major): When store-lag abdication calls leadershipTransfer() and it fails, the error is now returned instead of being logged and silently continuing. Continuing after a failed transfer left the node as leader-without-worker, stalling the cluster. Deterministic abdication test (major): Replace time.Sleep(10ms) + t.Fatal-in-goroutine with channel-based synchronization: leader runFn signals leaderStarted; the test goroutine waits up to 50ms for that signal and calls t.Error (safe from goroutines) if it arrives, then cancels the context either way. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
julienrbrt
left a comment
There was a problem hiding this comment.
the changes look great, could you add a changelog?
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cd82967 to
e161e41
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
e161e41 to
84ec0d0
Compare
Summary
Hardens Raft HA behavior across several failure modes observed in SIGTERM cycling tests.
Leader fencing on SIGTERM
ResignLeader(ctx context.Context) errortoraft.NodeandFullNode(viaLeaderResignerinterface)run_node.gocallsresigner.ResignLeader(resignCtx)directly with a 3s deadline — no wrapper goroutine needed;context.DeadlineExceededvs other errors are logged separatelyResignLeaderrunsleadershipTransferin a goroutine and selects on the result vsctx.Done(), so the caller is unblocked at the deadline (the inner transfer exits on its own, bounded byElectionTimeout)ShutdownTimeoutfield (default 5s) added toraft.ConfigsoStop()drains committed logs with a meaningful timeoutRaft log compaction
SnapshotThreshold(default 500) andTrailingLogs(default 200) toRaftConfigand wired them into the hashicorp/raft config — previously using library defaults (8192 / 10240, i.e. snapshots every ~2.25h at 1 blk/s)SnapCountdefault 0 → 3 (0 caused no snapshot files to be retained on disk)Election timeout config
ElectionTimeouttoRaftConfig(default 1s)Validate()rejects negative values explicitly (< 0is invalid;0remains valid as "use library default"); also validates>= HeartbeatTimeoutwhen non-zero to prevent hashicorp/raft panicking at startupFollower crash on restart (EVM ahead of stale snapshot)
RecoverFromRaftnow verifies the local block hash atraftState.Height; a hash match skips unnecessary recovery safely, a mismatch returns an explicit error indicating a forkwaitForMsgsLandedreplaced its repeating deadline ticker with a one-shottime.NewTimerand adds a final check in the deadline branch — prevents a race where both poll and deadline tickers fired simultaneously and timeout won even whenAppliedIndex >= CommitIndexFSM data race
FSM.ApplyandSetApplyCallbackare now guarded byapplyMu sync.RWMutexto prevent the raft library callingApplyconcurrently whileSetApplyCallback(nil)clears the channel pointer during shutdownBoltDB log noise
sync.OnceinNewNode) silently drops"tx closed"lines from raft-boltdb — these are emitted on every successful commit due todefer tx.Rollback()and have no actionable meaninglog.Writer()instead of hard-codingos.Stderrso any existing stdlib logger redirection is preservedBlock provenance audit log
RaftApplyMsgcarries aTermfield;FSM.Apply()logsraft_termandraft_indexalongside block height and hash for replication debuggingTest plan
go test ./pkg/raft/... ./pkg/config/... ./node/... -count=1— all passTestNodeResignLeader_NilNoop/TestNodeResignLeader_NotLeaderNoop— nil/non-leader no-op pathsTestFullNode_ResignLeader_Noop— table-driven nil and non-leader casesTestNewNode_SnapshotConfigApplied— snapshot config wiring (table-driven)TestSyncerStop_CallsRaftRetrieverStop—Syncer.Stop()callsraftRetriever.Stop()TestRecoverFromRaft_GetHeaderFailure— hash mismatch error path"negative election timeout rejected"—Validate()rejectsElectionTimeout < 0🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Configuration