fix: certwatcher reloads in poll-only mode when fsnotify setup is slow/fails#59
Open
matteogastaldello wants to merge 2 commits into
Open
fix: certwatcher reloads in poll-only mode when fsnotify setup is slow/fails#59matteogastaldello wants to merge 2 commits into
matteogastaldello wants to merge 2 commits into
Conversation
…tify timing The certwatcher has two change-detection mechanisms: fsnotify (fast) and a periodic poll (fallback). Several tests started the watcher with a 10s poll interval but asserted the reaction within a 4s (or the 1s default) Eventually window, so they depended entirely on fsnotify delivering the event inside a window shorter than the poll interval. Under CI load fsnotify delivery can be delayed, leaving no fallback inside the window — the cause of the intermittent "Timed out after 4.000s" failures in TestSource (should get updated on read certificate errors). Make the poll a timely fallback in the affected tests: use a 1s poll interval and Eventually timeouts comfortably above it. fsnotify is still exercised as the fast path; the assertions no longer hinge on its delivery timing. No production code changes; the poll+fsnotify design is correct. Verified with 12 consecutive runs and a -race run, all passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Start() blocked on adding the fsnotify watches (PollUntilContextTimeout, up to 10s) before reaching the poll loop, and returned an error if the watches could not be added. Two consequences: the periodic poll — the baseline reload mechanism — never ran until fsnotify was set up, and a slow/failing watch setup disabled certificate reloading entirely (Start returned an error). This was the real cause of the flaky certwatcher test: under CI load the watch setup could take ~10s, so no read happened in time (metric never incremented) and Start returned an error. Start now launches the poll loop immediately and adds the fsnotify watches in the background (watchFiles), retrying until they are added or the context is cancelled. fsnotify stays an optimization; the watcher keeps working in poll-only mode until (and if) the watches come up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
pkg/certwatcherintermittently fails in CI, e.g.:Root cause is in production code, not just the test.
Start()added thefsnotify watches before entering the poll loop, using
PollUntilContextTimeout(..., 10s, ...), and returned an error if thewatches could not be added in time. So:
fsnotify was set up.
(Start returned an error and exited).
Under CI load the watch setup can take ~10s, so no certificate read happened in
the test's window (the
read_certificate_totalmetric never incremented) andStartreturned an error → the timeout + AfterSuite panic.Fix
Production (
certwatcher.go).Startnow launches the poll loopimmediately and adds the fsnotify watches in the background (
watchFiles),retrying until they are added or the context is cancelled. fsnotify stays an
optimization; the watcher keeps working in poll-only mode until (and if) the
watches come up, and a watch-setup failure no longer disables reloading or fails
Start.Tests (
certwatcher_test.go). Make the affected tests use a 1s pollinterval and
Eventuallytimeouts comfortably above it, so assertions no longerdepend on fsnotify delivery timing:
should reload currentCert when changedshould reload currentCert when changed with renameshould get updated on read certificate errorsVerification
go build ./...,go vet,gofmtclean. 15 consecutive runs ofgo test ./pkg/certwatcher/plus two-raceruns, all green.🤖 Generated with Claude Code