Skip to content

fix: certwatcher reloads in poll-only mode when fsnotify setup is slow/fails#59

Open
matteogastaldello wants to merge 2 commits into
mainfrom
fix/flaky-certwatcher-tests
Open

fix: certwatcher reloads in poll-only mode when fsnotify setup is slow/fails#59
matteogastaldello wants to merge 2 commits into
mainfrom
fix/flaky-certwatcher-tests

Conversation

@matteogastaldello

@matteogastaldello matteogastaldello commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Problem

pkg/certwatcher intermittently fails in CI, e.g.:

[FAIL] CertWatcher ... [It] should get updated on read certificate errors
[FAILED] Timed out after 10.001s.
[AfterSuite] [PANICKED] ... Start(ctx) returned an error

Root cause is in production code, not just the test. Start() added the
fsnotify watches before entering the poll loop, using
PollUntilContextTimeout(..., 10s, ...), and returned an error if the
watches could not be added in time. So:

  1. The periodic poll — the baseline reload mechanism — never ran until
    fsnotify was set up.
  2. A slow or failing watch setup disabled certificate reloading entirely
    (Start returned an error and exited).

Under CI load the watch setup can take ~10s, so no certificate read happened in
the test's window (the read_certificate_total metric never incremented) and
Start returned an error → the timeout + AfterSuite panic.

Fix

Production (certwatcher.go). Start now launches the poll loop
immediately and adds the fsnotify watches in the background (watchFiles),
retrying until they are added or the context is cancelled. fsnotify stays an
optimization; the watcher keeps working in poll-only mode until (and if) the
watches come up, and a watch-setup failure no longer disables reloading or fails
Start.

Tests (certwatcher_test.go). Make the affected tests use a 1s poll
interval and Eventually timeouts comfortably above it, so assertions no longer
depend on fsnotify delivery timing:

  • should reload currentCert when changed
  • should reload currentCert when changed with rename
  • should get updated on read certificate errors

Verification

go build ./..., go vet, gofmt clean. 15 consecutive runs of
go test ./pkg/certwatcher/ plus two -race runs, all green.

🤖 Generated with Claude Code

…tify timing

The certwatcher has two change-detection mechanisms: fsnotify (fast) and a
periodic poll (fallback). Several tests started the watcher with a 10s poll
interval but asserted the reaction within a 4s (or the 1s default) Eventually
window, so they depended entirely on fsnotify delivering the event inside a
window shorter than the poll interval. Under CI load fsnotify delivery can be
delayed, leaving no fallback inside the window — the cause of the intermittent
"Timed out after 4.000s" failures in TestSource
(should get updated on read certificate errors).

Make the poll a timely fallback in the affected tests: use a 1s poll interval
and Eventually timeouts comfortably above it. fsnotify is still exercised as the
fast path; the assertions no longer hinge on its delivery timing.

No production code changes; the poll+fsnotify design is correct.

Verified with 12 consecutive runs and a -race run, all passing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 60.00000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/certwatcher/certwatcher.go 60.00% 5 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Start() blocked on adding the fsnotify watches (PollUntilContextTimeout, up to
10s) before reaching the poll loop, and returned an error if the watches could
not be added. Two consequences: the periodic poll — the baseline reload
mechanism — never ran until fsnotify was set up, and a slow/failing watch setup
disabled certificate reloading entirely (Start returned an error).

This was the real cause of the flaky certwatcher test: under CI load the watch
setup could take ~10s, so no read happened in time (metric never incremented)
and Start returned an error.

Start now launches the poll loop immediately and adds the fsnotify watches in
the background (watchFiles), retrying until they are added or the context is
cancelled. fsnotify stays an optimization; the watcher keeps working in
poll-only mode until (and if) the watches come up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@matteogastaldello matteogastaldello changed the title test: deterministic certwatcher tests (fix flaky TestSource) fix: certwatcher reloads in poll-only mode when fsnotify setup is slow/fails Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant