Skip to content

fix: use injected clock for minDrainTime check#2951

Open
jamesmt-aws wants to merge 4 commits intokubernetes-sigs:mainfrom
jamesmt-aws:fix-mindrain-clock-skew
Open

fix: use injected clock for minDrainTime check#2951
jamesmt-aws wants to merge 4 commits intokubernetes-sigs:mainfrom
jamesmt-aws:fix-mindrain-clock-skew

Conversation

@jamesmt-aws
Copy link
Copy Markdown
Contributor

@jamesmt-aws jamesmt-aws commented Apr 7, 2026

Summary

The minDrainTime check in awaitDrain compared c.clock.Since() against the condition's LastTransitionTime, which is set by operatorpkg using metav1.Now() (real clock). When tests use a fake clock, the two clocks diverge, causing the minDrainTime check to produce incorrect results.

This is the same class of bug fixed in #2783 ("deflake termination tests by using injected clock"), reintroduced by #2709 ("add minDrainTime").

  • Track drain start time using the injected clock via sync.Map on the controller
  • Clean up tracked time in removeFinalizer

Before (main, fails deterministically)

$ go test ./pkg/controllers/node/termination/ -ginkgo.focus="should not finish draining until minDrainTime has passed"

Ran 1 of 31 Specs in 8.980 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 30 Skipped
--- FAIL: TestAPIs (8.98s)
FAIL
FAIL	sigs.k8s.io/karpenter/pkg/controllers/node/termination	9.980s

After (this PR, 10/10 passes)

$ for i in $(seq 1 10); do \
    result=$(go test ./pkg/controllers/node/termination/ \
      -ginkgo.focus="should not finish draining until minDrainTime has passed" 2>&1 \
      | tail -2 | head -1); \
    echo "Run $i: $result"; \
  done

Run 1: ok  	sigs.k8s.io/karpenter/pkg/controllers/node/termination	8.956s
Run 2: ok  	sigs.k8s.io/karpenter/pkg/controllers/node/termination	9.913s
Run 3: ok  	sigs.k8s.io/karpenter/pkg/controllers/node/termination	9.156s
Run 4: ok  	sigs.k8s.io/karpenter/pkg/controllers/node/termination	9.025s
Run 5: ok  	sigs.k8s.io/karpenter/pkg/controllers/node/termination	8.841s
Run 6: ok  	sigs.k8s.io/karpenter/pkg/controllers/node/termination	10.111s
Run 7: ok  	sigs.k8s.io/karpenter/pkg/controllers/node/termination	8.967s
Run 8: ok  	sigs.k8s.io/karpenter/pkg/controllers/node/termination	9.503s
Run 9: ok  	sigs.k8s.io/karpenter/pkg/controllers/node/termination	9.271s
Run 10: ok  	sigs.k8s.io/karpenter/pkg/controllers/node/termination	8.583s

Test plan

  • "should not finish draining until minDrainTime has passed" passes 10/10 (was 0/10)
  • Full termination suite passes (31/31 specs)

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 7, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jamesmt-aws
Once this PR has been reviewed and has the lgtm label, please assign jonathan-innis for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from tallaxes April 7, 2026 16:35
@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 7, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @jamesmt-aws. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 7, 2026
@jamesmt-aws jamesmt-aws force-pushed the fix-mindrain-clock-skew branch from c083788 to 4d69f1d Compare April 7, 2026 16:37
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 7, 2026
The minDrainTime check in awaitDrain compared c.clock.Since() against
the condition's LastTransitionTime, which is set by operatorpkg using
metav1.Now() (real clock). When tests use a fake clock, the real clock
and fake clock diverge, causing the minDrainTime check to never pass.

Track drain start time using the injected clock via a sync.Map on the
controller. This matches the pattern from kubernetes-sigs#2783 which fixed the same
class of real-clock-vs-fake-clock bug in other termination tests.

The test "should not finish draining until minDrainTime has passed"
failed deterministically on all k8s versions locally and on 1.35.x
in CI (timing-dependent on other versions).
@jamesmt-aws jamesmt-aws force-pushed the fix-mindrain-clock-skew branch from 4d69f1d to efcc902 Compare April 7, 2026 16:49
@jamesmt-aws
Copy link
Copy Markdown
Contributor Author

This also affects #2906, which hit the same should not finish draining until minDrainTime has passed failure on 1.29.x and 1.30.x. Recently merged PRs (#2874, #2877, #2946) passed clean, so the timing race doesn't trigger on every run, but it's deterministic locally.

@DerekFrank
Copy link
Copy Markdown
Collaborator

Should operatorpkg use an injected clock?

jamesmt-aws added a commit to jamesmt-aws/operatorpkg that referenced this pull request Apr 7, 2026
ConditionSet.Set() used metav1.Now() (real clock) for LastTransitionTime.
Controllers that use a fake clock for time-dependent logic (e.g.,
checking elapsed time since a condition was set) get incorrect results
because the condition timestamp and the controller clock diverge.

Add a WithClock option to ConditionTypes.For() that injects a
clock.Clock into the ConditionSet. Defaults to clock.RealClock{}, so
existing callers are unaffected. Tests can pass a fake clock to get
deterministic LastTransitionTime values.

This fixes a class of test flakes in downstream projects (e.g.,
kubernetes-sigs/karpenter#2951) where controllers compare
clock.Since(condition.LastTransitionTime) using a fake clock but the
condition was stamped with real time.
@jamesmt-aws
Copy link
Copy Markdown
Contributor Author

jamesmt-aws commented Apr 7, 2026

Should operatorpkg use an injected clock?

I'm happy with that. Who do I bother about awslabs/operatorpkg#204? If we can't get that upstreamed in a reasonable amount of time, I think my original fix is better than nothing? You can let me know.

DerekFrank pushed a commit to awslabs/operatorpkg that referenced this pull request Apr 9, 2026
ConditionSet.Set() used metav1.Now() (real clock) for LastTransitionTime.
Controllers that use a fake clock for time-dependent logic (e.g.,
checking elapsed time since a condition was set) get incorrect results
because the condition timestamp and the controller clock diverge.

Add a WithClock option to ConditionTypes.For() that injects a
clock.Clock into the ConditionSet. Defaults to clock.RealClock{}, so
existing callers are unaffected. Tests can pass a fake clock to get
deterministic LastTransitionTime values.

This fixes a class of test flakes in downstream projects (e.g.,
kubernetes-sigs/karpenter#2951) where controllers compare
clock.Since(condition.LastTransitionTime) using a fake clock but the
condition was stamped with real time.
@DerekFrank
Copy link
Copy Markdown
Collaborator

Alright the change merged into operator pkg, we can bump the version and consume the functionality:

awslabs/operatorpkg#204

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 9, 2026
@jamesmt-aws jamesmt-aws force-pushed the fix-mindrain-clock-skew branch from 242fd4c to 087f93d Compare April 9, 2026 15:48
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 9, 2026
@jamesmt-aws jamesmt-aws force-pushed the fix-mindrain-clock-skew branch from 087f93d to f4c02b5 Compare April 9, 2026 16:04
@jamesmt-aws jamesmt-aws force-pushed the fix-mindrain-clock-skew branch from f4c02b5 to 203df19 Compare April 9, 2026 16:24
jamesmt-aws added a commit to jamesmt-aws/karpenter that referenced this pull request Apr 9, 2026
Migrate pkg/events.NewRecorder to wrap k8s.io/client-go/tools/events.EventRecorder
instead of the deprecated k8s.io/client-go/tools/record.EventRecorder.

The new EventRecorder interface has a single Eventf method with both 'reason'
and 'action' parameters. Karpenter's Event abstraction only models a single
concept, and every existing Reason already describes the action the controller
took (e.g. EvictPod, DisruptionLaunching), so we reuse Reason as the action.
We pass nil for 'related' since karpenter events only ever describe a single
object.

Switch operator.go to mgr.GetEventRecorder, dropping one staticcheck nolint
introduced in kubernetes-sigs#2951. The three call sites in pkg/controllers/controllers.go
that pass the recorder directly to operatorpkg's status.NewController still
need their nolints until awslabs/operatorpkg#205 merges and we can bump.

Test fakes (InternalRecorder, &record.FakeRecorder{} usages across six test
suites) updated to the new interface. The new events.FakeRecorder produces
the same channel format as the old one.
if nodeClaim != nil && nodeClaim.StatusConditions().Get(v1.ConditionTypeDrained) == nil {
nodeClaim.StatusConditions().SetUnknownWithReason(v1.ConditionTypeDrained, "Draining", "Draining")
}
// Track drain start time using the injected clock. The condition's LastTransitionTime
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use LastTransitionTime if we use the operatorpkg functionality elsewhere?

operatorpkg now supports an injected clock for ConditionSet (awslabs/operatorpkg#204),
so LastTransitionTime can be stamped with a fake clock in tests. Use it.

Replace the drainStartTimes sync.Map workaround with reading
LastTransitionTime directly off the Drained condition. The termination
controller now sets the Drained condition via StatusConditionsWithClock,
which threads its injected clock through to operatorpkg's status.WithClock,
so LastTransitionTime is stamped with the same clock the controller reads
back.

Add NodeClaim.StatusConditionsWithClock as a sibling to StatusConditions().
The variadic shape we'd normally use can't go on StatusConditions() itself
because operatorpkg's status.Object interface requires a non-variadic
signature.
// LastTransitionTime is stamped with the injected clock instead of the
// real wall clock. The variadic StatusConditions() above is required to
// satisfy operatorpkg's status.Object interface, which is non-variadic.
func (in *NodeClaim) StatusConditionsWithClock(clk clock.Clock) status.ConditionSet {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short term we should probably not have a new function, we should just change the StatusConditions() to require a clock. Long term we should investigate this new package: https://go.dev/blog/testing-time

@jamesmt-aws
Copy link
Copy Markdown
Contributor Author

Makes sense. I have awslabs/operatorpkg#204 open which adds WithClock to ConditionSet.For(). Once that merges, the fix here becomes: bump operatorpkg, thread the clock through StatusConditions(), and drop the sync.Map. I'll update this PR after #204 lands.

On the testing/synctest package, agreed that's the right long-term direction. It's stdlib, no dependency, and eliminates the fake clock pattern entirely.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 21, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants