fix: use injected clock for minDrainTime check#2951
fix: use injected clock for minDrainTime check#2951jamesmt-aws wants to merge 4 commits intokubernetes-sigs:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jamesmt-aws The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @jamesmt-aws. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
c083788 to
4d69f1d
Compare
The minDrainTime check in awaitDrain compared c.clock.Since() against the condition's LastTransitionTime, which is set by operatorpkg using metav1.Now() (real clock). When tests use a fake clock, the real clock and fake clock diverge, causing the minDrainTime check to never pass. Track drain start time using the injected clock via a sync.Map on the controller. This matches the pattern from kubernetes-sigs#2783 which fixed the same class of real-clock-vs-fake-clock bug in other termination tests. The test "should not finish draining until minDrainTime has passed" failed deterministically on all k8s versions locally and on 1.35.x in CI (timing-dependent on other versions).
4d69f1d to
efcc902
Compare
|
Should operatorpkg use an injected clock? |
ConditionSet.Set() used metav1.Now() (real clock) for LastTransitionTime.
Controllers that use a fake clock for time-dependent logic (e.g.,
checking elapsed time since a condition was set) get incorrect results
because the condition timestamp and the controller clock diverge.
Add a WithClock option to ConditionTypes.For() that injects a
clock.Clock into the ConditionSet. Defaults to clock.RealClock{}, so
existing callers are unaffected. Tests can pass a fake clock to get
deterministic LastTransitionTime values.
This fixes a class of test flakes in downstream projects (e.g.,
kubernetes-sigs/karpenter#2951) where controllers compare
clock.Since(condition.LastTransitionTime) using a fake clock but the
condition was stamped with real time.
I'm happy with that. Who do I bother about awslabs/operatorpkg#204? If we can't get that upstreamed in a reasonable amount of time, I think my original fix is better than nothing? You can let me know. |
ConditionSet.Set() used metav1.Now() (real clock) for LastTransitionTime.
Controllers that use a fake clock for time-dependent logic (e.g.,
checking elapsed time since a condition was set) get incorrect results
because the condition timestamp and the controller clock diverge.
Add a WithClock option to ConditionTypes.For() that injects a
clock.Clock into the ConditionSet. Defaults to clock.RealClock{}, so
existing callers are unaffected. Tests can pass a fake clock to get
deterministic LastTransitionTime values.
This fixes a class of test flakes in downstream projects (e.g.,
kubernetes-sigs/karpenter#2951) where controllers compare
clock.Since(condition.LastTransitionTime) using a fake clock but the
condition was stamped with real time.
|
Alright the change merged into operator pkg, we can bump the version and consume the functionality: |
242fd4c to
087f93d
Compare
087f93d to
f4c02b5
Compare
f4c02b5 to
203df19
Compare
Migrate pkg/events.NewRecorder to wrap k8s.io/client-go/tools/events.EventRecorder instead of the deprecated k8s.io/client-go/tools/record.EventRecorder. The new EventRecorder interface has a single Eventf method with both 'reason' and 'action' parameters. Karpenter's Event abstraction only models a single concept, and every existing Reason already describes the action the controller took (e.g. EvictPod, DisruptionLaunching), so we reuse Reason as the action. We pass nil for 'related' since karpenter events only ever describe a single object. Switch operator.go to mgr.GetEventRecorder, dropping one staticcheck nolint introduced in kubernetes-sigs#2951. The three call sites in pkg/controllers/controllers.go that pass the recorder directly to operatorpkg's status.NewController still need their nolints until awslabs/operatorpkg#205 merges and we can bump. Test fakes (InternalRecorder, &record.FakeRecorder{} usages across six test suites) updated to the new interface. The new events.FakeRecorder produces the same channel format as the old one.
| if nodeClaim != nil && nodeClaim.StatusConditions().Get(v1.ConditionTypeDrained) == nil { | ||
| nodeClaim.StatusConditions().SetUnknownWithReason(v1.ConditionTypeDrained, "Draining", "Draining") | ||
| } | ||
| // Track drain start time using the injected clock. The condition's LastTransitionTime |
There was a problem hiding this comment.
Can't we use LastTransitionTime if we use the operatorpkg functionality elsewhere?
operatorpkg now supports an injected clock for ConditionSet (awslabs/operatorpkg#204), so LastTransitionTime can be stamped with a fake clock in tests. Use it. Replace the drainStartTimes sync.Map workaround with reading LastTransitionTime directly off the Drained condition. The termination controller now sets the Drained condition via StatusConditionsWithClock, which threads its injected clock through to operatorpkg's status.WithClock, so LastTransitionTime is stamped with the same clock the controller reads back. Add NodeClaim.StatusConditionsWithClock as a sibling to StatusConditions(). The variadic shape we'd normally use can't go on StatusConditions() itself because operatorpkg's status.Object interface requires a non-variadic signature.
| // LastTransitionTime is stamped with the injected clock instead of the | ||
| // real wall clock. The variadic StatusConditions() above is required to | ||
| // satisfy operatorpkg's status.Object interface, which is non-variadic. | ||
| func (in *NodeClaim) StatusConditionsWithClock(clk clock.Clock) status.ConditionSet { |
There was a problem hiding this comment.
Short term we should probably not have a new function, we should just change the StatusConditions() to require a clock. Long term we should investigate this new package: https://go.dev/blog/testing-time
|
Makes sense. I have awslabs/operatorpkg#204 open which adds On the |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
The
minDrainTimecheck inawaitDraincomparedc.clock.Since()against the condition'sLastTransitionTime, which is set byoperatorpkgusingmetav1.Now()(real clock). When tests use a fake clock, the two clocks diverge, causing theminDrainTimecheck to produce incorrect results.This is the same class of bug fixed in #2783 ("deflake termination tests by using injected clock"), reintroduced by #2709 ("add minDrainTime").
sync.Mapon the controllerremoveFinalizerBefore (main, fails deterministically)
After (this PR, 10/10 passes)
Test plan