Skip to content

SDCICD-1840 Add MCSimulation operator restart logic#3255

Open
YiqinZhang wants to merge 2 commits into
openshift:mainfrom
YiqinZhang:sdcicd-1840-restart
Open

SDCICD-1840 Add MCSimulation operator restart logic#3255
YiqinZhang wants to merge 2 commits into
openshift:mainfrom
YiqinZhang:sdcicd-1840-restart

Conversation

@YiqinZhang

@YiqinZhang YiqinZhang commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Adds RestartOperator to the mcsimulation package. After CRD installation, operators that detect CRD presence at startup (like RMO's HostedControlPlane controller) need to be restarted to pick up the newly registered CRDs. This function handles that restart safely.

Approach mirrors RMO's restartRMODeployment pattern: delete matching pods (Deployment controller recreates them automatically) then poll until ReadyReplicas > 0. No Deployment spec mutation — equivalent to kubectl rollout restart without the annotation.

Co-authored-by: Cursor cursor@cursor.com

Summary by CodeRabbit

Release Notes

  • New Features

    • Added restart capability for McSimulation workloads in Kubernetes environments, enabling automatic pod restart through Deployment management with ready replica verification.
  • Tests

    • Added comprehensive test coverage for restart functionality, including deployment verification, timeout handling, and pod lifecycle management scenarios.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

There are test jobs defined for this repository which are not configured to run automatically. Comment /test ? to see a list of all defined jobs. Review these jobs and use /test <job> to manually trigger jobs most likely to be impacted by the proposed changes.Comment /pipeline required to trigger all required & necessary jobs.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6da9ca29-8de3-4495-a53a-696f54b60f19

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from minlei98 and ritmun June 9, 2026 14:20
@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: YiqinZhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2026
@YiqinZhang

Copy link
Copy Markdown
Contributor Author

/override ci/prow/hypershift-pr-check

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@YiqinZhang: Overrode contexts on behalf of YiqinZhang: ci/prow/hypershift-pr-check

Details

In response to this:

/override ci/prow/hypershift-pr-check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@YiqinZhang YiqinZhang force-pushed the sdcicd-1840-restart branch from 0845cc6 to 6a62f99 Compare June 9, 2026 15:27
@YiqinZhang YiqinZhang force-pushed the sdcicd-1840-restart branch from 6a62f99 to a17640a Compare June 9, 2026 15:36
@YiqinZhang

Copy link
Copy Markdown
Contributor Author

/override ci/prow/hypershift-pr-check

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@YiqinZhang: Overrode contexts on behalf of YiqinZhang: ci/prow/hypershift-pr-check

Details

In response to this:

/override ci/prow/hypershift-pr-check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
pkg/common/mcsimulation/restart.go (1)

58-58: ⚡ Quick win

Hardcoded timeout lacks flexibility.

The 2-minute timeout may be insufficient for operators with slow startup or excessive for fast-starting operators. Consider making this configurable via function parameter.

♻️ Proposed refactor
-func RestartOperator(ctx context.Context, clientset kubernetes.Interface, namespace, labelSelector string) error {
+func RestartOperator(ctx context.Context, clientset kubernetes.Interface, namespace, labelSelector string, readinessTimeout time.Duration) error {
 	...
-	if err := waitForDeploymentReady(ctx, clientset, namespace, deploy.Name, 2*time.Minute); err != nil {
+	if err := waitForDeploymentReady(ctx, clientset, namespace, deploy.Name, readinessTimeout); err != nil {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/common/mcsimulation/restart.go` at line 58, The call to
waitForDeploymentReady uses a hardcoded 2*time.Minute which is inflexible;
update the code to accept a configurable timeout instead: add a timeout
parameter (e.g., ctxTimeout or readyTimeout time.Duration) to the function that
calls waitForDeploymentReady and to waitForDeploymentReady itself, replace the
hardcoded 2*time.Minute with that parameter, and propagate the new parameter
through any callers (or provide a sensible default constant and
overload/variant) so operators can adjust the readiness wait duration without
changing the code.
pkg/common/mcsimulation/restart_test.go (2)

1-108: ⚡ Quick win

Add test for multiple deployments scenario.

Given the multiple-deployment ambiguity in RestartOperator (see restart.go comment), add a test case that creates two deployments matching the label selector and verifies the function returns an appropriate error.

📝 Proposed test case
func TestRestartOperator_MultipleDeployments(t *testing.T) {
	deploy1 := newDeployment(1)
	deploy2 := newDeployment(1)
	deploy2.Name = "controller-manager-2"
	client := fake.NewSimpleClientset(deploy1, deploy2)

	err := RestartOperator(context.Background(), client, testNS, "app=test-operator")
	if err == nil || !strings.Contains(err.Error(), "multiple deployments") {
		t.Fatalf("expected 'multiple deployments' error, got: %v", err)
	}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/common/mcsimulation/restart_test.go` around lines 1 - 108, Add a new unit
test named TestRestartOperator_MultipleDeployments that constructs two
deployments via newDeployment (change the second's Name to e.g.
"controller-manager-2"), seeds both into fake.NewSimpleClientset, calls
RestartOperator with the selector "app=test-operator", and asserts the call
returns a non-nil error whose message contains "multiple deployments" to
validate RestartOperator rejects ambiguous matches; use context.Background() for
the call and follow the same error assertion pattern as the other tests.

48-63: 💤 Low value

Test validation incomplete.

The test verifies pod deletion but doesn't validate the full restart cycle. The fake client doesn't simulate the Deployment controller recreating pods, so this test only confirms deletion succeeds and the initial deployment readiness passes immediately.

Consider adding a comment documenting this limitation or using a reactor to simulate pod recreation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/common/mcsimulation/restart_test.go` around lines 48 - 63, The test
TestRestartOperator_DeletesPodsAndWaitsReady currently only asserts pod deletion
because fake.NewSimpleClientset doesn't simulate Deployment controller behavior;
either add a clear comment above TestRestartOperator_DeletesPodsAndWaitsReady
explaining this limitation (that RestartOperator's wait-for-ready relies on real
controller behavior and the fake client only exercises deletion), or enhance the
fake client with a reactor: attach a delete-reactor on Pods (or a watch-reactor)
that, when the controller-manager pod is deleted, programmatically creates a
replacement pod (using newPod) and/or updates the Deployment object (from
newDeployment) to simulate readiness so RestartOperator's full restart-and-wait
logic is exercised; reference RestartOperator,
TestRestartOperator_DeletesPodsAndWaitsReady, newPod, newDeployment and
fake.NewSimpleClientset when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/common/mcsimulation/restart.go`:
- Around line 22-33: The current RestartOperator function silently picks the
first match when multiple Deployments match the labelSelector; update it to
detect when len(deployments.Items) > 1 and handle that case explicitly: either
return an error detailing how many matches were found and listing their names
(use deployments.Items[i].Name) so the caller can resolve the ambiguity, or
choose a deterministic selection rule (e.g., require exact one, or sort by name)
and document it; modify the code around the deployments variable and deploy
assignment in RestartOperator to implement this behavior and surface a clear
error message when multiple matches exist.

---

Nitpick comments:
In `@pkg/common/mcsimulation/restart_test.go`:
- Around line 1-108: Add a new unit test named
TestRestartOperator_MultipleDeployments that constructs two deployments via
newDeployment (change the second's Name to e.g. "controller-manager-2"), seeds
both into fake.NewSimpleClientset, calls RestartOperator with the selector
"app=test-operator", and asserts the call returns a non-nil error whose message
contains "multiple deployments" to validate RestartOperator rejects ambiguous
matches; use context.Background() for the call and follow the same error
assertion pattern as the other tests.
- Around line 48-63: The test TestRestartOperator_DeletesPodsAndWaitsReady
currently only asserts pod deletion because fake.NewSimpleClientset doesn't
simulate Deployment controller behavior; either add a clear comment above
TestRestartOperator_DeletesPodsAndWaitsReady explaining this limitation (that
RestartOperator's wait-for-ready relies on real controller behavior and the fake
client only exercises deletion), or enhance the fake client with a reactor:
attach a delete-reactor on Pods (or a watch-reactor) that, when the
controller-manager pod is deleted, programmatically creates a replacement pod
(using newPod) and/or updates the Deployment object (from newDeployment) to
simulate readiness so RestartOperator's full restart-and-wait logic is
exercised; reference RestartOperator,
TestRestartOperator_DeletesPodsAndWaitsReady, newPod, newDeployment and
fake.NewSimpleClientset when making the change.

In `@pkg/common/mcsimulation/restart.go`:
- Line 58: The call to waitForDeploymentReady uses a hardcoded 2*time.Minute
which is inflexible; update the code to accept a configurable timeout instead:
add a timeout parameter (e.g., ctxTimeout or readyTimeout time.Duration) to the
function that calls waitForDeploymentReady and to waitForDeploymentReady itself,
replace the hardcoded 2*time.Minute with that parameter, and propagate the new
parameter through any callers (or provide a sensible default constant and
overload/variant) so operators can adjust the readiness wait duration without
changing the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c6411e1a-ede7-4294-9d59-954a2632a46d

📥 Commits

Reviewing files that changed from the base of the PR and between 1d7f2d4 and a17640a.

📒 Files selected for processing (2)
  • pkg/common/mcsimulation/restart.go
  • pkg/common/mcsimulation/restart_test.go

Comment thread pkg/common/mcsimulation/restart.go
@YiqinZhang

Copy link
Copy Markdown
Contributor Author

/override ci/prow/hypershift-pr-check

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@YiqinZhang: Overrode contexts on behalf of YiqinZhang: ci/prow/hypershift-pr-check

Details

In response to this:

/override ci/prow/hypershift-pr-check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@YiqinZhang YiqinZhang closed this Jun 10, 2026
@YiqinZhang YiqinZhang deleted the sdcicd-1840-restart branch June 10, 2026 14:43
@YiqinZhang YiqinZhang restored the sdcicd-1840-restart branch June 10, 2026 14:46
@YiqinZhang YiqinZhang reopened this Jun 10, 2026
@YiqinZhang

Copy link
Copy Markdown
Contributor Author

/override ci/prow/hypershift-pr-check

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@YiqinZhang: Overrode contexts on behalf of YiqinZhang: ci/prow/hypershift-pr-check

Details

In response to this:

/override ci/prow/hypershift-pr-check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@YiqinZhang: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant