Skip to content

OCPBUGS-77344: enable dcm on route changes#778

Open
jcmoraisjr wants to merge 1 commit into
openshift:masterfrom
jcmoraisjr:OCPBUGS-77344-dcm-update-route
Open

OCPBUGS-77344: enable dcm on route changes#778
jcmoraisjr wants to merge 1 commit into
openshift:masterfrom
jcmoraisjr:OCPBUGS-77344-dcm-update-route

Conversation

@jcmoraisjr
Copy link
Copy Markdown
Member

@jcmoraisjr jcmoraisjr commented May 25, 2026

DCM was disabled on route changes due to the way the resource changes are being handled: without DCM, the template plugin applies a route remove, followed by a route add, which works well in the fork-and-reload approach since the removal is not applied until adding the route again succeed. On DCM, each of these operations are done separately as well, which means that the route is completely disabled during some time.

This update changes this behavior by saving previous and updated states of a route, identifying the differences, and dynamically applying them if possible. A fork-and-reload is made in case the dynamic update is not possible, as before.

Part of the changes also include extending the Endpoint struct so that it stores the weight and the VerifyHostname properties of each endpoint in order to make comparisons easier, as well as decrease the number of parameters sent to inner methods. These fields cannot be updated during the Endpoint creation, since they depend on the backend the endpoint is assigned. The router's updateEndpointTable() method centralizes all changes to these context based fields.

https://redhat.atlassian.net/browse/OCPBUGS-77344

Summary by CodeRabbit

  • New Features

    • Added error-detection helpers for HAProxy server management.
    • Added in-place update vs replace flow for backend endpoints to minimize disruptive changes.
  • Bug Fixes

    • Improved health-check handling: disabling checks explicitly marks servers healthy and supports reconfiguration when checks are missing.
    • More robust dynamic endpoint reconciliation with better handling of protocol/port/weight changes.
  • Refactor

    • Simplified server add/update/delete workflows and centralized endpoint table management for dynamic route updates.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 25, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jcmoraisjr: This pull request references Jira Issue OCPBUGS-77344, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

DCM was disabled on route changes due to the way the resource changes are being handled: without DCM, the template plugin applies a route remove, followed by a route add, which works well in the fork-and-reload approach since the removal is not applied until adding the route again succeed. On DCM, each of these operations are done separately as well, which means that the route is completely disabled during some time.

This update changes this behavior by saving previous and updated states of a route, identifying the differences, and dynamically applying them if possible. A fork-and-reload is made in case the dynamic update is not possible, as before.

Part of the changes also include extending the Endpoint struct so that it stores the weight and the VerifyHostname properties of each endpoint in order to make comparisons easier, as well as decrease the number of parameters sent to inner methods. These fields cannot be updated during the Endpoint creation, since they depend on the backend the endpoint is assigned. The router's updateEndpointTable() method centralizes all changes to these context based fields.

https://redhat.atlassian.net/browse/OCPBUGS-77344

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 25, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

Walkthrough

Router and config-manager now track per-endpoint Weight and VerifyHostname. Backend APIs use endpoint pointers and implement update/replace semantics, health-check enable/disable and explicit health-setting commands. Router rebuilds endpoint tables and performs dynamic in-place updates when compatible; ReplaceRouteEndpoints diffs endpoints by ID and applies add→update→delete.

Changes

Endpoint-Driven Server Management Refactor

Layer / File(s) Summary
Type system and ConfigManager interface updates
pkg/router/template/types.go
Endpoint gains Weight and VerifyHostname. ConfigManager.ReplaceRouteEndpoints final parameter changes from weight int32 to activeEndpoints int.
Router endpoint table and dynamic route logic
pkg/router/template/router.go
Adds updateEndpointTable; refactors dynamic add/update/replace/remove to use stored EndpointTable, introduces dynamicallyUpdateRoute for in-place updates, and persists endpoint-table changes during service-weight redistribution.
Config manager reconciliation and deletion
pkg/router/template/configmanager/haproxy/manager.go
ReplaceRouteEndpoints now diffs endpoints by ID and deep-equality, applies add→update→delete order, aggregates errors, and performs scale-out health-check enablement with fallback replace when HAProxy reports checks unconfigured. Deletion calls pass endpoint pointers.
HAProxy backend server management API refactor
pkg/router/template/configmanager/haproxy/backend.go
AddServer, UpdateServer, ReplaceServer, DeleteServer, and health-check APIs accept *Endpoint. Adds error classifiers, derives drain from Endpoint.Weight, adds innerSetServerHealth and apiSetServerHealth, and updates execCommand validation.
Backend dynamic-update tests and fixtures
pkg/router/template/configmanager/haproxy/backend_test.go, pkg/router/template/configmanager/haproxy/blueprint_plugin_test.go
TestBackendDynamicUpdate expanded with old/new endpoint pairs, replace command coverage, h2c variants, and failure-mode expectations. Test harness builds ServiceAliasConfig/ServiceUnit and uses new backend signatures; fakeConfigManager signature updated.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • openshift/router#763: Related prior work on HAProxy dynamic backend add/delete and backend update/replace flows.

Suggested reviewers

  • gcs278
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ❓ Inconclusive The custom check is for Ginkgo test code (BeforeEach, It blocks), but this PR uses standard Go testing with testify assertions, not Ginkgo. Clarify whether check applies to all test code or only Ginkgo tests. This PR uses Go testing.T, not Ginkgo.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title references a Jira issue (OCPBUGS-77344) and mentions enabling DCM on route changes, which aligns with the PR's primary objective to enable dynamic configuration management for route updates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The codebase uses the standard Go testing framework (testing.T), not Ginkgo. The check is specifically for Ginkgo test names, making it inapplicable to this PR.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR—all 32 new test files are standard Go unit tests, so the MicroShift test compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR contains only unit tests using Go's standard testing package; no Ginkgo e2e tests (It(), Describe(), etc.) are present that would require SNO compatibility checks.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies router HAProxy backend and endpoint handling. No deployment manifests, controllers, or scheduling constraints introduced.
Ote Binary Stdout Contract ✅ Passed No OTE Binary Stdout violations found. PR modifies internal utility packages with no process-level functions or stdout writes that would corrupt JSON output.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR contains no new Ginkgo e2e tests. It modifies existing standard Go unit tests and implementation files only, so the IPv6/disconnected network compatibility check does not apply.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from bentito and ironcladlou May 25, 2026 18:00
@jcmoraisjr
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 25, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jcmoraisjr: This pull request references Jira Issue OCPBUGS-77344, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (iamin@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jcmoraisjr: This pull request references Jira Issue OCPBUGS-77344, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (iamin@redhat.com), skipping review request.

Details

In response to this:

DCM was disabled on route changes due to the way the resource changes are being handled: without DCM, the template plugin applies a route remove, followed by a route add, which works well in the fork-and-reload approach since the removal is not applied until adding the route again succeed. On DCM, each of these operations are done separately as well, which means that the route is completely disabled during some time.

This update changes this behavior by saving previous and updated states of a route, identifying the differences, and dynamically applying them if possible. A fork-and-reload is made in case the dynamic update is not possible, as before.

Part of the changes also include extending the Endpoint struct so that it stores the weight and the VerifyHostname properties of each endpoint in order to make comparisons easier, as well as decrease the number of parameters sent to inner methods. These fields cannot be updated during the Endpoint creation, since they depend on the backend the endpoint is assigned. The router's updateEndpointTable() method centralizes all changes to these context based fields.

https://redhat.atlassian.net/browse/OCPBUGS-77344

Summary by CodeRabbit

Release Notes

  • New Features

  • Added error detection helpers for HAProxy server management operations.

  • Introduced improved server endpoint replacement logic with better fallback handling.

  • Bug Fixes

  • Enhanced health check state management with explicit server health updates.

  • Improved dynamic endpoint updates with more robust comparison and fallback behavior.

  • Refactor

  • Refined server addition, update, and deletion workflows for greater reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jcmoraisjr
Copy link
Copy Markdown
Member Author

/assign @gcs278

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/router/template/router.go (2)

886-896: 💤 Low value

Consider using oldConfig.EndpointTable[serviceKey] for removed service endpoints.

When removing a service from a route, using service.EndpointTable relies on the invariant that the ServiceUnit's endpoint table matches what was previously synced to this backend. Using oldConfig.EndpointTable[serviceKey] would be more explicit about removing exactly what was added to this specific backend.

While the state should normally be consistent due to locking, using the old config's endpoint table would be more defensive.

♻️ Suggested change
 	for serviceKey := range oldConfig.ServiceUnits {
-		if service, ok := r.findMatchingServiceUnit(serviceKey); ok {
-			if _, newFound := newConfig.ServiceUnits[serviceKey]; !newFound {
-				// Service was removed, remove the endpoints.
-				if err := r.dynamicConfigManager.RemoveRouteEndpoints(backendKey, service.EndpointTable); err != nil {
+		if _, newFound := newConfig.ServiceUnits[serviceKey]; !newFound {
+			if service, ok := r.findMatchingServiceUnit(serviceKey); ok {
+				// Service was removed from route, remove the endpoints that were in the old config.
+				if err := r.dynamicConfigManager.RemoveRouteEndpoints(backendKey, oldConfig.EndpointTable[serviceKey]); err != nil {
 					log.Info("router will reload due to error removing endpoints", "backendKey", backendKey, "serviceKey", serviceKey, "error", err)
 					failed = true
 				}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/template/router.go` around lines 886 - 896, The code currently
passes service.EndpointTable when removing endpoints for services present in
oldConfig.ServiceUnits but no longer in newConfig; instead, use the exact
endpoint table that was part of the previous route config to be defensive.
Update the removal call in the loop that iterates oldConfig.ServiceUnits (around
findMatchingServiceUnit and the RemoveRouteEndpoints call) to use
oldConfig.EndpointTable[serviceKey] (or the appropriate map/key in oldConfig)
rather than service.EndpointTable, preserving backendKey and serviceKey when
calling r.dynamicConfigManager.RemoveRouteEndpoints.

957-972: 💤 Low value

Misleading weight comparison in log message.

The log at line 966 compares oldWeight (from cfg.ServiceUnitNames, the calculated per-endpoint weight) with newWeight (from cfg.ServiceUnits, the route-level service weight). These values have different meanings and scales, making the log confusing for debugging.

The actual replacement logic is correct since it compares the full endpoint structs including their Weight fields.

♻️ Suggested change for clearer logging
-			log.V(4).Info("updating weight for sibling service due to endpoint change", "service", otherServiceKey, "backendKey", backendKey, "oldWeight", oldWeight, "newWeight", newWeight)
+			log.V(4).Info("updating sibling service endpoints due to weight redistribution", "service", otherServiceKey, "backendKey", backendKey)

Alternatively, compare the actual endpoint weights:

oldEpWeight := int32(0)
newEpWeight := int32(0)
if len(oldOtherEndpoints) > 0 {
    oldEpWeight = oldOtherEndpoints[0].Weight
}
if len(newOtherEndpoints) > 0 {
    newEpWeight = newOtherEndpoints[0].Weight
}
log.V(4).Info("updating sibling service endpoints", "service", otherServiceKey, "backendKey", backendKey, "oldEpWeight", oldEpWeight, "newEpWeight", newEpWeight)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/template/router.go` around lines 957 - 972, The log message in the
loop that updates sibling services is misleading because it compares oldWeight
(from oldWeights/cfg.ServiceUnitNames per-endpoint weight) to newWeight
(cfg.ServiceUnits route-level weight); change the log to report endpoint-level
weights instead: read the first endpoint Weight from oldOtherEndpoints and
newOtherEndpoints (guarding for empty slices) and log those as oldEpWeight and
newEpWeight in the log.V(4).Info call inside the loop that references
otherServiceKey, backendKey, oldOtherEndpoints, newOtherEndpoints and the
ReplaceRouteEndpoints error path so the message reflects actual endpoint weight
changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/router/template/router.go`:
- Around line 886-896: The code currently passes service.EndpointTable when
removing endpoints for services present in oldConfig.ServiceUnits but no longer
in newConfig; instead, use the exact endpoint table that was part of the
previous route config to be defensive. Update the removal call in the loop that
iterates oldConfig.ServiceUnits (around findMatchingServiceUnit and the
RemoveRouteEndpoints call) to use oldConfig.EndpointTable[serviceKey] (or the
appropriate map/key in oldConfig) rather than service.EndpointTable, preserving
backendKey and serviceKey when calling
r.dynamicConfigManager.RemoveRouteEndpoints.
- Around line 957-972: The log message in the loop that updates sibling services
is misleading because it compares oldWeight (from
oldWeights/cfg.ServiceUnitNames per-endpoint weight) to newWeight
(cfg.ServiceUnits route-level weight); change the log to report endpoint-level
weights instead: read the first endpoint Weight from oldOtherEndpoints and
newOtherEndpoints (guarding for empty slices) and log those as oldEpWeight and
newEpWeight in the log.V(4).Info call inside the loop that references
otherServiceKey, backendKey, oldOtherEndpoints, newOtherEndpoints and the
ReplaceRouteEndpoints error path so the message reflects actual endpoint weight
changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1a3caf78-e491-44da-b467-71d20f596926

📥 Commits

Reviewing files that changed from the base of the PR and between 6761134 and a5ac103.

📒 Files selected for processing (6)
  • pkg/router/template/configmanager/haproxy/backend.go
  • pkg/router/template/configmanager/haproxy/backend_test.go
  • pkg/router/template/configmanager/haproxy/blueprint_plugin_test.go
  • pkg/router/template/configmanager/haproxy/manager.go
  • pkg/router/template/router.go
  • pkg/router/template/types.go

DCM was disabled on route changes due to the way the resource changes
are being handled: without DCM, the template plugin applies a route
remove, followed by a route add, which works well in the fork-and-reload
approach since the removal is not applied until adding the route again
succeed. On DCM, each of these operations are done separately as well,
which means that the route is completely disabled during some time.

This update changes this behavior by saving previous and updated states
of a route, identifying the differences, and dynamically applying them
if possible. A fork-and-reload is made in case the dynamic update is
not possible, as before.

Part of the changes also include extending the Endpoint struct so that
it stores the weight and the VerifyHostname properties of each endpoint
in order to make comparisons easier, as well as decrease the number of
parameters sent to inner methods. These fields cannot be updated during
the Endpoint creation, since they depend on the backend the endpoint is
assigned. The router's updateEndpointTable() method centralizes all
changes to these context based fields.

https://redhat.atlassian.net/browse/OCPBUGS-77344
@jcmoraisjr jcmoraisjr force-pushed the OCPBUGS-77344-dcm-update-route branch from a5ac103 to 5dbcbdf Compare May 26, 2026 19:44
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from gcs278. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/router/template/router.go (1)

982-996: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove endpoints per route, not from the full ServiceUnit table.

This helper now sends service.EndpointTable to every associated backend. If two routes select different target ports on the same service, Line 995 will attempt to delete endpoints that were never configured for some of those backends, which turns a normal endpoint deletion into a forced reload.

Proposed fix
-func (r *templateRouter) dynamicallyRemoveEndpoints(service ServiceUnit) bool {
+func (r *templateRouter) dynamicallyRemoveEndpoints(id ServiceUnitKey, service ServiceUnit) bool {
 	if r.dynamicConfigManager == nil || !r.synced {
 		return false
 	}
@@
 	for backendKey := range service.ServiceAliasAssociations {
-		if _, ok := r.state[backendKey]; !ok {
+		cfg, ok := r.state[backendKey]
+		if !ok {
 			continue
 		}
@@
-		if err := r.dynamicConfigManager.RemoveRouteEndpoints(backendKey, service.EndpointTable); err != nil {
+		if err := r.dynamicConfigManager.RemoveRouteEndpoints(backendKey, cfg.EndpointTable[id]); err != nil {
 			// Error dynamically modifying the config, so return false to cause a reload to happen.
 			log.Info("router will reload as the ConfigManager could not dynamically remove endpoints for backend", "backendKey", backendKey, "error", err)
 			return false
 		}
 	}

You'll also need the matching call-site update at Line 1014.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/template/router.go` around lines 982 - 996, The
dynamicallyRemoveEndpoints helper is passing the entire
ServiceUnit.EndpointTable to every associated backend, causing
RemoveRouteEndpoints(backendKey, service.EndpointTable) to delete endpoints that
don't belong to specific backends; change dynamicallyRemoveEndpoints (and its
call-site) to pass only the route-specific endpoints for each backend (i.e., the
endpoints that correspond to that backendKey) instead of service.EndpointTable,
updating the use of ServiceAliasAssociations and the call-site that invokes
dynamicallyRemoveEndpoints so each backend removal receives the correct
per-route endpoint slice/map.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/router/template/configmanager/haproxy/backend.go`:
- Around line 370-375: When innerDeleteServer(oldEP) fails after you already
called innerSetServerState(oldEP, false), restore the original server state to
avoid leaving the fallback endpoint drained; change the code around the
innerDeleteServer call to capture the delete error, call
b.innerSetServerState(oldEP, true) to revert the change, and return a combined
or wrapped error if the restore also fails (reference innerSetServerState and
innerDeleteServer to locate the logic).
- Around line 361-363: The code only updates HAProxy weight when oldEP.Weight !=
newEP.Weight via b.innerUpdateServerWeight, but fails to toggle the server's
state between "ready" and "drain" when the weight crosses the zero boundary (<=0
vs >0), causing drained endpoints to still serve or recovered ones to remain
drained (and passthrough always sends 100%). Modify the branch handling in the
function that compares oldEP and newEP so that when weight changes you detect a
crossing of the drain threshold (oldEP.Weight > 0 && newEP.Weight <= 0 =>
transition to "drain"; oldEP.Weight <= 0 && newEP.Weight > 0 => transition to
"ready") and invoke the appropriate HAProxy state-change routine in addition to
calling b.innerUpdateServerWeight(newEP, isPassthrough); ensure isPassthrough
still results in correct weight payload but do the state flip via the same
server-state API used elsewhere.

In `@pkg/router/template/configmanager/haproxy/manager.go`:
- Around line 524-545: The code must only attempt to re-enable the singleton
health check if that exact old endpoint still exists; before calling
backend.EnableHealthCheck on oldEndpoints[0] (and before calling
backend.ReplaceServer/adding to newEPs), verify that the endpoint is still
present in the current service/backend endpoints (e.g. check svc.Endpoints or
the current backend server list) and skip this whole branch if the endpoint no
longer exists so a prior ReplaceServer/delete isn't turned into a forced reload.

In `@pkg/router/template/router.go`:
- Around line 886-890: When removing a service from a route, the code currently
passes the raw service list (service.EndpointTable) to RemoveRouteEndpoints;
instead it should pass the route-scoped endpoint slice that was actually
attached to this backend (the EndpointTable stored on the service unit from the
old route config). Update the call in the loop over oldConfig.ServiceUnits so
that, after locating the removed service via r.findMatchingServiceUnit, you
retrieve the service unit from oldConfig.ServiceUnits[serviceKey] (or the local
variable representing that unit) and pass that unit's EndpointTable (the
PreferPort-filtered slice) to
r.dynamicConfigManager.RemoveRouteEndpoints(backendKey, ... ) instead of
service.EndpointTable, keeping the same backendKey and error handling.

---

Outside diff comments:
In `@pkg/router/template/router.go`:
- Around line 982-996: The dynamicallyRemoveEndpoints helper is passing the
entire ServiceUnit.EndpointTable to every associated backend, causing
RemoveRouteEndpoints(backendKey, service.EndpointTable) to delete endpoints that
don't belong to specific backends; change dynamicallyRemoveEndpoints (and its
call-site) to pass only the route-specific endpoints for each backend (i.e., the
endpoints that correspond to that backendKey) instead of service.EndpointTable,
updating the use of ServiceAliasAssociations and the call-site that invokes
dynamicallyRemoveEndpoints so each backend removal receives the correct
per-route endpoint slice/map.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aa126d80-4b4b-44bb-841c-1121e049641b

📥 Commits

Reviewing files that changed from the base of the PR and between a5ac103 and 5dbcbdf.

📒 Files selected for processing (6)
  • pkg/router/template/configmanager/haproxy/backend.go
  • pkg/router/template/configmanager/haproxy/backend_test.go
  • pkg/router/template/configmanager/haproxy/blueprint_plugin_test.go
  • pkg/router/template/configmanager/haproxy/manager.go
  • pkg/router/template/router.go
  • pkg/router/template/types.go

Comment on lines +361 to +363
if oldEP.Weight != newEP.Weight {
return false, b.innerUpdateServerWeight(newEP, isPassthrough)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle weight changes that cross the drain boundary.

Line 361 only updates HAProxy's weight. If the endpoint moves from > 0 to <= 0 or back, the server never flips between ready and drain, so a drained endpoint can keep serving or a recovered endpoint can stay drained. Passthrough is especially affected here because innerUpdateServerWeight() always sends 100%.

Suggested fix
 	if oldEP.Weight != newEP.Weight {
-		return false, b.innerUpdateServerWeight(newEP, isPassthrough)
+		if err := b.innerUpdateServerWeight(newEP, isPassthrough); err != nil {
+			return false, err
+		}
+		if (oldEP.Weight <= 0) != (newEP.Weight <= 0) {
+			if err := b.innerSetServerState(newEP, true); err != nil {
+				return false, err
+			}
+		}
 	}
 
 	return false, nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/template/configmanager/haproxy/backend.go` around lines 361 - 363,
The code only updates HAProxy weight when oldEP.Weight != newEP.Weight via
b.innerUpdateServerWeight, but fails to toggle the server's state between
"ready" and "drain" when the weight crosses the zero boundary (<=0 vs >0),
causing drained endpoints to still serve or recovered ones to remain drained
(and passthrough always sends 100%). Modify the branch handling in the function
that compares oldEP and newEP so that when weight changes you detect a crossing
of the drain threshold (oldEP.Weight > 0 && newEP.Weight <= 0 => transition to
"drain"; oldEP.Weight <= 0 && newEP.Weight > 0 => transition to "ready") and
invoke the appropriate HAProxy state-change routine in addition to calling
b.innerUpdateServerWeight(newEP, isPassthrough); ensure isPassthrough still
results in correct weight payload but do the state flip via the same
server-state API used elsewhere.

Comment on lines +370 to +375
if err := b.innerSetServerState(oldEP, false); err != nil {
return err
}
if err := b.innerDeleteServer(oldEP); err != nil {
return err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore the old server state if replace aborts before the add.

After Line 370 puts the old server in maint, a delete failure on Line 373 returns an error but leaves that endpoint drained until the reload path completes. That reintroduces a transient availability drop on the fallback path.

Suggested fix
 	if err := b.innerSetServerState(oldEP, false); err != nil {
 		return err
 	}
 	if err := b.innerDeleteServer(oldEP); err != nil {
+		if rollbackErr := b.innerSetServerState(oldEP, true); rollbackErr != nil {
+			return fmt.Errorf("deleting old server: %v; restoring old server state: %v", err, rollbackErr)
+		}
 		return err
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/template/configmanager/haproxy/backend.go` around lines 370 - 375,
When innerDeleteServer(oldEP) fails after you already called
innerSetServerState(oldEP, false), restore the original server state to avoid
leaving the fallback endpoint drained; change the code around the
innerDeleteServer call to capture the delete error, call
b.innerSetServerState(oldEP, true) to revert the change, and return a combined
or wrapped error if the restore also fails (reference innerSetServerState and
innerDeleteServer to locate the logic).

Comment on lines 524 to +545
if len(oldEndpoints) == 1 {
// enabling also for the former single endpoint as well
newEPs = append(newEPs, oldEndpoints[0])
ep := &oldEndpoints[0]
if !ep.NoHealthCheck {
// The backend was previously in the single server scenario, so health check should be enabled.
// Dynamically enabling health check only works if health check is configured, and we only
// configure health check upfront in dynamically added servers.
// So, we are trying to enable health check first, and if HAProxy responds that it is not
// configured, we'll need to remove and add it again.
err := backend.EnableHealthCheck(ep)
if backend.IsHealthCheckNotConfiguredError(err) {
// Health check not configured on this server. Replace it dynamically to reconfigure with health check.
err = backend.ReplaceServer(entry.backend, svc, ep, ep, cm.workingDir, cm.defaultDestinationCA)
if err == nil {
// Server replaced successfully, mark to enable health check later.
newEPs = append(newEPs, ep)
}
}
if err != nil {
// Failed either enabling health check or replacing backend server.
errs = append(errs, err)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only carry forward the old singleton health check if that server still exists.

This branch runs after the delete phase, but it still calls EnableHealthCheck on oldEndpoints[0] whenever len(oldEndpoints) == 1. If the old singleton was replaced outright, that endpoint is already gone and this turns a valid dynamic replace into a forced reload.

Proposed fix
-		if len(oldEndpoints) == 1 {
+		if len(oldEndpoints) == 1 {
+			if _, removed := deletedEndpoints[oldEndpoints[0].ID]; removed {
+				goto enableNewHealthChecks
+			}
 			ep := &oldEndpoints[0]
 			if !ep.NoHealthCheck {
 				// The backend was previously in the single server scenario, so health check should be enabled.
 				// Dynamically enabling health check only works if health check is configured, and we only
 				// configure health check upfront in dynamically added servers.
@@
 				}
 			}
 		}
+	enableNewHealthChecks:
 		for _, ep := range newEPs {
 			if !ep.NoHealthCheck {
 				if err := backend.EnableHealthCheck(ep); err != nil {
 					errs = append(errs, err)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/template/configmanager/haproxy/manager.go` around lines 524 - 545,
The code must only attempt to re-enable the singleton health check if that exact
old endpoint still exists; before calling backend.EnableHealthCheck on
oldEndpoints[0] (and before calling backend.ReplaceServer/adding to newEPs),
verify that the endpoint is still present in the current service/backend
endpoints (e.g. check svc.Endpoints or the current backend server list) and skip
this whole branch if the endpoint no longer exists so a prior
ReplaceServer/delete isn't turned into a forced reload.

Comment on lines +886 to +890
for serviceKey := range oldConfig.ServiceUnits {
if _, newFound := newConfig.ServiceUnits[serviceKey]; !newFound {
if service, ok := r.findMatchingServiceUnit(serviceKey); ok {
// Service was removed from route, remove the endpoints that were in the old config.
if err := r.dynamicConfigManager.RemoveRouteEndpoints(backendKey, service.EndpointTable); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the route-scoped endpoint slice when a service is removed from a route.

service.EndpointTable is the raw service list, not the PreferPort-filtered set that was actually attached to this backend. On a port-specific route, Line 890 can try to delete servers that never existed in the route backend and unnecessarily drop back to reload.

Proposed fix
-				if err := r.dynamicConfigManager.RemoveRouteEndpoints(backendKey, service.EndpointTable); err != nil {
+				if err := r.dynamicConfigManager.RemoveRouteEndpoints(backendKey, oldConfig.EndpointTable[serviceKey]); err != nil {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/template/router.go` around lines 886 - 890, When removing a
service from a route, the code currently passes the raw service list
(service.EndpointTable) to RemoveRouteEndpoints; instead it should pass the
route-scoped endpoint slice that was actually attached to this backend (the
EndpointTable stored on the service unit from the old route config). Update the
call in the loop over oldConfig.ServiceUnits so that, after locating the removed
service via r.findMatchingServiceUnit, you retrieve the service unit from
oldConfig.ServiceUnits[serviceKey] (or the local variable representing that
unit) and pass that unit's EndpointTable (the PreferPort-filtered slice) to
r.dynamicConfigManager.RemoveRouteEndpoints(backendKey, ... ) instead of
service.EndpointTable, keeping the same backendKey and error handling.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 26, 2026

@jcmoraisjr: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants