Skip to content

OCPBUGS-77056: Make external cert validation asynchronous#745

Open
bentito wants to merge 43 commits into
openshift:masterfrom
bentito:OCPBUGS-77056-async-sar
Open

OCPBUGS-77056: Make external cert validation asynchronous#745
bentito wants to merge 43 commits into
openshift:masterfrom
bentito:OCPBUGS-77056-async-sar

Conversation

@bentito
Copy link
Copy Markdown
Contributor

@bentito bentito commented Mar 3, 2026

This PR implements asynchronous and cached SubjectAccessReview checks for spec.tls.externalCertificate during router startup.

Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit O(N * API_latency) delays and would hit CrashLoopBackOff from failing liveness probes.

This fix maintains full RBAC security checks but executes them asynchronously in the background. A global sync.Map acts as an asyncSARCache to drastically speed up repeated checks for the same underlying secret.

Fixes: OCPBUGS-77056

Summary by CodeRabbit

  • New Features

    • Caching and throttling for external-certificate validation
    • Configurable parallelism for route status updates
  • Improvements

    • More reliable, faster secret handling with hybrid informer strategy
    • Increased Kubernetes client request throughput
    • Architecture-aware debug image builds
    • Improved route status comparison to avoid spurious updates
  • Bug Fixes

    • Record completed access-check status for external certificates
  • Tests

    • Revised and added unit tests for secret manager, SAR caching, and status logic

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 3, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 3, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@bentito: This pull request references Jira Issue OCPBUGS-77056, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This PR implements asynchronous and cached SubjectAccessReview checks for spec.tls.externalCertificate during router startup.

Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit O(N * API_latency) delays and would hit CrashLoopBackOff from failing liveness probes.

This fix maintains full RBAC security checks but executes them asynchronously in the background. A global sync.Map acts as an asyncSARCache to drastically speed up repeated checks for the same underlying secret.

Fixes: OCPBUGS-77056

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2026
@openshift-ci openshift-ci Bot requested a review from lihongan March 3, 2026 23:38
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign grzpiotrowski for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bentito bentito changed the title WIP: OCPBUGS-77056: Make external cert validation asynchronous OCPBUGS-77056: Make external cert validation asynchronous Mar 3, 2026
@bentito bentito force-pushed the OCPBUGS-77056-async-sar branch from 325f313 to 214b603 Compare March 3, 2026 23:41
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2026
@bentito bentito marked this pull request as ready for review March 3, 2026 23:41
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2026
@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented Mar 3, 2026

/hold

@openshift-ci openshift-ci Bot requested review from Miciah and grzpiotrowski March 3, 2026 23:42
@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 3, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Semaphore-throttled, cached SAR validation for external TLS; new RBAC-aware SharedSecretManager; SAR-completed route status events and condition equality fixes; WriterLease gains worker concurrency; synchronous test refactors and small build/client wiring tweaks.

Changes

External Certificate SAR Validation and Route Status

Layer / File(s) Summary
SAR validation and caching
pkg/router/routeapihelpers/validation.go, pkg/router/routeapihelpers/validation_test.go
Use tls.X509KeyPair; add semaphore-throttled SubjectAccessReview checks, short-TTL success cache, and cache invalidation helpers; tests adjusted.
SharedSecretManager (RBAC-aware hybrid informers)
pkg/router/controller/shared_secret_manager.go, pkg/router/controller/shared_secret_manager_test.go
New manager probes Secrets(list) per-namespace to choose namespace-wide vs per-secret informers, keys informers accordingly, supports idempotent RegisterRoute/UnregisterRoute, notifies matching routes on secret events, and falls back to API reads when informer cache not available.
RouteSecretManager status events and cache invalidation
pkg/router/controller/route_secret_manager.go, pkg/router/controller/route_secret_manager_test.go
Record ExternalCertificateSARCompleted after successful plugin handling+secret population; revalidate/populate synchronously on certain modifies; invalidate async SAR cache on secret Add/Update/Delete (handle tombstones); tests refactored to mutex-protected recorder and direct handler invocation.
Status equality and ignored reasons
pkg/router/controller/contention.go, pkg/router/controller/status.go, pkg/router/controller/status_test.go
Add ExternalCertificateSARCompleted to ignored ingress reasons; conditionsEqual requires Type/Status match before ignore-reason short-circuit; recordIngressCondition preserves existing ignored Reason/Message when incoming Reason empty; new test verifies behavior.
WriterLease multi-worker support and wiring
pkg/router/writerlease/writerlease.go, pkg/router/writerlease/writerlease_test.go, pkg/cmd/infra/router/template.go, pkg/router/router_test.go
Add workers parameter (min 1) to WriterLease constructors; Run launches multiple worker goroutines with a WaitGroup; follower requeue uses AddAfter; template/tests updated to pass workers and SAR concurrency parameter.
Test and build/client tweaks
go.mod, hack/Makefile.debug, pkg/cmd/infra/router/clientcmd.go, pkg/router/template/configmanager/haproxy/testing/haproxy.go
Add go.mod replace to bentito fork; export GOARCH and pass to image build; increase Kubernetes client QPS/Burst; adjust fake HAProxy test socket path and startup/error handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (2 errors, 2 warnings)

Check name Status Explanation Resolution
Container-Privileges ❌ Error deploy/router.yaml (new file) contains hostNetwork: true, a privileged pod networking setting that requires flagging per check instructions. Review if hostNetwork privilege is necessary or use network policies and port management alternatives instead.
No-Sensitive-Data-In-Logs ❌ Error PR logs untyped interface objects from Kubernetes Secret handlers in error paths, risking exposure of secret data. Remove obj/tombstone.Obj from error log statements; log only type/key info instead of full objects.
Docstring Coverage ⚠️ Warning Docstring coverage is 46.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Tests have indefinite busy-loop waits without timeouts in writerlease_test.go and bare t.Fatal() calls lacking diagnostic context in shared_secret_manager_test.go. Add timeout bounds to busy-loop waits (e.g., time.After) and include context in t.Fatal calls, e.g., t.Fatalf("failed to register route: %v", err).
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely describes the main objective: making external certificate validation asynchronous, which aligns with the primary changes across the codebase (SAR caching, async validation, hybrid informer strategy).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Project uses standard Go testing.T, not Ginkgo. No Ginkgo tests found in repository (0/32 test files). Check not applicable; all test names are static and descriptive.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All test modifications are unit tests using standard Go testing framework, not Ginkgo syntax. The check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All changes are implementation files and unit tests (using standard testing.T pattern), not e2e tests requiring SNO compatibility checks.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only router controller, informer, and test code. No deployment manifests, pod specs, affinity constraints, or scheduling assumptions introduced.
Ote Binary Stdout Contract ✅ Passed PR adds no new stdout writes in process-level code; all fmt.Print* calls in router_test.go are pre-existing and not modified by this PR.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. All changes are to unit tests (using Go's testing package) and production code for the router package. No e2e test directories exist in the codebase.
No-Weak-Crypto ✅ Passed No weak crypto found. MD5/SHA1 only in certificate validation that rejects them. tls.X509KeyPair and bytes.Equal used correctly for certificate handling, not sensitive operations.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 288-298: The current Add handler unconditionally calls
RecordRouteRejection when a secret is replayed; change it to mirror the
UpdateFunc/SAR-completion logic: check the route's admission (via
populateRouteTLSFromSecret result or route.Status.Conditions) and if the route
is already admitted emit a normal update event instead of calling
RecordRouteRejection, otherwise record the rejection; extract the string
"ExternalCertificateSecretLoaded" into a shared constant (e.g.,
ExtCrtStatusReasonSecretLoaded) and add that constant to the
ignoreIngressConditionReason set in contention.go so the admission-ignoring
logic treats this replay reason as benign. Ensure you update references to
RecordRouteRejection, populateRouteTLSFromSecret, UpdateFunc,
ExternalCertificateSecretLoaded, and ignoreIngressConditionReason accordingly.

In `@pkg/router/routeapihelpers/validation.go`:
- Around line 503-556: The async SAR flow stores only a single pending error
(pendingErr) in asyncSARCache for a cacheKey, so subsequent calls (in validate
flow) that hit the cache drop their onComplete callback and never get
re-enqueued; change the cache value to include a list of waiting callbacks (or a
small struct with errs + []onComplete callbacks) keyed by asyncSARCache so that
when you detect a pending entry you append the current onComplete for
route.Namespace/secretName, and in the goroutine after
asyncSARCache.Store(cacheKey, errs) iterate and invoke all stored callbacks (not
just the first) and then replace the cache entry with errs only; update code
paths that read the cache (the cache-hit branch and the goroutine completion) to
handle this new struct and ensure onComplete is invoked for every waiting route.
- Around line 485-490: The global asyncSARCache currently stores completed
validation results forever; change it to avoid permanent sticky entries by (a)
replacing the raw sync.Map value with a small struct that includes the cached
field.ErrorList plus an expiration timestamp or a source type flag, (b) only
writing non-transient successful validation results (or results with a short
TTL) into asyncSARCache, and (c) adding explicit invalidation helpers (e.g.,
InvalidateAsyncSARCache(namespace, secretName) and
InvalidateAllAsyncSARCacheForSecret(namespace, secretName)) and call them from
the secret add/update/delete/recreate paths in the secret manager
(route_secret_manager.go) so secret changes force revalidation; also update
ClearAsyncSARCacheForTest to clear the new structure. Ensure references to
asyncSARCache and ClearAsyncSARCacheForTest (and the code paths mentioned around
the 503-556 region) are updated to use the new value type and invalidation APIs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e07f375c-945a-4e58-b5ec-3347eaff109c

📥 Commits

Reviewing files that changed from the base of the PR and between d8ed355 and 093ad83.

📒 Files selected for processing (5)
  • pkg/router/controller/contention.go
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/controller/route_secret_manager_test.go
  • pkg/router/routeapihelpers/validation.go
  • pkg/router/routeapihelpers/validation_test.go

Comment thread pkg/router/controller/route_secret_manager.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/router/routeapihelpers/validation.go (1)

485-497: ⚠️ Potential issue | 🟠 Major

Completed SAR entries are still permanent.

Only secret add/update/delete clears this cache. A finished allow/deny result for namespace/secretName survives route deletion, later RBAC grants/revocations, and transient SAR/API failures, so a new route that reuses the same secret name can inherit a stale decision without performing a fresh check. Please put completed entries behind a TTL/generation and avoid caching transient internal errors indefinitely.

Also applies to: 593-605

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/routeapihelpers/validation.go` around lines 485 - 497, The cache
currently stores completed asyncSARResult entries forever; update asyncSARResult
(and uses of asyncSARCache) to include a timestamp or generation counter and
enforce a TTL/generation check before returning cached results so that completed
allow/deny decisions expire after a short interval or when generation changes.
Ensure only stable allow/deny outcomes are cached long-term; if
asyncSARResult.errs contains transient/internal errors, set a much shorter TTL
(or do not mark done) so they trigger fresh SARs. Update the cache read path to
validate timestamp/generation and the write path to record it, and keep
InvalidateAsyncSARCache(namespace, secretName) as-is to support manual
invalidation.
pkg/router/controller/route_secret_manager.go (1)

262-304: ⚠️ Potential issue | 🟠 Major

Don't clear the fresh SAR result on the initial SecretLoaded replay.

RegisterRoute() is only reached after validate() has already completed successfully, so this non-deleted AddFunc path is replaying an already-validated secret. Clearing asyncSARCache here throws away that fresh result, defeats the per-secret cache for shared secrets, and can bounce the route back into a second "authorization check pending" / ExternalCertificateValidationFailed cycle on the SecretLoaded requeue. Keep invalidation on actual secret changes (recreated/updated/deleted), but not on the initial load replay.

Suggested shape
 		AddFunc: func(obj interface{}) {
 			secret := obj.(*kapi.Secret)
 			log.V(4).Info("Secret added for route", "namespace", namespace, "secret", secret.Name, "route", routeName)
-			routeapihelpers.InvalidateAsyncSARCache(namespace, secret.Name)
 
 			// Secret re-creation scenario
 			// Check if the route key exists in the deletedSecrets map, indicating that the secret was previously deleted for this route.
 			// If it exists, it means the secret is being recreated. Remove the key from the map and proceed with handling the route.
@@
 			key := generateKey(namespace, routeName)
 			if _, deleted := p.deletedSecrets.LoadAndDelete(key); deleted {
+				routeapihelpers.InvalidateAsyncSARCache(namespace, secret.Name)
 				log.V(4).Info("Secret recreated for route", "namespace", namespace, "secret", secret.Name, "route", routeName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/controller/route_secret_manager.go` around lines 262 - 304, The
AddFunc for secret events currently calls
routeapihelpers.InvalidateAsyncSARCache unconditionally, which discards fresh
per-secret async SAR results during the initial cache-sync replay; to fix,
remove the unconditional call to routeapihelpers.InvalidateAsyncSARCache from
the top of the AddFunc in RegisterRoute's secret handler and only invoke
routeapihelpers.InvalidateAsyncSARCache when the secret is actually
changed/recreated/deleted (e.g., inside the p.deletedSecrets.LoadAndDelete
"recreated" branch and in the Update/Delete handlers), using
generateKey(namespace, routeName) and p.deletedSecrets to determine real
recreation so you don't invalidate the SAR cache on the initial load.
🧹 Nitpick comments (1)
pkg/router/controller/route_secret_manager_test.go (1)

1355-1372: Put a timeout around these async waits.

These bare channel receives will hang the whole suite on any missed informer callback. Now that the tests intentionally depend on async completion, a tiny helper with select + time.After will turn those deadlocks into normal test failures.

Suggested helper
+func waitForRecorderEvent(t *testing.T, ch <-chan struct{}) {
+	t.Helper()
+	select {
+	case <-ch:
+	case <-time.After(2 * time.Second):
+		t.Fatal("timeout waiting for recorder event")
+	}
+}

Then replace each bare receive with:

-<-recorder.doneCh
+waitForRecorderEvent(t, recorder.doneCh)

Also applies to: 1434-1444, 1496-1516

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/controller/route_secret_manager_test.go` around lines 1355 - 1372,
Replace bare channel receives on recorder.doneCh with a timed wait helper to
avoid test hangs: add a small helper (e.g., waitForDone(ctxTimeout, ch) or
waitWithTimeout(t, ch, duration)) that uses select with the channel and
time.After to fail the test on timeout, then update every occurrence of
"<-recorder.doneCh" in route_secret_manager_test.go (including the occurrences
around the Secret Update flow that use kubeClient.CoreV1().Secrets(...).Update
and the initial Add wait) to call that helper instead so missed informer
callbacks produce a deterministic test failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 262-304: The AddFunc for secret events currently calls
routeapihelpers.InvalidateAsyncSARCache unconditionally, which discards fresh
per-secret async SAR results during the initial cache-sync replay; to fix,
remove the unconditional call to routeapihelpers.InvalidateAsyncSARCache from
the top of the AddFunc in RegisterRoute's secret handler and only invoke
routeapihelpers.InvalidateAsyncSARCache when the secret is actually
changed/recreated/deleted (e.g., inside the p.deletedSecrets.LoadAndDelete
"recreated" branch and in the Update/Delete handlers), using
generateKey(namespace, routeName) and p.deletedSecrets to determine real
recreation so you don't invalidate the SAR cache on the initial load.

In `@pkg/router/routeapihelpers/validation.go`:
- Around line 485-497: The cache currently stores completed asyncSARResult
entries forever; update asyncSARResult (and uses of asyncSARCache) to include a
timestamp or generation counter and enforce a TTL/generation check before
returning cached results so that completed allow/deny decisions expire after a
short interval or when generation changes. Ensure only stable allow/deny
outcomes are cached long-term; if asyncSARResult.errs contains
transient/internal errors, set a much shorter TTL (or do not mark done) so they
trigger fresh SARs. Update the cache read path to validate timestamp/generation
and the write path to record it, and keep InvalidateAsyncSARCache(namespace,
secretName) as-is to support manual invalidation.

---

Nitpick comments:
In `@pkg/router/controller/route_secret_manager_test.go`:
- Around line 1355-1372: Replace bare channel receives on recorder.doneCh with a
timed wait helper to avoid test hangs: add a small helper (e.g.,
waitForDone(ctxTimeout, ch) or waitWithTimeout(t, ch, duration)) that uses
select with the channel and time.After to fail the test on timeout, then update
every occurrence of "<-recorder.doneCh" in route_secret_manager_test.go
(including the occurrences around the Secret Update flow that use
kubeClient.CoreV1().Secrets(...).Update and the initial Add wait) to call that
helper instead so missed informer callbacks produce a deterministic test
failure.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c800f527-b5e1-4386-b6cf-e20eb4726a31

📥 Commits

Reviewing files that changed from the base of the PR and between 093ad83 and 8c6f16a.

📒 Files selected for processing (4)
  • pkg/router/controller/contention.go
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/controller/route_secret_manager_test.go
  • pkg/router/routeapihelpers/validation.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/router/controller/contention.go

@lihongan
Copy link
Copy Markdown

/retest-required

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 561-568: The semaphore (asyncSARSemaphore) is being acquired on
the caller path before launching the goroutine, causing the caller to block when
the limit is reached; move the acquire into the goroutine so the caller can
spawn all per-secret goroutines without blocking, then release the token inside
that goroutine (use defer to ensure release). Concretely: start the goroutine
immediately, perform asyncSARSemaphore <- struct{}{} at the top of the goroutine
body, and keep the existing defer func() { <-asyncSARSemaphore }() to guarantee
release; ensure any early returns in the goroutine still release the token.
- Around line 532-535: The current guard that returns nil when sarc or
secretsGetter is nil hides wiring bugs and skips RBAC/secret validation;
instead, change the conditional in validation.go so that if sarc == nil ||
secretsGetter == nil you return an internal error (e.g., fmt.Errorf or the
package's internal/server error type) describing "missing validation clients"
rather than nil, or alternatively gate this behind a test-only seam/flag so
production never silently succeeds; update callers to propagate/handle the
returned error as needed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 31462f9b-48bb-466f-8d90-5c399b5fab8b

📥 Commits

Reviewing files that changed from the base of the PR and between 8c6f16a and 871f184.

📒 Files selected for processing (1)
  • pkg/router/routeapihelpers/validation.go

Comment thread pkg/router/routeapihelpers/validation.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 526-529: The code currently returns a synthetic pending error on
cache miss without registering a callback if onComplete is nil; change the logic
in the validation path (the block that manipulates cached.callbacks and returns
cached.errs) so that onComplete must never be nil: either (A) make onComplete a
required parameter at the API boundary, or (B) when onComplete == nil execute
the async validation synchronously (drive the same validation routine on the
current goroutine, wait for completion, populate cached.errs, and then return
the real errors) instead of just returning the pending placeholder; if you
choose (B) ensure you reuse the same validation function that enqueues callbacks
(so cached state is updated) and that cached.callbacks is appended only when
onComplete is non-nil or after synchronous completion you return the finalized
errors.
- Around line 572-599: Wrap each SAR and secret GET with a per-request timeout
context (use context.WithTimeout and defer cancel) instead of context.TODO();
for the SAR checks stop using authorizationutil.Authorize and call the SAR
client Create() directly (e.g., sarClient.Create) with that timeout context so
you can inspect both (response, err) separately, treat API/transport errors as
retryable/internal (append field.InternalError or a distinct cached error) and
only append field.Forbidden when the SAR response explicitly denies, and call
secretsGetter.Secrets(route.Namespace).Get with the same bounded context and map
Get errors to NotFound vs InternalError accordingly. Ensure you still reference
routerServiceAccount, secretName, fldPath and preserve existing
field.Forbidden/field.InternalError/field.NotFound usages when classifying the
outcome.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbebc97e-5cff-40e0-80d7-bff7bf16a68c

📥 Commits

Reviewing files that changed from the base of the PR and between 871f184 and c2f6381.

📒 Files selected for processing (1)
  • pkg/router/routeapihelpers/validation.go

Comment thread pkg/router/routeapihelpers/validation.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/router/routeapihelpers/validation.go (1)

637-639: Consider recovering from callback panics to ensure all callbacks are invoked.

If a callback panics, the remaining callbacks in the slice won't be invoked. This is unlikely but could leave some routes stuck in a pending state if one route's callback handler has a bug.

🛡️ Proposed defensive callback invocation
 		for _, cb := range callbacks {
-			cb(route.Namespace, secretName)
+			func() {
+				defer func() {
+					if r := recover(); r != nil {
+						// Log panic but continue invoking remaining callbacks
+					}
+				}()
+				cb(route.Namespace, secretName)
+			}()
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/routeapihelpers/validation.go` around lines 637 - 639, Wrap each
callback invocation in a deferred recover to prevent a single panic from
aborting the loop: inside the loop over callbacks (callbacks, cb) call each cb
via a small closure that uses defer + recover to catch any panic, log or report
the panic along with the callback context (e.g., route.Namespace and secretName)
and continue to the next cb; ensure the recovered panic does not re-panic so all
callbacks are invoked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 637-639: Wrap each callback invocation in a deferred recover to
prevent a single panic from aborting the loop: inside the loop over callbacks
(callbacks, cb) call each cb via a small closure that uses defer + recover to
catch any panic, log or report the panic along with the callback context (e.g.,
route.Namespace and secretName) and continue to the next cb; ensure the
recovered panic does not re-panic so all callbacks are invoked.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9c78ea5-dba5-483e-8d75-13dc6f9905d7

📥 Commits

Reviewing files that changed from the base of the PR and between c2f6381 and dc6b383.

📒 Files selected for processing (1)
  • pkg/router/routeapihelpers/validation.go

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2026
@bentito bentito force-pushed the OCPBUGS-77056-async-sar branch from 716e042 to 0a99ee0 Compare March 17, 2026 20:15
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2026
Signed-off-by: Brett Tofel <btofel@redhat.com>
@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 19, 2026

/retest

@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 19, 2026

/retest

@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 19, 2026

/retest

1 similar comment
@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 20, 2026

/retest

@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 21, 2026

/retest

…mparison bug

The synchronous logic refactoring missed explicit status updates for
ExternalCertificateSARCompleted, which are required by E2E tests.

Additionally, conditionsEqual in contention.go had a bug where it would
incorrectly ignore status changes (e.g., from Rejected to Admitted) if
an ignored reason was present. This prevented routes from becoming
active after successful validation.

- Corrected conditionsEqual to only ignore reasons if Type and Status match.
- Restored RecordRouteUpdate calls with SARCompleted reason.
- Added SARCompleted to ignored reasons in contention tracker.
- Updated unit tests to verify status updates.
@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 27, 2026

/retest

1 similar comment
@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 28, 2026

/retest

bentito added 2 commits May 28, 2026 09:17
… comparison

The StatusAdmitter was inadvertently stripping meaningful reasons
(like SARCompleted) from route status during standard reconciliation.
This occurred because recordIngressCondition used literal struct
comparison (!=), bypassing the 'ignored reasons' logic.

This resulted in a status flapping loop where RouteSecretManager
would set a reason, and HandleRoute would immediately clear it,
causing high API churn and Server-Side Apply conflicts in CI.

- Updated recordIngressCondition to use conditionsEqual helper.
- Added pre-checks to RecordRouteUpdate/Rejection to skip redundant updates.
- Added a unit test to verify that ignored reasons are preserved.
…essions

The StatusAdmitter was inadvertently stripping meaningful reasons
(like SARCompleted) from route status during standard reconciliation.
This occurred because recordIngressCondition used literal struct
comparison (!=), bypassing the 'ignored reasons' logic.

Additionally, the previous optimization attempt was too aggressive,
as it only checked conditions and ignored changes to other ingress
fields like Host or WildcardPolicy, which broke standard tests.

- Updated recordIngressCondition to use conditionsEqual helper.
- Removed redundant top-level checks in RecordRouteUpdate/Rejection
  to ensure all ingress fields are correctly synced.
- Added a unit test to verify that ignored reasons are preserved.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/router/routeapihelpers/validation.go (1)

236-257: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject public/certificate PEM blocks in spec.tls.key before tls.X509KeyPair
ExtendedValidateRoute sanitizes tlsConfig.Key with sanitizePEM (which preserves PUBLIC KEY / CERTIFICATE blocks) and then appends the entire sanitized tlsConfig.Key into keyBytes passed to tls.X509KeyPair (when tlsConfig.Certificate is set). This allows tls.key that contains a private key plus certificate/public PEM blocks (see test case "A key field containing Public Key/certificate attributes should be rejected" with expectedErrors: 0).
Strip/reject PUBLIC KEY/CERTIFICATE blocks from tlsConfig.Key (only allow private-key PEM) before calling tls.X509KeyPair.

Suggested fix
 	if len(tlsConfig.Certificate) > 0 {
 		if certBytes, keyBytes, err := splitCertKey([]byte(tlsConfig.Certificate)); err != nil {
 			result = append(result, field.Invalid(tlsFieldPath.Child("certificate"), "redacted key data", err.Error()))
 		} else {
-			// Use any private key that was found in either
-			// tlsConfig.Certificate or tlsConfig.Key.
-			keyBytes = append(keyBytes, []byte(tlsConfig.Key)...)
+			// Use any private key that was found in either
+			// tlsConfig.Certificate or tlsConfig.Key, but reject
+			// public/certificate PEM blocks in the key field.
+			keyPublicBytes, keyOnlyBytes, err := splitCertKey([]byte(tlsConfig.Key))
+			if err != nil {
+				result = append(result, field.Invalid(tlsFieldPath.Child("key"), "redacted key data", err.Error()))
+				return result
+			}
+			if len(keyPublicBytes) > 0 {
+				result = append(result, field.Invalid(tlsFieldPath.Child("key"), "redacted key data", "key must not contain certificate/public PEM blocks"))
+				return result
+			}
+			keyBytes = append(keyBytes, keyOnlyBytes...)
 			if len(keyBytes) == 0 {
 				result = append(result, field.Invalid(tlsFieldPath.Child("key"), "", "no key specified"))
 			} else {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/routeapihelpers/validation.go` around lines 236 - 257,
ExtendedValidateRoute currently allows tlsConfig.Key that contains PUBLIC KEY or
CERTIFICATE PEM blocks because sanitizePEM preserves them and they get appended
to keyBytes before tls.X509KeyPair; modify the validation so that after
sanitizing/parsing tlsConfig.Key you scan its PEM blocks and either strip
non-private-key blocks or (preferably) reject the key if any PEM block has Type
"PUBLIC KEY" or "CERTIFICATE". Concretely, in the branch handling
tlsConfig.Certificate (the code that calls splitCertKey and then appends
tlsConfig.Key into keyBytes before tls.X509KeyPair), parse tlsConfig.Key into
PEM blocks, and if any block.Type equals "PUBLIC KEY" or "CERTIFICATE" append a
field.Invalid on tlsFieldPath.Child("key") (similar to other errors) and do not
call tls.X509KeyPair; otherwise proceed with only private-key PEM blocks (or the
sanitized key) as now. Use the existing symbols tlsConfig.Key, tlsFieldPath,
sanitizePEM, splitCertKey, tls.X509KeyPair and ExtendedValidateRoute to locate
and implement the change.
🧹 Nitpick comments (2)
pkg/router/template/configmanager/haproxy/testing/haproxy.go (1)

49-49: ⚡ Quick win

Consider using os.TempDir() instead of hardcoded /tmp.

The change from os.TempDir() to "/tmp" reduces portability. While /tmp is available on Unix-like systems, os.TempDir() is more portable and respects system-specific temp directory configuration (e.g., TMPDIR environment variable).

♻️ Revert to os.TempDir() for better portability
-	f, err := os.CreateTemp("/tmp", prefix)
+	f, err := os.CreateTemp(os.TempDir(), prefix)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/template/configmanager/haproxy/testing/haproxy.go` at line 49,
Replace the hard-coded "/tmp" argument passed to os.CreateTemp with os.TempDir()
to restore portability; locate the os.CreateTemp call (the line creating f, err
:= os.CreateTemp("/tmp", prefix) in haproxy.go) and change the directory
argument to os.TempDir() so the system-configured temp directory (and TMPDIR) is
respected.
pkg/cmd/infra/router/clientcmd.go (1)

64-68: ⚡ Quick win

Verify rate limit values are appropriate for production workloads.

The client-side rate limits have been increased from defaults (typically QPS=5, Burst=10) to QPS=50 and Burst=100 to support concurrent SAR validation during router startup. While these values seem reasonable for the use case, please confirm:

  1. These limits are appropriate for typical OpenShift cluster API server capacity
  2. The 10x increase in QPS won't contribute to API server overload during router startup storms (e.g., multiple routers restarting simultaneously)
  3. These values have been tested under realistic load conditions with many external certificate routes

Consider testing or confirming via documentation/SRE input that these rate limits align with API server capacity planning.

Based on learnings, the context snippet from pkg/cmd/infra/router/template.go:645 shows this KubeConfig() result is used to create the SubjectAccessReviewClient for external certificate SAR validation, so the increased rate limits directly affect the SAR request throughput path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cmd/infra/router/clientcmd.go` around lines 64 - 68, The hardcoded
increase of clientConfig.QPS to 50 and clientConfig.Burst to 100 may be too
aggressive for production; make these values configurable and document/validate
them: replace the fixed assignments around clientConfig.QPS and
clientConfig.Burst with configurable parameters (e.g., flags or env vars)
surfaced where KubeConfig() is used to build the SubjectAccessReviewClient, add
reasonable defaults and validation (min/max), and add a short comment
referencing SAR validation path so SRE can test; also update tests or add a
unit/integration test that exercises creating the SubjectAccessReviewClient with
custom QPS/Burst values and include guidance to verify against API server
capacity.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 119-120: The SARCompleted status is being emitted too early by
calling p.recorder.RecordRouteUpdate with ExtCrtStatusReasonSARCompleted (e.g.,
the calls that use route.Spec.TLS.ExternalCertificate.Name), which sets
RouteAdmitted=True before the downstream TLS validation (ExtendedValidator)
runs; move these RecordRouteUpdate calls so they execute only after the
downstream plugin chain (ExtendedValidator) has successfully validated the
tls.crt/tls.key, or alternatively check the current route admission and only
emit SARCompleted when the route is already admitted; update all similar sites
(the calls around the shown lines and the other occurrences you noted) to use
this gating or relocation so SARCompleted is only recorded post-validation.

In `@pkg/router/controller/status.go`:
- Around line 145-155: The status-update paths (the
isIngressConditionUpdateRequired check and related callers like the one invoking
performIngressConditionUpdate with routev1.RouteIngressCondition) currently rely
on the contention-oriented comparator that ignores the Reason field; change
these checks so they use a strict equality comparator that only ignores
LastTransitionTime (i.e., compare Type, Status, Reason, Message exactly) when
deciding to skip status writes, while leaving the existing
ignored-reason/conditionsEqual logic intact for contention detection only;
update the calls to isIngressConditionUpdateRequired (and the equivalent checks
at the other mentioned locations) to use the new strict comparator so
transitions like ExternalCertificateSecretUpdated →
ExternalCertificateSARCompleted are persisted.

In `@pkg/router/writerlease/writerlease.go`:
- Around line 156-174: Run() defers l.queue.ShutDown(), which delays shutting
the queue until after wg.Wait(), causing workers blocked in l.queue.Get() to
deadlock; change this by removing the defer and explicitly calling
l.queue.ShutDown() immediately after <-stopCh (before wg.Wait()) so workers
receive shutdown=true and can exit, then call wg.Wait() to join them; update the
Run function (the goroutine loop and shutdown ordering) to reflect this explicit
shutdown ordering.

---

Outside diff comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 236-257: ExtendedValidateRoute currently allows tlsConfig.Key that
contains PUBLIC KEY or CERTIFICATE PEM blocks because sanitizePEM preserves them
and they get appended to keyBytes before tls.X509KeyPair; modify the validation
so that after sanitizing/parsing tlsConfig.Key you scan its PEM blocks and
either strip non-private-key blocks or (preferably) reject the key if any PEM
block has Type "PUBLIC KEY" or "CERTIFICATE". Concretely, in the branch handling
tlsConfig.Certificate (the code that calls splitCertKey and then appends
tlsConfig.Key into keyBytes before tls.X509KeyPair), parse tlsConfig.Key into
PEM blocks, and if any block.Type equals "PUBLIC KEY" or "CERTIFICATE" append a
field.Invalid on tlsFieldPath.Child("key") (similar to other errors) and do not
call tls.X509KeyPair; otherwise proceed with only private-key PEM blocks (or the
sanitized key) as now. Use the existing symbols tlsConfig.Key, tlsFieldPath,
sanitizePEM, splitCertKey, tls.X509KeyPair and ExtendedValidateRoute to locate
and implement the change.

---

Nitpick comments:
In `@pkg/cmd/infra/router/clientcmd.go`:
- Around line 64-68: The hardcoded increase of clientConfig.QPS to 50 and
clientConfig.Burst to 100 may be too aggressive for production; make these
values configurable and document/validate them: replace the fixed assignments
around clientConfig.QPS and clientConfig.Burst with configurable parameters
(e.g., flags or env vars) surfaced where KubeConfig() is used to build the
SubjectAccessReviewClient, add reasonable defaults and validation (min/max), and
add a short comment referencing SAR validation path so SRE can test; also update
tests or add a unit/integration test that exercises creating the
SubjectAccessReviewClient with custom QPS/Burst values and include guidance to
verify against API server capacity.

In `@pkg/router/template/configmanager/haproxy/testing/haproxy.go`:
- Line 49: Replace the hard-coded "/tmp" argument passed to os.CreateTemp with
os.TempDir() to restore portability; locate the os.CreateTemp call (the line
creating f, err := os.CreateTemp("/tmp", prefix) in haproxy.go) and change the
directory argument to os.TempDir() so the system-configured temp directory (and
TMPDIR) is respected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e0491b73-90e9-45d9-a4cd-53ee9310aca8

📥 Commits

Reviewing files that changed from the base of the PR and between bb228d6 and c9138ea.

⛔ Files ignored due to path filters (6)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/openshift/library-go/pkg/authorization/authorizationutil/subject.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/library-go/pkg/authorization/authorizationutil/util.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/library-go/pkg/route/secretmanager/manager.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/library-go/pkg/secret/secret_monitor.go is excluded by !**/vendor/**, !vendor/**
  • vendor/modules.txt is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (16)
  • go.mod
  • hack/Makefile.debug
  • pkg/cmd/infra/router/clientcmd.go
  • pkg/cmd/infra/router/template.go
  • pkg/router/controller/contention.go
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/controller/route_secret_manager_test.go
  • pkg/router/controller/shared_secret_manager.go
  • pkg/router/controller/status.go
  • pkg/router/controller/status_test.go
  • pkg/router/routeapihelpers/validation.go
  • pkg/router/routeapihelpers/validation_test.go
  • pkg/router/router_test.go
  • pkg/router/template/configmanager/haproxy/testing/haproxy.go
  • pkg/router/writerlease/writerlease.go
  • pkg/router/writerlease/writerlease_test.go

Comment thread pkg/router/controller/route_secret_manager.go Outdated
Comment thread pkg/router/controller/status.go Outdated
Comment thread pkg/router/writerlease/writerlease.go
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@bentito: This pull request references Jira Issue OCPBUGS-77056, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

This PR implements asynchronous and cached SubjectAccessReview checks for spec.tls.externalCertificate during router startup.

Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit O(N * API_latency) delays and would hit CrashLoopBackOff from failing liveness probes.

This fix maintains full RBAC security checks but executes them asynchronously in the background. A global sync.Map acts as an asyncSARCache to drastically speed up repeated checks for the same underlying secret.

Fixes: OCPBUGS-77056

Summary by CodeRabbit

  • New Features

  • Asynchronous caching for external certificate validation

  • Configurable parallelism for route status updates

  • Improvements

  • Increased Kubernetes client request throughput

  • Architecture-aware build support for router binaries

  • More robust route status tracking and comparison

  • Stronger secret event handling and synchronization

  • Bug Fixes

  • Ensure route status reflects completed access checks for external certificates

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pkg/router/controller/status.go (1)

145-150: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Status write path can still drop SAR reason transitions.

These paths now always call performIngressConditionUpdate, but Line 342 still uses conditionsEqual (contention comparator) to decide whether to mutate status. Because that comparator ignores reasons in ignoreIngressConditionReason, transitions like ExternalCertificateSecretUpdatedExternalCertificateSARCompleted can still be treated as unchanged and not persisted.

💡 Minimal fix
+func conditionsEqualForStatusWrite(a, b *routev1.RouteIngressCondition) bool {
+	return a.Type == b.Type &&
+		a.Status == b.Status &&
+		a.Reason == b.Reason &&
+		a.Message == b.Message
+}
+
 func recordIngressCondition(route *routev1.Route, name, hostName string, condition routev1.RouteIngressCondition) (changed, created bool, at time.Time, latest, original *routev1.RouteIngress) {
@@
 		if existingCondition != nil {
 			condition.LastTransitionTime = existingCondition.LastTransitionTime
-			if !conditionsEqual(existingCondition, &condition) {
+			if !conditionsEqualForStatusWrite(existingCondition, &condition) {
 				changed = true
 			}
 		} else {

Also applies to: 157-162

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/status.go` around lines 145 - 150, The status write
path is still skipping updates because conditionsEqual ignores condition Reason
via ignoreIngressConditionReason; update the comparator so Reason changes that
matter (e.g., transitions between ExternalCertificateSecretUpdated and
ExternalCertificateSARCompleted) cause a mutation. Concretely, modify
conditionsEqual (or add a small wrapper used before calling
performIngressConditionUpdate) to return false when the existing
Condition.Reason != new Condition.Reason for RouteAdmitted SAR-related reasons
(such as "ExternalCertificateSecretUpdated" and
"ExternalCertificateSARCompleted"), so performIngressConditionUpdate is invoked
and the new reason is persisted; keep ignoreIngressConditionReason behavior for
other benign cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/router/controller/status.go`:
- Around line 145-150: The status write path is still skipping updates because
conditionsEqual ignores condition Reason via ignoreIngressConditionReason;
update the comparator so Reason changes that matter (e.g., transitions between
ExternalCertificateSecretUpdated and ExternalCertificateSARCompleted) cause a
mutation. Concretely, modify conditionsEqual (or add a small wrapper used before
calling performIngressConditionUpdate) to return false when the existing
Condition.Reason != new Condition.Reason for RouteAdmitted SAR-related reasons
(such as "ExternalCertificateSecretUpdated" and
"ExternalCertificateSARCompleted"), so performIngressConditionUpdate is invoked
and the new reason is persisted; keep ignoreIngressConditionReason behavior for
other benign cases.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 647baa90-9f64-40e5-8c98-c7599fdfaa7a

📥 Commits

Reviewing files that changed from the base of the PR and between c9138ea and 450b84b.

📒 Files selected for processing (1)
  • pkg/router/controller/status.go

bentito added 3 commits May 28, 2026 13:49
The StatusAdmitter was holding a non-reentrant sync.Mutex across the
entire plugin chain execution (HandleRoute). When a downstream plugin
(like RouteSecretManager) needed to update the route status synchronously
via RecordRouteUpdate, it attempted to reacquire the same lock,
causing a deadlock that halted the event loop.

This deadlock was masked in previous versions because secret validation
occurred asynchronously. With the transition to synchronous validation,
the deadlock became deterministic on cache sync, leading to pervasive
E2E timeouts.

The StatusAdmitter does not mutate any fields after initialization
and delegates state updates to the thread-safe writerlease queue,
rendering the lock unnecessary.

- Removed the sync.Mutex from StatusAdmitter.
- Stripped lock acquisition from HandleRoute and status update methods.
…hronous

The SharedSecretManager previously used a blocking WaitForCacheSync
during RegisterRoute. Because RegisterRoute was called synchronously
from the router's main event loop (HandleRoute), the entire router
could hang indefinitely if the API server was slow or if RBAC
permissions prevented the informer from ever syncing. This caused
timeouts in core router E2E tests (like 'HAProxy should run even
if it has no routes').

This commit removes the blocking wait from RegisterRoute, allowing
the main event loop to proceed immediately. It also updates GetSecret
to use a fast-path cache check and fall back to a direct API call
if the cache is not yet synced or the secret is missing. This maintains
the performance benefits of the cache while guaranteeing that the
router will never block on informer initialization.
- route_secret_manager: Gate SARCompleted status emission on route admission
  to prevent marking structurally invalid routes as active before
  ExtendedValidator runs.
- status: Use strict equality for explicit status writes while protecting
  meaningful ignored reasons from empty overwrites. This ensures valid
  transitions like SecretUpdated -> SARCompleted are persisted without flapping.
- writerlease: Fix shutdown deadlock by explicitly calling queue.ShutDown()
  before waiting on the worker WaitGroup.
- Makefile.debug: Pass GOARCH to image builder for cross-platform support.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/router/controller/shared_secret_manager.go (1)

98-103: Per-namespace informers are never stopped; the passed ctx is ignored.

The informer goroutine is launched with context.Background().Done() and is never torn down. UnregisterRoute only removes the registry entry, so the goroutine and its namespace-wide secrets watch persist for the router's lifetime even after the last route in a namespace is unregistered. This accumulates one live LIST/WATCH over all secrets per namespace ever seen — relevant given this PR targets large-scale behavior and clusters with namespace churn. Note also that the ctx argument to RegisterRoute is unused here; a stored per-namespace stop function (or deriving from ctx) would let UnregisterRoute reclaim empty-namespace informers.

Since informers cache every secret in a watched namespace, consider tracking a per-namespace route count and a stopCh/cancel func so UnregisterRoute can stop the informer once the namespace has no remaining registrations. This bounds memory and watch connections over time.

Want me to draft the per-namespace refcount + stop-channel cleanup in UnregisterRoute, or open a tracking issue for the deferred "unregister empty namespaces" optimization noted in the comment?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/shared_secret_manager.go` around lines 98 - 103, The
informer goroutine currently uses context.Background().Done() and is never
stopped; modify RegisterRoute to use the provided ctx (or create a per-namespace
context with cancel) and store a stop/cancel function and a refcount alongside
the informer in m.informers (e.g., replace the raw inf entry with a small struct
containing inf, refCount, cancelFunc/stopCh). Increment the refcount when
RegisterRoute is called for an existing namespace and start the informer only
when first created using ctx.Done(); in UnregisterRoute, decrement the namespace
refcount and when it reaches zero call the stored cancelFunc/close stopCh and
remove the informer entry so the informer goroutine and LIST/WATCH are stopped
and resources reclaimed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/router/controller/shared_secret_manager.go`:
- Around line 182-196: SharedSecretManager.GetSecret returns informer-store
object directly on cache hits; change the cache-hit path in GetSecret to return
a caller-owned copy (use secret.DeepCopy()) to avoid mutability of shared cache
objects, and keep the existing API-call fallback when the informer isn’t synced
or the object is missing; also update the comment in populateRouteTLSFromSecret
to no longer claim RegisterRoute blocks until cache sync and to note that
GetSecret will fallback to a direct API call when the cache isn't synced or the
secret is absent; finally, review RegisterRoute/UnregisterRoute to either honor
the input ctx for informer lifetime or implement informer
shutdown/Garbage-collection when no routes remain (ensure UnregisterRoute stops
the informer for that namespace or uses ctx.Done()).

---

Nitpick comments:
In `@pkg/router/controller/shared_secret_manager.go`:
- Around line 98-103: The informer goroutine currently uses
context.Background().Done() and is never stopped; modify RegisterRoute to use
the provided ctx (or create a per-namespace context with cancel) and store a
stop/cancel function and a refcount alongside the informer in m.informers (e.g.,
replace the raw inf entry with a small struct containing inf, refCount,
cancelFunc/stopCh). Increment the refcount when RegisterRoute is called for an
existing namespace and start the informer only when first created using
ctx.Done(); in UnregisterRoute, decrement the namespace refcount and when it
reaches zero call the stored cancelFunc/close stopCh and remove the informer
entry so the informer goroutine and LIST/WATCH are stopped and resources
reclaimed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f848a726-608b-4bc3-8193-4c9f52a6e2e7

📥 Commits

Reviewing files that changed from the base of the PR and between 450b84b and c68a51c.

📒 Files selected for processing (2)
  • pkg/router/controller/shared_secret_manager.go
  • pkg/router/controller/status.go
💤 Files with no reviewable changes (1)
  • pkg/router/controller/status.go

Comment on lines +182 to +196
// Try to get from cache first
if inf.HasSynced() {
obj, exists, err := inf.GetStore().GetByKey(namespace + "/" + ref.secretName)
if err == nil && exists {
return obj.(*corev1.Secret), nil
}
}

// Fallback to direct API call if cache is not synced or object is missing
secret, err := m.kubeClient.CoreV1().Secrets(namespace).Get(ctx, ref.secretName, metav1.GetOptions{})
if err != nil {
return nil, err
}
return secret, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find GetSecret consumers and inspect how the returned secret is used.
rg -nP --type=go -C5 '\.GetSecret\s*\(' 
# Locate the stale "blocks until cache sync" comment.
rg -nP --type=go -C2 'blocks until cache sync'

Repository: openshift/router

Length of output: 3066


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show full body usage around the known call site
sed -n '340,470p' pkg/router/controller/route_secret_manager.go | nl -ba | sed -n '1,140p'

# Find all GetSecret call sites (method call form) in the non-vendor code
rg -n --type go '\.GetSecret\s*\(' pkg/router | sed -n '1,200p'

# Find any places that mutate the returned secret variable in populateRouteTLSFromSecret
# (heuristic: after 'secret, err :=' look for 'secret.' writes)
rg -n --type go 'populateRouteTLSFromSecret|secret\.' pkg/router/controller/route_secret_manager.go

# Find any direct writes to secret.Data in the repo
rg -n --type go 'secret\.Data\s*\[' pkg/router | sed -n '1,200p'

Repository: openshift/router

Length of output: 103


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="pkg/router/controller/route_secret_manager.go"

echo "== populateRouteTLSFromSecret context =="
# Print the whole function region conservatively
sed -n '350,460p' "$FILE" | cat -n

echo "== All GetSecret call sites under pkg/router (non-vendor) =="
rg -n --type go '\.GetSecret\s*\(' pkg/router | head -n 200

echo "== Mutations in populateRouteTLSFromSecret (secret.* writes) =="
rg -n --type go 'populateRouteTLSFromSecret|secret\.' "$FILE"

echo "== Any secret.Data[...] writes under pkg/router (non-vendor) =="
rg -n --type go 'secret\.Data\s*\[' pkg/router | head -n 200

Repository: openshift/router

Length of output: 7132


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Global GetSecret call sites (non-vendor) =="
rg -n --type go '\.GetSecret\s*\(' . --glob '!vendor/**' | head -n 200

echo "== Any secret.Data[...] writes (non-vendor) =="
rg -n --type go 'secret\.Data\s*\[' . --glob '!vendor/**' | head -n 200

echo "== Inspect shared_secret_manager.go RegisterRoute and informer lifecycle =="
sed -n '1,260p' pkg/router/controller/shared_secret_manager.go | cat -n

Repository: openshift/router

Length of output: 7757


Harden GetSecret cache-hit mutability and update the stale “cache sync” comment

  • SharedSecretManager.GetSecret returns the shared informer-store object on the cache-hit path; return DeepCopy() there to keep a consistent “caller-owned” mutability contract (current populateRouteTLSFromSecret only reads secret.Data["tls.crt"]/["tls.key"], so this is defensive now).

    ♻️ Return a copy from the cache hit
    	// Try to get from cache first
    	if inf.HasSynced() {
    		obj, exists, err := inf.GetStore().GetByKey(namespace + "/" + ref.secretName)
    		if err == nil && exists {
    -			return obj.(*corev1.Secret), nil
    +			return obj.(*corev1.Secret).DeepCopy(), nil
    		}
    	}
  • Update the stale comment in populateRouteTLSFromSecret: it says RegisterRoute blocks until cache sync, but GetSecret can fall back to a direct API call when cache isn’t synced or the secret is missing.

  • RegisterRoute runs the informer with context.Background().Done() and UnregisterRoute doesn’t stop/GC the informer when a namespace becomes empty; consider honoring ctx or implementing informer shutdown when no routes remain.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Try to get from cache first
if inf.HasSynced() {
obj, exists, err := inf.GetStore().GetByKey(namespace + "/" + ref.secretName)
if err == nil && exists {
return obj.(*corev1.Secret), nil
}
}
// Fallback to direct API call if cache is not synced or object is missing
secret, err := m.kubeClient.CoreV1().Secrets(namespace).Get(ctx, ref.secretName, metav1.GetOptions{})
if err != nil {
return nil, err
}
return secret, nil
}
// Try to get from cache first
if inf.HasSynced() {
obj, exists, err := inf.GetStore().GetByKey(namespace + "/" + ref.secretName)
if err == nil && exists {
return obj.(*corev1.Secret).DeepCopy(), nil
}
}
// Fallback to direct API call if cache is not synced or object is missing
secret, err := m.kubeClient.CoreV1().Secrets(namespace).Get(ctx, ref.secretName, metav1.GetOptions{})
if err != nil {
return nil, err
}
return secret, nil
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/shared_secret_manager.go` around lines 182 - 196,
SharedSecretManager.GetSecret returns informer-store object directly on cache
hits; change the cache-hit path in GetSecret to return a caller-owned copy (use
secret.DeepCopy()) to avoid mutability of shared cache objects, and keep the
existing API-call fallback when the informer isn’t synced or the object is
missing; also update the comment in populateRouteTLSFromSecret to no longer
claim RegisterRoute blocks until cache sync and to note that GetSecret will
fallback to a direct API call when the cache isn't synced or the secret is
absent; finally, review RegisterRoute/UnregisterRoute to either honor the input
ctx for informer lifetime or implement informer shutdown/Garbage-collection when
no routes remain (ensure UnregisterRoute stops the informer for that namespace
or uses ctx.Done()).

…lity

The SharedSecretManager optimization was fundamentally incompatible with
the restrictive RBAC model used in E2E tests (resourceNames). By
attempting to list all secrets in a namespace, it triggered 403
Forbidden errors, causing the informer to never sync and the router
to remain blind to secret updates/deletions.

This commit introduces an adaptive hybrid strategy:
1. Probe each namespace for 'list' permissions on secrets.
2. If allowed (Fast Path), use a single namespace-wide informer (high performance).
3. If restricted (Safe Path/CI), fall back to a per-secret informer using a
   FieldSelector, which is compatible with granular RBAC.

Additionally:
- Implemented proper informer lifecycle management with context cancellation.
- Moved SARCompleted emission to occur only after the downstream plugin
  chain accepts the route, preventing premature admission of invalid certs.
- Added comprehensive unit tests for the hybrid logic.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/router/controller/shared_secret_manager.go (1)

107-109: 💤 Low value

Ineffectual assignment to infState when informer already exists.

When exists is true, the assigned infState is never used since the block is skipped. Use the blank identifier for the initial assignment.

♻️ Proposed fix
-	infKey := m.getInformerKey(namespace, secretName, restricted)
-	infState, exists := m.informers[infKey]
+	infKey := m.getInformerKey(namespace, secretName, restricted)
+	_, exists := m.informers[infKey]
 	if !exists {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/shared_secret_manager.go` around lines 107 - 109, The
map lookup currently assigns infState but then immediately branches when exists
is true so infState is never used; change the lookup to use the blank identifier
(e.g. "_, exists := m.informers[infKey]") when you only need the existence
check. Update the line where m.getInformerKey(namespace, secretName, restricted)
is used to assign infKey and perform "_, exists := m.informers[infKey]" instead
of "infState, exists := m.informers[infKey]" so you avoid the ineffectual
assignment while preserving the existence check for the subsequent logic that
uses infKey and m.informers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/router/controller/shared_secret_manager.go`:
- Around line 107-109: The map lookup currently assigns infState but then
immediately branches when exists is true so infState is never used; change the
lookup to use the blank identifier (e.g. "_, exists := m.informers[infKey]")
when you only need the existence check. Update the line where
m.getInformerKey(namespace, secretName, restricted) is used to assign infKey and
perform "_, exists := m.informers[infKey]" instead of "infState, exists :=
m.informers[infKey]" so you avoid the ineffectual assignment while preserving
the existence check for the subsequent logic that uses infKey and m.informers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 12c0f330-7774-44ba-88dd-cb7251258d1a

📥 Commits

Reviewing files that changed from the base of the PR and between 8b1a080 and 8b79da7.

📒 Files selected for processing (4)
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/controller/route_secret_manager_test.go
  • pkg/router/controller/shared_secret_manager.go
  • pkg/router/controller/shared_secret_manager_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/router/controller/route_secret_manager_test.go

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 31, 2026

@bentito: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/fips-image-scan-haproxy-router d0e707e link true /test fips-image-scan-haproxy-router
ci/prow/e2e-aws-fips d0e707e link true /test e2e-aws-fips
ci/prow/okd-scos-images d0e707e link true /test okd-scos-images
ci/prow/e2e-aws-serial-2of2 d0e707e link true /test e2e-aws-serial-2of2
ci/prow/e2e-upgrade d0e707e link true /test e2e-upgrade
ci/prow/images d0e707e link true /test images
ci/prow/e2e-agnostic d0e707e link true /test e2e-agnostic
ci/prow/e2e-aws-serial-1of2 d0e707e link true /test e2e-aws-serial-1of2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants