Skip to content

Cache OpenStack ProviderClient per auth context#317

Open
fwiesel wants to merge 1 commit into
mainfrom
cache-openstack-provider-client
Open

Cache OpenStack ProviderClient per auth context#317
fwiesel wants to merge 1 commit into
mainfrom
cache-openstack-provider-client

Conversation

@fwiesel

@fwiesel fwiesel commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

The seven controllers using the default credentials each call
AuthenticatedClient from SetupWithManager, deserialising the full
Keystone service catalog every time. With a large multi-region catalog
this is wasteful regardless of whether it is a proximate OOM cause.

Providers are now cached per auth context (AuthURL, project, user, and
their domains) so Keystone is hit once for the default credentials and
once for the test-project credentials.

Concurrent callers with the same key share a single in-flight request
via singleflight, preventing duplicate Keystone round-trips on
startup.

Summary by CodeRabbit

Release Notes

  • New Features

    • Implemented offboarding taint mechanism to prevent pods from re-registering services during hypervisor maintenance.
    • Added gating for nova-compute service deletion to wait for agent pod eviction completion.
  • Improvements

    • Enhanced OpenStack service client performance through provider credential caching.
  • Tests

    • Added comprehensive test coverage for offboarding taint application and agent pod eviction scenarios.
  • Chores

    • Updated RBAC permissions to support pod monitoring capabilities.
    • Promoted transitive dependencies to direct requirements.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@fwiesel, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 33 minutes and 30 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4e177f6f-3f04-452d-93cd-ae44e0ccb39c

📥 Commits

Reviewing files that changed from the base of the PR and between b0b6f49 and dec2a6d.

📒 Files selected for processing (2)
  • go.mod
  • internal/openstack/service_client.go
📝 Walkthrough

Walkthrough

Adds an AgentPodsEvicted condition computed by the hypervisor controller from live pod listings (filtering out succeeded, deleting, and taint-tolerating pods). The gardener node lifecycle controller gains a NoExecute offboarding taint mechanism triggered when Evicting is False. The offboarding controller gates nova-compute service deletion on AgentPodsEvicted=True. Separately, OpenStack provider clients are now cached and deduplicated via singleflight.

Changes

Agent Pod Eviction Gate for Offboarding

Layer / File(s) Summary
ConditionTypeAgentPodsEvicted type and offboarding taint constant
api/v1/hypervisor_types.go, internal/controller/constants.go
Defines the new AgentPodsEvicted condition type in the hypervisor API and the taintKeyOffboarding NoExecute taint key constant.
Compute AgentPodsEvicted condition in HypervisorController
internal/controller/hypervisor_controller.go, internal/controller/hypervisor_controller_test.go
Extends reconciliation to list node pods via field selector, filter out non-agent pods, and publish AgentPodsEvicted condition; requeues with defaultPollTime when agents remain. Adds RBAC annotation for pod access. Tests cover tolerating-only, running-agent, and deletion-pending scenarios.
Offboarding taint on Node via GardenerNodeLifecycleController
internal/controller/gardener_node_lifecycle_controller.go, internal/controller/gardener_node_lifecycle_controller_test.go
Adds ensureOffboardingTaint that patches a NoExecute taint onto the Node when MaintenanceTermination is active and Evicting is False. Reworks SetupWithManager to watch Hypervisor and enqueue Node reconciles via evictingConditionChangedPredicate. Tests validate taint creation and idempotency.
Gate compute-service deletion on AgentPodsEvicted in OffboardingController
internal/controller/offboarding_controller.go, internal/controller/offboarding_controller_test.go
Adds guard that blocks nova-compute service deletion until ConditionTypeAgentPodsEvicted is true, updating the Offboarded condition with a waiting message when not satisfied. Tests cover happy path, failure-mode bypass, and agent-pods-still-running blocking.
RBAC role and dependency updates
charts/openstack-hypervisor-operator/templates/role.yaml, go.mod
Adds pod get/list/watch to the ClusterRole. Promotes go-logr/logr and golang.org/x/sync to direct dependencies.

OpenStack Provider Client Caching

Layer / File(s) Summary
Authenticated provider cache and refactored auth boundary
internal/openstack/service_client.go
Introduces a global mutex + singleflight-protected ProviderClient cache keyed by auth-context fields. NewProviderClient returns the authenticated result directly; ServiceClientFromProvider assumes an already-authenticated provider and only resolves endpoints.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • cobaltcore-dev/openstack-hypervisor-operator#281: Modifies the same offboarding_controller.go reconciler logic and condition handling (renames decommission→offboarding and updates status messaging), which overlaps with the AgentPodsEvicted gating logic added here.
  • cobaltcore-dev/openstack-hypervisor-operator#286: Changes offboarding reconcile condition-based gating in the same controller (ConditionTypeOffboarded gating), directly adjacent to the new AgentPodsEvicted wait gate added in this PR.
  • cobaltcore-dev/openstack-hypervisor-operator#293: Refactors how hypervisor offboarding status conditions are patched (retry-on-conflict) in the same offboarding_controller.go, overlapping with the condition/status update logic extended here.

Suggested reviewers

  • mchristianl
  • notandy

Poem

🐰 A taint on the node, a gate on the door,
No nova-compute shall re-register more!
The agent pods counted, each name in a list,
Until they are gone, the deletion is missed.
The provider is cached, one login to share —
Hop hop, little rabbit, offboarding with care! 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately and concisely describes the main change—caching OpenStack ProviderClient per auth context—which aligns with the primary optimization in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cache-openstack-provider-client

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/openstack/service_client.go (1)

41-55: 💤 Low value

Cache key omits fields that can also affect the Keystone token.

The comment states the key includes "all fields that affect the resulting Keystone token" but it excludes alternative identity selectors like ProjectID, ProjectDomainID, UserID, UserDomainID, DomainID/DomainName, and application credentials. If a future caller specifies ProjectID instead of ProjectName, they would collide with a different auth context.

Current upstream usage (e.g., OnboardingController using ProjectName/ProjectDomainName) is safe, but consider either:

  1. Adding the missing ID fields to the key, or
  2. Updating the comment to accurately reflect the subset of fields covered.
Option 1: Expand the cache key to cover ID-based selectors
 func cacheKey(authInfo *clientconfig.AuthInfo) string {
 	if authInfo == nil {
 		return ""
 	}
-	// Include all fields that affect the resulting Keystone token and catalog
-	// so a future auth context differing only in user, domain, or AuthURL
-	// doesn't silently reuse the wrong cached provider.
+	// Include fields that affect the resulting Keystone token and catalog.
+	// This covers both name-based and ID-based selectors to prevent collision
+	// when callers use different identity strategies.
 	return strings.Join([]string{
 		authInfo.AuthURL,
 		authInfo.ProjectName,
+		authInfo.ProjectID,
 		authInfo.ProjectDomainName,
+		authInfo.ProjectDomainID,
 		authInfo.Username,
+		authInfo.UserID,
 		authInfo.UserDomainName,
+		authInfo.UserDomainID,
+		authInfo.DomainName,
+		authInfo.DomainID,
 	}, "\x00")
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/openstack/service_client.go` around lines 41 - 55, The cacheKey
function claims to include all fields affecting the Keystone token but omits
alternative identity selectors like ProjectID, ProjectDomainID, UserID,
UserDomainID, DomainID, DomainName, and application credentials. This can cause
cache collisions when different auth contexts use ID-based selectors instead of
name-based ones. Update the strings.Join call in cacheKey to include all the
missing ID and alternative fields from the authInfo struct alongside the
existing name-based fields, ensuring the cache key accounts for all ways callers
can specify authentication parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/hypervisor_controller.go`:
- Around line 298-300: The hv.List call on line 298-300 uses a field selector
for "spec.nodeName" on Pods, but this field index is not registered with the
controller's cached client. To fix this, either register the field index by
calling IndexField with "spec.nodeName" in the SetupWithManager method for the
Pod type, or configure Pod to bypass the cache by adding it to the
cache.Options.ByObject configuration in the main setup.

---

Nitpick comments:
In `@internal/openstack/service_client.go`:
- Around line 41-55: The cacheKey function claims to include all fields
affecting the Keystone token but omits alternative identity selectors like
ProjectID, ProjectDomainID, UserID, UserDomainID, DomainID, DomainName, and
application credentials. This can cause cache collisions when different auth
contexts use ID-based selectors instead of name-based ones. Update the
strings.Join call in cacheKey to include all the missing ID and alternative
fields from the authInfo struct alongside the existing name-based fields,
ensuring the cache key accounts for all ways callers can specify authentication
parameters.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0f0c9a6-9a49-426b-9515-f34863b08adf

📥 Commits

Reviewing files that changed from the base of the PR and between 441452b and b0b6f49.

📒 Files selected for processing (11)
  • api/v1/hypervisor_types.go
  • charts/openstack-hypervisor-operator/templates/role.yaml
  • go.mod
  • internal/controller/constants.go
  • internal/controller/gardener_node_lifecycle_controller.go
  • internal/controller/gardener_node_lifecycle_controller_test.go
  • internal/controller/hypervisor_controller.go
  • internal/controller/hypervisor_controller_test.go
  • internal/controller/offboarding_controller.go
  • internal/controller/offboarding_controller_test.go
  • internal/openstack/service_client.go

Comment thread internal/controller/hypervisor_controller.go Outdated
The seven controllers using the default credentials each call
AuthenticatedClient from SetupWithManager, deserialising the full
Keystone service catalog every time. With a large multi-region catalog
this is wasteful regardless of whether it is a proximate OOM cause.
Providers are now cached per auth context (AuthURL, project, user, and
their domains) so Keystone is hit once for the default credentials and
once for the test-project credentials.
@fwiesel fwiesel force-pushed the cache-openstack-provider-client branch from b0b6f49 to dec2a6d Compare June 19, 2026 10:15
@github-actions

Copy link
Copy Markdown

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/openstack 66.45% (-11.85%) 💀

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/openstack/service_client.go 0.00% (ø) 43 (+23) 0 43 (+23)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces repeated Keystone authentication/catalog deserialization by caching OpenStack ProviderClients per auth context, and extends the hypervisor offboarding workflow to (a) taint nodes to evict agent pods and (b) gate Nova compute-service deletion until those agent pods are confirmed gone.

Changes:

  • Cache gophercloud.ProviderClient instances keyed by auth context and deduplicate concurrent auth via singleflight.
  • Add an offboarding NoExecute taint and compute/report a new AgentPodsEvicted condition from live pod state.
  • Prevent Nova compute-service deletion during offboarding until AgentPodsEvicted=True, and update RBAC/tests accordingly.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/openstack/service_client.go Introduces cached ProviderClient creation and a helper to derive service clients from a cached provider.
internal/controller/offboarding_controller.go Gates compute-service deletion on a new “agent pods evicted” status condition.
internal/controller/offboarding_controller_test.go Updates tests to set/bypass the new eviction gate and adds a gate-specific test case.
internal/controller/hypervisor_controller.go Computes and persists AgentPodsEvicted based on pods running on the node.
internal/controller/hypervisor_controller_test.go Adds envtest coverage for AgentPodsEvicted condition scenarios (tolerating pods, running pods, deletion-in-progress).
internal/controller/gardener_node_lifecycle_controller.go Applies the offboarding taint once VM eviction completes; watches Hypervisor changes relevant to tainting.
internal/controller/gardener_node_lifecycle_controller_test.go Adds tests ensuring taint application and idempotency.
internal/controller/constants.go Adds the shared taintKeyOffboarding constant.
api/v1/hypervisor_types.go Adds the ConditionTypeAgentPodsEvicted condition type.
charts/openstack-hypervisor-operator/templates/role.yaml Grants pod get/list/watch for the new pod inspection logic.
go.mod Promotes logr and x/sync to direct dependencies used by new code.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants