OCPBUGS-90370: Stabilize telemetry config to prevent continuous console pod rollouts#1173
OCPBUGS-90370: Stabilize telemetry config to prevent continuous console pod rollouts#1173sg00dwin wants to merge 1 commit into
Conversation
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-90370, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🔗 Linked repositories identifiedCodeRabbit considers these linked repositories for cross-repo context during reviews:
🚧 Files skipped from review as they are similar to previous changes (2)
📜 Recent review details🧰 Additional context used🔀 Multi-repo context openshift/consoleLinked repositories findingsBased on my cross-repository exploration of openshift/consoleFrontend telemetry consumption patterns:
Impact Assessment: ✅ No breaking changes detected. The PR's guarantee to always include
Walkthrough
ChangesTelemetry Configuration Resilience
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sg00dwin The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/console/operator/sync_v400_test.go`:
- Around line 476-489: The test currently only verifies that specific shared
keys are present in both configurations, but does not assert that the complete
key sets are stable across availability changes. After the existing loop that
checks shared key presence, add an assertion that compares the full normalized
key lists from configUnavailable and configAvailable (after removing the
intentional TELEMETER_CLIENT_DISABLED exception from the unavailable config).
Use the already-computed keysUnavailable and keysAvailable variables to perform
a comprehensive comparison and fail the test if any unexpected keys are missing
or added, ensuring stronger regression coverage for ConfigMap-churn bugs.
- Around line 318-333: The test is ignoring errors from Indexer.Add calls (such
as cvIndexer.Add(cv)) and json.Marshal operations instead of handling them
explicitly. Create a test helper function marked with t.Helper() that wraps
these operations and uses t.Fatalf() to fail the test immediately if an error
occurs, then replace all instances of ignored Indexer.Add errors (lines 318,
333, 350, 378) and json.Marshal errors (lines 366, 374) with calls to this new
helper function. This ensures test setup failures are caught immediately rather
than silently propagating broken state.
In `@pkg/console/operator/sync_v400.go`:
- Around line 488-490: The TELEMETER_CLIENT_DISABLED key in telemetryConfig is
conditionally added only when telemeterClientIsAvailable is false, causing
non-deterministic ConfigMap updates when telemeter availability fluctuates.
Refactor the logic to always include the TELEMETER_CLIENT_DISABLED key in
telemetryConfig, setting its value to "true" when telemeterClientIsAvailable is
false and "false" when it is true. This ensures a stable, deterministic key-set
that prevents ConfigMap churn from telemeter availability oscillations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4f8c4fbd-88b8-48b6-8b25-dcc2325169c7
📒 Files selected for processing (2)
pkg/console/operator/sync_v400.gopkg/console/operator/sync_v400_test.go
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
openshift/console(manual)
📜 Review details
🧰 Additional context used
📓 Path-based instructions (8)
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Follow Go coding standards and patterns documented in CONVENTIONS.md
Organize imports according to conventions documented in CONVENTIONS.md
Usegofmtto format Go code with standard formatting
Rungo vetchecks on all Go packagesFollow Go coding standards and patterns as documented in CONVENTIONS.md, including proper import organization
Organize Go code following the repository structure: main entry point in
cmd/console/main.go, API constants inpkg/api/, operator command setup inpkg/cmd/operator/, and version command inpkg/cmd/version/
**/*.go: Usegofmtfor formatting Go code
Follow standard Go naming conventions
Group imports in order: standard lib, 3rd party, kube/openshift, internal (marked with comments)
Use meaningful error messages with context in Go code
Set status conditions usingstatus.Handle*functions with type prefixes (*Degraded, *Progressing, *Available, *Upgradeable)
Use typed errors and wrap errors to preserve stack context
**/*.go: Imports must be grouped with comments in the order: standard lib, 3rd party, kube, openshift, and operator (internal)
Wrap errors with context usingfmt.Errorf("failed to X: %w", err)pattern to provide meaningful error messages
**/*.go: Do NOT use deprecated ioutil package functions. Use os.ReadFile() instead of ioutil.ReadFile(), os.WriteFile() instead of ioutil.WriteFile(), and io.ReadAll() instead of ioutil.ReadAll() (Go 1.16+)
Do NOT use net.Dial() directly. Use (&net.Dialer{}).DialContext() to support context cancellation
Use %w verb in fmt.Errorf() to wrap errors and maintain error chains, not %v which loses the error chain
Add descriptive context to errors when wrapping them, including relevant identifiers and state information
Use specific error type checks (e.g., apierrors.IsNotFound()) instead of string matching with strings.Contains() on error.Error()
Always propagate the parent context in function calls instead of using context.Background(), to respect parent cont...
Files:
pkg/console/operator/sync_v400.gopkg/console/operator/sync_v400_test.go
⚙️ CodeRabbit configuration file
**/*.go: Review Go code following OpenShift operator patterns.
See CONVENTIONS.md for coding standards and patterns.Refer to the following skills based on CODE PATTERNS, not just file paths:
Refer to /controller-review when code contains:
- Controller struct types (e.g.,
type *Controller struct)func New*Controller(factory functionsfactory.New().WithFilteredEventsInformers(pattern.ToController(method callsSync(ctx context.Context, controllerContext factory.SyncContext)methodsoperatorConfig.Spec.ManagementStatechecksstatus.NewStatusHandlerorstatus.Handle*functionsRefer to /sync-handler-review when code contains:
- Main operator sync functions (e.g.,
sync_v400.gocontent)- Sequential resource syncing with early returns
- Incremental reconciliation loops
- Multiple
resourceapply.Apply*()calls in sequence- Dependency ordering of ConfigMaps → Secrets → Service Accounts → RBAC → Services → Deployments → Routes
- Feature gate conditional logic
Refer to /go-quality-review for all Go code to check:
- Deprecated imports:
ioutil.ReadFile,ioutil.WriteFile,ioutil.ReadAll- Deprecated patterns:
DialwithoutDialContext- Error handling: missing
%win fmt.Errorf- Code smells: deep nesting (4+ levels), functions >100 lines
- Magic values: unexplained numbers/strings
- Context propagation:
context.Background()instead of passed ctx- Missing godoc on exported functions
Files:
pkg/console/operator/sync_v400.gopkg/console/operator/sync_v400_test.go
{pkg,cmd}/**/*.go
📄 CodeRabbit inference engine (CLAUDE.md)
Use gofmt for code formatting on pkg and cmd directories
{pkg,cmd}/**/*.go: Format code usinggofmt -w ./pkg ./cmd
Rungo vetchecks on all Go packages in ./pkg and ./cmd
Files:
pkg/console/operator/sync_v400.gopkg/console/operator/sync_v400_test.go
**/*sync*.go
📄 CodeRabbit inference engine (CONVENTIONS.md)
Implement sync loops (
sync_v400) incrementally: start from zero, create/update missing requirements, and return to continue on next loop
**/*sync*.go: Sync handler implementations must use incremental sync pattern where each reconciliation loop returns early on error rather than collecting errors and continuing
Sync resources in dependency order: ConfigMaps/Secrets → Service Accounts → RBAC (Roles, RoleBindings) → Services → Deployments → Routes
Return early on errors to maintain incremental sync behavior instead of continuing execution or logging and proceeding
Use resourceapply.Apply*() functions to handle both resource creation and update operations
Ensure deleted resources are cleaned up from the cluster when removed from config, with proper handling for NotFound errors
Check feature gates before syncing gated resources to ensure conditional resource reconciliation
Status updates must reflect actual reconciliation state by adding conditions as operations complete using status handlers with HandleProgressingOrDegraded and HandleAvailable
Respect ManagementState configurations and implement cleanup logic for Removed state
Avoid mutating live objects; instead build desired state separately before applying
Do not sync all resources regardless of errors; stop reconciliation on dependency failures
Files:
pkg/console/operator/sync_v400.gopkg/console/operator/sync_v400_test.go
**/operator/**/*.go
📄 CodeRabbit inference engine (Custom checks)
When deployment manifests, operator code, or controllers are added/modified, ensure they do not introduce scheduling constraints assuming standard HA topology. Check ControlPlaneTopology for SingleReplica/DualReplica/HighlyAvailableArbiter/External modes before applying constraints. Use required anti-affinity with maxUnavailable >= 1 (not maxUnavailable: 0). Cap replica counts to schedulable nodes. Exclude arbiter nodes on TNA. Avoid master nodeSelectors on HyperShift. Use library-go DeploymentController hooks (WithTopologyAwareReplicasHook, WithTopologyAwareSchedulingHook, WithControlPlaneNodeSelectorHook).
Files:
pkg/console/operator/sync_v400.gopkg/console/operator/sync_v400_test.go
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}
⚙️ CodeRabbit configuration file
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}: Injection prevention (prodsec-skills):
- SQL: parameterized queries only; no string concatenation
- Command: no shell=True, os.system, or backtick exec with user input
- LDAP/XPath: escape special characters in filters
- Path traversal: canonicalize paths, reject ../
- Deserialization: no pickle/yaml.load()/eval on untrusted data
- Prototype pollution: no recursive merge of untrusted objects
- Validate at trust boundaries with allow-lists, not deny-lists
- Normalize Unicode and anchor regexes (^$); watch for ReDoS
Files:
pkg/console/operator/sync_v400.gopkg/console/operator/sync_v400_test.go
**
⚙️ CodeRabbit configuration file
**: # OpenShift Console Operator - AI Context HubThis file serves as the central AI documentation hub for the OpenShift Console Operator project. AI assistants (Claude, Cursor, Copilot, CodeRabbit, etc.) use this and the linked documents to understand project context.
Go Version and Dependencies
Go Version and Dependencies
- Go version: 1.24.0 (toolchain: go1.24.4)
- Dependency management: Uses
go.modwith vendoring- Build flags: Use
GOFLAGS="-mod=vendor"for builds and tests to ensure vendored dependencies are used- Key dependencies: openshift/api, openshift/library-go, k8s.io client libraries
- Go version: 1.24.0 (toolchain: go1.24.4)
- Dependency management: Uses
go.modwith vendoring- Build flags: Use
GOFLAGS="-mod=vendor"for builds and tests to ensure vendored dependencies are used- Key dependencies: openshift/api, openshift/library-go, k8s.io client libraries
Quick Reference
This Repository
Document Purpose ARCHITECTURE.md System architecture, components, repository structure CONVENTIONS.md Go coding standards, patterns, import organization TESTING.md Testing patterns, commands, debugging README.md Project README with setup instructions Console Repository (openshift/console)
For frontend-related guidelines, see the openshift/console repository:
Document Purpose STYLEGUIDE.md Frontend code style guidelines INTERNATIONALIZATION.md i18n patterns and translation guidelines CONTRIBUTING.md Contribution guidelines for the console project Project Summary
The **console-operator...
Files:
pkg/console/operator/sync_v400.gopkg/console/operator/sync_v400_test.go
**/*_test.go
📄 CodeRabbit inference engine (AGENTS.md)
Follow testing patterns and commands documented in TESTING.md
Follow testing patterns and commands as documented in TESTING.md, including running unit tests with 'make test-unit' and checks with 'make check'
**/*_test.go: Use table-driven tests for comprehensive coverage
Usehttptestfor HTTP handler testing in Go
Include proper cleanup functions in tests
Test both success and failure pathsCheck and handle all errors in tests using t.Fatalf() or t.Errorf(); do not ignore errors with blank identifiers
Files:
pkg/console/operator/sync_v400_test.go
⚙️ CodeRabbit configuration file
**/*_test.go: Review test code for quality and patterns.Refer to /unit-test-review when test is in pkg//*_test.go:**
- Table-driven test structure with test cases
- Use of
go-test/deepfor struct comparisons- Test naming conventions (TestFunctionName)
- Error handling with
wantErrpattern- Edge case coverage (nil, empty, boundary values)
- Proper assertions with helpful error messages
- Test isolation (no shared mutable state)
Refer to /e2e-test-review when test contains:
framework.MustNewClientset(t, nil)or similar e2e framework usagewait.Pollorwait.PollImmediatepatternsretry.RetryOnConflictfor updates- Cleanup via
deferfunctions- Console/operator CR manipulations
- Test assertions on cluster state
Suggest to use /e2e-test-review when:
- PR adds new feature requiring e2e coverage
- Test file is empty or skeleton
- Comments indicate "TODO: add test"
Review for common issues:
- Missing cleanup (defer statements)
- Using
time.Sleepinstead ofwait.Poll- Missing context timeouts
- Vague error messages in assertions
- Tests without table-driven structure when testing multiple cases
- Ignoring errors with
_- Tests without assertions
Files:
pkg/console/operator/sync_v400_test.go
pkg/**/*_test.go
📄 CodeRabbit inference engine (.claude/skills/unit-test-review.md)
pkg/**/*_test.go: Go unit tests in pkg/**/*_test.go files should use table-driven test pattern for clarity and completeness with struct containing test cases
Test function names should be descriptive and clearly indicate what is being tested (e.g., TestGetNodeComputeEnvironments, TestNewRouteConfig)
Test case names within table-driven tests should explain the scenario being tested (e.g., 'Custom hostname and TLS secret set'), not be vague (e.g., 'test1')
Use github.com/go-test/deep package for struct comparisons to show exact differences instead of simple equality checks or manual field-by-field comparisons
Test both success and failure paths, including edge cases such as empty inputs (nil, "", empty slices/maps), boundary values (0, -1, max int), missing labels/fields, duplicate values, and large inputs
Organize test code using Arrange-Act-Assert pattern: setup test data in Arrange phase, call the function in Act phase, verify results in Assert phase
Check error presence using pattern (err != nil) != tt.wantErr instead of ignoring errors with underscore or missing error checks
When checking specific error messages, use strings.Contains to verify error content instead of just checking error presence
Extract common test setup into helper functions rather than duplicating setup code across multiple tests
Write specific assertion error messages with context (e.g., 'expected %d nodes, got %d') rather than vague messages (e.g., 'failed')
Inline simple test data in test structs; extract complex fixtures into testdata files or helper functions
Avoid red flags in tests: no table-driven tests for multi-scenario functions, tests without subtests, tests depending on execution order, global mutable state, hardcoded sleeps, tests without assertions, ignoring errors with underscore, testing implementation details instead of behavior, or overly complex test setup
Files:
pkg/console/operator/sync_v400_test.go
🔇 Additional comments (1)
pkg/console/operator/sync_v400_test.go (1)
4-22: LGTM!
|
/jira refresh |
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-90370, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
c309132 to
53adcf0
Compare
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-90370, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
53adcf0 to
ce92dda
Compare
…le pod rollouts Remove early return in GetTelemetryConfiguration() so the function always produces the same key set regardless of telemeter-client availability. Make GetAccessToken errors non-fatal for disconnected clusters. Add key set stability tests. Assisted by: Claude (Opus 4.6)
ce92dda to
cc61faf
Compare
|
@sg00dwin: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Fixes continuous console pod rollouts caused by
GetTelemetryConfiguration()producing different ConfigMap content on consecutive sync cycles when telemeter-client availability flickers.Bug: Users are logged out of the web console every ~5 minutes. The operator creates ~11 replica sets in 50 minutes because the console-config ConfigMap alternates between two states, each triggering a deployment rollout that destroys in-memory sessions.
Root cause:
GetTelemetryConfiguration()has an earlyreturnwhen telemeter-client is unavailable that produces a map with a completely different key set than the available path:{CLUSTER_ID, SEGMENT_*, TELEMETER_CLIENT_DISABLED}— skipsORGANIZATION_ID,ACCOUNT_MAIL{CLUSTER_ID, SEGMENT_*, ORGANIZATION_ID, ACCOUNT_MAIL}— skipsTELEMETER_CLIENT_DISABLEDWhen availability flickers (node reboots, monitoring pod restarts),
ApplyConfigMapcorrectly detects the data change, bumpsresourceVersion, which changes the deployment annotation, triggering a new ReplicaSet.Fix:
returnso the function always sets all keys regardless of telemeter-client availabilityGetAccessTokenerrors non-fatal (log + continue with empty token) so disconnected clusters don't fail the syncResult: The telemetry config map always contains the same key set. Values may differ (empty vs populated), but key set stability means the ConfigMap only changes when values actually change.
Reviewer note
With this change, when telemeter-client is unavailable the console-config ConfigMap will include
ORGANIZATION_ID: ""andACCOUNT_MAIL: ""(empty strings) rather than omitting those keys entirely. These values are passed through to the console frontend (bridge) as telemetry config. Empty strings should be functionally equivalent to absent keys — the frontend won't send those fields in telemetry events — but I'd appreciate a reviewer familiar with the bridge telemetry consumption confirming that's harmless.Testing
make test-unitpassesmake check(gofmt + govet) cleanAssisted by: Claude (Opus 4.6)
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests