Skip to content

[SDCICD-1830] Add retry mechanism with fallback model for LLM analysis engines#3231

Open
varunraokadaparthi wants to merge 2 commits into
openshift:mainfrom
varunraokadaparthi:retry-mechanism
Open

[SDCICD-1830] Add retry mechanism with fallback model for LLM analysis engines#3231
varunraokadaparthi wants to merge 2 commits into
openshift:mainfrom
varunraokadaparthi:retry-mechanism

Conversation

@varunraokadaparthi

@varunraokadaparthi varunraokadaparthi commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add exponential backoff retry logic (1m → 2m → 4m, 3 retries) for Gemini API calls in both analysis engines (internal/analysisengine, pkg/krknai/analysisengine)
  • On primary model (gemini-3.1-pro-preview) exhaustion, automatically switches to fallback model (gemini-2.5-pro) with the same retry strategy
  • Retryable error classification uses a status code allowlist (429, 500, 502, 503) via genai.APIError inspection; non-retryable errors (4xx, context cancellation, tool failures) short-circuit immediately
  • Both clients are pre-initialized at engine construction to avoid allocation and auth overhead on the retry path
  • Sentinel errors replace string matching for SDK-originated error conditions

Test plan

  • 15 unit tests in internal/llm/retry_test.go covering: success paths, retryable/non-retryable status codes, primary exhaustion with fallback, both models exhausted, context cancellation, network timeouts, wrapped errors
  • Existing pkg/krknai/analysisengine tests pass unchanged
  • make build succeeds

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added a fallback model and retry-with-fallback logic for analysis to improve reliability.
  • Bug Fixes / Reliability
    • Improved retry behavior for transient API and network errors, empty responses, and timeouts.
    • Analysis failures now surface detailed failure status/messages for reporting.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

There are test jobs defined for this repository which are not configured to run automatically. Comment /test ? to see a list of all defined jobs. Review these jobs and use /test <job> to manually trigger jobs most likely to be impacted by the proposed changes.Comment /pipeline required to trigger all required & necessary jobs.

@coderabbitai

coderabbitai Bot commented May 18, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8abb8c93-d634-44c4-9da7-207d8233b322

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

The PR implements a resilient LLM analysis pipeline with a primary/fallback model strategy. It introduces sentinel errors to classify Gemini failures, adds exponential-backoff retry orchestration that promotes to a fallback model on exhaustion, integrates both patterns into internal and public engines, and validates behavior across success paths, transient failures, API errors, and edge cases.

Changes

LLM Retry and Fallback Pattern

Layer / File(s) Summary
Sentinel Errors and Fallback Model
internal/llm/errors.go, internal/llm/gemini.go
Defines exported sentinel error variables (ErrNoResponseCandidates, ErrNoContentInResponse, ErrToolCallFailed, ErrMaxIterations) and adds FallbackModel constant for a second Gemini model variant.
Gemini Error Classification
internal/llm/gemini.go
Updates error handling in response parsing and tool execution to emit sentinel errors instead of formatted strings; wraps tool failures with ErrToolCallFailed and returns ErrMaxIterations on max-iteration termination.
Retry Orchestration with Primary/Fallback
internal/llm/retry.go
Implements AnalyzeWithRetry to run primary analysis with exponential-backoff retries; on exhaustion, switches to fallback function with identical retry logic; returns aggregated error if both fail. Includes isRetryable predicate for API codes, context cancellation, network timeouts, and sentinel errors.
Engine Integration
internal/analysisengine/engine.go, pkg/krknai/analysisengine/engine.go
Both engines add fallbackLLMClient field and initialize a second Gemini client with FallbackModel in New; Run replaces direct primary-client calls with llm.AnalyzeWithRetry, passing context logger and closures for primary and fallback analysis.
Comprehensive Retry Tests
internal/llm/retry_test.go
Validates primary-only success, transient failure recovery, primary exhaustion with fallback promotion, dual exhaustion with aggregated errors, non-retryable error handling, HTTP status code retryability, context cancellation, network timeouts, sentinel error classification, and error wrapping preservation through multiple layers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the primary changes: adding retry mechanism with fallback model support to LLM analysis engines.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR uses Go's standard testing package, not Ginkgo. All test names are stable identifiers with no dynamic content, timestamps, UUIDs, or pod/node/namespace names that change between runs.
Test Structure And Quality ✅ Passed Custom check targets Ginkgo tests (It/Describe/BeforeEach blocks). The PR adds standard Go tests using testing.T. No Ginkgo tests present, so check is not applicable.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All test additions are standard Go unit tests in internal/llm/retry_test.go using testing.T, not e2e tests. The check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR does not introduce new Ginkgo e2e tests. All changes are internal LLM logic and unit tests using Go's standard testing framework, which are not subject to SNO compatibility requirements.
Topology-Aware Scheduling Compatibility ✅ Passed Check not applicable. PR contains only Go library code—LLM retry logic and error handling—with no Kubernetes deployment manifests, operator code, or scheduling constraints.
Ote Binary Stdout Contract ✅ Passed PR modifies only library packages. No main packages, init/TestMain functions, or direct stdout writes. Uses logr.Logger for logging. OTE contract satisfied.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added. PR adds only standard Go unit tests to internal/llm/retry_test.go using testing.T. The custom check for IPv6/disconnected network compatibility is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: varunraokadaparthi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026
@varunraokadaparthi

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 18, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/llm/gemini.go (1)

109-117: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return ErrMaxIterations instead of success on the final tool-call iteration.

Line 109-114 returns nil error on the last iteration, so the max-iteration failure path at Line 117 is bypassed.

Suggested fix
 func (g *GeminiClient) handleConversationWithTools(ctx context.Context, contents []*genai.Content, genConfig *genai.GenerateContentConfig, toolRegistry *tools.Registry) (*AnalysisResult, error) {
 	const maxIterations = 5
 	var toolCalls []*genai.FunctionCall
+	var lastTextContent string
 
 	for i := range maxIterations {
 		resp, err := g.client.Models.GenerateContent(ctx, g.model, contents, genConfig)
 		if err != nil {
 			return nil, fmt.Errorf("gemini API error: %w", err)
@@
 		textContent, functionCalls := g.processCandidateParts(candidate)
+		lastTextContent = textContent
@@
-		// Return partial result if we hit max iterations
-		if i == maxIterations-1 {
-			return &AnalysisResult{
-				Content:   textContent,
-				ToolCalls: toolCalls,
-			}, nil
-		}
+		_ = i
 	}
 
-	return &AnalysisResult{ToolCalls: toolCalls}, ErrMaxIterations
+	return &AnalysisResult{
+		Content:   lastTextContent,
+		ToolCalls: toolCalls,
+	}, ErrMaxIterations
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/llm/gemini.go` around lines 109 - 117, The final iteration branch
currently returns (&AnalysisResult{Content: textContent, ToolCalls: toolCalls},
nil) which bypasses the ErrMaxIterations path; update the branch in the loop
that checks if i == maxIterations-1 to return the AnalysisResult along with
ErrMaxIterations instead of nil so the caller sees the max-iteration failure
(referencing AnalysisResult, ErrMaxIterations, i, maxIterations, textContent,
and toolCalls).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/llm/retry_test.go`:
- Line 15: The file internal/llm/retry_test.go is not gofumpt-formatted (issue
at line 15); run the gofumpt formatter on that file to rewrite it to gofumpt
style, stage the resulting changes, and re-run the linter so the test file
(retry_test.go) conforms to the project's gofumpt formatting rules.
- Around line 317-320: The test's fallbackFn currently returns an invalid (nil,
nil) tuple after calling t.Fatal; change fallbackFn in
internal/llm/retry_test.go so it returns a non-nil error instead of nil for the
error value (e.g., return nil, errors.New("fallback called unexpectedly") or
fmt.Errorf(...)) to avoid nilnil; ensure you add the required import (errors or
fmt) if not already present.

In `@internal/llm/retry.go`:
- Around line 37-52: Before switching to the fallback, save the original primary
error (the err returned from retryLoop with primaryFn) into a distinct variable
(e.g., primaryErr) so it isn't lost; then when the fallback retryLoop (with
fallbackFn) also fails, return a combined error that preserves both failures
(use errors.Join(primaryErr, err) or fmt.Errorf with both messages) and include
both in the logger.Error context so diagnostics show both primary and fallback
failures.
- Around line 101-104: The errors.As check uses a value-typed target
(genai.APIError) which won't match the SDK's pointer error type; change the
target to a pointer (e.g., declare apiErr as *genai.APIError) and call
errors.As(err, &apiErr) so the match succeeds and retryableStatusCodes is
consulted for apiErr.Code; update the logic around the variable name `apiErr` in
retry.go to use the pointer type and keep the existing return of
retryableStatusCodes[apiErr.Code].

---

Outside diff comments:
In `@internal/llm/gemini.go`:
- Around line 109-117: The final iteration branch currently returns
(&AnalysisResult{Content: textContent, ToolCalls: toolCalls}, nil) which
bypasses the ErrMaxIterations path; update the branch in the loop that checks if
i == maxIterations-1 to return the AnalysisResult along with ErrMaxIterations
instead of nil so the caller sees the max-iteration failure (referencing
AnalysisResult, ErrMaxIterations, i, maxIterations, textContent, and toolCalls).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2bc7f5b6-0d3e-4dd5-86d0-245f9f6591a7

📥 Commits

Reviewing files that changed from the base of the PR and between 8c5716a and 9b2d95d.

📒 Files selected for processing (6)
  • internal/analysisengine/engine.go
  • internal/llm/errors.go
  • internal/llm/gemini.go
  • internal/llm/retry.go
  • internal/llm/retry_test.go
  • pkg/krknai/analysisengine/engine.go

Comment thread internal/llm/retry_test.go Outdated
Comment thread internal/llm/retry_test.go
Comment thread internal/llm/retry.go Outdated
Comment thread internal/llm/retry.go
@ritmun

ritmun commented May 19, 2026

Copy link
Copy Markdown
Contributor

Could wait.PollUntil functions be used anywhere?

Comment thread internal/llm/gemini.go Outdated
@@ -111,17 +114,17 @@ func (g *GeminiClient) handleConversationWithTools(ctx context.Context, contents

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}, ErrMaxIterations

@YiqinZhang YiqinZhang Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's optional, surface the error when the fallback is silenced, aligned with your change 3 lines below on line 117.

@varunraokadaparthi varunraokadaparthi Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new commit.
This will perform a final LLM call after max iterations, ensuring complete log analysis instead of partial or incomplete content.

@varunraokadaparthi varunraokadaparthi force-pushed the retry-mechanism branch 2 times, most recently from f4b0b75 to 7dbc4df Compare May 21, 2026 22:49

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/llm/retry_test.go (1)

58-58: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Test errors wrap value-type genai.APIError.

Tests use genai.APIError{Code: ...} (value), matching the current production bug. Once production code is fixed to use *genai.APIError, update tests to wrap pointer errors to reflect real SDK behavior:

-return nil, fmt.Errorf("gemini API error: %w", genai.APIError{Code: 429, Message: "rate limited"})
+return nil, fmt.Errorf("gemini API error: %w", &genai.APIError{Code: 429, Message: "rate limited"})

Also applies to: 85-85, 110-110, 115-115, 136-136, 161-161, 166-166, 201-201, 242-242, 274-274, 432-432, 436-436

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/llm/retry_test.go` at line 58, Replace value-type genai.APIError
usages in the test helper returns with pointer instances so wrapped errors match
the SDK behaviour; e.g. change occurrences like fmt.Errorf("gemini API error:
%w", genai.APIError{Code: 429, Message: "rate limited"}) to wrap a pointer
(fmt.Errorf("gemini API error: %w", &genai.APIError{Code: 429, Message: "rate
limited"})). Update all similar occurrences in internal/llm/retry_test.go (the
instances called out around the diff and at the listed offsets) so tests wrap
*genai.APIError instead of genai.APIError while keeping the fmt.Errorf wrapping
and message text unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/llm/retry_test.go`:
- Line 58: Replace value-type genai.APIError usages in the test helper returns
with pointer instances so wrapped errors match the SDK behaviour; e.g. change
occurrences like fmt.Errorf("gemini API error: %w", genai.APIError{Code: 429,
Message: "rate limited"}) to wrap a pointer (fmt.Errorf("gemini API error: %w",
&genai.APIError{Code: 429, Message: "rate limited"})). Update all similar
occurrences in internal/llm/retry_test.go (the instances called out around the
diff and at the listed offsets) so tests wrap *genai.APIError instead of
genai.APIError while keeping the fmt.Errorf wrapping and message text unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fca10d55-95ae-4299-bfad-4d1e5dc092a2

📥 Commits

Reviewing files that changed from the base of the PR and between 9b2d95d and 7dbc4df.

📒 Files selected for processing (3)
  • internal/llm/retry.go
  • internal/llm/retry_test.go
  • pkg/e2e/e2e.go

…, integration tests

- 2 retries on primary model, 1 attempt on fallback, graceful error message
- Flat 30s retry delay (no exponential backoff)
- Integration tests with mock Gemini server for all error code scenarios
- Store failed analysis result so LLM errors appear in Slack notifications

Co-Authored-By: Claude Code <noreply@anthropic.com>
@varunraokadaparthi varunraokadaparthi force-pushed the retry-mechanism branch 2 times, most recently from b515119 to 0e199e1 Compare June 13, 2026 00:51
@varunraokadaparthi

Copy link
Copy Markdown
Contributor Author

/retest

…sted

Instead of returning a partial result when the tool call loop is
exhausted, make one final API call with FunctionCallingConfigModeNone
to force the model to produce a text analysis from the gathered context.

Co-Authored-By: Claude Code <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

@varunraokadaparthi: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants