Skip to content

TRT-2564: Cache Risk Analysis responses#3409

Open
sosiouxme wants to merge 16 commits intoopenshift:mainfrom
sosiouxme:20260402-TRT-2564-RA-caching
Open

TRT-2564: Cache Risk Analysis responses#3409
sosiouxme wants to merge 16 commits intoopenshift:mainfrom
sosiouxme:20260402-TRT-2564-RA-caching

Conversation

@sosiouxme
Copy link
Copy Markdown
Member

@sosiouxme sosiouxme commented Apr 7, 2026

Risk Analysis queries are fairly heavy and can impact Sippy DB performance. During regression periods we are likely to see an increase in repetitive calls for the same job / test failures. So this adds a caching mechanism.

Both RA and Sippy classic test pages use the same underlying query mechanism, so the same cache is introduced for both (replacing the existing request-level cache for the test API). This caches the test queries for an hour by default, but cache entries are invalidated if the underlying matview is refreshed, so new data is surfaced faster.

  1. Sippy's matview refresh code is updated to write a timestamp to redis whenever a matview refresh completes.
  2. Query results are written to the cache with a timestamp.
  3. On a cache hit, if the underlying matview timestamp is newer than the result timestamp, it is treated as a miss and regenerated. (In the case where the matview timestamp is missing, cached data is used until it expires normally.)

Individual commits separate the concepts that built this change, so reviewing them serially may be easiest.

openshift/origin#30965 proposes one simple way to benchmark risk analysis requests.

Summary by CodeRabbit

  • New Features

    • Global caching introduced across APIs, CLI commands, background processors and materialized-view refresh tracking.
  • Changed

    • Refresh/seed flows and server endpoints now use optional cache clients; matview refreshes record cache timestamps; test/result generation and risk analysis are cache-aware and refactored for unified result specs.
  • Removals

    • /api/canary endpoint removed.
  • Tests

    • Extensive tests added for cache and matview behaviors.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 7, 2026

@sosiouxme: This pull request references TRT-2564 which is a valid jira issue.

Details

In response to this:

Risk Analysis queries are fairly heavy and can impact Sippy DB performance. During regression periods we are likely to see an increase in repetitive calls for the same job / test failures. So this adds a caching mechanism.

Both RA and Sippy classic test pages use the same underlying query mechanism, so the same cache is introduced for both (replacing the existing request-level cache for the test API). This caches the test queries for an hour by default, but cache entries are invalidated if the underlying matview is refreshed, so new data is surfaced faster.

  1. Sippy's matview refresh code is updated to write a timestamp to redis whenever a matview refresh completes.
  2. Query results are written to the cache with a timestamp.
  3. On a cache hit, if the underlying matview timestamp is newer than the result timestamp, it is treated as a miss and regenerated. (In the case where the matview timestamp is missing, cached data is used until it expires normally.)

Individual commits separate the concepts that built this change, so reviewing them serially may be easiest.

openshift/origin#30965 proposes one simple way to benchmark risk analysis requests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from stbenjam and xueqzhan April 7, 2026 00:05
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2026
@sosiouxme
Copy link
Copy Markdown
Member Author

/retest-required

@openshift-ci-robot
Copy link
Copy Markdown

Scheduling required tests:
/test e2e

petr-muller added a commit to petr-muller/sippy that referenced this pull request Apr 9, 2026
@sosiouxme
Copy link
Copy Markdown
Member Author

@coderabbitai help

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
    • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
    • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve merge conflict to automatically resolve merge conflicts.
  • @coderabbitai autofix to automatically fix issues identified in unresolved review comments.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@sosiouxme
Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 110815a6-b9a1-4431-bd10-0c6a67897948

📥 Commits

Reviewing files that changed from the base of the PR and between f81c9a4 and e06b56a.

📒 Files selected for processing (1)
  • pkg/api/tests.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/api/tests.go

Walkthrough

Threads a cache client through commands, server, processors, and API flows; adds generic GetDataFromCacheOrMatview and RefreshMatviewKey helpers; records matview refresh timestamps in cache; adapts function signatures and command flags to accept and propagate a cache client.

Changes

Cohort / File(s) Summary
Command entry points
cmd/sippy-daemon/main.go, cmd/sippy/load.go, cmd/sippy/refresh.go, cmd/sippy/seed_data.go
Add CacheFlags and initialize cacheClient; propagate cacheClient into flows; replace pinned-timestamp usage with cache-aware RefreshData signature; tolerate nil cache with warnings or abort on hard errors where appropriate.
Cache core & tests
pkg/api/cache.go, pkg/api/cache_test.go
Add GetDataFromCacheOrMatview[T], RefreshMatviewKey, and adjust CalculateRoundedCacheDuration to honor RequestOptions.Expiry; extensive tests validate nil-cache, hits/misses, matview refresh semantics, timestamp ordering, and cache write behavior.
API logic & types
pkg/api/tests.go, pkg/api/job_runs.go
Introduce TestResultsSpec with build* methods; rename result slice types to exported TestsAPIResult*; make tests and job-run analysis accept cache.Cache and use cache-aware generation; remove canary endpoint and standalone BuildTestsResults builders.
Server & processors
pkg/sippyserver/server.go, pkg/sippyserver/pr_commenting_processor.go
Wire cache.Cache into Server, WorkProcessor, and AnalysisWorker; change NewWorkProcessor and RefreshData signatures; record per-matview refresh timestamps via recordMatviewRefreshTime; update handlers to pass s.cache.
Cache API types
pkg/apis/cache/cache.go
Add Expiry time.Duration to exported RequestOptions and update CR time-rounding documentation.
Scripts / minor docs
scripts/updatesuites.go
Add clarifying comment that the script’s refresh call does not update matview cache timestamps.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server
    participant Cache
    participant DB
    participant BigQuery

    Client->>Server: Request data (e.g., /api/tests)
    Server->>Cache: GET RefreshMatviewKey(matview)
    Cache-->>Server: refresh_time or nil

    alt refresh_time missing OR cache_timestamp >= refresh_time
        Server->>Cache: GET cached data key
        Cache-->>Server: cached_value + cache_timestamp or MISS
        alt cache hit and valid
            Server->>Client: Return cached_value
        else
            Server->>DB: Query matview (or BigQuery fallback)
            DB-->>Server: query_result
            Server->>Cache: SET cached data key = {Val, Timestamp=now} (with cacheDuration)
            Cache-->>Server: set OK
            Server->>Client: Return generated data
        end
    else refresh_time > cache_timestamp
        Server->>DB: Query matview (or BigQuery)
        DB-->>Server: query_result
        Server->>Cache: SET RefreshMatviewKey(matview) = now (TTL 24h)
        Cache-->>Server: set OK
        Server->>Cache: SET cached data key = {Val, Timestamp=now} (with cacheDuration)
        Cache-->>Server: set OK
        Server->>Client: Return generated data
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Ote Binary Stdout Contract ❌ Error Pull request introduces OTE Binary Stdout Contract violations through process-level stdout writes in Cobra command initialization. Replace fmt.Fprintf(os.Stdout) with fmt.Fprintf(os.Stderr) in sippy-daemon and fmt.Println with fmt.Fprintln(os.Stderr) in PrintVersion function.
Docstring Coverage ⚠️ Warning Docstring coverage is 52.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Go Error Handling ❓ Inconclusive Unable to access modified source code files in sandbox environment to verify Go error handling patterns including error wrapping, nil pointer checks, and panic justification. Provide access to modified files (cache.go, load.go, refresh.go, server.go) to verify error wrapping with fmt.Errorf %w, cacheClient nil checks, and panic call justifications.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'TRT-2564: Cache Risk Analysis responses' clearly identifies the main change: adding caching for Risk Analysis responses. It is concise, specific, and directly relates to the primary objective of the changeset.
Sql Injection Prevention ✅ Passed Pull request uses proper parameterization for all database queries via BigQuery named parameters and GORM placeholders, with no direct user input concatenation into SQL.
Excessive Css In React Should Use Styles ✅ Passed The pull request modifies exclusively Go backend code files (cmd/, pkg/, and scripts/ directories) related to caching functionality and database operations. The custom check for 'Excessive CSS in React Should Use Styles' is specifically designed for React/JavaScript components and inline CSS styling patterns. Since all changes in this PR are confined to Go server-side code with no modifications to React components in the sippy-ng/ directory, this check is not applicable to this pull request.
Single Responsibility And Clear Naming ✅ Passed Pull request maintains clear single responsibility and naming principles with action-oriented function names and descriptive types appropriately scoped to their packages.
Stable And Deterministic Test Names ✅ Passed The pull request uses only standard Go testing package, not Ginkgo. All 27 test functions follow func Test* naming convention with static, descriptive names containing no dynamic information.
Test Structure And Quality ✅ Passed PR contains standard Go tests with testing.T, not Ginkgo BDD-style tests. Custom check for Ginkgo tests is not applicable.
Microshift Test Compatibility ✅ Passed This pull request does not introduce any new Ginkgo e2e tests. The repository is the Sippy performance monitoring tool, not a Kubernetes test suite. Only unit tests in standard Go testing package were modified.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR contains only standard Go unit tests using testing.T framework, not Ginkgo e2e tests. SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only Go application code without introducing Kubernetes manifests or operator code, making it out of scope for topology-aware scheduling compatibility assessment.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR does not add any Ginkgo e2e tests. The only test file added contains standard Go unit tests with mocked dependencies.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 13, 2026

@sosiouxme: This pull request references TRT-2564 which is a valid jira issue.

Details

In response to this:

Risk Analysis queries are fairly heavy and can impact Sippy DB performance. During regression periods we are likely to see an increase in repetitive calls for the same job / test failures. So this adds a caching mechanism.

Both RA and Sippy classic test pages use the same underlying query mechanism, so the same cache is introduced for both (replacing the existing request-level cache for the test API). This caches the test queries for an hour by default, but cache entries are invalidated if the underlying matview is refreshed, so new data is surfaced faster.

  1. Sippy's matview refresh code is updated to write a timestamp to redis whenever a matview refresh completes.
  2. Query results are written to the cache with a timestamp.
  3. On a cache hit, if the underlying matview timestamp is newer than the result timestamp, it is treated as a miss and regenerated. (In the case where the matview timestamp is missing, cached data is used until it expires normally.)

Individual commits separate the concepts that built this change, so reviewing them serially may be easiest.

openshift/origin#30965 proposes one simple way to benchmark risk analysis requests.

Summary by CodeRabbit

  • New Features

  • Added caching infrastructure for test results and job run risk analysis to improve response times.

  • Implemented automatic cache invalidation tracking for materialized view refreshes to ensure data freshness.

  • Removals

  • Removed /api/canary endpoint.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/api/tests.go (2)

686-690: ⚠️ Potential issue | 🟡 Minor

Fix the copied completion log label.

This is the BigQuery path, but the message still says "buildTestsResultsFromPostgres completed". That makes latency/debug logs ambiguous when you're comparing the BQ and Postgres builders.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/api/tests.go` around lines 686 - 690, The completion log message is wrong
for the BigQuery path: update the Debug message in the log.WithFields block that
currently says "buildTestsResultsFromPostgres completed" to the correct label
(e.g., "buildTestsResultsFromBigQuery completed") so the elapsed/testReports
metrics are clearly attributed; locate the logging call that uses elapsed and
testReports (the log.WithFields(...) followed by .Debug(...)) and replace the
string to match the BigQuery builder function name.

322-335: ⚠️ Potential issue | 🟠 Major

Normalize equivalent period values before they reach the cache key.

buildTestsResultsFromPostgres only treats "twoDay" specially, so "", "default", and "current" all hit the same 7-day matview. Right now those values are serialized into different TestResultsSpec cache keys, which fragments the cache and cuts the hit rate for equivalent requests.

Suggested fix
 	period := req.URL.Query().Get("period")
 	if period != "" && period != "default" && period != "current" && period != "twoDay" {
 		RespondWithJSON(http.StatusBadRequest, w, map[string]interface{}{"code": http.StatusBadRequest, "message": "Unknown period"})
 		return
 	}
+	if period == "" || period == "default" {
+		period = "current"
+	}
 
 	spec := TestResultsSpec{
 		Release:        release,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/api/tests.go` around lines 322 - 335, Normalize the incoming period value
to a canonical token before constructing TestResultsSpec so equivalent periods
share the same cache key: map ""/"default"/"current" to the same normalized
value (e.g., ""), leaving "twoDay" untouched, then use that normalized variable
when building the TestResultsSpec and calling
spec.buildTestsResultsFromPostgres(dbc, cacheClient); update the validation
logic that currently checks period to still allow the accepted inputs but
perform the normalization immediately prior to creating the TestResultsSpec.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/sippy/load.go`:
- Around line 148-152: In the cache initialization block around GetCacheClient,
fix the error variable mismatch by wrapping and returning cacheErr (not err)
when cacheErr != nil; update references in the component-readiness-cache path
and any early-return logic so that errors.Wrap or the returned error uses
cacheErr with a message like "couldn't get cache client", and keep cacheClient
nil when cacheErr is present to preserve existing error hygiene.

In `@cmd/sippy/refresh.go`:
- Around line 4-6: The current cache setup turns a best-effort write into a hard
failure; instead of returning the error when initializing the cache client, log
the error (using logrus) and set cacheClient = nil so RefreshData can proceed
(RefreshData already handles a nil cacheClient). Update the cache initialization
code that currently returns the error (around the cacheClient
setup/WriteTimestamp call) to log the failure and continue, ensuring calls to
RefreshData use the possibly-nil cacheClient.

In `@pkg/api/cache_test.go`:
- Around line 637-640: The test is supposed to hit the equality branch but loses
it because seedRefreshTimestamp serializes times with time.RFC3339 (no
sub-second precision) while cachedAt usually has nanoseconds; fix by normalizing
cachedAt to the same precision before seeding so the parsed refresh time equals
cachedAt — e.g., truncate or round cachedAt to seconds (or format+parse with
time.RFC3339) so the value passed to seedMatviewCache/cachedAt matches what
seedRefreshTimestamp writes for testMatview in mc and the equality branch is
exercised.

In `@pkg/api/cache.go`:
- Around line 212-215: The current comparison in the cache invalidation logic
uses lastRefresh.Before(cacheVal.Timestamp) which treats equal timestamps as
stale; change the condition to allow equality (i.e., treat refresh newer than
cache as miss, but equal or older as a hit). Locate the block around the cache
check (where cacheVal.Timestamp and lastRefresh are compared) and replace the
strict Before check with a comparison that returns the cached value when
lastRefresh is equal or before cacheVal.Timestamp (for example, use a
!lastRefresh.After(cacheVal.Timestamp) style check) so equal timestamps reuse
the cache.

In `@pkg/api/tests.go`:
- Around line 444-458: Change buildTestsResultsFromPostgres to accept and use a
request context and pass it into GetDataFromCacheOrMatview instead of
context.TODO(); update the local generator closure to call
spec.buildTestsResults with that ctx (so the closure uses the same context) and
ensure the GetDataFromCacheOrMatview invocation uses the incoming ctx.
Specifically, modify the signature of buildTestsResultsFromPostgres to take ctx
context.Context, change generator := func(ctx context.Context) ... to
capture/forward the outer ctx to spec.buildTestsResults, and replace the
context.TODO() argument in the GetDataFromCacheOrMatview call with the incoming
ctx; update any callers of buildTestsResultsFromPostgres accordingly.

In `@scripts/updatesuites.go`:
- Around line 153-156: The call to sippyserver.RefreshData(&db.DB{DB: dbc}, nil,
false) passes nil for the cache client so the new matview refresh timestamps are
never written; change RefreshData usage so it either receives a real cache
client or makes the script explicitly offline-only: locate
sippyserver.RefreshData and the call site using db.DB and dbc, then either (A)
construct and pass the appropriate cache client (Redis client/adapter used
elsewhere in the app) instead of nil so timestamps are persisted, or (B) add a
guard in this script to refuse to run against a live deployment (e.g., require
an --offline flag or check an env var) and return an error if no offline mode is
set, documenting that RefreshData must not be invoked with a nil cache in
production.

---

Outside diff comments:
In `@pkg/api/tests.go`:
- Around line 686-690: The completion log message is wrong for the BigQuery
path: update the Debug message in the log.WithFields block that currently says
"buildTestsResultsFromPostgres completed" to the correct label (e.g.,
"buildTestsResultsFromBigQuery completed") so the elapsed/testReports metrics
are clearly attributed; locate the logging call that uses elapsed and
testReports (the log.WithFields(...) followed by .Debug(...)) and replace the
string to match the BigQuery builder function name.
- Around line 322-335: Normalize the incoming period value to a canonical token
before constructing TestResultsSpec so equivalent periods share the same cache
key: map ""/"default"/"current" to the same normalized value (e.g., ""), leaving
"twoDay" untouched, then use that normalized variable when building the
TestResultsSpec and calling spec.buildTestsResultsFromPostgres(dbc,
cacheClient); update the validation logic that currently checks period to still
allow the accepted inputs but perform the normalization immediately prior to
creating the TestResultsSpec.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 57ca2b8b-6b2a-4111-a4fa-a6ccd9c7f29a

📥 Commits

Reviewing files that changed from the base of the PR and between c14224a and 55b4ca0.

📒 Files selected for processing (11)
  • cmd/sippy-daemon/main.go
  • cmd/sippy/load.go
  • cmd/sippy/refresh.go
  • cmd/sippy/seed_data.go
  • pkg/api/cache.go
  • pkg/api/cache_test.go
  • pkg/api/job_runs.go
  • pkg/api/tests.go
  • pkg/sippyserver/pr_commenting_processor.go
  • pkg/sippyserver/server.go
  • scripts/updatesuites.go

Comment thread cmd/sippy/load.go
Comment thread cmd/sippy/refresh.go Outdated
Comment thread pkg/api/cache_test.go Outdated
Comment on lines +637 to +640
cachedAt := time.Now().UTC().Add(-5 * time.Minute)
seedMatviewCache(t, mc, spec, testResult{Value: "cached"}, cachedAt)
// Refresh at the exact same time as the cache entry
seedRefreshTimestamp(mc, testMatview, cachedAt)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make this test hit the equality branch.

seedRefreshTimestamp serializes with time.RFC3339, so the refresh value loses sub-second precision. As written, cachedAt usually still carries nanos, which means the parsed refresh time is before the cached time, not equal to it. This test can pass without ever exercising the equality case.

Suggested fix
-	cachedAt := time.Now().UTC().Add(-5 * time.Minute)
+	cachedAt := time.Now().UTC().Add(-5 * time.Minute).Truncate(time.Second)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cachedAt := time.Now().UTC().Add(-5 * time.Minute)
seedMatviewCache(t, mc, spec, testResult{Value: "cached"}, cachedAt)
// Refresh at the exact same time as the cache entry
seedRefreshTimestamp(mc, testMatview, cachedAt)
cachedAt := time.Now().UTC().Add(-5 * time.Minute).Truncate(time.Second)
seedMatviewCache(t, mc, spec, testResult{Value: "cached"}, cachedAt)
// Refresh at the exact same time as the cache entry
seedRefreshTimestamp(mc, testMatview, cachedAt)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/api/cache_test.go` around lines 637 - 640, The test is supposed to hit
the equality branch but loses it because seedRefreshTimestamp serializes times
with time.RFC3339 (no sub-second precision) while cachedAt usually has
nanoseconds; fix by normalizing cachedAt to the same precision before seeding so
the parsed refresh time equals cachedAt — e.g., truncate or round cachedAt to
seconds (or format+parse with time.RFC3339) so the value passed to
seedMatviewCache/cachedAt matches what seedRefreshTimestamp writes for
testMatview in mc and the equality branch is exercised.

Comment thread pkg/api/cache.go Outdated
Comment on lines +212 to +215
if lastRefresh.Before(cacheVal.Timestamp) {
// not invalidated by a newer refresh, so use it (if we don't know the last refresh, still use it)
return cacheVal.Val, nil
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Allow equal timestamps to reuse the cache.

Line 212 uses a strict Before check, so a refresh timestamp equal to the cached timestamp is treated as stale. The invalidation rule described in this PR is "refresh newer than cache => miss", so equality should still return the cached value.

Suggested fix
-		if lastRefresh.Before(cacheVal.Timestamp) {
+		if !lastRefresh.After(cacheVal.Timestamp) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if lastRefresh.Before(cacheVal.Timestamp) {
// not invalidated by a newer refresh, so use it (if we don't know the last refresh, still use it)
return cacheVal.Val, nil
}
if !lastRefresh.After(cacheVal.Timestamp) {
// not invalidated by a newer refresh, so use it (if we don't know the last refresh, still use it)
return cacheVal.Val, nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/api/cache.go` around lines 212 - 215, The current comparison in the cache
invalidation logic uses lastRefresh.Before(cacheVal.Timestamp) which treats
equal timestamps as stale; change the condition to allow equality (i.e., treat
refresh newer than cache as miss, but equal or older as a hit). Locate the block
around the cache check (where cacheVal.Timestamp and lastRefresh are compared)
and replace the strict Before check with a comparison that returns the cached
value when lastRefresh is equal or before cacheVal.Timestamp (for example, use a
!lastRefresh.After(cacheVal.Timestamp) style check) so equal timestamps reuse
the cache.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will make no difference in practice and i believe the PR code is more legible.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme, fair point — exact timestamp equality between a matview refresh and a cache write is effectively impossible in practice, and lastRefresh.Before(cacheVal.Timestamp) does read more naturally than the double-negation alternative. Happy to defer to your judgement here.

Comment thread pkg/api/tests.go Outdated
Comment thread scripts/updatesuites.go
Comment on lines +153 to 156
// NOTE: does not update timestamps to invalidate cached matview data; not clear if the use case for this script requires that.
sippyserver.RefreshData(&db.DB{
DB: dbc,
}, nil, false)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This still skips cache invalidation after the migration runs.

RefreshData(..., nil, ...) means the matview refresh timestamps introduced in this PR are never written, so Redis-backed RA/test caches can keep serving pre-migration data until TTL expiry. If this script can be run against a live Sippy deployment, it should either plumb a real cache client or be explicitly limited to offline use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/updatesuites.go` around lines 153 - 156, The call to
sippyserver.RefreshData(&db.DB{DB: dbc}, nil, false) passes nil for the cache
client so the new matview refresh timestamps are never written; change
RefreshData usage so it either receives a real cache client or makes the script
explicitly offline-only: locate sippyserver.RefreshData and the call site using
db.DB and dbc, then either (A) construct and pass the appropriate cache client
(Redis client/adapter used elsewhere in the app) instead of nil so timestamps
are persisted, or (B) add a guard in this script to refuse to run against a live
deployment (e.g., require an --offline flag or check an env var) and return an
error if no offline mode is set, documenting that RefreshData must not be
invoked with a nil cache in production.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is fine with me. i doubt anyone uses this script against production. if we do and anyone cares, we'll have to add the ability to specify the cache.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme, understood — the inline NOTE comment already documents the limitation clearly, so anyone who needs to add cache support later has the context. That's a reasonable call.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Copy link
Copy Markdown
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
/lgtm
/approve
/meow

Some comments inline, nothing serious. Feel free to unhold or address, your call.

Comment thread pkg/api/cache.go
}
logrus.Debugf("matview %q refreshed at %v, will not use earlier cache entry from %v", matview, lastRefresh, cacheVal.Timestamp)
} else if strings.Contains(err.Error(), "connection refused") {
logrus.WithError(err).Fatalf("redis URL specified but got connection refused; exiting due to cost issues in this configuration")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fatal in a library is brutal 😁

could be a networking blip but I guess we can robustify if we run into this too often

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not know the history of why it was done this way and it did not seem likely to be a common problem that a blip would kill the server (but maybe it's been happening all the time and we just didn't notice because they're deployed redundantly)

Comment thread cmd/sippy/load.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return errors.Wrap(cacheErr, "couldn't get cache client")

an LLM pointed me here, it's not that I read all code

Comment thread pkg/api/tests.go Outdated
"elapsed": elapsed,
"reports": len(testReports),
}).Debug("BuildTestsResults completed")
}).Debug("buildTestsResultsFromPostgres completed")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}).Debug("buildTestsResultsFromPostgres completed")
}).Debug("buildTestsResultsFromBigQuery completed")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

Comment thread pkg/api/tests.go Outdated
generator := func(ctx context.Context) (testResults, []error) {
return spec.buildTestsResults(dbc, matview)
}
result, errs := GetDataFromCacheOrMatview(context.TODO(), cacheClient,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a request somewhere upstream, which means there should be a ctx somewhere too, would it be hard to plumb it here to avoid the context.TODO()?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, i guess i was being lazy

Comment thread pkg/api/tests.go Outdated
"elapsed": elapsed,
"reports": len(testReports),
}).Debug("BuildTestsResults completed")
}).Debug("buildTestsResultsFromPostgres completed")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}).Debug("buildTestsResultsFromPostgres completed")
}).Debug("buildTestsResults completed")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... actually it needs a name change anyway.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 13, 2026

@petr-muller: cat image

Details

In response to this:

/hold
/lgtm
/approve
/meow

Some comments inline, nothing serious. Feel free to unhold or address, your call.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 13, 2026
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 13, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, sosiouxme

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [petr-muller,sosiouxme]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Member Author

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edits incoming

Comment thread pkg/api/cache.go Outdated
Comment on lines +212 to +215
if lastRefresh.Before(cacheVal.Timestamp) {
// not invalidated by a newer refresh, so use it (if we don't know the last refresh, still use it)
return cacheVal.Val, nil
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will make no difference in practice and i believe the PR code is more legible.

Comment thread pkg/api/cache.go
}
logrus.Debugf("matview %q refreshed at %v, will not use earlier cache entry from %v", matview, lastRefresh, cacheVal.Timestamp)
} else if strings.Contains(err.Error(), "connection refused") {
logrus.WithError(err).Fatalf("redis URL specified but got connection refused; exiting due to cost issues in this configuration")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not know the history of why it was done this way and it did not seem likely to be a common problem that a blip would kill the server (but maybe it's been happening all the time and we just didn't notice because they're deployed redundantly)

Comment thread pkg/api/tests.go Outdated
generator := func(ctx context.Context) (testResults, []error) {
return spec.buildTestsResults(dbc, matview)
}
result, errs := GetDataFromCacheOrMatview(context.TODO(), cacheClient,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, i guess i was being lazy

Comment thread pkg/api/tests.go Outdated
"elapsed": elapsed,
"reports": len(testReports),
}).Debug("BuildTestsResults completed")
}).Debug("buildTestsResultsFromPostgres completed")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... actually it needs a name change anyway.

Comment thread pkg/api/tests.go Outdated
"elapsed": elapsed,
"reports": len(testReports),
}).Debug("BuildTestsResults completed")
}).Debug("buildTestsResultsFromPostgres completed")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

Comment thread scripts/updatesuites.go
Comment on lines +153 to 156
// NOTE: does not update timestamps to invalidate cached matview data; not clear if the use case for this script requires that.
sippyserver.RefreshData(&db.DB{
DB: dbc,
}, nil, false)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is fine with me. i doubt anyone uses this script against production. if we do and anyone cares, we'll have to add the ability to specify the cache.

fix silly error found by coderabbit
as coderabbit pointed out, we probably don't really want a caching
problem to prevent running the matview refreshes.
practically speaking, it doesn't matter
…uery

gives a chance to cancel the query if the server request is canceled.
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 15, 2026

New changes are detected. LGTM label has been removed.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 15, 2026

@sosiouxme: This pull request references TRT-2564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "4.22" instead.

Details

In response to this:

Risk Analysis queries are fairly heavy and can impact Sippy DB performance. During regression periods we are likely to see an increase in repetitive calls for the same job / test failures. So this adds a caching mechanism.

Both RA and Sippy classic test pages use the same underlying query mechanism, so the same cache is introduced for both (replacing the existing request-level cache for the test API). This caches the test queries for an hour by default, but cache entries are invalidated if the underlying matview is refreshed, so new data is surfaced faster.

  1. Sippy's matview refresh code is updated to write a timestamp to redis whenever a matview refresh completes.
  2. Query results are written to the cache with a timestamp.
  3. On a cache hit, if the underlying matview timestamp is newer than the result timestamp, it is treated as a miss and regenerated. (In the case where the matview timestamp is missing, cached data is used until it expires normally.)

Individual commits separate the concepts that built this change, so reviewing them serially may be easiest.

openshift/origin#30965 proposes one simple way to benchmark risk analysis requests.

Summary by CodeRabbit

  • New Features

  • Added caching across test results, job-run risk analysis, and API responses to improve performance and use cache-aware refresh timestamps for matview freshness.

  • CLI commands for refresh/seed now support optional cache usage.

  • Changed

  • Test and risk endpoints now leverage cache when available.

  • Removals

  • Removed /api/canary endpoint.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
pkg/api/cache.go (1)

220-221: ⚠️ Potential issue | 🟠 Major

Don't Fatal from this cache helper.

GetDataFromCacheOrMatview is used on request paths; a transient Redis outage here should degrade to regeneration, not terminate the process.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/api/cache.go` around lines 220 - 221, In GetDataFromCacheOrMatview, do
not call logrus.Fatalf (which exits the process) on a transient Redis
"connection refused" error; instead log non-fatally (e.g.,
logrus.WithError(err).Warnf or Errorf) and treat it as a cache miss by returning
control to the caller so the data can be regenerated (e.g., return a nil cache
result / an error that the caller interprets as "no cache"). Replace the Fatalf
call in the connection-refused branch with a non-fatal log and adjust the
function return path so it does not terminate the process.
🧹 Nitpick comments (1)
cmd/sippy/load.go (1)

148-152: Log the degraded cache mode here.

Swallowing cacheErr is fine for keeping the load moving, but it also means later matview refreshes run with cacheClient == nil, so refresh timestamps are never recorded and cached RA/test responses can stay stale until TTL expiry. Please surface this path with at least a warning/metric so a successful load does not hide disabled invalidation.

Possible tweak
 // likewise get a cache client if possible, though some things operate without it.
 cacheClient, cacheErr := f.CacheFlags.GetCacheClient()
 if cacheErr != nil {
+	log.WithError(cacheErr).Warn("cache unavailable; matview refreshes will skip invalidation timestamp writes")
 	cacheClient = nil // error hygiene, since we pass this down to quite a few functions
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/sippy/load.go` around lines 148 - 152, The code swallows cacheErr after
calling f.CacheFlags.GetCacheClient(), which hides degraded-cache mode; update
the block around f.CacheFlags.GetCacheClient() so that when cacheErr != nil you
log a warning (including cacheErr.Error()) and/or emit a metric indicating cache
invalidation is disabled before setting cacheClient = nil; reference the
cacheClient and cacheErr variables and the GetCacheClient call so the log/metric
is placed right where cacheErr is handled to surface the degraded cache path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/api/cache.go`:
- Around line 201-202: The json.Unmarshal error handling in the cache retrieval
path should treat corrupt cache entries as cache-misses instead of returning an
error; modify the block that unmarshals into cacheVal to log or warn about the
unmarshal failure (include error and cacheKey), optionally evict the bad entry,
and then fall through to call generateFn to produce the value and continue
normal execution rather than returning defaultVal and an error slice; update the
code that currently references json.Unmarshal, cacheVal, cacheKey, defaultVal,
and generateFn so the function returns the generated value (and no error) on
corrupt cache blobs.

In `@pkg/api/tests.go`:
- Around line 533-537: The cache key currently uses a constant
NewCacheSpec(testResultsBQ{}, "BigQueryTestsResults~", nil), causing all
BigQuery /api/tests requests to share one entry; update the
GetDataFromCacheOrGenerate call to build the cache spec from the actual request
spec (include release, period, filter, collapse or the request
struct/stringified key) so each unique request produces a unique NewCacheSpec
(replace the constant testResultsBQ{} key with the request-derived key) while
keeping the same cache, RequestOptions (testResultsCacheDuration) and the call
to GetDataFromCacheOrGenerate.

---

Duplicate comments:
In `@pkg/api/cache.go`:
- Around line 220-221: In GetDataFromCacheOrMatview, do not call logrus.Fatalf
(which exits the process) on a transient Redis "connection refused" error;
instead log non-fatally (e.g., logrus.WithError(err).Warnf or Errorf) and treat
it as a cache miss by returning control to the caller so the data can be
regenerated (e.g., return a nil cache result / an error that the caller
interprets as "no cache"). Replace the Fatalf call in the connection-refused
branch with a non-fatal log and adjust the function return path so it does not
terminate the process.

---

Nitpick comments:
In `@cmd/sippy/load.go`:
- Around line 148-152: The code swallows cacheErr after calling
f.CacheFlags.GetCacheClient(), which hides degraded-cache mode; update the block
around f.CacheFlags.GetCacheClient() so that when cacheErr != nil you log a
warning (including cacheErr.Error()) and/or emit a metric indicating cache
invalidation is disabled before setting cacheClient = nil; reference the
cacheClient and cacheErr variables and the GetCacheClient call so the log/metric
is placed right where cacheErr is handled to surface the degraded cache path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 093e1c8b-5796-4aa8-958a-4556161ef67a

📥 Commits

Reviewing files that changed from the base of the PR and between 55b4ca0 and 49e5677.

📒 Files selected for processing (8)
  • cmd/sippy/load.go
  • cmd/sippy/refresh.go
  • pkg/api/cache.go
  • pkg/api/cache_test.go
  • pkg/api/job_runs.go
  • pkg/api/tests.go
  • pkg/apis/cache/cache.go
  • pkg/sippyserver/server.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/sippyserver/server.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • cmd/sippy/refresh.go
  • pkg/api/cache_test.go
  • pkg/api/job_runs.go

Comment thread pkg/api/cache.go Outdated
Comment thread pkg/api/tests.go Outdated
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e

this is not strictly necessary since the results were already cached at
the request level and we don't have a way to find out when the backing
tables in BQ are refreshed. but it seemed like a good idea and not too
much trouble to have the two forks work the same.
both postgres and BQ forks can use the same request processing code.
also the "period" is now standardized for caching purposes.
fix problem coderabbit noted where the cache entry doesn't unmarshal
for some reason... better to treat it like a cache miss.
@sosiouxme sosiouxme force-pushed the 20260402-TRT-2564-RA-caching branch from 49e5677 to f81c9a4 Compare April 15, 2026 22:33
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 15, 2026

@sosiouxme: This pull request references TRT-2564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "4.22" instead.

Details

In response to this:

Risk Analysis queries are fairly heavy and can impact Sippy DB performance. During regression periods we are likely to see an increase in repetitive calls for the same job / test failures. So this adds a caching mechanism.

Both RA and Sippy classic test pages use the same underlying query mechanism, so the same cache is introduced for both (replacing the existing request-level cache for the test API). This caches the test queries for an hour by default, but cache entries are invalidated if the underlying matview is refreshed, so new data is surfaced faster.

  1. Sippy's matview refresh code is updated to write a timestamp to redis whenever a matview refresh completes.
  2. Query results are written to the cache with a timestamp.
  3. On a cache hit, if the underlying matview timestamp is newer than the result timestamp, it is treated as a miss and regenerated. (In the case where the matview timestamp is missing, cached data is used until it expires normally.)

Individual commits separate the concepts that built this change, so reviewing them serially may be easiest.

openshift/origin#30965 proposes one simple way to benchmark risk analysis requests.

Summary by CodeRabbit

  • New Features

  • Global caching added for test results, job-run risk analysis, materialized-view refresh tracking, and API responses; CLI refresh/seed support optional cache usage.

  • Changed

  • Refresh workflow now records cache-based matview timestamps; endpoints and processors use cache when available; tests and result builders refactored to use cache-aware flows.

  • Removals

  • /api/canary endpoint removed.

  • Tests

  • Comprehensive tests added for cache/matview behavior.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/api/cache.go (1)

140-146: ⚠️ Potential issue | 🟡 Minor

Don't let rounding make Expiry a no-op.

Once Line 143 runs, the duration is recomputed from now and the previously selected cacheOptions.Expiry is discarded. Any caller that sets both Expiry and CRTimeRoundingFactor will get the next rounding boundary from "now", not from now + Expiry.

Suggested fix
 func CalculateRoundedCacheDuration(cacheOptions cache.RequestOptions) time.Duration {
 	// require cacheDuration for persistence logic
 	cacheDuration := defaultCacheDuration
 	if cacheOptions.Expiry > 0 {
 		cacheDuration = cacheOptions.Expiry
 	}
 	if cacheOptions.CRTimeRoundingFactor > 0 {
 		now := time.Now().UTC()
+		expiresAt := now.Add(cacheDuration)
 		// Only cache until the next rounding duration
-		cacheDuration = now.Truncate(cacheOptions.CRTimeRoundingFactor).Add(cacheOptions.CRTimeRoundingFactor).Sub(now)
+		cacheDuration = expiresAt.Truncate(cacheOptions.CRTimeRoundingFactor).Add(cacheOptions.CRTimeRoundingFactor).Sub(now)
 	}
 	return cacheDuration
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/api/cache.go` around lines 140 - 146, The current logic overwrites
cacheDuration set from cacheOptions.Expiry when
cacheOptions.CRTimeRoundingFactor is present, causing Expiry to be ignored;
update the CRTimeRoundingFactor branch to compute the rounding boundary from now
+ cacheDuration (use the existing cacheDuration after applying Expiry) instead
of from now so that the final cacheDuration = roundedBoundary.Sub(now).
Concretely, after setting cacheDuration from cacheOptions.Expiry, when
CRTimeRoundingFactor > 0 compute target := now.Add(cacheDuration), then set
cacheDuration =
target.Truncate(cacheOptions.CRTimeRoundingFactor).Add(cacheOptions.CRTimeRoundingFactor).Sub(now)
(or equivalent) so both cacheOptions.Expiry and
cacheOptions.CRTimeRoundingFactor are honored for cacheDuration.
pkg/api/tests.go (1)

251-257: ⚠️ Potential issue | 🟠 Major

Guard limit before slicing.

Line 253 and Line 286 only reject zero. A request like ?limit=-1 falls through to tests[:limit] and panics the handler.

Suggested fix
 func (tests TestsAPIResult) limit(req *http.Request) TestsAPIResult {
-	limit, _ := strconv.Atoi(req.URL.Query().Get("limit"))
-	if limit == 0 || len(tests) < limit {
+	limit, err := strconv.Atoi(req.URL.Query().Get("limit"))
+	if err != nil || limit <= 0 || len(tests) < limit {
 		return tests
 	}
 
 	return tests[:limit]
 }
 func (tests TestsAPIResultBQ) limit(req *http.Request) TestsAPIResultBQ {
-	limit, _ := strconv.Atoi(req.URL.Query().Get("limit"))
-	if limit == 0 || len(tests) < limit {
+	limit, err := strconv.Atoi(req.URL.Query().Get("limit"))
+	if err != nil || limit <= 0 || len(tests) < limit {
 		return tests
 	}
 
 	return tests[:limit]
 }

Also applies to: 284-290

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/api/tests.go` around lines 251 - 257, The limit parsing in
TestsAPIResult.limit allows negative or parse-failed values through and can
cause a panic when slicing (tests[:limit]); update the TestsAPIResult.limit
method to validate the parsed limit from strconv.Atoi by checking the Atoi
error, treating non-positive (<=0) or failed parses as "no limit", and clamp the
positive limit to at most len(tests) before returning tests[:limit]; apply the
same guard/clamping logic to the other identical limit-handling block in this
file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/api/tests.go`:
- Around line 443-445: The fmt.Errorf calls that format the variable errs (a
[]error) use the "%s" verb; change both occurrences to use "%v" so the error
slice prints its elements correctly (update the fmt.Errorf invocations that
reference errs in pkg/api/tests.go to use "%v" instead of "%s").

---

Outside diff comments:
In `@pkg/api/cache.go`:
- Around line 140-146: The current logic overwrites cacheDuration set from
cacheOptions.Expiry when cacheOptions.CRTimeRoundingFactor is present, causing
Expiry to be ignored; update the CRTimeRoundingFactor branch to compute the
rounding boundary from now + cacheDuration (use the existing cacheDuration after
applying Expiry) instead of from now so that the final cacheDuration =
roundedBoundary.Sub(now). Concretely, after setting cacheDuration from
cacheOptions.Expiry, when CRTimeRoundingFactor > 0 compute target :=
now.Add(cacheDuration), then set cacheDuration =
target.Truncate(cacheOptions.CRTimeRoundingFactor).Add(cacheOptions.CRTimeRoundingFactor).Sub(now)
(or equivalent) so both cacheOptions.Expiry and
cacheOptions.CRTimeRoundingFactor are honored for cacheDuration.

In `@pkg/api/tests.go`:
- Around line 251-257: The limit parsing in TestsAPIResult.limit allows negative
or parse-failed values through and can cause a panic when slicing
(tests[:limit]); update the TestsAPIResult.limit method to validate the parsed
limit from strconv.Atoi by checking the Atoi error, treating non-positive (<=0)
or failed parses as "no limit", and clamp the positive limit to at most
len(tests) before returning tests[:limit]; apply the same guard/clamping logic
to the other identical limit-handling block in this file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 80e6632c-b39c-4fe5-8870-31f91c66ec69

📥 Commits

Reviewing files that changed from the base of the PR and between 49e5677 and f81c9a4.

📒 Files selected for processing (4)
  • pkg/api/cache.go
  • pkg/api/tests.go
  • pkg/apis/cache/cache.go
  • pkg/sippyserver/server.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/apis/cache/cache.go
  • pkg/sippyserver/server.go

Comment thread pkg/api/tests.go
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 16, 2026

@sosiouxme: This pull request references TRT-2564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "4.22" instead.

Details

In response to this:

Risk Analysis queries are fairly heavy and can impact Sippy DB performance. During regression periods we are likely to see an increase in repetitive calls for the same job / test failures. So this adds a caching mechanism.

Both RA and Sippy classic test pages use the same underlying query mechanism, so the same cache is introduced for both (replacing the existing request-level cache for the test API). This caches the test queries for an hour by default, but cache entries are invalidated if the underlying matview is refreshed, so new data is surfaced faster.

  1. Sippy's matview refresh code is updated to write a timestamp to redis whenever a matview refresh completes.
  2. Query results are written to the cache with a timestamp.
  3. On a cache hit, if the underlying matview timestamp is newer than the result timestamp, it is treated as a miss and regenerated. (In the case where the matview timestamp is missing, cached data is used until it expires normally.)

Individual commits separate the concepts that built this change, so reviewing them serially may be easiest.

openshift/origin#30965 proposes one simple way to benchmark risk analysis requests.

Summary by CodeRabbit

  • New Features

  • Global caching introduced across APIs, CLI commands, background processors and materialized-view refresh tracking.

  • Changed

  • Refresh/seed flows and server endpoints now use optional cache clients; matview refreshes record cache timestamps; test/result generation and risk analysis are cache-aware and refactored for unified result specs.

  • Removals

  • /api/canary endpoint removed.

  • Tests

  • Extensive tests added for cache and matview behaviors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 16, 2026

@sosiouxme: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants