TRT-2564: Cache Risk Analysis responses#3409
TRT-2564: Cache Risk Analysis responses#3409sosiouxme wants to merge 16 commits intoopenshift:mainfrom
Conversation
🤖 Assisted by Claude Code
🤖 Assisted by Claude Code
also minor refactoring to prep for caching 🤖 Assisted by Claude Code
... which replaces caching the tests at the api response
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@sosiouxme: This pull request references TRT-2564 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
Scheduling required tests: |
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
Status, support, documentation and community
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
WalkthroughThreads a cache client through commands, server, processors, and API flows; adds generic GetDataFromCacheOrMatview and RefreshMatviewKey helpers; records matview refresh timestamps in cache; adapts function signatures and command flags to accept and propagate a cache client. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server
participant Cache
participant DB
participant BigQuery
Client->>Server: Request data (e.g., /api/tests)
Server->>Cache: GET RefreshMatviewKey(matview)
Cache-->>Server: refresh_time or nil
alt refresh_time missing OR cache_timestamp >= refresh_time
Server->>Cache: GET cached data key
Cache-->>Server: cached_value + cache_timestamp or MISS
alt cache hit and valid
Server->>Client: Return cached_value
else
Server->>DB: Query matview (or BigQuery fallback)
DB-->>Server: query_result
Server->>Cache: SET cached data key = {Val, Timestamp=now} (with cacheDuration)
Cache-->>Server: set OK
Server->>Client: Return generated data
end
else refresh_time > cache_timestamp
Server->>DB: Query matview (or BigQuery)
DB-->>Server: query_result
Server->>Cache: SET RefreshMatviewKey(matview) = now (TTL 24h)
Cache-->>Server: set OK
Server->>Cache: SET cached data key = {Val, Timestamp=now} (with cacheDuration)
Cache-->>Server: set OK
Server->>Client: Return generated data
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning, 1 inconclusive)
✅ Passed checks (11 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
@sosiouxme: This pull request references TRT-2564 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/api/tests.go (2)
686-690:⚠️ Potential issue | 🟡 MinorFix the copied completion log label.
This is the BigQuery path, but the message still says
"buildTestsResultsFromPostgres completed". That makes latency/debug logs ambiguous when you're comparing the BQ and Postgres builders.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/api/tests.go` around lines 686 - 690, The completion log message is wrong for the BigQuery path: update the Debug message in the log.WithFields block that currently says "buildTestsResultsFromPostgres completed" to the correct label (e.g., "buildTestsResultsFromBigQuery completed") so the elapsed/testReports metrics are clearly attributed; locate the logging call that uses elapsed and testReports (the log.WithFields(...) followed by .Debug(...)) and replace the string to match the BigQuery builder function name.
322-335:⚠️ Potential issue | 🟠 MajorNormalize equivalent
periodvalues before they reach the cache key.
buildTestsResultsFromPostgresonly treats"twoDay"specially, so"","default", and"current"all hit the same 7-day matview. Right now those values are serialized into differentTestResultsSpeccache keys, which fragments the cache and cuts the hit rate for equivalent requests.Suggested fix
period := req.URL.Query().Get("period") if period != "" && period != "default" && period != "current" && period != "twoDay" { RespondWithJSON(http.StatusBadRequest, w, map[string]interface{}{"code": http.StatusBadRequest, "message": "Unknown period"}) return } + if period == "" || period == "default" { + period = "current" + } spec := TestResultsSpec{ Release: release,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/api/tests.go` around lines 322 - 335, Normalize the incoming period value to a canonical token before constructing TestResultsSpec so equivalent periods share the same cache key: map ""/"default"/"current" to the same normalized value (e.g., ""), leaving "twoDay" untouched, then use that normalized variable when building the TestResultsSpec and calling spec.buildTestsResultsFromPostgres(dbc, cacheClient); update the validation logic that currently checks period to still allow the accepted inputs but perform the normalization immediately prior to creating the TestResultsSpec.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/sippy/load.go`:
- Around line 148-152: In the cache initialization block around GetCacheClient,
fix the error variable mismatch by wrapping and returning cacheErr (not err)
when cacheErr != nil; update references in the component-readiness-cache path
and any early-return logic so that errors.Wrap or the returned error uses
cacheErr with a message like "couldn't get cache client", and keep cacheClient
nil when cacheErr is present to preserve existing error hygiene.
In `@cmd/sippy/refresh.go`:
- Around line 4-6: The current cache setup turns a best-effort write into a hard
failure; instead of returning the error when initializing the cache client, log
the error (using logrus) and set cacheClient = nil so RefreshData can proceed
(RefreshData already handles a nil cacheClient). Update the cache initialization
code that currently returns the error (around the cacheClient
setup/WriteTimestamp call) to log the failure and continue, ensuring calls to
RefreshData use the possibly-nil cacheClient.
In `@pkg/api/cache_test.go`:
- Around line 637-640: The test is supposed to hit the equality branch but loses
it because seedRefreshTimestamp serializes times with time.RFC3339 (no
sub-second precision) while cachedAt usually has nanoseconds; fix by normalizing
cachedAt to the same precision before seeding so the parsed refresh time equals
cachedAt — e.g., truncate or round cachedAt to seconds (or format+parse with
time.RFC3339) so the value passed to seedMatviewCache/cachedAt matches what
seedRefreshTimestamp writes for testMatview in mc and the equality branch is
exercised.
In `@pkg/api/cache.go`:
- Around line 212-215: The current comparison in the cache invalidation logic
uses lastRefresh.Before(cacheVal.Timestamp) which treats equal timestamps as
stale; change the condition to allow equality (i.e., treat refresh newer than
cache as miss, but equal or older as a hit). Locate the block around the cache
check (where cacheVal.Timestamp and lastRefresh are compared) and replace the
strict Before check with a comparison that returns the cached value when
lastRefresh is equal or before cacheVal.Timestamp (for example, use a
!lastRefresh.After(cacheVal.Timestamp) style check) so equal timestamps reuse
the cache.
In `@pkg/api/tests.go`:
- Around line 444-458: Change buildTestsResultsFromPostgres to accept and use a
request context and pass it into GetDataFromCacheOrMatview instead of
context.TODO(); update the local generator closure to call
spec.buildTestsResults with that ctx (so the closure uses the same context) and
ensure the GetDataFromCacheOrMatview invocation uses the incoming ctx.
Specifically, modify the signature of buildTestsResultsFromPostgres to take ctx
context.Context, change generator := func(ctx context.Context) ... to
capture/forward the outer ctx to spec.buildTestsResults, and replace the
context.TODO() argument in the GetDataFromCacheOrMatview call with the incoming
ctx; update any callers of buildTestsResultsFromPostgres accordingly.
In `@scripts/updatesuites.go`:
- Around line 153-156: The call to sippyserver.RefreshData(&db.DB{DB: dbc}, nil,
false) passes nil for the cache client so the new matview refresh timestamps are
never written; change RefreshData usage so it either receives a real cache
client or makes the script explicitly offline-only: locate
sippyserver.RefreshData and the call site using db.DB and dbc, then either (A)
construct and pass the appropriate cache client (Redis client/adapter used
elsewhere in the app) instead of nil so timestamps are persisted, or (B) add a
guard in this script to refuse to run against a live deployment (e.g., require
an --offline flag or check an env var) and return an error if no offline mode is
set, documenting that RefreshData must not be invoked with a nil cache in
production.
---
Outside diff comments:
In `@pkg/api/tests.go`:
- Around line 686-690: The completion log message is wrong for the BigQuery
path: update the Debug message in the log.WithFields block that currently says
"buildTestsResultsFromPostgres completed" to the correct label (e.g.,
"buildTestsResultsFromBigQuery completed") so the elapsed/testReports metrics
are clearly attributed; locate the logging call that uses elapsed and
testReports (the log.WithFields(...) followed by .Debug(...)) and replace the
string to match the BigQuery builder function name.
- Around line 322-335: Normalize the incoming period value to a canonical token
before constructing TestResultsSpec so equivalent periods share the same cache
key: map ""/"default"/"current" to the same normalized value (e.g., ""), leaving
"twoDay" untouched, then use that normalized variable when building the
TestResultsSpec and calling spec.buildTestsResultsFromPostgres(dbc,
cacheClient); update the validation logic that currently checks period to still
allow the accepted inputs but perform the normalization immediately prior to
creating the TestResultsSpec.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 57ca2b8b-6b2a-4111-a4fa-a6ccd9c7f29a
📒 Files selected for processing (11)
cmd/sippy-daemon/main.gocmd/sippy/load.gocmd/sippy/refresh.gocmd/sippy/seed_data.gopkg/api/cache.gopkg/api/cache_test.gopkg/api/job_runs.gopkg/api/tests.gopkg/sippyserver/pr_commenting_processor.gopkg/sippyserver/server.goscripts/updatesuites.go
| cachedAt := time.Now().UTC().Add(-5 * time.Minute) | ||
| seedMatviewCache(t, mc, spec, testResult{Value: "cached"}, cachedAt) | ||
| // Refresh at the exact same time as the cache entry | ||
| seedRefreshTimestamp(mc, testMatview, cachedAt) |
There was a problem hiding this comment.
Make this test hit the equality branch.
seedRefreshTimestamp serializes with time.RFC3339, so the refresh value loses sub-second precision. As written, cachedAt usually still carries nanos, which means the parsed refresh time is before the cached time, not equal to it. This test can pass without ever exercising the equality case.
Suggested fix
- cachedAt := time.Now().UTC().Add(-5 * time.Minute)
+ cachedAt := time.Now().UTC().Add(-5 * time.Minute).Truncate(time.Second)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cachedAt := time.Now().UTC().Add(-5 * time.Minute) | |
| seedMatviewCache(t, mc, spec, testResult{Value: "cached"}, cachedAt) | |
| // Refresh at the exact same time as the cache entry | |
| seedRefreshTimestamp(mc, testMatview, cachedAt) | |
| cachedAt := time.Now().UTC().Add(-5 * time.Minute).Truncate(time.Second) | |
| seedMatviewCache(t, mc, spec, testResult{Value: "cached"}, cachedAt) | |
| // Refresh at the exact same time as the cache entry | |
| seedRefreshTimestamp(mc, testMatview, cachedAt) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/api/cache_test.go` around lines 637 - 640, The test is supposed to hit
the equality branch but loses it because seedRefreshTimestamp serializes times
with time.RFC3339 (no sub-second precision) while cachedAt usually has
nanoseconds; fix by normalizing cachedAt to the same precision before seeding so
the parsed refresh time equals cachedAt — e.g., truncate or round cachedAt to
seconds (or format+parse with time.RFC3339) so the value passed to
seedMatviewCache/cachedAt matches what seedRefreshTimestamp writes for
testMatview in mc and the equality branch is exercised.
| if lastRefresh.Before(cacheVal.Timestamp) { | ||
| // not invalidated by a newer refresh, so use it (if we don't know the last refresh, still use it) | ||
| return cacheVal.Val, nil | ||
| } |
There was a problem hiding this comment.
Allow equal timestamps to reuse the cache.
Line 212 uses a strict Before check, so a refresh timestamp equal to the cached timestamp is treated as stale. The invalidation rule described in this PR is "refresh newer than cache => miss", so equality should still return the cached value.
Suggested fix
- if lastRefresh.Before(cacheVal.Timestamp) {
+ if !lastRefresh.After(cacheVal.Timestamp) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if lastRefresh.Before(cacheVal.Timestamp) { | |
| // not invalidated by a newer refresh, so use it (if we don't know the last refresh, still use it) | |
| return cacheVal.Val, nil | |
| } | |
| if !lastRefresh.After(cacheVal.Timestamp) { | |
| // not invalidated by a newer refresh, so use it (if we don't know the last refresh, still use it) | |
| return cacheVal.Val, nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/api/cache.go` around lines 212 - 215, The current comparison in the cache
invalidation logic uses lastRefresh.Before(cacheVal.Timestamp) which treats
equal timestamps as stale; change the condition to allow equality (i.e., treat
refresh newer than cache as miss, but equal or older as a hit). Locate the block
around the cache check (where cacheVal.Timestamp and lastRefresh are compared)
and replace the strict Before check with a comparison that returns the cached
value when lastRefresh is equal or before cacheVal.Timestamp (for example, use a
!lastRefresh.After(cacheVal.Timestamp) style check) so equal timestamps reuse
the cache.
There was a problem hiding this comment.
it will make no difference in practice and i believe the PR code is more legible.
There was a problem hiding this comment.
@sosiouxme, fair point — exact timestamp equality between a matview refresh and a cache write is effectively impossible in practice, and lastRefresh.Before(cacheVal.Timestamp) does read more naturally than the double-negation alternative. Happy to defer to your judgement here.
| // NOTE: does not update timestamps to invalidate cached matview data; not clear if the use case for this script requires that. | ||
| sippyserver.RefreshData(&db.DB{ | ||
| DB: dbc, | ||
| }, nil, false) |
There was a problem hiding this comment.
This still skips cache invalidation after the migration runs.
RefreshData(..., nil, ...) means the matview refresh timestamps introduced in this PR are never written, so Redis-backed RA/test caches can keep serving pre-migration data until TTL expiry. If this script can be run against a live Sippy deployment, it should either plumb a real cache client or be explicitly limited to offline use.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/updatesuites.go` around lines 153 - 156, The call to
sippyserver.RefreshData(&db.DB{DB: dbc}, nil, false) passes nil for the cache
client so the new matview refresh timestamps are never written; change
RefreshData usage so it either receives a real cache client or makes the script
explicitly offline-only: locate sippyserver.RefreshData and the call site using
db.DB and dbc, then either (A) construct and pass the appropriate cache client
(Redis client/adapter used elsewhere in the app) instead of nil so timestamps
are persisted, or (B) add a guard in this script to refuse to run against a live
deployment (e.g., require an --offline flag or check an env var) and return an
error if no offline mode is set, documenting that RefreshData must not be
invoked with a nil cache in production.
There was a problem hiding this comment.
that is fine with me. i doubt anyone uses this script against production. if we do and anyone cares, we'll have to add the ability to specify the cache.
There was a problem hiding this comment.
@sosiouxme, understood — the inline NOTE comment already documents the limitation clearly, so anyone who needs to add cache support later has the context. That's a reasonable call.
📝 Learnings were identified and not saved (knowledge base disabled). Enable
petr-muller
left a comment
There was a problem hiding this comment.
/hold
/lgtm
/approve
/meow
Some comments inline, nothing serious. Feel free to unhold or address, your call.
| } | ||
| logrus.Debugf("matview %q refreshed at %v, will not use earlier cache entry from %v", matview, lastRefresh, cacheVal.Timestamp) | ||
| } else if strings.Contains(err.Error(), "connection refused") { | ||
| logrus.WithError(err).Fatalf("redis URL specified but got connection refused; exiting due to cost issues in this configuration") |
There was a problem hiding this comment.
Fatal in a library is brutal 😁
could be a networking blip but I guess we can robustify if we run into this too often
There was a problem hiding this comment.
i do not know the history of why it was done this way and it did not seem likely to be a common problem that a blip would kill the server (but maybe it's been happening all the time and we just didn't notice because they're deployed redundantly)
There was a problem hiding this comment.
| return errors.Wrap(cacheErr, "couldn't get cache client") |
an LLM pointed me here, it's not that I read all code
| "elapsed": elapsed, | ||
| "reports": len(testReports), | ||
| }).Debug("BuildTestsResults completed") | ||
| }).Debug("buildTestsResultsFromPostgres completed") |
There was a problem hiding this comment.
| }).Debug("buildTestsResultsFromPostgres completed") | |
| }).Debug("buildTestsResultsFromBigQuery completed") |
| generator := func(ctx context.Context) (testResults, []error) { | ||
| return spec.buildTestsResults(dbc, matview) | ||
| } | ||
| result, errs := GetDataFromCacheOrMatview(context.TODO(), cacheClient, |
There was a problem hiding this comment.
there should be a request somewhere upstream, which means there should be a ctx somewhere too, would it be hard to plumb it here to avoid the context.TODO()?
There was a problem hiding this comment.
agreed, i guess i was being lazy
| "elapsed": elapsed, | ||
| "reports": len(testReports), | ||
| }).Debug("BuildTestsResults completed") | ||
| }).Debug("buildTestsResultsFromPostgres completed") |
There was a problem hiding this comment.
| }).Debug("buildTestsResultsFromPostgres completed") | |
| }).Debug("buildTestsResults completed") |
There was a problem hiding this comment.
yeah... actually it needs a name change anyway.
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, sosiouxme The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| if lastRefresh.Before(cacheVal.Timestamp) { | ||
| // not invalidated by a newer refresh, so use it (if we don't know the last refresh, still use it) | ||
| return cacheVal.Val, nil | ||
| } |
There was a problem hiding this comment.
it will make no difference in practice and i believe the PR code is more legible.
| } | ||
| logrus.Debugf("matview %q refreshed at %v, will not use earlier cache entry from %v", matview, lastRefresh, cacheVal.Timestamp) | ||
| } else if strings.Contains(err.Error(), "connection refused") { | ||
| logrus.WithError(err).Fatalf("redis URL specified but got connection refused; exiting due to cost issues in this configuration") |
There was a problem hiding this comment.
i do not know the history of why it was done this way and it did not seem likely to be a common problem that a blip would kill the server (but maybe it's been happening all the time and we just didn't notice because they're deployed redundantly)
| generator := func(ctx context.Context) (testResults, []error) { | ||
| return spec.buildTestsResults(dbc, matview) | ||
| } | ||
| result, errs := GetDataFromCacheOrMatview(context.TODO(), cacheClient, |
There was a problem hiding this comment.
agreed, i guess i was being lazy
| "elapsed": elapsed, | ||
| "reports": len(testReports), | ||
| }).Debug("BuildTestsResults completed") | ||
| }).Debug("buildTestsResultsFromPostgres completed") |
There was a problem hiding this comment.
yeah... actually it needs a name change anyway.
| "elapsed": elapsed, | ||
| "reports": len(testReports), | ||
| }).Debug("BuildTestsResults completed") | ||
| }).Debug("buildTestsResultsFromPostgres completed") |
| // NOTE: does not update timestamps to invalidate cached matview data; not clear if the use case for this script requires that. | ||
| sippyserver.RefreshData(&db.DB{ | ||
| DB: dbc, | ||
| }, nil, false) |
There was a problem hiding this comment.
that is fine with me. i doubt anyone uses this script against production. if we do and anyone cares, we'll have to add the ability to specify the cache.
fix silly error found by coderabbit
as coderabbit pointed out, we probably don't really want a caching problem to prevent running the matview refreshes.
practically speaking, it doesn't matter
…uery gives a chance to cancel the query if the server request is canceled.
|
New changes are detected. LGTM label has been removed. |
|
@sosiouxme: This pull request references TRT-2564 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
pkg/api/cache.go (1)
220-221:⚠️ Potential issue | 🟠 MajorDon't
Fatalfrom this cache helper.
GetDataFromCacheOrMatviewis used on request paths; a transient Redis outage here should degrade to regeneration, not terminate the process.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/api/cache.go` around lines 220 - 221, In GetDataFromCacheOrMatview, do not call logrus.Fatalf (which exits the process) on a transient Redis "connection refused" error; instead log non-fatally (e.g., logrus.WithError(err).Warnf or Errorf) and treat it as a cache miss by returning control to the caller so the data can be regenerated (e.g., return a nil cache result / an error that the caller interprets as "no cache"). Replace the Fatalf call in the connection-refused branch with a non-fatal log and adjust the function return path so it does not terminate the process.
🧹 Nitpick comments (1)
cmd/sippy/load.go (1)
148-152: Log the degraded cache mode here.Swallowing
cacheErris fine for keeping the load moving, but it also means later matview refreshes run withcacheClient == nil, so refresh timestamps are never recorded and cached RA/test responses can stay stale until TTL expiry. Please surface this path with at least a warning/metric so a successful load does not hide disabled invalidation.Possible tweak
// likewise get a cache client if possible, though some things operate without it. cacheClient, cacheErr := f.CacheFlags.GetCacheClient() if cacheErr != nil { + log.WithError(cacheErr).Warn("cache unavailable; matview refreshes will skip invalidation timestamp writes") cacheClient = nil // error hygiene, since we pass this down to quite a few functions }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sippy/load.go` around lines 148 - 152, The code swallows cacheErr after calling f.CacheFlags.GetCacheClient(), which hides degraded-cache mode; update the block around f.CacheFlags.GetCacheClient() so that when cacheErr != nil you log a warning (including cacheErr.Error()) and/or emit a metric indicating cache invalidation is disabled before setting cacheClient = nil; reference the cacheClient and cacheErr variables and the GetCacheClient call so the log/metric is placed right where cacheErr is handled to surface the degraded cache path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/api/cache.go`:
- Around line 201-202: The json.Unmarshal error handling in the cache retrieval
path should treat corrupt cache entries as cache-misses instead of returning an
error; modify the block that unmarshals into cacheVal to log or warn about the
unmarshal failure (include error and cacheKey), optionally evict the bad entry,
and then fall through to call generateFn to produce the value and continue
normal execution rather than returning defaultVal and an error slice; update the
code that currently references json.Unmarshal, cacheVal, cacheKey, defaultVal,
and generateFn so the function returns the generated value (and no error) on
corrupt cache blobs.
In `@pkg/api/tests.go`:
- Around line 533-537: The cache key currently uses a constant
NewCacheSpec(testResultsBQ{}, "BigQueryTestsResults~", nil), causing all
BigQuery /api/tests requests to share one entry; update the
GetDataFromCacheOrGenerate call to build the cache spec from the actual request
spec (include release, period, filter, collapse or the request
struct/stringified key) so each unique request produces a unique NewCacheSpec
(replace the constant testResultsBQ{} key with the request-derived key) while
keeping the same cache, RequestOptions (testResultsCacheDuration) and the call
to GetDataFromCacheOrGenerate.
---
Duplicate comments:
In `@pkg/api/cache.go`:
- Around line 220-221: In GetDataFromCacheOrMatview, do not call logrus.Fatalf
(which exits the process) on a transient Redis "connection refused" error;
instead log non-fatally (e.g., logrus.WithError(err).Warnf or Errorf) and treat
it as a cache miss by returning control to the caller so the data can be
regenerated (e.g., return a nil cache result / an error that the caller
interprets as "no cache"). Replace the Fatalf call in the connection-refused
branch with a non-fatal log and adjust the function return path so it does not
terminate the process.
---
Nitpick comments:
In `@cmd/sippy/load.go`:
- Around line 148-152: The code swallows cacheErr after calling
f.CacheFlags.GetCacheClient(), which hides degraded-cache mode; update the block
around f.CacheFlags.GetCacheClient() so that when cacheErr != nil you log a
warning (including cacheErr.Error()) and/or emit a metric indicating cache
invalidation is disabled before setting cacheClient = nil; reference the
cacheClient and cacheErr variables and the GetCacheClient call so the log/metric
is placed right where cacheErr is handled to surface the degraded cache path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 093e1c8b-5796-4aa8-958a-4556161ef67a
📒 Files selected for processing (8)
cmd/sippy/load.gocmd/sippy/refresh.gopkg/api/cache.gopkg/api/cache_test.gopkg/api/job_runs.gopkg/api/tests.gopkg/apis/cache/cache.gopkg/sippyserver/server.go
✅ Files skipped from review due to trivial changes (1)
- pkg/sippyserver/server.go
🚧 Files skipped from review as they are similar to previous changes (3)
- cmd/sippy/refresh.go
- pkg/api/cache_test.go
- pkg/api/job_runs.go
|
Scheduling required tests: |
this is not strictly necessary since the results were already cached at the request level and we don't have a way to find out when the backing tables in BQ are refreshed. but it seemed like a good idea and not too much trouble to have the two forks work the same.
both postgres and BQ forks can use the same request processing code. also the "period" is now standardized for caching purposes.
fix problem coderabbit noted where the cache entry doesn't unmarshal for some reason... better to treat it like a cache miss.
49e5677 to
f81c9a4
Compare
|
@sosiouxme: This pull request references TRT-2564 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/api/cache.go (1)
140-146:⚠️ Potential issue | 🟡 MinorDon't let rounding make
Expirya no-op.Once Line 143 runs, the duration is recomputed from
nowand the previously selectedcacheOptions.Expiryis discarded. Any caller that sets bothExpiryandCRTimeRoundingFactorwill get the next rounding boundary from "now", not fromnow + Expiry.Suggested fix
func CalculateRoundedCacheDuration(cacheOptions cache.RequestOptions) time.Duration { // require cacheDuration for persistence logic cacheDuration := defaultCacheDuration if cacheOptions.Expiry > 0 { cacheDuration = cacheOptions.Expiry } if cacheOptions.CRTimeRoundingFactor > 0 { now := time.Now().UTC() + expiresAt := now.Add(cacheDuration) // Only cache until the next rounding duration - cacheDuration = now.Truncate(cacheOptions.CRTimeRoundingFactor).Add(cacheOptions.CRTimeRoundingFactor).Sub(now) + cacheDuration = expiresAt.Truncate(cacheOptions.CRTimeRoundingFactor).Add(cacheOptions.CRTimeRoundingFactor).Sub(now) } return cacheDuration }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/api/cache.go` around lines 140 - 146, The current logic overwrites cacheDuration set from cacheOptions.Expiry when cacheOptions.CRTimeRoundingFactor is present, causing Expiry to be ignored; update the CRTimeRoundingFactor branch to compute the rounding boundary from now + cacheDuration (use the existing cacheDuration after applying Expiry) instead of from now so that the final cacheDuration = roundedBoundary.Sub(now). Concretely, after setting cacheDuration from cacheOptions.Expiry, when CRTimeRoundingFactor > 0 compute target := now.Add(cacheDuration), then set cacheDuration = target.Truncate(cacheOptions.CRTimeRoundingFactor).Add(cacheOptions.CRTimeRoundingFactor).Sub(now) (or equivalent) so both cacheOptions.Expiry and cacheOptions.CRTimeRoundingFactor are honored for cacheDuration.pkg/api/tests.go (1)
251-257:⚠️ Potential issue | 🟠 MajorGuard
limitbefore slicing.Line 253 and Line 286 only reject zero. A request like
?limit=-1falls through totests[:limit]and panics the handler.Suggested fix
func (tests TestsAPIResult) limit(req *http.Request) TestsAPIResult { - limit, _ := strconv.Atoi(req.URL.Query().Get("limit")) - if limit == 0 || len(tests) < limit { + limit, err := strconv.Atoi(req.URL.Query().Get("limit")) + if err != nil || limit <= 0 || len(tests) < limit { return tests } return tests[:limit] }func (tests TestsAPIResultBQ) limit(req *http.Request) TestsAPIResultBQ { - limit, _ := strconv.Atoi(req.URL.Query().Get("limit")) - if limit == 0 || len(tests) < limit { + limit, err := strconv.Atoi(req.URL.Query().Get("limit")) + if err != nil || limit <= 0 || len(tests) < limit { return tests } return tests[:limit] }Also applies to: 284-290
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/api/tests.go` around lines 251 - 257, The limit parsing in TestsAPIResult.limit allows negative or parse-failed values through and can cause a panic when slicing (tests[:limit]); update the TestsAPIResult.limit method to validate the parsed limit from strconv.Atoi by checking the Atoi error, treating non-positive (<=0) or failed parses as "no limit", and clamp the positive limit to at most len(tests) before returning tests[:limit]; apply the same guard/clamping logic to the other identical limit-handling block in this file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/api/tests.go`:
- Around line 443-445: The fmt.Errorf calls that format the variable errs (a
[]error) use the "%s" verb; change both occurrences to use "%v" so the error
slice prints its elements correctly (update the fmt.Errorf invocations that
reference errs in pkg/api/tests.go to use "%v" instead of "%s").
---
Outside diff comments:
In `@pkg/api/cache.go`:
- Around line 140-146: The current logic overwrites cacheDuration set from
cacheOptions.Expiry when cacheOptions.CRTimeRoundingFactor is present, causing
Expiry to be ignored; update the CRTimeRoundingFactor branch to compute the
rounding boundary from now + cacheDuration (use the existing cacheDuration after
applying Expiry) instead of from now so that the final cacheDuration =
roundedBoundary.Sub(now). Concretely, after setting cacheDuration from
cacheOptions.Expiry, when CRTimeRoundingFactor > 0 compute target :=
now.Add(cacheDuration), then set cacheDuration =
target.Truncate(cacheOptions.CRTimeRoundingFactor).Add(cacheOptions.CRTimeRoundingFactor).Sub(now)
(or equivalent) so both cacheOptions.Expiry and
cacheOptions.CRTimeRoundingFactor are honored for cacheDuration.
In `@pkg/api/tests.go`:
- Around line 251-257: The limit parsing in TestsAPIResult.limit allows negative
or parse-failed values through and can cause a panic when slicing
(tests[:limit]); update the TestsAPIResult.limit method to validate the parsed
limit from strconv.Atoi by checking the Atoi error, treating non-positive (<=0)
or failed parses as "no limit", and clamp the positive limit to at most
len(tests) before returning tests[:limit]; apply the same guard/clamping logic
to the other identical limit-handling block in this file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 80e6632c-b39c-4fe5-8870-31f91c66ec69
📒 Files selected for processing (4)
pkg/api/cache.gopkg/api/tests.gopkg/apis/cache/cache.gopkg/sippyserver/server.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/apis/cache/cache.go
- pkg/sippyserver/server.go
|
Scheduling required tests: |
|
@sosiouxme: This pull request references TRT-2564 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
|
@sosiouxme: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |

Risk Analysis queries are fairly heavy and can impact Sippy DB performance. During regression periods we are likely to see an increase in repetitive calls for the same job / test failures. So this adds a caching mechanism.
Both RA and Sippy classic test pages use the same underlying query mechanism, so the same cache is introduced for both (replacing the existing request-level cache for the test API). This caches the test queries for an hour by default, but cache entries are invalidated if the underlying matview is refreshed, so new data is surfaced faster.
Individual commits separate the concepts that built this change, so reviewing them serially may be easiest.
openshift/origin#30965 proposes one simple way to benchmark risk analysis requests.
Summary by CodeRabbit
New Features
Changed
Removals
Tests