Skip to content

APIGOV-31191 validate cache#1015

Open
alrosca wants to merge 19 commits into
mainfrom
APIGOV-31191
Open

APIGOV-31191 validate cache#1015
alrosca wants to merge 19 commits into
mainfrom
APIGOV-31191

Conversation

@alrosca

@alrosca alrosca commented Mar 10, 2026

Copy link
Copy Markdown
Collaborator

Cache validation on start up and reconnect

jcollins-axway
jcollins-axway previously approved these changes Mar 11, 2026

@jcollins-axway jcollins-axway left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from a high level, needs pounded testing wise

@jcollins-axway

Copy link
Copy Markdown
Collaborator

copilots review

Code Review: APIGOV-31191 vs main

Summary

  • Scope: reviewed 8 files (314 insertions, 4 deletions), focused on new cache validation flow, reconnect hooks, and cache manager interfaces.
  • Verdict: request changes
  • Severity: medium (2), low (0), high (0), blocker (0)

Findings

  • Medium — Cache validation uses non-unique key (name) and can collide across scoped resources

  • Medium — Cache validator hard-codes API version to v1alpha1, which can skip/incorrectly query some filters

    • Where: pkg/agent/cachevalidationjob.go, pkg/agent/cachevalidationjob.go
    • Description: validation URL is built from a synthetic ResourceInstance with fixed APIVersion: "v1alpha1". For filters whose resource version differs, GetKindLink() may produce the wrong endpoint or empty link, and validation is skipped (return true).
    • Possible impacts to the codebase: out-of-sync persisted cache may pass validation and remain in use after startup/reconnect for affected resource kinds.

Positives

  • Cache validation logic is cleanly encapsulated (cacheValidator) and integrated into both startup and reconnect paths.
  • Reconnect hooks were added consistently to poll and stream clients with minimal API surface change (WithOnReconnect options).
  • Logging around validation failures includes useful operational context (resource counts, resource names, timestamps).

Recommendations

  • Use a stable unique key for comparisons (for example selfLink or a composite of group/kind/scope/name) instead of name alone.
  • Derive API version from filter/resource metadata or from server-discovered GVK mappings rather than hard-coding "v1alpha1".
  • Add targeted unit tests for: duplicate names across scopes, non-v1alpha1 resources, and reconnect-triggered validation behavior.

@alrosca alrosca closed this Mar 12, 2026
@alrosca alrosca reopened this Mar 12, 2026
@alrosca

alrosca commented Mar 12, 2026

Copy link
Copy Markdown
Collaborator Author

Not a bad review at all

Comment thread pkg/agent/handler/discoveryaccessrequest_test.go Outdated
jcollins-axway
jcollins-axway previously approved these changes Mar 12, 2026
jcollins-axway
jcollins-axway previously approved these changes Mar 13, 2026
@jcollins-axway

Copy link
Copy Markdown
Collaborator

caching changes require an extra amount of testing as its usually to blame for our duplicate service issues

@sbolosan

Copy link
Copy Markdown
Collaborator

Give me time to look at this closer

Comment thread pkg/agent/cachevalidationjob.go Outdated
Comment thread pkg/agent/eventsync.go Outdated
Comment thread pkg/agent/eventsync.go Outdated

@sbolosan sbolosan Mar 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang! @vivekschauhan @jcollins-axway I put this note in a very long time ago ;).
I was looking at another race condition down stream, but decided to see who called RebuildCache().

So RebuildCache is reachable by WithOnClientStop (poller) and With EventSyncError (stream) while the event listener go routine is alive, well technically alive.

In those paths the race is meh, maybe not a big deal because event flow has stopped like a harvester error or when the stream is not yet producing. But PublishingLock still doesn't coordinate with the listener. Maybe we should take a look at this again?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets put something on the backlog to take a look @sbolosan

Comment thread pkg/agent/resource/manager.go Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we eat the error?
At least log the error/warn and then if requirement is to rebuild, then rebuild?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like

    timeToRebuild, err := a.shouldRebuildCache()
    if err != nil {
        a.logger.WithError(err).Warn("unable to determine cache rebuild state, triggering rebuild")
        timeToRebuild = true
    }
    if timeToRebuild && a.rebuildCache != nil {
        a.rebuildCache.RebuildCache()
    }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip the error logging outside shouldRebuildCache since we already log it inside the method. Returning true on error inside the method would not require checking the return error and explicitly putting timeToRebuild on true

Comment thread pkg/agent/resource/manager.go Outdated

@sbolosan sbolosan Mar 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is old and not part of your changes, but can we do a safe guard here

strVal, ok := value.(string)
            if !ok {
                logger.Warn("cacheUpdateTime is not a string, triggering rebuild")
                return true, nil
            }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this defensive code since in pkg/agent/evensync.go, line 162 that value is put in agent details and will always be a string? I don't think there are permissions to alter the agent details subresource from cli and I don't think anyone would do that, right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically you're right. But one extra ok check costs nothing and eliminates an entire class of runtime panic, no? I'm just saying x-agent-details is a map deserialized from JSON and JSON deseralization into interface{} has no type gurantees. It'll panic if future code changes forgets to stringify it. Up to you.

@jasonmscollins jasonmscollins left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments from Shane

@jcollins-axway jcollins-axway left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

Comment thread pkg/agent/stream/client.go Outdated
Comment thread pkg/agent/cachevalidationjob.go Outdated
kinds[management.ManagedApplicationGVK().Kind] = struct{}{}
kinds[management.AccessRequestGVK().Kind] = struct{}{}
case config.ComplianceAgent:
kinds[management.ComplianceRuntimeResultGVK().Kind] = struct{}{}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any way to use the watch topic that we create to determine these types?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the watch topic includes kinds that are not persisted in cache (subscribed for event processing only), so deriving the validated set purely from the watch topic would cause false-positive validation failures for those kinds. The per-agent-type whitelist is intentional.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent creates the watch topic on start, the logic that creates that has all the kinds, that is what I refer to. Not necessarily pulling the watch topic itself.

jcollins-axway
jcollins-axway previously approved these changes May 1, 2026
* APIGOV-32372 rebuild only failed kinds

* APIGOV-32372 fix for listener/cache rebuild race condition

* APIGOV-32372 merge main

* APIGOV-32372 remove else statement

Co-authored-by: Copilot <copilot@github.com>

* APIGOV-32327 wait for clients to be ready

* APIGOV-32372 improvements

* APIGOV-32372 fix for unlock on unlocked mutex error

* APIGOV-32372 guard against listener potential data race

---------

Co-authored-by: Copilot <copilot@github.com>
alrosca and others added 6 commits June 3, 2026 12:38
* APIGOV-31191 - remove API calls for metadata check for cache validation

* APIGOV-31191 - Refactoring and cleanup
- Remove entire cache flush. Flush only the kind getting processed and after the API call is successful to avoid cache inconsistency
- Cleanup instance count map that should speed up building APIservice cache items
- Fixes

* APIGOV-31191 - check publishing lock and acquire lock while publishing

* APIGOV-31191 - fixes
- Fix for endless waitForCacheRebuild
- Fix to keep original APIService resource instance in cache when duplicates are added to the cache
- Fix to hook cache rebuild on stream client reconnection instead of harvester error and to address job pool status

* APIGOV-31191 - updates
- remove waitForCacheRebuild and use the error returned from harvester to trigger the job restart

* APIGOV-31191 - fallback to count checks if harvester is not reachable

* APIGOV-31191 - review comments + lock on harvester event sync

* APIGOV-31191 - lock on harvester event sync

* APIGOV-31191 - use APIServiceInstance counts to rebuild APIService cache

* APIGOV-31191 - use CreateOrUpdateResource for processing APIService

* APIGOV-31191 - fix to not make multiple x-agent-details update for APIService

* APIGOV-31191 - clean up event listener pauser
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants