fix(caching): aggregate hit/miss counters across pods via Redis (closes #811)#961
Open
meem08 wants to merge 1 commit into
Open
fix(caching): aggregate hit/miss counters across pods via Redis (closes #811)#961meem08 wants to merge 1 commit into
meem08 wants to merge 1 commit into
Conversation
…fcode#811) CachingService previously tracked this.hits / this.misses as in-process instance variables. In a horizontally scaled deployment each pod reported only its own slice of traffic, so getStats() and Prometheus hit-rate metrics (cache_hit_rate_percentage) were skewed by 1/N. This change moves counter storage to Redis using atomic INCR on keys cache:hits:{type} and cache:misses:{type}. The cache type is derived from the second segment of the cache key (cache:{type}:...), keeping the cardinality of counter keys bounded. Highlights: * recordHit / recordMiss now await INCR against shared Redis keys. * getStats(), resetStats(), publishHitRateMetrics() are async and read from Redis so the reported hit rate reflects cluster-wide traffic. * New daily @Cron(EVERY_DAY_AT_MIDNIGHT UTC) resets all counter keys. * getAggregateStats() scans cache:hits:* and cache:misses:* (each in its own complete SCAN loop) to expose a single cluster-wide rate. * Graceful local fallback (CACHE_COUNTER_FALLBACK_LOCAL=true) is kept so single-instance dev/test runs and Redis outages still report something useful instead of silently zero. * resolveRedisFromConfig() short-circuits in NODE_ENV=test to avoid leaking live Redis connections when other suites DI the service. * Tests mock the Redis client (incr/mget/scan/del) and assert the exact key shape, distributed vs local counters, fallback behaviour and daily reset. Refs rinafcode#811
|
@meem08 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits. You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀 |
Contributor
|
Great job so far There’s just one blocker — the workflow is failing. Could you take a look and fix it so all checks pass? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #811.
CachingServicepreviously trackedthis.hits/this.missesas in-process instance variables. In a horizontally scaled deployment with N pods, the value returned bygetStats()(and ingested into thecache_hit_rate_percentagePrometheus gauge viapublishHitRateMetrics) reflected only the current pod's slice of traffic. Dashboards and alerting rules driven by this metric were therefore skewed by a factor of ~1/N, masking both regressions and genuine improvements.This change moves counter storage to Redis using atomic
INCRagainstcache:hits:{type}andcache:misses:{type}so every pod writes to and reads from the same shared source of truth.Linked Issue
Closes / references #811 — "CachingService hit/miss counters are in-process and incorrect in multi-instance deployments".
Acceptance Criteria
INCR.cache:hits:{type}/cache:misses:{type}where{type}is derived from the second segment of the cache key (e.g.cache:test:1→cache:hits:test).@Cron(EVERY_DAY_AT_MIDNIGHT, { timeZone: 'UTC' })job (dailyReset) sweeps both counter namespaces;resetStats(type)is also exposed for manual / rolling-window resets.src/caching/caching.service.spec.tsmocksincr/mget/scan/deland asserts exact key shape, cluster vs local numbers, fallback, and reset paths.Impacted Files
src/caching/caching.service.tsrecordHit/recordMisstoINCRper type; madegetStats,resetStats,publishHitRateMetricsasync and Redis-backed; addeddailyResetcron,getAggregateStats, deduped helpers, and graceful local fallback.src/caching/caching.service.spec.tsImplementation Notes
Counter namespace
Keys are built via the exported helper:
Cache keys in the codebase already follow a
cache:{type}:...convention, soderiveCacheType(key)extracts{type}from the second segment. Keys that don't follow this convention fall back to a singledefaultbucket. This keeps counter-key cardinality bounded by the number of cache types rather than the number of cached entries.Cluster-wide read path
getStats(type)readscache:hits:{type}andcache:misses:{type}via a singleMGET.getAggregateStats()runs two independentSCANloops (one per MATCH pattern) and concatenates the key sets — sharing a cursor across two parallel MATCH scans would terminate early whenever either pattern reached'0'and silently drop keys.publishHitRateMetrics(type)simply awaitsgetStats(type)and forwards it toMetricsCollectionService.updateCacheHitRate.Daily reset
resetStats(type?)deletes the targeted keys (DEL cache:hits:{type} cache:misses:{type}) when called with a type, or scans-and-deletes every counter key when called without one.Resilience
CACHE_COUNTER_ENABLE_REDIS=falsedisables the shared client entirely.CACHE_COUNTER_FALLBACK_LOCAL=true(default) keeps an in-process counter so single-instance dev / test setups and Redis outages still produce non-zero numbers.resolveRedisFromConfig()short-circuits whenNODE_ENV === 'test'so DI-instantiated services in tests don't open live Redis connections — tests inject a mock client via the constructor instead.Public API change (async)
getStats(),resetStats(), andpublishHitRateMetrics()are nowasync. An audit of the codebase (grep -r "cachingService.getStats\|cachingService.publishHitRateMetrics\|cachingService.resetStats") shows no production callers other than the spec we updated. Consumers that use the low-level API (get/set/getOrSet/delete/deleteMany/clear/deleteByPattern) are unaffected.Testing
New tests cover:
deriveCacheTypeandbuildCounterKeysexported helpersgetOrSetincrements the correct Redis key for hit/misspublishHitRateMetricsreads from Redis and emits the cluster-wide rateresetStats(type)andresetStats()(full sweep)Out of Scope
ComputationCacheServicehas the same in-process counter pattern (computationhit/miss). The issue scopes the fix toCachingServiceonly, so it is not modified by this PR. A follow-up can lift the same helpers into a sharedRedisCounterServiceand adopt them inComputationCacheServicefor consistency.