Skip to content

Conditionally refresh cache TTL on restore (A-1454)#4038

Draft
claude[bot] wants to merge 1 commit into
mainfrom
a-1454-conditional-ttl-refresh
Draft

Conditionally refresh cache TTL on restore (A-1454)#4038
claude[bot] wants to merge 1 commit into
mainfrom
a-1454-conditional-ttl-refresh

Conversation

@claude

@claude claude Bot commented Jun 30, 2026

Copy link
Copy Markdown

Requested by Ming Guo · Slack thread

Description

The S3 cache restore path refreshes a cache object's lifecycle expiration by copying the object to itself (source == dest, MetadataDirective: REPLACE), which resets its LastModified so the bucket lifecycle rule doesn't expire it.

Before: that self-copy ran on every restore. It's a synchronous full-object server-side rewrite — about 27s of blocking work for a 2GB cache — and it hard-errors on objects above S3's 5GB CopyObject limit, failing the whole restore (and the pipeline).

After: the self-copy runs only when the cached object has 20% or less of its lifecycle window left. A hot cache that was recently refreshed (most of its window remaining) skips the copy entirely, so most restores no longer block on it. And if the copy does run and fails — including the >5GB case — the restore continues normally; the failure is logged as a warning instead of aborting.

How: the lifecycle window is derived per object as expires_at - LastModified, where expires_at comes from the retrieve response (threaded through Blob.Download into the S3 store) and LastModified is read with one extra HeadObject per restore — a cheap metadata call, unlike the full-object copy it gates. A small pure function, shouldRefreshExpiration(expiresAt, lastModified, now, refreshFraction), decides whether to refresh: it returns true when the remaining lifetime (expires_at - now) is at or below refreshFraction of the window, and also when expires_at or LastModified is unknown (zero) or the window is non-positive — preserving the old always-refresh behaviour when the fraction can't be computed. When it returns false the copy is skipped; when it returns true the copy runs but soft-fails on error.

restoreTTLRefreshFraction is 0.20, so the refresh fires only in the last 20% of each object's lifecycle window. Because the window is computed from the object's own LastModified and expires_at, the threshold self-tunes to the bucket's actual lifecycle with no hard-coded duration.

Context

Linear ticket: A-1454. The per-restore CopyObject was identified as the largest single cost in the cache-restore investigation.

Changes

  • internal/cache/store/blob.go: add expiresAt time.Time to the Blob.Download interface.
  • internal/cache/store/s3.go: add the restoreTTLRefreshFraction const and the pure shouldRefreshExpiration function; in Download, read the object's LastModified via HeadObject, skip the self-copy when more than 20% of the lifecycle window remains, and soft-fail (log a warning, continue) when the copy errors.
  • internal/cache/store/file.go, internal/cache/store/nsc.go: accept and ignore the new expiresAt parameter.
  • internal/cache/restore.go: pass retrieveResp.ExpiresAt into the Download call.
  • internal/cache/store/s3_test.go: table-driven unit tests for shouldRefreshExpiration.
  • Updated existing file_test.go / nsc_test.go call sites for the new signature.

Testing

  • Tests have run locally (go test ./internal/cache/... ./api/... pass; go build ./... and go vet ./internal/cache/... clean).
  • Code is formatted.

Disclosures / Credits

Implemented with Claude Code.


Generated by Claude Code

@claude claude Bot force-pushed the a-1454-conditional-ttl-refresh branch from bdb5d84 to dea1333 Compare June 30, 2026 07:05
The S3 cache restore path issued a synchronous full-object self-CopyObject
on every restore to reset LastModified so the bucket lifecycle rule
wouldn't expire the object. That's a ~27s blocking server-side rewrite for
a 2GB cache and hard-errors above S3's 5GB CopyObject limit.

Now the self-copy only runs when the blob is near expiry: thread expires_at
from the retrieve response into Blob.Download and skip the copy when there's
more than restoreTTLRefreshThreshold of life left. Any copy failure
(including >5GB) is logged as a warning and no longer fails the restore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HT7Sg6wJVRSf9nfstxGWG1
@claude claude Bot force-pushed the a-1454-conditional-ttl-refresh branch from dea1333 to 8d881e6 Compare June 30, 2026 07:08
@claude claude Bot added the change Not a new feature, but a user observable non-breaking behavior change. label Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

change Not a new feature, but a user observable non-breaking behavior change.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant