Skip to content

Fix: Rolling KV cache and top-k logit trimming (fixes #675)#698

Open
medmomoait wants to merge 1 commit into
google-deepmind:mainfrom
medmomoait:fix/rolling-kv-cache-vram-675
Open

Fix: Rolling KV cache and top-k logit trimming (fixes #675)#698
medmomoait wants to merge 1 commit into
google-deepmind:mainfrom
medmomoait:fix/rolling-kv-cache-vram-675

Conversation

@medmomoait

Copy link
Copy Markdown

Fixes #675

Changes

1. ChatSampler — Rolling KV Cache (ring buffer)

Adds rolling_cache and rolling_cache_preserve_tokens options to
ChatSampler. When enabled, old tokens are evicted from the KV cache
in a ring-buffer fashion before each turn, preventing context-exhaustion
OOM in long multi-turn conversations. A configurable prefix (e.g. system
prompt) can be protected from eviction.

2. SamplerLoop — Top-k logit trimming

Adds a top_k_logits option to SamplerLoop. When set, logits are
masked to top-k immediately after the forward pass, reducing the
transient VRAM footprint during sampling without changing the sampling
distribution.

Both changes default to the previous behavior (disabled), so this is
backward compatible.

@google-cla

google-cla Bot commented Jun 14, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Context Exhaustion and VRAM Spikes in KV Cache & SamplerLoop

1 participant