Skip to content

Billy1900/RadixMLA

Repository files navigation


 ███╗   ███╗██╗      █████╗
 ████╗ ████║██║     ██╔══██╗
 ██╔████╔██║██║     ███████║
 ██║╚██╔╝██║██║     ██╔══██║
 ██║ ╚═╝ ██║███████╗██║  ██║
 ╚═╝     ╚═╝╚══════╝╚═╝  ╚═╝

MLA-aware RadixAttention

Smarter prefix cache eviction for DeepSeek V2 / V3 / R1 in SGLang


Tests License SGLang


The problem

SGLang's RadixCache evicts tokens as if each one costs the same amount of memory. For MHA models, that's correct. For MLA models like DeepSeek V2/V3/R1, it's wildly wrong.

Each MLA token stores a 576-dim compressed latent — not the 40,960-dim full K/V that the eviction logic expects. That's a 71× difference. The result: SGLang over-evicts, destroys prefix reuse, and burns TTFT on prefill that didn't need to happen.

This project fixes the eviction logic.

Expected benefit scenario:

  • DeepSeek-V3/R1 (larger model, smaller pool ratio)
  • High concurrency (>100 concurrent requests)
  • Long shared system prompts (>2048 tokens)
  • pool utilization consistently >80%

What changes

One key insight: SGLang already stores latent vectors, not expanded K/V. The MLATokenToKVPool buffer shape is [pool_size, 1, 576]. The storage is fine. The eviction policy just doesn't know that.

The fix is two numbers:

# Before: assumes MHA token cost
target_free_ratio = 0.20

# After: scales by compression ratio
target_free_ratio = max(0.05, 0.20 / compression_ratio)
# DeepSeek V3: 0.20 / 71 ≈ 0.003
# The cache can safely run at 99.7% utilization.

And one function:

# Before: evict exactly N tokens
tree_cache.evict(EvictParams(num_tokens=N))

# After: adjust for actual memory pressure
adjusted = budget.adjust_eviction_count(N, cached, free)
tree_cache.evict(EvictParams(num_tokens=adjusted))

That's it. Three files touched. Fully backward-compatible — non-MLA models go through the same code paths as before.


Results

Benchmarked on DeepSeek-V3 config across four workload patterns:

Workload Hit rate (baseline) Hit rate (MLA-aware) Δ
Chat (shared system prompts) 82.1% 93.2% +13.5%
Few-shot prompting 88.7% 96.3% +8.6%
Multi-turn conversation 90.3% 97.8% +8.3%
Random (no sharing) 6.1% 6.8% +0.7%

Memory capacity on a typical 80GB GPU, 40GB weights:

MHA eviction MLA-aware
Cacheable prefix tokens ~8,300 ~590,000
Free space target 20% ~0.3%

Quick start

pip install pytest torch

# Run tests
python -m pytest test_mla_radix_cache.py -v

# CPU benchmarks
python bench_mla_radix_cache.py

# GPU validation (requires A100+)
python gpu_validation.py --mode validate --model deepseek-ai/DeepSeek-V2-Lite

Standalone usage:

from mla_radix_cache import MLARadixCache, MLAModelConfig
import torch

cache = MLARadixCache(MLAModelConfig.deepseek_v3(), pool_size=100_000)
cache.insert(list(range(100)), torch.arange(100))

result = cache.match_prefix(list(range(50)) + [999])
print(result.matched_len)  # 50

SGLang integration:

from sglang_integration import detect_mla_config, patch_scheduler_for_mla

mla_config = detect_mla_config(model_config)
if mla_config:
    patch_scheduler_for_mla(scheduler, mla_config)

Structure

├── mla_radix_cache.py        core: MLARadixCache, MLAEvictionBudget, LatentCacheAnalyzer
├── sglang_integration.py     SGLang patches: detect_mla_config, patch_scheduler_for_mla
├── sglang_mla_eviction.py    unified diff for SGLang PR
├── test_mla_radix_cache.py   35 tests
├── bench_mla_radix_cache.py  CPU workload benchmarks
├── gpu_validation.py         Phase 3: correctness comparison baseline vs patched
├── e2e_benchmark.py          Phase 4: TTFT / throughput benchmark
├── launch_patched_server.py  patched SGLang server launcher
└── run_all.sh                all-in-one runner

GPU benchmark (Phase 4)

# Engine mode — in-process, no server needed
python e2e_benchmark.py --mode engine --model deepseek-ai/DeepSeek-V2-Lite --num-prompts 200

# Server mode — full bench_serving comparison
python launch_patched_server.py --model deepseek-ai/DeepSeek-V2-Lite --tp 1
python -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 500

# Report
python e2e_benchmark.py --mode report

MIT License · © 2026 Henry

About

MLA-aware prefix caching for SGLang — exploit latent compression in RadixAttention (DeepSeek based model)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors