Skip to content

Commit bad674c

Browse files
committed
Fix KV cache size per token values in discovery doc
Corrected values based on actual model configs: - llama3.1-8b: 128 KB (was incorrectly ~0.5 MB) - llama3.1-70b: 320 KB (was incorrectly ~5 MB) - mistral-7b: 128 KB (was incorrectly ~0.5 MB) - llama2-7b: 512 KB (MHA has 4x more KV heads than GQA) The 70b model is ~2.5x larger per token than 8b (not 10x) due to 80 layers vs 32 layers with same kv_heads=8.
1 parent e995340 commit bad674c

1 file changed

Lines changed: 6 additions & 5 deletions

File tree

kv_cache_benchmark/discovery_results_and_analysis/mlperfv3_results_and_metrics_discovery.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -676,7 +676,7 @@ done
676676
Larger models have larger KV cache blocks, which stress storage bandwidth more effectively:
677677

678678
```bash
679-
# Llama3.1-70b: ~10x larger KV cache per token than 8b models
679+
# Llama3.1-70b: ~2.5x larger KV cache per token than 8b models (320KB vs 128KB)
680680
# Better for systems with high-bandwidth storage (NVMe, CXL)
681681
for trial in 1 2 3; do
682682
python kv-cache.py \
@@ -694,11 +694,12 @@ done
694694
**Why llama3.1-70b matters:**
695695
| Model | KV Cache per Token | Storage I/O per Request | Use Case |
696696
|-------|-------------------|------------------------|----------|
697-
| llama3.1-8b | ~0.5 MB | Lower | Best differentiation ratio |
698-
| llama3.1-70b | ~5 MB | Higher | Maximum storage bandwidth stress |
699-
| mistral-7b | ~0.5 MB | Lower | Alternative to 8b |
697+
| llama3.1-8b | 128 KB | Lower | Best differentiation ratio |
698+
| llama3.1-70b | 320 KB | Higher | Maximum storage bandwidth stress |
699+
| mistral-7b | 128 KB | Lower | Alternative to 8b |
700+
| llama2-7b | 512 KB | Highest | MHA architecture (4x more than GQA) |
700701

701-
The 70b model generates ~10x more storage I/O per token, making it ideal for:
702+
The 70b model generates ~2.5x more storage I/O per token than 8b (due to 80 vs 32 layers), making it ideal for:
702703
- High-bandwidth NVMe arrays (PCIe 5.0, multiple drives)
703704
- CXL memory expanders
704705
- Enterprise storage systems where small I/Os don't saturate bandwidth

0 commit comments

Comments
 (0)