You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix KV cache size per token values in discovery doc
Corrected values based on actual model configs:
- llama3.1-8b: 128 KB (was incorrectly ~0.5 MB)
- llama3.1-70b: 320 KB (was incorrectly ~5 MB)
- mistral-7b: 128 KB (was incorrectly ~0.5 MB)
- llama2-7b: 512 KB (MHA has 4x more KV heads than GQA)
The 70b model is ~2.5x larger per token than 8b (not 10x)
due to 80 layers vs 32 layers with same kv_heads=8.
0 commit comments