Skip to content

Commit 2159bef

Browse files
Merge pull request #224 from hazemawadalla/TF_KVCache
Production Tooling and Validation
2 parents 988ebc0 + bad674c commit 2159bef

73 files changed

Lines changed: 60101 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

kv_cache_benchmark/MLperf v3 KV cache proposal.md

Lines changed: 1787 additions & 0 deletions
Large diffs are not rendered by default.
540 KB
Binary file not shown.

kv_cache_benchmark/README.md

Lines changed: 766 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
## Recommended Invocations by Model
2+
3+
### Why Two Invocations (cpu_mem=0 vs cpu_mem=4)?
4+
5+
| cpu_mem | Purpose | Primary Metric | Why |
6+
| -------- | -------------------------------- | ---------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
7+
| **0 GB** | **Maximum Storage Stress** | Decode Bytes Read, Wall-Clock Throughput | All I/O goes through NVMe. 4x more read traffic. True test of storage bandwidth. |
8+
| **4 GB** | **Storage Throughput Benchmark** | Storage Throughput (tok/s) | Some data cached in RAM. Storage Throughput metric works correctly (2.2x ratio). More representative of production inference workloads. |
9+
10+
---
11+
12+
### llama2-7b
13+
14+
| Parameter | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) |
15+
| ------------------------- | -------------------------- | ---------------------- |
16+
| `--cpu-memory-gb` | **0** | **4** |
17+
| `--max-concurrent-allocs` | **0** | **4** |
18+
| `--users` | **150** | **200** |
19+
| `--duration` | **300** | **300** |
20+
| `--generation-mode` | **none** | **none** |
21+
| **Expected Ratio** | WC Tput: **4.64x** | Stor Tput: **2.34x** |
22+
23+
```bash
24+
# llama2-7b: Storage Stress (cpu_mem=0)
25+
python kv-cache.py --model llama2-7b --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 150 --duration 300 --generation-mode none --output results/llama2-7b_stress_trial${N}.json
26+
27+
# llama2-7b: Throughput Benchmark (cpu_mem=4)
28+
python kv-cache.py --model llama2-7b --cpu-memory-gb 4 --max-concurrent-allocs 4 --users 200 --duration 300 --generation-mode none --output results/llama2-7b_tput_trial${N}.json
29+
```
30+
31+
---
32+
33+
### llama3.1-8b
34+
35+
| Parameter | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) |
36+
|-----------|---------------------------|------------------------|
37+
| `--cpu-memory-gb` | **0** | **4** |
38+
| `--max-concurrent-allocs` | **0** | **0** |
39+
| `--users` | **200** | **150** |
40+
| `--duration` | **300** | **300** |
41+
| `--generation-mode` | **none** | **none** |
42+
| **Expected Ratio** | WC Tput: **2.70x** | Stor Tput: **2.87x** |
43+
44+
```bash
45+
# llama3.1-8b: Storage Stress (cpu_mem=0)
46+
python kv-cache.py --model llama3.1-8b --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 200 --duration 300 --generation-mode none --output results/llama3.1-8b_stress_trial${N}.json
47+
48+
# llama3.1-8b: Throughput Benchmark (cpu_mem=4)
49+
python kv-cache.py --model llama3.1-8b --cpu-memory-gb 4 --max-concurrent-allocs 0 --users 150 --duration 300 --generation-mode none --output results/llama3.1-8b_tput_trial${N}.json
50+
```
51+
52+
---
53+
54+
### llama3.1-70b-instruct
55+
56+
| Parameter | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) |
57+
|-----------|---------------------------|------------------------|
58+
| `--cpu-memory-gb` | **0** | **4** |
59+
| `--max-concurrent-allocs` | **0** | **4** |
60+
| `--users` | **70** | **20** |
61+
| `--duration` | **300** | **300** |
62+
| `--generation-mode` | **none** | **none** |
63+
| **Expected Ratio** | WC Tput: **2.44x** | Stor Tput: **3.25x** |
64+
65+
```bash
66+
# llama3.1-70b: Storage Stress (cpu_mem=0)
67+
python kv-cache.py --model llama3.1-70b-instruct --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 70 --duration 300 --generation-mode none --output results/llama3.1-70b_stress_trial${N}.json
68+
69+
# llama3.1-70b: Throughput Benchmark (cpu_mem=4)
70+
python kv-cache.py --model llama3.1-70b-instruct --cpu-memory-gb 4 --max-concurrent-allocs 4 --users 20 --duration 300 --generation-mode none --output results/llama3.1-70b_tput_trial${N}.json
71+
```
72+
73+
---
74+
75+
## Summary Table
76+
77+
| Model | Invocation | cpu_mem | mca | users | Primary Metric | Expected Ratio |
78+
|-------|------------|---------|-----|-------|----------------|----------------|
79+
| **llama2-7b** | Stress | 0 | 0 | 150 | WC Throughput | 4.64x |
80+
| **llama2-7b** | Tput | 4 | 4 | 200 | Stor Throughput | 2.34x |
81+
| **llama3.1-8b** | Stress | 0 | 0 | 200 | WC Throughput | 2.70x |
82+
| **llama3.1-8b** | Tput | 4 | 0 | 150 | Stor Throughput | 2.87x |
83+
| **llama3.1-70b** | Stress | 0 | 0 | 70 | WC Throughput | 2.44x |
84+
| **llama3.1-70b** | Tput | 4 | 4 | 20 | Stor Throughput | 3.25x |
85+
86+
**Notes:**
87+
- **70b model uses fewer users** because larger KV cache = more memory per request
88+
- **mca=0 often best at cpu_mem=0** (no allocation throttling when fully I/O-bound)
89+
- **mca=4 often best at cpu_mem=4** (moderate throttling helps throughput)
90+
- **gen_mode=none** for pure storage benchmark (no simulated token delays)
91+
- **Run 3-5 trials** and report median

0 commit comments

Comments
 (0)