|
| 1 | +## Recommended Invocations by Model |
| 2 | + |
| 3 | +### Why Two Invocations (cpu_mem=0 vs cpu_mem=4)? |
| 4 | + |
| 5 | +| cpu_mem | Purpose | Primary Metric | Why | |
| 6 | +| -------- | -------------------------------- | ---------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | |
| 7 | +| **0 GB** | **Maximum Storage Stress** | Decode Bytes Read, Wall-Clock Throughput | All I/O goes through NVMe. 4x more read traffic. True test of storage bandwidth. | |
| 8 | +| **4 GB** | **Storage Throughput Benchmark** | Storage Throughput (tok/s) | Some data cached in RAM. Storage Throughput metric works correctly (2.2x ratio). More representative of production inference workloads. | |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +### llama2-7b |
| 13 | + |
| 14 | +| Parameter | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) | |
| 15 | +| ------------------------- | -------------------------- | ---------------------- | |
| 16 | +| `--cpu-memory-gb` | **0** | **4** | |
| 17 | +| `--max-concurrent-allocs` | **0** | **4** | |
| 18 | +| `--users` | **150** | **200** | |
| 19 | +| `--duration` | **300** | **300** | |
| 20 | +| `--generation-mode` | **none** | **none** | |
| 21 | +| **Expected Ratio** | WC Tput: **4.64x** | Stor Tput: **2.34x** | |
| 22 | + |
| 23 | +```bash |
| 24 | +# llama2-7b: Storage Stress (cpu_mem=0) |
| 25 | +python kv-cache.py --model llama2-7b --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 150 --duration 300 --generation-mode none --output results/llama2-7b_stress_trial${N}.json |
| 26 | + |
| 27 | +# llama2-7b: Throughput Benchmark (cpu_mem=4) |
| 28 | +python kv-cache.py --model llama2-7b --cpu-memory-gb 4 --max-concurrent-allocs 4 --users 200 --duration 300 --generation-mode none --output results/llama2-7b_tput_trial${N}.json |
| 29 | +``` |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | +### llama3.1-8b |
| 34 | + |
| 35 | +| Parameter | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) | |
| 36 | +|-----------|---------------------------|------------------------| |
| 37 | +| `--cpu-memory-gb` | **0** | **4** | |
| 38 | +| `--max-concurrent-allocs` | **0** | **0** | |
| 39 | +| `--users` | **200** | **150** | |
| 40 | +| `--duration` | **300** | **300** | |
| 41 | +| `--generation-mode` | **none** | **none** | |
| 42 | +| **Expected Ratio** | WC Tput: **2.70x** | Stor Tput: **2.87x** | |
| 43 | + |
| 44 | +```bash |
| 45 | +# llama3.1-8b: Storage Stress (cpu_mem=0) |
| 46 | +python kv-cache.py --model llama3.1-8b --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 200 --duration 300 --generation-mode none --output results/llama3.1-8b_stress_trial${N}.json |
| 47 | + |
| 48 | +# llama3.1-8b: Throughput Benchmark (cpu_mem=4) |
| 49 | +python kv-cache.py --model llama3.1-8b --cpu-memory-gb 4 --max-concurrent-allocs 0 --users 150 --duration 300 --generation-mode none --output results/llama3.1-8b_tput_trial${N}.json |
| 50 | +``` |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +### llama3.1-70b-instruct |
| 55 | + |
| 56 | +| Parameter | cpu_mem=0 (Storage Stress) | cpu_mem=4 (Throughput) | |
| 57 | +|-----------|---------------------------|------------------------| |
| 58 | +| `--cpu-memory-gb` | **0** | **4** | |
| 59 | +| `--max-concurrent-allocs` | **0** | **4** | |
| 60 | +| `--users` | **70** | **20** | |
| 61 | +| `--duration` | **300** | **300** | |
| 62 | +| `--generation-mode` | **none** | **none** | |
| 63 | +| **Expected Ratio** | WC Tput: **2.44x** | Stor Tput: **3.25x** | |
| 64 | + |
| 65 | +```bash |
| 66 | +# llama3.1-70b: Storage Stress (cpu_mem=0) |
| 67 | +python kv-cache.py --model llama3.1-70b-instruct --cpu-memory-gb 0 --max-concurrent-allocs 0 --users 70 --duration 300 --generation-mode none --output results/llama3.1-70b_stress_trial${N}.json |
| 68 | + |
| 69 | +# llama3.1-70b: Throughput Benchmark (cpu_mem=4) |
| 70 | +python kv-cache.py --model llama3.1-70b-instruct --cpu-memory-gb 4 --max-concurrent-allocs 4 --users 20 --duration 300 --generation-mode none --output results/llama3.1-70b_tput_trial${N}.json |
| 71 | +``` |
| 72 | + |
| 73 | +--- |
| 74 | + |
| 75 | +## Summary Table |
| 76 | + |
| 77 | +| Model | Invocation | cpu_mem | mca | users | Primary Metric | Expected Ratio | |
| 78 | +|-------|------------|---------|-----|-------|----------------|----------------| |
| 79 | +| **llama2-7b** | Stress | 0 | 0 | 150 | WC Throughput | 4.64x | |
| 80 | +| **llama2-7b** | Tput | 4 | 4 | 200 | Stor Throughput | 2.34x | |
| 81 | +| **llama3.1-8b** | Stress | 0 | 0 | 200 | WC Throughput | 2.70x | |
| 82 | +| **llama3.1-8b** | Tput | 4 | 0 | 150 | Stor Throughput | 2.87x | |
| 83 | +| **llama3.1-70b** | Stress | 0 | 0 | 70 | WC Throughput | 2.44x | |
| 84 | +| **llama3.1-70b** | Tput | 4 | 4 | 20 | Stor Throughput | 3.25x | |
| 85 | + |
| 86 | +**Notes:** |
| 87 | +- **70b model uses fewer users** because larger KV cache = more memory per request |
| 88 | +- **mca=0 often best at cpu_mem=0** (no allocation throttling when fully I/O-bound) |
| 89 | +- **mca=4 often best at cpu_mem=4** (moderate throttling helps throughput) |
| 90 | +- **gen_mode=none** for pure storage benchmark (no simulated token delays) |
| 91 | +- **Run 3-5 trials** and report median |
0 commit comments