Memory bandwidth and latency for KV load
The calculation is based on the llama3 7B model with sequence length 1024.
- seq_len = 1024
- d_model (hidden_dim) = 4096
- number of query: 32
- number of heads for key and value (GQA): 8
- d_type = fp16
- num_layers = 32
KV size per token = num_layers * 2 * d_kv * d_type = 32 * 2 * 1024 * 2 = 131,072 bytes (0.125 MB)
This table compares different methods for retrieving KV cache data during inference, showing bandwidth requirements, latency estimates, and storage/retrieval approaches. The data size represents the total KV cache footprint for 1024 tokens across all 32 layers.
| Method | Medium | Bandwidth | GPU_Model | Data_Size_MB | Latency_ms_min | Latency_ms_max | Notes |
|---|---|---|---|---|---|---|---|
| GPU_Retrieval | HBM | 2000_GB/s | A100 | 128 | 0.064 | 0.075 | Already in GPU memory |
| CPU_Retrieval | PCIe_Gen4 | 32_GB/s | N/A | 128 | 4 | 6 | CPU→GPU transfer |
| CPU_Retrieval | PCIe_Gen5 | 64_GB/s | N/A | 128 | 2 | 3 | CPU→GPU transfer |
| NVMe_Traditional | NVMe_SSD | 5_GB/s | N/A | 128 | 29 | 43 | Disk→CPU→GPU (two-hop) |
| NVMe_GPUDirect | NVMe_SSD | 7_GB/s | N/A | 128 | 19 | 29 | Direct disk→GPU (GDS) |
| Storage_Enterprise | RDMA_Storage | 50_GB/s | N/A | 128 | 3 | 6 | VAST/Dell with GPUDirect |
| S3_Retrieval | S3_Standard | 100_MB/s | N/A | 128 | 1313 | 1450 | Network + download |
| S3_Retrieval | S3_Express | 1000_MB/s | N/A | 128 | 135 | 175 | Low latency S3 |
| Recomputation | GPU_Compute | N/A | A10G_24GB | N/A | 360 | 560 | Lower-end GPU prefill |
| Recomputation | GPU_Compute | N/A | A100_80GB | N/A | 128 | 205 | A100 prefill (~5K tok/s) |
| Recomputation | GPU_Compute | N/A | H100_80GB | N/A | 80 | 128 | H100 prefill (~10K tok/s) |
Transformer Model Memory Analysis
The following table provides a detailed breakdown of memory operations and KV cache sizes for various large language models, assuming sequence length 1024. All calculations are in FLOPs for forward pass operations and bytes for memory usage.
| Model | d_m | d_kv | d_q | d_ff | Layers | Q proj | K proj | V proj | Q×K^T | Attn×V | O proj | MLP up | MLP down | KV/layer | KV total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Llama 3 8B | 4096 | 1024 | 4096 | 14336 | 32 | 2×seq×d_m×d_q (2×1024×4096×4096=34.4B) | 2×seq×d_m×d_kv (2×1024×4096×1024=8.6B) | 2×seq×d_m×d_kv (2×1024×4096×1024=8.6B) | 2×seq²×d_q (2×1024²×4096=8.6B) | 2×seq²×d_q (2×1024²×4096=8.6B) | 2×seq×d_q×d_m (2×1024×4096×4096=34.4B) | 2×seq×d_m×d_ff (2×1024×4096×14336=120.3B) | 2×seq×d_ff×d_m (2×1024×14336×4096=120.3B) | seq×4KB (4MB) | seq×128KB (128MB) |
| Llama 3 70B | 8192 | 1024 | 8192 | 28672 | 80 | 2×seq×d_m×d_q (2×1024×8192×8192=137.4B) | 2×seq×d_m×d_kv (2×1024×8192×1024=17.2B) | 2×seq×d_m×d_kv (2×1024×8192×1024=17.2B) | 2×seq²×d_q (2×1024²×8192=17.2B) | 2×seq²×d_q (2×1024²×8192=17.2B) | 2×seq×d_q×d_m (2×1024×8192×8192=137.4B) | 2×seq×d_m×d_ff (2×1024×8192×28672=481.4B) | 2×seq×d_ff×d_m (2×1024×28672×8192=481.4B) | seq×4KB (4MB) | seq×320KB (320MB) |
| Llama 3.1 405B | 16384 | 1024 | 16384 | 53248 | 126 | 2×seq×d_m×d_q (2×1024×16384×16384=549.8B) | 2×seq×d_m×d_kv (2×1024×16384×1024=34.4B) | 2×seq×d_m×d_kv (2×1024×16384×1024=34.4B) | 2×seq²×d_q (2×1024²×16384=34.4B) | 2×seq²×d_q (2×1024²×16384=34.4B) | 2×seq×d_q×d_m (2×1024×16384×16384=549.8B) | 2×seq×d_m×d_ff (2×1024×16384×53248=1788.2B) | 2×seq×d_ff×d_m (2×1024×53248×16384=1788.2B) | seq×4KB (4MB) | seq×504KB (504MB) |
| Qwen 2.5 7B | 3584 | 512 | 3584 | 18944 | 28 | 2×seq×d_m×d_q (2×1024×3584×3584=26.3B) | 2×seq×d_m×d_kv (2×1024×3584×512=3.8B) | 2×seq×d_m×d_kv (2×1024×3584×512=3.8B) | 2×seq²×d_q (2×1024²×3584=7.5B) | 2×seq²×d_q (2×1024²×3584=7.5B) | 2×seq×d_q×d_m (2×1024×3584×3584=26.3B) | 2×seq×d_m×d_ff (2×1024×3584×18944=139.0B) | 2×seq×d_ff×d_m (2×1024×18944×3584=139.0B) | seq×2KB (2MB) | seq×56KB (56MB) |
| Qwen 2.5 14B | 5120 | 1024 | 5120 | 13824 | 48 | 2×seq×d_m×d_q (2×1024×5120×5120=53.7B) | 2×seq×d_m×d_kv (2×1024×5120×1024=10.7B) | 2×seq×d_m×d_kv (2×1024×5120×1024=10.7B) | 2×seq²×d_q (2×1024²×5120=10.7B) | 2×seq²×d_q (2×1024²×5120=10.7B) | 2×seq×d_q×d_m (2×1024×5120×5120=53.7B) | 2×seq×d_m×d_ff (2×1024×5120×13824=145.1B) | 2×seq×d_ff×d_m (2×1024×13824×5120=145.1B) | seq×4KB (4MB) | seq×192KB (192MB) |
| Qwen 2.5 72B | 8192 | 1024 | 8192 | 29568 | 80 | 2×seq×d_m×d_q (2×1024×8192×8192=137.4B) | 2×seq×d_m×d_kv (2×1024×8192×1024=17.2B) | 2×seq×d_m×d_kv (2×1024×8192×1024=17.2B) | 2×seq²×d_q (2×1024²×8192=17.2B) | 2×seq²×d_q (2×1024²×8192=17.2B) | 2×seq×d_q×d_m (2×1024×8192×8192=137.4B) | 2×seq×d_m×d_ff (2×1024×8192×29568=496.3B) | 2×seq×d_ff×d_m (2×1024×29568×8192=496.3B) | seq×4KB (4MB) | seq×320KB (320MB) |
| Gemma 2 2B | 2304 | 1024 | 2048 | 9216 | 26 | 2×seq×d_m×d_q (2×1024×2304×2048=9.7B) | 2×seq×d_m×d_kv (2×1024×2304×1024=4.8B) | 2×seq×d_m×d_kv (2×1024×2304×1024=4.8B) | 2×seq²×d_q (2×1024²×2048=4.3B) | 2×seq²×d_q (2×1024²×2048=4.3B) | 2×seq×d_q×d_m (2×1024×2048×2304=9.7B) | 2×seq×d_m×d_ff (2×1024×2304×9216=43.5B) | 2×seq×d_ff×d_m (2×1024×9216×2304=43.5B) | seq×4KB (4MB) | seq×104KB (104MB) |
| Gemma 2 9B | 3584 | 2048 | 4096 | 14336 | 42 | 2×seq×d_m×d_q (2×1024×3584×4096=30.1B) | 2×seq×d_m×d_kv (2×1024×3584×2048=15.0B) | 2×seq×d_m×d_kv (2×1024×3584×2048=15.0B) | 2×seq²×d_q (2×1024²×4096=8.6B) | 2×seq²×d_q (2×1024²×4096=8.6B) | 2×seq×d_q×d_m (2×1024×4096×3584=30.1B) | 2×seq×d_m×d_ff (2×1024×3584×14336=105.4B) | 2×seq×d_ff×d_m (2×1024×14336×3584=105.4B) | seq×8KB (8MB) | seq×336KB (336MB) |
| Gemma 2 27B | 4608 | 2048 | 4096 | 36864 | 46 | 2×seq×d_m×d_q (2×1024×4608×4096=38.7B) | 2×seq×d_m×d_kv (2×1024×4608×2048=19.3B) | 2×seq×d_m×d_kv (2×1024×4608×2048=19.3B) | 2×seq²×d_q (2×1024²×4096=8.6B) | 2×seq²×d_q (2×1024²×4096=8.6B) | 2×seq×d_q×d_m (2×1024×4096×4608=38.7B) | 2×seq×d_m×d_ff (2×1024×4608×36864=348.2B) | 2×seq×d_ff×d_m (2×1024×36864×4608=348.2B) | seq×8KB (8MB) | seq×368KB (368MB) |
Differences Between the Two NVMe Paths:
Traditional NVMe path (Disk→CPU→GPU):
- Bandwidth: ~5 GB/s (limited by PCIe to CPU, then CPU to GPU)
- Latency: Higher due to two-hop transfer + CPU involvement
- Steps: NVMe → System RAM → CPU processing → PCIe → GPU
- Use case: Standard systems without GPUDirect Storage
NVMe GPUDirect (Disk→GPU):
- Bandwidth: ~7 GB/s (direct DMA from NVMe to GPU memory)
- Latency: ~30-40% lower - single hop, bypasses CPU
- Steps: NVMe → GPU (direct DMA via PCIe)
- Use case: Modern systems with NVIDIA GPUDirect Storage support
- Requirements:
- Supported NVMe drives
- GPUDirect Storage drivers
- PCIe topology that allows direct NVMe-GPU transfers