The calculation is based on the llama3 7B model with sequence length 1024.

  • seq_len = 1024
  • d_model (hidden_dim) = 4096
  • number of query: 32
  • number of heads for key and value (GQA): 8
  • d_type = fp16
  • num_layers = 32

KV size per token = num_layers * 2 * d_kv * d_type = 32 * 2 * 1024 * 2 = 131,072 bytes (0.125 MB)

This table compares different methods for retrieving KV cache data during inference, showing bandwidth requirements, latency estimates, and storage/retrieval approaches. The data size represents the total KV cache footprint for 1024 tokens across all 32 layers.

Method Medium Bandwidth GPU_Model Data_Size_MB Latency_ms_min Latency_ms_max Notes
GPU_Retrieval HBM 2000_GB/s A100 128 0.064 0.075 Already in GPU memory
CPU_Retrieval PCIe_Gen4 32_GB/s N/A 128 4 6 CPU→GPU transfer
CPU_Retrieval PCIe_Gen5 64_GB/s N/A 128 2 3 CPU→GPU transfer
NVMe_Traditional NVMe_SSD 5_GB/s N/A 128 29 43 Disk→CPU→GPU (two-hop)
NVMe_GPUDirect NVMe_SSD 7_GB/s N/A 128 19 29 Direct disk→GPU (GDS)
Storage_Enterprise RDMA_Storage 50_GB/s N/A 128 3 6 VAST/Dell with GPUDirect
S3_Retrieval S3_Standard 100_MB/s N/A 128 1313 1450 Network + download
S3_Retrieval S3_Express 1000_MB/s N/A 128 135 175 Low latency S3
Recomputation GPU_Compute N/A A10G_24GB N/A 360 560 Lower-end GPU prefill
Recomputation GPU_Compute N/A A100_80GB N/A 128 205 A100 prefill (~5K tok/s)
Recomputation GPU_Compute N/A H100_80GB N/A 80 128 H100 prefill (~10K tok/s)

Transformer Model Memory Analysis

The following table provides a detailed breakdown of memory operations and KV cache sizes for various large language models, assuming sequence length 1024. All calculations are in FLOPs for forward pass operations and bytes for memory usage.

Model d_m d_kv d_q d_ff Layers Q proj K proj V proj Q×K^T Attn×V O proj MLP up MLP down KV/layer KV total
Llama 3 8B 4096 1024 4096 14336 32 2×seq×d_m×d_q (2×1024×4096×4096=34.4B) 2×seq×d_m×d_kv (2×1024×4096×1024=8.6B) 2×seq×d_m×d_kv (2×1024×4096×1024=8.6B) 2×seq²×d_q (2×1024²×4096=8.6B) 2×seq²×d_q (2×1024²×4096=8.6B) 2×seq×d_q×d_m (2×1024×4096×4096=34.4B) 2×seq×d_m×d_ff (2×1024×4096×14336=120.3B) 2×seq×d_ff×d_m (2×1024×14336×4096=120.3B) seq×4KB (4MB) seq×128KB (128MB)
Llama 3 70B 8192 1024 8192 28672 80 2×seq×d_m×d_q (2×1024×8192×8192=137.4B) 2×seq×d_m×d_kv (2×1024×8192×1024=17.2B) 2×seq×d_m×d_kv (2×1024×8192×1024=17.2B) 2×seq²×d_q (2×1024²×8192=17.2B) 2×seq²×d_q (2×1024²×8192=17.2B) 2×seq×d_q×d_m (2×1024×8192×8192=137.4B) 2×seq×d_m×d_ff (2×1024×8192×28672=481.4B) 2×seq×d_ff×d_m (2×1024×28672×8192=481.4B) seq×4KB (4MB) seq×320KB (320MB)
Llama 3.1 405B 16384 1024 16384 53248 126 2×seq×d_m×d_q (2×1024×16384×16384=549.8B) 2×seq×d_m×d_kv (2×1024×16384×1024=34.4B) 2×seq×d_m×d_kv (2×1024×16384×1024=34.4B) 2×seq²×d_q (2×1024²×16384=34.4B) 2×seq²×d_q (2×1024²×16384=34.4B) 2×seq×d_q×d_m (2×1024×16384×16384=549.8B) 2×seq×d_m×d_ff (2×1024×16384×53248=1788.2B) 2×seq×d_ff×d_m (2×1024×53248×16384=1788.2B) seq×4KB (4MB) seq×504KB (504MB)
Qwen 2.5 7B 3584 512 3584 18944 28 2×seq×d_m×d_q (2×1024×3584×3584=26.3B) 2×seq×d_m×d_kv (2×1024×3584×512=3.8B) 2×seq×d_m×d_kv (2×1024×3584×512=3.8B) 2×seq²×d_q (2×1024²×3584=7.5B) 2×seq²×d_q (2×1024²×3584=7.5B) 2×seq×d_q×d_m (2×1024×3584×3584=26.3B) 2×seq×d_m×d_ff (2×1024×3584×18944=139.0B) 2×seq×d_ff×d_m (2×1024×18944×3584=139.0B) seq×2KB (2MB) seq×56KB (56MB)
Qwen 2.5 14B 5120 1024 5120 13824 48 2×seq×d_m×d_q (2×1024×5120×5120=53.7B) 2×seq×d_m×d_kv (2×1024×5120×1024=10.7B) 2×seq×d_m×d_kv (2×1024×5120×1024=10.7B) 2×seq²×d_q (2×1024²×5120=10.7B) 2×seq²×d_q (2×1024²×5120=10.7B) 2×seq×d_q×d_m (2×1024×5120×5120=53.7B) 2×seq×d_m×d_ff (2×1024×5120×13824=145.1B) 2×seq×d_ff×d_m (2×1024×13824×5120=145.1B) seq×4KB (4MB) seq×192KB (192MB)
Qwen 2.5 72B 8192 1024 8192 29568 80 2×seq×d_m×d_q (2×1024×8192×8192=137.4B) 2×seq×d_m×d_kv (2×1024×8192×1024=17.2B) 2×seq×d_m×d_kv (2×1024×8192×1024=17.2B) 2×seq²×d_q (2×1024²×8192=17.2B) 2×seq²×d_q (2×1024²×8192=17.2B) 2×seq×d_q×d_m (2×1024×8192×8192=137.4B) 2×seq×d_m×d_ff (2×1024×8192×29568=496.3B) 2×seq×d_ff×d_m (2×1024×29568×8192=496.3B) seq×4KB (4MB) seq×320KB (320MB)
Gemma 2 2B 2304 1024 2048 9216 26 2×seq×d_m×d_q (2×1024×2304×2048=9.7B) 2×seq×d_m×d_kv (2×1024×2304×1024=4.8B) 2×seq×d_m×d_kv (2×1024×2304×1024=4.8B) 2×seq²×d_q (2×1024²×2048=4.3B) 2×seq²×d_q (2×1024²×2048=4.3B) 2×seq×d_q×d_m (2×1024×2048×2304=9.7B) 2×seq×d_m×d_ff (2×1024×2304×9216=43.5B) 2×seq×d_ff×d_m (2×1024×9216×2304=43.5B) seq×4KB (4MB) seq×104KB (104MB)
Gemma 2 9B 3584 2048 4096 14336 42 2×seq×d_m×d_q (2×1024×3584×4096=30.1B) 2×seq×d_m×d_kv (2×1024×3584×2048=15.0B) 2×seq×d_m×d_kv (2×1024×3584×2048=15.0B) 2×seq²×d_q (2×1024²×4096=8.6B) 2×seq²×d_q (2×1024²×4096=8.6B) 2×seq×d_q×d_m (2×1024×4096×3584=30.1B) 2×seq×d_m×d_ff (2×1024×3584×14336=105.4B) 2×seq×d_ff×d_m (2×1024×14336×3584=105.4B) seq×8KB (8MB) seq×336KB (336MB)
Gemma 2 27B 4608 2048 4096 36864 46 2×seq×d_m×d_q (2×1024×4608×4096=38.7B) 2×seq×d_m×d_kv (2×1024×4608×2048=19.3B) 2×seq×d_m×d_kv (2×1024×4608×2048=19.3B) 2×seq²×d_q (2×1024²×4096=8.6B) 2×seq²×d_q (2×1024²×4096=8.6B) 2×seq×d_q×d_m (2×1024×4096×4608=38.7B) 2×seq×d_m×d_ff (2×1024×4608×36864=348.2B) 2×seq×d_ff×d_m (2×1024×36864×4608=348.2B) seq×8KB (8MB) seq×368KB (368MB)

Differences Between the Two NVMe Paths:

Traditional NVMe path (Disk→CPU→GPU):

  • Bandwidth: ~5 GB/s (limited by PCIe to CPU, then CPU to GPU)
  • Latency: Higher due to two-hop transfer + CPU involvement
  • Steps: NVMe → System RAM → CPU processing → PCIe → GPU
  • Use case: Standard systems without GPUDirect Storage

NVMe GPUDirect (Disk→GPU):

  • Bandwidth: ~7 GB/s (direct DMA from NVMe to GPU memory)
  • Latency: ~30-40% lower - single hop, bypasses CPU
  • Steps: NVMe → GPU (direct DMA via PCIe)
  • Use case: Modern systems with NVIDIA GPUDirect Storage support
  • Requirements:
    • Supported NVMe drives
    • GPUDirect Storage drivers
    • PCIe topology that allows direct NVMe-GPU transfers