This is my quick access note for roofline analysis for llm inference.

Llama3 8B model with sequence length 1024

  • n_q_head = 32
  • n_kv_head = 8
  • d_head = 128 (hidden dimension per head)
  • d_model = 4096 (= n_q_head × d_head)
  • d_ffn = 14336
  • num_layers = 32
  • vocab_size = 128256
  • d_type = 2 (fp16)
  • seq_len
  • batch

Roofline analysis

I will calculate the memory read write bytes and FLOPs for prefill and decoding. Memory consists of model weights, and activation (including kv cache). The model weights consists of W_q, W_k, W_v, W_o, W_gate, W_up, W_down.

FLOPs in transformer decoding model consists of flops_Q, flops_K, flops_V, flops_qk, flops_av, flops_O, flops_gate, flops_up, flops_down. We will ignore residual connection, layer normalization, and other insignificant computations in this post.

First, let’s do prefill.

Prefill

Memory

model weight

Numbers inside () show the llama3 8B

Embedding = vocab_size × d_model (1GB)

W_q = d_model × d_model (32MB)

W_k = d_model × n_kv_head × d_head (8MB)

W_v = d_model × n_kv_head × d_head (8MB) # same as W_k

W_o = d_model x n_q_head × d_head (32MB)

W_gate = d_model × d_ffn (128MB)

W_up = d_model × d_ffn (128MB)

W_down = d_ffn × d_model (128MB)

Total bytes per layer = W_q + W_k + W_v + W_o + W_gate + W_up + W_down (480MB)

Total model weight bytes = 32 x 480MB + 1GB ~= 16GB (~= 8B x 2 bytes)

activation

There are many intermediate tensors, so-called activation. We don’t need to write all of them to memory. Optimized kernels will try to avoid the unnecessary read/write by fusing the multiple operations. The exact calculation will depend on the specific implementation of the kernel. However, anyway it is not dominate factor in memory operation, so you don’t need to be bothered too much. Let’s still do it. Good to be precise at least one time so that we don’t look back.

We will consider these activations only assuming well-optimized fused kernel.

  • Attention input: batch × seq_len × d_model × 2 (read)
  • Attention output: batch × seq_len × d_model × 2 (write)
  • FFN output: batch × seq_len × d_model × 2 (write)
  • KV cache: batch × seq_len × n_kv_head × d_head × 2 × 2 (write: k and v)

Total activation bytes = batch × seq_len × d_model × 2 × 3 + batch × seq_len × n_kv_head × d_head × 2 × 2 = batch × seq_len × d_model × 6 + batch × seq_len × d_model × 2 = batch × seq_len × d_model × 8 ≈ batch × seq_len × d_model × 7 (for GQA with n_kv_head = n_q_head/4)

All logical activations are listed in below. Again, most of them do not need to be written in memory.

Attention Block - Fused Operations

  • layernorm write: [batch, seq_len, d_model] → FUSED with QKV projection
  • q write: [batch, seq_len, d_model] → FUSED, written directly as reshaped
  • k write: [batch, n_kv_head, seq_len, d_head] → WRITE
  • v write: [batch, n_kv_head, seq_len, d_head] → WRITE
  • QK^T write: [batch, n_q_head, seq_len, seq_len] → AVOIDED
  • softmax write: [batch, n_q_head, seq_len, seq_len] → AVOIDED
  • attn output write: [batch, n_q_head, seq_len, d_head] → FUSED with concat
  • concat write: [batch, seq_len, d_model] → FUSED with O projection
  • O projection write: [batch, seq_len, d_model] → WRITE
  • residual add write: [batch, seq_len, d_model] → FUSED with next layernorm

FFN Block - Fused Operations

  • layernorm write: [batch, seq_len, d_model] → FUSED with gate/up
  • gate write: [batch, seq_len, d_ffn] → FUSED with activation
  • up write: [batch, seq_len, d_ffn] → FUSED with activation
  • activation write: [batch, seq_len, d_ffn] → FUSED with down projection
  • down projection write: [batch, seq_len, d_model] → WRITE
  • residual add write: [batch, seq_len, d_model] → FUSED with next layer input

Computation (FLOPs)

Attention

flops_Q = batch × seq_len × d_model × d_model x 2

flops_K = batch × seq_len × d_model × (n_kv_head × d_head) x 2

flops_V = batch × seq_len × d_model × (n_kv_head × d_head) x 2

flops_qk = batch × seq_len × d_model × seq_len × 2

flops_av = batch × seq_len × d_model × seq_len × 2

flops_O = batch × seq_len × d_model × d_model x 2

FFN

flops_gate,up,down = 3 x (batch × seq_len × d_model × d_ffn) x 2

Analysis

If llama3 8B, batch = 32, and seq_len = 1024, and data_type_size = 2 bytes, then

Total read/write bytes per layer = 436 MB (weights) + 940 MB (activation) ≈ 1300 MB

Total flops per layer ≈ 14.8 TFlops

  • Attention: ~3.0 TFLOPs (Q, K, V, QK^T, AV, O)
  • FFN: ~12 TFLOPs (gate, up, down)

arithmetic intensity = flops per layer / memory read bytes per layer = 15 TFlops / (1376 * 10^6 bytes) ≈ 10,000 FLOPs/byte

Compared to the hardware’s theoretical peak arithmetic intensity, 10,000 FLOPs/byte is much higher than even H100 (295 FLOPs/byte). It means llama3 8B prefill with batch=32, seq_len=1024 is compute-bound.

If the batch size = 1, arithmetic intensity ≈ 10,000 / 32 FLOPs/byte ≈ 300 FLOPs/byte

Still compute-bound in theory (above H100’s 295 FLOPs/byte).

Hardware specification

GPU Model Compute (TFLOPs, bf16/fp16) Memory BW (GB/s) Intensity (FLOPs/byte)
H100 SXM 989 3,350 295
H100 PCIe 756 2,000 378
A100 SXM 312 2,039 153
A100 PCIe 312 1,555 201
L40S 362 864 419
A10 125 600 208
V100 125 (fp16) 900 139

Decode

Let’s see decode phase. seq_len = 1 seq_len_context = cached context length

Memory

model weight

  • Model weights: number_of_parameters * data_type_size / number_of_layers (same, 480MB)

input tensor

  • Input tensor: batch × 1 × d_model × 2 (32 × 1 × 4096 × 2 = 262,144 bytes) (negligible)

activation

  • KV cache READ: batch × seq_len_context × n_kv_head × d_head × 2 × 2 (32 x 1024 x 8 x 128 x 2 x 2 = 134,217,728 bytes)
  • KV cache WRITE: batch × 1 × n_kv_head × d_head × 2 × 2 (new token only) (32 × 1 × 8 × 128 × 2 × 2 = 131,072 bytes)
  • Attention input: batch × 1 × d_model × 2 (32 x 1 x 4096 x 2 = 262,144 bytes) (negligible)
  • Attention output: batch × 1 × d_model × 2 (32 x 1 x 4096 x 2 = 262,144 bytes) (negligible)
  • FFN input: batch × 1 × d_ffn × 2 (32 x 1 x 14336 x 2 = 884,736 bytes) (negligible)
  • FFN output: batch × 1 × d_model × 2 (32 x 1 x 4096 x 2 = 262,144 bytes) (negligible)

rough total read/write bytes per layer = 480MB (weights) + 135MB (activation) ≈ 600MB

FLOPs

Attention

flops_Q = batch × 1 × d_model × d_model x 2

flops_K = batch × 1 × d_model × (n_kv_head × d_head) x 2

flops_V = batch × 1 × d_model × (n_kv_head × d_head) x 2

flops_qk = batch × n_q_head × 1 × seq_len_context × d_head × 2 = batch × 1 × d_model × seq_len_context × 2 (since d_model = n_q_head × d_head)

flops_av = batch × n_q_head × 1 × seq_len_context × d_head × 2 = batch × 1 × d_model × seq_len_context × 2 (since d_model = n_q_head × d_head)

flops_O = batch × 1 × d_model × d_model x 2

FFN

flops_ffn = 3 × (batch × 1 × d_ffn × d_model x 2)

Analysis

If llama3 8B, batch = 32, seq_len_context = 1024, and data_type_size = 2 bytes, then

Total read/write bytes per layer = 480MB (weights) + 135MB (activation) ≈ 600MB

Total flops per layer ≈ 15 GFLOPs = 0.015 TFLOPs

  • Attention: ~3.0 GFLOPs (Q, K, V, QK^T, AV, O)
  • FFN: ~12 GFLOPs (gate, up, down)

arithmetic intensity = flops per layer / memory read bytes per layer = 15 GFLOPs / (600 * 10^6 bytes) = 0.015 TFLOPs / (600 * 10^6 bytes) ≈ 24 FLOPs/byte « H100 (295 FLOPs/byte)

Memory-bound!