GPU Communication Architecture: NCCL, NVSHMEM, and NVLink
- GPU Communication Architecture: NCCL, NVSHMEM, and NVLink
- 1. Communication Library Overview
- 2. Two-Sided vs One-Sided Communication
- 3. Performance Analysis: Matrix Transpose Benchmark
- 4. Data Movement: Pointer Access vs Explicit Get
- 5. Address Discovery and Translation in NVLink Domains
- 6. NVLink Architecture and Functionality
- 7. NVLink Topologies: Point-to-Point vs Switch
- 8. GPU MMU vs CPU IOMMU
- 10. Expert Parallelism Use Case
- 12. Key Takeaways
- NVSHMEM Synchronization Mechanisms: A Technical Deep Dive
- 1. The Fundamental Challenge
- 2. NVSHMEM Synchronization Primitives
- 3. Two Approaches Compared
- 4. Understanding βAtomicityβ in nvshmem_put_signal()
- 5. Memory Ordering Guarantees
- 6. Symmetric Heap Model
- 7. One-Sided vs Two-Sided Communication
- 8. Performance Implications
- 9. Complete Working Example
- 10. Key Takeaways
- GPU Communication Protocols: Simple, LL, LL128
- DMA
GPU Communication Architecture: NCCL, NVSHMEM, and NVLink
This document examines GPU-to-GPU communication mechanisms in distributed systems, focusing on NVIDIAβs NCCL and NVSHMEM libraries, their architectural differences, performance characteristics, and the underlying NVLink fabric. We analyze when to use two-sided versus one-sided communication, explain the address translation pipeline, and clarify common misconceptions about NVLinkβs role in memory access.
1. Communication Library Overview
1.1 NCCL (NVIDIA Collective Communications Library)
Design Philosophy: Optimized for AI training/inference collective patterns. Key Characteristics:
Stream-based integration with CUDA runtime Collective-first API design (all-reduce, all-to-all, broadcast, etc.) Automatic protocol selection based on message size and topology Scales from single workstation to thousands of GPUs Supports all AI parallelism patterns: data, tensor, pipeline, expert
Limitations:
Two-sided communication model (synchronization coupled with data movement) No device-initiated communication in current versions (changing with recent previews) Single rank per GPU enforcement
1.2 NVSHMEM (NVIDIA Symmetric Memory)
Design Philosophy: PGAS (Partitioned Global Address Space) model derived from OpenSHMEM.
Key Characteristics:
One-sided communication (put/get) decouples synchronization from data movement Device-initiated communication: CUDA kernels can directly communicate Symmetric heap: collective memory allocation across all processing elements Direct pointer access within NVLink domains Supports host and device-side APIs
Unique Capabilities:
nvshmem_ptr(): Returns pointer for direct load/store within NVLink domains Device-initiated collectives with cooperative groups Thread, warp, and block-level communication primitives
Restrictions:
Symmetric memory allocation requirement (all PEs allocate same size) Limited flexibility for heterogeneous application composition More complex synchronization management
2. Two-Sided vs One-Sided Communication
2.1 Two-Sided (Send/Receive)
Mechanism: Sender side: Receiver side:
- Message metadata - Expected metadata
- Source buffer address - Destination buffer address
- Matching tag - Matching tag
β β
Matching protocol ensures alignment
β β
Data transfer occurs with implicit synchronization
Protocol Overhead:
- Sender posts send with metadata
- Receiver posts receive (may be before or after send)
- Matching phase: align send with receive
- Rendezvous handshake (for large messages):
- Sender notifies receiver
- Receiver registers memory and sends permission
- Sender performs RDMA write
- Completion notification Advantages:
Clear execution ordering semantics Safe by default (no accidental overwrites) Portable across all network types
Disadvantages:
Matching overhead (especially with wildcards) Enforced synchronization may be unnecessary Difficult to implement efficiently on GPUs
2.2 One-Sided (Put/Get)
Mechanism: Initiator side only:
- Source buffer address
- Target PE and offset
- Transfer size β Direct data movement β No receiver involvement Setup Phase (one-time cost): c// Collective memory allocation void *symmetric_heap = nvshmem_malloc(SIZE); // All PEs exchange addresses and register memory // Creates globally accessible address space Runtime Phase (per operation): c// Put: write to remote memory nvshmem_int_put(remote_addr, local_data, count, target_pe); // No receiver-side code needed
// Get: read from remote memory
nvshmem_int_get(local_data, remote_addr, count, target_pe);
// No sender-side code needed
Synchronization Responsibility:
c// Programmer must ensure ordering
nvshmem_put(β¦); // Write data
nvshmem_fence(); // Ensure write completes
nvshmem_put_signal(β¦); // Notify remote side
// Remote side nvshmem_wait_until(β¦); // Poll for notification // Now safe to read data Advantages:
Eliminates matching overhead Decoupled synchronization enables optimization Natural fit for unstructured communication patterns Can amortize setup cost across many operations
Disadvantages:
More complex programming model Manual synchronization required Potential for race conditions if misused
3. Performance Analysis: Matrix Transpose Benchmark
3.1 Workload Description
Algorithm: Distributed matrix transpose with accumulation c// Each GPU owns column blocks // Transpose scatters blocks to all GPUs (all-to-all pattern) for (int b = 0; b < num_blocks; b++) { transpose_and_scatter(block[b]); // Communication local_transpose(block[b]); // Computation accumulate(block[b]); // Computation } Hardware: DGX H100
8x H100 GPUs NVLink Switch interconnect Local memory bandwidth: 24 TB/s aggregate (3 TB/s per GPU) NVLink bisection bandwidth: 3.6 TB/s theoretical
3.2 Implementation Variants
NCCL All-to-All: cnccl_group_start(); for (int i = 0; i < npes; i++) { nccl_send(send_blocks[i], β¦, i, β¦); nccl_recv(recv_blocks[i], β¦, i, β¦); } nccl_group_end(); // Then transpose and accumulate locally Performance: 7.5 TB/s aggregate throughput
NCCL Point-to-Point (manual pairwise): cfor (int i = 0; i < npes; i++) { int peer = (rank + i) % npes; nccl_send(send_block, β¦, peer, β¦); nccl_recv(recv_block, β¦, peer, β¦); local_transpose(recv_block); } Performance: 4-5 TB/s (throttled by NCCLβs flow control) Reason: NCCL assumes custom collective implementation, applies conservative flow control NVSHMEM All-to-All: cnvshmem_barrier_all(); // Required: no implicit sync for (int i = 0; i < npes; i++) { nvshmem_get(local_blocks[i], remote_blocks[i], size, i); } nvshmem_barrier_all(); // Then transpose and accumulate locally Performance: 7.5 TB/s (similar to NCCL) NVSHMEM Get with Interleaving:
cnvshmem_barrier_all();
for (int i = 0; i < npes; i++) {
nvshmem_get(local_block, remote_block, size, i);
local_transpose(local_block); // Interleave comm and compute
}
nvshmem_barrier_all();
Performance: 7.5 TB/s (minimal difference at full network saturation) NVSHMEM Pointer Access (fused):
nvshmem_barrier_all();
// Single kernel launch
fused_kernel<<<...>>>() {
for (int b = 0; b < num_blocks; b++) {
int *remote = nvshmem_ptr(blocks[b], target_pe);
// Direct loads from remote memory
transpose_inplace(remote, output);
}
}
nvshmem_barrier_all();
Performance: 9.5 TB/s (30% improvement)
3.3 Performance Breakdown
NCCL All-to-All Time Budget:
- Communication: 2.7 TB/s (close to bisection limit)
- Local transpose: 23 TB/s
- Accumulate: 23 TB/s
- Serialized execution limits aggregate throughput
NVSHMEM Pointer Time Budget:
- No explicit communication phase (fused into compute)
- Remote loads occur during transpose kernel execution
- Better cache utilization
- Reduced kernel launch overhead
4. Data Movement: Pointer Access vs Explicit Get
4.1 The Critical Difference
Common Misconception: βBoth methods transfer data over NVLink, so performance should be identical.β
Reality: The data path and overhead are fundamentally different.
4.2 NVSHMEM Explicit Get Data Path
Step 1: Function call overhead
nvshmem_int_get(local_buf, remote_data, size, pe)
β
Step 2: NVSHMEM runtime processing
- Validate parameters
- Lookup PE connection info
- Determine protocol (NVLink vs IB)
β
Step 3: Initiate transfer (may use Copy Engine or DMA)
β
Step 4: Data movement
Remote GPU HBM β NVLink β Local GPU HBM (explicit copy)
β
Step 5: Synchronization (ensure completion)
β
Step 6: Application accesses data
Local GPU HBM β L2 cache β Registers
Total path: Remote HBM β Local HBM β L2 β Registers
Key overhead sources:
- Software function call and parameter validation
- Explicit copy allocates local HBM space
- Synchronization logic
- Two memory copies (remote-to-local, local-to-cache)
4.3 NVSHMEM Pointer Access Data Path
Step 1: One-time pointer lookup (outside critical path)
int *remote_ptr = nvshmem_ptr(data, pe);
β
Step 2: Direct load instruction in kernel
value = *remote_ptr;
β
Step 3: GPU MMU translation
Virtual address β (Target GPU ID + Physical offset)
β
Step 4: NVLink load transaction
Remote GPU HBM β NVLink β Local GPU L2 cache β Registers
Total path: Remote HBM β L2 β Registers
Key advantages:
No intermediate copy to local HBM
Single GPU load instruction (hardware operation)
No software overhead per access
Data goes directly to L2 cache
Can leverage GPU cache hierarchy
### 4.4 Bandwidth and Latency Comparison
MetricExplicit GetPointer AccessSoftware overhead~500-1000 ns per callOne-time lookup onlyData pathRemote HBM β Local HBM β L2Remote HBM β L2Memory copies2 (remoteβlocal, localβcache)1 (remoteβcache)HBM allocationRequires local bufferNoneLatency per access~1-5 ΞΌs~100-200 nsEffective bandwidthLimited by software overheadLimited by NVLink (450 GB/s)
### 4.5 When Explicit Get is Superior
**Case 1: Multiple Accesses to Same Data**
```cpp
// Scenario: Reuse data multiple times
// Pointer access - pays NVLink latency every time
int *remote_ptr = nvshmem_ptr(data, pe);
for (int iter = 0; iter < 1000; iter++) {
for (int i = 0; i < N; i++) {
sum += remote_ptr[i]; // Remote load each time
// If N > L2 cache size, every access goes over NVLink
}
}
// Explicit get - one-time copy, then local access
int *local_buf = malloc(N * sizeof(int));
nvshmem_int_get(local_buf, remote_data, N, pe);
for (int iter = 0; iter < 1000; iter++) {
for (int i = 0; i < N; i++) {
sum += local_buf[i]; // Local HBM: 2-3 TB/s
}
}
Break-even analysis:
NVLink bandwidth: 450 GB/s Local HBM bandwidth: 2500 GB/s Explicit get wins when: num_accesses > 2-3
Case 2: Large Data Exceeding L2 Cache cuda// Data size: 1 GB // H100 L2 cache: 50 MB per GPU
// Pointer access: // - 1 GB doesnβt fit in 50 MB cache // - Every access misses β remote fetch over NVLink // - Effective bandwidth: 450 GB/s
// Explicit get: // - One-time transfer: 1 GB / 450 GB/s = 2.2 ms // - Subsequent accesses: local HBM at 2-3 TB/s // - For compute-heavy workloads: massive win
Case 3: Communication/Computation Overlap
cuda// Explicit get enables stream-based overlap cudaStream_t comm_stream, comp_stream;
// Pipeline: fetch next while computing current nvshmem_get_on_stream(buf_A, remote_A, size, pe, comm_stream); compute_kernelΒ«<β¦, comp_streamΒ»>(buf_B);
// Pointer access: no overlap capability // Stalls on every remote load
4.6 When Pointer Access is Superior
Case 1: Single-Pass Streaming (Jeffβs transpose benchmark) cuda// Each block accessed exactly once for (int b = 0; b < num_blocks; b++) { int *remote = nvshmem_ptr(block[b], pe); transpose_kernelΒ«<β¦Β»>(remote, output); } // No reuse β no benefit from local copy // Pointer access eliminates copy overhead
Case 2: Sparse/Random Access cuda// Access 1% of remote array int *remote_ptr = nvshmem_ptr(data, pe); for (int i : sparse_indices) { // 1% of indices value = remote_ptr[i]; } // Total transfer: 1% of array size
// vs explicit get would copy 100% unnecessarily
Case 3: Small Messages cuda// 64-byte transfer int value = remote_ptr; // ~200 ns // vs nvshmem_int_get(&local, remote, 1, pe); // ~1000 ns // Software overhead dominates for small transfers Case 4: Fine-Grained Producer-Consumer cuda// Polling pattern while (!done) { __threadfence_system(); if (remote_flag == READY) { // Direct check process(*remote_data); *remote_flag = DONE; } } // Low latency, direct access beats get/put for flags
4.7 Decision Matrix
ScenarioWinnerReasonSingle-pass readPointerNo copy overheadMultiple iterations (3+)GetAmortize copy, use local HBM speedDataΒ Β» L2 cacheGetCache misses make remote slowData < 1 KBPointerSoftware overhead mattersSparse/random accessPointerFetch only whatβs neededDense computationGetKeep working set localNeed overlapGetStream-based pipeliningFine-grained syncPointerLow latency critical
5. Address Discovery and Translation in NVLink Domains
5.1 CUDA IPC Setup Phase
Initialization sequence: c// During nvshmem_init(), all PEs participate
// Step 1: Each GPU allocates symmetric heap void *my_heap = cudaMalloc(SIZE); // GPU 0: 0x7f8a00000000 // GPU 1: 0x7f9b00000000 (different virtual address!) // GPU 2: 0x7fac00000000
// Step 2: Export memory handle cudaIpcMemHandle_t my_handle; cudaIpcGetMemHandle(&my_handle, my_heap);
// Step 3: Exchange handles (collective operation) cudaIpcMemHandle_t remote_handles[8]; all_gather(my_handle, remote_handles); // via control plane
// Step 4: Import remote handles for (int pe = 0; pe < npes; pe++) { if (pe != my_pe) { void *remote_ptr; cudaIpcOpenMemHandle(&remote_ptr, remote_handles[pe], cudaIpcMemLazyEnablePeerAccess); // Now have local virtual address mapping to remote physical ipc_mappings[pe] = remote_ptr; } }
### 5.2 GPU MMU Page Tables
**Post-IPC setup, each GPU has**:
βββββββββββββββββββββββββββββββββββββββββββββββ β GPU 0 MMU Page Table β ββββββββββββββββββββ¬βββββββββββββββββββββββββββ€ β Virtual Address β Physical Location β ββββββββββββββββββββΌβββββββββββββββββββββββββββ€ β 0x7f8a00000000 β Local HBM, offset 0 β β 0x7f9b00000000 β GPU 1 HBM, offset 0 β β 0x7fac00000000 β GPU 2 HBM, offset 0 β β 0x7fbd00000000 β GPU 3 HBM, offset 0 β β β¦ β β¦ β ββββββββββββββββββββ΄βββββββββββββββββββββββββββ
Key insight: Each GPU's MMU knows which virtual addresses map to which remote GPUs.
### 5.3 NVSHMEM's Internal Translation Table
```cpp
// Simplified internal structure
struct nvshmem_pe_info {
int pe_id;
void *symmetric_base; // This PE's base address
void *ipc_mapped_ptrs[MAX_PES]; // IPC-mapped remote addresses
bool nvlink_accessible[MAX_PES]; // Topology info
// For non-NVLink PEs: network connection info
};
// nvshmem_ptr() implementation
void* nvshmem_ptr(void *source, int target_pe) {
// Calculate offset within symmetric heap
size_t offset = (char*)source - (char*)my_symmetric_base;
// Check if target is NVLink-accessible
if (pe_info[target_pe].nvlink_accessible) {
// Return IPC-mapped pointer for direct load/store
return (char*)pe_info[target_pe].ipc_mapped_ptrs + offset;
} else {
// Cross-node or no NVLink β return NULL
// Must use explicit put/get
return NULL;
}
}
5.4 Complete Memory Access Flow
// GPU 0 executes:
int *remote_ptr = nvshmem_ptr(data, 3); // Points to GPU 3
int value = remote_ptr[100]; // Load from offset 100
Step-by-step execution:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. NVSHMEM API Layer (one-time, outside critical path) β
β nvshmem_ptr(data, 3) β
β β Lookup IPC mapping for PE 3 β
β β Return: 0x7fbd00000000 (virtual addr for GPU 3) β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. CUDA Kernel Load Instruction β
β LD R0, [0x7fbd00000064] // 0x100 offset + base β
β β Issued by GPU 0's load/store unit β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. GPU 0 MMU (Memory Management Unit) β
β TLB lookup: 0x7fbd00000064 β
β Result: "Remote memory" β
β "Target: GPU_ID=3" β
β "Physical offset: 0x64" β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. GPU 0 Memory Controller / NVLink Interface β
β Create NVLink packet: β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Header: β β
β β - Destination GPU: 3 β β
β β - Type: READ β β
β β - Size: 64 bytes (cache line) β β
β β - Transaction ID: 0x4A2B (for matching) β β
β β Physical Address: 0x64 β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β NVLink physical lanes (18 Gbps each)
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. NVLink Fabric (Switch or Point-to-Point) β
β Parse packet header β
β Routing decision: Dest=3 β Output port for GPU 3 β
β Forward packet through crossbar β
β - Manage link credits (flow control) β
β - CRC check, retry on error β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 6. GPU 3 NVLink Interface β
β Receive packet, verify CRC β
β Extract: Physical_Addr=0x64, Type=READ, TxID=0x4A2B β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 7. GPU 3 Memory Controller β
β Read from HBM at physical offset 0x64 β
β Retrieve 64 bytes (cache line) β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 8. GPU 3 NVLink Interface β
β Create response packet: β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Header: β β
β β - Destination GPU: 0 β β
β β - Type: READ_RESPONSE β β
β β - Transaction ID: 0x4A2B (match request) β β
β β Data: [64 bytes from HBM] β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β NVLink return path
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 9. NVLink Fabric β
β Route response back to GPU 0 β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 10. GPU 0 NVLink Interface β
β Receive response β
β Match Transaction ID: 0x4A2B β
β Forward data to memory controller β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 11. GPU 0 L2 Cache (if cacheable) β
β Store cache line β
β Forward requested word to load/store unit β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 12. GPU 0 Load/Store Unit β
β Receive data, write to register R0 β
β Instruction completes β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Latency breakdown:
- MMU lookup: ~10 ns (TLB hit)
- Packet creation: ~20 ns
- NVLink transmission: ~50-100 ns (depending on hops)
- Remote memory access: ~50 ns
- Return path: ~50-100 ns
- Total: ~180-280 ns (vs ~30-50 ns for local HBM)
6. NVLink Architecture and Functionality
6.1 What NVLink Does NOT Do
Critical clarifications:
β Address translation: This is the GPU MMUβs responsibility β Virtual memory management: Handled by CUDA IPC and GPU page tables β Cache coherence: NVLink protocol supports it, but GPUs donβt use it β Page table lookups: Done by GPU MMU before packets reach NVLink β Computation or processing: Pure interconnect fabric
6.2 What NVLink Actually Does
β Physical packet routing: Forwards packets based on destination GPU ID in header β Flow control: Credit-based backpressure to prevent congestion β Error handling: CRC checks, automatic retries on corruption β Physical layer: High-speed SerDes signaling (18 Gbps per lane) β QoS: Priority queues for different traffic classes β Multicast/broadcast: Efficiently replicate packets to multiple destinations
Analogy:
- GPU MMU = GPS navigation (decides destination and route)
- NVLink = Highway system (physical infrastructure that moves traffic)
6.3 NVLink Packet Format
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NVLink Packet β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Header (64-128 bits): β
β - Destination GPU ID (3-7 bits, depends on domain size)β
β - Transaction Type (read/write/atomic/response) β
β - Size (cache line: 32B, 64B, 128B) β
β - Transaction ID (for matching responses) β
β - QoS/Priority bits β
β - Routing hints (for multi-switch topologies) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Address (40-48 bits): β
β - Physical offset within target GPU's HBM β
β - Already translated by source GPU's MMU β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Data (0-128 bytes, depending on transaction type): β
β - Payload for writes β
β - Empty for read requests β
β - Filled in response packets β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β CRC (16-32 bits): β
β - Error detection for entire packet β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7. NVLink Topologies: Point-to-Point vs Switch
7.1 Point-to-Point NVLink (Direct Connections)
Architecture (example: 8 GPUs, 6 links per GPU):
Hypercube topology:
GPU 0 βββ GPU 1 GPU 4 βββ GPU 5
β \ / β β \ / β
β \/ β β \/ β
β /\ β β /\ β
β / \ β β / \ β
GPU 2 βββ GPU 3 GPU 6 βββ GPU 7
β β β β
βββββββββββΌβββββββββββββ β
ββββββββββββββββββββββββ
Characteristics:
- Each GPU has limited NVLink ports (A100: 6, H100: 18)
- Not fully connected - some pairs require multi-hop routing
- Routing logic implemented in each GPUβs hardware
- Topology: hypercube, torus, or custom mesh
Routing example (GPU 0 β GPU 7):
GPU 0's routing table: "GPU 7 β forward via port 2 (to GPU 2)"
β
GPU 0 sends packet to GPU 2
β
GPU 2 receives, checks destination: "GPU 7"
GPU 2's routing table: "GPU 7 β forward via port 5 (to GPU 6)"
β
GPU 2 forwards packet to GPU 6
β
GPU 6 receives, checks destination: "GPU 7"
GPU 6's routing table: "GPU 7 β port 1 (direct connection)"
β
GPU 6 forwards to GPU 7
β
GPU 7 receives, serves memory request
Implications:
- Latency: Varies by hop count (1-3 hops typical)
- Bandwidth: Shared along path (multiple flows contend)
- GPU overhead: Each intermediate GPU does routing
- Bisection bandwidth: Limited by topology (e.g., 4Γ4 partition has few cross-links)
Performance characteristics:
Direct neighbors: 450 GB/s, ~100 ns latency
One hop away: 225-450 GB/s (shared), ~150-200 ns latency
Two hops away: 150-300 GB/s (shared), ~200-300 ns latency
7.2 NVLink Switch (NVSwitch)
Architecture (DGX H100, DGX B100, NVL72):
βββββββββββββββββββββββββββββββββββββββββββββ
β NVSwitch (Dedicated Chip) β
β β
β ββββββββββββββββββββββββββββββββββββββ β
β β Full Crossbar (Non-blocking) β β
β β β β
β β ββββ ββββ ββββ ββββ ββββ β β
β β βP0β βP1β βP2β βP3β βP4β β β
β β ββ¬ββ ββ¬ββ ββ¬ββ ββ¬ββ ββ¬ββ β β
β βββββββΌββββββΌββββββΌββββββΌββββββΌββββββ β
ββββββββββΌββββββΌββββββΌββββββΌββββββΌββββββββββ
β β β β β
β β β β β
GPU0 GPU1 GPU2 GPU3 GPU4 ...
Characteristics:
- Dedicated switching hardware (separate chip, not in GPUs)
- Full crossbar: any GPU to any GPU at full bandwidth simultaneously
- Single-hop for all GPU pairs (always 1 hop)
- GPUs just send/receive - no routing logic needed
- Scales to 72+ GPUs (NVL72 racks)
Routing example (GPU 0 β GPU 7):
GPU 0's NVLink interface: Create packet with Dest=7
β
Send to NVSwitch
β
NVSwitch examines header: Destination=GPU 7
NVSwitch crossbar: Connect input port 0 β output port 7
β
Forward packet to GPU 7
β
GPU 7 receives, serves memory request
Implications:
- Latency: Uniform ~150-200 ns for all pairs
- Bandwidth: Full 450 GB/s per GPU, non-blocking
- No GPU overhead: GPUs donβt route, just send/receive
- Bisection bandwidth: Full bisection (3.6 TB/s for 8 GPUs)
Performance characteristics:
Any GPU pair: 450 GB/s, ~150-200 ns latency
All-to-all: 3.6 TB/s aggregate bisection bandwidth
No contention: True non-blocking crossbar
7.3 NVSwitch Internal Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NVSwitch Chip β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Crossbar Switch Fabric β β
β β (Non-blocking, full bisection) β β
β β β β
β β Input Buffers: β β
β β ββββββ ββββββ ββββββ ββββββ ββββββ ββββββ β β
β β β Q0 β β Q1 β β Q2 β β Q3 β β Q4 β β Q5 β ... β β
β β ββββ¬ββ ββββ¬ββ ββββ¬ββ ββββ¬ββ ββββ¬ββ ββββ¬ββ β β
β β β β β β β β β β
β β ββββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββ β β
β β β β β
β β ββββββββββββββββββββββββ β β
β β β Arbitration Logic β β β
β β β - QoS prioritizationβ β β
β β β - Round-robin/WFQ β β β
β β ββββββββββββ¬ββββββββββββ β β
β β β β β
β β ββββββββββββββββββββββββ β β
β β β Crossbar Matrix β β β
β β β (NΓN connections) β β β
β β ββββββββββββ¬ββββββββββββ β β
β β β β β
β β Output Buffers: β β
β β ββββββ ββββββ ββββββ ββββββ ββββββ ββββββ β β
β β β Q0 β β Q1 β β Q2 β β Q3 β β Q4 β β Q5 β ... β β
β β ββββ¬ββ ββββ¬ββ ββββ¬ββ ββββ¬ββ ββββ¬ββ ββββ¬ββ β β
β βββββββΌββββββΌββββββΌββββββΌββββββΌββββββΌβββββββββββββββ β
β β β β β β β β
β βββββββΌββ βββΌβββββ βββββΌβββ ββΌββββββ ββββΌββββ β
β βPort 0 β βPort 1β βPort 2β βPort 3β βPort 4β ... β
β β900GB/sβ β900GB/β β900GB/β β900GB/β β900GB/β β
β βββββ¬ββββ βββββ¬βββ βββββ¬βββ βββββ¬βββ βββββ¬βββ β
ββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ
β β β β β
GPU0 GPU1 GPU2 GPU3 GPU4 ...
Switch functions:
- Header parsing: Extract destination GPU ID from packet
- Arbitration: If multiple packets target same output, prioritize
- Crossbar connection: Establish path from input port to output port
- Buffering: Queue packets if output temporarily busy
- Flow control: Credit-based system prevents buffer overflow
- QoS: Priority queues for different traffic classes
- Multicast: Replicate packets to multiple outputs efficiently
7.4 Comparison Table
| Feature | Point-to-Point | NVLink Switch |
|---|---|---|
| Routing location | In each GPU | Dedicated switch hardware |
| Topology | Partial mesh | Full crossbar |
| Hops (worst case) | 1-3 | Always 1 |
| Bisection bandwidth | Limited by topology | Full (non-blocking) |
| Latency uniformity | Varies by path | Uniform |
| GPU complexity | Must include routing | Just send/receive |
| Scalability | 8-16 GPUs typical | 72+ GPUs (NVL systems) |
| Cost | Lower (no switch chip) | Higher (dedicated hardware) |
| Bandwidth guarantee | Depends on contention | Non-blocking guarantee |
| Example systems | DGX-1, DGX-2, DGX A100 | DGX H100, DGX B100, NVL72 |
7.5 Performance Impact: All-to-All Pattern
Point-to-Point (hypercube, 8 GPUs):
Pattern: GPU 0-3 send to GPU 4-7 (bisection test)
Bottleneck: Limited cross-links between partitions
Available paths: ~4-6 links crossing partition boundary
Bisection bandwidth: 1.8-2.7 TB/s (theoretical)
Actual measured: 1.5-2.0 TB/s (with contention)
NVLink Switch (8 GPUs):
Pattern: GPU 0-3 send to GPU 4-7 (bisection test)
No bottleneck: Full crossbar, all paths available
Bisection bandwidth: 3.6 TB/s (50% of aggregate)
Actual measured: 2.7-3.0 TB/s (Jeff's benchmark result)
This explains Jeffβs 2.7 TB/s all-to-all result: Near-theoretical bisection bandwidth on NVSwitch-based DGX H100.
8. GPU MMU vs CPU IOMMU
8.1 Architecture Separation
Critical distinction: NVLink uses GPU MMU; PCIe P2P uses CPU IOMMU.
Within Node (NVLink domain):
βββββββββββββββ βββββββββββββββ
β GPU 0 β NVLink β GPU 1 β
β βββββββββ βββββββββββ βββββββββ β
β β MMU β β β β MMU β β
β βββββββββ β β βββββββββ β
βββββββββββββββ βββββββββββββββ
β β
ββββββ No IOMMU βββββββββ
(GPU-to-GPU direct)
Cross-Node (Network):
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β GPU 0 β PCIe β NIC β Network β NIC β
β βββββββββββ βββββββββββ β
βββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β β β
ββββββββββββββββββββββββΌββββββββββββββββββββββββββ
β
βββββββββββββββ
β IOMMU β β CPU-side translation
β (optional) β
βββββββββββββββ
8.2 GPU MMU Characteristics
Location: On-chip in each GPU
Responsibilities:
- Virtual-to-physical address translation within GPU
- IPC mapping management (cross-GPU virtual addresses)
- NVLink routing decisions (which GPU owns this address)
- Page table walks for misses
- TLB (Translation Lookaside Buffer) caching
Translation process:
Virtual Address: 0x7f9b00001000
β
GPU MMU lookup:
- Check TLB for cached translation
- If miss: walk page tables
- Result: "GPU 3, physical offset 0x1000"
β
Create NVLink packet with routing info
Performance:
- TLB hit: ~10 ns
- TLB miss: ~100-200 ns (page table walk)
- No software involvement (pure hardware)
8.3 CPU IOMMU Characteristics (for NIC-GPU P2P)
Location: In CPU die or chipset (Root Complex)
Responsibilities:
- Isolate DMA from different devices
- Translate device-side addresses (IOVA β physical)
- Provide per-VM address spaces
- Enable on-demand paging for devices
Translation process (IOMMU enabled):
NIC wants to write to GPU memory
1. NIC driver gets IOVA (not physical address)
IOVA: 0x7f8a00000000
2. NIC issues PCIe Memory Write TLP with IOVA
3. IOMMU intercepts transaction
4. IOTLB (IOMMU's TLB) lookup:
- Hit: Translate IOVA β Physical (0x380000000)
- Miss: Generate page fault
β
ATS (Address Translation Service) request
β
CPU handles page fault
β
Install translation in IOTLB
β
Retry transaction
5. Forward translated address to GPU
6. GPU memory controller receives write
Performance impact:
- IOTLB hit: ~100-500 ns additional latency
- IOTLB miss: ~2-10 ΞΌs (includes page fault handling)
- This is catastrophic for low-latency RDMA
8.4 Why IOMMU Matters for NIC-GPU but Not NVLink
NVLink (GPU-to-GPU):
Path: GPU 0 β NVLink β GPU 1
Translation: GPU 0's MMU only (on-chip, fast)
No CPU involvement
No IOMMU in path
Latency: ~150-200 ns total
PCIe P2P (NIC-to-GPU):
Path: NIC β PCIe switch β GPU
Without IOMMU:
NIC uses physical addresses directly
PCIe switch routes based on physical address
GPU memory controller receives write
Latency: ~1-2 ΞΌs
With IOMMU:
NIC uses IOVA
IOMMU must translate every transaction
IOTLB misses cause stalls
Latency: ~5-20 ΞΌs (worst case with misses)
8.5 IOMMU Modes
Disabled (IOMMU OFF):
Advantages:
β
Lowest latency (no translation)
β
Deterministic performance
β
No IOTLB misses
Disadvantages:
β No DMA isolation (security risk)
β Device can access all physical memory
β No per-VM address spaces
1:1 Passthrough (IOVA == Physical Address):
IOVA: 0x380000000 β Physical: 0x380000000 (identity mapping)
Advantages:
β
Near-zero translation overhead
β
Retains basic IOMMU infrastructure
β
Can enable ATS for local caching
Disadvantages:
β οΈ If any mapping breaks identity, reverts to slow path
β οΈ No real isolation (mappings are identity)
Full IOMMU (Translation enabled):
IOVA: 0x7f8a00000000 β Physical: 0x380000000
Advantages:
β
True DMA isolation
β
Per-VM address spaces
β
Security boundary
Disadvantages:
β High latency (IOTLB misses)
β Unpredictable tail latency
β Kills p99/p999 for RDMA
9. NVSHMEM Cross-Node Communication
9.1 Topology Detection
During initialization, NVSHMEM determines connectivity:
```cpp
cvoid nvshmem_init() {
// Discover all PEs and their locations
for (int pe = 0; pe < npes; pe++) {
if (same_nvlink_domain(my_pe, pe)) {
// Setup CUDA IPC for direct access
setup_ipc_mapping(pe);
pe_info[pe].transport = NVLINK_DIRECT;
pe_info[pe].can_use_ptr = true;
}
else if (same_node(my_pe, pe)) {
// Different NVLink domain, same node (rare)
// Could use PCIe P2P if topology allows
pe_info[pe].transport = PCIE_P2P;
pe_info[pe].can_use_ptr = false;
}
else {
// Cross-node: use network (IB/Ethernet)
setup_ib_connection(pe);
pe_info[pe].transport = NETWORK_RDMA;
pe_info[pe].can_use_ptr = false;
}
}
}
9.2 Hybrid Transport
NVSHMEM automatically selects transport:
void nvshmem_int_put(int *target, int *source, size_t nelems, int pe) {
switch (pe_info[pe].transport) {
case NVLINK_DIRECT:
// Could use direct store, but put is also efficient
nvlink_put_implementation(target, source, nelems, pe);
break;
case PCIE_P2P:
// Use PCIe P2P DMA if supported
pcie_p2p_put(target, source, nelems, pe);
break;
case NETWORK_RDMA:
// Multi-stage: GPU β Host β NIC β Network
network_put_implementation(target, source, nelems, pe);
break;
}
}
9.3 Network Transport Path (GPUDirect RDMA)
With GPUDirect RDMA enabled:
Node A, GPU 0: Node B, GPU 0:
βββββββββββββββ βββββββββββββββ
β GPU HBM β β GPU HBM β
β (source) β β (dest) β
ββββββββ¬βββββββ ββββββββ²βββββββ
β PCIe (no CPU copy) β PCIe
ββββββββΌβββββββ ββββββββ΄βββββββ
β NIC ββββββInfiniBandβββββββββ NIC β
βββββββββββββββ RDMA βββββββββββββββ
Steps:
1. NVSHMEM calls put
2. GPU memory pinned, registered with NIC
3. NIC DMAs directly from GPU HBM (if IOMMU OFF)
4. RDMA write across network
5. Remote NIC DMAs directly to remote GPU HBM
6. No CPU copies!
Without GPUDirect RDMA (fallback):
Node A, GPU 0: Node B, GPU 0:
βββββββββββββββ βββββββββββββββ
β GPU HBM β β GPU HBM β
ββββββββ¬βββββββ ββββββββ²βββββββ
β PCIe β PCIe
ββββββββΌβββββββ ββββββββ΄βββββββ
β Host RAM β β Host RAM β
β (staging) β β (staging) β
ββββββββ¬βββββββ ββββββββ²βββββββ
β PCIe β PCIe
ββββββββΌβββββββ ββββββββ΄βββββββ
β NIC ββββββInfiniBandβββββββββ NIC β
βββββββββββββββ RDMA βββββββββββββββ
Steps:
1. NVSHMEM calls put
2. GPU β Host memory copy (PCIe)
3. NIC DMAs from host memory
4. RDMA write across network
5. Remote NIC DMAs to host memory
6. Host β GPU memory copy (PCIe)
Total: 2 extra PCIe copies (huge overhead!)
9.4 Why Pointer Access Only Works Within NVLink
// Application code
void *remote_ptr = nvshmem_ptr(data, target_pe);
if (remote_ptr != NULL) {
// Within NVLink domain
// Pointer maps to virtual address via CUDA IPC
// GPU MMU translates, NVLink routes
int value = *(int*)remote_ptr; // Direct load
}
else {
// Cross-node or no NVLink
// Cannot create valid virtual address mapping
// Network requires explicit protocol
nvshmem_int_get(&local_value, data, 1, target_pe);
}
Why cross-node pointer access is impossible:
- No global virtual address space across nodes
- Network requires packet framing (not raw loads)
- Latency too high for synchronous loads (~50 ΞΌs vs 200 ns)
- RDMA requires explicit memory registration and permissions
10. Expert Parallelism Use Case
10.1 Architecture
Problem: Route tokens from multiple requests to appropriate expert GPUs.
Request 1: [token_1, token_2, token_3]
Request 2: [token_4, token_5]
Request 3: [token_6, token_7, token_8, token_9]
β
Router (GPU 0)
β
Routing decision
β
ββββββ΄βββββ¬βββββββββ¬βββββββββ
β β β β
Expert 1 Expert 2 Expert 3 Expert 4
(GPU 1) (GPU 2) (GPU 3) (GPU 4)
Challenge:
Router knows which tokens go where
Experts don't know which tokens are coming
Unstructured all-to-all pattern
Latency critical (token generation)
### 10.2 NVSHMEM Implementation (Perplexity/DeepSeek)
Dispatch phase:
```cpp
// Router kernel (device-initiated)
__global__ void dispatch_tokens() {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < num_tokens) {
// Determine which expert handles this token
int expert_pe = routing_table[token_ids[tid]];
// Calculate offset in expert's buffer
// Use atomic counter to get unique slot
size_t offset = nvshmem_uint64_atomic_fetch_add(
&remote_counters[expert_pe], 1, expert_pe);
// Put token directly to expert's buffer
nvshmem_putmem_nbi(
expert_buffers[expert_pe] + offset,
&tokens[tid],
sizeof(Token),
expert_pe);
}
}
// No host involvement - entire dispatch in one kernel
Expert processing:
cuda// Expert kernel
__global__ void process_expert() {
// Check how many tokens arrived
int num_received = local_counter;
// Process received tokens
for (int i = 0; i < num_received; i++) {
Token t = my_buffer[i];
// ... expert computation ...
}
}
Why NVSHMEM wins here:
- Unstructured pattern: Router knows destinations, experts donβt
- Device-initiated: No round-trip to CPU for each put
- One-sided: Experts donβt participate in dispatch
- Fine-grained: Can put individual tokens without batching
10.3 Performance Impact
Traditional approach (NCCL all-to-all):
1. Router collects all tokens
2. Determine send counts per expert
3. Call NCCL alltoallv (collective, all PEs participate)
4. Experts receive batched tokens
5. Process
Latency: ~50-100 ΞΌs per routing step
NVSHMEM approach:
- Router kernel directly puts tokens to expert buffers
- Experts poll for completion, begin processing
Latency: ~5-10 ΞΌs per routing step Result: 10x latency reduction for token routing (Perplexityβs measurements).
- Future Directions 11.1 NCCL Roadmap Recent additions to NCCL:
Symmetric memory support (released) Device-initiated APIs (preview available) NVLink pointer access (coming) Host-side put/get (roadmap)
Convergence: NCCL will support most NVSHMEM capabilities while maintaining collective-first API design and communicator flexibility. 11.2 When to Use Each Library Use NCCL when:
Standard collective patterns (all-reduce, all-gather, etc.) Need communicator flexibility (sub-groups, heterogeneous parallelism) Ecosystem integration important (PyTorch, JAX, etc.) Want automatic protocol tuning
Use NVSHMEM when:
Unstructured communication patterns Device-initiated communication critical Fine-grained synchronization needed Direct NVLink pointer access desired Working with symmetric PGAS model acceptable
Use both:
Different phases of application have different needs Can coexist in same application with careful memory management
12. Key Takeaways
Two-sided vs one-sided: Synchronization coupling vs decoupling trades safety for performance optimization opportunities. Pointer access vs explicit get: Pointer access eliminates copy to local HBM but only works within NVLink domains. Choose based on access frequency and data reuse. Address translation pipeline: GPU MMU handles virtual-to-physical translation and routing decisions; NVLink is pure physical routing fabric. CUDA IPC: Mechanism for creating cross-GPU virtual address mappings within NVLink domains. One-time setup enables direct load/store. No cache coherence: GPUs donβt use NVLinkβs coherence protocol. Explicit synchronization (fences, barriers) required. NVLink topologies: Point-to-point requires multi-hop routing in GPUs; NVSwitch provides single-hop full crossbar with dedicated hardware. GPU MMU β CPU IOMMU: GPU-GPU uses GPU MMU; NIC-GPU uses CPU IOMMU. IOMMU adds latency for PCIe P2P, doesnβt affect NVLink. Expert parallelism: Unstructured routing patterns benefit dramatically from one-sided device-initiated communication (10x latency reduction). Hybrid transport: NVSHMEM automatically selects NVLink direct access, PCIe P2P, or network RDMA based on topology detection. Library convergence: NCCL adding NVSHMEM-like capabilities; choice increasingly about API preference rather than feature availability.
NVSHMEM Synchronization Mechanisms: A Technical Deep Dive
1. The Fundamental Challenge
In one-sided RDMA communication, the sender writes directly to remote GPU memory without receiver involvement. This creates a synchronization problem: How does the receiver know when the senderβs write is complete? Unlike two-sided communication (e.g., MPI_Send/Recv) where both parties explicitly participate, one-sided RDMA provides no inherent completion notification to the receiver.
2. NVSHMEM Synchronization Primitives
2.1 Memory Ordering Operations
nvshmem_fence()
Ensures all prior PUT/GET operations to a specific PE are ordered before subsequent operations Called by sender to enforce ordering Does NOT guarantee completion
nvshmem_quiet()
Blocks until all outstanding PUT/GET operations complete globally Guarantees operations are visible at remote memory Performance cost: sender stalls waiting for hardware completion
2.2 Signal-Based Synchronization
nvshmem_put_signal()
Atomically writes data AND updates a signal variable at receiver Single hardware operation bundling both writes Key advantage: sender doesnβt block (non-blocking from senderβs perspective)
nvshmem_signal_wait_until()
Receiver polls signal variable until expected value Active waiting mechanism (receiver must check)
2.3 Other Mechanisms
Collective barriers: nvshmem_barrier_all() for global synchronization Atomic operations: nvshmem_atomic_* for lightweight synchronization flags
3. Two Approaches Compared
Approach 1: Separate PUT + Flag Update cuda// Sender (PE 0) nvshmem_putmem(buffer, local_data, SIZE, 1); nvshmem_quiet(); // Block until data write completes nvshmem_uint64_p(flag, 1, 1); // Separate flag update nvshmem_quiet(); // Block until flag write completes
// Receiver (PE 1) nvshmem_uint64_wait_until(flag, NVSHMEM_CMP_EQ, 1); Characteristics:
Three separate operations Sender blocks twice (performance penalty) Software-enforced ordering via nvshmem_quiet() Still one-sided (no receiver involvement in data transfer)
Approach 2: Atomic PUT + Signal cuda// Sender (PE 0) nvshmem_put_signal(buffer, local_data, SIZE, flag, 1, NVSHMEM_SIGNAL_SET, 1); // Returns immediately - no blocking!
// Receiver (PE 1) nvshmem_signal_wait_until(flag, NVSHMEM_CMP_EQ, 1); Characteristics:
Single API call, single hardware operation Non-blocking from senderβs perspective Atomic guarantee at receiver (data + signal arrive together) Single network round-trip
4. Understanding βAtomicityβ in nvshmem_put_signal()
Critical Clarification The βatomicβ property is from the receiverβs perspective, not the senderβs: Receiverβs View (Atomic):
When signal becomes visible, data is guaranteed visible No race condition between data and flag Hardware ensures signal write happens after data write completes
Senderβs View (Still One-Sided):
Returns immediately after issuing operation Does NOT know when receiver sees the data Must call nvshmem_quiet() if sender needs confirmation (e.g., to reuse buffer)
Hardware Implementation The network layer (NVLink/InfiniBand) bundles data + signal into one transaction:
Network hardware writes data to receiverβs memory Hardware writes signal value Memory ordering guarantees signal visibility implies data visibility
Similar to RDMA write with immediate data in InfiniBand.
5. Memory Ordering Guarantees
How Order is Enforced (Approach 1) cudanvshmem_putmem(buffer, local_data, SIZE, 1); // PUT #1 nvshmem_quiet(); // Blocks until PUT #1 visible at remote nvshmem_uint64_p(flag, 1, 1); // PUT #2 issued AFTER quiet returns Two mechanisms ensure ordering:
Software ordering: nvshmem_quiet() blocks sender until PUT #1 completes at remote memory before PUT #2 is issued Hardware ordering: PCIe/NVLink RDMA guarantees writes from same source to same destination arrive in order
Without Ordering - Race Condition cuda// BAD - no ordering! nvshmem_putmem(buffer, local_data, SIZE, 1); nvshmem_uint64_p(flag, 1, 1); // Could overtake data write! Flag write might complete before data write, causing receiver to proceed with invalid data.
6. Symmetric Heap Model
Key Concept NVSHMEM uses a symmetric heap: memory allocated at the same virtual address offset on all PEs, but physically separate on each GPU. cuda// Both PEs run same initialization char buffer = (char)nvshmem_malloc(SIZE); uint64_t flag = (uint64_t)nvshmem_malloc(sizeof(uint64_t)); Result:
PE 0: buffer at offset 0x1000 β GPU 0βs physical memory PE 1: buffer at offset 0x1000 β GPU 1βs physical memory Same virtual address, different physical memory
Remote Access cuda// PE 0 accessing PE 1βs memory nvshmem_putmem(buffer, // Symmetric address (same offset on all PEs) local_data, SIZE, 1); // target_pe selects which GPUβs physical memory The symmetric address + target_pe parameter identifies the specific physical memory location. Important Notes
nvshmem_malloc() only allocates on the calling PE Not automatic replication across all PEs βSymmetricβ means same virtual address offset if all PEs call nvshmem_malloc() in same order
7. One-Sided vs Two-Sided Communication
NVSHMEM (One-Sided) cuda// PE 0: Sender nvshmem_putmem(buffer, data, SIZE, 1); // Write to PE 1 nvshmem_quiet(); // Wait for hardware completion
// PE 1: Receiver (passive) nvshmem_uint64_wait_until(flag, NVSHMEM_CMP_EQ, 1); // Just wait Characteristics:
Receiver doesnβt participate in data transfer No matching βreceiveβ call required Hardware (RDMA/NVLink) confirms write completion to sender No ACK message from receiver to sender
MPI (Two-Sided) cuda// PE 0: Sender MPI_Send(data, SIZE, β¦); // Must match with Recv
// PE 1: Receiver (active) MPI_Recv(buffer, SIZE, β¦); // Must explicitly receive Characteristics:
Both sender and receiver must execute matching calls Receiver actively participates Protocol-level handshaking
8. Performance Implications
Pipeline Pattern (Optimal) cuda// Sender issues multiple operations without blocking for (int i = 0; i < N; i++) { compute_chunk(i); nvshmem_put_signal(remote_buf[i], local_buf[i], size, &remote_flags[i], i, NVSHMEM_SIGNAL_SET, target_pe); // No blocking - continues immediately! } // Only synchronize when necessary nvshmem_quiet(); // Optional: if sender needs to reuse buffers Advantages:
Overlaps communication with computation Multiple operations in flight simultaneously Minimal sender stalls
Anti-Pattern (Poor Performance) cudanvshmem_putmem(buffer, data, SIZE, 1); nvshmem_quiet(); // Stall #1 nvshmem_uint64_p(flag, 1, 1); nvshmem_quiet(); // Stall #2 Sender blocks twice per message, preventing overlap.
9. Complete Working Example
cuda#include <nvshmem.h>
#include <nvshmemx.h>
#include <stdio.h>
#include <stdlib.h>
#define SIZE 1024
int main() {
nvshmem_init();
int my_pe = nvshmem_my_pe();
// Symmetric heap allocation
char *buffer = (char*)nvshmem_malloc(SIZE);
uint64_t *flag = (uint64_t*)nvshmem_malloc(sizeof(uint64_t));
*flag = 0;
if (my_pe == 0) {
// Sender
char *local_data = (char*)malloc(SIZE);
for (int i = 0; i < SIZE; i++) {
local_data[i] = 'A';
}
nvshmem_put_signal(buffer, local_data, SIZE,
flag, 1, NVSHMEM_SIGNAL_SET, 1);
printf("PE 0: Sent data\n");
free(local_data);
}
else if (my_pe == 1) {
// Receiver
nvshmem_signal_wait_until(flag, NVSHMEM_CMP_EQ, 1);
printf("PE 1: Received data: %c\n", buffer[0]);
}
nvshmem_free(buffer);
nvshmem_free(flag);
nvshmem_finalize();
return 0;
}
10. Key Takeaways
One-sided RDMA has no inherent completion notification - receiver must actively check via polling/waiting nvshmem_put_signal() is superior to separate PUT + flag for single-message scenarios due to atomicity and non-blocking behavior βAtomicβ means receiver-side atomicity, not sender notification - sender still doesnβt know when receiver sees data without calling nvshmem_quiet() Memory ordering requires explicit synchronization - use nvshmem_quiet() or atomic nvshmem_put_signal() to prevent races Symmetric heap enables simple remote addressing - same virtual address offset across PEs, different physical memory Performance optimization requires minimizing sender blocking - pipeline operations and only synchronize when necessary Still one-sided despite ordering guarantees - hardware confirms completion, receiver never sends ACK back to sender
GPU Communication Protocols: Simple, LL, LL128
NVIDIAβs NCCL library implements three protocols (Simple, LL, LL128) with fundamentally different synchronization mechanisms. Critical finding: NVLink provides high bandwidth and unified addressing but no hardware cache coherence between GPUs. Performance optimization requires understanding protocol trade-offs, hardware topology constraints, and the distinction between software-enforced coherence mechanisms. NCCL 2.28 introduces device-initiated communication, closing the gap with NVSHMEM for irregular workloads.
1. NCCL Protocol Architecture
1.1 Core Protocols
ProtocolSync MethodChunk SizeLatencyBandwidthGPUDirect RDMAOptimal SizeSimpleMemory fences~512 KB~6 Β΅s~100%β Yes>1 MBLLAtomic flags (host mem)8B (4B data + 4B flag)~1 Β΅s25-50%β No<64 KBLL128Atomic flags (GPU mem)128B (120B data + 8B flag)~2 Β΅s~95%β Yes64KB-1MB
1.2 Simple Protocol
Mechanism: Large chunk transfers with memory fence synchronization. Latency Breakdown (~6 Β΅s):
Store buffer drain: 0.5 Β΅s L1βL2 writeback: 0.5 Β΅s Interconnect traversal: 2.0 Β΅s Remote invalidation: 1.0 Β΅s ACK return: 1.0 Β΅s
Fence Hardware Execution:
Flushes all store buffers Writes dirty L1 cache lines to L2 L2 sends invalidation messages via interconnect Waits for ACKs from all remote caches Guarantees global visibility
Performance: Fence overhead (6 Β΅s) dominates small messages but amortizes for large transfers. For 5 GB transfer: fence is 0.0001% overhead.
1.3 LL Protocol
Mechanism: 8-byte atomic operations with flag-based synchronization in host memory. Why Host Memory is Mandatory:
CPU polling GPU memory over PCIe: ~1-2 Β΅s per read CPU polling host DRAM: ~50-100 ns per read (20-40x faster) GPU memory not cacheable by CPU
Bandwidth Limitations:
Raw efficiency: 4/8 bytes = 50% PCIe TLP overhead: 4B payload in 32B TLP = 12.5% Double PCIe traversal (GPUβHost, HostβNIC) Result: 25-50% of peak bandwidth
Data Flow: GPU 0 β PCIe β Host buffer β CPU polls β NIC β Network β Remote NIC β Remote host β Remote CPU polls β PCIe β GPU 1 Cannot Use GPUDirect RDMA: Forced host staging prevents direct GPU-to-NIC path.
1.4 LL128 Protocol
Mechanism: 128-byte atomic operations in GPU memory, enables GPUDirect RDMA. Efficiency: 120/128 = 93.75% bandwidth utilization Hardware Requirement: Must support atomic 128-byte writes without splitting. NCCL disables on incompatible PCIe configurations. Intra-node: Direct NVLink transfers (~300-500ns latency) Inter-node: GPU accumulates chunks β CPU notified β NIC reads via GPUDirect β Network β Remote NIC writes via GPUDirect β GPU Degradation at Scale: For multi-GB messages across many nodes, sync overhead from millions of 128B atomic operations can make it slower than Simple.
2. Synchronization Mechanisms
2.1 Memory Fences vs Atomic Polling
Memory Fences:
Enforces ordering and visibility across entire memory hierarchy Hardware guarantees correctness via cache coherence protocol High latency (~6 Β΅s) but simple programming model Serializes pipeline until completion
Atomic Polling:
Active checking using atomic operations with acquire-release semantics Provides ordering without global flush Low latency (0.05-0.5 Β΅s) but burns CPU/GPU cycles Enables fine-grained pipelining
2.2 Acquire-Release Semantics
atomic_store_release: All prior writes complete before atomic store atomic_load_acquire: All subsequent reads happen after atomic load Key Point: Provides ordering WITHOUT heavyweight fences. Coherence still enforced by software, not hardware.
2.3 Performance Trade-offs
Small, Frequent Messages (100 Γ 4KB):
Fence approach: 100 Γ 10Β΅s = 1000 Β΅s overhead Polling approach: 100 Γ 0.1Β΅s = 10 Β΅s overhead 100x improvement
Large, Infrequent Messages (5GB):
Transfer: 8.3 ms Fence: 10 Β΅s = 0.1% overhead Fence negligible
3. Data Transfer Paths
3.1 Intra-Node Transports
P2P Transport (via NVLink or PCIe):
NVLink: Direct GPU-GPU, ~900 GB/s PCIe: Via PCIe switch, ~32 GB/s per direction
P2P_DIRECT Mode (same process):
Eliminates intermediate FIFO buffer Direct source-to-destination transfer 2x faster than multi-process P2P Still uses atomic counters for synchronization
SHM Transport:
Used when P2P suboptimal (inter-socket over PCIe) Routes through host memory to avoid CPU interconnect bottleneck Better than PCIe-to-PCIe across sockets
3.2 Inter-Node Transports
Socket Transport (net_socket.cc):
No RDMA support Host memory staging required 2x PCIe traversal overhead
IB Transport (net_ib.cc): Without GPUDirect RDMA:
GPU β Host buffer β RDMA β Network β Remote host β GPU
With GPUDirect RDMA:
GPU β NIC (direct PCIe) β Network β Remote NIC β GPU (direct) Eliminates host staging and extra PCIe traversal
QP Layout (2 QPs per rank pair):
Forward QP: Bulk data (RDMA_WRITE + RDMA_WRITE_WITH_IMM) Reverse QP: Clear-to-send messages Separation prevents head-of-line blocking
Local Flush Mechanism:
Dummy loop-back RDMA_READ ensures PCIe write ordering Cost: ~0.1-0.5 Β΅s
3.3 GPUDirect RDMA
Requirements:
GPU and NIC on same PCIe root complex Kernel support (nvidia_p2p_get_pages or DMA-BUF) Cannot efficiently cross CPU sockets
Protocol Compatibility:
Simple: β Can use (GPU memory allowed) LL: β Cannot use (forced host memory) LL128: β Can use (GPU memory allowed)
4. Hardware Capabilities and Limitations
4.1 Critical Finding: No GPU-GPU Cache Coherence
What NVLink Provides: FeatureGPU β GPU (NVLink)CPU β GPU (NVLink-C2C)Unified addressingββDirect load/storeββHigh bandwidth (900 GB/s)ββHardware cache coherenceβ NOβ YESL2 caches remote dataβ NOβ YESCoherence protocolβ Noneβ MESI
4.2 GPU Cache Architecture
Per-SM L1 Cache (128-256 KB):
Private to each SM NOT coherent across SMs Requires __threadfence() to flush
Shared L2 Cache (40-60 MB):
Coherence point within single GPU Memory-side (between SMs and HBM) Cannot cache remote GPU memory
Implication for Remote Polling: GPU 1 reading flag from GPU 0:
GPU 1βs L2 cannot cache the flag Every read traverses NVLink (~300-500ns) No local caching benefit exists
Why LL128 Polling is Still Fast:
Not because of caching (no caching exists) NVLink latency (0.5 Β΅s) < Memory fence (6 Β΅s) GPU 0βs L2 serves reads quickly (doesnβt go to HBM) Atomic operations prevent stale buffer data
4.3 Software-Enforced Coherence
Since no hardware coherence exists, software uses:
Atomic operations: Provide ordering, force writes to memory Memory fences: Flush all caches/buffers, ensure global visibility Explicit synchronization: Programmer responsibility
4.4 Grace Hopper Exception
NVLink-C2C (CPU β GPU only):
True hardware MESI coherence protocol CPU L2/L3 can cache GPU memory GPU L2 can cache CPU memory Automatic invalidations
But GPU β GPU still has NO coherence.
4.5 Channel Architecture
Purpose: Subdivide collectives across multiple GPU blocks (SMs) for parallelism. Structure:
Each channel = one CUDA block on dedicated SM Each channel has 8 slots (512 KB each for Simple) Slots enable pipelining: overlap receive, reduce, send
Data Partitioning:
Total data split across N channels Each channel processes 1/N of data Within channel: split into loop iterations β slots β chunks
Channel Count Trade-off: Too few: Underutilizes GPU parallelism Too many: Per-channel chunks < 512 KB NIC FIFO β partially filled buffers β worse efficiency Example problem:
16 MB message / 64 channels = 256 KB per channel NIC FIFO: 512 KB optimal Result: 50% NIC utilization
NCCL auto-tunes based on message size to avoid this.
5. NCCL vs NVSHMEM
5.1 Fundamental Differences
Programming Model:
NCCL: Collective operations (AllReduce, Broadcast) - all ranks participate NVSHMEM: One-sided operations (put/get) - individual rank initiates
Communication Pattern:
NCCL: Dense, structured, coordinated NVSHMEM: Sparse, irregular, no coordination
5.2 Performance Comparison: MoE Example
Task: Route 1024 tokens to 8 experts (128 tokens/expert) NCCL (Pre-2.28):
Group launch: ~10 Β΅s 8 send/recv pairs via CPU proxy: ~40 Β΅s Transfer: ~20 Β΅s Total: ~70 Β΅s
NVSHMEM:
Kernel launch: ~5 Β΅s 1024 GPU-initiated puts (pipelined): ~15 Β΅s Transfer: ~20 Β΅s Total: ~40 Β΅s (43% faster)
Why NVSHMEM Wins:
GPU-initiated (0.5 Β΅s overhead) vs CPU proxy (5 Β΅s overhead) No coordination overhead for irregular patterns Fine-grained control enables better compute-communication overlap
5.3 When Each Wins
NCCL:
Dense collectives (AllReduce 1GB): Optimized algorithms, bandwidth-optimal Large regular transfers: Protocol tuning excels Established training pipelines: Native PyTorch/TensorFlow integration
NVSHMEM:
Sparse irregular (GNN neighbor gathering): No coordination needed Fine-grained frequent: Lower per-message overhead One-sided patterns: Put without receiver coordination
5.4 NVSHMEM Architecture
Transport Options:
IB Verbs: Direct RDMA, uses GPUDirect, lowest latency UCX: Flexible multi-transport NCCL backend: Fallback for collectives MPI: Compatibility layer
Device-Side Operations:
GPU kernel directly calls put/get/atomic operations GPU posts to NIC queue without CPU involvement ~0.5 Β΅s overhead per operation
6. NCCL 2.28 Device API
6.1 Three Operation Modes
LSA (Load/Store Accessible):
Intra-node NVLink/PCIe Direct memory load/store operations
Multimem:
NVLink SHARP hardware multicast
GIN (GPU-Initiated Networking):
Inter-node InfiniBand/RoCE GPU directly initiates RDMA operations
6.2 GIN Architecture
Backend Options: GDAKI (GPUDirect Async Kernel-Initiated):
Direct GPU-to-NIC via DOCA GPUNetIO Lowest latency (~0.5 Β΅s overhead)
Proxy Backend:
Lock-free GPU-to-CPU queues CPU posts RDMA operations Broader hardware support Slightly higher latency (~1-2 Β΅s)
Impact: Reduces per-operation overhead from ~5 Β΅s (CPU proxy) to ~0.5 Β΅s (GPU-direct), bringing NCCL to parity with NVSHMEM for irregular workloads.
6.3 Copy Engine Collectives
Traditional: SMs handle data movement β contention with compute CE Collectives:
Dedicated hardware copy engines Zero-SM operation Applies to AlltoAll, AllGather Reduces SM contention, better overlap
6.4 Updated Comparison
FeatureNCCL (Pre-2.28)NCCL 2.28+ GINNVSHMEMDevice-initiatedβββOptimized collectivesβββ οΈ ManualGPU-to-NIC overhead~5 Β΅s~0.5 Β΅s~0.5 Β΅sMoE optimizedβββMaturityβ Matureπ Newβ Mature
7. Performance Characteristics
7.1 Latency by Protocol and Size
Small Message (4 KB): ProtocolTransferSync OverheadTotalvs SimpleSimple0.007 Β΅s12 Β΅s (fences)12 Β΅s1xLL0.013 Β΅s0.1 Β΅s (polling)0.2 Β΅s60x fasterLL1280.007 Β΅s0.5 Β΅s (polling)0.5 Β΅s24x faster Large Message (1 GB): ProtocolTransferSync OverheadTotalvs SimpleSimple1.7 ms12 Β΅s1.712 ms1xLL3.3 msnegligible3.3 ms2x slowerLL1281.75 msnegligible1.75 ms~same
7.2 Benchmark Results (16-node GH200)
AllReduce Inter-Node: SizeSimpleLLLL128Winner16 KB~40 Β΅s~15 Β΅s~20 Β΅sLL1 MB~500 Β΅s~2000 Β΅s~600 Β΅sSimple1 GB~50 ms~200 ms~70 msSimple AllReduce Intra-Node (NVLink):
LL128 consistently best across all sizes Only 5% slower than Simple at large sizes
7.3 Protocol Selection Heuristics
Intra-node NVLink:
All sizes: LL128 (if available) Fallback: LL for small, Simple for large
Inter-node:
<64 KB: LL 64KB-1MB: LL128 or Simple (workload dependent)
1 GB: Simple (LL128 sync overhead accumulates)
8. Decision Framework
8.1 Protocol Selection
Message size + topology β protocol choice
<64 KB: Intra-node β LL128 > LL Inter-node β LL
64KB-1MB: Intra-node β LL128 Inter-node β LL128 or Simple
1 MB: Any topology β Simple or LL128
Hardware constraints override: No 128B atomics β Simple or LL only No GPUDirect RDMA β LL for low latency
8.2 NCCL vs NVSHMEM Selection
Use NCCL for:
Dense collectives (gradient sync, parameter broadcast) Large regular transfers Existing training infrastructure
Use NVSHMEM or NCCL 2.28 GIN for:
Sparse irregular patterns (MoE routing, GNN) Fine-grained frequent messages Dynamic routing requiring compute-communication fusion
8.3 Common Pitfalls
Assuming cache coherence exists:
Reality: No hardware coherence between GPUs Must use atomics/fences explicitly
Forcing protocol manually:
NCCL autotuning generally optimal Manual NCCL_PROTO often degrades performance
Over-channelization:
NCCL_NCHANNELS ignored in recent versions Trust autotuning for NIC FIFO optimization
Ignoring topology:
Check nvidia-smi topo -m Understand NVLink vs PCIe paths Consider inter-socket costs
Conclusion
GPU communication performance requires understanding:
No hardware cache coherence between GPUs - software must enforce via atomics/fences Protocol trade-offs - latency (fences) vs bandwidth (sync granularity) Topology constraints - NVLink vs PCIe, intra vs inter-node, NUMA effects Workload patterns - dense collectives vs sparse irregular
Key insights:
Simple: Bandwidth-optimal for large messages, high latency LL: Low latency for small messages, forced host memory kills bandwidth LL128: Balanced, but requires hardware support NCCL 2.28 GIN closes gap with NVSHMEM for irregular workloads Optimization requires measurement and topology awareness
Modern applications benefit from hybrid approaches: NCCL for collectives, device-initiated (GIN/NVSHMEM) for irregular patterns, with careful attention to hardware topology and message characteristics.
DMA
DMA = Hardware moves data without CPU/kernel involvement
Three types in GPU communication:
- GPU Memory Controller DMA (NVLink):
βββββββββββββββββββββββββββββββββββββββ β Source: GPU 0 HBM β β Dest: GPU 1 HBM β β Who moves it: GPU Memory Controller β β Path: NVLink β βββββββββββββββββββββββββββββββββββββββ Triggered by: Store instructions to remote addresses Example: remote_ptr[i] = value; - NIC DMA (PCIe):
βββββββββββββββββββββββββββββββββββββββ β Source: GPU HBM or Host RAM β β Dest: NIC internal buffer β β Who moves it: NIC DMA engine β β Path: PCIe β βββββββββββββββββββββββββββββββββββββββ Triggered by: ibv_post_send() or similar NIC API - Copy Engine DMA (newer GPUs):
βββββββββββββββββββββββββββββββββββββββ β Source: GPU 0 HBM β β Dest: GPU 1 HBM β β Who moves it: Dedicated copy engine β β Path: NVLink β βββββββββββββββββββββββββββββββββββββββ Triggered by: NCCL API, not user kernel Advantage: Doesn't use SM resources