GPU Communication Architecture: NCCL, NVSHMEM, and NVLink
NVSHMEM Synchronization Mechanisms: A Technical Deep Dive
GPU Communication Protocols: Simple, LL, LL128
DMA

GPU Communication Architecture: NCCL, NVSHMEM, and NVLink

This document examines GPU-to-GPU communication mechanisms in distributed systems, focusing on NVIDIA’s NCCL and NVSHMEM libraries, their architectural differences, performance characteristics, and the underlying NVLink fabric. We analyze when to use two-sided versus one-sided communication, explain the address translation pipeline, and clarify common misconceptions about NVLink’s role in memory access.

1. Communication Library Overview

1.1 NCCL (NVIDIA Collective Communications Library)

Design Philosophy: Optimized for AI training/inference collective patterns. Key Characteristics:

Stream-based integration with CUDA runtime Collective-first API design (all-reduce, all-to-all, broadcast, etc.) Automatic protocol selection based on message size and topology Scales from single workstation to thousands of GPUs Supports all AI parallelism patterns: data, tensor, pipeline, expert

Limitations:

Two-sided communication model (synchronization coupled with data movement) No device-initiated communication in current versions (changing with recent previews) Single rank per GPU enforcement

1.2 NVSHMEM (NVIDIA Symmetric Memory)

Design Philosophy: PGAS (Partitioned Global Address Space) model derived from OpenSHMEM.

Key Characteristics:

One-sided communication (put/get) decouples synchronization from data movement Device-initiated communication: CUDA kernels can directly communicate Symmetric heap: collective memory allocation across all processing elements Direct pointer access within NVLink domains Supports host and device-side APIs

Unique Capabilities:

nvshmem_ptr(): Returns pointer for direct load/store within NVLink domains Device-initiated collectives with cooperative groups Thread, warp, and block-level communication primitives

Restrictions:

Symmetric memory allocation requirement (all PEs allocate same size) Limited flexibility for heterogeneous application composition More complex synchronization management

2. Two-Sided vs One-Sided Communication

2.1 Two-Sided (Send/Receive)

Mechanism: Sender side: Receiver side:

Message metadata - Expected metadata
Source buffer address - Destination buffer address
Matching tag - Matching tag ↓ ↓ Matching protocol ensures alignment ↓ ↓ Data transfer occurs with implicit synchronization Protocol Overhead:
1. Sender posts send with metadata
2. Receiver posts receive (may be before or after send)
3. Matching phase: align send with receive
4. Rendezvous handshake (for large messages):
- Sender notifies receiver
- Receiver registers memory and sends permission
- Sender performs RDMA write
  1. Completion notification Advantages:

Clear execution ordering semantics Safe by default (no accidental overwrites) Portable across all network types

Disadvantages:

Matching overhead (especially with wildcards) Enforced synchronization may be unnecessary Difficult to implement efficiently on GPUs

2.2 One-Sided (Put/Get)

Mechanism: Initiator side only:

Source buffer address
Target PE and offset
Transfer size ↓ Direct data movement ↓ No receiver involvement Setup Phase (one-time cost): c// Collective memory allocation void *symmetric_heap = nvshmem_malloc(SIZE); // All PEs exchange addresses and register memory // Creates globally accessible address space Runtime Phase (per operation): c// Put: write to remote memory nvshmem_int_put(remote_addr, local_data, count, target_pe); // No receiver-side code needed

// Get: read from remote memory
nvshmem_int_get(local_data, remote_addr, count, target_pe); // No sender-side code needed Synchronization Responsibility: c// Programmer must ensure ordering nvshmem_put(…); // Write data nvshmem_fence(); // Ensure write completes nvshmem_put_signal(…); // Notify remote side

// Remote side nvshmem_wait_until(…); // Poll for notification // Now safe to read data Advantages:

Eliminates matching overhead Decoupled synchronization enables optimization Natural fit for unstructured communication patterns Can amortize setup cost across many operations

Disadvantages:

More complex programming model Manual synchronization required Potential for race conditions if misused

3. Performance Analysis: Matrix Transpose Benchmark

3.1 Workload Description

Algorithm: Distributed matrix transpose with accumulation c// Each GPU owns column blocks // Transpose scatters blocks to all GPUs (all-to-all pattern) for (int b = 0; b < num_blocks; b++) { transpose_and_scatter(block[b]); // Communication local_transpose(block[b]); // Computation accumulate(block[b]); // Computation } Hardware: DGX H100

8x H100 GPUs NVLink Switch interconnect Local memory bandwidth: 24 TB/s aggregate (3 TB/s per GPU) NVLink bisection bandwidth: 3.6 TB/s theoretical

3.2 Implementation Variants

NCCL All-to-All: cnccl_group_start(); for (int i = 0; i < npes; i++) { nccl_send(send_blocks[i], …, i, …); nccl_recv(recv_blocks[i], …, i, …); } nccl_group_end(); // Then transpose and accumulate locally Performance: 7.5 TB/s aggregate throughput

NCCL Point-to-Point (manual pairwise): cfor (int i = 0; i < npes; i++) { int peer = (rank + i) % npes; nccl_send(send_block, …, peer, …); nccl_recv(recv_block, …, peer, …); local_transpose(recv_block); } Performance: 4-5 TB/s (throttled by NCCL’s flow control) Reason: NCCL assumes custom collective implementation, applies conservative flow control NVSHMEM All-to-All: cnvshmem_barrier_all(); // Required: no implicit sync for (int i = 0; i < npes; i++) { nvshmem_get(local_blocks[i], remote_blocks[i], size, i); } nvshmem_barrier_all(); // Then transpose and accumulate locally Performance: 7.5 TB/s (similar to NCCL) NVSHMEM Get with Interleaving:

cnvshmem_barrier_all();
for (int i = 0; i < npes; i++) {
    nvshmem_get(local_block, remote_block, size, i);
    local_transpose(local_block);  // Interleave comm and compute
}
nvshmem_barrier_all();

Performance: 7.5 TB/s (minimal difference at full network saturation) NVSHMEM Pointer Access (fused):

nvshmem_barrier_all();
// Single kernel launch
fused_kernel<<<...>>>() {
    for (int b = 0; b < num_blocks; b++) {
        int *remote = nvshmem_ptr(blocks[b], target_pe);
        // Direct loads from remote memory
        transpose_inplace(remote, output);
    }
}
nvshmem_barrier_all();

Performance: 9.5 TB/s (30% improvement)

3.3 Performance Breakdown

NCCL All-to-All Time Budget:

Communication: 2.7 TB/s (close to bisection limit)
Local transpose: 23 TB/s
Accumulate: 23 TB/s
Serialized execution limits aggregate throughput

NVSHMEM Pointer Time Budget:

No explicit communication phase (fused into compute)
Remote loads occur during transpose kernel execution
Better cache utilization
Reduced kernel launch overhead

4. Data Movement: Pointer Access vs Explicit Get

4.1 The Critical Difference

Common Misconception: “Both methods transfer data over NVLink, so performance should be identical.”

Reality: The data path and overhead are fundamentally different.

4.2 NVSHMEM Explicit Get Data Path

Step 1: Function call overhead
nvshmem_int_get(local_buf, remote_data, size, pe)
    ↓
Step 2: NVSHMEM runtime processing
    - Validate parameters
    - Lookup PE connection info
    - Determine protocol (NVLink vs IB)
    ↓
Step 3: Initiate transfer (may use Copy Engine or DMA)
    ↓
Step 4: Data movement
Remote GPU HBM → NVLink → Local GPU HBM (explicit copy)
    ↓
Step 5: Synchronization (ensure completion)
    ↓
Step 6: Application accesses data
Local GPU HBM → L2 cache → Registers

Total path: Remote HBM → Local HBM → L2 → Registers

Key overhead sources:

Software function call and parameter validation
Explicit copy allocates local HBM space
Synchronization logic
Two memory copies (remote-to-local, local-to-cache)

4.3 NVSHMEM Pointer Access Data Path

Step 1: One-time pointer lookup (outside critical path)
int *remote_ptr = nvshmem_ptr(data, pe);
    ↓
Step 2: Direct load instruction in kernel
value = *remote_ptr;
    ↓
Step 3: GPU MMU translation
Virtual address → (Target GPU ID + Physical offset)
    ↓
Step 4: NVLink load transaction
Remote GPU HBM → NVLink → Local GPU L2 cache → Registers
Total path: Remote HBM → L2 → Registers
Key advantages:

No intermediate copy to local HBM
Single GPU load instruction (hardware operation)
No software overhead per access
Data goes directly to L2 cache
Can leverage GPU cache hierarchy

### 4.4 Bandwidth and Latency Comparison
MetricExplicit GetPointer AccessSoftware overhead~500-1000 ns per callOne-time lookup onlyData pathRemote HBM → Local HBM → L2Remote HBM → L2Memory copies2 (remote→local, local→cache)1 (remote→cache)HBM allocationRequires local bufferNoneLatency per access~1-5 μs~100-200 nsEffective bandwidthLimited by software overheadLimited by NVLink (450 GB/s)
### 4.5 When Explicit Get is Superior

**Case 1: Multiple Accesses to Same Data**

```cpp
// Scenario: Reuse data multiple times
// Pointer access - pays NVLink latency every time
int *remote_ptr = nvshmem_ptr(data, pe);
for (int iter = 0; iter < 1000; iter++) {
    for (int i = 0; i < N; i++) {
        sum += remote_ptr[i];  // Remote load each time
        // If N > L2 cache size, every access goes over NVLink
    }
}

// Explicit get - one-time copy, then local access

int *local_buf = malloc(N * sizeof(int));
nvshmem_int_get(local_buf, remote_data, N, pe);
for (int iter = 0; iter < 1000; iter++) {
    for (int i = 0; i < N; i++) {
        sum += local_buf[i];  // Local HBM: 2-3 TB/s
    }
}

Break-even analysis:

NVLink bandwidth: 450 GB/s Local HBM bandwidth: 2500 GB/s Explicit get wins when: num_accesses > 2-3

Case 2: Large Data Exceeding L2 Cache cuda// Data size: 1 GB // H100 L2 cache: 50 MB per GPU

// Pointer access: // - 1 GB doesn’t fit in 50 MB cache // - Every access misses → remote fetch over NVLink // - Effective bandwidth: 450 GB/s

// Explicit get: // - One-time transfer: 1 GB / 450 GB/s = 2.2 ms // - Subsequent accesses: local HBM at 2-3 TB/s // - For compute-heavy workloads: massive win

Case 3: Communication/Computation Overlap

cuda// Explicit get enables stream-based overlap cudaStream_t comm_stream, comp_stream;

// Pipeline: fetch next while computing current nvshmem_get_on_stream(buf_A, remote_A, size, pe, comm_stream); compute_kernel«<…, comp_stream»>(buf_B);

// Pointer access: no overlap capability // Stalls on every remote load

4.6 When Pointer Access is Superior

Case 1: Single-Pass Streaming (Jeff’s transpose benchmark) cuda// Each block accessed exactly once for (int b = 0; b < num_blocks; b++) { int *remote = nvshmem_ptr(block[b], pe); transpose_kernel«<…»>(remote, output); } // No reuse → no benefit from local copy // Pointer access eliminates copy overhead

Case 2: Sparse/Random Access cuda// Access 1% of remote array int *remote_ptr = nvshmem_ptr(data, pe); for (int i : sparse_indices) { // 1% of indices value = remote_ptr[i]; } // Total transfer: 1% of array size

// vs explicit get would copy 100% unnecessarily

Case 3: Small Messages cuda// 64-byte transfer int value = remote_ptr; // ~200 ns // vs nvshmem_int_get(&local, remote, 1, pe); // ~1000 ns // Software overhead dominates for small transfers Case 4: Fine-Grained Producer-Consumer cuda// Polling pattern while (!done) { __threadfence_system(); if (remote_flag == READY) { // Direct check process(*remote_data); *remote_flag = DONE; } } // Low latency, direct access beats get/put for flags

4.7 Decision Matrix

ScenarioWinnerReasonSingle-pass readPointerNo copy overheadMultiple iterations (3+)GetAmortize copy, use local HBM speedData » L2 cacheGetCache misses make remote slowData < 1 KBPointerSoftware overhead mattersSparse/random accessPointerFetch only what’s neededDense computationGetKeep working set localNeed overlapGetStream-based pipeliningFine-grained syncPointerLow latency critical

5. Address Discovery and Translation in NVLink Domains

5.1 CUDA IPC Setup Phase

Initialization sequence: c// During nvshmem_init(), all PEs participate

// Step 1: Each GPU allocates symmetric heap void *my_heap = cudaMalloc(SIZE); // GPU 0: 0x7f8a00000000 // GPU 1: 0x7f9b00000000 (different virtual address!) // GPU 2: 0x7fac00000000

// Step 2: Export memory handle cudaIpcMemHandle_t my_handle; cudaIpcGetMemHandle(&my_handle, my_heap);

// Step 3: Exchange handles (collective operation) cudaIpcMemHandle_t remote_handles[8]; all_gather(my_handle, remote_handles); // via control plane

// Step 4: Import remote handles for (int pe = 0; pe < npes; pe++) { if (pe != my_pe) { void *remote_ptr; cudaIpcOpenMemHandle(&remote_ptr, remote_handles[pe], cudaIpcMemLazyEnablePeerAccess); // Now have local virtual address mapping to remote physical ipc_mappings[pe] = remote_ptr; } }

### 5.2 GPU MMU Page Tables

**Post-IPC setup, each GPU has**:

┌─────────────────────────────────────────────┐ │ GPU 0 MMU Page Table │ ├──────────────────┬──────────────────────────┤ │ Virtual Address │ Physical Location │ ├──────────────────┼──────────────────────────┤ │ 0x7f8a00000000 │ Local HBM, offset 0 │ │ 0x7f9b00000000 │ GPU 1 HBM, offset 0 │ │ 0x7fac00000000 │ GPU 2 HBM, offset 0 │ │ 0x7fbd00000000 │ GPU 3 HBM, offset 0 │ │ … │ … │ └──────────────────┴──────────────────────────┘

Key insight: Each GPU's MMU knows which virtual addresses map to which remote GPUs.

### 5.3 NVSHMEM's Internal Translation Table

```cpp
// Simplified internal structure
struct nvshmem_pe_info {
    int pe_id;
    void *symmetric_base;           // This PE's base address
    void *ipc_mapped_ptrs[MAX_PES]; // IPC-mapped remote addresses
    bool nvlink_accessible[MAX_PES]; // Topology info
    // For non-NVLink PEs: network connection info
};

// nvshmem_ptr() implementation
void* nvshmem_ptr(void *source, int target_pe) {
    // Calculate offset within symmetric heap
    size_t offset = (char*)source - (char*)my_symmetric_base;
    
    // Check if target is NVLink-accessible
    if (pe_info[target_pe].nvlink_accessible) {
        // Return IPC-mapped pointer for direct load/store
        return (char*)pe_info[target_pe].ipc_mapped_ptrs + offset;
    } else {
        // Cross-node or no NVLink → return NULL
        // Must use explicit put/get
        return NULL;
    }
}

5.4 Complete Memory Access Flow

// GPU 0 executes:
int *remote_ptr = nvshmem_ptr(data, 3);  // Points to GPU 3
int value = remote_ptr[100];  // Load from offset 100

Step-by-step execution:

┌──────────────────────────────────────────────────────────┐
│ 1. NVSHMEM API Layer (one-time, outside critical path)  │
│    nvshmem_ptr(data, 3)                                  │
│    → Lookup IPC mapping for PE 3                         │
│    → Return: 0x7fbd00000000 (virtual addr for GPU 3)     │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 2. CUDA Kernel Load Instruction                          │
│    LD R0, [0x7fbd00000064]  // 0x100 offset + base       │
│    → Issued by GPU 0's load/store unit                   │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 3. GPU 0 MMU (Memory Management Unit)                    │
│    TLB lookup: 0x7fbd00000064                            │
│    Result: "Remote memory"                               │
│            "Target: GPU_ID=3"                            │
│            "Physical offset: 0x64"                       │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 4. GPU 0 Memory Controller / NVLink Interface            │
│    Create NVLink packet:                                 │
│    ┌─────────────────────────────────────────────┐      │
│    │ Header:                                      │      │
│    │  - Destination GPU: 3                        │      │
│    │  - Type: READ                                │      │
│    │  - Size: 64 bytes (cache line)               │      │
│    │  - Transaction ID: 0x4A2B (for matching)     │      │
│    │ Physical Address: 0x64                       │      │
│    └─────────────────────────────────────────────┘      │
└────────────────────┬─────────────────────────────────────┘
                     │ NVLink physical lanes (18 Gbps each)
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 5. NVLink Fabric (Switch or Point-to-Point)              │
│    Parse packet header                                   │
│    Routing decision: Dest=3 → Output port for GPU 3      │
│    Forward packet through crossbar                       │
│    - Manage link credits (flow control)                  │
│    - CRC check, retry on error                           │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 6. GPU 3 NVLink Interface                                │
│    Receive packet, verify CRC                            │
│    Extract: Physical_Addr=0x64, Type=READ, TxID=0x4A2B   │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 7. GPU 3 Memory Controller                               │
│    Read from HBM at physical offset 0x64                 │
│    Retrieve 64 bytes (cache line)                        │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 8. GPU 3 NVLink Interface                                │
│    Create response packet:                               │
│    ┌─────────────────────────────────────────────┐      │
│    │ Header:                                      │      │
│    │  - Destination GPU: 0                        │      │
│    │  - Type: READ_RESPONSE                       │      │
│    │  - Transaction ID: 0x4A2B (match request)    │      │
│    │ Data: [64 bytes from HBM]                    │      │
│    └─────────────────────────────────────────────┘      │
└────────────────────┬─────────────────────────────────────┘
                     │ NVLink return path
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 9. NVLink Fabric                                         │
│    Route response back to GPU 0                          │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 10. GPU 0 NVLink Interface                               │
│     Receive response                                     │
│     Match Transaction ID: 0x4A2B                         │
│     Forward data to memory controller                    │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 11. GPU 0 L2 Cache (if cacheable)                        │
│     Store cache line                                     │
│     Forward requested word to load/store unit            │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ↓
┌──────────────────────────────────────────────────────────┐
│ 12. GPU 0 Load/Store Unit                                │
│     Receive data, write to register R0                   │
│     Instruction completes                                │
└──────────────────────────────────────────────────────────┘

Latency breakdown:

MMU lookup: ~10 ns (TLB hit)
Packet creation: ~20 ns
NVLink transmission: ~50-100 ns (depending on hops)
Remote memory access: ~50 ns
Return path: ~50-100 ns
Total: ~180-280 ns (vs ~30-50 ns for local HBM)

6. NVLink Architecture and Functionality

6.1 What NVLink Does NOT Do

Critical clarifications:

❌ Address translation: This is the GPU MMU’s responsibility ❌ Virtual memory management: Handled by CUDA IPC and GPU page tables ❌ Cache coherence: NVLink protocol supports it, but GPUs don’t use it ❌ Page table lookups: Done by GPU MMU before packets reach NVLink ❌ Computation or processing: Pure interconnect fabric

6.2 What NVLink Actually Does

✅ Physical packet routing: Forwards packets based on destination GPU ID in header ✅ Flow control: Credit-based backpressure to prevent congestion ✅ Error handling: CRC checks, automatic retries on corruption ✅ Physical layer: High-speed SerDes signaling (18 Gbps per lane) ✅ QoS: Priority queues for different traffic classes ✅ Multicast/broadcast: Efficiently replicate packets to multiple destinations

Analogy:

GPU MMU = GPS navigation (decides destination and route)
NVLink = Highway system (physical infrastructure that moves traffic)

6.3 NVLink Packet Format

┌─────────────────────────────────────────────────────────┐
│                    NVLink Packet                        │
├─────────────────────────────────────────────────────────┤
│ Header (64-128 bits):                                   │
│  - Destination GPU ID (3-7 bits, depends on domain size)│
│  - Transaction Type (read/write/atomic/response)        │
│  - Size (cache line: 32B, 64B, 128B)                    │
│  - Transaction ID (for matching responses)              │
│  - QoS/Priority bits                                    │
│  - Routing hints (for multi-switch topologies)          │
├─────────────────────────────────────────────────────────┤
│ Address (40-48 bits):                                   │
│  - Physical offset within target GPU's HBM              │
│  - Already translated by source GPU's MMU               │
├─────────────────────────────────────────────────────────┤
│ Data (0-128 bytes, depending on transaction type):     │
│  - Payload for writes                                   │
│  - Empty for read requests                              │
│  - Filled in response packets                           │
├─────────────────────────────────────────────────────────┤
│ CRC (16-32 bits):                                       │
│  - Error detection for entire packet                    │
└─────────────────────────────────────────────────────────┘

7. NVLink Topologies: Point-to-Point vs Switch

7.1 Point-to-Point NVLink (Direct Connections)

Architecture (example: 8 GPUs, 6 links per GPU):

Hypercube topology:

GPU 0 ─── GPU 1        GPU 4 ─── GPU 5
 │    \  /  │           │    \  /  │
 │     \/   │           │     \/   │
 │     /\   │           │     /\   │
 │    /  \  │           │    /  \  │
GPU 2 ─── GPU 3        GPU 6 ─── GPU 7
 │         │            │         │
 └─────────┼────────────┘         │
           └──────────────────────┘

Characteristics:

Each GPU has limited NVLink ports (A100: 6, H100: 18)
Not fully connected - some pairs require multi-hop routing
Routing logic implemented in each GPU’s hardware
Topology: hypercube, torus, or custom mesh

Routing example (GPU 0 → GPU 7):

GPU 0's routing table: "GPU 7 → forward via port 2 (to GPU 2)"
    ↓
GPU 0 sends packet to GPU 2
    ↓
GPU 2 receives, checks destination: "GPU 7"
GPU 2's routing table: "GPU 7 → forward via port 5 (to GPU 6)"
    ↓
GPU 2 forwards packet to GPU 6
    ↓
GPU 6 receives, checks destination: "GPU 7"
GPU 6's routing table: "GPU 7 → port 1 (direct connection)"
    ↓
GPU 6 forwards to GPU 7
    ↓
GPU 7 receives, serves memory request

Implications:

Latency: Varies by hop count (1-3 hops typical)
Bandwidth: Shared along path (multiple flows contend)
GPU overhead: Each intermediate GPU does routing
Bisection bandwidth: Limited by topology (e.g., 4×4 partition has few cross-links)

Performance characteristics:

Direct neighbors:  450 GB/s, ~100 ns latency
One hop away:      225-450 GB/s (shared), ~150-200 ns latency  
Two hops away:     150-300 GB/s (shared), ~200-300 ns latency

7.2 NVLink Switch (NVSwitch)

Architecture (DGX H100, DGX B100, NVL72):

┌───────────────────────────────────────────┐
│         NVSwitch (Dedicated Chip)         │
│                                           │
│  ┌────────────────────────────────────┐  │
│  │    Full Crossbar (Non-blocking)    │  │
│  │                                    │  │
│  │    ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  │  │
│  │    │P0│  │P1│  │P2│  │P3│  │P4│  │  │
│  │    └┬─┘  └┬─┘  └┬─┘  └┬─┘  └┬─┘  │  │
│  └─────┼─────┼─────┼─────┼─────┼─────┘  │
└────────┼─────┼─────┼─────┼─────┼─────────┘
         │     │     │     │     │
         ↓     ↓     ↓     ↓     ↓
       GPU0  GPU1  GPU2  GPU3  GPU4 ...

Characteristics:

Dedicated switching hardware (separate chip, not in GPUs)
Full crossbar: any GPU to any GPU at full bandwidth simultaneously
Single-hop for all GPU pairs (always 1 hop)
GPUs just send/receive - no routing logic needed
Scales to 72+ GPUs (NVL72 racks)

Routing example (GPU 0 → GPU 7):

GPU 0's NVLink interface: Create packet with Dest=7
    ↓
Send to NVSwitch
    ↓
NVSwitch examines header: Destination=GPU 7
NVSwitch crossbar: Connect input port 0 → output port 7
    ↓
Forward packet to GPU 7
    ↓
GPU 7 receives, serves memory request

Implications:

Latency: Uniform ~150-200 ns for all pairs
Bandwidth: Full 450 GB/s per GPU, non-blocking
No GPU overhead: GPUs don’t route, just send/receive
Bisection bandwidth: Full bisection (3.6 TB/s for 8 GPUs)

Performance characteristics:

Any GPU pair:  450 GB/s, ~150-200 ns latency
All-to-all:    3.6 TB/s aggregate bisection bandwidth
No contention: True non-blocking crossbar

7.3 NVSwitch Internal Architecture

┌─────────────────────────────────────────────────────────┐
│                   NVSwitch Chip                         │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────────────────────────────────────────┐  │
│  │          Crossbar Switch Fabric                   │  │
│  │        (Non-blocking, full bisection)             │  │
│  │                                                   │  │
│  │  Input Buffers:                                   │  │
│  │  ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐      │  │
│  │  │ Q0 │ │ Q1 │ │ Q2 │ │ Q3 │ │ Q4 │ │ Q5 │ ...  │  │
│  │  └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘      │  │
│  │     │      │      │      │      │      │         │  │
│  │     └──────┴──────┴──────┴──────┴──────┘         │  │
│  │                      ↓                            │  │
│  │           ┌──────────────────────┐                │  │
│  │           │  Arbitration Logic   │                │  │
│  │           │  - QoS prioritization│                │  │
│  │           │  - Round-robin/WFQ   │                │  │
│  │           └──────────┬───────────┘                │  │
│  │                      ↓                            │  │
│  │           ┌──────────────────────┐                │  │
│  │           │   Crossbar Matrix    │                │  │
│  │           │   (N×N connections)  │                │  │
│  │           └──────────┬───────────┘                │  │
│  │                      ↓                            │  │
│  │  Output Buffers:                                  │  │
│  │  ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐      │  │
│  │  │ Q0 │ │ Q1 │ │ Q2 │ │ Q3 │ │ Q4 │ │ Q5 │ ...  │  │
│  │  └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘      │  │
│  └─────┼─────┼─────┼─────┼─────┼─────┼──────────────┘  │
│        │     │     │     │     │     │                 │
│  ┌─────▼─┐ ┌─▼────┐ ┌───▼──┐ ┌▼─────┐ ┌──▼───┐       │
│  │Port 0 │ │Port 1│ │Port 2│ │Port 3│ │Port 4│ ...   │
│  │900GB/s│ │900GB/│ │900GB/│ │900GB/│ │900GB/│       │
│  └───┬───┘ └───┬──┘ └───┬──┘ └───┬──┘ └───┬──┘       │
└──────┼─────────┼─────────┼─────────┼─────────┼─────────┘
       │         │         │         │         │
      GPU0     GPU1     GPU2     GPU3     GPU4  ...

Switch functions:

Header parsing: Extract destination GPU ID from packet
Arbitration: If multiple packets target same output, prioritize
Crossbar connection: Establish path from input port to output port
Buffering: Queue packets if output temporarily busy
Flow control: Credit-based system prevents buffer overflow
QoS: Priority queues for different traffic classes
Multicast: Replicate packets to multiple outputs efficiently

7.4 Comparison Table

Feature	Point-to-Point	NVLink Switch
Routing location	In each GPU	Dedicated switch hardware
Topology	Partial mesh	Full crossbar
Hops (worst case)	1-3	Always 1
Bisection bandwidth	Limited by topology	Full (non-blocking)
Latency uniformity	Varies by path	Uniform
GPU complexity	Must include routing	Just send/receive
Scalability	8-16 GPUs typical	72+ GPUs (NVL systems)
Cost	Lower (no switch chip)	Higher (dedicated hardware)
Bandwidth guarantee	Depends on contention	Non-blocking guarantee
Example systems	DGX-1, DGX-2, DGX A100	DGX H100, DGX B100, NVL72

7.5 Performance Impact: All-to-All Pattern

Point-to-Point (hypercube, 8 GPUs):

Pattern: GPU 0-3 send to GPU 4-7 (bisection test)

Bottleneck: Limited cross-links between partitions
Available paths: ~4-6 links crossing partition boundary
Bisection bandwidth: 1.8-2.7 TB/s (theoretical)
Actual measured: 1.5-2.0 TB/s (with contention)

NVLink Switch (8 GPUs):

Pattern: GPU 0-3 send to GPU 4-7 (bisection test)

No bottleneck: Full crossbar, all paths available
Bisection bandwidth: 3.6 TB/s (50% of aggregate)
Actual measured: 2.7-3.0 TB/s (Jeff's benchmark result)

This explains Jeff’s 2.7 TB/s all-to-all result: Near-theoretical bisection bandwidth on NVSwitch-based DGX H100.

8. GPU MMU vs CPU IOMMU

8.1 Architecture Separation

Critical distinction: NVLink uses GPU MMU; PCIe P2P uses CPU IOMMU.

Within Node (NVLink domain):
┌─────────────┐         ┌─────────────┐
│   GPU 0     │ NVLink  │   GPU 1     │
│  ┌───────┐  │────────→│  ┌───────┐  │
│  │  MMU  │  │         │  │  MMU  │  │
│  └───────┘  │         │  └───────┘  │
└─────────────┘         └─────────────┘
     ↑                       ↑
     └───── No IOMMU ────────┘
     (GPU-to-GPU direct)

Cross-Node (Network):
┌─────────────┐         ┌──────────────┐         ┌─────────────┐
│   GPU 0     │  PCIe   │     NIC      │ Network │     NIC     │
│             │────────→│              │────────→│             │
└─────────────┘         └──────────────┘         └──────────────┘
       │                      │                         │
       └──────────────────────┼─────────────────────────┘
                              ↓
                       ┌─────────────┐
                       │   IOMMU     │ ← CPU-side translation
                       │ (optional)  │
                       └─────────────┘

8.2 GPU MMU Characteristics

Location: On-chip in each GPU

Responsibilities:

Virtual-to-physical address translation within GPU
IPC mapping management (cross-GPU virtual addresses)
NVLink routing decisions (which GPU owns this address)
Page table walks for misses
TLB (Translation Lookaside Buffer) caching

Translation process:

Virtual Address: 0x7f9b00001000
      ↓
GPU MMU lookup:
  - Check TLB for cached translation
  - If miss: walk page tables
  - Result: "GPU 3, physical offset 0x1000"
      ↓
Create NVLink packet with routing info

Performance:

TLB hit: ~10 ns
TLB miss: ~100-200 ns (page table walk)
No software involvement (pure hardware)

8.3 CPU IOMMU Characteristics (for NIC-GPU P2P)

Location: In CPU die or chipset (Root Complex)

Responsibilities:

Isolate DMA from different devices
Translate device-side addresses (IOVA → physical)
Provide per-VM address spaces
Enable on-demand paging for devices

Translation process (IOMMU enabled):

NIC wants to write to GPU memory

1. NIC driver gets IOVA (not physical address)
   IOVA: 0x7f8a00000000

2. NIC issues PCIe Memory Write TLP with IOVA

3. IOMMU intercepts transaction

4. IOTLB (IOMMU's TLB) lookup:
   - Hit: Translate IOVA → Physical (0x380000000)
   - Miss: Generate page fault
           ↓
           ATS (Address Translation Service) request
           ↓
           CPU handles page fault
           ↓
           Install translation in IOTLB
           ↓
           Retry transaction

5. Forward translated address to GPU

6. GPU memory controller receives write

Performance impact:

IOTLB hit: ~100-500 ns additional latency
IOTLB miss: ~2-10 μs (includes page fault handling)
This is catastrophic for low-latency RDMA

8.4 Why IOMMU Matters for NIC-GPU but Not NVLink

NVLink (GPU-to-GPU):

Path: GPU 0 → NVLink → GPU 1

Translation: GPU 0's MMU only (on-chip, fast)
No CPU involvement
No IOMMU in path
Latency: ~150-200 ns total

PCIe P2P (NIC-to-GPU):

Path: NIC → PCIe switch → GPU

Without IOMMU:
  NIC uses physical addresses directly
  PCIe switch routes based on physical address
  GPU memory controller receives write
  Latency: ~1-2 μs

With IOMMU:
  NIC uses IOVA
  IOMMU must translate every transaction
  IOTLB misses cause stalls
  Latency: ~5-20 μs (worst case with misses)

8.5 IOMMU Modes

Disabled (IOMMU OFF):

Advantages:
✅ Lowest latency (no translation)
✅ Deterministic performance
✅ No IOTLB misses

Disadvantages:
❌ No DMA isolation (security risk)
❌ Device can access all physical memory
❌ No per-VM address spaces

1:1 Passthrough (IOVA == Physical Address):

IOVA: 0x380000000 → Physical: 0x380000000 (identity mapping)

Advantages:
✅ Near-zero translation overhead
✅ Retains basic IOMMU infrastructure
✅ Can enable ATS for local caching

Disadvantages:
⚠️  If any mapping breaks identity, reverts to slow path
⚠️  No real isolation (mappings are identity)

Full IOMMU (Translation enabled):

IOVA: 0x7f8a00000000 → Physical: 0x380000000

Advantages:
✅ True DMA isolation
✅ Per-VM address spaces
✅ Security boundary

Disadvantages:
❌ High latency (IOTLB misses)
❌ Unpredictable tail latency
❌ Kills p99/p999 for RDMA

9. NVSHMEM Cross-Node Communication
9.1 Topology Detection
During initialization, NVSHMEM determines connectivity:
```cpp
cvoid nvshmem_init() {
    // Discover all PEs and their locations
    for (int pe = 0; pe < npes; pe++) {
        if (same_nvlink_domain(my_pe, pe)) {
            // Setup CUDA IPC for direct access
            setup_ipc_mapping(pe);
            pe_info[pe].transport = NVLINK_DIRECT;
            pe_info[pe].can_use_ptr = true;
        }
        else if (same_node(my_pe, pe)) {
            // Different NVLink domain, same node (rare)
            // Could use PCIe P2P if topology allows
            pe_info[pe].transport = PCIE_P2P;
            pe_info[pe].can_use_ptr = false;
        }
        else {
            // Cross-node: use network (IB/Ethernet)
            setup_ib_connection(pe);
            pe_info[pe].transport = NETWORK_RDMA;
            pe_info[pe].can_use_ptr = false;
        }
    }
}

9.2 Hybrid Transport

NVSHMEM automatically selects transport:

void nvshmem_int_put(int *target, int *source, size_t nelems, int pe) {
    switch (pe_info[pe].transport) {
        case NVLINK_DIRECT:
            // Could use direct store, but put is also efficient
            nvlink_put_implementation(target, source, nelems, pe);
            break;
            
        case PCIE_P2P:
            // Use PCIe P2P DMA if supported
            pcie_p2p_put(target, source, nelems, pe);
            break;
            
        case NETWORK_RDMA:
            // Multi-stage: GPU → Host → NIC → Network
            network_put_implementation(target, source, nelems, pe);
            break;
    }
}

9.3 Network Transport Path (GPUDirect RDMA)

With GPUDirect RDMA enabled:

Node A, GPU 0:                         Node B, GPU 0:
┌─────────────┐                       ┌─────────────┐
│  GPU HBM    │                       │  GPU HBM    │
│  (source)   │                       │ (dest)      │
└──────┬──────┘                       └──────▲──────┘
       │ PCIe (no CPU copy)                  │ PCIe
┌──────▼──────┐                       ┌──────┴──────┐
│    NIC      │─────InfiniBand───────→│    NIC      │
└─────────────┘      RDMA              └─────────────┘

Steps:
1. NVSHMEM calls put
2. GPU memory pinned, registered with NIC
3. NIC DMAs directly from GPU HBM (if IOMMU OFF)
4. RDMA write across network
5. Remote NIC DMAs directly to remote GPU HBM
6. No CPU copies!

Without GPUDirect RDMA (fallback):

Node A, GPU 0:                         Node B, GPU 0:
┌─────────────┐                       ┌─────────────┐
│  GPU HBM    │                       │  GPU HBM    │
└──────┬──────┘                       └──────▲──────┘
       │ PCIe                                │ PCIe
┌──────▼──────┐                       ┌──────┴──────┐
│  Host RAM   │                       │  Host RAM   │
│  (staging)  │                       │  (staging)  │
└──────┬──────┘                       └──────▲──────┘
       │ PCIe                                │ PCIe
┌──────▼──────┐                       ┌──────┴──────┐
│    NIC      │─────InfiniBand───────→│    NIC      │
└─────────────┘      RDMA              └─────────────┘

Steps:
1. NVSHMEM calls put
2. GPU → Host memory copy (PCIe)
3. NIC DMAs from host memory
4. RDMA write across network
5. Remote NIC DMAs to host memory
6. Host → GPU memory copy (PCIe)
Total: 2 extra PCIe copies (huge overhead!)

9.4 Why Pointer Access Only Works Within NVLink

// Application code
void *remote_ptr = nvshmem_ptr(data, target_pe);

if (remote_ptr != NULL) {
    // Within NVLink domain
    // Pointer maps to virtual address via CUDA IPC
    // GPU MMU translates, NVLink routes
    int value = *(int*)remote_ptr;  // Direct load
}
else {
    // Cross-node or no NVLink
    // Cannot create valid virtual address mapping
    // Network requires explicit protocol
    nvshmem_int_get(&local_value, data, 1, target_pe);
}

Why cross-node pointer access is impossible:

No global virtual address space across nodes
Network requires packet framing (not raw loads)
Latency too high for synchronous loads (~50 μs vs 200 ns)
RDMA requires explicit memory registration and permissions

10. Expert Parallelism Use Case

10.1 Architecture

Problem: Route tokens from multiple requests to appropriate expert GPUs.

Request 1: [token_1, token_2, token_3]
Request 2: [token_4, token_5]
Request 3: [token_6, token_7, token_8, token_9]
         ↓
    Router (GPU 0)
         ↓
   Routing decision
         ↓
    ┌────┴────┬────────┬────────┐
    ↓         ↓        ↓        ↓
Expert 1  Expert 2  Expert 3  Expert 4
(GPU 1)   (GPU 2)   (GPU 3)   (GPU 4)
Challenge:

Router knows which tokens go where
Experts don't know which tokens are coming
Unstructured all-to-all pattern
Latency critical (token generation)

### 10.2 NVSHMEM Implementation (Perplexity/DeepSeek)
Dispatch phase:
```cpp
// Router kernel (device-initiated)
__global__ void dispatch_tokens() {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    
    if (tid < num_tokens) {
        // Determine which expert handles this token
        int expert_pe = routing_table[token_ids[tid]];
        
        // Calculate offset in expert's buffer
        // Use atomic counter to get unique slot
        size_t offset = nvshmem_uint64_atomic_fetch_add(
            &remote_counters[expert_pe], 1, expert_pe);
        
        // Put token directly to expert's buffer
        nvshmem_putmem_nbi(
            expert_buffers[expert_pe] + offset,
            &tokens[tid],
            sizeof(Token),
            expert_pe);
    }
}

// No host involvement - entire dispatch in one kernel
Expert processing:
cuda// Expert kernel
__global__ void process_expert() {
    // Check how many tokens arrived
    int num_received = local_counter;
    
    // Process received tokens
    for (int i = 0; i < num_received; i++) {
        Token t = my_buffer[i];
        // ... expert computation ...
    }
}

Why NVSHMEM wins here:

Unstructured pattern: Router knows destinations, experts don’t
Device-initiated: No round-trip to CPU for each put
One-sided: Experts don’t participate in dispatch
Fine-grained: Can put individual tokens without batching

10.3 Performance Impact

Traditional approach (NCCL all-to-all):

Router collects all tokens
Determine send counts per expert
Call NCCL alltoallv (collective, all PEs participate)
Experts receive batched tokens
Process

Latency: ~50-100 μs per routing step

NVSHMEM approach:

Router kernel directly puts tokens to expert buffers
Experts poll for completion, begin processing

Latency: ~5-10 μs per routing step Result: 10x latency reduction for token routing (Perplexity’s measurements).

Future Directions 11.1 NCCL Roadmap Recent additions to NCCL:

Symmetric memory support (released) Device-initiated APIs (preview available) NVLink pointer access (coming) Host-side put/get (roadmap)

Convergence: NCCL will support most NVSHMEM capabilities while maintaining collective-first API design and communicator flexibility. 11.2 When to Use Each Library Use NCCL when:

Standard collective patterns (all-reduce, all-gather, etc.) Need communicator flexibility (sub-groups, heterogeneous parallelism) Ecosystem integration important (PyTorch, JAX, etc.) Want automatic protocol tuning

Use NVSHMEM when:

Unstructured communication patterns Device-initiated communication critical Fine-grained synchronization needed Direct NVLink pointer access desired Working with symmetric PGAS model acceptable

Use both:

Different phases of application have different needs Can coexist in same application with careful memory management

12. Key Takeaways

Two-sided vs one-sided: Synchronization coupling vs decoupling trades safety for performance optimization opportunities. Pointer access vs explicit get: Pointer access eliminates copy to local HBM but only works within NVLink domains. Choose based on access frequency and data reuse. Address translation pipeline: GPU MMU handles virtual-to-physical translation and routing decisions; NVLink is pure physical routing fabric. CUDA IPC: Mechanism for creating cross-GPU virtual address mappings within NVLink domains. One-time setup enables direct load/store. No cache coherence: GPUs don’t use NVLink’s coherence protocol. Explicit synchronization (fences, barriers) required. NVLink topologies: Point-to-point requires multi-hop routing in GPUs; NVSwitch provides single-hop full crossbar with dedicated hardware. GPU MMU ≠ CPU IOMMU: GPU-GPU uses GPU MMU; NIC-GPU uses CPU IOMMU. IOMMU adds latency for PCIe P2P, doesn’t affect NVLink. Expert parallelism: Unstructured routing patterns benefit dramatically from one-sided device-initiated communication (10x latency reduction). Hybrid transport: NVSHMEM automatically selects NVLink direct access, PCIe P2P, or network RDMA based on topology detection. Library convergence: NCCL adding NVSHMEM-like capabilities; choice increasingly about API preference rather than feature availability.

NVSHMEM Synchronization Mechanisms: A Technical Deep Dive

1. The Fundamental Challenge

In one-sided RDMA communication, the sender writes directly to remote GPU memory without receiver involvement. This creates a synchronization problem: How does the receiver know when the sender’s write is complete? Unlike two-sided communication (e.g., MPI_Send/Recv) where both parties explicitly participate, one-sided RDMA provides no inherent completion notification to the receiver.

2. NVSHMEM Synchronization Primitives

2.1 Memory Ordering Operations

nvshmem_fence()

Ensures all prior PUT/GET operations to a specific PE are ordered before subsequent operations Called by sender to enforce ordering Does NOT guarantee completion

nvshmem_quiet()

Blocks until all outstanding PUT/GET operations complete globally Guarantees operations are visible at remote memory Performance cost: sender stalls waiting for hardware completion

2.2 Signal-Based Synchronization

nvshmem_put_signal()

Atomically writes data AND updates a signal variable at receiver Single hardware operation bundling both writes Key advantage: sender doesn’t block (non-blocking from sender’s perspective)

nvshmem_signal_wait_until()

Receiver polls signal variable until expected value Active waiting mechanism (receiver must check)

2.3 Other Mechanisms

Collective barriers: nvshmem_barrier_all() for global synchronization Atomic operations: nvshmem_atomic_* for lightweight synchronization flags

3. Two Approaches Compared

Approach 1: Separate PUT + Flag Update cuda// Sender (PE 0) nvshmem_putmem(buffer, local_data, SIZE, 1); nvshmem_quiet(); // Block until data write completes nvshmem_uint64_p(flag, 1, 1); // Separate flag update nvshmem_quiet(); // Block until flag write completes

// Receiver (PE 1) nvshmem_uint64_wait_until(flag, NVSHMEM_CMP_EQ, 1); Characteristics:

Three separate operations Sender blocks twice (performance penalty) Software-enforced ordering via nvshmem_quiet() Still one-sided (no receiver involvement in data transfer)

Approach 2: Atomic PUT + Signal cuda// Sender (PE 0) nvshmem_put_signal(buffer, local_data, SIZE, flag, 1, NVSHMEM_SIGNAL_SET, 1); // Returns immediately - no blocking!

// Receiver (PE 1) nvshmem_signal_wait_until(flag, NVSHMEM_CMP_EQ, 1); Characteristics:

Single API call, single hardware operation Non-blocking from sender’s perspective Atomic guarantee at receiver (data + signal arrive together) Single network round-trip

4. Understanding “Atomicity” in nvshmem_put_signal()

Critical Clarification The “atomic” property is from the receiver’s perspective, not the sender’s: Receiver’s View (Atomic):

When signal becomes visible, data is guaranteed visible No race condition between data and flag Hardware ensures signal write happens after data write completes

Sender’s View (Still One-Sided):

Returns immediately after issuing operation Does NOT know when receiver sees the data Must call nvshmem_quiet() if sender needs confirmation (e.g., to reuse buffer)

Hardware Implementation The network layer (NVLink/InfiniBand) bundles data + signal into one transaction:

Network hardware writes data to receiver’s memory Hardware writes signal value Memory ordering guarantees signal visibility implies data visibility

Similar to RDMA write with immediate data in InfiniBand.

5. Memory Ordering Guarantees

How Order is Enforced (Approach 1) cudanvshmem_putmem(buffer, local_data, SIZE, 1); // PUT #1 nvshmem_quiet(); // Blocks until PUT #1 visible at remote nvshmem_uint64_p(flag, 1, 1); // PUT #2 issued AFTER quiet returns Two mechanisms ensure ordering:

Software ordering: nvshmem_quiet() blocks sender until PUT #1 completes at remote memory before PUT #2 is issued Hardware ordering: PCIe/NVLink RDMA guarantees writes from same source to same destination arrive in order

Without Ordering - Race Condition cuda// BAD - no ordering! nvshmem_putmem(buffer, local_data, SIZE, 1); nvshmem_uint64_p(flag, 1, 1); // Could overtake data write! Flag write might complete before data write, causing receiver to proceed with invalid data.

6. Symmetric Heap Model

Key Concept NVSHMEM uses a symmetric heap: memory allocated at the same virtual address offset on all PEs, but physically separate on each GPU. cuda// Both PEs run same initialization char buffer = (char)nvshmem_malloc(SIZE); uint64_t flag = (uint64_t)nvshmem_malloc(sizeof(uint64_t)); Result:

PE 0: buffer at offset 0x1000 → GPU 0’s physical memory PE 1: buffer at offset 0x1000 → GPU 1’s physical memory Same virtual address, different physical memory

Remote Access cuda// PE 0 accessing PE 1’s memory nvshmem_putmem(buffer, // Symmetric address (same offset on all PEs) local_data, SIZE, 1); // target_pe selects which GPU’s physical memory The symmetric address + target_pe parameter identifies the specific physical memory location. Important Notes

nvshmem_malloc() only allocates on the calling PE Not automatic replication across all PEs “Symmetric” means same virtual address offset if all PEs call nvshmem_malloc() in same order

7. One-Sided vs Two-Sided Communication

NVSHMEM (One-Sided) cuda// PE 0: Sender nvshmem_putmem(buffer, data, SIZE, 1); // Write to PE 1 nvshmem_quiet(); // Wait for hardware completion

// PE 1: Receiver (passive) nvshmem_uint64_wait_until(flag, NVSHMEM_CMP_EQ, 1); // Just wait Characteristics:

Receiver doesn’t participate in data transfer No matching “receive” call required Hardware (RDMA/NVLink) confirms write completion to sender No ACK message from receiver to sender

MPI (Two-Sided) cuda// PE 0: Sender MPI_Send(data, SIZE, …); // Must match with Recv

// PE 1: Receiver (active) MPI_Recv(buffer, SIZE, …); // Must explicitly receive Characteristics:

Both sender and receiver must execute matching calls Receiver actively participates Protocol-level handshaking

8. Performance Implications

Pipeline Pattern (Optimal) cuda// Sender issues multiple operations without blocking for (int i = 0; i < N; i++) { compute_chunk(i); nvshmem_put_signal(remote_buf[i], local_buf[i], size, &remote_flags[i], i, NVSHMEM_SIGNAL_SET, target_pe); // No blocking - continues immediately! } // Only synchronize when necessary nvshmem_quiet(); // Optional: if sender needs to reuse buffers Advantages:

Overlaps communication with computation Multiple operations in flight simultaneously Minimal sender stalls

Anti-Pattern (Poor Performance) cudanvshmem_putmem(buffer, data, SIZE, 1); nvshmem_quiet(); // Stall #1 nvshmem_uint64_p(flag, 1, 1); nvshmem_quiet(); // Stall #2 Sender blocks twice per message, preventing overlap.

9. Complete Working Example

cuda#include <nvshmem.h>
#include <nvshmemx.h>
#include <stdio.h>
#include <stdlib.h>

#define SIZE 1024

int main() {
    nvshmem_init();
    int my_pe = nvshmem_my_pe();
    
    // Symmetric heap allocation
    char *buffer = (char*)nvshmem_malloc(SIZE);
    uint64_t *flag = (uint64_t*)nvshmem_malloc(sizeof(uint64_t));
    *flag = 0;
    
    if (my_pe == 0) {
        // Sender
        char *local_data = (char*)malloc(SIZE);
        for (int i = 0; i < SIZE; i++) {
            local_data[i] = 'A';
        }
        
        nvshmem_put_signal(buffer, local_data, SIZE,
                          flag, 1, NVSHMEM_SIGNAL_SET, 1);
        
        printf("PE 0: Sent data\n");
        free(local_data);
    } 
    else if (my_pe == 1) {
        // Receiver
        nvshmem_signal_wait_until(flag, NVSHMEM_CMP_EQ, 1);
        printf("PE 1: Received data: %c\n", buffer[0]);
    }
    
    nvshmem_free(buffer);
    nvshmem_free(flag);
    nvshmem_finalize();
    return 0;
}

10. Key Takeaways

One-sided RDMA has no inherent completion notification - receiver must actively check via polling/waiting nvshmem_put_signal() is superior to separate PUT + flag for single-message scenarios due to atomicity and non-blocking behavior “Atomic” means receiver-side atomicity, not sender notification - sender still doesn’t know when receiver sees data without calling nvshmem_quiet() Memory ordering requires explicit synchronization - use nvshmem_quiet() or atomic nvshmem_put_signal() to prevent races Symmetric heap enables simple remote addressing - same virtual address offset across PEs, different physical memory Performance optimization requires minimizing sender blocking - pipeline operations and only synchronize when necessary Still one-sided despite ordering guarantees - hardware confirms completion, receiver never sends ACK back to sender

GPU Communication Protocols: Simple, LL, LL128

NVIDIA’s NCCL library implements three protocols (Simple, LL, LL128) with fundamentally different synchronization mechanisms. Critical finding: NVLink provides high bandwidth and unified addressing but no hardware cache coherence between GPUs. Performance optimization requires understanding protocol trade-offs, hardware topology constraints, and the distinction between software-enforced coherence mechanisms. NCCL 2.28 introduces device-initiated communication, closing the gap with NVSHMEM for irregular workloads.

1. NCCL Protocol Architecture

1.1 Core Protocols

ProtocolSync MethodChunk SizeLatencyBandwidthGPUDirect RDMAOptimal SizeSimpleMemory fences~512 KB~6 µs~100%✓ Yes>1 MBLLAtomic flags (host mem)8B (4B data + 4B flag)~1 µs25-50%✗ No<64 KBLL128Atomic flags (GPU mem)128B (120B data + 8B flag)~2 µs~95%✓ Yes64KB-1MB

1.2 Simple Protocol

Mechanism: Large chunk transfers with memory fence synchronization. Latency Breakdown (~6 µs):

Store buffer drain: 0.5 µs L1→L2 writeback: 0.5 µs Interconnect traversal: 2.0 µs Remote invalidation: 1.0 µs ACK return: 1.0 µs

Fence Hardware Execution:

Flushes all store buffers Writes dirty L1 cache lines to L2 L2 sends invalidation messages via interconnect Waits for ACKs from all remote caches Guarantees global visibility

Performance: Fence overhead (6 µs) dominates small messages but amortizes for large transfers. For 5 GB transfer: fence is 0.0001% overhead.

1.3 LL Protocol

Mechanism: 8-byte atomic operations with flag-based synchronization in host memory. Why Host Memory is Mandatory:

CPU polling GPU memory over PCIe: ~1-2 µs per read CPU polling host DRAM: ~50-100 ns per read (20-40x faster) GPU memory not cacheable by CPU

Bandwidth Limitations:

Raw efficiency: 4/8 bytes = 50% PCIe TLP overhead: 4B payload in 32B TLP = 12.5% Double PCIe traversal (GPU→Host, Host→NIC) Result: 25-50% of peak bandwidth

Data Flow: GPU 0 → PCIe → Host buffer → CPU polls → NIC → Network → Remote NIC → Remote host → Remote CPU polls → PCIe → GPU 1 Cannot Use GPUDirect RDMA: Forced host staging prevents direct GPU-to-NIC path.

1.4 LL128 Protocol

Mechanism: 128-byte atomic operations in GPU memory, enables GPUDirect RDMA. Efficiency: 120/128 = 93.75% bandwidth utilization Hardware Requirement: Must support atomic 128-byte writes without splitting. NCCL disables on incompatible PCIe configurations. Intra-node: Direct NVLink transfers (~300-500ns latency) Inter-node: GPU accumulates chunks → CPU notified → NIC reads via GPUDirect → Network → Remote NIC writes via GPUDirect → GPU Degradation at Scale: For multi-GB messages across many nodes, sync overhead from millions of 128B atomic operations can make it slower than Simple.

2. Synchronization Mechanisms

2.1 Memory Fences vs Atomic Polling

Memory Fences:

Enforces ordering and visibility across entire memory hierarchy Hardware guarantees correctness via cache coherence protocol High latency (~6 µs) but simple programming model Serializes pipeline until completion

Atomic Polling:

Active checking using atomic operations with acquire-release semantics Provides ordering without global flush Low latency (0.05-0.5 µs) but burns CPU/GPU cycles Enables fine-grained pipelining

2.2 Acquire-Release Semantics

atomic_store_release: All prior writes complete before atomic store atomic_load_acquire: All subsequent reads happen after atomic load Key Point: Provides ordering WITHOUT heavyweight fences. Coherence still enforced by software, not hardware.

2.3 Performance Trade-offs

Small, Frequent Messages (100 × 4KB):

Fence approach: 100 × 10µs = 1000 µs overhead Polling approach: 100 × 0.1µs = 10 µs overhead 100x improvement

Large, Infrequent Messages (5GB):

Transfer: 8.3 ms Fence: 10 µs = 0.1% overhead Fence negligible

3. Data Transfer Paths

3.1 Intra-Node Transports

P2P Transport (via NVLink or PCIe):

NVLink: Direct GPU-GPU, ~900 GB/s PCIe: Via PCIe switch, ~32 GB/s per direction

P2P_DIRECT Mode (same process):

Eliminates intermediate FIFO buffer Direct source-to-destination transfer 2x faster than multi-process P2P Still uses atomic counters for synchronization

SHM Transport:

Used when P2P suboptimal (inter-socket over PCIe) Routes through host memory to avoid CPU interconnect bottleneck Better than PCIe-to-PCIe across sockets

3.2 Inter-Node Transports

Socket Transport (net_socket.cc):

No RDMA support Host memory staging required 2x PCIe traversal overhead

IB Transport (net_ib.cc): Without GPUDirect RDMA:

GPU → Host buffer → RDMA → Network → Remote host → GPU

With GPUDirect RDMA:

GPU → NIC (direct PCIe) → Network → Remote NIC → GPU (direct) Eliminates host staging and extra PCIe traversal

QP Layout (2 QPs per rank pair):

Forward QP: Bulk data (RDMA_WRITE + RDMA_WRITE_WITH_IMM) Reverse QP: Clear-to-send messages Separation prevents head-of-line blocking

Local Flush Mechanism:

Dummy loop-back RDMA_READ ensures PCIe write ordering Cost: ~0.1-0.5 µs

3.3 GPUDirect RDMA

Requirements:

GPU and NIC on same PCIe root complex Kernel support (nvidia_p2p_get_pages or DMA-BUF) Cannot efficiently cross CPU sockets

Protocol Compatibility:

Simple: ✓ Can use (GPU memory allowed) LL: ✗ Cannot use (forced host memory) LL128: ✓ Can use (GPU memory allowed)

4. Hardware Capabilities and Limitations

4.1 Critical Finding: No GPU-GPU Cache Coherence

What NVLink Provides: FeatureGPU ↔ GPU (NVLink)CPU ↔ GPU (NVLink-C2C)Unified addressing✓✓Direct load/store✓✓High bandwidth (900 GB/s)✓✓Hardware cache coherence✗ NO✓ YESL2 caches remote data✗ NO✓ YESCoherence protocol✗ None✓ MESI

4.2 GPU Cache Architecture

Per-SM L1 Cache (128-256 KB):

Private to each SM NOT coherent across SMs Requires __threadfence() to flush

Shared L2 Cache (40-60 MB):

Coherence point within single GPU Memory-side (between SMs and HBM) Cannot cache remote GPU memory

Implication for Remote Polling: GPU 1 reading flag from GPU 0:

GPU 1’s L2 cannot cache the flag Every read traverses NVLink (~300-500ns) No local caching benefit exists

Why LL128 Polling is Still Fast:

Not because of caching (no caching exists) NVLink latency (0.5 µs) < Memory fence (6 µs) GPU 0’s L2 serves reads quickly (doesn’t go to HBM) Atomic operations prevent stale buffer data

4.3 Software-Enforced Coherence

Since no hardware coherence exists, software uses:

Atomic operations: Provide ordering, force writes to memory Memory fences: Flush all caches/buffers, ensure global visibility Explicit synchronization: Programmer responsibility

4.4 Grace Hopper Exception

NVLink-C2C (CPU ↔ GPU only):

True hardware MESI coherence protocol CPU L2/L3 can cache GPU memory GPU L2 can cache CPU memory Automatic invalidations

But GPU ↔ GPU still has NO coherence.

4.5 Channel Architecture

Purpose: Subdivide collectives across multiple GPU blocks (SMs) for parallelism. Structure:

Each channel = one CUDA block on dedicated SM Each channel has 8 slots (512 KB each for Simple) Slots enable pipelining: overlap receive, reduce, send

Data Partitioning:

Total data split across N channels Each channel processes 1/N of data Within channel: split into loop iterations → slots → chunks

Channel Count Trade-off: Too few: Underutilizes GPU parallelism Too many: Per-channel chunks < 512 KB NIC FIFO → partially filled buffers → worse efficiency Example problem:

16 MB message / 64 channels = 256 KB per channel NIC FIFO: 512 KB optimal Result: 50% NIC utilization

NCCL auto-tunes based on message size to avoid this.

5. NCCL vs NVSHMEM

5.1 Fundamental Differences

Programming Model:

NCCL: Collective operations (AllReduce, Broadcast) - all ranks participate NVSHMEM: One-sided operations (put/get) - individual rank initiates

Communication Pattern:

NCCL: Dense, structured, coordinated NVSHMEM: Sparse, irregular, no coordination

5.2 Performance Comparison: MoE Example

Task: Route 1024 tokens to 8 experts (128 tokens/expert) NCCL (Pre-2.28):

Group launch: ~10 µs 8 send/recv pairs via CPU proxy: ~40 µs Transfer: ~20 µs Total: ~70 µs

NVSHMEM:

Kernel launch: ~5 µs 1024 GPU-initiated puts (pipelined): ~15 µs Transfer: ~20 µs Total: ~40 µs (43% faster)

Why NVSHMEM Wins:

GPU-initiated (0.5 µs overhead) vs CPU proxy (5 µs overhead) No coordination overhead for irregular patterns Fine-grained control enables better compute-communication overlap

5.3 When Each Wins

NCCL:

Dense collectives (AllReduce 1GB): Optimized algorithms, bandwidth-optimal Large regular transfers: Protocol tuning excels Established training pipelines: Native PyTorch/TensorFlow integration

NVSHMEM:

Sparse irregular (GNN neighbor gathering): No coordination needed Fine-grained frequent: Lower per-message overhead One-sided patterns: Put without receiver coordination

5.4 NVSHMEM Architecture

Transport Options:

IB Verbs: Direct RDMA, uses GPUDirect, lowest latency UCX: Flexible multi-transport NCCL backend: Fallback for collectives MPI: Compatibility layer

Device-Side Operations:

GPU kernel directly calls put/get/atomic operations GPU posts to NIC queue without CPU involvement ~0.5 µs overhead per operation

6. NCCL 2.28 Device API

6.1 Three Operation Modes

LSA (Load/Store Accessible):

Intra-node NVLink/PCIe Direct memory load/store operations

Multimem:

NVLink SHARP hardware multicast

GIN (GPU-Initiated Networking):

Inter-node InfiniBand/RoCE GPU directly initiates RDMA operations

6.2 GIN Architecture

Backend Options: GDAKI (GPUDirect Async Kernel-Initiated):

Direct GPU-to-NIC via DOCA GPUNetIO Lowest latency (~0.5 µs overhead)

Proxy Backend:

Lock-free GPU-to-CPU queues CPU posts RDMA operations Broader hardware support Slightly higher latency (~1-2 µs)

Impact: Reduces per-operation overhead from ~5 µs (CPU proxy) to ~0.5 µs (GPU-direct), bringing NCCL to parity with NVSHMEM for irregular workloads.

6.3 Copy Engine Collectives

Traditional: SMs handle data movement → contention with compute CE Collectives:

Dedicated hardware copy engines Zero-SM operation Applies to AlltoAll, AllGather Reduces SM contention, better overlap

6.4 Updated Comparison

FeatureNCCL (Pre-2.28)NCCL 2.28+ GINNVSHMEMDevice-initiated✗✓✓Optimized collectives✓✓⚠️ ManualGPU-to-NIC overhead~5 µs~0.5 µs~0.5 µsMoE optimized✗✓✓Maturity✓ Mature🆕 New✓ Mature

7. Performance Characteristics

7.1 Latency by Protocol and Size

Small Message (4 KB): ProtocolTransferSync OverheadTotalvs SimpleSimple0.007 µs12 µs (fences)12 µs1xLL0.013 µs0.1 µs (polling)0.2 µs60x fasterLL1280.007 µs0.5 µs (polling)0.5 µs24x faster Large Message (1 GB): ProtocolTransferSync OverheadTotalvs SimpleSimple1.7 ms12 µs1.712 ms1xLL3.3 msnegligible3.3 ms2x slowerLL1281.75 msnegligible1.75 ms~same

7.2 Benchmark Results (16-node GH200)

AllReduce Inter-Node: SizeSimpleLLLL128Winner16 KB~40 µs~15 µs~20 µsLL1 MB~500 µs~2000 µs~600 µsSimple1 GB~50 ms~200 ms~70 msSimple AllReduce Intra-Node (NVLink):

LL128 consistently best across all sizes Only 5% slower than Simple at large sizes

7.3 Protocol Selection Heuristics

Intra-node NVLink:

All sizes: LL128 (if available) Fallback: LL for small, Simple for large

Inter-node:

<64 KB: LL 64KB-1MB: LL128 or Simple (workload dependent)

1 GB: Simple (LL128 sync overhead accumulates)

8. Decision Framework

8.1 Protocol Selection

Message size + topology → protocol choice

<64 KB: Intra-node → LL128 > LL Inter-node → LL

64KB-1MB: Intra-node → LL128 Inter-node → LL128 or Simple

1 MB: Any topology → Simple or LL128

Hardware constraints override: No 128B atomics → Simple or LL only No GPUDirect RDMA → LL for low latency

8.2 NCCL vs NVSHMEM Selection

Use NCCL for:

Dense collectives (gradient sync, parameter broadcast) Large regular transfers Existing training infrastructure

Use NVSHMEM or NCCL 2.28 GIN for:

Sparse irregular patterns (MoE routing, GNN) Fine-grained frequent messages Dynamic routing requiring compute-communication fusion

8.3 Common Pitfalls

Assuming cache coherence exists:

Reality: No hardware coherence between GPUs Must use atomics/fences explicitly

Forcing protocol manually:

NCCL autotuning generally optimal Manual NCCL_PROTO often degrades performance

Over-channelization:

NCCL_NCHANNELS ignored in recent versions Trust autotuning for NIC FIFO optimization

Ignoring topology:

Check nvidia-smi topo -m Understand NVLink vs PCIe paths Consider inter-socket costs

Conclusion

GPU communication performance requires understanding:

No hardware cache coherence between GPUs - software must enforce via atomics/fences Protocol trade-offs - latency (fences) vs bandwidth (sync granularity) Topology constraints - NVLink vs PCIe, intra vs inter-node, NUMA effects Workload patterns - dense collectives vs sparse irregular

Key insights:

Simple: Bandwidth-optimal for large messages, high latency LL: Low latency for small messages, forced host memory kills bandwidth LL128: Balanced, but requires hardware support NCCL 2.28 GIN closes gap with NVSHMEM for irregular workloads Optimization requires measurement and topology awareness

Modern applications benefit from hybrid approaches: NCCL for collectives, device-initiated (GIN/NVSHMEM) for irregular patterns, with careful attention to hardware topology and message characteristics.

DMA

DMA = Hardware moves data without CPU/kernel involvement

Three types in GPU communication:

GPU Memory Controller DMA (NVLink):

┌─────────────────────────────────────┐
│ Source: GPU 0 HBM                   │
│ Dest: GPU 1 HBM                     │
│ Who moves it: GPU Memory Controller │
│ Path: NVLink                        │
└─────────────────────────────────────┘
   
Triggered by: Store instructions to remote addresses
Example: remote_ptr[i] = value;

NIC DMA (PCIe):

┌─────────────────────────────────────┐
│ Source: GPU HBM or Host RAM         │
│ Dest: NIC internal buffer           │
│ Who moves it: NIC DMA engine        │
│ Path: PCIe                          │
└─────────────────────────────────────┘
   
Triggered by: ibv_post_send() or similar NIC API

Copy Engine DMA (newer GPUs):

┌─────────────────────────────────────┐
│ Source: GPU 0 HBM                   │
│ Dest: GPU 1 HBM                     │
│ Who moves it: Dedicated copy engine │
│ Path: NVLink                        │
└─────────────────────────────────────┘
   
Triggered by: NCCL API, not user kernel
Advantage: Doesn't use SM resources