Skip to content

InfiniBand Performance Tuning for AI Workloads

Distributed training performance is no longer compute-bound — it is network-bound.

For large-scale LLM training, inefficient RDMA communication can reduce cluster efficiency by more than 40%.
This guide provides a production-grade methodology for tuning InfiniBand in GPU clusters.


Why InfiniBand Performance Matters for LLM Training

In modern training workloads:

  • AllReduce dominates iteration time
  • Communication overlaps with compute
  • Network imbalance breaks scaling efficiency

At scale:

  • Small latency increase → global throughput drop
  • PCIe misalignment → NCCL bandwidth collapse

End-to-End Data Path

The real performance path is:
GPU → PCIe Switch → HCA → IB Fabric → Remote HCA → Remote GPU

Key bottlenecks:

  • PCIe lane width
  • NUMA crossing
  • Retimer latency
  • GDR capability

Key Performance Metrics

1. RDMA Bandwidth Test

bash
ib_write_bw
ib_read_bw
ib_send_bw

2. NCCL Test

bash
nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

Focus on:

  • Bus Bandwidth
  • Algorithm Bandwidth
  • Latency at small message sizes

PCIe and NUMA Affinity Optimization

Check Topology

bash
nvidia-smi topo -m
lspci -tv
numactl -H

Goal:

  • GPU and HCA under same NUMA
  • Avoid SYS distance in topology matrix

Manual Binding

bash
export NCCL_IB_HCA=mlx5_0,mlx5_1
export NCCL_TOPO_FILE=/path/to/custom_topo.xml

GPUDirect RDMA Optimization

Verify GDR

bash
nvidia-smi -q | grep GPUDirect

Common Issues:

  • ACS enabled in PCIe switch
  • IOMMU enabled
  • Insufficient BAR space

NCCL Environment Variable Tuning

Core Parameters

bash
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=136

Tuning strategy depends on:

  • GPU count per node
  • Rail-optimized network
  • Fabric oversubscription ratio

InfiniBand Fabric Tuning

Cluster-level tuning:

  • MTU = 4096
  • Adaptive routing = enabled
  • Proper SL mapping

Validate using:

bash
ibdiagnet
perfquery

RoCE vs InfiniBand

For a lossless RoCE tuning guide see:
/guide/03-network/roce-ai-fabric

Key differences:

  • Congestion control
  • Buffer design
  • PFC impact

Benchmark Methodology

A correct benchmarking process:

  1. Single link RDMA test
  2. Intra-node NCCL
  3. Inter-node NCCL
  4. Real training workload validation

Real-World Tuning Case Study

Initial State:

  • 8×GPU per node
  • AllReduce Bus BW: 43 GB/s

Optimizations Applied:

  • NUMA realignment
  • GDR enabled
  • QPs increased
  • Rail-aware NCCL tuning

Final Result:

  • AllReduce Bus BW: 92 GB/s
  • Scaling efficiency improved from 58% → 91%


Ask the AI-HPC Expert

Have a performance issue in your cluster?
Use our AI expert system to diagnose your NCCL and RDMA bottlenecks interactively.

AI-HPC Organization · Contact: openaihpc@gmail.com