NCCL Performance Testing & Tuning Guide
Abstract: NCCL (NVIDIA Collective Communications Library) is the backbone of distributed deep learning training. This guide details how to compile and deploy
nccl-tests, perform intra-node and inter-node benchmarking, and apply advanced BIOS/OS level tuning strategies.
1. Prerequisites
Before starting, ensure the cluster meets the following conditions:
- Hardware: NVIDIA A800/H800 GPUs, Mellanox IB NICs.
- Software: CentOS 7/Rocky 8, NVIDIA Driver 535+, CUDA 12.x.
- Health Check:
- GPU:
nvidia-smishows no anomalous usage. - IB:
ibstatshowsState: Active. - Network: Passwordless SSH configured.
- GPU:
2. Compilation
Compiling from source is recommended for optimal hardware support.
2.1 Setup Environment
bash
export WORK_DIR=/tmp/nccl_build
mkdir -p $WORK_DIR
cd $WORK_DIR2.2 Install OpenMPI
bash
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.bz2
tar xf openmpi-4.1.5.tar.bz2
cd openmpi-4.1.5
# Compile with CUDA support
./configure --prefix=$WORK_DIR/openmpi --with-cuda
make -j$(nproc)
make install
# Set Envs
export PATH=$WORK_DIR/openmpi/bin:$PATH
export LD_LIBRARY_PATH=$WORK_DIR/openmpi/lib:$LD_LIBRARY_PATH2.3 Compile nccl-tests
Compile the benchmark tool (assuming NCCL is installed with CUDA/Driver):
bash
cd $WORK_DIR
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
# Build
make MPI=1 \
MPI_HOME=$WORK_DIR/openmpi \
CUDA_HOME=/usr/local/cuda \
-j$(nproc)Binaries will be in ./build/.
3. Benchmarking
The most common tool is all_reduce_perf.
3.1 Intra-Node Test
Test NVLink bandwidth within a single node (8 GPUs).
bash
# -b: min bytes, -e: max bytes, -f: factor, -g: num GPUs
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8- Expectation: For A800, measured BusBW should approach 550-580GB/s (Theoretical peak is 600GB/s).
3.2 Inter-Node Test
Test cross-node IB performance.
1. Create hostfile
text
# Format: IP slots=NumGPUs
192.168.1.1 slots=8
192.168.1.2 slots=82. Run Test
bash
mpirun --allow-run-as-root \
--hostfile hostfile \
-np 16 \
-x LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -w 10 -n 100-np 16: Total GPUs.-x NCCL_IB_HCA: Explicitly specify IB devices to avoid Ethernet fallback.
4. Analysis & Baselines
Focus on the busbw column.
| Metric | Description | Formula (AllReduce) |
|---|---|---|
| algbw | Algorithm Bandwidth | Application-level throughput |
| busbw | Bus Bandwidth | Hardware utilization metric |
$$ \text{BusBW} = \text{AlgBW} \times \frac{2(n-1)}{n} $$
Passing Criteria
- Intra-Node (NVLink): > 90% Theoretical BW.
- Inter-Node (IB): > 80% Theoretical BW.
5. Advanced Tuning
5.1 System Optimization
- GDR (GPUDirect RDMA): Load
nvidia_peermemto allow IB to read/write GPU memory directly.bashmodprobe nvidia_peermem - PCIe ACS: Disable ACS (Access Control Services) for P2P communication.
- IOMMU: Typically
intel_iommu=onis safe; ensure no conflicts with peer-to-peer.
5.2 NCCL Variables
| Variable | Recommended/Note |
|---|---|
NCCL_DEBUG | INFO for topology check |
NCCL_IB_HCA | mlx5_0:1 (Select devices) |
NCCL_IB_GID_INDEX | 3 (For RoCEv2) |
NCCL_SOCKET_IFNAME | eth0 (OOB interface) |
5.3 Troubleshooting
- Hangs:
- Check firewall ports.
- Isolate by disabling NVLink (
NCCL_P2P_DISABLE=1) or IB (NCCL_IB_DISABLE=1).
- Low Performance:
- Ensure transport is IB, not TCP (Check logs).
- Check PCIe link width/speed (
lspci -vv).
