Skip to content

NCCL Performance Testing & Tuning Guide

Abstract: NCCL (NVIDIA Collective Communications Library) is the backbone of distributed deep learning training. This guide details how to compile and deploy nccl-tests, perform intra-node and inter-node benchmarking, and apply advanced BIOS/OS level tuning strategies.

1. Prerequisites

Before starting, ensure the cluster meets the following conditions:

  • Hardware: NVIDIA A800/H800 GPUs, Mellanox IB NICs.
  • Software: CentOS 7/Rocky 8, NVIDIA Driver 535+, CUDA 12.x.
  • Health Check:
    • GPU: nvidia-smi shows no anomalous usage.
    • IB: ibstat shows State: Active.
    • Network: Passwordless SSH configured.

2. Compilation

Compiling from source is recommended for optimal hardware support.

2.1 Setup Environment

bash
export WORK_DIR=/tmp/nccl_build
mkdir -p $WORK_DIR
cd $WORK_DIR

2.2 Install OpenMPI

bash
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.bz2
tar xf openmpi-4.1.5.tar.bz2
cd openmpi-4.1.5

# Compile with CUDA support
./configure --prefix=$WORK_DIR/openmpi --with-cuda
make -j$(nproc)
make install

# Set Envs
export PATH=$WORK_DIR/openmpi/bin:$PATH
export LD_LIBRARY_PATH=$WORK_DIR/openmpi/lib:$LD_LIBRARY_PATH

2.3 Compile nccl-tests

Compile the benchmark tool (assuming NCCL is installed with CUDA/Driver):

bash
cd $WORK_DIR
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests

# Build
make MPI=1 \
     MPI_HOME=$WORK_DIR/openmpi \
     CUDA_HOME=/usr/local/cuda \
     -j$(nproc)

Binaries will be in ./build/.

3. Benchmarking

The most common tool is all_reduce_perf.

3.1 Intra-Node Test

Test NVLink bandwidth within a single node (8 GPUs).

bash
# -b: min bytes, -e: max bytes, -f: factor, -g: num GPUs
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
  • Expectation: For A800, measured BusBW should approach 550-580GB/s (Theoretical peak is 600GB/s).

3.2 Inter-Node Test

Test cross-node IB performance.

1. Create hostfile

text
# Format: IP slots=NumGPUs
192.168.1.1 slots=8
192.168.1.2 slots=8

2. Run Test

bash
mpirun --allow-run-as-root \
    --hostfile hostfile \
    -np 16 \
    -x LD_LIBRARY_PATH \
    -x NCCL_DEBUG=INFO \
    -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
    ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -w 10 -n 100
  • -np 16: Total GPUs.
  • -x NCCL_IB_HCA: Explicitly specify IB devices to avoid Ethernet fallback.

4. Analysis & Baselines

Focus on the busbw column.

MetricDescriptionFormula (AllReduce)
algbwAlgorithm BandwidthApplication-level throughput
busbwBus BandwidthHardware utilization metric

$$ \text{BusBW} = \text{AlgBW} \times \frac{2(n-1)}{n} $$

Passing Criteria

  • Intra-Node (NVLink): > 90% Theoretical BW.
  • Inter-Node (IB): > 80% Theoretical BW.

5. Advanced Tuning

5.1 System Optimization

  1. GDR (GPUDirect RDMA): Load nvidia_peermem to allow IB to read/write GPU memory directly.
    bash
    modprobe nvidia_peermem
  2. PCIe ACS: Disable ACS (Access Control Services) for P2P communication.
  3. IOMMU: Typically intel_iommu=on is safe; ensure no conflicts with peer-to-peer.

5.2 NCCL Variables

VariableRecommended/Note
NCCL_DEBUGINFO for topology check
NCCL_IB_HCAmlx5_0:1 (Select devices)
NCCL_IB_GID_INDEX3 (For RoCEv2)
NCCL_SOCKET_IFNAMEeth0 (OOB interface)

5.3 Troubleshooting

  • Hangs:
    • Check firewall ports.
    • Isolate by disabling NVLink (NCCL_P2P_DISABLE=1) or IB (NCCL_IB_DISABLE=1).
  • Low Performance:
    • Ensure transport is IB, not TCP (Check logs).
    • Check PCIe link width/speed (lspci -vv).

AI-HPC Organization