NCCL Performance Testing & Tuning Guide

Abstract: NCCL (NVIDIA Collective Communications Library) is the backbone of distributed deep learning training. This guide details how to compile and deploy nccl-tests, perform intra-node and inter-node benchmarking, and apply advanced BIOS/OS level tuning strategies.

1. Prerequisites

Before starting, ensure the cluster meets the following conditions:

Hardware: NVIDIA A800/H800 GPUs, Mellanox IB NICs.
Software: CentOS 7/Rocky 8, NVIDIA Driver 535+, CUDA 12.x.
Health Check:
- GPU: nvidia-smi shows no anomalous usage.
- IB: ibstat shows State: Active.
- Network: Passwordless SSH configured.

2. Compilation

Compiling from source is recommended for optimal hardware support.

2.1 Setup Environment

bash

export WORK_DIR=/tmp/nccl_build
mkdir -p $WORK_DIR
cd $WORK_DIR

2.2 Install OpenMPI

bash

wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.bz2
tar xf openmpi-4.1.5.tar.bz2
cd openmpi-4.1.5

# Compile with CUDA support
./configure --prefix=$WORK_DIR/openmpi --with-cuda
make -j$(nproc)
make install

# Set Envs
export PATH=$WORK_DIR/openmpi/bin:$PATH
export LD_LIBRARY_PATH=$WORK_DIR/openmpi/lib:$LD_LIBRARY_PATH

2.3 Compile nccl-tests

Compile the benchmark tool (assuming NCCL is installed with CUDA/Driver):

bash

cd $WORK_DIR
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests

# Build
make MPI=1 \
     MPI_HOME=$WORK_DIR/openmpi \
     CUDA_HOME=/usr/local/cuda \
     -j$(nproc)

Binaries will be in ./build/.

3. Benchmarking

The most common tool is all_reduce_perf.

3.1 Intra-Node Test

Test NVLink bandwidth within a single node (8 GPUs).

bash

# -b: min bytes, -e: max bytes, -f: factor, -g: num GPUs
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

Expectation: For A800, measured BusBW should approach 550-580GB/s (Theoretical peak is 600GB/s).

3.2 Inter-Node Test

Test cross-node IB performance.

1. Create hostfile

text

# Format: IP slots=NumGPUs
192.168.1.1 slots=8
192.168.1.2 slots=8

2. Run Test

bash

mpirun --allow-run-as-root \
    --hostfile hostfile \
    -np 16 \
    -x LD_LIBRARY_PATH \
    -x NCCL_DEBUG=INFO \
    -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
    ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -w 10 -n 100

-np 16: Total GPUs.
-x NCCL_IB_HCA: Explicitly specify IB devices to avoid Ethernet fallback.

4. Analysis & Baselines

Focus on the busbw column.

Metric	Description	Formula (AllReduce)
algbw	Algorithm Bandwidth	Application-level throughput
busbw	Bus Bandwidth	Hardware utilization metric

$$ \text{BusBW} = \text{AlgBW} \times \frac{2(n-1)}{n} $$

Passing Criteria

Intra-Node (NVLink): > 90% Theoretical BW.
Inter-Node (IB): > 80% Theoretical BW.

5. Advanced Tuning

5.1 System Optimization

GDR (GPUDirect RDMA): Load nvidia_peermem to allow IB to read/write GPU memory directly.
bash
```
modprobe nvidia_peermem
```
PCIe ACS: Disable ACS (Access Control Services) for P2P communication.
IOMMU: Typically intel_iommu=on is safe; ensure no conflicts with peer-to-peer.

5.2 NCCL Variables

Variable	Recommended/Note
`NCCL_DEBUG`	`INFO` for topology check
`NCCL_IB_HCA`	`mlx5_0:1` (Select devices)
`NCCL_IB_GID_INDEX`	`3` (For RoCEv2)
`NCCL_SOCKET_IFNAME`	`eth0` (OOB interface)

5.3 Troubleshooting

Hangs:
- Check firewall ports.
- Isolate by disabling NVLink (NCCL_P2P_DISABLE=1) or IB (NCCL_IB_DISABLE=1).
Low Performance:
- Ensure transport is IB, not TCP (Check logs).
- Check PCIe link width/speed (lspci -vv).

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

NCCL Performance Testing & Tuning Guide

1. Prerequisites

2. Compilation

2.1 Setup Environment

2.2 Install OpenMPI

2.3 Compile nccl-tests

3. Benchmarking

3.1 Intra-Node Test

3.2 Inter-Node Test

4. Analysis & Baselines

Passing Criteria

5. Advanced Tuning

5.1 System Optimization

5.2 NCCL Variables

5.3 Troubleshooting

NCCL Performance Testing & Tuning Guide ​

1. Prerequisites ​

2. Compilation ​

2.1 Setup Environment ​

2.2 Install OpenMPI ​

2.3 Compile nccl-tests ​

3. Benchmarking ​

3.1 Intra-Node Test ​

3.2 Inter-Node Test ​

4. Analysis & Baselines ​

Passing Criteria ​

5. Advanced Tuning ​

5.1 System Optimization ​

5.2 NCCL Variables ​

5.3 Troubleshooting ​

NCCL Performance Testing & Tuning Guide

1. Prerequisites

2. Compilation

2.1 Setup Environment

2.2 Install OpenMPI

2.3 Compile nccl-tests

3. Benchmarking

3.1 Intra-Node Test

3.2 Inter-Node Test

4. Analysis & Baselines

Passing Criteria

5. Advanced Tuning

5.1 System Optimization

5.2 NCCL Variables

5.3 Troubleshooting