Skip to content

NVIDIA GPU HPL Benchmarking Guide (Docker)

Abstract: Running HPL (High-Performance Linpack) in a containerized environment is the standard practice for evaluating GPU cluster performance. This guide, based on the nvcr.io/nvidia/hpc-benchmarks image, details the Docker deployment, execution commands, and parameter tuning.

1. Overview

Using the official NVIDIA optimized container offers several advantages over bare-metal compilation:

  • Consistency: Pre-built with highly optimized CUDA, NCCL, OpenMPI, and MKL.
  • Ease of Use: No need to manually compile math libraries or resolve dependencies.
  • Performance: Specifically optimized for Tensor Cores.

Image: nvcr.io/nvidia/hpc-benchmarks:25.02 (or latest).

2. Docker Configuration

To unlock maximum hardware performance, specific privileges and resource mappings are required.

2.1 Run Command

bash
docker run -it --gpus all --ipc=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --privileged=true --network host \
  --shm-size=20G \
  -v /home/hpl-data:/workspace \
  nvcr.io/nvidia/hpc-benchmarks:25.02

2.2 Key Parameters

ParameterPurposeWhy?
--gpus allPass-through GPUsAllows the container to access all host GPUs.
--ipc=hostShare Host IPCCritical: Enables low-latency IPC (e.g., NVSHMEM) for multi-GPU communication.
--ulimit memlock=-1Unlimited MemlockCritical: Allows Pinned Memory, required for InfiniBand RDMA.
--ulimit stack=...Increase StackPrevents Stack Overflow during large matrix factorization.
--privileged=truePrivileged ModeAllows access to InfiniBand devices and driver interfaces.
--network hostHost NetworkBypasses Docker bridge for lowest latency (Crucial for IB).
--shm-size=20GShared MemorySufficient space for MPI buffers and large matrix computation.

3. Running HPL

Inside the container, typically mpirun is used to invoke the hpl.sh wrapper script.

3.1 Execution (Single Node 8-GPU)

bash
mpirun --bind-to none -np 8 \
  -npernode 8 \
  -x LD_LIBRARY_PATH \
  /workspace/hpl.sh --dat /workspace/HPL-8GPUs.dat

3.2 Arguments

  • --bind-to none: Important. Prevents MPI from forcefully binding to CPU cores. In containers, OS scheduling or manual affinity control is often safer.
  • -np 8: Total processes (8 GPUs).
  • -npernode 8: Processes per node (1 GPU = 1 Process).
  • -x LD_LIBRARY_PATH: Explicitly pass environment variables to ensure libraries are found.

3.3 Configuration (HPL.dat)

HPL.dat defines the problem size. For 8x A800/H800:

  • N (Matrix Order): 264192 (Target 80-90% memory usage).
  • NB (Block Size): 1024 (Optimized for Tensor Cores).
  • P x Q (Grid): 4 x 2 (Ideally P < Q or square).

4. Result Analysis

Look for the summary in the standard output:

text
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4      264192  1024     4     2              90.14              1.364e+05
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.000224 ...... PASSED
================================================================================

4.1 Metrics

  1. Time: Total execution time (90.14 s).
  2. Total Gflops: System performance.
    • 1.364e+05 = 136.4 TFLOPS.
  3. Residual Check: Accuracy validation.
    • Must be PASSED.
    • Value should be very small (e.g., < 1.0).

5. Optimization Tips

  1. NUMA Affinity: While --bind-to none is safe, for peak performance, use scripts to bind each MPI Rank to the CPU Cores in the local NUMA domain of its assigned GPU.
  2. Multi-Node: Requires a hostfile and SSH trust (or Slurm/K8s). --network host and --ipc=host are non-negotiable for multi-node RDMA performance.

AI-HPC Organization