NVIDIA GPU HPL Benchmarking Guide (Docker)
Abstract: Running HPL (High-Performance Linpack) in a containerized environment is the standard practice for evaluating GPU cluster performance. This guide, based on the
nvcr.io/nvidia/hpc-benchmarksimage, details the Docker deployment, execution commands, and parameter tuning.
1. Overview
Using the official NVIDIA optimized container offers several advantages over bare-metal compilation:
- Consistency: Pre-built with highly optimized CUDA, NCCL, OpenMPI, and MKL.
- Ease of Use: No need to manually compile math libraries or resolve dependencies.
- Performance: Specifically optimized for Tensor Cores.
Image: nvcr.io/nvidia/hpc-benchmarks:25.02 (or latest).
2. Docker Configuration
To unlock maximum hardware performance, specific privileges and resource mappings are required.
2.1 Run Command
bash
docker run -it --gpus all --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
--privileged=true --network host \
--shm-size=20G \
-v /home/hpl-data:/workspace \
nvcr.io/nvidia/hpc-benchmarks:25.022.2 Key Parameters
| Parameter | Purpose | Why? |
|---|---|---|
--gpus all | Pass-through GPUs | Allows the container to access all host GPUs. |
--ipc=host | Share Host IPC | Critical: Enables low-latency IPC (e.g., NVSHMEM) for multi-GPU communication. |
--ulimit memlock=-1 | Unlimited Memlock | Critical: Allows Pinned Memory, required for InfiniBand RDMA. |
--ulimit stack=... | Increase Stack | Prevents Stack Overflow during large matrix factorization. |
--privileged=true | Privileged Mode | Allows access to InfiniBand devices and driver interfaces. |
--network host | Host Network | Bypasses Docker bridge for lowest latency (Crucial for IB). |
--shm-size=20G | Shared Memory | Sufficient space for MPI buffers and large matrix computation. |
3. Running HPL
Inside the container, typically mpirun is used to invoke the hpl.sh wrapper script.
3.1 Execution (Single Node 8-GPU)
bash
mpirun --bind-to none -np 8 \
-npernode 8 \
-x LD_LIBRARY_PATH \
/workspace/hpl.sh --dat /workspace/HPL-8GPUs.dat3.2 Arguments
--bind-to none: Important. Prevents MPI from forcefully binding to CPU cores. In containers, OS scheduling or manual affinity control is often safer.-np 8: Total processes (8 GPUs).-npernode 8: Processes per node (1 GPU = 1 Process).-x LD_LIBRARY_PATH: Explicitly pass environment variables to ensure libraries are found.
3.3 Configuration (HPL.dat)
HPL.dat defines the problem size. For 8x A800/H800:
- N (Matrix Order):
264192(Target 80-90% memory usage). - NB (Block Size):
1024(Optimized for Tensor Cores). - P x Q (Grid):
4 x 2(Ideally P < Q or square).
4. Result Analysis
Look for the summary in the standard output:
text
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR01C2R4 264192 1024 4 2 90.14 1.364e+05
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.000224 ...... PASSED
================================================================================4.1 Metrics
- Time: Total execution time (90.14 s).
- Total Gflops: System performance.
1.364e+05= 136.4 TFLOPS.
- Residual Check: Accuracy validation.
- Must be PASSED.
- Value should be very small (e.g., < 1.0).
5. Optimization Tips
- NUMA Affinity: While
--bind-to noneis safe, for peak performance, use scripts to bind each MPI Rank to the CPU Cores in the local NUMA domain of its assigned GPU. - Multi-Node: Requires a
hostfileand SSH trust (or Slurm/K8s).--network hostand--ipc=hostare non-negotiable for multi-node RDMA performance.
