NVIDIA GPU HPL Benchmarking Guide (Docker)

Abstract: Running HPL (High-Performance Linpack) in a containerized environment is the standard practice for evaluating GPU cluster performance. This guide, based on the nvcr.io/nvidia/hpc-benchmarks image, details the Docker deployment, execution commands, and parameter tuning.

1. Overview

Using the official NVIDIA optimized container offers several advantages over bare-metal compilation:

Consistency: Pre-built with highly optimized CUDA, NCCL, OpenMPI, and MKL.
Ease of Use: No need to manually compile math libraries or resolve dependencies.
Performance: Specifically optimized for Tensor Cores.

Image: nvcr.io/nvidia/hpc-benchmarks:25.02 (or latest).

2. Docker Configuration

To unlock maximum hardware performance, specific privileges and resource mappings are required.

2.1 Run Command

bash

docker run -it --gpus all --ipc=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --privileged=true --network host \
  --shm-size=20G \
  -v /home/hpl-data:/workspace \
  nvcr.io/nvidia/hpc-benchmarks:25.02

2.2 Key Parameters

Parameter	Purpose	Why?
`--gpus all`	Pass-through GPUs	Allows the container to access all host GPUs.
`--ipc=host`	Share Host IPC	Critical: Enables low-latency IPC (e.g., NVSHMEM) for multi-GPU communication.
`--ulimit memlock=-1`	Unlimited Memlock	Critical: Allows Pinned Memory, required for InfiniBand RDMA.
`--ulimit stack=...`	Increase Stack	Prevents Stack Overflow during large matrix factorization.
`--privileged=true`	Privileged Mode	Allows access to InfiniBand devices and driver interfaces.
`--network host`	Host Network	Bypasses Docker bridge for lowest latency (Crucial for IB).
`--shm-size=20G`	Shared Memory	Sufficient space for MPI buffers and large matrix computation.

3. Running HPL

Inside the container, typically mpirun is used to invoke the hpl.sh wrapper script.

3.1 Execution (Single Node 8-GPU)

bash

mpirun --bind-to none -np 8 \
  -npernode 8 \
  -x LD_LIBRARY_PATH \
  /workspace/hpl.sh --dat /workspace/HPL-8GPUs.dat

3.2 Arguments

--bind-to none: Important. Prevents MPI from forcefully binding to CPU cores. In containers, OS scheduling or manual affinity control is often safer.
-np 8: Total processes (8 GPUs).
-npernode 8: Processes per node (1 GPU = 1 Process).
-x LD_LIBRARY_PATH: Explicitly pass environment variables to ensure libraries are found.

3.3 Configuration (HPL.dat)

HPL.dat defines the problem size. For 8x A800/H800:

N (Matrix Order): 264192 (Target 80-90% memory usage).
NB (Block Size): 1024 (Optimized for Tensor Cores).
P x Q (Grid): 4 x 2 (Ideally P < Q or square).

4. Result Analysis

Look for the summary in the standard output:

text

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4      264192  1024     4     2              90.14              1.364e+05
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.000224 ...... PASSED
================================================================================

4.1 Metrics

Time: Total execution time (90.14 s).
Total Gflops: System performance.
- 1.364e+05 = 136.4 TFLOPS.
Residual Check: Accuracy validation.
- Must be PASSED.
- Value should be very small (e.g., < 1.0).

5. Optimization Tips

NUMA Affinity: While --bind-to none is safe, for peak performance, use scripts to bind each MPI Rank to the CPU Cores in the local NUMA domain of its assigned GPU.
Multi-Node: Requires a hostfile and SSH trust (or Slurm/K8s). --network host and --ipc=host are non-negotiable for multi-node RDMA performance.

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

NVIDIA GPU HPL Benchmarking Guide (Docker)

1. Overview

2. Docker Configuration

2.1 Run Command

2.2 Key Parameters

3. Running HPL

3.1 Execution (Single Node 8-GPU)

3.2 Arguments

3.3 Configuration (HPL.dat)

4. Result Analysis

4.1 Metrics

5. Optimization Tips

NVIDIA GPU HPL Benchmarking Guide (Docker) ​

1. Overview ​

2. Docker Configuration ​

2.1 Run Command ​

2.2 Key Parameters ​

3. Running HPL ​

3.1 Execution (Single Node 8-GPU) ​

3.2 Arguments ​

3.3 Configuration (HPL.dat) ​

4. Result Analysis ​

4.1 Metrics ​

5. Optimization Tips ​

NVIDIA GPU HPL Benchmarking Guide (Docker)

1. Overview

2. Docker Configuration

2.1 Run Command

2.2 Key Parameters

3. Running HPL

3.1 Execution (Single Node 8-GPU)

3.2 Arguments

3.3 Configuration (HPL.dat)

4. Result Analysis

4.1 Metrics

5. Optimization Tips