AMD CPU Linpack (HPL) Performance Guide
Abstract: This guide focuses on HPL benchmarking for High Performance Computing clusters based on AMD EPYC (e.g., Genoa/Zen 4) architectures. Unlike Intel platforms, AMD requires specific best practices for MPI process binding, Block Size (NB) selection, and OpenMP thread configuration.
1. Overview
High Performance Linpack (HPL) is the official benchmark for the TOP500 supercomputer list. For AMD platforms, using optimized math libraries (like AOCL or specifically compiled HPL binaries) coupled with correct NUMA binding strategies is critical to unlocking hardware performance.
2. Preparation
2.1 System Requirements
- OS: Recommended CentOS 7.9, Rocky Linux 8/9, or Ubuntu 22.04.
- Network: Passwordless SSH between nodes, InfiniBand/RoCE drivers (Mellanox OFED) installed.
- Filesystem: Test binaries should be located on a shared filesystem (NFS/Lustre).
- BIOS Settings (AMD Recommended):
- Determinism Slider: Power Determinism (or Max Performance).
- cTDP / P-State: Max Performance.
- C-States: Disabled.
- SMT (Hyper-Threading): Disabled (HPL is generally more stable on physical cores).
- NUMA Nodes per Socket (NPS): NPS4 or NPS1 (depending on memory interleaving; NPS4 usually offers lower latency).
2.2 Dependencies
- MPI: OpenMPI ≥ 4.0.3 (Recommended) or MPICH.
- Tools:
psutil(Python library for monitoring).bashpip install psutil
3. HPL.dat Configuration (AMD Specific)
While similar to general configs, some parameters have recommended sweet spots for Zen architectures.
| Parameter | Description | AMD Zen 3/4 Recommendation |
|---|---|---|
| N (Problem Size) | Matrix Dimension | $N \approx \sqrt{\frac{\text{Total Mem} \times 0.9}{8}}$ |
| NB (Block Size) | Block Size | 384 (Sweet spot for Genoa/Milan) |
| P, Q | Process Grid | $P \times Q = \text{Total Sockets}$ Recommend 1 MPI process per Socket |
Process Model Differences
On Intel platforms, one might run one MPI process per NUMA node. On AMD platforms, the best practice is typically one MPI process per physical CPU Socket, spanning OpenMP threads equal to the core count of that socket.
4. Single Node Test
Assuming a dual-socket AMD EPYC Genoa server (96 cores per socket), totaling 192 cores.
4.1 Running Script (run_single.sh)
Using OpenMPI for precise core binding.
#!/bin/bash
# Assuming use of a static HPL binary optimized for AMD
# MPI Flags:
# --map-by socket:PE=96 -> 1 MPI proc per Socket, each spanning 96 Slots (Cores)
# -np 2 -> Total 2 MPI processes (Dual Socket)
# --bind-to core -> Strict binding to prevent drift
/path/to/openmpi/bin/mpirun \
--map-by socket:PE=96 \
-np 2 \
--bind-to core \
--allow-run-as-root \
-x OMP_NUM_THREADS=96 \
-x OMP_PROC_BIND=spread \
-x OMP_PLACES=cores \
./xhpl-amd-static4.2 Theoretical Peak (Rpeak)
AMD EPYC (Zen 4) supports AVX-512, capable of 16 double-precision floating-point operations per clock cycle (2 FMA units × 256bit × 2 ops).
$$ \text{Rpeak} = \text{Nodes} \times \text{Sockets} \times \text{Cores} \times \text{Freq} \times 16 $$
Example (Dual EPYC 9654, 2.4GHz): $$ Rpeak = 1 \times 2 \times 96 \times 2.4 \times 16 = 7372.8 \text{ GFlops} $$
- Target Efficiency: Should reach 95% ~ 98%.
5. Multi-Node Test
5.1 Hostfile Preparation
Create a hostfile listing all participating hostnames.
5.2 Running Script (run_cluster.sh)
Example for 10 nodes (20 Sockets total).
#!/bin/bash
# N: Set to value calculated for total cluster memory
# P x Q = 20 (e.g., P=4, Q=5)
/path/to/openmpi/bin/mpirun \
--map-by socket:PE=96 \
-np 20 \
-N 2 \
-hostfile hostfile \
--bind-to core \
--allow-run-as-root \
-x OMP_NUM_THREADS=96 \
-x OMP_PROC_BIND=spread \
-x OMP_PLACES=cores \
./xhpl-amd-static-np 20: Total MPI processes (10 nodes × 2 Sockets).-N 2: Run 2 MPI processes per node.
5.3 Analysis
If multi-node efficiency drops significantly (e.g., below 85%):
- Network: Check InfiniBand status (
ibstat) to ensure full speed (e.g., NDR 400Gbps). - Topology: Verify
PandQinHPL.dat. - Jitter: Use
topto check for OS noise.
6. Best Practices
- Compiler: Recommend AOCC (AMD Optimizing C/C++ Compiler) or GCC + BLIS/FLAME libraries.
- Memory Bandwidth: HPL is bandwidth-bound. Ensure all memory channels are populated and running at max frequency (e.g., DDR5-4800).
- Validation: Always check the
Residualvalue in the log. Valid runs must haveResidual < 1e-6(scaled).
