AMD CPU Linpack (HPL) Performance Guide

Abstract: This guide focuses on HPL benchmarking for High Performance Computing clusters based on AMD EPYC (e.g., Genoa/Zen 4) architectures. Unlike Intel platforms, AMD requires specific best practices for MPI process binding, Block Size (NB) selection, and OpenMP thread configuration.

1. Overview

High Performance Linpack (HPL) is the official benchmark for the TOP500 supercomputer list. For AMD platforms, using optimized math libraries (like AOCL or specifically compiled HPL binaries) coupled with correct NUMA binding strategies is critical to unlocking hardware performance.

2. Preparation

2.1 System Requirements

OS: Recommended CentOS 7.9, Rocky Linux 8/9, or Ubuntu 22.04.
Network: Passwordless SSH between nodes, InfiniBand/RoCE drivers (Mellanox OFED) installed.
Filesystem: Test binaries should be located on a shared filesystem (NFS/Lustre).
BIOS Settings (AMD Recommended):
- Determinism Slider: Power Determinism (or Max Performance).
- cTDP / P-State: Max Performance.
- C-States: Disabled.
- SMT (Hyper-Threading): Disabled (HPL is generally more stable on physical cores).
- NUMA Nodes per Socket (NPS): NPS4 or NPS1 (depending on memory interleaving; NPS4 usually offers lower latency).

2.2 Dependencies

MPI: OpenMPI ≥ 4.0.3 (Recommended) or MPICH.
Tools: psutil (Python library for monitoring).
bash
```
pip install psutil
```

3. HPL.dat Configuration (AMD Specific)

While similar to general configs, some parameters have recommended sweet spots for Zen architectures.

Parameter	Description	AMD Zen 3/4 Recommendation
N (Problem Size)	Matrix Dimension	$N \approx \sqrt{\frac{\text{Total Mem} \times 0.9}{8}}$
NB (Block Size)	Block Size	384 (Sweet spot for Genoa/Milan)
P, Q	Process Grid	$P \times Q = \text{Total Sockets}$ Recommend 1 MPI process per Socket

Process Model Differences

On Intel platforms, one might run one MPI process per NUMA node. On AMD platforms, the best practice is typically one MPI process per physical CPU Socket, spanning OpenMP threads equal to the core count of that socket.

4. Single Node Test

Assuming a dual-socket AMD EPYC Genoa server (96 cores per socket), totaling 192 cores.

4.1 Running Script (run_single.sh)

Using OpenMPI for precise core binding.

bash

#!/bin/bash
# Assuming use of a static HPL binary optimized for AMD

# MPI Flags:
# --map-by socket:PE=96  -> 1 MPI proc per Socket, each spanning 96 Slots (Cores)
# -np 2                  -> Total 2 MPI processes (Dual Socket)
# --bind-to core         -> Strict binding to prevent drift

/path/to/openmpi/bin/mpirun \
  --map-by socket:PE=96 \
  -np 2 \
  --bind-to core \
  --allow-run-as-root \
  -x OMP_NUM_THREADS=96 \
  -x OMP_PROC_BIND=spread \
  -x OMP_PLACES=cores \
  ./xhpl-amd-static

4.2 Theoretical Peak (Rpeak)

AMD EPYC (Zen 4) supports AVX-512, capable of 16 double-precision floating-point operations per clock cycle (2 FMA units × 256bit × 2 ops).

$$ \text{Rpeak} = \text{Nodes} \times \text{Sockets} \times \text{Cores} \times \text{Freq} \times 16 $$

Example (Dual EPYC 9654, 2.4GHz): $$ Rpeak = 1 \times 2 \times 96 \times 2.4 \times 16 = 7372.8 \text{ GFlops} $$

Target Efficiency: Should reach 95% ~ 98%.

5. Multi-Node Test

5.1 Hostfile Preparation

Create a hostfile listing all participating hostnames.

5.2 Running Script (run_cluster.sh)

Example for 10 nodes (20 Sockets total).

bash

#!/bin/bash
# N: Set to value calculated for total cluster memory
# P x Q = 20 (e.g., P=4, Q=5)

/path/to/openmpi/bin/mpirun \
  --map-by socket:PE=96 \
  -np 20 \
  -N 2 \
  -hostfile hostfile \
  --bind-to core \
  --allow-run-as-root \
  -x OMP_NUM_THREADS=96 \
  -x OMP_PROC_BIND=spread \
  -x OMP_PLACES=cores \
  ./xhpl-amd-static

-np 20: Total MPI processes (10 nodes × 2 Sockets).
-N 2: Run 2 MPI processes per node.

5.3 Analysis

If multi-node efficiency drops significantly (e.g., below 85%):

Network: Check InfiniBand status (ibstat) to ensure full speed (e.g., NDR 400Gbps).
Topology: Verify P and Q in HPL.dat.
Jitter: Use top to check for OS noise.

6. Best Practices

Compiler: Recommend AOCC (AMD Optimizing C/C++ Compiler) or GCC + BLIS/FLAME libraries.
Memory Bandwidth: HPL is bandwidth-bound. Ensure all memory channels are populated and running at max frequency (e.g., DDR5-4800).
Validation: Always check the Residual value in the log. Valid runs must have Residual < 1e-6 (scaled).

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

AMD CPU Linpack (HPL) Performance Guide

1. Overview

2. Preparation

2.1 System Requirements

2.2 Dependencies

3. HPL.dat Configuration (AMD Specific)

4. Single Node Test

4.1 Running Script (run_single.sh)

4.2 Theoretical Peak (Rpeak)

5. Multi-Node Test

5.1 Hostfile Preparation

5.2 Running Script (run_cluster.sh)

5.3 Analysis

6. Best Practices

AMD CPU Linpack (HPL) Performance Guide ​

1. Overview ​

2. Preparation ​

2.1 System Requirements ​

2.2 Dependencies ​

3. HPL.dat Configuration (AMD Specific) ​

4. Single Node Test ​

4.1 Running Script (run_single.sh) ​

4.2 Theoretical Peak (Rpeak) ​

5. Multi-Node Test ​

5.1 Hostfile Preparation ​

5.2 Running Script (run_cluster.sh) ​

5.3 Analysis ​

6. Best Practices ​

AMD CPU Linpack (HPL) Performance Guide

1. Overview

2. Preparation

2.1 System Requirements

2.2 Dependencies

3. HPL.dat Configuration (AMD Specific)

4. Single Node Test

4.1 Running Script (run_single.sh)

4.2 Theoretical Peak (Rpeak)

5. Multi-Node Test

5.1 Hostfile Preparation

5.2 Running Script (run_cluster.sh)

5.3 Analysis

6. Best Practices