CPU Linpack Performance Benchmarking Guide (HPL)

Abstract: High Performance Linpack (HPL) is the international standard for evaluating floating-point performance of high-performance computing systems (the benchmark for TOP500). This guide details the HPL testing process, parameter configuration, and optimization methods based on the Intel OneAPI (MKL) environment.

1. Overview and Preparation

1.1 Introduction to HPL

HPL measures the floating-point computing capability of a system by solving a random dense linear system using LU factorization with row partial pivoting.

Official Benchmark: Netlib HPL
Industry Standard: It is the core metric for HPC project acceptance and delivery.

1.2 Environment Requirements

Before testing, ensure the following infrastructure is ready:

System: OS installed correctly, Passwordless SSH configured between nodes.
Storage & Network: NFS shared directory mounted, NIS service active (if needed), IB/RoCE drivers loaded.
Software: Intel OneAPI (including MKL and MPI) installed in the shared directory.

Loading Environment Variables

Newer versions of Intel OneAPI require only one line to load the entire toolchain:

bash

source /opt/intel202*/oneapi/setvars.sh intel64 --force

1.3 Critical BIOS Settings (Intel Xeon)

To achieve stable peak performance, configure the BIOS as follows:

CPU Power Policy: Performance
Intel Turbo Boost: Enabled
Hyper-Threading: Off (HPL generally performs better on physical cores).

1.4 Test File Preparation

The test binaries are typically located in the Intel MKL installation directory. It is recommended to create a dedicated directory (e.g., /home/benchmark/hpl) and copy the following files:

Source path example: $MKLROOT/benchmarks/mp_linpack/
Required Files:
- xhpl_intel_dynamic: Dynamic link executable.
- xhpl_intel64_static: Static link executable.
- runme_*: Helper scripts.
- HPL.dat: Core parameter configuration file.

2. HPL.dat Core Configuration

The HPL.dat file defines the problem size and grid layout, which are critical for the score.

Parameter	Description	Recommended Formula/Value
N (Problem Size)	Matrix size	$N = \sqrt{\frac{\text{Total Memory} \times \text{Usage(0.9)}}{8}} \times 0.9$ Note: Reserve memory to prevent OOM.
NB (Block Size)	Partitioning block size	Recommended 384 (Skylake and newer).
P, Q (Grid)	Process grid	$P \times Q = \text{Total MPI Processes}$ Usually P < Q, and P should ideally be a power of 2.

P x Q Configuration Example

Assuming 8 nodes, running 1 MPI process per node:

Total Processes = 8
Recommendation: P=2, Q=4 (or P=1, Q=8)

3. Testing Steps

3.1 Pre-requisite: Lock Performance Mode

Execute on all compute nodes:

bash

cpupower -c all frequency-set -g performance

3.2 Scenario A: One Process Per Node (Recommended)

Suitable for large-scale tests to minimize inter-process communication overhead. Each node runs a single xhpl process, using all cores via OpenMP.

Edit HPL.dat: Set N, NB, P, Q based on the node count.
Create hostfile: List all participating node hostnames.
Run Command:

InfiniBand / RoCE Network:

bash

# -ppn 1 means 1 process per node
mpirun -genv I_MPI_FABRICS shm:ofi \
       -genv FI_PROVIDER mlx \
       -machinefile hostfile \
       -np <Total Nodes> -ppn 1 ./xhpl_intel64_dynamic

Ethernet (TCP):

bash

mpirun -genv I_MPI_FABRICS shm:ofi \
       -genv FI_PROVIDER tcp \
       -machinefile hostfile \
       -np <Total Nodes> -ppn 1 ./xhpl_intel64_dynamic

3.3 Scenario B: Multiple Processes Per Node (Runme Script)

If you need to manually control processes per NUMA node, use the runme_intel64_dynamic script.

Edit HPL.dat: Ensure $P \times Q = \text{Nodes} \times \text{Processes Per Node}$.
Edit Script: Modify runme_intel64_dynamic.
- MPI_PROC_NUM: Total processes.
- MPI_PER_NODE: Processes per node (e.g., 2).
Execute:
bash
```
./runme_intel64_dynamic
```
Results will be written to HPL.out.

4. Fat Node (4-way/8-way) Considerations

For Fat Nodes with 4 or 8 CPU sockets, it is recommended to run one MPI process per CPU socket to optimize memory access (NUMA affinity).

Strategy:
- P x Q: Equals Node Count × CPU Sockets per Node.
- MPI_PER_NODE: Set to the number of sockets per node (e.g., 4 for a 4-way server).

5. Analysis and Optimization

5.1 Theoretical Peak (Rpeak) Calculation

$$ \text{Rpeak} = \text{Frequency} \times \text{Cores} \times \text{FLOPs/Cycle} $$

Example (Intel Xeon Gold 6126):

Frequency: 2.6 GHz
Cores: 12
Instruction Set: AVX-512 (32 DP FLOPs per cycle)
Single CPU Rpeak = $2.6 \times 12 \times 32 = 998.4 \text{ GFlops}$

5.2 Optimization Flags

Intel MKL automatically detects CPU architecture, but sometimes forcing AVX-512 is necessary for maximum performance.

bash

# Force usage of AVX-512 instructions (Skylake/Cascade Lake/Ice Lake/Sapphire Rapids)
export MKL_ENABLE_INSTRUCTIONS=AVX512

Real-time Monitoring

During the test, use turbostat in a separate terminal to monitor CPU frequency and ensure no throttling occurs:

bash

turbostat --interval 1

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

CPU Linpack Performance Benchmarking Guide (HPL)

1. Overview and Preparation

1.1 Introduction to HPL

1.2 Environment Requirements

1.3 Critical BIOS Settings (Intel Xeon)

1.4 Test File Preparation

2. HPL.dat Core Configuration

3. Testing Steps

3.1 Pre-requisite: Lock Performance Mode

3.2 Scenario A: One Process Per Node (Recommended)

3.3 Scenario B: Multiple Processes Per Node (Runme Script)

4. Fat Node (4-way/8-way) Considerations

5. Analysis and Optimization

5.1 Theoretical Peak (Rpeak) Calculation

5.2 Optimization Flags

CPU Linpack Performance Benchmarking Guide (HPL) ​

1. Overview and Preparation ​

1.1 Introduction to HPL ​

1.2 Environment Requirements ​

1.3 Critical BIOS Settings (Intel Xeon) ​

1.4 Test File Preparation ​

2. HPL.dat Core Configuration ​

3. Testing Steps ​

3.1 Pre-requisite: Lock Performance Mode ​

3.2 Scenario A: One Process Per Node (Recommended) ​

3.3 Scenario B: Multiple Processes Per Node (Runme Script) ​

4. Fat Node (4-way/8-way) Considerations ​

5. Analysis and Optimization ​

5.1 Theoretical Peak (Rpeak) Calculation ​

5.2 Optimization Flags ​

CPU Linpack Performance Benchmarking Guide (HPL)

1. Overview and Preparation

1.1 Introduction to HPL

1.2 Environment Requirements

1.3 Critical BIOS Settings (Intel Xeon)

1.4 Test File Preparation

2. HPL.dat Core Configuration

3. Testing Steps

3.1 Pre-requisite: Lock Performance Mode

3.2 Scenario A: One Process Per Node (Recommended)

3.3 Scenario B: Multiple Processes Per Node (Runme Script)

4. Fat Node (4-way/8-way) Considerations

5. Analysis and Optimization

5.1 Theoretical Peak (Rpeak) Calculation

5.2 Optimization Flags