Skip to content

CPU Linpack Performance Benchmarking Guide (HPL)

Abstract: High Performance Linpack (HPL) is the international standard for evaluating floating-point performance of high-performance computing systems (the benchmark for TOP500). This guide details the HPL testing process, parameter configuration, and optimization methods based on the Intel OneAPI (MKL) environment.

1. Overview and Preparation

1.1 Introduction to HPL

HPL measures the floating-point computing capability of a system by solving a random dense linear system using LU factorization with row partial pivoting.

  • Official Benchmark: Netlib HPL
  • Industry Standard: It is the core metric for HPC project acceptance and delivery.

1.2 Environment Requirements

Before testing, ensure the following infrastructure is ready:

  • System: OS installed correctly, Passwordless SSH configured between nodes.
  • Storage & Network: NFS shared directory mounted, NIS service active (if needed), IB/RoCE drivers loaded.
  • Software: Intel OneAPI (including MKL and MPI) installed in the shared directory.

Loading Environment Variables

Newer versions of Intel OneAPI require only one line to load the entire toolchain:

bash
source /opt/intel202*/oneapi/setvars.sh intel64 --force

1.3 Critical BIOS Settings (Intel Xeon)

To achieve stable peak performance, configure the BIOS as follows:

  • CPU Power Policy: Performance
  • Intel Turbo Boost: Enabled
  • Hyper-Threading: Off (HPL generally performs better on physical cores).

1.4 Test File Preparation

The test binaries are typically located in the Intel MKL installation directory. It is recommended to create a dedicated directory (e.g., /home/benchmark/hpl) and copy the following files:

  • Source path example: $MKLROOT/benchmarks/mp_linpack/
  • Required Files:
    • xhpl_intel_dynamic: Dynamic link executable.
    • xhpl_intel64_static: Static link executable.
    • runme_*: Helper scripts.
    • HPL.dat: Core parameter configuration file.

2. HPL.dat Core Configuration

The HPL.dat file defines the problem size and grid layout, which are critical for the score.

ParameterDescriptionRecommended Formula/Value
N (Problem Size)Matrix size$N = \sqrt{\frac{\text{Total Memory} \times \text{Usage(0.9)}}{8}} \times 0.9$
Note: Reserve memory to prevent OOM.
NB (Block Size)Partitioning block sizeRecommended 384 (Skylake and newer).
P, Q (Grid)Process grid$P \times Q = \text{Total MPI Processes}$
Usually P < Q, and P should ideally be a power of 2.

P x Q Configuration Example

Assuming 8 nodes, running 1 MPI process per node:

  • Total Processes = 8
  • Recommendation: P=2, Q=4 (or P=1, Q=8)

3. Testing Steps

3.1 Pre-requisite: Lock Performance Mode

Execute on all compute nodes:

bash
cpupower -c all frequency-set -g performance

Suitable for large-scale tests to minimize inter-process communication overhead. Each node runs a single xhpl process, using all cores via OpenMP.

  1. Edit HPL.dat: Set N, NB, P, Q based on the node count.
  2. Create hostfile: List all participating node hostnames.
  3. Run Command:

InfiniBand / RoCE Network:

bash
# -ppn 1 means 1 process per node
mpirun -genv I_MPI_FABRICS shm:ofi \
       -genv FI_PROVIDER mlx \
       -machinefile hostfile \
       -np <Total Nodes> -ppn 1 ./xhpl_intel64_dynamic

Ethernet (TCP):

bash
mpirun -genv I_MPI_FABRICS shm:ofi \
       -genv FI_PROVIDER tcp \
       -machinefile hostfile \
       -np <Total Nodes> -ppn 1 ./xhpl_intel64_dynamic

3.3 Scenario B: Multiple Processes Per Node (Runme Script)

If you need to manually control processes per NUMA node, use the runme_intel64_dynamic script.

  1. Edit HPL.dat: Ensure $P \times Q = \text{Nodes} \times \text{Processes Per Node}$.
  2. Edit Script: Modify runme_intel64_dynamic.
    • MPI_PROC_NUM: Total processes.
    • MPI_PER_NODE: Processes per node (e.g., 2).
  3. Execute:
    bash
    ./runme_intel64_dynamic
    Results will be written to HPL.out.

4. Fat Node (4-way/8-way) Considerations

For Fat Nodes with 4 or 8 CPU sockets, it is recommended to run one MPI process per CPU socket to optimize memory access (NUMA affinity).

  • Strategy:
    • P x Q: Equals Node Count × CPU Sockets per Node.
    • MPI_PER_NODE: Set to the number of sockets per node (e.g., 4 for a 4-way server).

5. Analysis and Optimization

5.1 Theoretical Peak (Rpeak) Calculation

$$ \text{Rpeak} = \text{Frequency} \times \text{Cores} \times \text{FLOPs/Cycle} $$

Example (Intel Xeon Gold 6126):

  • Frequency: 2.6 GHz
  • Cores: 12
  • Instruction Set: AVX-512 (32 DP FLOPs per cycle)
  • Single CPU Rpeak = $2.6 \times 12 \times 32 = 998.4 \text{ GFlops}$

5.2 Optimization Flags

Intel MKL automatically detects CPU architecture, but sometimes forcing AVX-512 is necessary for maximum performance.

bash
# Force usage of AVX-512 instructions (Skylake/Cascade Lake/Ice Lake/Sapphire Rapids)
export MKL_ENABLE_INSTRUCTIONS=AVX512

Real-time Monitoring

During the test, use turbostat in a separate terminal to monitor CPU frequency and ensure no throttling occurs:

bash
turbostat --interval 1

AI-HPC Organization