Memory Bandwidth Testing Report & STREAM Guide

Abstract: Memory bandwidth is often the bottleneck for AI training and HPC applications. This guide details the theoretical calculation of DDR5 server memory bandwidth, compilation optimization (Intel OneAPI), and real-world analysis using the STREAM benchmark.

1. Test Platform Overview

The test evaluates the bandwidth performance of the Intel Sapphire Rapids platform under different memory configurations.

CPU: 2 × Intel Xeon 8470 (Sapphire Rapids), 2.0 GHz, 52-Core
Memory Channels: 8 Channels/CPU (Total 16 Channels)
Configurations:
- Half-Population: DDR5-4800 MHz (24 DIMMs, 64GB)
- Full-Population: DDR5-4400 MHz (32 DIMMs, 32GB)
Goal: Verify the efficiency gap between STREAM measured results and theoretical peaks.

2. Theoretical Calculation

Bandwidth is determined by frequency, channel count, and data width.

$$ \text{BW}_{\text{theory}} = \text{Freq (MT/s)} \times \text{CPU Count} \times \text{Channels/CPU} \times \frac{64\text{bit}}{8} $$

2.1 Scenario A: Half-Population (24 DIMMs)

Frequency is typically higher when running in 1DPC (1 DIMM Per Channel) mode or partially populated.

Config: 24 × DDR5-4800
Calc: $4800 \times 2 \times 8 \times 8 = 614,400 \text{ MB/s}$

2.2 Scenario B: Full-Population (32 DIMMs)

When fully populated (2DPC), memory frequency often downgrades (e.g., from 4800 to 4400) to maintain signal integrity.

Config: 32 × DDR5-4400
Calc: $4400 \times 2 \times 8 \times 8 = 563,200 \text{ MB/s}$

3. STREAM Benchmark Overview

STREAM is the industry standard for measuring sustainable memory bandwidth via four vector operations:

Function	Kernel	FLOPS/Byte	Description
Copy	`a[i] = b[i]`	2	Data transfer only
Scale	`a[i] = q * b[i]`	2	Multiplication
Add	`a[i] = b[i] + c[i]`	3	Vector addition
Triad	`a[i] = b[i] + q * c[i]`	3	Combined calc, closest to real apps

Performance Pattern

Typically: Triad ≈ Add > Copy > Scale More complex memory access patterns can better hide latency, resulting in higher bandwidth figures.

4. Compilation & Execution (Intel Env)

To utilize hardware limits, use the Intel C Compiler (icx) with Non-temporal Store optimizations.

4.1 Array Size

The array must be at least 4x larger than the total L3 Cache to ensure you are testing memory, not cache.

Example (Xeon 8470): L3 Cache ≈ 105 MB × 2 = 210 MB. Recommended Array Size > 800 MB. DSTREAM_ARRAY_SIZE=268435456 (Approx 2GB for doubles)

4.2 Compilation Command

bash

# Load OneAPI environment
source /opt/intel/oneapi/setvars.sh

# Compile stream.c
icx stream.c -o stream_test \
    -DNTIMES=100 \
    -DOFFSET=0 \
    -DSTREAM_TYPE=double \
    -DSTREAM_ARRAY_SIZE=268435456 \
    -Wall -O3 -mcmodel=medium \
    -qopenmp \
    -shared-intel \
    -qopt-streaming-stores always

-qopenmp: Enable multi-threading.
-qopt-streaming-stores always: Critical. Forces non-temporal stores (bypassing cache on write), significantly boosting bandwidth.

4.3 Runtime Configuration

bash

# Bind threads to physical cores
export OMP_NUM_THREADS=104
export KMP_AFFINITY=compact,granularity=fine

./stream_test

5. Analysis of Results

Taking Full-Population (32 DIMMs, 4400 MHz) as the example:

Theoretical BW: 563,200 MB/s
Measured Data:

Test	Bandwidth (MB/s)
Copy	400,376
Scale	385,210
Add	403,436
Triad	403,436

5.1 Efficiency

$$ \text{Efficiency} = \frac{\text{Measured Triad}}{\text{Theoretical BW}} = \frac{403,436}{563,200} \approx 71.6% $$

5.2 Conclusions

Normal Range: Memory bandwidth efficiency for dual-socket servers is typically 68% ~ 75%. This result (~71%) is healthy.
Instruction Set: For STREAM, AVX-512 yields little benefit over AVX2 as the bottleneck is the memory channel, not the compute unit.
Capacity vs Speed: Full population offers capacity but drops frequency (4800 -> 4400), reducing peak theoretical bandwidth.

6. Best Practices

NUMA Awareness: Ensure OMP_NUM_THREADS matches core counts and KMP_AFFINITY is set. Crossing NUMA nodes incorrectly can halve bandwidth.
Dependencies: Binaries compiled with new Intel compilers may need newer glibc.
Sanity Check: If results are absurdly high (TB/s range), your STREAM_ARRAY_SIZE is too small, and you are benchmarking L1/L2 cache.

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

Memory Bandwidth Testing Report & STREAM Guide

1. Test Platform Overview

2. Theoretical Calculation

2.1 Scenario A: Half-Population (24 DIMMs)

2.2 Scenario B: Full-Population (32 DIMMs)

3. STREAM Benchmark Overview

4. Compilation & Execution (Intel Env)

4.1 Array Size

4.2 Compilation Command

4.3 Runtime Configuration

5. Analysis of Results

5.1 Efficiency

5.2 Conclusions

6. Best Practices

Memory Bandwidth Testing Report & STREAM Guide ​

1. Test Platform Overview ​

2. Theoretical Calculation ​

2.1 Scenario A: Half-Population (24 DIMMs) ​

2.2 Scenario B: Full-Population (32 DIMMs) ​

3. STREAM Benchmark Overview ​

4. Compilation & Execution (Intel Env) ​

4.1 Array Size ​

4.2 Compilation Command ​

4.3 Runtime Configuration ​

5. Analysis of Results ​

5.1 Efficiency ​

5.2 Conclusions ​

6. Best Practices ​

Memory Bandwidth Testing Report & STREAM Guide

1. Test Platform Overview

2. Theoretical Calculation

2.1 Scenario A: Half-Population (24 DIMMs)

2.2 Scenario B: Full-Population (32 DIMMs)

3. STREAM Benchmark Overview

4. Compilation & Execution (Intel Env)

4.1 Array Size

4.2 Compilation Command

4.3 Runtime Configuration

5. Analysis of Results

5.1 Efficiency

5.2 Conclusions

6. Best Practices