Skip to content

Memory Bandwidth Testing Report & STREAM Guide

Abstract: Memory bandwidth is often the bottleneck for AI training and HPC applications. This guide details the theoretical calculation of DDR5 server memory bandwidth, compilation optimization (Intel OneAPI), and real-world analysis using the STREAM benchmark.

1. Test Platform Overview

The test evaluates the bandwidth performance of the Intel Sapphire Rapids platform under different memory configurations.

  • CPU: 2 × Intel Xeon 8470 (Sapphire Rapids), 2.0 GHz, 52-Core
  • Memory Channels: 8 Channels/CPU (Total 16 Channels)
  • Configurations:
    • Half-Population: DDR5-4800 MHz (24 DIMMs, 64GB)
    • Full-Population: DDR5-4400 MHz (32 DIMMs, 32GB)
  • Goal: Verify the efficiency gap between STREAM measured results and theoretical peaks.

2. Theoretical Calculation

Bandwidth is determined by frequency, channel count, and data width.

$$ \text{BW}_{\text{theory}} = \text{Freq (MT/s)} \times \text{CPU Count} \times \text{Channels/CPU} \times \frac{64\text{bit}}{8} $$

2.1 Scenario A: Half-Population (24 DIMMs)

Frequency is typically higher when running in 1DPC (1 DIMM Per Channel) mode or partially populated.

  • Config: 24 × DDR5-4800
  • Calc: $4800 \times 2 \times 8 \times 8 = 614,400 \text{ MB/s}$

2.2 Scenario B: Full-Population (32 DIMMs)

When fully populated (2DPC), memory frequency often downgrades (e.g., from 4800 to 4400) to maintain signal integrity.

  • Config: 32 × DDR5-4400
  • Calc: $4400 \times 2 \times 8 \times 8 = 563,200 \text{ MB/s}$

3. STREAM Benchmark Overview

STREAM is the industry standard for measuring sustainable memory bandwidth via four vector operations:

FunctionKernelFLOPS/ByteDescription
Copya[i] = b[i]2Data transfer only
Scalea[i] = q * b[i]2Multiplication
Adda[i] = b[i] + c[i]3Vector addition
Triada[i] = b[i] + q * c[i]3Combined calc, closest to real apps

Performance Pattern

Typically: Triad ≈ Add > Copy > Scale More complex memory access patterns can better hide latency, resulting in higher bandwidth figures.

4. Compilation & Execution (Intel Env)

To utilize hardware limits, use the Intel C Compiler (icx) with Non-temporal Store optimizations.

4.1 Array Size

The array must be at least 4x larger than the total L3 Cache to ensure you are testing memory, not cache.

  • Example (Xeon 8470): L3 Cache ≈ 105 MB × 2 = 210 MB. Recommended Array Size > 800 MB. DSTREAM_ARRAY_SIZE=268435456 (Approx 2GB for doubles)

4.2 Compilation Command

bash
# Load OneAPI environment
source /opt/intel/oneapi/setvars.sh

# Compile stream.c
icx stream.c -o stream_test \
    -DNTIMES=100 \
    -DOFFSET=0 \
    -DSTREAM_TYPE=double \
    -DSTREAM_ARRAY_SIZE=268435456 \
    -Wall -O3 -mcmodel=medium \
    -qopenmp \
    -shared-intel \
    -qopt-streaming-stores always
  • -qopenmp: Enable multi-threading.
  • -qopt-streaming-stores always: Critical. Forces non-temporal stores (bypassing cache on write), significantly boosting bandwidth.

4.3 Runtime Configuration

bash
# Bind threads to physical cores
export OMP_NUM_THREADS=104
export KMP_AFFINITY=compact,granularity=fine

./stream_test

5. Analysis of Results

Taking Full-Population (32 DIMMs, 4400 MHz) as the example:

  • Theoretical BW: 563,200 MB/s
  • Measured Data:
TestBandwidth (MB/s)
Copy400,376
Scale385,210
Add403,436
Triad403,436

5.1 Efficiency

$$ \text{Efficiency} = \frac{\text{Measured Triad}}{\text{Theoretical BW}} = \frac{403,436}{563,200} \approx 71.6% $$

5.2 Conclusions

  1. Normal Range: Memory bandwidth efficiency for dual-socket servers is typically 68% ~ 75%. This result (~71%) is healthy.
  2. Instruction Set: For STREAM, AVX-512 yields little benefit over AVX2 as the bottleneck is the memory channel, not the compute unit.
  3. Capacity vs Speed: Full population offers capacity but drops frequency (4800 -> 4400), reducing peak theoretical bandwidth.

6. Best Practices

  • NUMA Awareness: Ensure OMP_NUM_THREADS matches core counts and KMP_AFFINITY is set. Crossing NUMA nodes incorrectly can halve bandwidth.
  • Dependencies: Binaries compiled with new Intel compilers may need newer glibc.
  • Sanity Check: If results are absurdly high (TB/s range), your STREAM_ARRAY_SIZE is too small, and you are benchmarking L1/L2 cache.

AI-HPC Organization