Memory Bandwidth Testing Report & STREAM Guide
Abstract: Memory bandwidth is often the bottleneck for AI training and HPC applications. This guide details the theoretical calculation of DDR5 server memory bandwidth, compilation optimization (Intel OneAPI), and real-world analysis using the STREAM benchmark.
1. Test Platform Overview
The test evaluates the bandwidth performance of the Intel Sapphire Rapids platform under different memory configurations.
- CPU: 2 × Intel Xeon 8470 (Sapphire Rapids), 2.0 GHz, 52-Core
- Memory Channels: 8 Channels/CPU (Total 16 Channels)
- Configurations:
- Half-Population: DDR5-4800 MHz (24 DIMMs, 64GB)
- Full-Population: DDR5-4400 MHz (32 DIMMs, 32GB)
- Goal: Verify the efficiency gap between STREAM measured results and theoretical peaks.
2. Theoretical Calculation
Bandwidth is determined by frequency, channel count, and data width.
$$ \text{BW}_{\text{theory}} = \text{Freq (MT/s)} \times \text{CPU Count} \times \text{Channels/CPU} \times \frac{64\text{bit}}{8} $$
2.1 Scenario A: Half-Population (24 DIMMs)
Frequency is typically higher when running in 1DPC (1 DIMM Per Channel) mode or partially populated.
- Config: 24 × DDR5-4800
- Calc: $4800 \times 2 \times 8 \times 8 = 614,400 \text{ MB/s}$
2.2 Scenario B: Full-Population (32 DIMMs)
When fully populated (2DPC), memory frequency often downgrades (e.g., from 4800 to 4400) to maintain signal integrity.
- Config: 32 × DDR5-4400
- Calc: $4400 \times 2 \times 8 \times 8 = 563,200 \text{ MB/s}$
3. STREAM Benchmark Overview
STREAM is the industry standard for measuring sustainable memory bandwidth via four vector operations:
| Function | Kernel | FLOPS/Byte | Description |
|---|---|---|---|
| Copy | a[i] = b[i] | 2 | Data transfer only |
| Scale | a[i] = q * b[i] | 2 | Multiplication |
| Add | a[i] = b[i] + c[i] | 3 | Vector addition |
| Triad | a[i] = b[i] + q * c[i] | 3 | Combined calc, closest to real apps |
Performance Pattern
Typically: Triad ≈ Add > Copy > Scale More complex memory access patterns can better hide latency, resulting in higher bandwidth figures.
4. Compilation & Execution (Intel Env)
To utilize hardware limits, use the Intel C Compiler (icx) with Non-temporal Store optimizations.
4.1 Array Size
The array must be at least 4x larger than the total L3 Cache to ensure you are testing memory, not cache.
- Example (Xeon 8470): L3 Cache ≈ 105 MB × 2 = 210 MB. Recommended Array Size > 800 MB.
DSTREAM_ARRAY_SIZE=268435456(Approx 2GB for doubles)
4.2 Compilation Command
# Load OneAPI environment
source /opt/intel/oneapi/setvars.sh
# Compile stream.c
icx stream.c -o stream_test \
-DNTIMES=100 \
-DOFFSET=0 \
-DSTREAM_TYPE=double \
-DSTREAM_ARRAY_SIZE=268435456 \
-Wall -O3 -mcmodel=medium \
-qopenmp \
-shared-intel \
-qopt-streaming-stores always-qopenmp: Enable multi-threading.-qopt-streaming-stores always: Critical. Forces non-temporal stores (bypassing cache on write), significantly boosting bandwidth.
4.3 Runtime Configuration
# Bind threads to physical cores
export OMP_NUM_THREADS=104
export KMP_AFFINITY=compact,granularity=fine
./stream_test5. Analysis of Results
Taking Full-Population (32 DIMMs, 4400 MHz) as the example:
- Theoretical BW: 563,200 MB/s
- Measured Data:
| Test | Bandwidth (MB/s) |
|---|---|
| Copy | 400,376 |
| Scale | 385,210 |
| Add | 403,436 |
| Triad | 403,436 |
5.1 Efficiency
$$ \text{Efficiency} = \frac{\text{Measured Triad}}{\text{Theoretical BW}} = \frac{403,436}{563,200} \approx 71.6% $$
5.2 Conclusions
- Normal Range: Memory bandwidth efficiency for dual-socket servers is typically 68% ~ 75%. This result (~71%) is healthy.
- Instruction Set: For STREAM, AVX-512 yields little benefit over AVX2 as the bottleneck is the memory channel, not the compute unit.
- Capacity vs Speed: Full population offers capacity but drops frequency (4800 -> 4400), reducing peak theoretical bandwidth.
6. Best Practices
- NUMA Awareness: Ensure
OMP_NUM_THREADSmatches core counts andKMP_AFFINITYis set. Crossing NUMA nodes incorrectly can halve bandwidth. - Dependencies: Binaries compiled with new Intel compilers may need newer
glibc. - Sanity Check: If results are absurdly high (TB/s range), your
STREAM_ARRAY_SIZEis too small, and you are benchmarking L1/L2 cache.
