Skip to content

Mellanox IB Operations & Tuning Guide

Abstract: This guide consolidates best practices for InfiniBand network lifecycle management, including driver installation, IP configuration, switch firmware upgrades, MPI benchmarking, performance tuning, and BER troubleshooting.

1. Driver Installation & Config

1.1 Install Mellanox OFED

For RHEL/CentOS based systems.

bash
# 1. Extract
tar zxvf MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.5-x86_64.tgz
cd MLNX_OFED_LINUX-5.3-1.0.0.1-rhel7.5-x86_64/

# 2. Install (Force all modules)
./mlnxofedinstall --all --force

# 3. Reboot
reboot

1.2 Configure IPoIB

Set static IP for management or IPoIB communication.

bash
# Edit: /etc/sysconfig/network-scripts/ifcfg-ib0
NAME=ib0
DEVICE=ib0
BOOTPROTO=static
ONBOOT=yes
TYPE=InfiniBand
IPADDR=11.11.11.200
NETMASK=255.255.0.0
  • Verify: ip a and ibstat (State: Active).

1.3 Start Subnet Manager (SM)

For small clusters without a managed switch, start OpenSM on at least one node.

bash
systemctl start opensmd

2. Switch Firmware Upgrade

2.1 MST Initialization

Mellanox Software Tools (MST) are used to access IB devices.

bash
mst start
mst ib add
mst status

Output: /dev/mst/SW_MT54000_..._lid-0x0006 (The switch device)

2.2 Burning Firmware

  1. Query:
    bash
    flint -d /dev/mst/SW_MT54000_..._lid-0x0002 query
  2. Burn:
    bash
    flint -d /dev/mst/SW_MT54000_..._lid-0x0002 -i firmware.bin burn
  3. Reset:
    bash
    flint -d /dev/mst/SW_MT54000_..._lid-0x0002 swreset

    Note

    For high-end models (e.g., 400G MQM9790), a physical power cycle (unplug for 3 mins) may be required.

3. Benchmarking (MPI & OSU)

3.1 Intel MPI Benchmarks (IMB)

Test PingPong bandwidth and latency. Use Intel Compiler 2020+.

bash
# PingPong (2 processes)
mpirun -iface ib0 -f hostfile -np 2 -ppn 1 ./IMB-MPI1 pingpong
  • Metrics:
    • Latency: t[usec] for 0-byte msg (< 2us).
    • Bandwidth: Mbytes/sec for 4MB msg (Near line rate).

3.2 OSU Micro-Benchmarks (Script)

Iterate through all node pairs using OpenMPI.

bash
#!/bin/bash
MPI_RUN="/usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun"
OSU_DIR="/path/to/osu-micro-benchmarks"

for i in `cat host1`; do
    for j in `cat host2`; do
        # Force IB device mlx5_0:1
        $MPI_RUN -x UCX_NET_DEVICES=mlx5_0:1 -H $i,$j $OSU_DIR/osu_bw
    done
done

4. Performance Tuning

4.1 CPU Power Management

High Performance Computing requires disabling power saving to avoid CPU wake-up latency.

bash
# Set to Performance
cpupower -c all frequency-set -g performance

# Verify
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Expected: performance

4.2 NIC Tuning

Use mlnx_tune for one-click optimization.

bash
# High Throughput Profile
mlnx_tune -p HIGH_THROUGHPUT

5. Health Check & Troubleshooting

5.1 Deep Inspection with ibdiagnet

Run a 30-minute stress test to catch intermittent errors.

bash
ibdiagnet --pc --pm_pause_time 1800 -P all=1 \
  --get_phy_info --get_cable_info --sc \
  --extended_speeds all --pm_per_lane --routing --sharp

5.2 Failure Criteria

  • BER: Must be < $10^{-12}$ (Strict) or $10^{-8}$ (Min).
  • Link Down Counter: Delta must be 0.

5.3 Isolation Method

Locate device via LID, then apply Cross-Swap:

  1. Clean: Clean fiber endpoints.
  2. Swap:
    • Replace Cable.
    • Replace Transceiver.
    • Swap NIC/Switch ports.
  3. Retest: Run ibdiagnet again to confirm zero errors.

AI-HPC Organization