Skip to content

ROCE AI Fabric Architecture Guide

Abstract: As AI models surpass trillion parameters, traditional RoCE networks based on ECMP and DCQCN struggle to maintain linear scalability. This guide details the ROCE AI Fabric solution, leveraging Adaptive Routing (AR) and Rail-Optimized architecture to achieve a 400G lossless Ethernet fabric for massive GPU clusters.

1. Background & Challenges

1.1 Network Bottlenecks in the GenAI Era

AI model training is a bandwidth-sensitive workload. In the periodic "Compute-Communicate" cycle, network performance directly dictates the effective utilization of the cluster.

  • Massive Scale: Evolving from thousands to tens of thousands of GPUs.
  • Extreme Performance: Inter-GPU bandwidth hits 3.2T, demanding zero packet loss and high throughput.
  • High Reliability: Fault impact must be contained within sub-milliseconds to avoid training interruptions.

1.2 Traditional RoCE vs. ROCE AI Fabric

FeatureTraditional RoCEROCE AI Fabric
Load BalancingECMP (Flow-based)
Static hashing based on 5-tuple. Hash collisions cause uneven link utilization and congestion.
AR (Packet-based)
Dynamically selects paths based on link congestion. Per-packet forwarding ensures near-perfect link utilization.
Congestion ControlDCQCN
Relies on ECN watermarks. Complex tuning, slow feedback loop.
RTT-CC
Host-side SmartNIC coordination. Simplified config, fast response.
OrderingNetwork-enforced ordering.Out-of-Order Forwarding
Relies on SmartNICs (e.g., BlueField3) to reorder packets at the receiver.

2. Architecture Design

2.1 Overview

ROCE AI Fabric utilizes a Spine-Leaf architecture with full 400G interconnects, designed as a 1:1 Non-blocking Rail-Optimized network.

  • Compute Network (Backend): Carries GPU parameter synchronization traffic with AR enabled.
  • Storage/Mgmt Network: Standard Ethernet configuration.

2.2 Scaling Scenarios

A. 256~512 GPU Cluster (Layer-2)

  • Structure: 2-Layer Spine-Leaf.
  • Connectivity: Leaves connect to Spines via AR for multi-path load balancing.
  • Scale:
    • 256 Cards: 4 Spines + 4 Leaves (32 nodes per Leaf).
    • 512 Cards: Scale out horizontally to 12 switches.

B. 10k+ GPU Cluster (Layer-3)

  • Super Unit (SU): Modular unit (e.g., 1K GPUs per SU).
  • SuperSpine: Introduces a third layer. Leaf and Spine handle intra-rail traffic; SuperSpine handles cross-rail traffic.

C. Rail-Optimized Design

  • Concept: Connect the same-numbered GPU (e.g., GPU0 from all servers) to the same Leaf switch (or group).
  • Benefit: Most collective operations (AllReduce) occur between same-numbered GPUs. Traffic stays within the Leaf layer (single hop), significantly reducing latency.

3. Deployment & Configuration (SONiC)

This section demonstrates the configuration for a 256-card cluster using SONiC.

3.1 IP & BGP Planning

Use /31 masks for Spine-Leaf links and VLAN interfaces for server gateways.

  • Protocol: BGP
  • AS Number: Same AS for Spines (e.g., 65100), different AS for Leaves (e.g., 64100, 64101...).

3.2 Switch Configuration

Step 1: Interface & BGP Setup

Spine Example:

bash
# Interface IP
sonic(config)# interface ethernet 1
sonic(config-if)# no switchport
sonic(config-if)# ip address 10.1.1.0/31

# BGP Config (Frr mode)
sonic(config)# vtysh
sonic(config)# router bgp 65100
sonic(config-router)# bgp router-id 1.1.1.1
sonic(config-router)# neighbor 10.1.1.1 remote-as 64100

Leaf Example:

bash
# Downlink VLAN
sonic(config)# interface vlan 100
sonic(config-if)# ip address 192.168.1.1/26
sonic(config)# interface ethernet 65-96
sonic(config-if)# switchport access vlan 100

# Uplink BGP
sonic(config)# router bgp 64100
sonic(config-router)# neighbor 10.1.1.0 remote-as 65100

Step 2: Enable AR (Adaptive Routing)

This is the core step. Enabling AR automatically loads the default QoS profile (PFC/ECN on Queue 3).

bash
# 1. Enable AR
sonic(config)# ar enable

# 2. Save Config
sonic# write

# 3. Reboot (Mandatory!)
sonic# reboot

Important Notes

  1. Reboot Required: ar enable only takes effect after a write and reboot.
  2. Config Lock: Once AR is enabled, QoS settings (PFC/ECN) are locked and cannot be manually modified.
  3. Verification: Check status via show ar config and show doroce status.

3.3 SmartNIC Configuration (Host Side)

Configure the NICs to support AR and enable RTT-CC.

bash
# 1. Enable NIC AR (RoCE Acceleration)
mlxreg -d $IB_DEV -y --reg_name ROCE_ACCL \
  --set "roce_tx_window_en=0x1,adaptive_routing_forced_en=0x1"

# 2. Enable RTT-CC & Disable DCQCN
# Disable DCQCN (cmd_type=2)
mlxreg -d $IB_DEV -y --set "cmd_type=2" --reg_name PPCC \
  --indexes "local_port=1,algo_slot=15"
# Enable RTTCC (cmd_type=1)
mlxreg -d $IB_DEV -y --set "cmd_type=1" --reg_name PPCC \
  --indexes "local_port=1,algo_slot=0"

Note: Replace $IB_DEV with the actual device name, e.g., mlx5_0.

4. Hardware & Physical Layer

4.1 Optics Selection

400G interconnects are recommended.

RateTypeInterfaceDistanceScenario
400GQSFP112 VR4MPO50mIntra-rack / Adjacent Rack (MMF)
400GQSFP112 DR4MPO500mTOR to Spine (SMF)
400GQSFP112 FR4LC2kmLong Range (SMF)

4.2 Topology Recommendations

  • Switch-to-Switch: 400G DR4/FR4.
  • Switch-to-NIC:
    • Direct: 400G Switch Port -> 400G NIC.
    • Split: 400G Switch Port -> Breakout Cable -> 2x 200G NDR NICs.

5. Appendix: Single Switch RoCE (No AR)

For small clusters (<128 Cards), a single switch is sufficient. AR is not needed, but manual lossless configuration is required.

Manual QoS Template:

bash
# 1. Enable PFC (Queue 3)
interface ethernet 1-128
 priority-flow-control priority 3

# 2. Configure ECN (WRED)
qos queue-profile profile1
 green min-threshold 153600
 green max-threshold 1536000
 probability 100
 ecn
interface ethernet 1-128
 service-policy queue-profile profile1 queue 3

AI-HPC Organization