ROCE AI Fabric Architecture Guide
Abstract: As AI models surpass trillion parameters, traditional RoCE networks based on ECMP and DCQCN struggle to maintain linear scalability. This guide details the ROCE AI Fabric solution, leveraging Adaptive Routing (AR) and Rail-Optimized architecture to achieve a 400G lossless Ethernet fabric for massive GPU clusters.
1. Background & Challenges
1.1 Network Bottlenecks in the GenAI Era
AI model training is a bandwidth-sensitive workload. In the periodic "Compute-Communicate" cycle, network performance directly dictates the effective utilization of the cluster.
- Massive Scale: Evolving from thousands to tens of thousands of GPUs.
- Extreme Performance: Inter-GPU bandwidth hits 3.2T, demanding zero packet loss and high throughput.
- High Reliability: Fault impact must be contained within sub-milliseconds to avoid training interruptions.
1.2 Traditional RoCE vs. ROCE AI Fabric
| Feature | Traditional RoCE | ROCE AI Fabric |
|---|---|---|
| Load Balancing | ECMP (Flow-based) Static hashing based on 5-tuple. Hash collisions cause uneven link utilization and congestion. | AR (Packet-based) Dynamically selects paths based on link congestion. Per-packet forwarding ensures near-perfect link utilization. |
| Congestion Control | DCQCN Relies on ECN watermarks. Complex tuning, slow feedback loop. | RTT-CC Host-side SmartNIC coordination. Simplified config, fast response. |
| Ordering | Network-enforced ordering. | Out-of-Order Forwarding Relies on SmartNICs (e.g., BlueField3) to reorder packets at the receiver. |
2. Architecture Design
2.1 Overview
ROCE AI Fabric utilizes a Spine-Leaf architecture with full 400G interconnects, designed as a 1:1 Non-blocking Rail-Optimized network.
- Compute Network (Backend): Carries GPU parameter synchronization traffic with AR enabled.
- Storage/Mgmt Network: Standard Ethernet configuration.
2.2 Scaling Scenarios
A. 256~512 GPU Cluster (Layer-2)
- Structure: 2-Layer Spine-Leaf.
- Connectivity: Leaves connect to Spines via AR for multi-path load balancing.
- Scale:
- 256 Cards: 4 Spines + 4 Leaves (32 nodes per Leaf).
- 512 Cards: Scale out horizontally to 12 switches.
B. 10k+ GPU Cluster (Layer-3)
- Super Unit (SU): Modular unit (e.g., 1K GPUs per SU).
- SuperSpine: Introduces a third layer. Leaf and Spine handle intra-rail traffic; SuperSpine handles cross-rail traffic.
C. Rail-Optimized Design
- Concept: Connect the same-numbered GPU (e.g., GPU0 from all servers) to the same Leaf switch (or group).
- Benefit: Most collective operations (AllReduce) occur between same-numbered GPUs. Traffic stays within the Leaf layer (single hop), significantly reducing latency.
3. Deployment & Configuration (SONiC)
This section demonstrates the configuration for a 256-card cluster using SONiC.
3.1 IP & BGP Planning
Use /31 masks for Spine-Leaf links and VLAN interfaces for server gateways.
- Protocol: BGP
- AS Number: Same AS for Spines (e.g., 65100), different AS for Leaves (e.g., 64100, 64101...).
3.2 Switch Configuration
Step 1: Interface & BGP Setup
Spine Example:
# Interface IP
sonic(config)# interface ethernet 1
sonic(config-if)# no switchport
sonic(config-if)# ip address 10.1.1.0/31
# BGP Config (Frr mode)
sonic(config)# vtysh
sonic(config)# router bgp 65100
sonic(config-router)# bgp router-id 1.1.1.1
sonic(config-router)# neighbor 10.1.1.1 remote-as 64100Leaf Example:
# Downlink VLAN
sonic(config)# interface vlan 100
sonic(config-if)# ip address 192.168.1.1/26
sonic(config)# interface ethernet 65-96
sonic(config-if)# switchport access vlan 100
# Uplink BGP
sonic(config)# router bgp 64100
sonic(config-router)# neighbor 10.1.1.0 remote-as 65100Step 2: Enable AR (Adaptive Routing)
This is the core step. Enabling AR automatically loads the default QoS profile (PFC/ECN on Queue 3).
# 1. Enable AR
sonic(config)# ar enable
# 2. Save Config
sonic# write
# 3. Reboot (Mandatory!)
sonic# rebootImportant Notes
- Reboot Required:
ar enableonly takes effect after awriteandreboot. - Config Lock: Once AR is enabled, QoS settings (PFC/ECN) are locked and cannot be manually modified.
- Verification: Check status via
show ar configandshow doroce status.
3.3 SmartNIC Configuration (Host Side)
Configure the NICs to support AR and enable RTT-CC.
# 1. Enable NIC AR (RoCE Acceleration)
mlxreg -d $IB_DEV -y --reg_name ROCE_ACCL \
--set "roce_tx_window_en=0x1,adaptive_routing_forced_en=0x1"
# 2. Enable RTT-CC & Disable DCQCN
# Disable DCQCN (cmd_type=2)
mlxreg -d $IB_DEV -y --set "cmd_type=2" --reg_name PPCC \
--indexes "local_port=1,algo_slot=15"
# Enable RTTCC (cmd_type=1)
mlxreg -d $IB_DEV -y --set "cmd_type=1" --reg_name PPCC \
--indexes "local_port=1,algo_slot=0"Note: Replace $IB_DEV with the actual device name, e.g., mlx5_0.
4. Hardware & Physical Layer
4.1 Optics Selection
400G interconnects are recommended.
| Rate | Type | Interface | Distance | Scenario |
|---|---|---|---|---|
| 400G | QSFP112 VR4 | MPO | 50m | Intra-rack / Adjacent Rack (MMF) |
| 400G | QSFP112 DR4 | MPO | 500m | TOR to Spine (SMF) |
| 400G | QSFP112 FR4 | LC | 2km | Long Range (SMF) |
4.2 Topology Recommendations
- Switch-to-Switch: 400G DR4/FR4.
- Switch-to-NIC:
- Direct: 400G Switch Port -> 400G NIC.
- Split: 400G Switch Port -> Breakout Cable -> 2x 200G NDR NICs.
5. Appendix: Single Switch RoCE (No AR)
For small clusters (<128 Cards), a single switch is sufficient. AR is not needed, but manual lossless configuration is required.
Manual QoS Template:
# 1. Enable PFC (Queue 3)
interface ethernet 1-128
priority-flow-control priority 3
# 2. Configure ECN (WRED)
qos queue-profile profile1
green min-threshold 153600
green max-threshold 1536000
probability 100
ecn
interface ethernet 1-128
service-policy queue-profile profile1 queue 3