ROCE AI Fabric Architecture Guide

Abstract: As AI models surpass trillion parameters, traditional RoCE networks based on ECMP and DCQCN struggle to maintain linear scalability. This guide details the ROCE AI Fabric solution, leveraging Adaptive Routing (AR) and Rail-Optimized architecture to achieve a 400G lossless Ethernet fabric for massive GPU clusters.

1. Background & Challenges

1.1 Network Bottlenecks in the GenAI Era

AI model training is a bandwidth-sensitive workload. In the periodic "Compute-Communicate" cycle, network performance directly dictates the effective utilization of the cluster.

Massive Scale: Evolving from thousands to tens of thousands of GPUs.
Extreme Performance: Inter-GPU bandwidth hits 3.2T, demanding zero packet loss and high throughput.
High Reliability: Fault impact must be contained within sub-milliseconds to avoid training interruptions.

1.2 Traditional RoCE vs. ROCE AI Fabric

Feature	Traditional RoCE	ROCE AI Fabric
Load Balancing	ECMP (Flow-based) Static hashing based on 5-tuple. Hash collisions cause uneven link utilization and congestion.	AR (Packet-based) Dynamically selects paths based on link congestion. Per-packet forwarding ensures near-perfect link utilization.
Congestion Control	DCQCN Relies on ECN watermarks. Complex tuning, slow feedback loop.	RTT-CC Host-side SmartNIC coordination. Simplified config, fast response.
Ordering	Network-enforced ordering.	Out-of-Order Forwarding Relies on SmartNICs (e.g., BlueField3) to reorder packets at the receiver.

2. Architecture Design

2.1 Overview

ROCE AI Fabric utilizes a Spine-Leaf architecture with full 400G interconnects, designed as a 1:1 Non-blocking Rail-Optimized network.

Compute Network (Backend): Carries GPU parameter synchronization traffic with AR enabled.
Storage/Mgmt Network: Standard Ethernet configuration.

2.2 Scaling Scenarios

A. 256~512 GPU Cluster (Layer-2)

Structure: 2-Layer Spine-Leaf.
Connectivity: Leaves connect to Spines via AR for multi-path load balancing.
Scale:
- 256 Cards: 4 Spines + 4 Leaves (32 nodes per Leaf).
- 512 Cards: Scale out horizontally to 12 switches.

B. 10k+ GPU Cluster (Layer-3)

Super Unit (SU): Modular unit (e.g., 1K GPUs per SU).
SuperSpine: Introduces a third layer. Leaf and Spine handle intra-rail traffic; SuperSpine handles cross-rail traffic.

C. Rail-Optimized Design

Concept: Connect the same-numbered GPU (e.g., GPU0 from all servers) to the same Leaf switch (or group).
Benefit: Most collective operations (AllReduce) occur between same-numbered GPUs. Traffic stays within the Leaf layer (single hop), significantly reducing latency.

3. Deployment & Configuration (SONiC)

This section demonstrates the configuration for a 256-card cluster using SONiC.

3.1 IP & BGP Planning

Use /31 masks for Spine-Leaf links and VLAN interfaces for server gateways.

Protocol: BGP
AS Number: Same AS for Spines (e.g., 65100), different AS for Leaves (e.g., 64100, 64101...).

3.2 Switch Configuration

Step 1: Interface & BGP Setup

Spine Example:

bash

# Interface IP
sonic(config)# interface ethernet 1
sonic(config-if)# no switchport
sonic(config-if)# ip address 10.1.1.0/31

# BGP Config (Frr mode)
sonic(config)# vtysh
sonic(config)# router bgp 65100
sonic(config-router)# bgp router-id 1.1.1.1
sonic(config-router)# neighbor 10.1.1.1 remote-as 64100

Leaf Example:

bash

# Downlink VLAN
sonic(config)# interface vlan 100
sonic(config-if)# ip address 192.168.1.1/26
sonic(config)# interface ethernet 65-96
sonic(config-if)# switchport access vlan 100

# Uplink BGP
sonic(config)# router bgp 64100
sonic(config-router)# neighbor 10.1.1.0 remote-as 65100

Step 2: Enable AR (Adaptive Routing)

This is the core step. Enabling AR automatically loads the default QoS profile (PFC/ECN on Queue 3).

bash

# 1. Enable AR
sonic(config)# ar enable

# 2. Save Config
sonic# write

# 3. Reboot (Mandatory!)
sonic# reboot

Important Notes

Reboot Required: ar enable only takes effect after a write and reboot.
Config Lock: Once AR is enabled, QoS settings (PFC/ECN) are locked and cannot be manually modified.
Verification: Check status via show ar config and show doroce status.

3.3 SmartNIC Configuration (Host Side)

Configure the NICs to support AR and enable RTT-CC.

bash

# 1. Enable NIC AR (RoCE Acceleration)
mlxreg -d $IB_DEV -y --reg_name ROCE_ACCL \
  --set "roce_tx_window_en=0x1,adaptive_routing_forced_en=0x1"

# 2. Enable RTT-CC & Disable DCQCN
# Disable DCQCN (cmd_type=2)
mlxreg -d $IB_DEV -y --set "cmd_type=2" --reg_name PPCC \
  --indexes "local_port=1,algo_slot=15"
# Enable RTTCC (cmd_type=1)
mlxreg -d $IB_DEV -y --set "cmd_type=1" --reg_name PPCC \
  --indexes "local_port=1,algo_slot=0"

Note: Replace $IB_DEV with the actual device name, e.g., mlx5_0.

4. Hardware & Physical Layer

4.1 Optics Selection

400G interconnects are recommended.

Rate	Type	Interface	Distance	Scenario
400G	QSFP112 VR4	MPO	50m	Intra-rack / Adjacent Rack (MMF)
400G	QSFP112 DR4	MPO	500m	TOR to Spine (SMF)
400G	QSFP112 FR4	LC	2km	Long Range (SMF)

4.2 Topology Recommendations

Switch-to-Switch: 400G DR4/FR4.
Switch-to-NIC:
- Direct: 400G Switch Port -> 400G NIC.
- Split: 400G Switch Port -> Breakout Cable -> 2x 200G NDR NICs.

5. Appendix: Single Switch RoCE (No AR)

For small clusters (<128 Cards), a single switch is sufficient. AR is not needed, but manual lossless configuration is required.

Manual QoS Template:

bash

# 1. Enable PFC (Queue 3)
interface ethernet 1-128
 priority-flow-control priority 3

# 2. Configure ECN (WRED)
qos queue-profile profile1
 green min-threshold 153600
 green max-threshold 1536000
 probability 100
 ecn
interface ethernet 1-128
 service-policy queue-profile profile1 queue 3

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

ROCE AI Fabric Architecture Guide

1. Background & Challenges

1.1 Network Bottlenecks in the GenAI Era

1.2 Traditional RoCE vs. ROCE AI Fabric

2. Architecture Design

2.1 Overview

2.2 Scaling Scenarios

A. 256~512 GPU Cluster (Layer-2)

B. 10k+ GPU Cluster (Layer-3)

C. Rail-Optimized Design

3. Deployment & Configuration (SONiC)

3.1 IP & BGP Planning

3.2 Switch Configuration

Step 1: Interface & BGP Setup

Step 2: Enable AR (Adaptive Routing)

3.3 SmartNIC Configuration (Host Side)

4. Hardware & Physical Layer

4.1 Optics Selection

4.2 Topology Recommendations

5. Appendix: Single Switch RoCE (No AR)

ROCE AI Fabric Architecture Guide ​

1. Background & Challenges ​

1.1 Network Bottlenecks in the GenAI Era ​

1.2 Traditional RoCE vs. ROCE AI Fabric ​

2. Architecture Design ​

2.1 Overview ​

2.2 Scaling Scenarios ​

A. 256~512 GPU Cluster (Layer-2) ​

B. 10k+ GPU Cluster (Layer-3) ​

C. Rail-Optimized Design ​

3. Deployment & Configuration (SONiC) ​

3.1 IP & BGP Planning ​

3.2 Switch Configuration ​

Step 1: Interface & BGP Setup ​

Step 2: Enable AR (Adaptive Routing) ​

3.3 SmartNIC Configuration (Host Side) ​

4. Hardware & Physical Layer ​

4.1 Optics Selection ​

4.2 Topology Recommendations ​

5. Appendix: Single Switch RoCE (No AR) ​

ROCE AI Fabric Architecture Guide

1. Background & Challenges

1.1 Network Bottlenecks in the GenAI Era

1.2 Traditional RoCE vs. ROCE AI Fabric

2. Architecture Design

2.1 Overview

2.2 Scaling Scenarios

A. 256~512 GPU Cluster (Layer-2)

B. 10k+ GPU Cluster (Layer-3)

C. Rail-Optimized Design

3. Deployment & Configuration (SONiC)

3.1 IP & BGP Planning

3.2 Switch Configuration

Step 1: Interface & BGP Setup

Step 2: Enable AR (Adaptive Routing)

3.3 SmartNIC Configuration (Host Side)

4. Hardware & Physical Layer

4.1 Optics Selection

4.2 Topology Recommendations

5. Appendix: Single Switch RoCE (No AR)