The MRC protocol changes the behavior of networks in AI clusters, reducing congestion and increasing resilience during failures. This is critical for synchronous model training on tens of thousands of GPUs.

The problem does not manifest immediately — until the scale of the cluster begins to amplify every network anomaly. In training large models, one step involves millions of data transfers. A single delay is enough for the entire step to start waiting. This creates a “failure amplifier” effect: single packet losses, congestion, or link failures begin to scale across the entire job. In traditional RDMA (RoCE) networks, each flow follows a single path. In multi-plane topologies, this leads to flow collisions and uneven load distribution, which exacerbates jitter and reduces throughput.

The solution is built around MRC (Multipath Reliable Connection) — an extension of RoCE that alters the fundamental delivery model. Instead of a single path for transmission, hundreds of parallel routes are utilized. This reduces the likelihood of hot spots and balances the load. Additionally, SRv6 source routing is employed, where the sender explicitly specifies the packet route. This removes dependence on dynamic routing and reduces the class of control plane errors. The trade-off is evident: the system becomes more complex at the endpoint level, but the network as a whole simplifies and becomes more predictable.

Architecturally, MRC relies on a multi-plane network. One 800Gb/s interface is divided into several channels, for example, 8×100Gb/s, each connected to a separate plane. This allows for the construction of topologies with two levels of switches instead of three or four. Fewer levels mean lower latency and fewer points of failure. However, such redundancy creates a challenge for efficient path utilization. MRC addresses this through packet spraying: packets from a single flow are distributed across all available routes. The order of delivery is not guaranteed, but each packet contains the final address in memory, allowing data to be collected without strict sequencing.

Additionally, adaptive response to degradation is implemented. If a path begins to lose packets or is overloaded, it is excluded from use within microseconds. Upon packet loss, the system assumes failure and immediately switches. To avoid false triggers, packet trimming is used: during overload, the switch trims the payload and forwards only the header, initiating point-to-point retransmission. This reduces the likelihood of misinterpreting congestion as failure.

A separate layer of simplification is the abandonment of dynamic routing (e.g., BGP) in favor of static tables and SRv6. Each packet carries the complete route through the network. Switches operate deterministically and do not recalculate paths during failures. This eliminates convergence delays, which in traditional networks can take seconds. In MRC, the response occurs at the connection level and is completed within microseconds.

From an operational perspective, this changes the behavior of the system. Link failures and even switch reboots cease to be critical events. The network continues to operate, and the training job does not require a restart. Loss of some bandwidth leads to degradation, but it is usually less than the physical loss of resources. When links are restored, they automatically return to the pool of available paths.

The results are systemic. The impact of congestion is reduced, latency between flows is balanced, and sensitivity to failures is decreased. There is also the possibility of building clusters of more than 100,000 GPUs with fewer network levels. This reduces energy consumption and the number of components. Exact metrics are not available in the original data, but it is claimed that the impact of network failures on synchronous training becomes virtually negligible.

In a broader context, MRC is an attempt to shift complexity from the network to the endpoints and make system behavior deterministic. Such approaches are already being discussed in the industry, especially in the context of AI infrastructure, where predictability is more important than peak performance. At scales where the network defines the efficiency of GPU utilization, such trade-offs appear pragmatic.

Read

The MRC protocol changes the behavior of networks in AI clusters, reducing congestion and increasing resilience during failures. This is critical for synchronous model training on tens of thousands of GPUs.

🚀 Deploy the Blocks