Reverse Address Translation becomes a hidden source of latency in multi-GPU clusters. The analysis shows how address translation affects All-to-All and where performance is lost.

As distributed ML workloads grow, the bottleneck becomes not only the network but also the semantics of memory access. Scale-up fabrics like NVLink and UALink provide direct memory access between GPUs but add a new step — Reverse Address Translation (NPA → SPA) on the receiver’s side. This operation is performed in the Link MMU and has previously been little analyzed as a source of delays. The problem does not manifest immediately — until the system starts working with small, latency-sensitive collective operations.

Architecturally, translation is structured as a hierarchy: L1 Link TLB at the station level, a shared L2 TLB, and then a page walk through the page table walker. The authors modeled this path in ASTRA-sim with a network backend on Omnet++, adding detailed emulation of the Link MMU. The workload is All-to-All, one of the most intensive patterns in distributed training. It is important that accesses are modeled as cache misses to isolate the impact of translation rather than data caching.

The key effect is cold TLB misses. For small collections (e.g., 1MB), they dominate: each request is forced to go through the full translation path, including the page walk. This results in up to 1.4× degradation compared to an ideal scenario without translation. In terms of latency, this translates to up to ~30% of the request time being spent solely on address translation. For larger collections, the situation smooths out: the working set of pages stabilizes, the TLB warms up, and the cost of translation is amortized over the number of requests.

The reason lies in the access pattern. All-to-All in ML behaves like streaming with a stride: there is spatial locality within a page, but almost no temporal locality between pages. Each GPU effectively “brings” one active page at a time. This means that the working set of translation scales with the number of GPUs rather than the size of the data. Hence, an unexpected conclusion follows: increasing the size of the L2 TLB above this threshold yields little benefit. Even a small TLB covering one page per GPU shows comparable performance.

Detailing the hierarchy confirms this behavior. Although more than 90% of requests formally hit the L1-MSHR, they may depend on incomplete page walks below. For small collections, L2-TLB misses and page walks dominate, creating a high tail latency. As the data volume increases, the share of L1 and L2 TLB hits rises, and the impact of the deep hierarchy diminishes. This is a typical case where average hit metrics do not reflect actual latency due to “hit-under-miss” effects.

The practical takeaway is that optimization should focus not on the size of the TLB but on the behavior during cold starts. The authors propose two approaches. The first is fused pre-translation kernels: translation is initiated in advance and overlaps with computations. The second is software-guided TLB prefetching, where the system preloads expected entries. Both approaches aim to hide latency rather than eliminate it, which appears to be a pragmatic compromise.

For the industry, this is particularly important in inference scenarios. Unlike training, where large batches smooth out delays, inference often works with small volumes and is sensitive to latency. In such conditions, Reverse Address Translation becomes part of the critical path. Optimizing this layer can yield significant gains without changing the network architecture or communication algorithms.

In a broader context, this research shows a shift: the bottleneck moves from the network level to the level of memory semantics. When accelerators gain direct access to remote memory, the cost of translation becomes comparable to data transfer. This changes the priorities in designing interconnects and MMUs — from throughput to latency management at the addressing level.

Information source

arXiv is the largest open preprint repository (since 1991, under the auspices of Cornell), where researchers quickly post working versions of papers; the materials are publicly accessible but do not undergo full peer review, so results should be considered preliminary and, where possible, checked against updated versions or peer‑reviewed journals. arxiv.org

View the original research PDF

Reverse Address Translation becomes a hidden source of latency in multi-GPU clusters. The analysis shows how address translation affects All-to-All and where performance is lost.

🚀 Deploy the Blocks