Slice spraying in GPU clusters reduces latency and increases throughput in disaggregated LLM serving through dynamic data transfer management.
The problem arises when GPU clusters cease to be homogeneous. Modern LLM systems operate on a mix of NVLink, RDMA, PCIe, and other interconnects with varying bandwidth and latencies. In such conditions, classic transfer engines use static path binding and simple load distribution, leading to head-of-line blocking, throughput degradation, and cluster fragmentation. This is particularly noticeable in disaggregated LLM serving, where “elephant flows” — transferring gigabytes of KVCache or weights — lie directly on the critical response path.
TENT proposes to change the very model of data transfer management. Instead of the application selecting the transport (RDMA, NVLink, etc.) in advance, the system accepts a declarative intent: what needs to be transferred and where. The engine then decides how to accomplish this. Internally, it combines all available interconnects into a single pool and breaks large transfers into smaller slices. These slices are dynamically distributed across links based on telemetry — current load, queues, and bandwidth. Architecturally, this relies on the abstraction of segments, which hide the physical location of data, and pluggable backends for different transports.
The key effect comes from abandoning state-blind striping. In the traditional round-robin model, a slow link becomes a bottleneck for the entire transfer. TENT, on the other hand, evaluates the expected completion time of each slice and sends it via the currently “fastest” path. This eliminates head-of-line blocking and balances the load across multi-rail networks. In experiments on H800 HGX clusters, the system demonstrated up to a 1.36× increase in throughput and a 26% reduction in P90 TTFT compared to Mooncake TE. In the RL pipeline, the parameter update acceleration was 20–26%, while in microbenchmarks, throughput increased by 33% with a reduction in P99 latency to 27.6% of the baseline.
An additional layer is fault tolerance. In production environments, failures occur constantly: NIC degradation, GPU errors, unstable links. In classic architecture, this is handled at the control plane level and requires manual intervention or a restart. TENT shifts this logic into the data plane. If a slice fails, it is automatically rerouted via an alternative path. Recovery occurs within tens of milliseconds (<50 ms), without application involvement. This transforms frequent hardware failures from incidents into brief performance fluctuations.
For the industry, this represents a pragmatic shift: data movement becomes a standalone layer with its own scheduling logic, rather than a side effect of compute. This approach is particularly beneficial in heterogeneous clusters and multi-tenant environments, where topology and network state are constantly changing. The trade-off is the complexity of the data plane and the need for precise telemetry. However, the gains in bandwidth utilization, latency stability, and reduced operational load make this compromise justified for high-load LLM systems.
Information source
arXiv is the largest open preprint repository (since 1991, under the auspices of Cornell), where researchers quickly post working versions of papers; the materials are publicly accessible but do not undergo full peer review, so results should be considered preliminary and, where possible, checked against updated versions or peer‑reviewed journals. arxiv.org