× Install ThecoreGrid App
Tap below and select "Add to Home Screen" for full-screen experience.
B2B Engineering Insights & Architectural Teardowns

DWDP for LLM Inference Without Inter-GPU Synchronization

Distributed Weight Data Parallelism (DWDP) reduces the impact of synchronization in LLM inference through asynchronous execution and selective weight loading.

The problem of scaling LLM inference across multiple GPUs arises not from parallelization itself, but from synchronization. Classic strategies—tensor, pipeline, and expert parallelism—require inter-GPU coordination at each layer level. In real production conditions, this becomes a bottleneck. The reason is the load imbalance: varying sequence lengths, different KV-cache hit rates, and uneven routing in MoE. As a result, the latency of individual ranks begins to diverge, and the overall system is limited by the slowest one. According to research, even a moderate imbalance (coefficient of variation 20%) results in about 12% overhead from synchronization.

DWDP proposes to shift the compromise point. Instead of a synchronous model, a scheme is introduced where each GPU (rank) remains a data-parallel executor, but MoE weights are distributed across neighboring GPUs. Missing experts are loaded on demand. The key change is the abandonment of collective operations (e.g., all-to-all via NCCL) in the critical path. Instead, peer-to-peer transfer is used through cudaMemcpyAsync, which does not occupy computational resources (SM). This allows each rank to operate independently without waiting for the others.

Architecturally, the system is built around overlapping computation and data transfer. While the GPU executes the MoE block of layer l and the attention of the next layer, it asynchronously prefetches weights for layer l+1. If the computation window is large enough, the transfer latency is hidden. Formally, this is expressed as T_DWDP = max(T_compute, T_prefetch), in contrast to the classical approach T_DEP = T_compute + T_all2all. Importantly, efficiency directly depends on the ratio of compute to communication. For example, as the length of the input sequence increases, the compute window grows, and DWDP begins to gain—research shows a threshold of about 16K tokens for a batch size of 1.

Practical implementation reveals two classes of overhead. The first is managing the distributed weights. A naive approach requires merging local and remote weights into a single buffer before computation, adding D2D copying. This is eliminated by modifying the kernel (groupedGEMM), which starts working with multiple buffers directly. The second is degradation due to contention during asynchronous requests: multiple GPUs may simultaneously pull data from a single source, creating many-to-one contention. To address this, time-division multiplexing is introduced: weights are split into chunks, and copying is scheduled in a round-robin manner. This reduces the likelihood of bottlenecks and better utilizes the copy engine.

Metrics show that eliminating synchronization has a significant effect, but not without compromises. In the DeepSeek-R1 configuration on NVL72, the gain reaches 8.8% TPS/GPU with comparable TPS/user. In micro-benchmarks of the context phase, acceleration reaches 1.09–1.11× in throughput and up to 1.27× in TTFT. However, part of the gain is consumed by overhead: interference between compute and communication and effects like frequency throttling. After optimizations, the final improvement in iteration latency is about 11.7% instead of the theoretical ~21%.

Interestingly, DWDP particularly benefits from increasing load imbalance. The less uniform the system, the greater the penalty for synchronous strategies and the higher the relative efficiency of the asynchronous model. However, at high TPS/user, the effect diminishes: the system becomes generation-bound, and optimizing the context phase contributes less. Moreover, there is an increase in TTFT due to worsening rate matching between stages and a decrease in the number of context GPUs.

For the industry, this appears as a pragmatic shift from “perfectly synchronous” models to more loosely coupled execution models. DWDP does not eliminate the need for load balancing but reduces its criticality. In systems with high-speed interconnects (e.g., NVLink-like topologies), this approach becomes particularly applicable. However, it requires co-design at the level of runtime, kernel, and communications—without this, overhead quickly negates the gain.

The main conclusion: synchronization becomes a limitation before computational resources are exhausted. DWDP shows that partial abandonment of synchronization is not a radical step but an engineering compromise that provides measurable gains under real workloads.

Information source

arXiv is the largest open preprint repository (since 1991, under the auspices of Cornell), where researchers quickly post working versions of papers; the materials are publicly accessible but do not undergo full peer review, so results should be considered preliminary and, where possible, checked against updated versions or peer‑reviewed journals. arxiv.org

View the original research PDF

×

🚀 Deploy the Blocks

Controls: ← → to move, ↑ to rotate, ↓ to drop.
Mobile: use buttons below.