Optimizing MoE inference hinges on load balancing. CRAFT demonstrates how to manage expert replication without overspending GPU memory.
Mixture-of-Experts (MoE) has become the standard for scaling LLMs through sparse activation of experts. However, during the inference stage, a systemic issue arises: load imbalance at the expert level. The router directs tokens unevenly, causing some GPUs to become overloaded while others remain idle. This is exacerbated in Expert Parallelism, where the load is distributed through all-to-all communications, creating additional network contention and increasing latency.
Classic approaches address this through expert placement and expert replication. Placement attempts to balance the load by combining “hot” and “cold” experts on devices. However, it fails under strong skew—when several experts receive an disproportionately large flow of tokens. Replication eliminates this skew, but creates a new problem: increased GPU memory consumption, which limits the KV cache and reduces throughput. In real systems, this results in a trade-off between balancedness and memory pressure.
CRAFT proposes shifting the optimal point through fine-grained replication at the layer level. The key observation is that not all MoE layers benefit equally from replication. In layers with high skew (where one expert receives >10× the average flow), replication provides a significant boost in balancedness. In layers with uniform distribution, the effect is minimal. Additionally, increasing the number of replicas yields diminishing returns: beyond a certain threshold, additional copies have little impact on throughput but continue to consume memory.
Architecturally, CRAFT is built around a cost-aware model. First, the system offline profiles the distribution of tokens across experts and evaluates the replication benefit for each layer. Next, the task is formalized as an optimization problem with a memory constraint—a variant of the knapsack problem, where a limited number of replicas must be distributed among layers. Dynamic programming is used to find a near-optimal distribution. The final step is capacity-aware placement, which balances GPU memory usage and avoids fragmentation of the KV cache.
Results show a systemic effect: CRAFT achieves an average 1.14× throughput (up to 1.2×) compared to existing replication schemes. The key factor is the reduction in the number of replicas without losing balancedness. In experiments, CRAFT uses ~7 times fewer replicas than the baseline (EPLB) while maintaining a comparable level of balancing. This directly frees up memory for the KV cache and increases batch size, which is critical for high-load inference.
An interesting point is scalability. As the number of GPUs increases, the imbalance problem intensifies: fewer experts per device → worse load “smoothing.” In such conditions, replication becomes even more important, but this is where the cost of error (over-replication) is maximized. CRAFT shows that a layer-aware strategy adapts better to this mode than uniform replication.
For the industry, this appears as a pragmatic improvement to the inference stack. The approach does not require retraining the model and can be integrated into existing serving frameworks. The main takeaway is that replication must be managed and contextual. Simply increasing the number of expert copies quickly hits a ceiling of efficiency. A more precise cost model and layer-level solutions provide a better balance between throughput and memory footprint.
Information source
arXiv is the largest open preprint repository (since 1991, under the auspices of Cornell), where researchers quickly post working versions of papers; the materials are publicly accessible but do not undergo full peer review, so results should be considered preliminary and, where possible, checked against updated versions or peer‑reviewed journals. arxiv.org