× Install ThecoreGrid App
Tap below and select "Add to Home Screen" for full-screen experience.
B2B Engineering Insights & Architectural Teardowns

CPU-free LLM inference without CPU involvement

CPU-free LLM inference changes the critical path of inference by eliminating the CPU as a source of delays and instability

Modern LLM serving architectures are more dependent on the CPU than they appear. Although computations are performed on the GPU, it is the CPU that manages the lifecycle of each token: batching, scheduling, KV-cache, and launching CUDA graphs. This makes the system sensitive to CPU interference—especially in colocation conditions. As a result, operators are forced to reserve CPU headroom, sacrificing utilization for predictable latency and SLO.

The Blink architecture offers a different approach: removing the CPU from the steady-state inference path and redistributing responsibilities between the GPU and SmartNIC. The data plane moves to the SmartNIC, which accepts requests and directly writes data to GPU memory via RDMA. The control plane is moved inside the GPU through a persistent CUDA kernel that manages batching, scheduling, and KV-cache without reverting to the CPU. Interaction between components is built through a GPU-resident ring buffer, eliminating unnecessary copying and synchronization through the host.

A key insight from the research is that the bottleneck is not in computations but in the control path. The authors show that even with optimizations, the CPU can account for up to 50% of latency. Moreover, interference exacerbates the problem: the throughput of existing systems drops to 28–54% of the baseline, while P99 TTFT can degrade by orders of magnitude. Blink eliminates this effect: reducing P99 TTFT by 8.47×, TPOT by 3.40×, increasing throughput by 2.1×, and decreasing energy per token by 48.6%. Importantly, under load, CPU performance remains stable, whereas traditional systems degrade by two orders of magnitude.

The practical takeaway for architects is that the problem cannot be solved by tuning the OS or isolating resources. Experiments with huge pages, core pinning, and cache partitioning show limited effect: even with LLC contention eliminated, latency changes little because the CPU remains in the critical cycle. This indicates an architectural compromise: either dedicated resources with low utilization or a shared environment with unstable latency. CPU-free LLM inference offers a third way—bringing orchestration closer to the data and computations, reducing the critical path to GPU + NIC.

Information source

arXiv is the largest open preprint repository (since 1991, under the auspices of Cornell), where researchers quickly post working versions of papers; the materials are publicly accessible but do not undergo full peer review, so results should be considered preliminary and, where possible, checked against updated versions or peer‑reviewed journals. arxiv.org

View the original research PDF

×

🚀 Deploy the Blocks

Controls: ← → to move, ↑ to rotate, ↓ to drop.
Mobile: use buttons below.