× Install ThecoreGrid App
Tap below and select "Add to Home Screen" for full-screen experience.
B2B Engineering Insights & Architectural Teardowns

KV cache restoration acceleration through 3D parallelism

KV cache restoration becomes a bottleneck in LLM serving. CacheFlow offers 3D parallelism to reduce latency and TTFT.

The problem manifests when the system begins to work with long contexts. The KV cache (intermediate states of attention) grows linearly with the sequence length, but its restoration becomes non-linearly expensive. During recomputation, the cost increases quadratically due to attention, while during I/O restoration, the system hits bandwidth limitations. In real scenarios—multi-turn chat, RAG, agent pipelines—this leads to delays of seconds with a target level of around 200 ms for Time-To-First-Token (TTFT).

Traditional approaches reduce the task to a choice: recompute or load (I/O). This simplification breaks under load. First, the cost of restoration is heterogeneous: late tokens are more expensive due to quadratic attention. Second, production systems operate in batches and across multiple GPUs, where resource (compute and bandwidth) contention arises. Per-request optimization does not account for these effects and leads to stragglers and degradation of tail latency.

CacheFlow redefines KV cache restoration as a multidimensional scheduling task. The architecture is built around 3D parallelism: across tokens, layers, and GPUs. At the token level, a two-pointer strategy is employed: early tokens are recomputed, late ones are loaded via I/O, and processes converge in the middle. This reduces excessive recomputation where it is most costly. At the layer level, a similar scheme is applied, but along the depth of the model: lower layers are recomputed, upper ones are loaded. The choice between these modes depends on the sequence length and is determined through a threshold obtained from profiling.

The third dimension is multi-GPU parallelism. Instead of sequential restoration through a pipeline, CacheFlow uses boundary activations to break the dependency between devices. Each GPU restores its shard of the KV cache independently. Theoretically, this provides linear acceleration with the number of GPUs, as compute and I/O are shared among devices. In practice, acceleration is limited by balancing, but remains close to linear.

A key element of the system is the batch-aware two-pointer scheduler. It distributes compute and I/O among requests, considering their “value.” Requests with long contexts are prioritized for I/O, as this significantly reduces the future cost of recomputation. This is important in batches, where competition for bandwidth can sharply increase the latency of individual requests. This approach reduces the straggler effect and stabilizes tail latency.

Results show that CacheFlow reduces TTFT by 10%–62% compared to existing approaches (vLLM, SGLang, LMCache, Cake). The effect is amplified with long contexts and in the tails of the distribution (P90–P99). At the same time, the system achieves high resource utilization: around 88% GPU compute and 78% I/O, indicating effective overlap of operations. It is also evident that as the input length increases, the gap with recomputation-only approaches widens due to the quadratic cost of attention.

For the industry, this represents a pragmatic shift from local optimization to system-wide scheduling. KV cache restoration can no longer be viewed as a binary choice between compute and I/O. It is a task of coordinating resources in a multidimensional space: tokens, layers, devices, and batches. Such approaches are already being discussed in the context of high-load inference, where not only averages but also tail latency are critical. CacheFlow demonstrates that even without changing the model, latency can be significantly reduced through more precise scheduling and exploitation of structural parallelism.


Information source

arXiv is the largest open preprint repository (since 1991, under the auspices of Cornell), where researchers quickly post working versions of papers; the materials are publicly accessible but do not undergo full peer review, so results should be considered preliminary and, where possible, checked against updated versions or peer‑reviewed journals. arxiv.org

View the original research PDF

×

🚀 Deploy the Blocks

Controls: ← → to move, ↑ to rotate, ↓ to drop.
Mobile: use buttons below.