× Install ThecoreGrid App
Tap below and select "Add to Home Screen" for full-screen experience.
B2B Engineering Insights & Architectural Teardowns

KV cache optimization for multi-LoRA agents

ForkKV rethinks KV cache optimization for multi-LoRA serving, eliminating memory duplication and increasing throughput.

The problem arises in multi-LoRA agent serving, where multiple specialized agents operate on top of a single base model. LoRA reduces the cost of fine-tuning, but during the inference stage, a bottleneck occurs — the KV cache. Due to differences in adapter activations, the KV cache ceases to be shared even with identical contexts. This breaks prefix caching and leads to a linear increase in memory consumption, which directly reduces throughput and limits parallelism.

ForkKV offers an architectural shift: disaggregated KV cache. Instead of storing a single KV cache, the system splits it into two components — a shared bCache (base cache) and an agent-specific rCache (residual cache). This design relies on the structure of LoRA: the main projection xW is significantly larger than the low-rank part xA. The management of this separation is implemented through DualRadixTree and a fork model with copy-on-write (CoW), similar to processes in an OS. The new agent inherits the shared bCache and creates only its own rCache, eliminating data duplication.

A key engineering trade-off is a partial loss of accuracy due to state divergence between agents. However, empirically this divergence is limited: the similarity of input states exceeds 99.4%, and the degradation in generation quality is around 1.60%. Meanwhile, the efficiency gains are substantial. In scenarios with a shared context of 32K tokens, ForkKV reduces memory consumption from tens of gigabytes to single digits and provides up to a 3.0× increase in throughput. This is achieved not only through memory savings but also thanks to ResidualAttention — a custom kernel that reconstructs the KV cache directly in SRAM, avoiding HBM overhead and maintaining batch parallelism.

For the industry, this appears to be a pragmatic way to scale agent-based systems. Instead of horizontally scaling models, the system increases the density of agent placement through more efficient memory usage. The approach is particularly applicable in scenarios with long shared contexts — for example, codebases or document pipelines. The limitation remains in the complexity of implementation: a specialized runtime, custom kernels, and a new memory management model are required. However, the principle itself — decomposition of the KV cache and CoW semantics — already looks like a sustainable pattern for high-load LLM serving.

Information source

arXiv is the largest open preprint repository (since 1991, under the auspices of Cornell), where researchers quickly post working versions of papers; the materials are publicly accessible but do not undergo full peer review, so results should be considered preliminary and, where possible, checked against updated versions or peer‑reviewed journals. arxiv.org

View the original research PDF

×

🚀 Deploy the Blocks

Controls: ← → to move, ↑ to rotate, ↓ to drop.
Mobile: use buttons below.