As LLM production workloads grow, it becomes clear: classic Kubernetes mechanisms do not understand the nature of inference. llm-d is an attempt to bridge this gap at the platform level.

The main limitation becomes apparent when inference goes beyond a “stateless HTTP service.” Requests to LLMs have different costs: prompt length, generation phase, KV-cache hits. In Kubernetes, all of this looks like identical requests. As a result, you get poor pod placement, cache locality breakdown, and fluctuating latency under load. This is especially noticeable in multi-tenant scenarios, where efficiency depends on context reuse. Basic autoscaling and service routing mechanisms are blind to the state of inference here.

llm-d proposes treating distributed inference as a first-class workload. The key idea is to link the orchestration layer (Kubernetes) with inference semantics. The project is integrated between the control plane (for example, KServe) and engines like vLLM, adding state-aware routing and cache management. The trade-off is obvious: increased platform complexity and a new level of abstraction, but in return—control over key inference metrics and the elimination of “black boxes” in production.

The implementation is built around several principles. First, hierarchical KV-cache management with the ability to offload between GPU, CPU, and storage—this allows balancing speed and cost. Second, routing based on model and state: a request is directed to where the required context or suitable hardware is already available. Third, disaggregated serving—separation of roles (for example, leader/worker via LWS), which simplifies scaling and resource management. Special emphasis is placed on a hardware-agnostic approach: the system should work equally well with NVIDIA, AMD, TPU, and other accelerators. This reduces vendor lock-in but requires careful abstraction over differences in performance and memory.

Practical results show exactly where the gains appear. In a multi-tenant SaaS scenario with prefix caching, llm-d maintains almost zero Time To First Token as load increases, whereas a standard Kubernetes service quickly degrades. Throughput reaches about 120k tokens per second on a cluster (8×vLLM, 16×H100), while latency remains predictable. An important point—the project focuses on reproducible benchmarks, not marketing numbers, which is still rare for AI infrastructure.

The inclusion of llm-d in the CNCF Sandbox is less about status and more about an attempt to standardize a layer that has been missing until now: state-aware orchestration of inference. If the approach takes hold, Kubernetes will start treating LLM workloads not as exceptions, but as the norm.

Read the original

🚀 Deploy the Blocks