llm-d: Kubernetes Framework for Scalable LLM Inference Donated to CNCF

by Chief Editor

Kubernetes and AI: The Dawn of Cloud-Native Inference

The convergence of Kubernetes and Artificial Intelligence has taken a significant leap forward with the introduction of llm-d, a Kubernetes blueprint designed to streamline the deployment of inference stacks for any model, on any accelerator, and in any cloud environment. This collaborative effort, spearheaded by IBM Research, Red Hat, and Google Cloud, and now a sandbox project within the Cloud Native Computing Foundation (CNCF), promises to reshape how organizations approach large language model (LLM) inference.

llm-d: A Blueprint for Scalable AI

Launched in 2025 and built by Neural Magic (acquired by Red Hat), llm-d addresses the challenges of serving foundation models at scale. It transforms LLM inference from a model-by-model improvisation into a replicable, production-grade Kubernetes-based system. The goal, as articulated by IBM Research Distinguished Engineer Carlos Costa, is to establish large-scale model serving as a first-class cloud-native workload.

How llm-d Works: Disaggregation and Intelligent Routing

llm-d fundamentally changes how LLM inference is handled. It operates by disaggregating the inference process into prefill and decode phases, running each on separate pods. This allows for independent scaling and tuning of each phase, optimizing resource utilization. It incorporates an LLM-aware routing and scheduling layer, utilizing a gateway extension that directs requests based on KV-cache state, pod load, and hardware characteristics to enhance both latency and throughput.

At its core, llm-d leverages vLLM as an inference gateway, alongside other modular components, to provide a reusable blueprint adaptable to “any model, any accelerator, any cloud.” While vLLM handles the prompt inference engine, llm-d provides the operating layer for cluster management, intelligent scheduling, and cache-aware routing—specifically tuned for LLM traffic.

Industry Collaboration and the CNCF Donation

The donation of llm-d to the CNCF signifies a commitment to community-governed, vendor-neutral LLM inference. This move is supported by a broad coalition of collaborators including NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI. Red Hat SVP and AI CTO Brian Stevens emphasized the importance of supporting diverse accelerators – TPUs, AMD, Nvidia, and others – mirroring the open hardware philosophy of Linux.

Performance Gains and Cost Reduction

Early testing by Google Cloud demonstrated significant performance improvements with llm-d, achieving “2x improvements in time-to-first-token” for tasks like code completion. This enhancement stems from addressing limitations in traditional autoscalers and generic APIs, which weren’t designed for the stateful nature of LLM inference and the need for efficient KV cache management and heterogeneous accelerator orchestration.

llm-d tackles these issues through prefix-cache-aware routing and prefill/decode disaggregation. It also supports hierarchical cache offloading across GPU, CPU, and storage tiers, enabling larger context windows without overwhelming accelerator memory. Its traffic- and hardware-aware autoscaler dynamically adapts to workload patterns, moving beyond basic utilization metrics.

Integration with Emerging Kubernetes APIs

llm-d is designed to function seamlessly with emerging Kubernetes APIs, including the Gateway API Inference Extension (GAIE) and LeaderWorkerSet (LWS). This trio aims to establish distributed inference as a core Kubernetes workload.

The Future of llm-d: Multi-Modality and Optimization

Looking ahead, development efforts will focus on expanding llm-d’s capabilities to encompass multi-modal workloads, multi-LoRA optimization with HuggingFace, and deeper integration with vLLM. Mistral AI is already contributing code to advance open standards around disaggregated serving.

IBM Research also plans to explore the intersection of inference and training, including reinforcement learning and self-optimizing AI infrastructure. The overarching vision is to create a common foundation stack that allows the AI ecosystem to focus on innovation rather than rebuilding fundamental components.

FAQ

What is llm-d?

llm-d is an open-source, Kubernetes-native framework for running large language model (LLM) inference as a distributed, production-grade workload.

Who created llm-d?

llm-d was originally built by Neural Magic, which was acquired by Red Hat in 2025. We see now a collaborative effort involving IBM Research, Red Hat, and Google Cloud.

What are the benefits of using llm-d?

llm-d offers improved performance, reduced costs, and simplified deployment of LLM inference workloads on Kubernetes.

Where can I find more information about llm-d?

You can find more information at https://llm-d.ai/

Did you know? llm-d’s disaggregated approach to inference allows for independent scaling of the prefill and decode phases, optimizing resource allocation and reducing latency.

Explore more about cloud-native technologies and AI advancements on The New Stack.

You may also like

Leave a Comment