Kubernetes Becomes the AI Substrate: 66% of GenAI Inference, DRA GA, llm-d

Abhishek GautamAbhishek Gautam12 min read
Kubernetes Becomes the AI Substrate: 66% of GenAI Inference, DRA GA, llm-d

Quick summary

CNCF data shows two-thirds of generative AI inference now runs on Kubernetes. Nvidia and Google donated their DRA drivers, and geo-distributed GPU pooling crossed into production.

66% of organizations running generative AI inference now manage those workloads on Kubernetes, according to the 2026 CNCF Annual Survey — a number that quietly settles a debate the infrastructure world has had since 2023. Kubernetes was a poor fit for GPUs for years. In the last twelve months, that changed in four concrete steps.

This matters to every team deciding this quarter whether to self-host models or stay on closed APIs.

The Four Changes That Made Kubernetes an AI Substrate

1. Dynamic Resource Allocation (DRA) went GA — in Kubernetes 1.34, giving workloads a standard way to request GPUs by property (memory, interconnect, topology) instead of the crude device-plugin counting of old.

2. Both GPU vendors standardized on it — at KubeCon Europe 2026, Nvidia donated its GPU DRA driver to CNCF and Google open-sourced its TPU DRA driver. The two dominant AI hardware vendors now treat DRA as the interface.

3. Native gang scheduling landed — Kubernetes v1.36 (70 enhancements, the most AI-oriented release to date) ships gang scheduling natively, so distributed training and multi-pod inference jobs start all-or-nothing instead of deadlocking half-scheduled.

4. Purpose-built inference frameworks entered CNCFllm-d, a Kubernetes-native distributed LLM inference framework from Red Hat, Google Cloud, IBM Research, CoreWeave, and Nvidia, joined the CNCF Sandbox, alongside KAI Scheduler for fractional GPUs, bin-packing, and hierarchical quotas.

Beyond One Data Center: Geo-Distributed GPU Pooling

A CNCF blog post published June 8, 2026 detailed the k0smos stack — and a collaboration with Germany's federal innovation agency SPRIND called exalsius — pooling fragmented, heterogeneous GPUs (Nvidia A100s and AMD MI300X in the same fabric) across sites into one compute system, with WireGuard P2P data planes and energy-aware orchestration. Results were presented at EuroSys 2026.

Translation: the single-building AI factory is no longer the only architecture. Stranded GPUs in different locations can now act as one cluster — the same direction as Nvidia XFRA residential distributed compute, our most-read post.

Our Analysis: What This Means If You Are Choosing a Stack

1. The self-host decision just got easier

The hard part of self-hosting was never the weights — DeepSeek V4 Pro and Qwen solved that. It was operations: scheduling, sharing, failure recovery. With DRA GA, KAI, llm-d, and Nvidia open-sourcing DSX OS the same week, the ops layer is now standard open source. A platform team can run a credible inference service without buying a proprietary orchestration product.

2. Fractional GPUs end the utilization lie

Most inference pods do not need a whole H100. KAI's MIG partitioning and time-slicing with hierarchical quotas means finance can finally see per-team GPU utilization — directly relevant after the $500M Claude bill story made AI FinOps a board topic.

3. Vendor neutrality is real this time

DRA is vendor-neutral plumbing with both Nvidia and Google drivers under CNCF governance. exalsius mixing A100 + MI300X shows AMD silicon slotting into the same fabric. If export controls or supply constraints force a mixed fleet — common for our readers in China and Singapore — Kubernetes is now the abstraction layer that absorbs it.

4. A checklist if you migrate inference to Kubernetes this quarter

  • Be on v1.34+ for DRA GA; target v1.36 for native gang scheduling
  • Use vendor DRA drivers (CNCF-governed), not legacy device plugins
  • Evaluate llm-d for disaggregated prefill/decode before writing custom serving code
  • Add KAI or Kueue for quotas and gang scheduling of fine-tuning jobs
  • Measure tokens per watt per namespace — that is the 2026 efficiency metric

Key Takeaways

  • 66% of orgs running genAI inference use Kubernetes for it — 2026 CNCF Annual Survey
  • DRA GA (k8s 1.34) + Nvidia GPU and Google TPU DRA drivers donated to CNCF at KubeCon Europe 2026
  • v1.36 ships native gang scheduling — the most AI-focused Kubernetes release yet
  • llm-d (Red Hat, Google, IBM, CoreWeave, Nvidia) and KAI Scheduler entered CNCF Sandbox
  • k0smos/exalsius demonstrated geo-distributed, mixed A100 + MI300X GPU pooling with SPRIND (CNCF, June 8)
  • For developers: the open-source ops layer for self-hosted inference is now complete — weights, scheduling, serving, and power management all have standard components
  • What to watch: llm-d graduation pace, AMD ROCm DRA parity, whether managed clouds expose DRA cleanly

Sources

FAQ

Frequently Asked Questions

How many organizations run AI inference on Kubernetes in 2026?

The 2026 CNCF Annual Survey found that 66% of organizations hosting generative AI inference now use Kubernetes to manage some or all of those workloads, making it the dominant orchestration layer for AI serving.

What is Dynamic Resource Allocation (DRA) in Kubernetes?

DRA is a standard Kubernetes interface for requesting accelerators by property — GPU memory, topology, interconnect — instead of simple device counting. It reached general availability in Kubernetes 1.34, and at KubeCon Europe 2026 Nvidia donated its GPU DRA driver and Google open-sourced its TPU DRA driver to CNCF.

What is llm-d?

llm-d is a Kubernetes-native open-source framework for distributed LLM inference, built by Red Hat, Google Cloud, IBM Research, CoreWeave, and Nvidia and accepted into the CNCF Sandbox at KubeCon Europe 2026. It supports disaggregated prefill/decode serving across any model, accelerator, or cloud.

What did Kubernetes v1.36 add for AI workloads?

Kubernetes v1.36 shipped 70 enhancements and is the most AI-oriented release to date, headlined by native gang scheduling so distributed training and multi-pod inference jobs start all-or-nothing instead of deadlocking partially scheduled.

Can Kubernetes pool GPUs across multiple data centers?

Yes. The k0smos stack and the exalsius project with Germany's SPRIND agency, detailed by CNCF in June 2026, pool heterogeneous GPUs — including Nvidia A100 and AMD MI300X — across sites into a unified compute fabric using WireGuard P2P networking and energy-aware orchestration.

Free Weekly Briefing

The AI & Dev Briefing

One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.

No spam. Unsubscribe anytime.

Free Tool

Will AI replace your job?

4 questions. Get a personalised developer risk score based on your stack, role, and what you actually build day to day.

Check Your AI Risk Score →

Written by

Software Engineer based in Delhi, India. Writes about AI models, semiconductor supply chains, and tech geopolitics — covering the intersection of infrastructure and global events. 846+ posts cited by ChatGPT, Perplexity, and Gemini. Read in 164 countries.