Google TPU 8t and 8i at Cloud Next 2026: The Inference War Starts Now

Abhishek GautamApril 24, 20266 min read

Google TPU 8t and 8i at Cloud Next 2026: The Inference War Starts Now

Quick summary

Google announced TPU 8t (training) and TPU 8i (inference) at Cloud Next 2026 on April 22. Anthropic gets 1M chips. Meta signed a multibillion deal. OpenAI is switching.

Why Google Split Training and Inference Into Two Chips

The dominant design philosophy for AI accelerators from 2020 to 2025 was a single high-performance chip that could handle both training and inference. Nvidia built Hopper (H100/H200) and Blackwell on this model. The logic was sound: training and inference were different workloads but shared enough silicon requirements (high-bandwidth memory, fast matrix multiply) that a single design served both.

That logic breaks when the workloads diverge enough. Training is about raw compute throughput and inter-chip bandwidth — you want maximum FLOPS and the ability to sync gradients across thousands of chips simultaneously. Inference, especially agentic inference with long context windows and many simultaneous requests, is about memory capacity, memory bandwidth, and latency on collective operations. Paying for training-optimized silicon to run inference is wasteful in both directions.

Google's TPU 8t and TPU 8i are the first production chips from a major vendor that acknowledge this divergence explicitly and build two separate architectures around it.

TPU 8t: The Training Chip

The TPU 8t is built for large-scale pre-training. Its headline numbers: 12.6 petaFLOPS per chip at FP4 precision, 216 GB of HBM per chip, and 6,528 GB/s of memory bandwidth. Those per-chip numbers are competitive with Nvidia's Blackwell B200.

The number that separates it from Nvidia is scale: a TPU 8t superpod packs 9,600 chips delivering 121 exaFLOPS total, and Google says it can link over 1 million TPU 8t chips in a single cluster. Nvidia's current NVLink architecture tops out at roughly 72-chip NVL72 racks, with cluster sizes beyond that requiring InfiniBand fabric that adds latency and cost.

The inter-chip interconnect bandwidth is 19.2 Tbps per chip (doubled from the 7th-generation Ironwood), and the data centre network delivers 47 petabits per second of non-blocking bi-sectional bandwidth. Ironwood already had the fastest AI cluster interconnect on the market. TPU 8t doubles it.

Two other TPU 8t details matter for training-at-scale:

10x faster storage access via TPUDirect Storage using Managed Lustre. Training runs at scale are often bottlenecked on checkpoint I/O and dataset loading. A 10x improvement in storage access means training runs that previously spent 20% of wall-clock time on I/O overhead now spend roughly 2%. That compounds across weeks-long runs.

97%+ compute goodput — the fraction of clock cycles doing actual useful computation rather than waiting on communication, memory, or scheduling overhead. Nvidia claims similar numbers for NVL72 configurations but typically at much smaller cluster sizes. Maintaining 97% goodput at 1-million-chip scale is a different engineering problem.

TPU 8i: The Inference Chip Built for Agents

The TPU 8i has lower raw FP4 compute than the 8t (10.1 petaFLOPS vs. 12.6) but it was designed for a different problem. The key differentiators:

288 GB of HBM per chip versus the 8t's 216 GB. Inference is memory-capacity bound for large models — the bigger the KV cache you can hold in fast memory, the longer the context you can serve without expensive memory swaps. A 1,152-chip TPU 8i pod holds 331.8 TB of total HBM across the pod. That is enough to serve multiple simultaneous long-context requests at scale without context compression tricks.

384 MB of on-chip SRAM (Vmem) — three times the Ironwood. This is the fastest memory on the chip, and the 8i uses it to hold KV cache for the current request entirely on-chip rather than in HBM. For agentic workloads doing many small attention operations in sequence, keeping KV cache on-chip eliminates the dominant latency contributor.

Collectives Acceleration Engine (CAE) reduces collective operation latency by 5x. Inference with multiple chips requires all-to-all communication to aggregate attention results across chips. The 8i's CAE hardware-accelerates this at the chip level rather than software-coordinating it, which is why Google chose the Boardfly network topology (maximum 7 hops, 50% lower all-to-all latency than a 3D torus) for the 8i versus the Virgo 3D torus on the 8t.

The architectural split is coherent: 8t is optimized for compute throughput and scale-out; 8i is optimized for memory capacity, latency, and serving density.

Anthropic, Meta, OpenAI: The Customer Story

The spec sheet matters. The customer list matters more.

Anthropic is Google's largest TPU customer. The company is getting access to over 1 million TPU chips and 1 gigawatt of capacity in 2026 alone. To put 1 GW in context: a single modern AI data centre typically draws 100-300 MW. Anthropic is effectively getting the equivalent of 3-10 hyperscale AI data centres worth of Google TPU capacity. Claude's training runs and inference serving are running on TPU infrastructure, not Nvidia GPUs.

Meta signed a multibillion-dollar multiyear deal with Google Cloud for TPU capacity in February 2026. Meta builds and operates its own data centres and has its own MTIA custom AI silicon programme — a company that invests this heavily in proprietary AI infrastructure signing a multiyear external TPU deal signals that TPU 8 capacity is filling a gap that Meta's own infrastructure cannot cover at the timeline required.

OpenAI's move is the most strategically significant. OpenAI co-developed the CUDA ecosystem with Nvidia. Every GPT model was trained on Nvidia hardware. The relationship was close enough that Nvidia's Jensen Huang described OpenAI as a foundational partner. OpenAI adopting TPU capacity — even as a supplement rather than a replacement — means the single-vendor GPU lock-in model for frontier AI labs is breaking.

What This Means for Nvidia

Nvidia's data centre GPU revenue in FY2025 was approximately $115 billion. The company controls roughly 92% of the AI accelerator market by revenue. TPU 8t and TPU 8i do not change those numbers in 2026 — the new chips are not generally available until H2 2026, and supply will be constrained. But the customer moves signal a structural shift in the 2027-2029 timeframe.

The specific threat is at inference. Nvidia's Blackwell architecture is competitive on training throughput, but inference economics increasingly favour memory-capacity and latency-optimized designs over raw FLOPS. The TPU 8i's 288 GB HBM, 384 MB on-chip SRAM, and sub-7-hop network are designed precisely for the inference serving patterns that represent the majority of commercial AI compute spend.

Google is also attacking the cost structure. Both TPU 8 chips are claimed to deliver 2x better performance per watt versus Ironwood. At data centre scale where power is 40-60% of operational cost for AI workloads, a 2x efficiency advantage is a durable pricing advantage — not a margin question but a structural input cost difference.

Nvidia retains a large and defensible advantage in ecosystem: CUDA, tens of thousands of optimized kernels, Triton, the full NIM microservices stack. JAX and PyTorch on TPU have improved substantially but the software ecosystem gap is real. Any developer migrating a training or inference stack from Nvidia to TPU is taking on porting work.

The Agentic Era Framing

Google is explicitly marketing both chips as "designed for the agentic era." The TPU 8i's Collectives Acceleration Engine and Boardfly topology are optimized for the access patterns of agentic inference — many small, rapid attention operations, large KV caches for long-running agent contexts, frequent all-to-all communication across chips serving parallel agent threads.

The TPU 8t's 10x storage access improvement maps to the training patterns of reasoning models that require large reasoning trace datasets and frequent checkpointing. Reasoning model training is more I/O intensive than standard next-token prediction pre-training.

Google's Pathways distributed training framework, which runs across TPU superpods, now supports multi-cluster training at the 1-million-chip scale that TPU 8t enables. Training a GPT-5-class model on Nvidia hardware today requires complex multi-cluster orchestration that was not designed for this scale. TPU 8t's homogeneous cluster interconnect at million-chip scale simplifies the distributed training problem.

Key Takeaways

Google announced TPU 8t and TPU 8i at Cloud Next 2026 on April 22: first split of TPU into separate training and inference architectures; signals end of "general-purpose" AI accelerator era
TPU 8t (training): 12.6 petaFLOPS FP4, 216 GB HBM, 121 exaFLOPS per 9,600-chip pod, 1M+ chip cluster scale, 10x faster storage access, 97%+ compute goodput
TPU 8i (inference): 10.1 petaFLOPS FP4, 288 GB HBM per chip, 384 MB on-chip SRAM (3x Ironwood), 5x lower collective op latency via CAE, Boardfly topology cuts all-to-all latency 50%
Customer list is the story: Anthropic gets 1M chips + 1 GW capacity in 2026; Meta signed multibillion multiyear deal; OpenAI is adopting TPU capacity — frontier AI lab TPU adoption is now the norm, not the exception
Availability: H2 2026, via Google Cloud Platform and AI Hypercomputer; no per-unit pricing disclosed; 2.8x training and 80% inference price-performance improvement vs. Ironwood claimed
Nvidia implications: not an immediate revenue threat in 2026 but Anthropic/Meta/OpenAI moves signal structural shift in 2027-2029 inference spend; TPU 8i's memory architecture directly targets Nvidia's weakest inference economics

For the HBM shortage context this addresses, read SK Hynix $27B Profit: HBM Shortage Lasts Until 2030, AI Memory at Risk. For the competing US chip independence play, read Elon Musk's TeraFab Uses Intel 14A: SpaceX Leads High-Volume AI Chip Production. For the HBM alternative context, read NEO Semiconductor 3D X-DRAM POC: 8x HBM Density, Stan Shih Backs It.

FAQ

Frequently Asked Questions

What did Google announce at Cloud Next 2026 for AI chips?

Google announced two new 8th-generation TPU chips at Cloud Next 2026 on April 22, 2026: the TPU 8t optimized for AI model training and the TPU 8i optimized for AI inference. It is the first time Google has split the TPU line into separate training and inference architectures. The TPU 8t delivers 12.6 petaFLOPS FP4 per chip, scales to 1 million chips in a single cluster, and delivers 121 exaFLOPS in a 9,600-chip superpod. The TPU 8i delivers 10.1 petaFLOPS FP4 with 288 GB HBM and 384 MB on-chip SRAM per chip, optimized for agentic inference workloads with 5x lower collective operation latency.

How do Google's TPU 8t and 8i compare to Nvidia H200 and Blackwell?

Google did not release official head-to-head benchmarks against Nvidia at Cloud Next 2026. Per-chip, TPU 8t and 8i are roughly competitive with Nvidia H200 and Blackwell B200 on raw FP4 compute. TPU's advantages are at cluster scale (1M+ chips vs. ~72 chips per NVLink rack for Nvidia), interconnect bandwidth (19.2 Tbps per chip vs. 900 Gbps NVLink), storage access (10x faster than prior generation), and inference memory (TPU 8i's 288 GB HBM and 384 MB SRAM vs. Blackwell B200's ~192 GB). Nvidia retains a substantial advantage in the CUDA software ecosystem, broader library support, and proven large-scale training workflows.

Why is Anthropic getting 1 million TPU chips from Google?

Anthropic is Google's largest TPU customer and is receiving access to over 1 million TPU 8 chips plus 1 gigawatt of capacity in 2026. This is consistent with Google's ongoing investment in Anthropic — Google has invested several billion dollars in Anthropic and is a primary cloud infrastructure provider. Anthropic uses TPU infrastructure for training Claude models and serving Claude inference at scale. The 1 GW figure is equivalent to 3-10 hyperscale AI data centres and signals that Anthropic's compute requirements are growing faster than what any single GPU supply chain could serve.

When will Google TPU 8t and 8i be available on Google Cloud?

Google announced TPU 8t and 8i at Cloud Next 2026 on April 22, 2026, with general availability targeted for "later in 2026" — interpreted as H2 2026. No specific month or pricing was disclosed. The chips will be available via Google Cloud Platform compute instances and through Google's AI Hypercomputer integrated training and inference platform. Customers can register interest through the GCP console. No per-hour or per-chip pricing was announced; Google disclosed relative performance claims of 2.8x better training price-performance and 80% better inference price-performance versus the prior-generation Ironwood TPU.

Does OpenAI using Google TPUs mean it is switching away from Nvidia?

OpenAI adopting Google TPU capacity — announced at Cloud Next 2026 — does not mean it is abandoning Nvidia. It signals that OpenAI is no longer exclusively dependent on Nvidia for AI compute infrastructure. OpenAI co-developed the CUDA ecosystem with Nvidia and trained every GPT model generation on Nvidia hardware. Adding TPU capacity is most likely a supply diversification and cost optimisation move rather than a wholesale platform switch. However, it is strategically significant: when the company most associated with Nvidia GPUs begins using competitor silicon, it validates TPU 8 as a credible production alternative rather than a research curiosity.

Free Weekly Briefing

The AI & Dev Briefing

One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.

No spam. Unsubscribe anytime.