AWS AI Infrastructure Cloud Infrastructure Developer Tools Semiconductors

AWS and Cerebras Build Disaggregated Inference to Beat Nvidia at Speed

Abhishek GautamMarch 16, 20267 min read

AWS and Cerebras Build Disaggregated Inference to Beat Nvidia at Speed

Quick summary

AWS and Cerebras announced a disaggregated inference architecture combining AWS Trainium and Cerebras CS-3 chips on Amazon Bedrock, claiming order-of-magnitude faster AI inference than GPUs.

What Disaggregated Inference Actually Means

To understand why this matters, you need to understand what makes GPU-based inference slow. When a large language model generates a response, it does two distinct things. First, it processes your entire input prompt — this is called prefill. Second, it generates the output tokens one by one — this is called decode. Both stages have very different computational profiles.

Prefill is compute-bound. You need raw processing power to digest a long context quickly. Decode is memory-bandwidth-bound. Generating each new token requires reading the entire key-value cache of previous tokens from memory, and doing it fast enough to keep output generation from feeling slow.

GPUs handle both stages on the same hardware, which means they are neither optimally compute-dense for prefill nor optimally memory-bandwidth-rich for decode. The result is a performance compromise on both.

Disaggregated inference splits the two stages. AWS Trainium chips, optimised for high compute throughput, handle the prefill stage. Cerebras CS-3 chips, which deliver thousands of times more memory bandwidth than the fastest GPU, handle the decode stage. The two systems are connected via AWS's Elastic Fabric Adapter high-speed networking, which allows the prefill output (the key-value cache) to be transferred to the Cerebras decode system in real time.

The architecture essentially assigns each stage to the hardware it was built for. The result, according to both companies, is inference that is an order of magnitude faster than current GPU-based systems.

What the Cerebras CS-3 Actually Is

The Cerebras CS-3 is a wafer-scale chip — literally a single chip the size of an entire silicon wafer. A standard Nvidia H100 GPU die is approximately 800 square millimetres. The Cerebras CS-3 wafer is 46,225 square millimetres — about 57 times larger.

The implications for memory bandwidth are significant. Memory bandwidth scales with die size because on-chip SRAM sits adjacent to compute elements. The CS-3 has 44GB of on-chip SRAM with 21 petabytes per second of memory bandwidth. For comparison, the Nvidia H100 SXM has 80GB of HBM3 memory with approximately 3.35 terabytes per second of bandwidth. The CS-3 delivers roughly 6,000 times more memory bandwidth than the H100.

For the decode stage of inference — where memory bandwidth is the limiting factor — this is a transformational advantage. The CS-3 can read the entire key-value cache of a long-context generation in microseconds rather than milliseconds, eliminating the memory bottleneck that makes GPU-based inference slow.

The tradeoff is that wafer-scale manufacturing is expensive and yield-limited. Cerebras has not been able to compete with Nvidia on training workloads where raw compute and interconnect bandwidth across many GPUs matter more than single-chip memory bandwidth. Disaggregated inference plays directly to the CS-3's unique advantage.

Why Amazon Bedrock Is the Right Distribution Channel

Amazon Bedrock is AWS's managed model-as-a-service platform, giving developers API access to foundation models from Anthropic, Meta, Mistral, AI21, and others without managing the underlying infrastructure. Adding Cerebras CS-3 systems to Bedrock means every developer calling the Bedrock API could benefit from the faster inference without changing any code.

This matters for enterprise AI deployment. The bottleneck for many production AI applications is not model quality — it is inference latency. Applications that require real-time responses, live coding assistants, voice AI agents, and customer service systems all degrade significantly when inference latency climbs above 200-300 milliseconds. The CS-3's memory bandwidth advantage directly addresses this.

AWS will initially deploy the disaggregated inference solution for third-party models on Bedrock. Later in 2026, Amazon Nova models — Amazon's own foundation model family — will also run on Cerebras hardware through Bedrock. That second phase is the more significant signal: Amazon is not just reselling Cerebras capacity, it is optimising its own AI products around the architecture.

The Nvidia Dependency Problem This Solves

AWS is the world's largest cloud provider and one of Nvidia's biggest GPU customers. It also has strong incentives to reduce that dependency. Nvidia's H100 and H200 GPUs carry extremely high margins — Nvidia's data centre gross margins exceeded 75% in 2025. Every dollar AWS spends on Nvidia hardware is a dollar not captured as AWS margin.

AWS has been building its own silicon for years. AWS Trainium handles training workloads. AWS Inferentia handles inference. Neither has achieved the performance density to replace Nvidia at the frontier model tier. The Cerebras partnership fills the gap at the highest-performance inference tier without requiring AWS to build its own wafer-scale chip from scratch.

The framing from one financial analyst covering the announcement was pointed: "Amazon and Cerebras Forge Disaggregated Inference Alliance to Shatter Nvidia's Memory Monopoly." Whether it actually shatters anything depends on scale, but the architectural logic is sound. Inference is where AI computing economics are decided — training happens once, inference happens billions of times per day.

What This Means for Developers

For developers building on Amazon Bedrock, the practical implications are:

Lower latency for long-context applications. If your application passes long documents, conversation histories, or code files as context, the prefill bottleneck is your current performance limiter. Trainium handling prefill at higher compute density reduces time-to-first-token significantly.

Faster decode for streaming responses. Applications where users see tokens streaming in real time — chatbots, writing assistants, code completions — will feel meaningfully faster when decode runs on CS-3 hardware.

No code changes required. The disaggregated inference architecture is infrastructure-level. Developers calling Bedrock APIs do not change their integration. AWS and Cerebras handle the routing of prefill and decode stages transparently.

Potential cost efficiency. Cerebras has historically offered competitive pricing relative to Nvidia for inference workloads. If the disaggregated architecture delivers the claimed throughput, AWS may be able to serve more inference requests per dollar — potentially passing some of that efficiency to API pricing.

Available in months, not years. The partnership announcement specified launch on Bedrock within the next couple of months. This is not a research preview. Developers should expect to test it in production by mid-2026.

The Competitive Landscape This Reshapes

The AWS-Cerebras announcement is one of several signals that the inference hardware market is fragmenting away from Nvidia dominance. In the same week:

Meta revealed its MTIA chip roadmap with four custom chip generations specifically for inference
Google has been running its TPU v5 for Gemini inference exclusively in-house
Microsoft has Azure Maia chips handling some inference for Copilot

The pattern is consistent: hyperscalers are investing in custom inference hardware to escape Nvidia's GPU pricing power on the highest-volume computing workload in AI. Training will stay on Nvidia GPUs for the foreseeable future — the H100 and B200 ecosystem is too established to displace. But inference, which runs 24 hours a day at massive scale, is the market where alternatives to Nvidia are finding genuine traction.

Cerebras, which has been public about pursuing a Nasdaq IPO, benefits enormously from the AWS partnership. A deployment on Amazon Bedrock gives CS-3 hardware the scale exposure it has never had. If the performance claims hold up in production, the partnership could validate Cerebras's path to profitability ahead of its public offering.

Key Takeaways

AWS and Cerebras announced disaggregated inference on March 13, 2026 — launching on Amazon Bedrock within months, with AWS as the exclusive launch cloud partner
Disaggregated inference splits prefill (Trainium) and decode (Cerebras CS-3) across hardware optimised for each stage, connected via Elastic Fabric Adapter networking
Cerebras CS-3 delivers 21 petabytes per second of memory bandwidth — roughly 6,000 times more than the Nvidia H100 — making it uniquely suited for the decode stage
No code changes for Bedrock developers — the architecture is infrastructure-level; existing API integrations benefit automatically
Amazon Nova models will also run on Cerebras hardware later in 2026 — AWS is building its own AI products around this architecture, not just reselling capacity
The inference hardware market is fragmenting — Meta MTIA, Google TPU v5, Microsoft Azure Maia, and now AWS-Cerebras are all building inference alternatives to reduce Nvidia dependency

FAQ

Frequently Asked Questions

What is disaggregated inference and why does it matter?

Disaggregated inference splits AI generation into two stages — prefill (processing the input prompt) and decode (generating output tokens) — and runs each on hardware optimised for that specific workload. Prefill is compute-bound and runs on AWS Trainium. Decode is memory-bandwidth-bound and runs on Cerebras CS-3 chips. The result is significantly faster inference than GPU systems that handle both stages on the same hardware with performance compromises on both.

What is the Cerebras CS-3 and how is it different from Nvidia GPUs?

The Cerebras CS-3 is a wafer-scale chip — a single silicon die the size of an entire wafer, at 46,225 square millimetres versus the Nvidia H100's 800 square millimetres. It delivers 21 petabytes per second of on-chip SRAM memory bandwidth, roughly 6,000 times more than the H100's HBM3 bandwidth. This makes it uniquely fast for the decode stage of inference, where reading the key-value cache from memory is the primary bottleneck.

When will the AWS-Cerebras inference be available on Bedrock?

AWS and Cerebras announced the partnership on March 13, 2026, with a launch timeline of the next couple of months. Third-party foundation models on Bedrock will be available first. Amazon Nova models running on Cerebras hardware are planned for later in 2026. Developers using Bedrock APIs will not need to change their code — the faster inference will be transparently available.

Does this mean developers should switch away from Nvidia for inference?

Not directly — the Cerebras hardware is accessed through Amazon Bedrock as a managed service. Developers do not provision CS-3 chips directly; they call Bedrock APIs as they do today. The switch is happening at the infrastructure layer, not the application layer. If you are running your own GPU inference clusters, this announcement does not immediately change your setup, but it signals the direction the market is heading.

How does this affect Nvidia's business?

Training workloads are not affected — Nvidia GPUs dominate LLM training and will continue to for years. The threat is on inference, which runs continuously at scale and represents an increasingly large share of AI compute spending. If disaggregated inference architectures deliver the claimed throughput at competitive cost, hyperscalers have strong incentives to shift inference workloads to custom hardware. Nvidia's inference revenue is vulnerable in a way its training revenue is not.

Free Weekly Briefing

The AI & Dev Briefing

One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.

No spam. Unsubscribe anytime.