AWS and Cerebras Build Disaggregated Inference to Beat Nvidia at Speed
Quick summary
AWS and Cerebras announced a disaggregated inference architecture combining AWS Trainium and Cerebras CS-3 chips on Amazon Bedrock, claiming order-of-magnitude faster AI inference than GPUs.
Read next
- IEA Just Released 400 Million Barrels of Oil. It Did Not Work. Here Is What That Means for Tech.
- ByteDance Routes 36,000 Nvidia B200 Chips Through Malaysia to Beat US Ban
AWS and Cerebras announced a formal partnership on March 13, 2026, to build what they are calling disaggregated inference — a new AI serving architecture that separates the prefill and decode stages of language model inference onto different specialised hardware. AWS is the first cloud provider for the solution. It will be available exclusively through Amazon Bedrock, launching within the next couple of months. The claim: inference an order of magnitude faster than what GPU-based systems deliver today.
What Disaggregated Inference Actually Means
To understand why this matters, you need to understand what makes GPU-based inference slow. When a large language model generates a response, it does two distinct things. First, it processes your entire input prompt — this is called prefill. Second, it generates the output tokens one by one — this is called decode. Both stages have very different computational profiles.
Prefill is compute-bound. You need raw processing power to digest a long context quickly. Decode is memory-bandwidth-bound. Generating each new token requires reading the entire key-value cache of previous tokens from memory, and doing it fast enough to keep output generation from feeling slow.
GPUs handle both stages on the same hardware, which means they are neither optimally compute-dense for prefill nor optimally memory-bandwidth-rich for decode. The result is a performance compromise on both.
Disaggregated inference splits the two stages. AWS Trainium chips, optimised for high compute throughput, handle the prefill stage. Cerebras CS-3 chips, which deliver thousands of times more memory bandwidth than the fastest GPU, handle the decode stage. The two systems are connected via AWS's Elastic Fabric Adapter high-speed networking, which allows the prefill output (the key-value cache) to be transferred to the Cerebras decode system in real time.
The architecture essentially assigns each stage to the hardware it was built for. The result, according to both companies, is inference that is an order of magnitude faster than current GPU-based systems.
What the Cerebras CS-3 Actually Is
The Cerebras CS-3 is a wafer-scale chip — literally a single chip the size of an entire silicon wafer. A standard Nvidia H100 GPU die is approximately 800 square millimetres. The Cerebras CS-3 wafer is 46,225 square millimetres — about 57 times larger.
The implications for memory bandwidth are significant. Memory bandwidth scales with die size because on-chip SRAM sits adjacent to compute elements. The CS-3 has 44GB of on-chip SRAM with 21 petabytes per second of memory bandwidth. For comparison, the Nvidia H100 SXM has 80GB of HBM3 memory with approximately 3.35 terabytes per second of bandwidth. The CS-3 delivers roughly 6,000 times more memory bandwidth than the H100.
For the decode stage of inference — where memory bandwidth is the limiting factor — this is a transformational advantage. The CS-3 can read the entire key-value cache of a long-context generation in microseconds rather than milliseconds, eliminating the memory bottleneck that makes GPU-based inference slow.
The tradeoff is that wafer-scale manufacturing is expensive and yield-limited. Cerebras has not been able to compete with Nvidia on training workloads where raw compute and interconnect bandwidth across many GPUs matter more than single-chip memory bandwidth. Disaggregated inference plays directly to the CS-3's unique advantage.
Why Amazon Bedrock Is the Right Distribution Channel
Amazon Bedrock is AWS's managed model-as-a-service platform, giving developers API access to foundation models from Anthropic, Meta, Mistral, AI21, and others without managing the underlying infrastructure. Adding Cerebras CS-3 systems to Bedrock means every developer calling the Bedrock API could benefit from the faster inference without changing any code.
This matters for enterprise AI deployment. The bottleneck for many production AI applications is not model quality — it is inference latency. Applications that require real-time responses, live coding assistants, voice AI agents, and customer service systems all degrade significantly when inference latency climbs above 200-300 milliseconds. The CS-3's memory bandwidth advantage directly addresses this.
AWS will initially deploy the disaggregated inference solution for third-party models on Bedrock. Later in 2026, Amazon Nova models — Amazon's own foundation model family — will also run on Cerebras hardware through Bedrock. That second phase is the more significant signal: Amazon is not just reselling Cerebras capacity, it is optimising its own AI products around the architecture.
The Nvidia Dependency Problem This Solves
AWS is the world's largest cloud provider and one of Nvidia's biggest GPU customers. It also has strong incentives to reduce that dependency. Nvidia's H100 and H200 GPUs carry extremely high margins — Nvidia's data centre gross margins exceeded 75% in 2025. Every dollar AWS spends on Nvidia hardware is a dollar not captured as AWS margin.
AWS has been building its own silicon for years. AWS Trainium handles training workloads. AWS Inferentia handles inference. Neither has achieved the performance density to replace Nvidia at the frontier model tier. The Cerebras partnership fills the gap at the highest-performance inference tier without requiring AWS to build its own wafer-scale chip from scratch.
The framing from one financial analyst covering the announcement was pointed: "Amazon and Cerebras Forge Disaggregated Inference Alliance to Shatter Nvidia's Memory Monopoly." Whether it actually shatters anything depends on scale, but the architectural logic is sound. Inference is where AI computing economics are decided — training happens once, inference happens billions of times per day.
What This Means for Developers
For developers building on Amazon Bedrock, the practical implications are:
Lower latency for long-context applications. If your application passes long documents, conversation histories, or code files as context, the prefill bottleneck is your current performance limiter. Trainium handling prefill at higher compute density reduces time-to-first-token significantly.
Faster decode for streaming responses. Applications where users see tokens streaming in real time — chatbots, writing assistants, code completions — will feel meaningfully faster when decode runs on CS-3 hardware.
No code changes required. The disaggregated inference architecture is infrastructure-level. Developers calling Bedrock APIs do not change their integration. AWS and Cerebras handle the routing of prefill and decode stages transparently.
Potential cost efficiency. Cerebras has historically offered competitive pricing relative to Nvidia for inference workloads. If the disaggregated architecture delivers the claimed throughput, AWS may be able to serve more inference requests per dollar — potentially passing some of that efficiency to API pricing.
Available in months, not years. The partnership announcement specified launch on Bedrock within the next couple of months. This is not a research preview. Developers should expect to test it in production by mid-2026.
The Competitive Landscape This Reshapes
The AWS-Cerebras announcement is one of several signals that the inference hardware market is fragmenting away from Nvidia dominance. In the same week:
- Meta revealed its MTIA chip roadmap with four custom chip generations specifically for inference
- Google has been running its TPU v5 for Gemini inference exclusively in-house
- Microsoft has Azure Maia chips handling some inference for Copilot
The pattern is consistent: hyperscalers are investing in custom inference hardware to escape Nvidia's GPU pricing power on the highest-volume computing workload in AI. Training will stay on Nvidia GPUs for the foreseeable future — the H100 and B200 ecosystem is too established to displace. But inference, which runs 24 hours a day at massive scale, is the market where alternatives to Nvidia are finding genuine traction.
Cerebras, which has been public about pursuing a Nasdaq IPO, benefits enormously from the AWS partnership. A deployment on Amazon Bedrock gives CS-3 hardware the scale exposure it has never had. If the performance claims hold up in production, the partnership could validate Cerebras's path to profitability ahead of its public offering.
Key Takeaways
- AWS and Cerebras announced disaggregated inference on March 13, 2026 — launching on Amazon Bedrock within months, with AWS as the exclusive launch cloud partner
- Disaggregated inference splits prefill (Trainium) and decode (Cerebras CS-3) across hardware optimised for each stage, connected via Elastic Fabric Adapter networking
- Cerebras CS-3 delivers 21 petabytes per second of memory bandwidth — roughly 6,000 times more than the Nvidia H100 — making it uniquely suited for the decode stage
- No code changes for Bedrock developers — the architecture is infrastructure-level; existing API integrations benefit automatically
- Amazon Nova models will also run on Cerebras hardware later in 2026 — AWS is building its own AI products around this architecture, not just reselling capacity
- The inference hardware market is fragmenting — Meta MTIA, Google TPU v5, Microsoft Azure Maia, and now AWS-Cerebras are all building inference alternatives to reduce Nvidia dependency
Free Weekly Briefing
The AI & Dev Briefing
One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.
No spam. Unsubscribe anytime.
More on aws
All posts →IEA Just Released 400 Million Barrels of Oil. It Did Not Work. Here Is What That Means for Tech.
The IEA approved the largest emergency oil release in history after the Strait of Hormuz closed. Brent crude is still above $90. AWS data centers in UAE and Bahrain were hit by drones. Qatar's helium supply is offline, threatening chip fabs globally. Here is the full developer and infrastructure impact.
ByteDance Routes 36,000 Nvidia B200 Chips Through Malaysia to Beat US Ban
ByteDance is deploying 36,000 Nvidia B200 Blackwell chips in Malaysia via a $2.5B deal, using Southeast Asian cloud firms to legally access hardware blocked under US export controls.
Meta Built 4 Custom AI Chips in 2 Years. Here's What MTIA Means for Nvidia.
Meta unveiled its MTIA chip roadmap in March 2026 — four generations of custom RISC-V inference chips made by TSMC and designed with Broadcom, with MTIA 300 already in production.
Nvidia NemoClaw: Open-Source Enterprise AI Agent Platform Explained
Nvidia is launching NemoClaw, an open-source AI agent platform for enterprise workforces. It's hardware-agnostic, not CUDA-locked, with Salesforce, Cisco, Google, and CrowdStrike already on board.
Written by
Abhishek Gautam
Full Stack Developer & Software Engineer based in Delhi, India. Building web applications and SaaS products with React, Next.js, Node.js, and TypeScript. 8+ projects deployed across 7+ countries.