Apr 28 Claude Outage: Postmortem Playbook for Production Teams
Quick summary
Anthropic reported Claude.ai and API errors on Apr 28, 2026. Learn the exact failure patterns, retry controls, and fallback changes teams need before the next incident.
Read next
- Claude.ai Outage Apr 28: API Errors, Login Failures, Dev MitigationAnthropic confirmed elevated API errors and Claude.ai login failures on Apr 28, 2026. Timeline, blast radius, and failover steps for teams shipping tonight.
- Claude 3 Haiku Retires April 19, 2026: Migration Guide for DevelopersAnthropic is retiring claude-3-haiku-20240307 on April 19, 2026. Any production application still calling this model will break. Here is exactly what to migrate to, how to do it, and what the cost difference looks like.
Anthropic's status page posted an Investigating update at 17:41 UTC and an Identified update at 17:51 UTC on April 28, confirming elevated API errors and Claude.ai access issues. In many teams, the direct outage was not the most expensive part. The expensive part was what happened next: retry storms, silent queue growth, and brittle orchestration logic that assumed one model provider would always be reachable.
If you ship production AI features, this is the useful question: what failed in your system when Claude degraded, and what must change before the next provider incident. This piece is that postmortem playbook.
The Outage Signal Was Clear; Most Internal Signals Were Not
The public signal was straightforward: vendor incident, active degradation, partial surface failures. Internal signals were usually messy:
- API error alerts fired, but dashboards did not separate provider failures from app regressions.
- Chat endpoints failed loudly, while async agent tasks failed quietly.
- On-call teams saw downstream timeout noise, not a clean root-cause label.
That pattern means your observability is too endpoint-centric and not dependency-centric. You need explicit dependency health channels for each model provider, then map app symptoms to provider state in near real time.
Retry Logic Became the Incident Multiplier
A failed request is manageable. One million failed requests with unbounded retries is not.
Many SDK wrappers use exponential backoff that is technically correct but operationally unsafe when service degradation is prolonged. If your workers keep retrying as queues pile up, you create:
- latency amplification
- token waste
- user-visible stalls long after partial recovery
Set retry budgets by feature tier. A billing assistant can fail closed after two retries. A premium support flow may justify five retries with fallback generation. Treat retries as a business decision, not a default library behavior.
Authentication Coupling Broke More Than Chat UI
Incident notes referenced login path problems affecting Claude Code. Teams often misclassify this as a client-only issue. It is broader when auth and request execution share control paths:
- service accounts with short-lived tokens fail refresh cycles
- CI jobs using interactive auth break unexpectedly
- background workers fail token renewal mid-run
If auth instability at a provider can pause your deployments, separate runtime credentials from human login paths and test that separation monthly.
Graceful Degradation Was Better Than “All-or-Nothing” Failover
Full provider failover sounds clean and often fails in practice. Schema mismatches, tool-calling differences, and safety policy variance create new incidents during the switch.
The more reliable pattern is feature-level degradation:
- keep critical flows alive with shorter prompts and reduced tool chains
- disable low-priority AI assist temporarily
- surface clear user messaging instead of generic 500 errors
This is where a live pricing and routing view matters. Teams with cost-aware routing performed better because they already knew which fallback paths were acceptable under pressure. Use /tools/llm-api-pricing as the operating table, not just a planning reference.
Multi-Provider Readiness Is a Runtime Practice, Not an Architecture Slide
Most teams say they are multi-provider. Few are tested multi-provider.
True readiness means:
- adapter layer parity across providers
- prompt and tool schema compatibility tests on every release
- failover runbooks that include legal and policy constraints
If one provider outage triggers emergency prompt rewrites at 1 AM, you are still single-provider in operational terms.
Incident Comms Quality Determined Customer Trust
Teams that published timestamped, concrete updates retained trust even during degraded performance. Teams that used vague language (“temporary instability”) burned more support time and renewal goodwill.
Your status messaging should always answer:
- what is affected
- what is not affected
- what mitigation is active
- when next update is due
This communication standard is as important as technical mitigation during public incidents.
What to Change in the Next 14 Days
Use this short execution list:
- Add provider health as a first-class dependency in monitoring.
- Cap retries per feature and enforce queue cutoffs.
- Test auth-path separation for CI and background workloads.
- Run one controlled failover drill with synthetic traffic.
- Publish and rehearse a customer-facing incident template.
Then map this reliability work to broader infrastructure risk planning. A provider outage and a regional infrastructure shock are different events, but they share one consequence: concentrated dependencies break assumptions quickly. The same risk framing appears in our Gulf cloud recovery analysis and our cloud SLA force majeure checklist.
Key Takeaways
- Confirmed window: Anthropic posted incident updates within a 10-minute span at 17:41 UTC and 17:51 UTC on Apr 28.
- Primary failure pattern was not only API errors, it was retry amplification and weak dependency observability in downstream systems.
- Best mitigation was feature-level graceful degradation, not emergency “flip everything” provider migration.
- Operational multi-provider readiness requires tested adapters, schema parity, and failover drills, not architecture diagrams.
- Customer trust outcome depended heavily on timestamped, specific status updates during the incident.
FAQ
Frequently Asked Questions
What exactly was reported in the Claude incident on April 28, 2026?
Anthropic reported elevated API errors and Claude.ai access issues, with status updates moving from Investigating to Identified within about ten minutes. Incident notes also referenced login path problems affecting Claude Code users.
Why do LLM outages create larger app incidents than expected?
They often trigger retry storms, queue buildup, and hidden dependencies such as token refresh paths or async workers. The provider outage is the trigger, but app-side amplification creates most of the operational damage.
Is full model-provider failover always the best response?
No. Full failover can introduce schema and behavior mismatches under pressure. Feature-level graceful degradation is usually safer and faster if provider adapters are not already battle-tested.
What should teams implement first after this kind of incident?
Start with dependency-level monitoring, bounded retries, and one realistic failover drill using synthetic traffic. Those three controls usually deliver the largest reliability gain in the shortest time.
Free Weekly Briefing
The AI & Dev Briefing
One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.
No spam. Unsubscribe anytime.
More on Anthropic
All posts →Claude.ai Outage Apr 28: API Errors, Login Failures, Dev Mitigation
Anthropic confirmed elevated API errors and Claude.ai login failures on Apr 28, 2026. Timeline, blast radius, and failover steps for teams shipping tonight.
Claude 3 Haiku Retires April 19, 2026: Migration Guide for Developers
Anthropic is retiring claude-3-haiku-20240307 on April 19, 2026. Any production application still calling this model will break. Here is exactly what to migrate to, how to do it, and what the cost difference looks like.
Anthropic Doubled Claude Usage Limits Off-Peak: What Changed for Free, Pro and Max
Anthropic doubled Claude off-peak usage limits from March 13 to 27, 2026. Free users got 2x daily messages, Pro and Max got 2x extended thinking. Here is the full breakdown.
Anthropic Is Giving 10,000 Open Source Developers Free Claude Max — How to Get It
Anthropic's Claude for Open Source program gives qualifying maintainers 6 months of Claude Max 20x ($1,200 value) free. Eligibility, step-by-step application, and what to do if you're borderline.
Written by
Software Engineer based in Delhi, India. Writes about AI models, semiconductor supply chains, and tech geopolitics — covering the intersection of infrastructure and global events. 941+ posts cited by ChatGPT, Perplexity, and Gemini. Read in 167 countries.
