Anthropic: Fable 5 Jailbreak Capability Already in GPT-5.5 — The Full Technical Account

Abhishek GautamAbhishek Gautam11 min read
Anthropic: Fable 5 Jailbreak Capability Already in GPT-5.5 — The Full Technical Account

Quick summary

Anthropic published its rebuttal to the US government directive that pulled Fable 5 on June 12. The company says the cited capability — codebase ingestion and vulnerability identification — already exists in GPT-5.5. Here are the test cases, the Glasswing data, and what the security community found.

The prompt that triggered a US export control order on the most capable AI model ever deployed was, according to Anthropic's internal review, a standard codebase security audit request.

On June 12, 2026, three days after launching Fable 5 and Mythos 5, Anthropic received a government directive at 5:21 PM ET suspending global access to both models. The government cited a jailbreak. What the jailbreak did: it instructed the model to ingest a specific codebase and identify exploitable software flaws. Anthropic has now published its response, arguing that this capability — codebase ingestion and autonomous vulnerability discovery — exists in OpenAI's GPT-5.5, in Chinese models like Kimi 2.7, and in every other deployed frontier model. The government's standard, if maintained, would prohibit them all.

The Specific Prompt the Government Flagged

Anthropic's statement says it reviewed what it believes is the underlying government report and concluded the jailbreak amounts to prompting the model to read a specific codebase and identify software flaws.

That is the complete technical description of the cited jailbreak. There is no novel exploit chain. No model weight extraction. No synthetic biology sequence generation. The technique: give the model source code, ask it to find vulnerabilities.

This is not an edge case for Fable 5. It is the model's documented core capability. Anthropic built and publicly promoted Fable 5's ability to perform exactly this task. Project Glasswing — the controlled-access security research program Anthropic launched in April 2026 — used Claude Mythos Preview for this purpose across 50 enterprise partners before Fable 5 even launched. The program found more than 10,000 high or critical-severity vulnerabilities using codebase ingestion prompts. The government, by Anthropic's reading, flagged the output of a technique Anthropic had already documented, publicised, and deployed with partners including AWS, Apple, Cisco, Microsoft, Google, and NVIDIA.

How Pliny the Liberator Actually Did It

The publicly circulating jailbreak, performed by researcher "Pliny the Liberator" within 72 hours of launch, used different techniques from the government's cited concern. The outputs overlapped.

Pliny's documented method combined four approaches to bypass Fable 5's safety classifiers:

Unicode and homoglyph substitution: Replacing standard ASCII characters with visually identical Unicode variants — Cyrillic "a" instead of Latin "a", mathematical characters for standard letters — to defeat keyword-based safety classifiers that scan token sequences for forbidden terms. The classifier sees clean text; the model processes the embedded meaning.

Long-context reference tracking: Distributing harmful intent across a very long conversation or document context, where each individual segment appears benign but the assembled context unlocks restricted output. Fable 5's 1M+ token context window, a selling point on launch day, became the attack surface. The classifier evaluates segments; the model maintains cross-segment state.

Document-structure framing: Embedding offensive queries inside legitimate-looking academic papers, security audit templates, or technical reference documents. The classifier scores a document; the model processes the embedded request within that document's context.

Multi-agent decomposition: Routing a single prohibited task through a chain of agents, each handling a benign subtask, with the combined output achieving what a direct single-agent query would refuse. No individual agent step trips the classifier; the assembled pipeline produces the restricted output.

The documented output included step-by-step stack buffer overflow exploitation for x86 Linux systems: disabling ASLR (Address Space Layout Randomization), writing vulnerable C code using strcpy (unsafe string copy that does not bounds-check input length), and compiling without stack canary or NX protection flags. Pliny also leaked Fable 5's 120,000-character system prompt — one of the most closely guarded elements of the deployment.

Anthropic disputes calling this a universal jailbreak. The company ran more than 1,000 hours of red-team testing and a bug bounty programme before launch and found no technique that defeats the classifiers universally. Pliny's method, per Anthropic's reading, unlocked specific outputs in specific contexts rather than providing general capability bypass on demand.

The GPT-5.5 Parity Argument and the ExploitBench Numbers

Anthropic's formal rebuttal states: "the level of capability demonstrated is available from other publicly deployed models, including OpenAI's GPT-5.5, and is used by cybersecurity defenders as a matter of routine."

The benchmark data adds precision. On ExploitBench — a cybersecurity evaluation measuring a model's ability to discover and reason about software vulnerabilities — published scores show:

ModelExploitBench Score
Claude Mythos 5 / Fable 578%
Claude Opus 4.840%
GPT-5.534%

Fable 5 is significantly ahead of GPT-5.5 on this benchmark. Anthropic is not claiming equivalence. The argument is narrower: the specific capability the government cited — codebase ingestion and vulnerability identification — exists in GPT-5.5 at 34%, which is well above chance and within the range of practical offensive use. If 34% is dangerous enough to warrant export controls, the controls apply to GPT-5.5, which is globally available today with no restriction. The same logic extends to open-weight models and Kimi 2.7.

CyberScoop surveyed dozens of practitioners following the ban. Their verdict: "Issues found in jailbreaking reports can be reproduced in other commercial and open-source models, including GPT 5.5, Claude Opus, Claude Sonnet and Chinese models like Kimi 2.7." The Fable 5 ban does not remove the capability from the internet. It removes one implementation of it.

The Safety Architecture: Classifiers, Fallbacks, and the False Positive Problem

Fable 5 and Mythos 5 share the same underlying model weights. They are differentiated by a safety classifier layer. When a query trips a classifier in high-risk categories — cybersecurity, biology and chemistry, model distillation — Fable 5 hands the request to Claude Opus 4.8, informing the user of the fallback. The categories map to Anthropic's Responsible Scaling Policy CBRN framework (chemical, biological, radiological, nuclear) plus cyber. Mythos 5 has these restrictions lifted for approved Project Glasswing partners operating under data handling agreements.

The false positive rate: below 5% of sessions on average. That sounds manageable until you are a security engineer running code audits all day across large codebases. On Fable 5's launch day, the security community found the guardrails so oversensitive that routine defensive security workflows triggered the Opus 4.8 fallback repeatedly. Within hours, this became what one analyst described as "a source of humor in the cyber community" — practitioners posting publicly that they couldn't get the model to perform basic security tasks without triggering a downgrade to a weaker model.

This is the core irony of the government's position. The model classified as uniquely dangerous was simultaneously blocking routine defensive security work at a rate high enough to draw public mockery from the exact professionals who would use it legitimately. The guardrails were too tight, not too loose. The government's export control treated those tight guardrails as evidence of a dangerous capability rather than evidence of safety-first engineering.

The Amazon-to-White House Chain

Amazon researchers, working within Project Glasswing partner access, reportedly discovered they could prompt the Mythos-class model to generate cyberattack-relevant output that safety classifiers were supposed to block. Amazon CEO Andy Jassy communicated this finding to senior White House officials.

David Sacks, Co-Chair of the President's Council of Advisers on Science and Technology, posted a detailed account on June 14. According to Sacks, the administration gave Anthropic a choice before issuing the export control directive: fix the jailbreak or de-deploy the model. Dario Amodei refused both options. Sacks described the administration as "frankly bewildered that Anthropic hasn't wanted to comply with safety requests that it previously said were its highest priority."

A China-linked group reportedly accessed Mythos 5 before the ban. This is the national security dimension Anthropic has least visibility into — the government has shared only verbal evidence of the access, not documentation. The export control directive, which prohibits foreign nationals including foreign national Anthropic employees in the US from accessing either model, was the government's response to that access concern alongside the jailbreak. Anthropic's rebuttal does not dispute the Chinese access. It disputes that what they accessed is uniquely dangerous relative to what any actor can already obtain from GPT-5.5 or open-weight alternatives.

Project Glasswing: The Same Technique, Used Defensively

Project Glasswing is Anthropic's strongest evidence for the capability parity argument. Since April 2026, 50 enterprise partners have used Claude Mythos Preview to scan their own codebases using the exact prompt technique the government cited as dangerous. The results published in Anthropic's Glasswing update:

  • More than 10,000 high or critical-severity vulnerabilities found across all partners
  • 6,202 high or critical-severity issues identified in open-source projects (out of 23,019 total findings)
  • Cloudflare: 2,000 bugs found, 400 of which are high or critical-severity, with a false positive rate Cloudflare described as better than human testers
  • Partners: AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks

The government was aware of Project Glasswing and its partner list. The same prompt technique — "read this codebase and find the vulnerabilities" — that the government cited as the dangerous Fable 5 jailbreak is what Glasswing used on behalf of the world's largest tech infrastructure companies. Anthropic's implicit argument: the technique is not dangerous because someone can jailbreak it out of Fable 5. It is simultaneously useful and dual-use, which is why Anthropic built a structured program to deploy it defensively. The export control addresses one face of that dual-use reality without acknowledging the other.

Our Analysis: When Safety Marketing Becomes Regulatory Ammunition

Anthropic spent two years publicly describing the Mythos class as the most powerful and potentially dangerous AI ever built. The company's Responsible Scaling Policy, its CBRN threat framework, its ExploitBench score of 78%, its Project Glasswing announcements — all of these were accurate capability claims that positioned Fable 5 as uniquely capable in ways relevant to national security. That positioning was correct.

It also created a precise regulatory hook.

If Anthropic had positioned Fable 5 as incrementally better on coding benchmarks rather than as a model that autonomously finds zero-days at enterprise scale, the government's export control argument would be much harder to sustain. Instead, Anthropic's own documentation provided the evidence base for classifying the model as dual-use technology. The safety-first communication strategy that was designed to build institutional trust became, in this case, the technical basis for the export control.

OpenAI released GPT-5.5 globally with less safety marketing around its cybersecurity capabilities. Its ExploitBench score of 34% is publicly available but was not featured in product launch announcements the way Anthropic's 78% was. This is not evidence that GPT-5.5 is safer. It may be evidence that the more loudly you announce you've built a security-capability model, the more directly you invite this exact regulatory outcome.

For developers building on Anthropic infrastructure: Snyk and Saptang Labs both published post-ban guidance recommending that any AI pipeline using codebase ingestion at Fable 5 capability should be treated as potentially subject to regulatory interruption without notice. The practical takeaway is model abstraction — if your agent pipeline is hard-coded to claude-fable-5 endpoints, June 12 was a single point of failure with no SLA and no migration window.

For context on what you can use now, see our Anthropic Fable 5 developer impact post and the LLM API Pricing Tracker for current alternative model availability.

Key Takeaways

  • The cited jailbreak was a codebase review prompt: the government concern was asking Fable 5 to ingest source code and identify exploitable vulnerabilities, the same workflow Anthropic's Project Glasswing used with 50 enterprise partners including AWS, Apple, and Microsoft
  • GPT-5.5 scores 34% on ExploitBench vs Fable 5 at 78%: Anthropic says 34% is already sufficient for practical offensive use and GPT-5.5 is globally available today with no restriction
  • Pliny used four bypass techniques: Unicode homoglyph substitution, long-context reference tracking, document-structure framing, and multi-agent decomposition; output included x86 stack buffer overflow code and a leaked 120,000-character Fable 5 system prompt
  • Glasswing found 10,000+ vulnerabilities using the same technique: Cloudflare alone reported 2,000 bugs, 400 high or critical, with a false positive rate better than human testers; partners include NVIDIA, Cisco, CrowdStrike, JPMorganChase
  • Amazon researchers triggered the government chain: Jassy contacted White House; Dario Amodei refused to fix or withdraw; export control followed on June 12 at 5:21 PM ET; a China-linked group had reportedly already accessed the model
  • Security community verdict: Fable 5's guardrails were too sensitive on launch day, blocking routine defensive workflows; practitioners publicly said GPT-5.5 and Kimi 2.7 replicate the same outputs
  • The safety paradox: Anthropic's own Responsible Scaling Policy, CBRN framework, and Glasswing marketing provided the precise technical basis the government used to classify the model as dual-use technology

Sources

FAQ

Frequently Asked Questions

What was the Fable 5 jailbreak that the US government cited?

According to Anthropic's review of the government report, the cited jailbreak was a prompt instructing Fable 5 to read a specific codebase and identify exploitable software vulnerabilities. This is not a novel attack. It is the same codebase ingestion and vulnerability discovery workflow that Anthropic's Project Glasswing had used with 50 enterprise partners including AWS, Apple, Microsoft, and Cloudflare since April 2026, finding more than 10,000 high or critical-severity vulnerabilities in that period. Anthropic argues that this specific capability exists in GPT-5.5, Claude Opus, and Kimi 2.7 and is used in routine defensive security workflows across the industry.

How did Pliny the Liberator actually jailbreak Fable 5?

Researcher Pliny the Liberator combined four techniques within 72 hours of Fable 5's June 9 launch: Unicode homoglyph substitution (replacing ASCII characters with visually identical Unicode variants to bypass keyword classifiers), long-context reference tracking (distributing harmful intent across a very long conversation to evade per-segment detection using Fable 5's 1M+ token window), document-structure framing (embedding queries inside legitimate-looking academic papers or audit templates), and multi-agent decomposition (routing a prohibited task through a chain of agents each handling a benign subtask). The documented output included step-by-step stack buffer overflow exploitation on x86 Linux and a leaked 120,000-character Fable 5 system prompt. Anthropic disputes this constitutes a universal jailbreak.

Why does Anthropic say GPT-5.5 has the same capability as Fable 5?

Anthropic's argument is specific: the capability the government cited — codebase ingestion and vulnerability identification — is present in GPT-5.5 at a level sufficient for practical offensive use. On ExploitBench, Fable 5 scores 78% versus GPT-5.5 at 34%. Anthropic concedes Fable is ahead but argues 34% already exceeds the threshold for practical misuse, and GPT-5.5 is globally available today. If the government's export control standard applies to Fable 5 at 78%, it logically applies to GPT-5.5 at 34%, and to Kimi 2.7 and open-weight models that cybersecurity practitioners say can reproduce the same outputs.

How did Amazon and Andy Jassy get involved in the Fable 5 ban?

Amazon researchers, operating as Project Glasswing partners, discovered they could use specific prompts to get Mythos-class models to generate cyberattack-relevant output that safety classifiers were designed to block. Amazon CEO Andy Jassy communicated this finding to senior White House officials. According to David Sacks, Co-Chair of the President's Council of Advisers on Science and Technology, the administration then gave Anthropic a choice before issuing the export control directive: fix the jailbreak or withdraw the model. Dario Amodei refused both options. The June 12 directive followed. A separate China-linked group had reportedly already accessed Mythos 5 before the ban.

What is Project Glasswing and why does it matter here?

Project Glasswing is Anthropic's controlled-access program where approximately 50 enterprise partners used Claude Mythos Preview to autonomously scan their codebases for vulnerabilities. Partners included AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Microsoft, NVIDIA, and Palo Alto Networks. The program found more than 10,000 high or critical-severity vulnerabilities across all partners, including 6,202 in open-source projects. Cloudflare found 2,000 bugs, 400 high or critical, with a false positive rate better than human testers. Glasswing matters here because it used the exact same technique the government cited as the dangerous Fable 5 jailbreak — asking the model to read a codebase and find exploitable flaws — in a structured defensive context the government was aware of.

Free Weekly Briefing

The AI & Dev Briefing

One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.

No spam. Unsubscribe anytime.

Written by

Software Engineer based in Delhi, India. Writes about AI models, semiconductor supply chains, and tech geopolitics — covering the intersection of infrastructure and global events. 917+ posts cited by ChatGPT, Perplexity, and Gemini. Read in 167 countries.