Meta Muse Spark: First MSL Model, Closed Source, Benchmark Results
Quick summary
Meta launched Muse Spark on April 8 — first model from Superintelligence Labs under Alexandr Wang. Free, closed source, beats GPT-5.4 on health and science. Trails on coding.
Read next
- NVIDIA GTC 2026: What Developers and AI Engineers Need to Know Before March 16Jensen Huang takes the stage on March 16 and has promised to "surprise the world" with a new chip. GTC 2026 covers physical AI, agentic AI, inference, and AI factories. Here is what matters for developers building on the AI stack — and what to watch for.
- DeepSeek R2 Is Out: What Every Developer Needs to Know Right NowDeepSeek R2 just dropped. It is multimodal, covers 100+ languages, and was trained on Nvidia Blackwell chips despite US export controls. Here is what changed from R1, what the benchmarks mean, and how to use it including running it locally.
Meta's Superintelligence Labs shipped its first model on April 8 and immediately broke two things: the assumption that Meta would keep releasing open-weight models, and the assumption that a new entrant needs years to challenge frontier labs. Muse Spark — built over nine months from a complete ground-up rebuild of Meta's AI stack — sits in the top 5 on the Artificial Analysis Intelligence Index, beats GPT-5.4 on health benchmarks, and costs nothing to use on meta.ai.
The closed-source decision is the headline. Llama defined Meta's AI identity for three years. Muse Spark is the explicit statement that Alexandr Wang's team is building something different — and it is not sharing the weights.
What Meta Superintelligence Labs Actually Is
Meta Superintelligence Labs (MSL) is the research division that came out of Meta's $14.3 billion deal to bring in Alexandr Wang as Chief AI Officer. Wang co-founded Scale AI and built one of the most important AI data labelling and evaluation companies in the industry. His bet when joining Meta was that the company's data advantages — billions of daily active users across Instagram, WhatsApp, Facebook, and Messenger — could be turned into a training signal moat that frontier labs without a consumer base cannot replicate.
MSL spent nine months on a ground-up rebuild. They did not fine-tune Llama. They did not iterate on an existing architecture. The Muse series is a new training paradigm internally described as "deliberate and scientific model scaling where each generation validates and builds on the last before going bigger." Muse Spark is the first step — small and fast by design, built to validate the scaling approach before MSL goes to larger parameter counts.
The compute efficiency claim is notable: Meta says Muse Spark reaches the same capability level as Llama 4 Maverick with more than ten times less compute. If that holds under independent evaluation, it suggests the architecture and training data quality improvements are doing meaningful work rather than the model just benefiting from increased scale.
Benchmark Results: Where It Wins and Where It Loses
On the Artificial Analysis Intelligence Index v4.0, Muse Spark scores 52 — placing it fifth overall:
- GPT-5.4: 57
- Gemini 3.1 Pro: 57
- Claude Opus 4.6: 53
- Muse Spark: 52
Where Muse Spark leads the field:
Health benchmarks (HealthBench Hard): 42.8, versus GPT-5.4's 40.1. This is the benchmark most relevant to medical and clinical AI use cases — complex clinical reasoning, drug interaction analysis, diagnostic question answering. Meta's health data advantage across its platforms appears to be producing measurable results here.
Scientific reasoning (Humanity's Last Exam): 50.2% in Contemplating mode, versus GPT-5.4 Pro's 43.9%. HLE tests graduate-level expert questions across mathematics, physics, chemistry, biology, and other sciences. A 6+ percentage point lead on this benchmark is significant — it is one of the hardest AI evaluations currently available.
Chart understanding (CharXiv): 86.4 — strong multimodal performance on data visualisation comprehension.
Where Muse Spark trails significantly:
Coding (Terminal-Bench): 59.0 versus GPT-5.4's 75.1. A 16-point gap on coding benchmarks is large. For developers evaluating whether to use Muse Spark for code generation, this number matters. The model is not competitive with GPT-5.4 or Claude on pure coding tasks at launch.
Abstract reasoning (ARC-AGI-2): 42.5 versus GPT-5.4's 76.1. A 33-point gap on the benchmark designed to test novel reasoning rather than pattern matching from training data. This is the most striking underperformance number in Muse Spark's launch results.
Agentic tasks (GDPval-AA): 1,444 ELO versus GPT-5.4's 1,672. Meaningful gap for developers building multi-step agentic workflows.
The honest read: Muse Spark is a specialist that wins on health, science, and charts. It is not a generalist that challenges GPT-5.4 or Claude Opus 4.6 across the board at launch. The "top 5" framing is accurate but hides the uneven benchmark profile.
The Closed-Source Decision: What It Means
Every previous major Meta AI model — Llama 1, 2, 3, 4 — was released as open weights. Muse Spark is not. The weights are not available. The architecture is not documented beyond what Meta chose to share. There is no HuggingFace download.
Meta's official position is that future versions of the Muse series may be open-sourced. The current closed approach is framed as necessary for responsible deployment given capability levels — the same framing OpenAI and Anthropic use. The practical effect is that the developer community that built an ecosystem around Llama now has a Meta flagship model it cannot run, fine-tune, or audit.
This is a significant strategic shift. The Llama ecosystem created enormous goodwill, research adoption, and indirect commercial leverage for Meta. MSL is betting that the competitive advantage of keeping Muse weights proprietary outweighs the ecosystem network effects of open release. That is a bet Anthropic and OpenAI have always made — it is a new bet for Meta.
For enterprises that chose Meta AI infrastructure specifically because of open weights: Muse Spark is not a drop-in for Llama 4 in your deployment. The API is a different integration model with different control, cost, and compliance implications.
Technical Architecture: What We Know
Muse Spark is natively multimodal — voice, text, and image inputs at launch, with text-only output. Visual chain of thought, tool-use, and multi-agent orchestration are built into the base model rather than bolted on. This matters because models that add multimodal capability as an afterthought tend to perform worse at cross-modal reasoning than models where modalities are integrated at pretraining.
The model is currently powering the Meta AI app and website. It will roll out to WhatsApp, Instagram, Facebook, Messenger, and Meta's AI glasses over the coming weeks. That deployment scale — billions of users — is something no other frontier model has. If Meta uses that distribution to collect preference and feedback data at scale, the advantage compounds into the next Muse generation.
A private API preview is open to select partners now. Public API access timeline has not been announced. For developers wanting to evaluate the model, meta.ai is the current access point — completely free with no subscription required.
How to Use Muse Spark Right Now
For health and science tasks: Muse Spark is the current benchmark leader. Clinical reasoning, drug research, scientific paper analysis, medical question answering — try it here first before defaulting to GPT-5.4 or Claude.
For coding: Use Claude Opus 4.6 or GPT-5.4 instead. The 16-point Terminal-Bench gap is too large to ignore for production code generation.
For agentic workflows: Too early to deploy at scale. The GDPval-AA gap and the ARC-AGI-2 underperformance suggest the model has not yet been optimised for multi-step autonomous tasks.
For chart and data analysis: Competitive benchmark scores — worth evaluating alongside Gemini 3.1 Pro for visualisation-heavy workflows.
For cost-sensitive deployments: It is free on meta.ai with no usage limits announced. For high-volume read-only use cases where health or science context is relevant, this is a meaningful cost advantage. Check LLM API pricing once the API launches publicly for the full cost comparison.
The Muse series will get larger. Muse Spark's role is to validate the training approach and the scaling laws before MSL commits to the compute required for the next generation. What comes after Muse Spark — once the architecture is validated and the health/science data advantages are combined with more scale — is the real competitive move Meta is building toward.
Key Takeaways
- Muse Spark launched April 8 from Meta Superintelligence Labs (MSL) under Alexandr Wang — Meta's first model since the $14.3B Scale AI deal
- Closed source: break from Llama heritage — weights not public; future versions may be open-sourced; no timeline given
- Benchmark scores: 52 on AI Analysis Index (5th overall); beats GPT-5.4 on HealthBench Hard (42.8 vs 40.1) and HLE science reasoning (50.2% vs 43.9%); trails badly on coding (59 vs 75.1) and ARC-AGI-2 abstract reasoning (42.5 vs 76.1)
- Architecture: natively multimodal (voice, text, image input), visual chain-of-thought, tool-use, multi-agent orchestration built in at base model level
- Access: completely free on meta.ai and Meta AI app; private API preview for select partners; public API timeline unannounced
- 10x compute efficiency claimed versus Llama 4 Maverick at equivalent capability — if independently verified, significant
- Deployment: rolling to WhatsApp, Instagram, Facebook, Messenger, AI glasses — billions of users as feedback data flywheel
FAQ
Frequently Asked Questions
What is Meta Muse Spark and when did it launch?
Meta Muse Spark is the first AI model from Meta Superintelligence Labs (MSL), launched on April 8 2026. It was built over nine months from a ground-up rebuild of Meta's AI stack under Chief AI Officer Alexandr Wang. It is natively multimodal, supports tool-use and multi-agent orchestration, and is free to use on meta.ai and the Meta AI app.
Is Meta Muse Spark open source like Llama?
No. Muse Spark is closed source — a major break from Meta's previous Llama models which were all released as open weights. The weights and architecture are not public. Meta said it hopes to open-source future Muse versions but gave no timeline. The current access is via meta.ai (free) or a private API preview for select partners.
How does Meta Muse Spark compare to GPT-5.4 and Claude Opus 4.6?
Muse Spark scores 52 on the Artificial Analysis Intelligence Index v4.0, placing it fifth behind GPT-5.4 (57), Gemini 3.1 Pro (57), and Claude Opus 4.6 (53). It beats GPT-5.4 on HealthBench Hard (42.8 vs 40.1) and Humanity's Last Exam science reasoning (50.2% vs 43.9%), but trails significantly on coding benchmarks (59 vs 75.1) and abstract reasoning ARC-AGI-2 (42.5 vs 76.1).
Should developers use Meta Muse Spark for coding?
Not as a primary coding model. Muse Spark scores 59 on Terminal-Bench versus GPT-5.4's 75.1 — a 16-point gap that is too large to ignore for production code generation. Use Claude Opus 4.6 or GPT-5.4 for coding tasks. Muse Spark is better suited to health, science, and chart analysis use cases where it leads or matches frontier models.
How can I access Meta Muse Spark?
Muse Spark is free on meta.ai and the Meta AI app with no subscription required. It is rolling out to WhatsApp, Instagram, Facebook, Messenger, and Meta AI glasses over coming weeks. A private API preview is available to select partners. Public API access timeline has not been announced.
Free Weekly Briefing
The AI & Dev Briefing
One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.
No spam. Unsubscribe anytime.
More on AI
All posts →NVIDIA GTC 2026: What Developers and AI Engineers Need to Know Before March 16
Jensen Huang takes the stage on March 16 and has promised to "surprise the world" with a new chip. GTC 2026 covers physical AI, agentic AI, inference, and AI factories. Here is what matters for developers building on the AI stack — and what to watch for.
DeepSeek R2 Is Out: What Every Developer Needs to Know Right Now
DeepSeek R2 just dropped. It is multimodal, covers 100+ languages, and was trained on Nvidia Blackwell chips despite US export controls. Here is what changed from R1, what the benchmarks mean, and how to use it including running it locally.
NVIDIA, Google DeepMind, and Disney Built a Physics Engine to Train Every Robot on Earth. Here Is What Newton Does.
Three of the most powerful technology organisations in the world — NVIDIA, Google DeepMind, and Disney Research — jointly built and open-sourced Newton, a physics engine for training robots. It runs 70x faster than existing simulators. Here is why it matters.
Claude vs ChatGPT 2026: Five Tells You Can Spot (Blind Quiz Inside)
Unlabeled Claude vs ChatGPT answers: tone, uncertainty, structure. Learn the tells, then take the blind quiz. For picking a daily model or API in 2026.
Free Tool
Will AI replace your job?
4 questions. Get a personalised developer risk score based on your stack, role, and what you actually build day to day.
Check Your AI Risk Score →Written by
Software Engineer based in Delhi, India. Writes about AI models, semiconductor supply chains, and tech geopolitics — covering the intersection of infrastructure and global events. 941+ posts cited by ChatGPT, Perplexity, and Gemini. Read in 167 countries.
