Alibaba's Qwen 3.5 9B Beats GPT-OSS-120B on 3 Benchmarks — Runs on a Laptop
Quick summary
Qwen 3.5's 9B model outscores OpenAI's 120B model on GPQA Diamond, MMLU-Pro, and multilingual tests. Apache 2.0, multimodal, runs offline on 16GB RAM. Full developer guide.
Read next
- NVIDIA GTC 2026: What Developers and AI Engineers Need to Know Before March 16
- DeepSeek R2 Is Out: What Every Developer Needs to Know Right Now
Alibaba's Qwen team released the Qwen 3.5 Small Model Series on March 1, 2026: four models at 0.8B, 2B, 4B, and 9B parameters, all multimodal, all capable of running offline on consumer hardware. The 9B model beats OpenAI's GPT-OSS-120B on three major benchmarks while being 13 times smaller.
The scores: MMLU-Pro (82.5 vs 80.8), GPQA Diamond (81.7 vs 80.1), multilingual MMMLU (81.2 vs 78.2). The 9B also beats Gemini 2.5 Flash-Lite on MMMU-Pro visual reasoning (70.1 vs 59.7). These are not narrow tests — they cover graduate-level science reasoning, multilingual performance, and visual understanding, the three categories that matter most for real-world developer tasks.
All four models are Apache 2.0 licensed and available on Hugging Face. No restrictions, no fees, no usage caps.
The Four Models: What Runs Where
Qwen 3.5 0.8B — Fits in RAM on mid-range smartphones. Useful for classification, extraction, and intent detection where latency is critical and API costs aren't justified. Think on-device search query understanding or lightweight form parsing.
Qwen 3.5 2B — Runs on flagship phones (iPhone 16 Pro, Pixel 9 Pro) and entry-level laptops. Handles summarisation, light code generation, and document Q&A in offline or edge deployments. Faster than cloud API calls for simple tasks.
Qwen 3.5 4B — The practical sweet spot for laptop deployments. Any laptop with 8GB RAM. Handles multi-turn conversation, structured output, and moderate code generation with good speed. This is the model to use for developer tooling prototypes.
Qwen 3.5 9B — Requires 16GB RAM or a consumer GPU (RTX 3060 or equivalent). This is the benchmark winner. Use it for complex reasoning, multimodal document processing, and production edge hardware. On Apple Silicon (M2 Pro and above), inference is fast enough to be practical.
Why Early Fusion Changes Everything
The key architectural shift in Qwen 3.5 is early-fusion multimodal training. Most small multimodal models — including earlier Qwen versions — train the language model first, then attach a vision encoder. These two modules were never trained to communicate in the same representational space, so the vision encoder learns to approximate translation into the language model's space. Approximation introduces errors.
Qwen 3.5 trains text and image tokens jointly from the start. The model reasons about images and text in a unified representation. This is why it scores 70.1 on MMMU-Pro visual reasoning versus 59.7 for Gemini 2.5 Flash-Lite and 57.4 for GPT-5-Nano — both of which are late-fusion. The visual reasoning gap is not a marginal improvement. It's the difference between a model that actually understands image-text relationships and one that estimates them.
The models also use grouped query attention (GQA) and sliding window attention for middle layers, reducing memory footprint for long-context tasks. Standard transformer attention scales quadratically with sequence length — this becomes a real problem at 8K+ tokens on consumer hardware. The hybrid attention keeps memory usage manageable even when processing long documents or image sequences.
Full Benchmark Table
| Model | MMLU-Pro | GPQA Diamond | MMMLU (multilingual) | MMMU-Pro (vision) |
|---|---|---|---|---|
| Qwen 3.5 9B | 82.5 | 81.7 | 81.2 | 70.1 |
| GPT-OSS-120B | 80.8 | 80.1 | 78.2 | — |
| Gemini 2.5 Flash-Lite | — | — | — | 59.7 |
| GPT-5-Nano | — | — | — | 57.4 |
The multilingual result deserves attention. 81.2 vs 78.2 on MMMLU is a substantial gap. Alibaba's training data includes higher-quality Chinese, Arabic, and Southeast Asian content than most US lab datasets. For developers building products for non-English markets — which is most of the world — this is a material advantage over GPT-OSS-120B.
Developer Use Cases
Local LLM on a developer's machine. The 4B and 9B run with Ollama or LM Studio on a standard workstation. Zero API costs, zero network latency, no data leaving your machine. For code review, document parsing, and prototyping, the 9B running locally is competitive with paid API calls to mid-tier cloud models. Nvidia's Nemotron 3 Super is stronger on pure coding benchmarks, but Qwen 3.5 9B wins on vision and multilingual.
Edge and air-gapped deployment. Industrial IoT, on-premise healthcare, legal tech with data residency requirements — contexts where data can't go to a cloud API. The 4B and 9B are the first small models capable enough for production-quality responses in these environments without significant quality degradation.
Mobile AI. The 0.8B and 2B run on Android and iOS. Unlike Apple's Gemini-powered Siri (which only runs in Apple's controlled stack), Qwen 3.5 small models are available to any mobile developer, on any platform, offline. For apps targeting emerging markets where connectivity is unreliable, the 2B offline is more reliable than any cloud API.
Multimodal document pipelines. The 9B processes documents that mix text and images — invoices, engineering diagrams, medical scans, product photos — without the quality drop that comes from late-fusion models struggling to bridge visual and text representations. For any workflow that currently sends images to GPT-4V or Claude for parsing, Qwen 3.5 9B is worth benchmarking locally.
Fine-tuning on domain data. Small models are cheaper to fine-tune than large ones. A Qwen 3.5 9B fine-tuned on your domain data will outperform a generic GPT-OSS-120B on your specific task. Apache 2.0 means you can fine-tune, modify, and deploy commercially without restrictions or royalties.
How It Compares to the Other Small Models
vs Meta Llama 3.2 11B: Similar benchmark performance, Llama is slightly larger. Qwen 3.5 9B has better multilingual scores and native multimodal. Llama has broader community tooling and more third-party fine-tunes. Both are Apache 2.0. Pick Llama if ecosystem size matters, Qwen if multilingual or vision quality matters.
vs Microsoft Phi-4 14B: Phi-4 is larger and has stronger coding performance on some benchmarks. Qwen 3.5 9B wins on vision and multilingual. Phi-4 is MIT licensed. If your use case is primarily code generation and you have the hardware for 14B, Phi-4 is worth testing.
vs Google Gemma 3 9B: Direct size comparison. Qwen 3.5 9B outperforms Gemma 3 9B on multilingual and visual benchmarks. Gemma integrates better with Google's Vertex AI toolchain. Choose Gemma if you're already in the Google ecosystem; Qwen if raw benchmark performance matters more.
vs Mistral Small 22B: Mistral is larger but has similar performance on some reasoning tasks. Qwen 3.5 9B wins clearly on size efficiency and visual capability. Mistral Small is better if you have hardware that can run 22B and need the strongest pure-text performance.
The Cost Trend This Represents
A year ago, GPT-4 class reasoning required a $20/month API subscription or expensive cloud GPU instances. Today, equivalent reasoning runs locally on a $1,500 laptop. Qwen 3.5 9B beating GPT-OSS-120B is the continuation of a pattern: DeepSeek R1 challenged OpenAI o1, Xiaomi's Hunter Alpha targeted trillion-parameter performance, and now a 9B model from Alibaba outscores a 120B model from OpenAI on the same benchmarks.
For developers deciding which AI infrastructure to build on: cloud API dependency is increasingly optional for many tasks. The decision between local small model and cloud frontier model is now a product architecture choice — privacy vs capability, cost vs performance — not a forced choice between capable and unusable.
Key Takeaways
- Qwen 3.5 9B beats GPT-OSS-120B (13x larger) on GPQA Diamond (81.7 vs 80.1), MMLU-Pro (82.5 vs 80.8), and MMMLU multilingual (81.2 vs 78.2)
- Four sizes: 0.8B, 2B, 4B, 9B — all multimodal, all Apache 2.0, all on Hugging Face, no commercial restrictions
- Early-fusion vision: trains text and images jointly, giving a 10+ point MMMU-Pro advantage over late-fusion competitors
- Hardware: 4B on any 8GB RAM laptop; 9B on 16GB RAM or RTX 3060
- Best use cases: local developer LLM, edge/air-gapped deployment, mobile AI on Android, multimodal document parsing, domain fine-tuning
- The trend: small models are matching 12-month-old frontier model performance — capable AI no longer requires cloud API access
Free Weekly Briefing
The AI & Dev Briefing
One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.
No spam. Unsubscribe anytime.
More on AI
All posts →NVIDIA GTC 2026: What Developers and AI Engineers Need to Know Before March 16
Jensen Huang takes the stage on March 16 and has promised to "surprise the world" with a new chip. GTC 2026 covers physical AI, agentic AI, inference, and AI factories. Here is what matters for developers building on the AI stack — and what to watch for.
DeepSeek R2 Is Out: What Every Developer Needs to Know Right Now
DeepSeek R2 just dropped. It is multimodal, covers 100+ languages, and was trained on Nvidia Blackwell chips despite US export controls. Here is what changed from R1, what the benchmarks mean, and how to use it including running it locally.
NVIDIA, Google DeepMind, and Disney Built a Physics Engine to Train Every Robot on Earth. Here Is What Newton Does.
Three of the most powerful technology organisations in the world — NVIDIA, Google DeepMind, and Disney Research — jointly built and open-sourced Newton, a physics engine for training robots. It runs 70x faster than existing simulators. Here is why it matters.
Claude vs ChatGPT: The Real Differences (And a Quiz to Test Yourself)
Everyone says Claude and ChatGPT are different. But can you actually tell them apart? This post covers the real behavioural differences — writing style, how they handle uncertainty, structure, and tone — plus an interactive quiz to put your knowledge to the test.
Free Tool
Will AI replace your job?
4 questions. Get a personalised developer risk score based on your stack, role, and what you actually build day to day.
Check Your AI Risk Score →Written by
Abhishek Gautam
Software Engineer based in Delhi, India. Writes about AI models, semiconductor supply chains, and tech geopolitics — covering the intersection of infrastructure and global events. 355+ posts cited by ChatGPT, Perplexity, and Gemini. Read in 121 countries.