The Qwen3.5 Small Series Shock: When 9B Beats 120B, a New Standard for Local AI
On March 2, 2026, Alibaba’s Qwen team released the Qwen3.5 Small Model Series. Comprising four dense models at 0.8B, 2B, 4B, and 9B parameters, this series shook the local AI community by outperforming OpenAI’s gpt-oss-120B on key benchmarks with just 9B parameters. The announcement racked up 1,261 upvotes on r/LocalLLaMA within hours1.
Sixteen days after the 397B flagship arrived, followed by the medium series, the small series completes the lineup — nine models in 16 days. This article dives deep into the Small Series’ architecture, benchmarks, and impact on the local AI ecosystem.
Lineup at a Glance
| Model | Parameters | Layers | Context | VRAM (BF16) | VRAM (4-bit) |
|---|---|---|---|---|---|
| Qwen3.5-0.8B | 0.8B | 24 | 262K | ~1.6 GB | ~0.5 GB |
| Qwen3.5-2B | 2B | 24 | 262K | ~4 GB | ~1.5 GB |
| Qwen3.5-4B | 4B | 32 | 262K | ~8 GB | ~3 GB |
| Qwen3.5-9B | 9B | 32 | 262K (1M extended) | ~18 GB | ~5 GB |
All four models are released under the Apache 2.0 license and are immediately available on Hugging Face and ModelScope2. Both base and instruct models are provided, giving researchers and enterprises full freedom to fine-tune.
Architecture: Gated DeltaNet Hybrid Attention
The secret to how the Small Series achieves this level of performance at this scale is the Gated DeltaNet hybrid architecture. This design, shared across the entire Qwen3.5 family, directly tackles the limitations of traditional Transformers.
What Is Gated DeltaNet?
Gated Delta Networks3 combine Mamba2’s gated decay mechanism with delta-rule-based hidden state updates in a linear attention framework. Key characteristics include:
- Constant memory complexity: Independent of sequence length, enabling 262K-token context even in the 0.8B model.
- 3:1 hybrid ratio: Three DeltaNet linear attention blocks followed by one traditional full softmax attention block. Routine computation goes to linear attention; precision-demanding reasoning goes to full attention.
- Solving the memory wall: Smaller models are more bottlenecked by memory bandwidth, and the linear attention blocks substantially alleviate this.
Native Multimodal: No Separate Vision Model Needed
Previous generations bolted vision encoders onto text models. Qwen3.5 is fundamentally different. Using an early fusion approach, text, image, and video tokens are trained together from the start. The vision encoder uses a DeepStack Vision Transformer with Conv3D patch embeddings to capture temporal dynamics in video. Features from multiple layers — not just the final one — are merged, enabling video understanding even in the 0.8B model.
Multi-Token Prediction (MTP)
All four models feature MTP (Multi-Token Prediction) for simultaneously predicting multiple tokens during inference. This directly improves inference speed without quality loss.
Strong-to-Weak Distillation
The Small Series was trained via knowledge distillation using the 397B flagship and medium series as teacher models. According to the Qwen team, distillation proved more effective than direct reinforcement learning (RL) at this scale. Off-policy and on-policy transfer were combined to maximally compress and transfer teacher capabilities.
Benchmarks: Numbers That Defy Expectations
Language Benchmarks: 9B Rivals Previous-Gen 80B
| Benchmark | Qwen3.5-9B | Qwen3.5-4B | Qwen3-30B | Qwen3-80B |
|---|---|---|---|---|
| MMLU-Pro | 82.5 | 79.1 | 80.9 | 82.7 |
| C-Eval | 88.2 | 85.1 | 87.4 | 89.7 |
| GPQA Diamond | 81.7 | 76.2 | 73.4 | 77.2 |
| IFEval | 91.5 | 89.8 | 88.9 | 88.9 |
| LongBench v2 | 55.2 | 50.0 | 44.8 | 48.0 |
On GPQA Diamond, the 9B model (81.7) surpassed the previous-gen 80B (77.2) by 4.5 points. On instruction following (IFEval), it led with 91.5 versus 88.9. On long-context processing (LongBench v2), the gap was over 7 points (55.2 vs. 48.0). A model with 9× fewer parameters outperformed the previous generation’s top model.
Key Comparison: 9B vs. gpt-oss-120B
VentureBeat highlighted these comparison points4:
| Benchmark | Qwen3.5-9B | gpt-oss-120B |
|---|---|---|
| GPQA Diamond | 81.7 | 80.1 |
| MMMLU (multilingual) | 81.2 | 78.2 |
| OmniDocBench v1.5 | 87.7 | — |
On graduate-level reasoning (GPQA Diamond), 9B edged out 120B by 1.6 points. On multilingual knowledge (MMMLU), it won by 3 points — beating a model 13.5× its size. That said, gpt-oss-120B still held the advantage on coding benchmarks1.
Vision Benchmarks: Dominating GPT-5-Nano
| Benchmark | Qwen3.5-9B | Qwen3.5-4B | GPT-5-Nano | Gemini 2.5 Flash-Lite |
|---|---|---|---|---|
| MMMU-Pro | 70.1 | 66.3 | 57.2 | 59.7 |
| MathVision | 78.9 | 74.6 | 62.2 | 52.1 |
| MathVista (mini) | 85.7 | 85.1 | 71.5 | 72.8 |
| OmniDocBench v1.5 | 87.7 | 86.2 | 55.9 | 79.4 |
| VideoMME (w/ subs) | 84.5 | — | — | 74.6 |
On MMMU-Pro, the 9B (70.1) crushed GPT-5-Nano (57.2) by 13 points. On document understanding (OmniDocBench), the gap was 31.8 points. This wasn’t a difference in size — it was a generational leap.
Smaller Models That Still Deliver: 0.8B and 2B
| Benchmark | Qwen3.5-2B | Qwen3.5-0.8B |
|---|---|---|
| MMMU (vision) | 64.2 | 49.0 |
| MathVista (vision) | 76.7 | 62.2 |
| OCRBench (vision) | 84.5 | 74.5 |
| VideoMME (w/ subs) | 75.6 | 63.8 |
The 2B model’s OCRBench score of 84.5 surpasses previous-generation 7B-class models. Even the 0.8B delivers MathVista 62.2 and OCRBench 74.5 — practical levels for edge device deployment.
201 Languages and a 248K Vocabulary
The entire Qwen3.5 family uses a 248K-token vocabulary supporting 201 languages and dialects5. This includes Korean, Japanese, and Chinese, as well as Arabic, Hindi, Swahili, and other low-resource languages. The 9B model’s MMMLU score of 81.2, surpassing gpt-oss-120B (78.2), is a direct result of this vocabulary design.
Where Can You Run It: From Raspberry Pi to Laptops
The real significance of the Small Series isn’t benchmark numbers — it’s accessibility.
- Qwen3.5-0.8B: ~0.5 GB at 4-bit quantization. Runs on Raspberry Pi and smartphones.
- Qwen3.5-2B: ~1.5 GB at 4-bit. Works on regular laptop GPUs and mobile SoCs.
- Qwen3.5-4B: ~3 GB at 4-bit. Runs comfortably on RTX 3060 12GB and M1/M2 Macs.
- Qwen3.5-9B: ~5 GB at 4-bit. Runs on RTX 3090/4090 and M2 Pro+ Macs. Supports ~1M token context via YaRN extension.
All major inference frameworks are supported: vLLM, SGLang, llama.cpp, MLX, and Hugging Face Transformers. GGUF quantized versions are also available on Hugging Face. Hugging Face developer Xenova demonstrated the Qwen3.5 Small Series running directly in a web browser, performing video analysis4.
For those interested in quantization and model compression techniques, see our separate guide.
Community Reaction: “How Is This Possible?”
The response on r/LocalLLaMA was sheer amazement. The launch post hit 1,261 upvotes within 10 hours1.
Paul Couvert of Blueshell AI wrote on X: “How is this even possible? The 4B version is nearly on par with the previous 80B-A3B model, and the 9B is as good as the 13× larger GPT-OSS-120B.”4
Karan Kendre of Kargul Studio said, “These models run locally for free on my M1 MacBook Air.”4 One developer called the 4B model’s native multimodal capabilities “a game changer for mobile developers.”
Not everything was rosy, though. On r/LocalLLaMA benchmark comparison threads, some pointed out that “reasoning and coding benchmarks are lower compared to gpt-oss”1, and others expressed frustration at the lack of direct comparison with Qwen3-4B (2507 version).
The Full Qwen3.5 Family Picture
| Model | Release Date | Type | Active Parameters |
|---|---|---|---|
| Qwen3.5-397B-A17B | Feb 16 | MoE (flagship) | 17B |
| Qwen3.5-122B-A10B | Feb 24 | MoE | 10B |
| Qwen3.5-35B-A3B | Feb 24 | MoE | 3B |
| Qwen3.5-27B | Feb 24 | Dense | 27B |
| Qwen3.5-9B | Mar 2 | Dense | 9B |
| Qwen3.5-4B | Mar 2 | Dense | 4B |
| Qwen3.5-2B | Mar 2 | Dense | 2B |
| Qwen3.5-0.8B | Mar 2 | Dense | 0.8B |
Completing nine models in 16 days — from a 0.8B edge model to a 397B frontier flagship — is unprecedented in open-source AI history. All models share the same Gated DeltaNet hybrid architecture and support native multimodal, 201 languages, and Thinking/Non-thinking dual modes.
What Changes
The Qwen3.5 Small Series isn’t just about “small models have arrived.” It signals several fundamental shifts.
First, a paradigm shift in parameter efficiency. A 9B model beating a 120B model isn’t mere benchmark hacking — it’s the compound result of architecture (Gated DeltaNet), training methodology (Strong-to-Weak distillation), and data quality. The fact that the same architecture scales from 0.8B to 397B means this design has achieved true versatility.
Second, the democratization of multimodal AI. The ability to process text, images, and video in a single model has now reached 0.8B. Understanding video on a smartphone, reading documents, and recognizing UI elements are all possible. Eliminating the need to load a separate vision model fundamentally reduces the complexity of edge deployment.
Third, the threshold for practical local AI has dropped. A 4-bit quantized 9B model running in 5GB VRAM means that anyone with an RTX 3060-class GPU can use gpt-oss-120B-level reasoning locally, for free. Cloud API costs, data privacy, and latency — all three problems solved simultaneously.
With Qwen3.5, Alibaba delivered a clear message: “More intelligence, less compute.” The era when small models beat large ones has arrived, and the benefits can be enjoyed even on a Raspberry Pi.
Footnotes
-
r/LocalLLaMA, “Breaking: The small qwen3.5 models have been dropped,” Reddit, March 2, 2026. Link ↩ ↩2 ↩3 ↩4
-
Qwen Team, Qwen3.5 Collection, Hugging Face, March 2, 2026. Link ↩
-
Yang et al., “Gated Delta Networks: Improving Mamba2 with Delta Rule,” arXiv:2412.06464, 2024. Link ↩
-
Carl Franzen, “Alibaba’s small, open source Qwen3.5-9B beats OpenAI’s gpt-oss-120B and can run on standard laptops,” VentureBeat, March 2, 2026. Link ↩ ↩2 ↩3 ↩4
-
Awesome Agents, “Qwen 3.5 Small Series Ships Four Models From 0.8B to 9B,” March 2, 2026. Link ↩