The Qwen3.5 Small Series Shock: When 9B Beats 120B, a New Standard for Local AI

2026-03-03 · # AI 뉴스

Qwen3.5 local LLM on-device AI open source MoE multimodal

On March 2, 2026, Alibaba’s Qwen team released the Qwen3.5 Small Model Series. Comprising four dense models at 0.8B, 2B, 4B, and 9B parameters, this series shook the local AI community by outperforming OpenAI’s gpt-oss-120B on key benchmarks with just 9B parameters. The announcement racked up 1,261 upvotes on r/LocalLLaMA within hours¹.

Sixteen days after the 397B flagship arrived, followed by the medium series, the small series completes the lineup — nine models in 16 days. This article dives deep into the Small Series’ architecture, benchmarks, and impact on the local AI ecosystem.

Lineup at a Glance

Model	Parameters	Layers	Context	VRAM (BF16)	VRAM (4-bit)
Qwen3.5-0.8B	0.8B	24	262K	~1.6 GB	~0.5 GB
Qwen3.5-2B	2B	24	262K	~4 GB	~1.5 GB
Qwen3.5-4B	4B	32	262K	~8 GB	~3 GB
Qwen3.5-9B	9B	32	262K (1M extended)	~18 GB	~5 GB

All four models are released under the Apache 2.0 license and are immediately available on Hugging Face and ModelScope². Both base and instruct models are provided, giving researchers and enterprises full freedom to fine-tune.

Architecture: Gated DeltaNet Hybrid Attention

The secret to how the Small Series achieves this level of performance at this scale is the Gated DeltaNet hybrid architecture. This design, shared across the entire Qwen3.5 family, directly tackles the limitations of traditional Transformers.

What Is Gated DeltaNet?

Gated Delta Networks³ combine Mamba2’s gated decay mechanism with delta-rule-based hidden state updates in a linear attention framework. Key characteristics include:

Constant memory complexity: Independent of sequence length, enabling 262K-token context even in the 0.8B model.
3:1 hybrid ratio: Three DeltaNet linear attention blocks followed by one traditional full softmax attention block. Routine computation goes to linear attention; precision-demanding reasoning goes to full attention.
Solving the memory wall: Smaller models are more bottlenecked by memory bandwidth, and the linear attention blocks substantially alleviate this.

Native Multimodal: No Separate Vision Model Needed

Previous generations bolted vision encoders onto text models. Qwen3.5 is fundamentally different. Using an early fusion approach, text, image, and video tokens are trained together from the start. The vision encoder uses a DeepStack Vision Transformer with Conv3D patch embeddings to capture temporal dynamics in video. Features from multiple layers — not just the final one — are merged, enabling video understanding even in the 0.8B model.

Multi-Token Prediction (MTP)

All four models feature MTP (Multi-Token Prediction) for simultaneously predicting multiple tokens during inference. This directly improves inference speed without quality loss.

Strong-to-Weak Distillation

The Small Series was trained via knowledge distillation using the 397B flagship and medium series as teacher models. According to the Qwen team, distillation proved more effective than direct reinforcement learning (RL) at this scale. Off-policy and on-policy transfer were combined to maximally compress and transfer teacher capabilities.

Benchmarks: Numbers That Defy Expectations

Language Benchmarks: 9B Rivals Previous-Gen 80B

Benchmark	Qwen3.5-9B	Qwen3.5-4B	Qwen3-30B	Qwen3-80B
MMLU-Pro	82.5	79.1	80.9	82.7
C-Eval	88.2	85.1	87.4	89.7
GPQA Diamond	81.7	76.2	73.4	77.2
IFEval	91.5	89.8	88.9	88.9
LongBench v2	55.2	50.0	44.8	48.0

On GPQA Diamond, the 9B model (81.7) surpassed the previous-gen 80B (77.2) by 4.5 points. On instruction following (IFEval), it led with 91.5 versus 88.9. On long-context processing (LongBench v2), the gap was over 7 points (55.2 vs. 48.0). A model with 9× fewer parameters outperformed the previous generation’s top model.

Key Comparison: 9B vs. gpt-oss-120B

VentureBeat highlighted these comparison points⁴:

Benchmark	Qwen3.5-9B	gpt-oss-120B
GPQA Diamond	81.7	80.1
MMMLU (multilingual)	81.2	78.2
OmniDocBench v1.5	87.7	—

On graduate-level reasoning (GPQA Diamond), 9B edged out 120B by 1.6 points. On multilingual knowledge (MMMLU), it won by 3 points — beating a model 13.5× its size. That said, gpt-oss-120B still held the advantage on coding benchmarks¹.

Vision Benchmarks: Dominating GPT-5-Nano

Benchmark	Qwen3.5-9B	Qwen3.5-4B	GPT-5-Nano	Gemini 2.5 Flash-Lite
MMMU-Pro	70.1	66.3	57.2	59.7
MathVision	78.9	74.6	62.2	52.1
MathVista (mini)	85.7	85.1	71.5	72.8
OmniDocBench v1.5	87.7	86.2	55.9	79.4
VideoMME (w/ subs)	84.5	—	—	74.6

On MMMU-Pro, the 9B (70.1) crushed GPT-5-Nano (57.2) by 13 points. On document understanding (OmniDocBench), the gap was 31.8 points. This wasn’t a difference in size — it was a generational leap.

Smaller Models That Still Deliver: 0.8B and 2B

Benchmark	Qwen3.5-2B	Qwen3.5-0.8B
MMMU (vision)	64.2	49.0
MathVista (vision)	76.7	62.2
OCRBench (vision)	84.5	74.5
VideoMME (w/ subs)	75.6	63.8

The 2B model’s OCRBench score of 84.5 surpasses previous-generation 7B-class models. Even the 0.8B delivers MathVista 62.2 and OCRBench 74.5 — practical levels for edge device deployment.

201 Languages and a 248K Vocabulary

The entire Qwen3.5 family uses a 248K-token vocabulary supporting 201 languages and dialects⁵. This includes Korean, Japanese, and Chinese, as well as Arabic, Hindi, Swahili, and other low-resource languages. The 9B model’s MMMLU score of 81.2, surpassing gpt-oss-120B (78.2), is a direct result of this vocabulary design.

Where Can You Run It: From Raspberry Pi to Laptops

The real significance of the Small Series isn’t benchmark numbers — it’s accessibility.

Qwen3.5-0.8B: ~0.5 GB at 4-bit quantization. Runs on Raspberry Pi and smartphones.
Qwen3.5-2B: ~1.5 GB at 4-bit. Works on regular laptop GPUs and mobile SoCs.
Qwen3.5-4B: ~3 GB at 4-bit. Runs comfortably on RTX 3060 12GB and M1/M2 Macs.
Qwen3.5-9B: ~5 GB at 4-bit. Runs on RTX 3090/4090 and M2 Pro+ Macs. Supports ~1M token context via YaRN extension.

All major inference frameworks are supported: vLLM, SGLang, llama.cpp, MLX, and Hugging Face Transformers. GGUF quantized versions are also available on Hugging Face. Hugging Face developer Xenova demonstrated the Qwen3.5 Small Series running directly in a web browser, performing video analysis⁴.

For those interested in quantization and model compression techniques, see our separate guide.

Community Reaction: “How Is This Possible?”

The response on r/LocalLLaMA was sheer amazement. The launch post hit 1,261 upvotes within 10 hours¹.

Paul Couvert of Blueshell AI wrote on X: “How is this even possible? The 4B version is nearly on par with the previous 80B-A3B model, and the 9B is as good as the 13× larger GPT-OSS-120B.”⁴

Karan Kendre of Kargul Studio said, “These models run locally for free on my M1 MacBook Air.”⁴ One developer called the 4B model’s native multimodal capabilities “a game changer for mobile developers.”

Not everything was rosy, though. On r/LocalLLaMA benchmark comparison threads, some pointed out that “reasoning and coding benchmarks are lower compared to gpt-oss”¹, and others expressed frustration at the lack of direct comparison with Qwen3-4B (2507 version).

The Full Qwen3.5 Family Picture

Model	Release Date	Type	Active Parameters
Qwen3.5-397B-A17B	Feb 16	MoE (flagship)	17B
Qwen3.5-122B-A10B	Feb 24	MoE	10B
Qwen3.5-35B-A3B	Feb 24	MoE	3B
Qwen3.5-27B	Feb 24	Dense	27B
Qwen3.5-9B	Mar 2	Dense	9B
Qwen3.5-4B	Mar 2	Dense	4B
Qwen3.5-2B	Mar 2	Dense	2B
Qwen3.5-0.8B	Mar 2	Dense	0.8B

Completing nine models in 16 days — from a 0.8B edge model to a 397B frontier flagship — is unprecedented in open-source AI history. All models share the same Gated DeltaNet hybrid architecture and support native multimodal, 201 languages, and Thinking/Non-thinking dual modes.

What Changes

The Qwen3.5 Small Series isn’t just about “small models have arrived.” It signals several fundamental shifts.

First, a paradigm shift in parameter efficiency. A 9B model beating a 120B model isn’t mere benchmark hacking — it’s the compound result of architecture (Gated DeltaNet), training methodology (Strong-to-Weak distillation), and data quality. The fact that the same architecture scales from 0.8B to 397B means this design has achieved true versatility.

Second, the democratization of multimodal AI. The ability to process text, images, and video in a single model has now reached 0.8B. Understanding video on a smartphone, reading documents, and recognizing UI elements are all possible. Eliminating the need to load a separate vision model fundamentally reduces the complexity of edge deployment.

Third, the threshold for practical local AI has dropped. A 4-bit quantized 9B model running in 5GB VRAM means that anyone with an RTX 3060-class GPU can use gpt-oss-120B-level reasoning locally, for free. Cloud API costs, data privacy, and latency — all three problems solved simultaneously.

With Qwen3.5, Alibaba delivered a clear message: “More intelligence, less compute.” The era when small models beat large ones has arrived, and the benefits can be enjoyed even on a Raspberry Pi.

r/LocalLLaMA, “Breaking: The small qwen3.5 models have been dropped,” Reddit, March 2, 2026. Link ↩ ↩² ↩³ ↩⁴
Qwen Team, Qwen3.5 Collection, Hugging Face, March 2, 2026. Link ↩
Yang et al., “Gated Delta Networks: Improving Mamba2 with Delta Rule,” arXiv:2412.06464, 2024. Link ↩
Carl Franzen, “Alibaba’s small, open source Qwen3.5-9B beats OpenAI’s gpt-oss-120B and can run on standard laptops,” VentureBeat, March 2, 2026. Link ↩ ↩² ↩³ ↩⁴
Awesome Agents, “Qwen 3.5 Small Series Ships Four Models From 0.8B to 9B,” March 2, 2026. Link ↩