From Qwen3 to Qwen3.5: Why Active 3B Now Beats Active 22B

2026-02-26 · # AI 개념

Qwen MoE LLM architecture DeltaNet hybrid attention

In February 2026, Alibaba’s Qwen team quietly released a striking statistic: Qwen3.5-35B-A3B, a 35B total parameter model, had broadly outperformed Qwen3-235B-A22B — the flagship from just 10 months earlier — across major benchmarks.¹ The numbers seem backwards. With active parameters, that’s 3B vs 22B. A model 7+ times smaller performing better.

This reversal wasn’t simply because they “trained it better.” The architecture itself changed. The 10-month journey from Qwen3 → Qwen3-Next → Qwen3.5 was a design experiment that directly challenged the “bigger equals stronger” assumption.

Qwen3: When Orthodoxy Worked

Released in April 2025, Qwen3² was quite polished by the standards of its time. The flagship Qwen3-235B-A22B adopted a Mixture of Experts (MoE) structure with 235B total parameters activating 22B. It selected 8 out of 128 layer-wise experts for operation.³

For attention mechanisms, it used Grouped-Query Attention (GQA). With 64 query heads paired with 4 key-value heads, it significantly reduced memory usage compared to traditional multi-head attention. This eased KV cache burden for long contexts.

However, the core attention computation still relied on traditional softmax attention. The problem lies in computational complexity. With sequence length L, softmax attention requires O(L²) computation. When context doubles, computation quadruples. Processing a 10,000-token document versus a 20,000-token one scales as a quadratic function, not linear.

Vision capabilities were delegated to a separate model, Qwen3-VL. This choice simplified the training pipeline while sacrificing synergy between the two capabilities. Language support covered 119 languages.

Qwen3 showed competitive performance against DeepSeek-R1, GPT-4o and others, establishing itself as a strong contender in the open weights space. Architecturally though, this generation represented “polished combination of proven approaches” — prioritizing refinement over innovation.

Qwen3-Next: Dismantling and Redesigning Architecture

In September 2025, the Qwen team surprised everyone with Qwen3-Next.⁴ While positioned as a research preview, the content was substantial for a “preview.”

Hybrid Attention: Being “Serious” Only Once Every Three Times

Qwen3-Next’s first innovation was restructuring attention. Out of 48 layers, 36 used linear attention (DeltaNet approach) while 12 used traditional softmax attention — a 3:1 hybrid structure.⁵

Linear attention reduces O(L²) complexity to near O(L). But there’s a fatal weakness. Pure linear attention compresses historical information into hidden states (internal state matrices). As information accumulates, early details get muddled. It’s weak at “needle in haystack” tasks requiring precise retrieval of key clues from the 3rd paragraph in a 10,000-word document.

DeltaNet’s delta rule solved this.⁶ DeltaNet’s state update equation:

$S_t = S_{t-1} + \beta_t(v_t - S_{t-1}k_t)k_t^T$

For non-technical readers: if traditional linear attention is like continuously adding notes to a memo pad, DeltaNet applies the principle “erase and rewrite if what’s written is wrong.” The $(v_t - S_{t-1}k_t)$ term represents “prediction error” — the difference between incoming information and what past states predicted. It only updates memory by this error amount, drastically improving accuracy. DeltaNet appeared in ICLR 2025, achieving superior performance over Mamba2 and pure linear attention on the MQAR associative memory benchmark.

The reason 25% softmax attention layers remained in Qwen3-Next relates to this limitation. Linear attention alone provides incomplete accurate information retrieval for long contexts. Inserting a softmax layer every 4 times struck a balance: “75% fast tools, 25% precision tools.” NVIDIA’s official blog confirmed this design scales GPU memory and computation near-linearly while maintaining accuracy.⁵

Ultra-Sparse MoE: 11 Out of 512 People

The second innovation pushed MoE sparsity to extremes. Qwen3-Next deployed 512 experts (512 routing experts + 1 shared expert), activating only routing experts 10 and shared expert 1 — total 11 per token.⁵ The activation ratio was approximately 2.2%.

Comparing with previous generations shows the clear direction:

Model	Total Experts	Active Experts	Active Ratio	Notes
Mixtral	8	2	1/4 (25%)	Early sparse MoE
Qwen3-235B-A22B	128	8	1/16 (6.3%)	Orthodox generation
DeepSeek R1	256	8	1/32 (3.1%)	DeepSeek approach
Qwen3-Next	512	11	1/46 (2.2%)	Ultra-sparse

Expanding to 512 experts follows a “maximize candidate expert diversity” strategy. Each expert handles narrower domains, improving specialization levels. However, training difficulty also increases — if 512 experts don’t develop evenly, load imbalance occurs where only some experts get utilized.

Ultimately, Qwen3-Next achieves 80B total parameters with only 3B active per token. An 80B model running inference at 3B-level computational cost.

MTP: Two or Three Tokens at Once

The third innovation was Multi-Token Prediction (MTP). Traditional language models predict one token at a time. MTP uses speculative decoding-like approaches, predicting 2-3 tokens simultaneously during inference, dramatically increasing token processing speed.

The combination of Hybrid Attention (O(L) computation) + Ultra-sparse MoE (low active parameters) + MTP (parallel token generation) achieved 100+ tokens/second. Compared to Qwen3-32B, prefill throughput improved 7-10x and decoding throughput improved 4-10x.⁷

Qwen3.5: Production Completeness

February 16, 2026 saw Qwen3.5’s release.⁸ This time, it wasn’t a research preview but a production release. While inheriting the hybrid attention structure validated in Qwen3-Next, it elevated completeness in three directions.

First, training infrastructure redesign. They introduced FP8 (8-bit floating-point) training pipelines, applying low-precision computation across activations, MoE routing, and matrix operations.⁹ Trading precision for computational throughput, the Qwen team maintained near 100% multimodal training efficiency compared to text-only training. Reinforcement learning also shifted to asynchronous approaches, scaling to handle progressively complex tasks in “million agent environments.”¹⁰

Second, native multimodal. Qwen3 operated separate text and vision models. Qwen3.5 adopted early fusion from the start, jointly learning text and image representations in shared parameter space without separate adapters. Result: matching Qwen3’s text performance while surpassing Qwen3-VL in visual understanding.

Third, expanded language support. From Qwen3’s 119 to 201 languages. The broadest language coverage among open models.

The flagship Qwen3.5-397B-A17B provides 1M token context windows by default with built-in tool usage capabilities. However, this generation’s Medium Series, especially 35B-A3B, drew particular attention. With 35B total parameters and 3B active — just 13.6% of Qwen3-235B-A22B’s 22B active parameters — this model surpassed the previous generation flagship.¹

Key benchmark results:

Benchmark	Qwen3.5 Score	Notes
IFBench (instruction following)	76.5	Exceeds GPT-5.2 (75.4)
AIME 2026 (mathematical reasoning)	91.3	Competitive with GPT-5.2 (96.7), Claude (93.3)
SWE-bench Verified (coding)	76.4	Approaches GPT-5.2 (80.0)
MMMU (multimodal understanding)	85.0	Major improvement over Qwen3-VL (80.6)

⁸

Why Active 3B Beats Active 22B

To answer this question, we first need to understand why “parameter count = performance” held as a rule.

Early language model development established the heuristic that more parameters meant better performance. So people just made them bigger. But increasing parameters is like adding more warehouse shelves. More shelves don’t make you retrieve items faster or store them more efficiently. What matters is “which items go on which shelves, how they’re arranged, and how you retrieve them.”

Qwen3.5 generation’s reversal resulted from combining three levers.

Lever 1: Computational Structure Efficiency. Softmax attention’s computation explodes as L² with longer contexts. Hybrid attention’s transition reduced 75% of this cost to linear scale. The same computing budget could process much longer contexts or handle the same context length much faster.

Lever 2: Maximizing Expert Specialization. The transition from Qwen3’s 128-choose-8 to Qwen3-Next/Qwen3.5’s 512-choose-11 wasn’t just about adding experts. Each expert’s domain became narrower. Using a hospital analogy: a hospital with 11 out of 512 ultra-specialists provides more accurate diagnoses than one with 8 out of 128 doctors. Even with fewer active parameters, higher expert focus improves overall reasoning quality.

Lever 3: Training Recipe Evolution. FP8 training isn’t just about saving memory. Maintaining model quality while reducing precision requires solving numerous engineering challenges around training stability, gradient management, and weight initialization. Asynchronous RL introduction followed the same pattern — not simply “more reinforcement learning” but curriculum learning that runs millions of agents in parallel while gradually complexifying task distributions. Ultimately, it’s about efficiency: extracting more from the same data and computing.

The shift moved from dragging heavy equipment up narrow mountain paths to designing optimal routes with lightweight gear for efficient summit reaching.

Comparison: Three Generations’ Specs

Category	Qwen3 (2025.4)	Qwen3-Next (2025.9)	Qwen3.5 (2026.2)
Flagship	235B-A22B	80B-A3B	397B-A17B
Efficiency Model	30B-A3B	80B-A3B	35B-A3B
Attention Method	Full softmax + GQA	Hybrid 3:1 (DeltaNet+softmax)	Hybrid 3:1 (Gated DeltaNet+Gated Attention)
MoE Experts	128	512 (+1 shared)	512
Active Experts	8	11	11
Activation Ratio	1/16	~1/46	~1/46
Context Length	128K	128K+	1M (API)
Inference Speed	20-30 tok/s level	100+ tok/s	High throughput
Multimodal	Separate Qwen3-VL	None	Native integration
Language Support	119	Unspecified	201
FP8 Training	❌	❌	✅
MTP Support	❌	✅	✅

Direction of Architectural Evolution

What Qwen3.5 generation demonstrated wasn’t simple performance improvement. It signaled the competitive axis shifting from “parameter scale” to “activation efficiency” in language models.

HuggingFace analyst Maxime Labonne characterized this trend as “attention mechanisms becoming the new battleground.”⁸ Just a year ago, the debate over “softmax attention or not” was itself niche. But between 2025-2026, major Chinese AI labs each redesigned attention their own way — Qwen’s hybrid DeltaNet, DeepSeek’s Multi-Head Latent Attention (MLA), GLM-5’s sparse attention, Kimi’s MLA variant. The single standard disappeared.

Increasing architectural diversity also means growing hardware and software stack burden. NVIDIA’s official blog post shortly after Qwen3-Next launch emphasizing that Blackwell GPU NVLink bandwidth was essential for this hybrid MoE expert routing wasn’t coincidental.⁵

The era where active 3B beats active 22B means the fundamental question shifted from “how big did you make it” to “how cleverly did you design it” — structure matters more than size.

MarkTechPost. (2026, February 24). “Alibaba Qwen Team Releases Qwen 3.5 Medium Model Series: A Production Powerhouse Proving that Smaller AI Models are Smarter.” MarkTechPost. ↩ ↩²
Qwen Team. (2025, April). “Qwen3: Think Deeper, Act Faster.” Qwen Blog, Alibaba. ↩
ApXML. “Qwen3 235B A22B Thinking — Architecture.” ApXML Machine Learning. ↩
Bojie Li. (2025, September). “Qwen3-Next: Hybrid Attention + Ultra-Sparse MoE + MTP = SOTA Inference Speed.” 01.me. ↩
NVIDIA Developer Blog. (2025). “New Open Source Qwen3-Next Models Preview Hybrid MoE Architecture Delivering Improved Accuracy and Accelerated Parallel Processing Across NVIDIA Platform.” NVIDIA. ↩ ↩² ↩³ ↩⁴
Yang, S., et al. (2024). “Gated Delta Networks: Improving Mamba2 with Delta Rule.” arXiv:2412.06464. (ICLR 2025 camera ready) ↩
Analytics Vidhya. (2025, September 15). “Qwen3-Next: A Deep Dive into Qwen’s latest 80B model.” Analytics Vidhya. (7-10x prefill throughput, 4-10x decoding throughput improvement) ↩
Labonne, M. (2026, February). “Qwen3.5: Nobody Agrees on Attention Anymore.” Hugging Face Blog. ↩ ↩² ↩³
Digital Watch Observatory. (2026). “Qwen3.5 debuts with hybrid architecture and expanded multimodal capabilities.” Digital Watch. ↩
Qwen Team. (2026). Qwen3.5-27B-FP8 Model Card. “RL Generalization: Reinforcement learning scaled across million-agent environments.” Hugging Face. ↩