Qwen 3.5 Medium Series: Open Source Models with 10B Active Parameters Beat GPT-5-mini

2026-02-25 · # AI 뉴스

Qwen 3.5 local LLM MoE open source AI benchmarks

Small Models Are Starting to Beat Big Models

On February 24, 2026, Alibaba’s Qwen team released three Qwen 3.5 medium series models: 122B-A10B, 27B Dense, and 35B-A3B. The numbers might suggest mid-range models, but the benchmark results tell a different story — they surpassed not only their predecessor flagship Qwen3-235B-A22B but also OpenAI’s GPT-5-mini across multiple benchmarks. The 122B-A10B’s achievement of MMLU-Pro 86.7 and GPQA Diamond 86.6 with just 10B active parameters carries significance beyond raw numbers. It marks a turning point where open source model efficiency began catching up to closed model absolute performance.

This piece analyzes the architecture, benchmarks, local execution methods, and use cases for each model.

Lineup Overview: Four Choices

The Qwen 3.5 medium series consists of three weight models and one hosted version.

Qwen3.5-122B-A10B: 122B total parameters, 10B active. 256 MoE experts (8 routing + 1 shared). The core model delivering flagship-level performance.
Qwen3.5-27B: 27B Dense model with hybrid architecture. Vision integration for multimodal support.
Qwen3.5-35B-A3B: 35B total parameters, 3B active. MoE structure enabling the lightest inference.
Qwen3.5-Flash: API-hosted version of 35B-A3B with 1M context as default.

All three models support 262K native context, extendable to 1,010,000 tokens. They support 201 languages and can toggle between thinking and non-thinking modes.

Architecture: Gated Delta Networks + MoE Hybrid

The most notable change in Qwen 3.5 was architectural. It adopted a hybrid design that dramatically improved efficiency without completely replacing standard Transformer attention layers.

What is Gated DeltaNet

Gated Delta Networks (Gated DeltaNet) were designed to overcome linear attention limitations¹. Standard linear attention has O(n) complexity relative to sequence length but struggled with precise information retrieval. DeltaNet applied the delta rule to update memory state at each step. Adding gating mechanisms to this created Gated DeltaNet for adaptive memory control.

The key insight was compressing past context through fixed-size hidden states. While structurally similar to RNNs, the delta rule made it far more accurate than standard linear attention on associative recall tasks.

3:1 Hybrid Ratio

However, Gated DeltaNet alone was limited for global context modeling. Qwen 3.5 didn’t replace all layers with DeltaNet but adopted a 3:1 hybrid structure². In the 122B-A10B model, 48 layers were organized into 16 blocks, each structured as:

3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)

Three DeltaNet layers handled efficient sequence processing while one full attention layer captured global dependencies. DeltaNet layers used 64 linear attention heads for V and 16 for QK, while full attention layers used 32 Q heads and 2 KV heads.

MoE: Sparse Activation of 256 Experts

The 122B-A10B model activated only 8 out of 256 routing experts plus 1 shared expert. Each expert’s intermediate dimension was 1,024, resulting in an active parameter ratio of about 8.2%. This sparsity enabled utilizing 122B-scale knowledge capacity at 10B-level computational cost.

Benchmark Analysis: Where It Wins and Loses

Below are the key benchmark results from official model cards³.

Benchmark	GPT-5-mini	GPT-OSS-120B	Qwen3-235B	122B-A10B	27B Dense	35B-A3B
MMLU-Pro	83.7	80.8	84.4	86.7	86.1	85.3
GPQA Diamond	82.8	80.1	81.1	86.6	85.5	84.2
SWE-bench Verified	72.0	62.0	—	72.0	72.4	69.2
IFEval	93.9	88.9	87.8	93.4	95.0	91.9
BFCL-V4 (tool calling)	55.5	—	54.8	72.2	68.5	67.3
HLE w/ CoT	19.4	14.9	18.2	25.3	24.3	22.4
BrowseComp	48.1	41.1	—	63.8	61.0	61.0
LiveCodeBench v6	80.5	82.7	75.1	78.9	80.7	74.6
TerminalBench 2	31.9	18.7	—	49.4	41.6	40.5

122B-A10B’s Dominant Areas

In knowledge and reasoning benchmarks, 122B-A10B clearly outperformed GPT-5-mini. It led by 3.0 points on MMLU-Pro, 3.8 points on GPQA Diamond, and 5.9 points on HLE w/ CoT. HLE (Humanity’s Last Exam) represents expert-level difficulty problems, making the 25.3% score the highest among all compared models.

Search agent performance was also notable. It scored 63.8% on BrowseComp and 49.4% on TerminalBench 2, significantly outperforming GPT-5-mini (48.1% and 31.9% respectively).

Areas Where GPT-5-mini Still Leads

GPT models maintained advantages in coding benchmarks. GPT-OSS-120B topped LiveCodeBench v6 at 82.7%, with GPT-5-mini at 80.5%. The 122B-A10B scored 78.9%, slightly behind. However, the 27B Dense model’s 80.7% was noteworthy. In CodeForces, GPT-5-mini (2160) and GPT-OSS-120B (2157) both outpaced 122B-A10B (2100).

Most Surprising Result: 27B Dense

The 27B Dense model’s performance was particularly impressive. Its SWE-bench Verified score of 72.4% exceeded both GPT-5-mini (72.0) and 122B-A10B (72.0). The IFEval score of 95.0% was the highest among all compared models. LiveCodeBench v6’s 80.7% also surpassed 122B-A10B (78.9%). Achieving this level of performance at just 27B parameters demonstrated the hybrid architecture’s efficiency.

Tool Calling: The Significance of BFCL-V4 72.2%

BFCL-V4 (Berkeley Function Calling Leaderboard v4) measures LLM function/tool calling accuracy. The 122B-A10B’s 72.2% showed an overwhelming gap compared to GPT-5-mini (55.5%) and Qwen3-235B (54.8%). The 27B Dense scored 68.5% and 35B-A3B scored 67.3%.

This metric matters because it directly impacts local agent development. Tool calling accuracy determines viability in MCP (Model Context Protocol) based tool integration, code execution agents, and automation workflows. Breaking through 70% in this domain, where existing open source models remained in the 50s, represents a practical turning point. TAU2-Bench results of 79.5% for 122B-A10B and 81.2% for 35B-A3B further demonstrated strong performance across agent tasks.

Local Execution Guide

VRAM Requirements

According to Unsloth’s official guide, memory requirements for each model are⁴:

Model	4-bit	8-bit	BF16
27B	17 GB	30 GB	54 GB
35B-A3B	22 GB	38 GB	70 GB
122B-A10B	70 GB	132 GB	245 GB

The 27B 4-bit quantized model runs on a single RTX 4090 (24GB). The 35B-A3B also fits in 24GB VRAM with 4-bit quantization, and thanks to the MoE structure, actual inference speed was faster than 27B. The 122B-A10B requires 70GB for 4-bit, making single consumer GPU deployment challenging, but it’s feasible on Mac Studio M3 Ultra (192GB unified memory) or multi-GPU setups.

Quantization: Unsloth Dynamic 2.0

Unsloth received early access from the Qwen team and released GGUF quantization files on launch day. They applied Dynamic 2.0 quantization, upcasting critical layers to 8-bit or 16-bit. The main quantization format was MXFP4_MOE, optimized for MoE expert layers.

According to reports from r/LocalLLaMA, Qwen3.5’s hybrid architecture (qwen-next architecture) showed very strong resistance to quantization⁵. This was attributed to DeltaNet layers having more uniform weight distributions compared to full attention.

Inference Framework Compatibility

Officially supported frameworks include:

SGLang: Qwen3.5 support in main branch. MTP (Multi-Token Prediction) support for additional speed improvement.
vLLM: Main branch required. Tensor Parallel support.
KTransformers: Specialized for MoE offloading. 122B-A10B execution possible with 24GB GPU + 256GB RAM combination.
llama.cpp: GGUF format support. SSD/HDD offloading enables inference in memory-constrained environments (with speed degradation).

Basic SGLang command for serving 122B-A10B:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-122B-A10B \
    --port 8000 --tp-size 8 \
    --mem-fraction-static 0.8 \
    --context-length 262144 \
    --reasoning-parser qwen3

For local 35B-A3B execution with llama.cpp:

./llama.cpp/llama-cli \
    -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
    --ctx-size 16384 \
    --temp 0.6 --top-p 0.95 --top-k 20

One r/LocalLLaMA user reported running the 35B-A3B MXFP4 quantized model on a 64GB M2 Max MacBook Pro integrated with Claude Code, achieving 398 t/s prompt processing and 27.9 t/s generation⁶.

Recommended Inference Settings

Unsloth’s recommended settings varied by mode:

Thinking mode (general): temperature 1.0, top_p 0.95, top_k 20, presence_penalty 1.5
Thinking mode (coding): temperature 0.6, top_p 0.95, top_k 20, presence_penalty 0.0
Non-thinking mode (general): temperature 0.7, top_p 0.8, top_k 20, presence_penalty 1.5

Maximum context window is 262,144 tokens, with 32,768 tokens recommended for output length in most queries. If OOM occurs, reduce context while maintaining at least 128K to preserve thinking mode quality.

Community Reactions: r/LocalLLaMA

Several threads appeared simultaneously on r/LocalLLaMA post-launch. Main reactions:

35B-A3B Expectations vs Reality: A thread with 360+ upvotes showed users were positive about 35B-A3B’s size-to-performance ratio, but reports indicated about half the speed compared to previous generation Qwen3-30B-A3B. Speed dropped from 85 t/s to 45 t/s⁷. This represented a tradeoff between performance improvement and speed reduction.

Coding Test Comparisons: One user compared Qwen3-Coder-Next, 35B-A3B, and 27B in one-shot coding tests. Qwen3-Coder-Next scored 5.5/6 as the superior coding-specific model. The 35B-A3B scored 4.5/6, closely following, while 27B scored 2/6, showing relative weakness in coding tasks⁶. However, this was a single test result, and opinions suggested 27B might perform better in multi-step agent tasks.

Praise for Quantization Resistance: The 122B-A10B thread notably highlighted the qwen-next architecture’s strong quantization resistance. This was attributed to DeltaNet layers’ structural characteristics, meaning minimal performance degradation even at low-bit quantization.

NVIDIA DGX Spark User Reactions: NVIDIA forums reported attempts to compress the 122B model onto a single Spark unit, suggesting potential for expanded consumer hardware accessibility⁸.

Which Model for Whom

122B-A10B: For users with multi-GPU servers or high-capacity Mac environments. Suited for production workloads requiring top-tier reasoning quality, search agents, and tool calling-based automation. Requiring 70GB VRAM at 4-bit makes it more appropriate for team/organization deployment than individual use.

27B Dense: Runs 4-bit on a single 24GB GPU. The only option when multimodal (vision) support is needed. Achieved series-high scores on IFEval and SWE-bench, excelling at instruction following and software engineering tasks. For general tasks beyond coding, accuracy was higher than 35B-A3B.

35B-A3B: Provides fastest inference speed with 3B active parameters. Runs on 24GB VRAM (4-bit), ideal for interactive agents and real-time coding assistants where quick response matters. However, overall accuracy was slightly lower than 27B.

Flash (API): For those wanting 1M context without infrastructure setup. Same model as 35B-A3B but with serverless deployment convenience.

Conclusion: New Efficiency Standards

The Qwen 3.5 medium series wasn’t just a model update. The Gated DeltaNet + MoE hybrid architecture provided a concrete answer to “how far can we go with 10B active parameters.” The BFCL-V4 72.2% and TerminalBench 2 49.4% scores elevated local agent development practicality by a significant step.

Open source model efficiency competition has shifted from “how well can we build big models” to “how much can we leverage big model knowledge with small active parameters.” The Qwen 3.5 medium series represents the clearest example of this fundamental transformation.

Gated Delta Networks: Improving Mamba2 with Delta Rule. OpenReview, 2024. https://openreview.net/forum?id=r8H7xhYPwz ↩
Sebastian Raschka. “Gated DeltaNet for Linear Attention”. https://sebastianraschka.com/llms-from-scratch/ch04/08_deltanet/ ↩
Qwen3.5-122B-A10B Model Card. Hugging Face. https://huggingface.co/Qwen/Qwen3.5-122B-A10B ↩
Unsloth. “Qwen3.5 - How to Run Locally Guide”. https://unsloth.ai/docs/models/qwen3.5 ↩
r/LocalLLaMA. Qwen/Qwen3.5-122B-A10B Thread. https://www.reddit.com/r/LocalLLaMA/comments/1rdlc02/ ↩
r/LocalLLaMA. Qwen3-Coder-Next vs Qwen3.5-35B-A3B vs Qwen3.5-27B Coding Test. https://www.reddit.com/r/LocalLLaMA/comments/1rdnxe6/ ↩ ↩²
r/LocalLLaMA. Qwen/Qwen3.5-35B-A3B Thread. https://www.reddit.com/r/LocalLLaMA/comments/1rdlbvc/ ↩
NVIDIA Developer Forums. Qwen3.5-122B-A10B DGX Spark Discussion. https://forums.developer.nvidia.com/t/361639 ↩