Qwen 3.5 Complete Guide — Specs, Benchmarks, VRAM, and Usage

2026-02-19 · # AI 활용

Alibaba dropped Qwen 3.5-397B-A17B on February 16th. It’s a 397B parameter MoE model with only 17B active parameters, Apache 2.0 licensed, and native multimodal support. The fact that you can run near GPT-5.2 or Claude 4.5 Opus performance as open weights is pretty remarkable.

Core Specs Summary

Category	Value
Total Parameters	397B
Active Parameters (per token)	17B
Architecture	Sparse MoE + Hybrid Attention (Gated DeltaNet + Gated Attention)
Expert Configuration	10 Routed + 1 Shared = 11 active out of 512
Context Length	262,144 tokens(up to 1M with YaRN)
Supported Languages	201 languages and dialects
Multimodal	Native vision-language (images 1344×1344, 60-second video)
Vocabulary	248,320 tokens
License	Apache 2.0

Out of 397B parameters, only 17B are actually used for inference, keeping costs incredibly low. The active parameter ratio is just 4.3%.

Benchmark Comparison

Let’s see how it stacks up against frontier models. Here are the key benchmarks.

Language (Thinking Mode)

Benchmark	Qwen 3.5	GPT-5.2	Claude 4.5 Opus	Gemini 3 Pro
MMLU-Pro	87.8	87.4	89.5	89.8
AIME26	91.3	96.7	93.3	90.6
GPQA Diamond	88.4	92.4	87.0	91.9
LiveCodeBench v6	83.6	87.7	84.8	90.7
SWE-bench Verified	76.4	80.0	80.9	76.2
IFBench	76.5	75.4	58.0	70.4
LongBench v2	63.2	54.5	64.4	68.2

It’s slightly behind GPT-5.2 on math/reasoning, but tops IFBench (Instruction Following) and beats GPT-5.2 by a decent margin on LongBench.

Multimodal (Vision-Language)

Benchmark	Qwen 3.5	GPT-5.2	Claude 4.5 Opus	Gemini 3 Pro
MMMU	85.0	86.7	80.7	87.2
MathVista (mini)	90.3	83.1	80.0	87.9
ZEROBench	12	9	3	10
OCRBench	93.1	80.7	85.8	90.4
Video-MME	87.5	86.0	77.6	88.4

The multimodal results are honestly impressive. ZEROBench #1 overall (12 points), plus first place on MathVista and OCRBench. The OCR performance especially — beating GPT-5.2 by 12+ points is significant.

What Changed from Qwen3

Qwen3.5 isn’t just an upgrade; they redesigned the architecture from the ground up.

Category	Qwen3 (2025.05)	Qwen3.5 (2026.02)
Largest Model	235B-A22B	397B-A17B
Architecture	Standard MoE	Hybrid MoE + Gated DeltaNet
Languages	119	201(+69%)
Multimodal	Separate VL model	Native integration(Early Fusion)
Vocabulary	152K	248K(+63%)
Context	128K	262K

Three key changes:

Hybrid Architecture: Mixing Gated DeltaNet (linear attention) with regular attention massively improved inference efficiency. 8.6–19x faster inference compared to Qwen3-Max (1T+) with 60% cost reduction.
Native Multimodal: Previously needed separate VL models, now handles text, images, and video in one unified model.
201 Language Support: 63% vocabulary increase reduced non-English token consumption by 10–60%.

VRAM Requirements

The biggest gotcha with MoE models is “active parameters are small, but you still need all parameters in memory.” Here’s memory usage by quantization:

Quantization	Disk Size	Required Memory
BF16 Original	~807 GB	~810+ GB
FP8	~400 GB	~400+ GB
Q8_0	~420 GB	~420+ GB
Q6_K	~320 GB	~320+ GB
Q5_K_M	~280 GB	~280+ GB
Q4_K_XL (UD)	~214 GB	~256 GB
Q3_K_XL (UD)	~170 GB	~192 GB
Q2_K_XL (UD)	~146 GB	~150+ GB

UD = Unsloth Dynamic 2.0 quantization. Keeps important layers at 8/16-bit to minimize quality loss.

What Hardware Can Run This

Hardware	Feasible	Notes
8×H100 80GB(640GB)	✓	Official recommendation. 45 tok/s with FP8
8×A100 80GB(640GB)	✓	Can serve with vLLM/SGLang
Mac Studio M3 Ultra 256GB	✓	Q4 quantization possible. Confirmed by users
Mac Studio M2 Ultra 192GB	✓	Q3/Q2 quantization possible
4×A100 80GB(320GB)	Limited	Needs Q8 or lower quantization
RTX 4090 24GB + 256GB RAM	Limited	Offload MoE to RAM, ~3-4 tok/s
RTX 4090 24GB Solo	✗	Insufficient VRAM

Honestly, for local use, Mac Studio 192GB+ is the realistic option. Someone on Reddit got ~3 tok/s with 192GB RAM + 36GB VRAM (3090+3060) using Q2 quantization. Usable but not exactly snappy.

Usage

Ollama (Local)

ollama run qwen3.5:397b

GGUF quantized versions are provided by Unsloth. For direct llama.cpp usage:

./llama-cli \
  -hf unsloth/Qwen3.5-397B-A17B-GGUF:MXFP4_MOE \
  --ctx-size 16384 --temp 0.6

vLLM / SGLang (Server)

If you have multiple GPUs, vLLM or SGLang can serve it:

vllm serve Qwen/Qwen3.5-397B-A17B \
  --tensor-parallel-size 8 --max-model-len 262144

SGLang supports MTP (Multi-Token Prediction) for additional speed.

API (Pricing)

If local deployment isn’t feasible, APIs are the practical choice.

Provider	Input Price	Output Price
Alibaba Cloud	$0.11/M tokens	$0.44/M tokens
OpenRouter	$0.13/M tokens	$0.52/M tokens
NVIDIA NIM	Free trial	-

The pricing is notable. Alibaba Cloud’s $0.11/M input tokens is about 1/18th of Gemini 3 Pro’s price. Frontier-level performance at this price point is pretty aggressive.

Community Reactions

Positive Feedback

MoE Efficiency Praise: “397B with 17B active? That’s a huge win for inference costs”
Benchmark Approval: “Near Opus 4.5/GPT-5.2 performance” was the common assessment
Multimodal Strength: Screenshot-to-Code was reportedly better at layout reproduction than Gemini 3 Pro
Local Execution: Mac Studio 192GB compatibility got high praise
Reddit r/LocalLLaMA Unsloth GGUF post got 454+ upvotes

Negative Feedback

Coding Disappointment: “Can’t generate error-free code in one shot”, some reports worse than Qwen3-30B
Agent Coding Weakness: “Lacks competitiveness in agent coding”
API Speed: 5-10 tok/s on OpenRouter right after launch (server overload)
LiveCodeBench Gap: 83.6 vs Gemini 3 Pro’s 90.7 is a noticeable difference

Overall assessment leans toward “ultimate bang for buck” but repeatedly mentions disappointment with pure coding tasks.

Bottom Line: Who Should Use This

Recommended for:

Services needing multimodal (image/video analysis)
API cost optimization while maintaining frontier-level performance
Global services where multilingual support is crucial
Mac Studio users wanting local LLM deployment

Less ideal for:

Coding agents as core functionality (Gemini 3 Pro or Claude 4.5 still better)
Scenarios requiring top-tier math/reasoning (GPT-5.2 more reliable)

Summary: Qwen 3.5 delivers 80–90% of frontier model performance at 1/10–1/18th the cost. The open weights + Apache 2.0 licensing makes it commercially viable. If coding isn’t your primary use case, it’s definitely worth considering as a main model.