Qwen 3.5 Complete Guide — Specs, Benchmarks, VRAM, and Usage
Alibaba dropped Qwen 3.5-397B-A17B on February 16th. It’s a 397B parameter MoE model with only 17B active parameters, Apache 2.0 licensed, and native multimodal support. The fact that you can run near GPT-5.2 or Claude 4.5 Opus performance as open weights is pretty remarkable.
Core Specs Summary
| Category | Value |
|---|---|
| Total Parameters | 397B |
| Active Parameters (per token) | 17B |
| Architecture | Sparse MoE + Hybrid Attention (Gated DeltaNet + Gated Attention) |
| Expert Configuration | 10 Routed + 1 Shared = 11 active out of 512 |
| Context Length | 262,144 tokens(up to 1M with YaRN) |
| Supported Languages | 201 languages and dialects |
| Multimodal | Native vision-language (images 1344×1344, 60-second video) |
| Vocabulary | 248,320 tokens |
| License | Apache 2.0 |
Out of 397B parameters, only 17B are actually used for inference, keeping costs incredibly low. The active parameter ratio is just 4.3%.
Benchmark Comparison
Let’s see how it stacks up against frontier models. Here are the key benchmarks.
Language (Thinking Mode)
| Benchmark | Qwen 3.5 | GPT-5.2 | Claude 4.5 Opus | Gemini 3 Pro |
|---|---|---|---|---|
| MMLU-Pro | 87.8 | 87.4 | 89.5 | 89.8 |
| AIME26 | 91.3 | 96.7 | 93.3 | 90.6 |
| GPQA Diamond | 88.4 | 92.4 | 87.0 | 91.9 |
| LiveCodeBench v6 | 83.6 | 87.7 | 84.8 | 90.7 |
| SWE-bench Verified | 76.4 | 80.0 | 80.9 | 76.2 |
| IFBench | 76.5 | 75.4 | 58.0 | 70.4 |
| LongBench v2 | 63.2 | 54.5 | 64.4 | 68.2 |
It’s slightly behind GPT-5.2 on math/reasoning, but tops IFBench (Instruction Following) and beats GPT-5.2 by a decent margin on LongBench.
Multimodal (Vision-Language)
| Benchmark | Qwen 3.5 | GPT-5.2 | Claude 4.5 Opus | Gemini 3 Pro |
|---|---|---|---|---|
| MMMU | 85.0 | 86.7 | 80.7 | 87.2 |
| MathVista (mini) | 90.3 | 83.1 | 80.0 | 87.9 |
| ZEROBench | 12 | 9 | 3 | 10 |
| OCRBench | 93.1 | 80.7 | 85.8 | 90.4 |
| Video-MME | 87.5 | 86.0 | 77.6 | 88.4 |
The multimodal results are honestly impressive. ZEROBench #1 overall (12 points), plus first place on MathVista and OCRBench. The OCR performance especially — beating GPT-5.2 by 12+ points is significant.
What Changed from Qwen3
Qwen3.5 isn’t just an upgrade; they redesigned the architecture from the ground up.
| Category | Qwen3 (2025.05) | Qwen3.5 (2026.02) |
|---|---|---|
| Largest Model | 235B-A22B | 397B-A17B |
| Architecture | Standard MoE | Hybrid MoE + Gated DeltaNet |
| Languages | 119 | 201(+69%) |
| Multimodal | Separate VL model | Native integration(Early Fusion) |
| Vocabulary | 152K | 248K(+63%) |
| Context | 128K | 262K |
Three key changes:
- Hybrid Architecture: Mixing Gated DeltaNet (linear attention) with regular attention massively improved inference efficiency. 8.6–19x faster inference compared to Qwen3-Max (1T+) with 60% cost reduction.
- Native Multimodal: Previously needed separate VL models, now handles text, images, and video in one unified model.
- 201 Language Support: 63% vocabulary increase reduced non-English token consumption by 10–60%.
VRAM Requirements
The biggest gotcha with MoE models is “active parameters are small, but you still need all parameters in memory.” Here’s memory usage by quantization:
| Quantization | Disk Size | Required Memory |
|---|---|---|
| BF16 Original | ~807 GB | ~810+ GB |
| FP8 | ~400 GB | ~400+ GB |
| Q8_0 | ~420 GB | ~420+ GB |
| Q6_K | ~320 GB | ~320+ GB |
| Q5_K_M | ~280 GB | ~280+ GB |
| Q4_K_XL (UD) | ~214 GB | ~256 GB |
| Q3_K_XL (UD) | ~170 GB | ~192 GB |
| Q2_K_XL (UD) | ~146 GB | ~150+ GB |
UD = Unsloth Dynamic 2.0 quantization. Keeps important layers at 8/16-bit to minimize quality loss.
What Hardware Can Run This
| Hardware | Feasible | Notes |
|---|---|---|
| 8×H100 80GB(640GB) | ✓ | Official recommendation. 45 tok/s with FP8 |
| 8×A100 80GB(640GB) | ✓ | Can serve with vLLM/SGLang |
| Mac Studio M3 Ultra 256GB | ✓ | Q4 quantization possible. Confirmed by users |
| Mac Studio M2 Ultra 192GB | ✓ | Q3/Q2 quantization possible |
| 4×A100 80GB(320GB) | Limited | Needs Q8 or lower quantization |
| RTX 4090 24GB + 256GB RAM | Limited | Offload MoE to RAM, ~3-4 tok/s |
| RTX 4090 24GB Solo | ✗ | Insufficient VRAM |
Honestly, for local use, Mac Studio 192GB+ is the realistic option. Someone on Reddit got ~3 tok/s with 192GB RAM + 36GB VRAM (3090+3060) using Q2 quantization. Usable but not exactly snappy.
Usage
Ollama (Local)
ollama run qwen3.5:397b
GGUF quantized versions are provided by Unsloth. For direct llama.cpp usage:
./llama-cli \
-hf unsloth/Qwen3.5-397B-A17B-GGUF:MXFP4_MOE \
--ctx-size 16384 --temp 0.6
vLLM / SGLang (Server)
If you have multiple GPUs, vLLM or SGLang can serve it:
vllm serve Qwen/Qwen3.5-397B-A17B \
--tensor-parallel-size 8 --max-model-len 262144
SGLang supports MTP (Multi-Token Prediction) for additional speed.
API (Pricing)
If local deployment isn’t feasible, APIs are the practical choice.
| Provider | Input Price | Output Price |
|---|---|---|
| Alibaba Cloud | $0.11/M tokens | $0.44/M tokens |
| OpenRouter | $0.13/M tokens | $0.52/M tokens |
| NVIDIA NIM | Free trial | - |
The pricing is notable. Alibaba Cloud’s $0.11/M input tokens is about 1/18th of Gemini 3 Pro’s price. Frontier-level performance at this price point is pretty aggressive.
Community Reactions
Positive Feedback
- MoE Efficiency Praise: “397B with 17B active? That’s a huge win for inference costs”
- Benchmark Approval: “Near Opus 4.5/GPT-5.2 performance” was the common assessment
- Multimodal Strength: Screenshot-to-Code was reportedly better at layout reproduction than Gemini 3 Pro
- Local Execution: Mac Studio 192GB compatibility got high praise
- Reddit r/LocalLLaMA Unsloth GGUF post got 454+ upvotes
Negative Feedback
- Coding Disappointment: “Can’t generate error-free code in one shot”, some reports worse than Qwen3-30B
- Agent Coding Weakness: “Lacks competitiveness in agent coding”
- API Speed: 5-10 tok/s on OpenRouter right after launch (server overload)
- LiveCodeBench Gap: 83.6 vs Gemini 3 Pro’s 90.7 is a noticeable difference
Overall assessment leans toward “ultimate bang for buck” but repeatedly mentions disappointment with pure coding tasks.
Bottom Line: Who Should Use This
Recommended for:
- Services needing multimodal (image/video analysis)
- API cost optimization while maintaining frontier-level performance
- Global services where multilingual support is crucial
- Mac Studio users wanting local LLM deployment
Less ideal for:
- Coding agents as core functionality (Gemini 3 Pro or Claude 4.5 still better)
- Scenarios requiring top-tier math/reasoning (GPT-5.2 more reliable)
Summary: Qwen 3.5 delivers 80–90% of frontier model performance at 1/10–1/18th the cost. The open weights + Apache 2.0 licensing makes it commercially viable. If coding isn’t your primary use case, it’s definitely worth considering as a main model.