Qwen 3.5 Complete Guide — Specs, Benchmarks, VRAM, and Usage

· # AI 활용
Qwen LLM open source benchmarks

Alibaba dropped Qwen 3.5-397B-A17B on February 16th. It’s a 397B parameter MoE model with only 17B active parameters, Apache 2.0 licensed, and native multimodal support. The fact that you can run near GPT-5.2 or Claude 4.5 Opus performance as open weights is pretty remarkable.

Core Specs Summary

CategoryValue
Total Parameters397B
Active Parameters (per token)17B
ArchitectureSparse MoE + Hybrid Attention (Gated DeltaNet + Gated Attention)
Expert Configuration10 Routed + 1 Shared = 11 active out of 512
Context Length262,144 tokens(up to 1M with YaRN)
Supported Languages201 languages and dialects
MultimodalNative vision-language (images 1344×1344, 60-second video)
Vocabulary248,320 tokens
LicenseApache 2.0

Out of 397B parameters, only 17B are actually used for inference, keeping costs incredibly low. The active parameter ratio is just 4.3%.

Benchmark Comparison

Let’s see how it stacks up against frontier models. Here are the key benchmarks.

Language (Thinking Mode)

BenchmarkQwen 3.5GPT-5.2Claude 4.5 OpusGemini 3 Pro
MMLU-Pro87.887.489.589.8
AIME2691.396.793.390.6
GPQA Diamond88.492.487.091.9
LiveCodeBench v683.687.784.890.7
SWE-bench Verified76.480.080.976.2
IFBench76.575.458.070.4
LongBench v263.254.564.468.2

It’s slightly behind GPT-5.2 on math/reasoning, but tops IFBench (Instruction Following) and beats GPT-5.2 by a decent margin on LongBench.

Multimodal (Vision-Language)

BenchmarkQwen 3.5GPT-5.2Claude 4.5 OpusGemini 3 Pro
MMMU85.086.780.787.2
MathVista (mini)90.383.180.087.9
ZEROBench129310
OCRBench93.180.785.890.4
Video-MME87.586.077.688.4

The multimodal results are honestly impressive. ZEROBench #1 overall (12 points), plus first place on MathVista and OCRBench. The OCR performance especially — beating GPT-5.2 by 12+ points is significant.

What Changed from Qwen3

Qwen3.5 isn’t just an upgrade; they redesigned the architecture from the ground up.

CategoryQwen3 (2025.05)Qwen3.5 (2026.02)
Largest Model235B-A22B397B-A17B
ArchitectureStandard MoEHybrid MoE + Gated DeltaNet
Languages119201(+69%)
MultimodalSeparate VL modelNative integration(Early Fusion)
Vocabulary152K248K(+63%)
Context128K262K

Three key changes:

  1. Hybrid Architecture: Mixing Gated DeltaNet (linear attention) with regular attention massively improved inference efficiency. 8.6–19x faster inference compared to Qwen3-Max (1T+) with 60% cost reduction.
  2. Native Multimodal: Previously needed separate VL models, now handles text, images, and video in one unified model.
  3. 201 Language Support: 63% vocabulary increase reduced non-English token consumption by 10–60%.

VRAM Requirements

The biggest gotcha with MoE models is “active parameters are small, but you still need all parameters in memory.” Here’s memory usage by quantization:

QuantizationDisk SizeRequired Memory
BF16 Original~807 GB~810+ GB
FP8~400 GB~400+ GB
Q8_0~420 GB~420+ GB
Q6_K~320 GB~320+ GB
Q5_K_M~280 GB~280+ GB
Q4_K_XL (UD)~214 GB~256 GB
Q3_K_XL (UD)~170 GB~192 GB
Q2_K_XL (UD)~146 GB~150+ GB

UD = Unsloth Dynamic 2.0 quantization. Keeps important layers at 8/16-bit to minimize quality loss.

What Hardware Can Run This

HardwareFeasibleNotes
8×H100 80GB(640GB)Official recommendation. 45 tok/s with FP8
8×A100 80GB(640GB)Can serve with vLLM/SGLang
Mac Studio M3 Ultra 256GBQ4 quantization possible. Confirmed by users
Mac Studio M2 Ultra 192GBQ3/Q2 quantization possible
4×A100 80GB(320GB)LimitedNeeds Q8 or lower quantization
RTX 4090 24GB + 256GB RAMLimitedOffload MoE to RAM, ~3-4 tok/s
RTX 4090 24GB SoloInsufficient VRAM

Honestly, for local use, Mac Studio 192GB+ is the realistic option. Someone on Reddit got ~3 tok/s with 192GB RAM + 36GB VRAM (3090+3060) using Q2 quantization. Usable but not exactly snappy.

Usage

Ollama (Local)

ollama run qwen3.5:397b

GGUF quantized versions are provided by Unsloth. For direct llama.cpp usage:

./llama-cli \
  -hf unsloth/Qwen3.5-397B-A17B-GGUF:MXFP4_MOE \
  --ctx-size 16384 --temp 0.6

vLLM / SGLang (Server)

If you have multiple GPUs, vLLM or SGLang can serve it:

vllm serve Qwen/Qwen3.5-397B-A17B \
  --tensor-parallel-size 8 --max-model-len 262144

SGLang supports MTP (Multi-Token Prediction) for additional speed.

API (Pricing)

If local deployment isn’t feasible, APIs are the practical choice.

ProviderInput PriceOutput Price
Alibaba Cloud$0.11/M tokens$0.44/M tokens
OpenRouter$0.13/M tokens$0.52/M tokens
NVIDIA NIMFree trial-

The pricing is notable. Alibaba Cloud’s $0.11/M input tokens is about 1/18th of Gemini 3 Pro’s price. Frontier-level performance at this price point is pretty aggressive.

Community Reactions

Positive Feedback

  • MoE Efficiency Praise: “397B with 17B active? That’s a huge win for inference costs”
  • Benchmark Approval: “Near Opus 4.5/GPT-5.2 performance” was the common assessment
  • Multimodal Strength: Screenshot-to-Code was reportedly better at layout reproduction than Gemini 3 Pro
  • Local Execution: Mac Studio 192GB compatibility got high praise
  • Reddit r/LocalLLaMA Unsloth GGUF post got 454+ upvotes

Negative Feedback

  • Coding Disappointment: “Can’t generate error-free code in one shot”, some reports worse than Qwen3-30B
  • Agent Coding Weakness: “Lacks competitiveness in agent coding”
  • API Speed: 5-10 tok/s on OpenRouter right after launch (server overload)
  • LiveCodeBench Gap: 83.6 vs Gemini 3 Pro’s 90.7 is a noticeable difference

Overall assessment leans toward “ultimate bang for buck” but repeatedly mentions disappointment with pure coding tasks.

Bottom Line: Who Should Use This

Recommended for:

  • Services needing multimodal (image/video analysis)
  • API cost optimization while maintaining frontier-level performance
  • Global services where multilingual support is crucial
  • Mac Studio users wanting local LLM deployment

Less ideal for:

  • Coding agents as core functionality (Gemini 3 Pro or Claude 4.5 still better)
  • Scenarios requiring top-tier math/reasoning (GPT-5.2 more reliable)

Summary: Qwen 3.5 delivers 80–90% of frontier model performance at 1/10–1/18th the cost. The open weights + Apache 2.0 licensing makes it commercially viable. If coding isn’t your primary use case, it’s definitely worth considering as a main model.

← AI Reasoning Wars, Anthropic's $30B, and the Security Paradox Complete LLM Serving Engine Guide — In-Depth Analysis of 18 Tools →