vLLM Complete Guide — From Parameters to Optimization, Everything About Local LLM Serving

2026-02-26 · # AI 활용

vLLM LLM serving GPU optimization PagedAttention Qwen3

The first tool engineers encounter when trying to serve LLMs on local GPUs is vLLM. Installation to server startup takes under 10 minutes. The problem starts after that. How should you set --gpu-memory-utilization? What’s the difference between --tensor-parallel-size and --enable-expert-parallel? How much faster does FP8 quantization actually make things? The official docs don’t answer these questions. This guide provides direct answers for real-world usage.

Why vLLM is Hot: PagedAttention and Continuous Batching

vLLM’s core paper was published at SOSP 2023 by Kwon et al.¹ Back then, LLM serving systems commonly suffered from GPU memory waste. Transformer attention mechanisms generate and store Key and Value vectors for each token, with this KV cache occupying contiguous memory blocks. The problem was that systems couldn’t predict how many tokens each request would generate, so existing systems had to reserve memory for worst-case scenarios. This resulted in 60–80% KV cache memory waste.

PagedAttention solved this using OS virtual memory paging¹. Just as operating systems divide physical memory into fixed-size pages and allocate them non-contiguously to processes, vLLM divides KV cache into fixed-size blocks (default 16 tokens) and stores them in non-contiguous memory. Blocks are dynamically allocated as requests actually generate tokens, eliminating waste. vLLM achieved up to 24x throughput improvement compared to existing systems according to the paper.

vLLM inference pipeline architecture

Continuous Batching was another innovation. Traditional static batching couldn’t accept new requests until all requests in the batch completed. 100-token requests had to wait while one request generated 1000 tokens. Continuous Batching removes completed requests from batches and immediately inserts new requests at each iteration. Since GPUs always process full batches, throughput improved dramatically². Anyscale research showed Continuous Batching achieved 23x throughput improvement while reducing p50 latency compared to static batching.

Complete Parameter Breakdown

vLLM performance metrics at a glance

Model Loading: Get the Basics Right

Starting with the most basic parameters for vllm serve <model>:

Parameter	Default	Description
`--model`	—	HuggingFace model name or local path
`--tokenizer`	Same as model	Specify separate tokenizer when needed
`--tokenizer-mode`	`auto`	`auto/hf/mistral/deepseek_v32`
`--dtype`	`auto`	`auto/bfloat16/float16/float32`
`--quantization`	None	`awq/gptq/fp8/bitsandbytes/gguf` etc.
`--max-model-len`	Model config auto	Maximum context length (input+output)
`--trust-remote-code`	False	Allow custom model code execution
`--load-format`	`auto`	`auto/safetensors/gguf/bitsandbytes` etc.

--dtype auto automatically determines based on model’s torch_dtype. Forcing --dtype float16 on BF16 models causes precision loss. Use auto unless there’s a specific reason.

--trust-remote-code is required for Qwen3, InternLM series models. Since it executes custom Python files from HuggingFace repos, only use with trusted official repositories.

--max-model-len isn’t just about “how long documents to process.” This value directly correlates to total KV cache memory. With heavy models like Qwen3-235B, setting this high can make KV cache consume more memory than model weights. We’ll cover memory calculations in detail below.

GPU Memory and Parallelization: The Most Important Settings

Parallelization Strategy: TP vs PP vs EP

Understanding these three parameters is key to vLLM configuration.

--tensor-parallel-size (shortened to -tp) divides model weight matrices across multiple GPUs for computation. All-reduce communication happens quickly between NVLink-connected GPUs within the same node, keeping latency low. This is the first option to consider in most situations.

--pipeline-parallel-size (shortened to -pp) distributes model layers sequentially across GPUs. GPU A processes front layers while GPU B processes back layers. This creates “pipeline bubbles” that reduce GPU utilization compared to TP. Use this supplementarily in inter-node environments with narrow bandwidth or when TP alone can’t fit the model.

--enable-expert-parallel (shortened to -ep) is for MoE (Mixture of Experts) models only. For MoE architectures like Qwen3-235B-A22B, DeepSeek V3, and Llama 4 Maverick, enabling EP instead of TP distributes expert computation load evenly across GPUs. Applying TP to MoE often inefficiently fragments each expert’s weights.

For Qwen3-235B-A22B on dual H200s, the answer is simple: -tp 2 --enable-expert-parallel.

Strategy	Suitable Situations	Considerations
`--tensor-parallel-size N`	Within-node NVLink GPUs	Requires NVLink bandwidth, minimal latency
`--pipeline-parallel-size N`	Inter-node distribution, TP limits exceeded	Efficiency loss from pipeline bubbles
`--enable-expert-parallel`	MoE models only	Can combine with TP (`-tp 2 -ep`)

Memory Triangle: gpu-memory-utilization, max-model-len, max-num-seqs

These three parameters work in tandem. Changing one affects the allowable ranges of the other two.

--gpu-memory-utilization (default 0.9) sets the GPU memory fraction for vLLM engine use. Setting 0.9 on H200 (141GB) allocates ~127GB for model weights + KV cache. Higher values expand KV cache space for more concurrent requests, but reduce remaining GPU memory (~14GB) available for system stack and PyTorch internal buffers.

--max-model-len is maximum context length. Large values mean each request consumes more KV cache. Small values allow more concurrent requests with the same memory.

--max-num-seqs (default 256) is maximum simultaneous sequences. Large values increase concurrent requests, thus total KV cache consumption.

The relationship:

Available KV cache memory = (GPU memory × gpu-memory-utilization) - Model weights memory

Max concurrent sequences × KV cache/sequence ≤ Available KV cache memory
where KV cache/sequence ∝ max-model-len

When OOM occurs, first reduce --max-model-len, then --max-num-seqs.

Quantization Choice: Real-World Differences Between fp8, awq, gptq, bitsandbytes

Practical characteristics by --quantization option:

Method	Precision Type	Required Hardware	Speed	Memory Savings	Notes
`fp8`	W8A8 (FP8)	H100, H200, Ada+	★★★★★	~50%	Fastest. Use official FP8 checkpoints
`awq`	W4A16	All NVIDIA GPUs	★★★★☆	~75%	Marlin kernel auto-applied. Recommend `--dtype half`
`gptq`	W4A16 or W8A16	All NVIDIA GPUs	★★★★☆	~75%	gptq_marlin kernel auto-used
`bitsandbytes`	NF4, Int8	Including CPU	★★☆☆☆	~75%+	Maximum memory savings. Slowest speed
`gguf`	Various	All GPUs	★★★☆☆	Variable	Requires `--load-format gguf`. MoE support (v0.8+)

FP8 is the strongest choice with H100/H200. Both model weights and activations use FP8, benefiting memory and computation. For models with official FP8 checkpoints like Qwen3-235B-A22B (Qwen3-235B-A22B-FP8), quality is more stable than dynamic FP8 quantization.

AWQ considers activation distributions during weight quantization to protect important channels, typically showing less quality degradation than GPTQ³. For 4-bit quantization without H100, AWQ is the first choice.

bitsandbytes uses NF4 or Int8 to minimize memory consumption to extremes. However, without Marlin kernel support, it’s slower than AWQ or GPTQ. Consider it a fallback for very memory-constrained situations.

FP8 KV Cache Effects

--kv-cache-dtype fp8 stores only KV cache in FP8, separate from model weight quantization. Two effects:

First, KV cache memory halves. BF16 is 16-bit while FP8 is 8-bit, naturally. This enables processing more concurrent sequences or longer contexts with the same memory.

Second, H100/H200 FlashInfer backend utilizes FP8 GEMM hardware acceleration. Quality loss is generally negligible. Requires CUDA 11.8+ and H100/H200/Ada Lovelace GPUs.

# FP8 weights + FP8 KV cache combination (H200 optimal settings)
vllm serve Qwen/Qwen3-235B-A22B-FP8 \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --dtype bfloat16

Scheduling: Chunked Prefill and Batch Size Tuning

--max-num-batched-tokens is key to Chunked Prefill in vLLM V1. It sets maximum tokens to process in one scheduling step.

Small values (e.g., 2048) give decode requests more opportunities to process before prefill, reducing Inter-Token Latency (ITL). Beneficial for real-time streaming viewing.

Large values (e.g., 32768+) process prefill chunks in bulk, improving TTFT (Time To First Token) and overall throughput. Better for offline batch processing.

V1 has --enable-chunked-prefill activated by default, so just adjust --max-num-batched-tokens situationally.

Speculative Decoding: Practical Usage

--speculative-config configures speculative decoding in JSON format. Small draft models predict multiple tokens while large main models verify them in one pass. This parallelizes verification to reduce decode latency.

Method 1: N-gram

Reuses n-grams from input prompts for token prediction. No additional models needed, no extra GPU memory. Effective for code or repetitive pattern documents.

--speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 3, "prompt_lookup_max": 10}'

Method 2: Draft Model

Uses smaller models with same tokenizers as draft models. Leveraging smaller versions from the same model family improves accept rates.

--speculative-config '{"model": "meta-llama/Llama-3.1-8B-Instruct", "num_speculative_tokens": 5}'

Method 3: EAGLE3

Uses specially trained draft heads, currently showing highest accept rates in speculative decoding⁴. Works with prefix caching and chunked prefill in V1.

--speculative-config '{
  "method": "eagle3",
  "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
  "num_speculative_tokens": 3,
  "draft_tensor_parallel_size": 1
}'

num_speculative_tokens of 3–5 is generally optimal. Too large reduces accept rates, actually slowing things down. Speculative decoding doesn’t work well with MoE models due to mismatched distribution characteristics between low active parameter MoE models and draft models.

Version Evolution: v0.4 to v0.8

vLLM version timeline

Key Changes by Version

v0.4.x (Late 2023)    — PagedAttention v1, Continuous Batching, AWQ/GPTQ quantization
v0.5.x (H1 2024)      — Chunked Prefill (optional), Prefix Caching (optional), early Speculative Decoding
v0.6.x (H2 2024)      — 1.8–2.7x throughput improvement over v0.5.3, FP8 KV Cache, bitsandbytes FP4 support
v0.7.x (January 2025) — V1 engine beta, DeepSeek V3/R1 support, FlashAttention 3
v0.8.x (February 2025)— V1 engine default, Expert Parallelism, Gemma 3, Blackwell support
v0.9+  (Mid 2025+)    — Complete V0 backend removal planned

v0.6.0’s performance improvement wasn’t just bug fixes⁵. Multi-step scheduling and async output processor overlapped GPU computation with CPU output processing for 12% additional throughput. Simultaneous use of Chunked Prefill and Prefix Caching became possible from this version.

Why V0 → V1 Engine Transition Matters

V0 engine’s biggest problem was architectural synchronicity. Schedulers processing either prefill or decode (but not both) in one iteration limited resource utilization.

V1 completely redesigned this structure⁶. ZMQ-based asynchronous API server transition with unified schedulers processing prefill and decode in a single flow. torch.compile full integration enabled additional optimizations. Chunked Prefill and Prefix Caching became default activations.

Category	V0	V1 (v0.8+ default)
Scheduler	Separate prefill/decode	Unified scheduler
Chunked Prefill	Optional (default off)	Default on
Prefix Caching	Optional (default off)	Default on
Preemption Default	SWAP	RECOMPUTE
torch.compile	Partial support	Full integration
Architecture	Synchronous	ZMQ-based asynchronous

To revert to V0, set environment variable VLLM_USE_V1=0. V0 fallback automatically occurs when using features V1 doesn’t support.

Common Errors and Solutions

OOM: Step-by-Step Approach

Methods to try in order when OOM occurs:

Reduce --max-model-len: Directly decreases KV cache requirements. Most effective.
Reduce --max-num-seqs: Lowers concurrent sequences for total KV cache reduction.
Add --kv-cache-dtype fp8: 50% KV cache memory savings.
Increase --tensor-parallel-size: Reduces per-GPU model memory, freeing KV cache space.
Apply --quantization fp8 or awq: Reduces model weights themselves.

Raising --gpu-memory-utilization is a last resort. Above 0.95 can cause OOM during PyTorch internal buffering or CUDA graph capture.

CUDA Error Solutions

CUDA illegal memory access (common with MoE models): H20 GPU MoE FP8 illegal memory access issues were reported in pre-v0.8 versions. Upgrading to v0.8.0+ resolves this.

CUDA graph capture failed: Add --enforce-eager flag to disable CUDA graphs. Also useful for debugging.

FP16 overflow (DeepSeek V2 etc.): Switch to --dtype bfloat16. FP16’s range (max 65504) can overflow with large LLM weight values.

Model Loading Failures

KeyError: 'qwen3_moe': Caused by low transformers version. Resolve with pip install transformers>=4.51.0.

GGUF loading failure: Must specify both --load-format gguf and --quantization gguf.

Qwen3/MoE Series Specifics

Qwen3-235B-A22B requires vllm>=0.8.5 and transformers>=4.51.0. When using Qwen3’s thinking mode, avoid greedy decoding (temperature=0). Zero temperature causes known infinite repetition generation issues. Maintain temperature >= 0.6.

Some GPUs like Tesla V100, A40, RTX series may show FusedMoE JSON config not found errors with MoE models. This happens when MoE kernel tuning configs don’t exist for those GPU architectures. Work around with --enforce-eager or adjust VLLM_FUSED_MOE_CHUNK_SIZE environment variable.

Practical Debugging Flow

# Step 1: Basic operation check with eager mode + V0 engine
VLLM_USE_V1=0 vllm serve <model> \
  --enforce-eager \
  --max-model-len 4096 \
  --trust-remote-code

# Step 2: Switch to V1 if no issues
vllm serve <model> \
  --enforce-eager \
  --max-model-len 4096 \
  --trust-remote-code

# Step 3: Enable CUDA graphs (remove eager)
vllm serve <model> \
  --max-model-len 4096 \
  --trust-remote-code

# Step 4: Gradually increase max-model-len and add parameters

Real-World Optimization: Running Qwen3-235B on H200x2

KV Cache Memory Calculation Method

Formula for calculating actual KV cache memory requirements:

KV cache memory ≈ 2 × num_layers × num_kv_heads × head_dim × max_model_len × sequences × element_size

Qwen3-235B-A22B (FP8 KV, TP=2):
- num_layers = 94
- num_kv_heads = 4 (GQA, 2 per GPU with TP=2)
- head_dim = 128
- element_size: FP8 = 1 byte

max_model_len=32768, 1 sequence:
= 2 × 94 × 2 × 128 × 32768 × 1 × 1 byte
≈ 1.6 GB / sequence

BF16 = 2x → 3.2 GB / sequence

Practical guide summary:

max-model-len	KV cache / sequence (FP8)	Recommended max-num-seqs
8,192	~400 MB	32–64
32,768	~1.6 GB	16–32
131,072	~6.4 GB	4–8

FP8 vs BF16 Tradeoff

Loading 235B model in BF16 on H200 requires 235B × 2 bytes ≈ 470 GB. Dual H200 total VRAM is 282GB. BF16 full weights simply don’t fit. FP8 checkpoint needs 235B × 1 byte ≈ 235 GB, fitting in H200x2. Adding FP8 KV cache leaves ~40–50GB for KV cache usage.

Category	FP8 (H200)	BF16
Model memory (235B)	~235 GB	~470 GB (exceeds H200x2)
Compute speed	Up to 2x faster	Baseline
Precision	Slight loss (official ckpt validated)	Full precision
Additional KV cache savings	`--kv-cache-dtype fp8` possible	Baseline
Recommended GPU	H100, H200, Ada+	All GPUs

Throughput vs Latency Tuning

Settings should vary by purpose:

Purpose	max-num-seqs	max-num-batched-tokens	gpu-memory-utilization
Maximum throughput	256+	32768+	0.95
General balance	64–128	8192–16384	0.90
Minimum latency	8–16	2048–4096	0.80–0.85
Stability priority	32	8192	0.85

Prefix Caching Utilization

Default activation in V1 requires no additional setup. To maximize effectiveness, always place system prompts at the beginning. For patterns like long document Q&A processing multiple questions with identical context, it significantly reduces TTFT.

# Prefix caching utilization in Python API
llm = LLM(model="...", enable_prefix_caching=True)
# Requests with identical system prompts automatically reuse KV cache

Environment Variable Cheat Sheet

# V1/V0 engine selection
VLLM_USE_V1=1           # Force V1 (default in v0.8+)
VLLM_USE_V1=0           # Force V0 (debugging/compatibility)

# Manual attention backend specification
VLLM_ATTENTION_BACKEND=FLASH_ATTN           # Flash Attention
VLLM_ATTENTION_BACKEND=FLASHINFER           # FlashInfer
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN  # For Qwen3 long contexts

# MoE related
VLLM_FUSED_MOE_CHUNK_SIZE=32768             # Adjust MoE chunk size

# Other
CUDA_VISIBLE_DEVICES=0,1                     # Specify GPUs (0, 1 only)
VLLM_MEDIA_LOADING_THREAD_COUNT=8           # Multimodal media loading threads

Practical Command Collection

Basic Server Start

vllm serve Qwen/Qwen3-8B-Instruct \
  --dtype auto \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code

Large MoE Model: Qwen3-235B on H200x2

vllm serve Qwen/Qwen3-235B-A22B-FP8 \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --dtype bfloat16 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill \
  --enable-reasoning \
  --reasoning-parser deepseek_r1 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

High Performance: Prefix Cache + Speculative Decoding

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --speculative-config '{"method": "ngram", "num_speculative_tokens": 4, "prompt_lookup_min": 3}' \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92

EAGLE3 Speculative Decoding

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --speculative-config '{
    "method": "eagle3",
    "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
    "num_speculative_tokens": 3,
    "draft_tensor_parallel_size": 1
  }' \
  --kv-cache-dtype fp8

Qwen3 YaRN Long Context Extension (131K)

vllm serve Qwen/Qwen3-235B-A22B-FP8 \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --max-model-len 131072 \
  --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8 \
  --max-num-seqs 4 \
  --trust-remote-code

Minimal Debugging Setup

VLLM_USE_V1=0 vllm serve <model> \
  --enforce-eager \
  --trust-remote-code \
  --max-model-len 4096 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.80

vLLM’s parameter system initially seems complex, but understanding the two core ideas of PagedAttention and Continuous Batching makes each parameter’s tradeoffs naturally clear. Memory, latency, and throughput balance — that judgment is the essence of vLLM configuration. The fact that running 235B models on dual H200s has become a realistic option in production represents the transformation this project achieved in just 2 years.

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). “Efficient Memory Management for Large Language Model Serving with PagedAttention.” Proceedings of the 29th Symposium on Operating Systems Principles (SOSP ‘23). ACM. doi:10.1145/3600006.3613165 ↩ ↩²
Luo, C., & Stoica, I. (2023). “How Continuous Batching Enables 23x Throughput in LLM Inference while Reducing p50 Latency.” Anyscale Blog. ↩
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” arXiv:2306.00978. ↩
Zhang, Y., et al. (2025). “EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test.” vLLM Blog, December 13, 2025. ↩
vLLM Team. (2024). “vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction.” vLLM Blog, September 5, 2024. ↩
vLLM Team. (2025). “vLLM V1: A Major Upgrade to vLLM’s Core Architecture.” vLLM Blog, January 27, 2025. ↩