vLLM Complete Guide — From Parameters to Optimization, Everything About Local LLM Serving
The first tool engineers encounter when trying to serve LLMs on local GPUs is vLLM. Installation to server startup takes under 10 minutes. The problem starts after that. How should you set --gpu-memory-utilization? What’s the difference between --tensor-parallel-size and --enable-expert-parallel? How much faster does FP8 quantization actually make things? The official docs don’t answer these questions. This guide provides direct answers for real-world usage.
Why vLLM is Hot: PagedAttention and Continuous Batching
vLLM’s core paper was published at SOSP 2023 by Kwon et al.1 Back then, LLM serving systems commonly suffered from GPU memory waste. Transformer attention mechanisms generate and store Key and Value vectors for each token, with this KV cache occupying contiguous memory blocks. The problem was that systems couldn’t predict how many tokens each request would generate, so existing systems had to reserve memory for worst-case scenarios. This resulted in 60–80% KV cache memory waste.
PagedAttention solved this using OS virtual memory paging1. Just as operating systems divide physical memory into fixed-size pages and allocate them non-contiguously to processes, vLLM divides KV cache into fixed-size blocks (default 16 tokens) and stores them in non-contiguous memory. Blocks are dynamically allocated as requests actually generate tokens, eliminating waste. vLLM achieved up to 24x throughput improvement compared to existing systems according to the paper.

Continuous Batching was another innovation. Traditional static batching couldn’t accept new requests until all requests in the batch completed. 100-token requests had to wait while one request generated 1000 tokens. Continuous Batching removes completed requests from batches and immediately inserts new requests at each iteration. Since GPUs always process full batches, throughput improved dramatically2. Anyscale research showed Continuous Batching achieved 23x throughput improvement while reducing p50 latency compared to static batching.
Complete Parameter Breakdown

Model Loading: Get the Basics Right
Starting with the most basic parameters for vllm serve <model>:
| Parameter | Default | Description |
|---|---|---|
--model | — | HuggingFace model name or local path |
--tokenizer | Same as model | Specify separate tokenizer when needed |
--tokenizer-mode | auto | auto/hf/mistral/deepseek_v32 |
--dtype | auto | auto/bfloat16/float16/float32 |
--quantization | None | awq/gptq/fp8/bitsandbytes/gguf etc. |
--max-model-len | Model config auto | Maximum context length (input+output) |
--trust-remote-code | False | Allow custom model code execution |
--load-format | auto | auto/safetensors/gguf/bitsandbytes etc. |
--dtype auto automatically determines based on model’s torch_dtype. Forcing --dtype float16 on BF16 models causes precision loss. Use auto unless there’s a specific reason.
--trust-remote-code is required for Qwen3, InternLM series models. Since it executes custom Python files from HuggingFace repos, only use with trusted official repositories.
--max-model-len isn’t just about “how long documents to process.” This value directly correlates to total KV cache memory. With heavy models like Qwen3-235B, setting this high can make KV cache consume more memory than model weights. We’ll cover memory calculations in detail below.
GPU Memory and Parallelization: The Most Important Settings
Parallelization Strategy: TP vs PP vs EP
Understanding these three parameters is key to vLLM configuration.
--tensor-parallel-size (shortened to -tp) divides model weight matrices across multiple GPUs for computation. All-reduce communication happens quickly between NVLink-connected GPUs within the same node, keeping latency low. This is the first option to consider in most situations.
--pipeline-parallel-size (shortened to -pp) distributes model layers sequentially across GPUs. GPU A processes front layers while GPU B processes back layers. This creates “pipeline bubbles” that reduce GPU utilization compared to TP. Use this supplementarily in inter-node environments with narrow bandwidth or when TP alone can’t fit the model.
--enable-expert-parallel (shortened to -ep) is for MoE (Mixture of Experts) models only. For MoE architectures like Qwen3-235B-A22B, DeepSeek V3, and Llama 4 Maverick, enabling EP instead of TP distributes expert computation load evenly across GPUs. Applying TP to MoE often inefficiently fragments each expert’s weights.
For Qwen3-235B-A22B on dual H200s, the answer is simple: -tp 2 --enable-expert-parallel.
| Strategy | Suitable Situations | Considerations |
|---|---|---|
--tensor-parallel-size N | Within-node NVLink GPUs | Requires NVLink bandwidth, minimal latency |
--pipeline-parallel-size N | Inter-node distribution, TP limits exceeded | Efficiency loss from pipeline bubbles |
--enable-expert-parallel | MoE models only | Can combine with TP (-tp 2 -ep) |
Memory Triangle: gpu-memory-utilization, max-model-len, max-num-seqs
These three parameters work in tandem. Changing one affects the allowable ranges of the other two.
--gpu-memory-utilization (default 0.9) sets the GPU memory fraction for vLLM engine use. Setting 0.9 on H200 (141GB) allocates ~127GB for model weights + KV cache. Higher values expand KV cache space for more concurrent requests, but reduce remaining GPU memory (~14GB) available for system stack and PyTorch internal buffers.
--max-model-len is maximum context length. Large values mean each request consumes more KV cache. Small values allow more concurrent requests with the same memory.
--max-num-seqs (default 256) is maximum simultaneous sequences. Large values increase concurrent requests, thus total KV cache consumption.
The relationship:
Available KV cache memory = (GPU memory × gpu-memory-utilization) - Model weights memory
Max concurrent sequences × KV cache/sequence ≤ Available KV cache memory
where KV cache/sequence ∝ max-model-len
When OOM occurs, first reduce --max-model-len, then --max-num-seqs.
Quantization Choice: Real-World Differences Between fp8, awq, gptq, bitsandbytes
Practical characteristics by --quantization option:
| Method | Precision Type | Required Hardware | Speed | Memory Savings | Notes |
|---|---|---|---|---|---|
fp8 | W8A8 (FP8) | H100, H200, Ada+ | ★★★★★ | ~50% | Fastest. Use official FP8 checkpoints |
awq | W4A16 | All NVIDIA GPUs | ★★★★☆ | ~75% | Marlin kernel auto-applied. Recommend --dtype half |
gptq | W4A16 or W8A16 | All NVIDIA GPUs | ★★★★☆ | ~75% | gptq_marlin kernel auto-used |
bitsandbytes | NF4, Int8 | Including CPU | ★★☆☆☆ | ~75%+ | Maximum memory savings. Slowest speed |
gguf | Various | All GPUs | ★★★☆☆ | Variable | Requires --load-format gguf. MoE support (v0.8+) |
FP8 is the strongest choice with H100/H200. Both model weights and activations use FP8, benefiting memory and computation. For models with official FP8 checkpoints like Qwen3-235B-A22B (Qwen3-235B-A22B-FP8), quality is more stable than dynamic FP8 quantization.
AWQ considers activation distributions during weight quantization to protect important channels, typically showing less quality degradation than GPTQ3. For 4-bit quantization without H100, AWQ is the first choice.
bitsandbytes uses NF4 or Int8 to minimize memory consumption to extremes. However, without Marlin kernel support, it’s slower than AWQ or GPTQ. Consider it a fallback for very memory-constrained situations.
FP8 KV Cache Effects
--kv-cache-dtype fp8 stores only KV cache in FP8, separate from model weight quantization. Two effects:
First, KV cache memory halves. BF16 is 16-bit while FP8 is 8-bit, naturally. This enables processing more concurrent sequences or longer contexts with the same memory.
Second, H100/H200 FlashInfer backend utilizes FP8 GEMM hardware acceleration. Quality loss is generally negligible. Requires CUDA 11.8+ and H100/H200/Ada Lovelace GPUs.
# FP8 weights + FP8 KV cache combination (H200 optimal settings)
vllm serve Qwen/Qwen3-235B-A22B-FP8 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--dtype bfloat16
Scheduling: Chunked Prefill and Batch Size Tuning
--max-num-batched-tokens is key to Chunked Prefill in vLLM V1. It sets maximum tokens to process in one scheduling step.
Small values (e.g., 2048) give decode requests more opportunities to process before prefill, reducing Inter-Token Latency (ITL). Beneficial for real-time streaming viewing.
Large values (e.g., 32768+) process prefill chunks in bulk, improving TTFT (Time To First Token) and overall throughput. Better for offline batch processing.
V1 has --enable-chunked-prefill activated by default, so just adjust --max-num-batched-tokens situationally.
Speculative Decoding: Practical Usage
--speculative-config configures speculative decoding in JSON format. Small draft models predict multiple tokens while large main models verify them in one pass. This parallelizes verification to reduce decode latency.
Method 1: N-gram
Reuses n-grams from input prompts for token prediction. No additional models needed, no extra GPU memory. Effective for code or repetitive pattern documents.
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 3, "prompt_lookup_max": 10}'
Method 2: Draft Model
Uses smaller models with same tokenizers as draft models. Leveraging smaller versions from the same model family improves accept rates.
--speculative-config '{"model": "meta-llama/Llama-3.1-8B-Instruct", "num_speculative_tokens": 5}'
Method 3: EAGLE3
Uses specially trained draft heads, currently showing highest accept rates in speculative decoding4. Works with prefix caching and chunked prefill in V1.
--speculative-config '{
"method": "eagle3",
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
"draft_tensor_parallel_size": 1
}'
num_speculative_tokens of 3–5 is generally optimal. Too large reduces accept rates, actually slowing things down. Speculative decoding doesn’t work well with MoE models due to mismatched distribution characteristics between low active parameter MoE models and draft models.
Version Evolution: v0.4 to v0.8

Key Changes by Version
v0.4.x (Late 2023) — PagedAttention v1, Continuous Batching, AWQ/GPTQ quantization
v0.5.x (H1 2024) — Chunked Prefill (optional), Prefix Caching (optional), early Speculative Decoding
v0.6.x (H2 2024) — 1.8–2.7x throughput improvement over v0.5.3, FP8 KV Cache, bitsandbytes FP4 support
v0.7.x (January 2025) — V1 engine beta, DeepSeek V3/R1 support, FlashAttention 3
v0.8.x (February 2025)— V1 engine default, Expert Parallelism, Gemma 3, Blackwell support
v0.9+ (Mid 2025+) — Complete V0 backend removal planned
v0.6.0’s performance improvement wasn’t just bug fixes5. Multi-step scheduling and async output processor overlapped GPU computation with CPU output processing for 12% additional throughput. Simultaneous use of Chunked Prefill and Prefix Caching became possible from this version.
Why V0 → V1 Engine Transition Matters
V0 engine’s biggest problem was architectural synchronicity. Schedulers processing either prefill or decode (but not both) in one iteration limited resource utilization.
V1 completely redesigned this structure6. ZMQ-based asynchronous API server transition with unified schedulers processing prefill and decode in a single flow. torch.compile full integration enabled additional optimizations. Chunked Prefill and Prefix Caching became default activations.
| Category | V0 | V1 (v0.8+ default) |
|---|---|---|
| Scheduler | Separate prefill/decode | Unified scheduler |
| Chunked Prefill | Optional (default off) | Default on |
| Prefix Caching | Optional (default off) | Default on |
| Preemption Default | SWAP | RECOMPUTE |
| torch.compile | Partial support | Full integration |
| Architecture | Synchronous | ZMQ-based asynchronous |
To revert to V0, set environment variable VLLM_USE_V1=0. V0 fallback automatically occurs when using features V1 doesn’t support.
Common Errors and Solutions
OOM: Step-by-Step Approach
Methods to try in order when OOM occurs:
- Reduce
--max-model-len: Directly decreases KV cache requirements. Most effective. - Reduce
--max-num-seqs: Lowers concurrent sequences for total KV cache reduction. - Add
--kv-cache-dtype fp8: 50% KV cache memory savings. - Increase
--tensor-parallel-size: Reduces per-GPU model memory, freeing KV cache space. - Apply
--quantization fp8orawq: Reduces model weights themselves.
Raising --gpu-memory-utilization is a last resort. Above 0.95 can cause OOM during PyTorch internal buffering or CUDA graph capture.
CUDA Error Solutions
CUDA illegal memory access (common with MoE models):
H20 GPU MoE FP8 illegal memory access issues were reported in pre-v0.8 versions. Upgrading to v0.8.0+ resolves this.
CUDA graph capture failed:
Add --enforce-eager flag to disable CUDA graphs. Also useful for debugging.
FP16 overflow (DeepSeek V2 etc.):
Switch to --dtype bfloat16. FP16’s range (max 65504) can overflow with large LLM weight values.
Model Loading Failures
KeyError: 'qwen3_moe':
Caused by low transformers version. Resolve with pip install transformers>=4.51.0.
GGUF loading failure:
Must specify both --load-format gguf and --quantization gguf.
Qwen3/MoE Series Specifics
Qwen3-235B-A22B requires vllm>=0.8.5 and transformers>=4.51.0. When using Qwen3’s thinking mode, avoid greedy decoding (temperature=0). Zero temperature causes known infinite repetition generation issues. Maintain temperature >= 0.6.
Some GPUs like Tesla V100, A40, RTX series may show FusedMoE JSON config not found errors with MoE models. This happens when MoE kernel tuning configs don’t exist for those GPU architectures. Work around with --enforce-eager or adjust VLLM_FUSED_MOE_CHUNK_SIZE environment variable.
Practical Debugging Flow
# Step 1: Basic operation check with eager mode + V0 engine
VLLM_USE_V1=0 vllm serve <model> \
--enforce-eager \
--max-model-len 4096 \
--trust-remote-code
# Step 2: Switch to V1 if no issues
vllm serve <model> \
--enforce-eager \
--max-model-len 4096 \
--trust-remote-code
# Step 3: Enable CUDA graphs (remove eager)
vllm serve <model> \
--max-model-len 4096 \
--trust-remote-code
# Step 4: Gradually increase max-model-len and add parameters
Real-World Optimization: Running Qwen3-235B on H200x2
KV Cache Memory Calculation Method
Formula for calculating actual KV cache memory requirements:
KV cache memory ≈ 2 × num_layers × num_kv_heads × head_dim × max_model_len × sequences × element_size
Qwen3-235B-A22B (FP8 KV, TP=2):
- num_layers = 94
- num_kv_heads = 4 (GQA, 2 per GPU with TP=2)
- head_dim = 128
- element_size: FP8 = 1 byte
max_model_len=32768, 1 sequence:
= 2 × 94 × 2 × 128 × 32768 × 1 × 1 byte
≈ 1.6 GB / sequence
BF16 = 2x → 3.2 GB / sequence
Practical guide summary:
| max-model-len | KV cache / sequence (FP8) | Recommended max-num-seqs |
|---|---|---|
| 8,192 | ~400 MB | 32–64 |
| 32,768 | ~1.6 GB | 16–32 |
| 131,072 | ~6.4 GB | 4–8 |
FP8 vs BF16 Tradeoff
Loading 235B model in BF16 on H200 requires 235B × 2 bytes ≈ 470 GB. Dual H200 total VRAM is 282GB. BF16 full weights simply don’t fit. FP8 checkpoint needs 235B × 1 byte ≈ 235 GB, fitting in H200x2. Adding FP8 KV cache leaves ~40–50GB for KV cache usage.
| Category | FP8 (H200) | BF16 |
|---|---|---|
| Model memory (235B) | ~235 GB | ~470 GB (exceeds H200x2) |
| Compute speed | Up to 2x faster | Baseline |
| Precision | Slight loss (official ckpt validated) | Full precision |
| Additional KV cache savings | --kv-cache-dtype fp8 possible | Baseline |
| Recommended GPU | H100, H200, Ada+ | All GPUs |
Throughput vs Latency Tuning
Settings should vary by purpose:
| Purpose | max-num-seqs | max-num-batched-tokens | gpu-memory-utilization |
|---|---|---|---|
| Maximum throughput | 256+ | 32768+ | 0.95 |
| General balance | 64–128 | 8192–16384 | 0.90 |
| Minimum latency | 8–16 | 2048–4096 | 0.80–0.85 |
| Stability priority | 32 | 8192 | 0.85 |
Prefix Caching Utilization
Default activation in V1 requires no additional setup. To maximize effectiveness, always place system prompts at the beginning. For patterns like long document Q&A processing multiple questions with identical context, it significantly reduces TTFT.
# Prefix caching utilization in Python API
llm = LLM(model="...", enable_prefix_caching=True)
# Requests with identical system prompts automatically reuse KV cache
Environment Variable Cheat Sheet
# V1/V0 engine selection
VLLM_USE_V1=1 # Force V1 (default in v0.8+)
VLLM_USE_V1=0 # Force V0 (debugging/compatibility)
# Manual attention backend specification
VLLM_ATTENTION_BACKEND=FLASH_ATTN # Flash Attention
VLLM_ATTENTION_BACKEND=FLASHINFER # FlashInfer
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN # For Qwen3 long contexts
# MoE related
VLLM_FUSED_MOE_CHUNK_SIZE=32768 # Adjust MoE chunk size
# Other
CUDA_VISIBLE_DEVICES=0,1 # Specify GPUs (0, 1 only)
VLLM_MEDIA_LOADING_THREAD_COUNT=8 # Multimodal media loading threads
Practical Command Collection
Basic Server Start
vllm serve Qwen/Qwen3-8B-Instruct \
--dtype auto \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code
Large MoE Model: Qwen3-235B on H200x2
vllm serve Qwen/Qwen3-235B-A22B-FP8 \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--max-num-seqs 32 \
--max-num-batched-tokens 16384 \
--enable-chunked-prefill \
--enable-reasoning \
--reasoning-parser deepseek_r1 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
High Performance: Prefix Cache + Speculative Decoding
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 16384 \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 4, "prompt_lookup_min": 3}' \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92
EAGLE3 Speculative Decoding
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--speculative-config '{
"method": "eagle3",
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
"draft_tensor_parallel_size": 1
}' \
--kv-cache-dtype fp8
Qwen3 YaRN Long Context Extension (131K)
vllm serve Qwen/Qwen3-235B-A22B-FP8 \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--max-model-len 131072 \
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8 \
--max-num-seqs 4 \
--trust-remote-code
Minimal Debugging Setup
VLLM_USE_V1=0 vllm serve <model> \
--enforce-eager \
--trust-remote-code \
--max-model-len 4096 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.80
vLLM’s parameter system initially seems complex, but understanding the two core ideas of PagedAttention and Continuous Batching makes each parameter’s tradeoffs naturally clear. Memory, latency, and throughput balance — that judgment is the essence of vLLM configuration. The fact that running 235B models on dual H200s has become a realistic option in production represents the transformation this project achieved in just 2 years.
Footnotes
-
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). “Efficient Memory Management for Large Language Model Serving with PagedAttention.” Proceedings of the 29th Symposium on Operating Systems Principles (SOSP ‘23). ACM. doi:10.1145/3600006.3613165 ↩ ↩2
-
Luo, C., & Stoica, I. (2023). “How Continuous Batching Enables 23x Throughput in LLM Inference while Reducing p50 Latency.” Anyscale Blog. ↩
-
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” arXiv:2306.00978. ↩
-
Zhang, Y., et al. (2025). “EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test.” vLLM Blog, December 13, 2025. ↩
-
vLLM Team. (2024). “vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction.” vLLM Blog, September 5, 2024. ↩
-
vLLM Team. (2025). “vLLM V1: A Major Upgrade to vLLM’s Core Architecture.” vLLM Blog, January 27, 2025. ↩