Complete LLM Serving Engine Guide — In-Depth Analysis of 18 Tools
Last Updated: February 2026
Target Audience: ML Engineers, MLOps, Infrastructure Architects
Scope: 18 production LLM serving tools + core technology comparative analysis
1. Introduction
The most critical bottleneck in practical deployment of Large Language Models (LLMs) is inference serving. Serving models with tens to hundreds of billions of parameters in real-time requires solving various technical challenges including memory management, batching strategies, attention optimization, and quantization.
The LLM serving tool ecosystem has exploded since 2023. After vLLM’s PagedAttention changed the paradigm of KV cache memory management, various approaches have emerged including SGLang’s RadixAttention, TensorRT-LLM’s FP8 optimization, and llama.cpp’s consumer hardware accessibility expansion.
This article provides paper-level depth analysis of 18 major LLM serving/inference tools and systematically compares their core technologies.
Core Evaluation Metrics
| Metric | Description |
|---|---|
| Throughput | Tokens generated per second (tokens/s) |
| TTFT | Time to First Token — time taken to first token |
| Latency(P50/P99) | Response latency per request |
| Memory Efficiency | GPU/CPU memory usage efficiency |
| Scalability | Performance maintenance capability as concurrent users increase |
2. Core Technology Concepts
2.1 KV Cache
A mechanism to avoid duplicate computation by reusing Key-Value tensors from previous tokens during Transformer decoding. In LLM serving, KV cache occupies 30–60% of total GPU memory, making efficient management crucial for serving performance.
2.2 Continuous Batching
While static batching must wait until all sequences in the batch complete, continuous batching inserts new requests immediately as each sequence finishes. This maximizes GPU utilization. First proposed in Yu et al. (2022)‘s Orca system.
2.3 Quantization
A technique to reduce memory and increase inference speed by converting FP16/BF16 weights to lower precision like INT4/INT8. There’s a tradeoff between quality loss and speed improvement.
3. In-Depth Tool Analysis
3.1 vLLM GitHub: vllm-project/vllmDevelopment: UC Berkeley (Kwon et al.)License: Apache 2.0Current Status: Active development (v0.7.x+ as of 2025, official distribution via NVIDIA NGC)
Core Technology: PagedAttention
vLLM’s core innovation is PagedAttention (Kwon et al., 2023). Inspired by virtual memory paging techniques in operating systems, it partitions KV cache into fixed-size blocks and manages them through an indirection layer.
Problems with existing approach: Traditional KV cache allocates continuous memory regions for each sequence. Maximum sequence length must be pre-reserved, resulting in average 60–80% memory waste (internal + external fragmentation).
PagedAttention’s solution:
- Partition KV cache into fixed-size blocks (e.g., 16 tokens)
- Each sequence references non-contiguous blocks through a block table (page table)
- Dynamically allocate new blocks as sequences grow
- Enable KV cache sharing for beam search etc. via Copy-on-Write
Result: Reduces memory waste to under 5%, enabling 2–4x more concurrent requests on the same GPU.
Architecture
Client Request → FastAPI Server → AsyncLLMEngine
→ Scheduler (continuous batching)
→ Model Runner (GPU execution)
→ PagedAttention KV Cache Manager
→ Sampler → Token Output (streaming)
- Scheduler: Schedules requests iteration-wise with continuous batching
- Model Runner: Executes model with CUDA kernels (FlashAttention/FlashInfer backend selection)
- KV Cache Manager: Block-level allocation/deallocation, Copy-on-Write support
Performance Benchmarks
| Comparison | Throughput vs vLLM |
|---|---|
| HuggingFace Transformers | 14–24x lower (Kwon et al., 2023) |
| Early TGI | 2.2–3.5x lower |
| FasterTransformer | 1.5–2x lower |
BentoML benchmark (2024, A100 80GB, Llama 3 8B):
- TTFT: Best-in-class across all concurrent user levels
- Token generation rate: ~2,300–2,500 tokens/s at 100 users (lower than LMDeploy’s 4,000 tokens/s)
- Slightly behind in decode throughput compared to engines with higher GPU utilization (LMDeploy etc.)
Supported Models/Quantization
- Models: 30+ architectures (LLaMA, Mistral, Qwen, Gemma, Phi, Command-R, DeepSeek, etc.)
- Quantization: AWQ, GPTQ, FP8, INT8 (W8A8), Marlin kernels
- Hardware: NVIDIA CUDA, AMD ROCm, AWS Neuron, CPU
Pros and Cons
| Pros | Cons |
|---|---|
| Best-in-class memory efficiency | Lack of decode speed optimization for quantized models |
| Extensive model support | Lags behind TensorRT-LLM in single-request latency |
| Active community, rapid updates | Higher setup complexity vs Ollama |
| OpenAI-compatible API | |
| Speculative decoding support |
Suitable Use Cases
- General production LLM serving (high throughput + low TTFT)
- Serving various models on single infrastructure
- Multi-GPU distributed inference
3.2 SGLang GitHub: sgl-project/sglangDevelopment: LMSYS (UC Berkeley, Zheng et al.)Paper: Zheng et al., “SGLang: Efficient Execution of Structured Language Model Programs” (2023, accepted to ICLR 2025)Current Status: Very active development (Diffusion model support, EAGLE 3 speculative decoding as of late 2025)
Core Technology: RadixAttention
SGLang’s innovation is RadixAttention—managing KV cache in a radix tree structure to automatically share prefixes among multiple requests.
Difference from PagedAttention:
- PagedAttention: Focus on block-level memory management (eliminating fragmentation)
- RadixAttention: Focus on prefix reuse (requests sharing the same prefix don’t duplicate KV cache computation)
Radix tree structure:
Root
├── "You are a helpful assistant. " → KV cached
│ ├── "Translate: Hello" → Branch A
│ ├── "Translate: World" → Branch B
│ └── "Summarize: ..." → Branch C
Numerous requests sharing the same system prompt or few-shot examples compute and reuse KV cache only once.
Cache hit rates:
- Few-shot learning (shared examples): 85–95% (vLLM PagedAttention: 15–25%)
- Multi-turn chat: 60–85% (vLLM: 30–50%)
- LMSYS production: 52.4% for LLaVA-Next-34B, 74.1% for some models
Frontend DSL
SGLang provides not just runtime but also frontend DSL:
@sgl.function
def multi_turn_qa(s, question1, question2):
s += sgl.system("You are a helpful assistant.")
s += sgl.user(question1)
s += sgl.assistant(sgl.gen("answer1", max_tokens=256))
s += sgl.user(question2)
s += sgl.assistant(sgl.gen("answer2", max_tokens=256))
This DSL automatically optimizes prefix sharing and supports parallel generation with fork/join.
Structured Generation
Compressed Finite State Machine technique for efficiently decoding structured outputs like JSON and regex. Processes in intervals (jumps) rather than token-by-token masking, dramatically reducing overhead.
Performance Benchmarks
LMSYS benchmark (July 2024, Llama 3):
- Llama 3 8B (A100): Both SGLang and TensorRT-LLM achieve up to 5,000 tokens/s on short inputs, vLLM lags behind
- Llama 3 70B: SGLang achieves up to 3x throughput vs vLLM in online serving
- Structured workload: Up to 6.4x throughput, 3.7x lower latency vs baseline
Clarifai benchmark (August 2025, GPT-OSS-120B, H100):
- Strong performance at medium-high concurrency (50 requests)
- TensorRT-LLM shows highest throughput for single requests, lacks scaling at extreme concurrency
Pros and Cons
| Pros | Cons |
|---|---|
| Dramatic performance improvement via prefix reuse | Smaller ecosystem vs vLLM |
| Structured output generation optimization | Relatively less online examples/documentation |
| Complex LLM program authoring via DSL | Some model support lags behind |
| EAGLE 3 speculative decoding | |
| Extension to Diffusion models |
Suitable Use Cases
- Agent/Tool-use workflows (high prefix reuse)
- When structured outputs (JSON) are needed
- Multi-turn chat serving
- Few-shot evaluation pipelines
3.3 TensorRT-LLM GitHub: NVIDIA/TensorRT-LLMDevelopment: NVIDIALicense: Apache 2.0Current Status: v0.17+ (as of 2025), NVIDIA’s official inference stack
Core Technology
TensorRT-LLM is an LLM-specific inference engine built on NVIDIA’s TensorRT compiler, generating model-specific optimized CUDA kernels at compile-time.
Key optimizations:
- In-flight Batching: NVIDIA’s implementation of continuous batching. Insert new requests immediately as individual requests complete
- FP8/INT4 quantization: Utilizes FP8 Tensor Cores in Hopper architecture (H100). 2x+ throughput vs FP16, quality loss under 2%
- Paged KV Cache: Block-based KV management similar to vLLM
- Quantized KV Cache: Quantize KV cache itself to INT8, FP8 for memory savings
- KV Cache Reuse: KV offloading to CPU then reuse. Up to 14x TTFT reduction (H100 basis)
- Kernel Fusion: Fuse MHA, MLP etc. into single kernels
Architecture
Model Definition (Python) → TensorRT Engine Build (compilation)
→ Executor API → Triton Inference Server (serving)
→ In-flight Batching Scheduler
→ Fused CUDA Kernels
Important: TensorRT-LLM requires explicit compilation stage. Must build engines for each model+hardware+batch size combination, taking tens of minutes to hours.
Performance Benchmarks
- Single-request latency: Lowest on NVIDIA GPUs (strength of compiled kernels)
- Llama 3.1 8B FP8 (H100): ~2x throughput improvement vs FP16
- LMSYS benchmark: Achieves 5,000 tokens/s on short inputs alongside SGLang
- High concurrency may increase P99 latency due to aggressive batching
Pros and Cons
| Pros | Cons |
|---|---|
| Best single-request performance on NVIDIA GPUs | NVIDIA-only (vendor lock-in) |
| FP8 optimization (Hopper) | Complex setup (engine build, Triton configuration) |
| Rich KV cache options | Recompilation needed for model changes |
| Official NVIDIA support | Steepest learning curve |
Suitable Use Cases
- NVIDIA-only environments requiring maximum performance
- Latency-critical workloads (real-time chatbots)
- Fixed models where engine build investment is feasible
3.4 TGI (Text Generation Inference) GitHub: huggingface/text-generation-inferenceDevelopment: Hugging FaceLicense: HFOIL (v1), Apache 2.0 (v2+)Current Status: Maintenance mode as of December 2025 — accepting only minor bug fixes
Core Technology
TGI was Hugging Face ecosystem’s official inference server, providing all-in-one production serving features:
- Rust-based HTTP/gRPC server: High-performance web server
- Flash Attention (Dao et al., 2022): Attention algorithm optimizing HBM ↔ SRAM IO
- Continuous Batching: Dynamic request insertion/removal
- Paged Attention: vLLM-style KV cache management
- TGI v3’s Chunked Prefill: Split long contexts into chunks for prefill, reducing memory peaks
- Prefix KV Caching: Reuse KV of long conversation history
Performance Benchmarks
- General prompts: Similar level to vLLM, vLLM slightly ahead at high concurrency
- TGI v3 + long context: 3x more token processing, up to 13x faster vs vLLM (long history + prefix caching)
- BentoML benchmark (Llama 3 8B, A100): 2,300–2,500 tokens/s (similar to vLLM)
Supported Quantization
- AWQ, GPTQ, bitsandbytes (INT4, INT8)
- FP8 (experimental)
Pros and Cons
| Pros | Cons |
|---|---|
| Perfect HuggingFace Hub integration | Entered maintenance mode (December 2025~) |
| Easy setup, excellent documentation | Lags behind latest optimizations vs vLLM/SGLang |
| Built-in safety features (watermark, safety) | Slow model support updates |
| Various hardware (CUDA, ROCm, Gaudi, Inferentia) |
Suitable Use Cases
- HuggingFace Inference Endpoints users
- Chat workloads with long conversation history (utilizing v3’s prefix caching)
- Rapid prototyping and deployment
Note: With TGI entering maintenance mode, HuggingFace recommends vLLM/SGLang as alternatives.
3.5 llama.cpp GitHub: ggml-org/llama.cppDevelopment: Georgi Gerganov and communityLicense: MITCurrent Status: Daily active development (build 4000+ as of 2025)
Core Technology: GGUF Quantization
llama.cpp is a pure inference engine written in C/C++ that can run LLMs on both CPU and GPU without Python/PyTorch dependencies.
GGUF (GGML Unified Format): llama.cpp’s model file format supporting various quantization methods.
Quantization Methods Detail
| Quantization | Bits | Size (7B basis) | Quality | Speed | Description |
|---|---|---|---|---|---|
| Q8_0 | 8bit | ~7.0 GB | Best | Slow | Near FP16 |
| Q6_K | 6bit | ~5.5 GB | Very good | Medium | Super-blocks with 6-bit |
| Q5_K_M | 5bit | ~4.8 GB | Good | Medium | Mixed 5-bit, recommended |
| Q4_K_M | 4bit | ~4.1 GB | Fair | Fast | Most popular balance point |
| Q4_K_S | 4bit | ~3.9 GB | Fair | Fast | Slightly smaller than Q4_K_M |
| Q3_K_M | 3bit | ~3.3 GB | Degraded | Fast | Memory constrained |
| Q2_K | 2bit | ~2.7 GB | Significantly degraded | Very fast | Extreme compression |
| IQ4_XS | ~4bit | ~3.7 GB | Q4_K_M level | Slow* | Importance Matrix based |
*IQ quantization can be very slow with partial GPU offloading.
K-Quant System: Quantizations with “K” (Q4_K_M etc.) use super-block structure. Each super-block (usually 256 weights) has independent scale factors, with M(medium) and S(small) being precision differences in scale factors.
Architecture
GGUF Model File → ggml tensor library
→ CPU: AVX2/AVX-512/ARM NEON vector operations
→ GPU: CUDA/Metal/Vulkan/OpenCL offloading
→ Multi-threaded inference
→ HTTP Server (llama-server) or CLI
- Partial GPU Offloading: Can split GPU/CPU by layer
- Metal Support: Excellent performance on Apple Silicon
- Vulkan: Universal GPU acceleration (AMD, Intel)
Performance Benchmarks
llama-bench results (Apple Silicon M-series, Qwen2 1.5B Q4_0):
- Prompt processing (pp512): 5,765 tokens/s
- Token generation (tg128): 198 tokens/s
With full GPU offloading vs ExLlamaV2:
- llama.cpp: ~7,500 tokens/s (prompt), ExLlamaV2: ~14,000 tokens/s (~2x difference)
Pros and Cons
| Pros | Cons |
|---|---|
| Hardware universality (CPU/all GPUs) | Lower throughput vs GPU-only tools |
| Single binary, minimal dependencies | Weak continuous batching |
| Extensive quantization options | Lack of production serving features |
| Apple Silicon optimization | Unsuitable for large-scale concurrent serving |
| Very active community |
Suitable Use Cases
- Running LLMs on local PC/laptop
- CPU server deployment without GPU
- Inference on Apple Silicon Mac
- Edge device deployment
3.6 Ollama GitHub: ollama/ollamaDevelopment: Ollama Inc.License: MITCurrent Status: Active development (expanded to cloud model support as of 2025)
Core Technology
Ollama is a user-friendly LLM execution environment that wraps llama.cpp. Provides Docker-like interface to pull/run models.
ollama pull llama3.1
ollama run llama3.1
Architecture
Ollama CLI/API → Go server (REST API)
→ llama.cpp (inference backend)
→ Model registry (ollama.com)
→ Modelfile (Dockerfile-like model configuration)
Key features:
- Model management:
ollama pull,ollama list,ollama rm - Modelfile: Declaratively set system prompts, temperature etc.
- OpenAI-compatible API:
/v1/chat/completionsendpoint - Multimodal: Vision model support
- 2025 updates: Cloud model integration (Turbo), local-only mode settings
Performance
Ollama’s performance is essentially identical to llama.cpp. Go server wrapper overhead is negligible, with main bottleneck being llama.cpp backend inference speed.
vLLM comparison (same model, same GPU):
- Single request: Nearly identical latency
- Concurrent requests: vLLM achieves 2–5x higher throughput with continuous batching
Pros and Cons
| Pros | Cons |
|---|---|
| Extremely easy installation/usage | Unsuitable for large-scale serving (weak batching) |
| Model registry ecosystem | Cannot exceed llama.cpp performance |
| Custom model creation via Modelfile | Limited GPU memory optimization |
| Cross-platform | Lower throughput vs vLLM/SGLang |
Suitable Use Cases
- Developer local environment prototyping
- AI accessibility for non-developers
- Internal AI tools for small teams
- LLM testing in CI/CD pipelines
3.7 MLC LLM GitHub: mlc-ai/mlc-llmDevelopment: CMU/OctoAI (TVM team, Chen et al.)Paper: Based on Apache TVM (Chen et al., 2018)License: Apache 2.0
Core Technology: TVM Compiler
MLC LLM uses the Apache TVM compiler framework to compile LLMs to native code for various hardware backends.
Compilation pipeline:
HuggingFace model → Relax IR (TVM)
→ Hardware-specific optimization (fusion, tiling, vectorization)
→ Backend-specific code generation:
- CUDA (NVIDIA GPU)
- Metal (Apple GPU)
- Vulkan (Universal GPU)
- OpenCL (Mobile GPU)
- WebGPU (Browser)
- C/LLVM (CPU)
Mobile/Edge Deployment
MLC LLM’s unique strength is LLM inference on mobile devices:
- iOS: Metal backend, Swift bindings
- Android: OpenCL/Vulkan backend, Java/Kotlin bindings
- WebGPU: Direct execution in browsers (web-llm)
Mobile benchmarks (arxiv:2410.03613, 2024):
- Qualcomm Snapdragon 8 Gen 3 with 7B 4-bit model: ~10-15 tokens/s
- Apple A17 Pro with similar setup: ~20+ tokens/s
BentoML Benchmark (Llama 3 8B, A100)
- 10 users: Similar decode performance to LMDeploy, best-in-class TTFT
- 50 users: Still good TTFT
- 100 users: Sharp performance degradation under high load — both decode speed and TTFT lag behind LMDeploy
Pros and Cons
| Pros | Cons |
|---|---|
| Mobile/edge/browser deployment | Compilation stage required (cold start increase) |
| Most comprehensive hardware support | No stable releases (nightly only) |
| WebGPU support (web-llm) | Performance degradation at high concurrency |
| TVM optimization auto-tuning | Learning curve |
Suitable Use Cases
- Embedding LLMs in mobile apps
- Browser-based AI (WebGPU)
- Edge device deployment (Jetson, RPi, etc.)
- Environments with high hardware diversity
3.8 LMDeploy GitHub: InternLM/lmdeployDevelopment: Shanghai AI Lab (InternLM team)License: Apache 2.0
Core Technology: TurboMind
LMDeploy’s core inference engine TurboMind started from NVIDIA FasterTransformer’s GPT-NeoX implementation and optimized for conversational model inference.
Key optimizations:
- Persistent Batching: Variant of continuous batching that maintains batches while dynamically replacing individual sequences
- Blocked KV Cache: Block-based KV management similar to vLLM PagedAttention, but with different internal layout
- Dynamic Split & Fuse: Dynamically split/fuse attention blocks for optimal GPU utilization
- KV Quantization: Quantize KV cache itself to INT8/INT4
- Weight Quantization: AWQ 4-bit, INT8 support
Performance Benchmarks
BentoML benchmark (Llama 3, A100 80GB):
| Metric | LMDeploy | vLLM | TensorRT-LLM | MLC-LLM | TGI |
|---|---|---|---|---|---|
| Decode (8B, 100 users) | ~4,000 t/s | ~2,400 t/s | ~2,400 t/s | ~2,000 t/s | ~2,300 t/s |
| TTFT (8B, 10 users) | Best | Best | Good | Best | Medium |
| Decode (70B Q4, 100 users) | ~700 t/s | ~450 t/s | ~650 t/s | N/A | ~400 t/s |
InternLM benchmark: After GQA optimization, internlm2-20b achieves 16+ RPS, 1.8x faster than vLLM.
LMDeploy achieves near 100% GPU utilization particularly with quantized models.
Pros and Cons
| Pros | Cons |
|---|---|
| Best-in-class decode throughput | NVIDIA CUDA only |
| Particularly strong in 4-bit inference | Limited model support (~20 models) |
| Easy to use (on-the-fly conversion) | Uneven English/Chinese documentation quality |
| KV quantization support | Smaller community vs vLLM |
Suitable Use Cases
- NVIDIA GPU environments requiring maximum throughput
- Quantized model serving (AWQ 4-bit)
- When using InternLM family models
- Large-scale concurrent serving (stable even at high concurrency)
3.9 Triton Inference Server GitHub: triton-inference-server/serverDevelopment: NVIDIALicense: BSD 3-Clause
Core Technology
Triton is a universal model serving platform, not LLM-specific but serving various ML models. For LLM serving, primarily used with TensorRT-LLM backend.
Core features:
- Dynamic Batching: Automatically batch multiple requests. Configurable wait time/batch size limits
- Model Ensembles: Configure preprocessing → LLM → postprocessing as pipelines
- Multi-backends: TensorRT, ONNX Runtime, PyTorch, TensorFlow, vLLM, etc.
- Concurrent model serving: Serve multiple models simultaneously on single server
- Model versioning: Model version management for A/B testing
Architecture
Client (HTTP/gRPC) → Triton Server
→ Request Scheduler (dynamic batching)
→ Model Repository
├── Model A (TensorRT-LLM)
├── Model B (ONNX Runtime)
└── Ensemble Pipeline
→ Response Aggregator
Role in LLM Serving
Triton itself doesn’t perform LLM inference optimizations (PagedAttention etc.). Instead:
- TensorRT-LLM Backend: Serve TensorRT-LLM engines via
tensorrtllm_backend - vLLM Backend: Use vLLM as Triton backend
- Actual inference optimization handled by backend engines
Pros and Cons
| Pros | Cons |
|---|---|
| Multi-model serving (LLM + vision + audio) | Complex setup as not LLM-specific |
| Production-proven stability | No LLM optimization when used alone |
| Built-in monitoring, metrics | High learning curve when used with TensorRT-LLM |
| Ensemble pipelines |
Suitable Use Cases
- Multimodal AI pipelines (LLM + image + audio)
- Large-scale enterprise ML infrastructure
- A/B testing + model versioning needs
- Production serving wrapper for TensorRT-LLM models
3.10 ExLlamaV2 GitHub: turboderp-org/exllamav2Development: turboderpLicense: MIT
Core Technology: EXL2 Quantization
ExLlamaV2’s core innovation is EXL2 (ExLlama v2 Quantization)—quantization mixing different bit counts per layer and tensor.
How it works:
- Measure importance (sensitivity) of each layer/tensor
- Quantize important layers to high bits (6-8bit), less important layers to low bits (2-3bit)
- Match overall model’s average bits to target (e.g., 4.25 bits per weight)
- Achieve higher quality than uniform quantization at same model size
Supported bits: 2, 3, 4, 5, 6, 8 bit and their mixtures
Performance Benchmarks
Reddit benchmark (2024, RTX 4090):
- Llama 3 8B: ExLlamaV2 achieves ~14,000 tokens/s in prompt processing (vs llama.cpp’s ~7,500, about 2x)
- At same 4-bit, EXL2 slightly higher quality than GPTQ, similar or slightly lower than GGUF
- ExLlama-based GPTQ execution shows fastest evaluation speed (oobabooga benchmark)
Quality comparison (4-bit, Llama 2 13B, perplexity basis):
- AWQ: Best quality
- GPTQ ≈ EXL2: Similar
- GGUF (Q4_K_M): Slightly behind
Pros and Cons
| Pros | Cons |
|---|---|
| Best-in-class GPU inference speed | NVIDIA CUDA only |
| Flexible mixed-precision quantization | Limited serving features (inference library) |
| Flash Attention, context caching support | No CPU inference |
| Popular in local communities | No continuous batching |
Suitable Use Cases
- Maximum inference speed on single GPU (with TabbyAPI etc.)
- Precise quantization adjustment to fit memory
- Local AI chat (oobabooga, SillyTavern)
3.11 Ray Serve + vLLM Framework: Ray Serve + vLLMDevelopment: Anyscale (Ray), UC Berkeley (vLLM)
Core Technology
Ray Serve is a distributed model serving framework that adds autoscaling, monitoring, fault recovery while using vLLM as backend.
Architecture:
Load Balancer → Ray Serve Router
→ Replica 1 (vLLM on GPU 0-1)
→ Replica 2 (vLLM on GPU 2-3)
→ Replica N (autoscaled)
→ Ray Dashboard (monitoring)
Key features:
- Autoscaling: Automatically increase/decrease vLLM instances based on traffic
- Multi-model serving: Serve multiple models simultaneously on one cluster
- Fault recovery: Automatic restart on replica failure
- Disaggregated Serving: Run Prefill and Decode on separate nodes (vLLM’s latest feature)
vLLM Large-Scale Serving (December 2025):
- DeepSeek models achieve 2,200 tokens/s per H200 (Wide Expert Parallelism)
- Efficient KV transfer via NIXL/LMCache connectors
- Independent scaling of each phase (prefill/decode) with Ray’s distributed computing
Pros and Cons
| Pros | Cons |
|---|---|
| Production-level autoscaling | Complex setup (Ray + vLLM) |
| Built-in monitoring, fault recovery | Ray cluster management required |
| Multi-model, multi-node serving | Overhead exists |
| Disaggregated serving support |
Suitable Use Cases
- Large-scale production LLM services
- Environments with high traffic fluctuation
- Multi-model / multi-tenant serving
- Cloud-native AI infrastructure
3.12 PowerInfer GitHub: SJTU-IPADS/PowerInferDevelopment: Shanghai Jiao Tong UniversityPaper: Song et al., “PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU” (2023)
Core Technology: Neuron-Aware Sparse Inference
PowerInfer leverages activation sparsity in LLMs. In FFN layers, only a portion of neurons actually activate, and which neurons activate frequently (“hot neurons”) can be profiled beforehand.
How it works:
- Offline profiling to analyze activation frequency per neuron
- Hot neurons (frequently activated): Reside on GPU
- Cold neurons (rarely activated): Stored in CPU memory
- Runtime adaptive predictor predicts which neurons will activate
- Neuron-aware sparse operator computes only activated neurons
Performance Benchmarks
RTX 4090 single GPU:
- Various LLMs including OPT-175B achieve average 13.20 tokens/s, max 29.08 tokens/s
- Only 18% lower performance than A100 server — on consumer GPU!
- Up to 11x faster inference than llama.cpp (on GPU memory constrained models)
Pros and Cons
| Pros | Cons |
|---|---|
| Run large models on consumer GPUs | Only effective on models with FFN sparsity |
| GPU-CPU hybrid overcomes VRAM constraints | Profiling stage required |
| Dramatic performance improvement vs llama.cpp | Limited GQA/MoE model support |
| No production serving features |
Suitable Use Cases
- Running large models on VRAM-limited consumer GPUs
- Models with strong sparse activation patterns like OPT, Falcon
- Research/experimental purposes
3.13 Aphrodite Engine GitHub: aphrodite-engine/aphrodite-engineDevelopment: PygmalionAILicense: Apache 2.0GitHub Stars: ~1.6k
Core Technology
Aphrodite is a vLLM fork optimized for RP/storytelling community needs.
Features added over vLLM:
- Enhanced sampling parameters (fine control of temperature, repetition penalty, etc.)
- EXL2, GGUF quantization format support (vLLM focuses on GPTQ/AWQ)
- Rapid response to community requests
- PagedAttention KV cache management (vLLM-based)
- Continuous batching (async server)
Pros and Cons
| Pros | Cons |
|---|---|
| vLLM-based high performance | May lag in tracking vLLM upstream |
| Various quantization format support | Community-focused rather than production environment |
| Fine-grained sampling control | Limited documentation/support |
Suitable Use Cases
- RP/storytelling serving (SillyTavern, etc.)
- When wanting to serve EXL2/GGUF models on server
- When needing sampling features absent in vLLM
3.14 LocalAI GitHub: mudler/LocalAIDevelopment: mudler and communityLicense: MIT
Core Technology
LocalAI is a fully OpenAI API compatible local AI server that integrates various backends.
Multi-backend architecture:
OpenAI-compatible API (/v1/chat/completions, /v1/images, /v1/audio, etc.)
├── llama.cpp (text generation)
├── whisper.cpp (speech recognition)
├── stable-diffusion.cpp (image generation)
├── bark (TTS)
├── piper (TTS)
└── other backends
2025 features:
- LocalAI Core (text, image, audio, vision APIs)
- LocalAGI (autonomous agents)
- LocalRecall (semantic search)
- P2P distributed inference
- Constrained grammars (structured output)
Pros and Cons
| Pros | Cons |
|---|---|
| Complete OpenAI API drop-in replacement | Lacks performance optimization vs individual backends |
| Text+image+audio all-in-one | Setup complexity |
| P2P distributed support | Documentation insufficient for community size |
| Easy Docker-based deployment |
Suitable Use Cases
- Converting existing OpenAI API code to local
- Multimodal AI (text+image+audio) from single server
- Privacy-sensitive environments
3.15 DeepSpeed-MII GitHub: deepspeedai/DeepSpeed-MIIDevelopment: Microsoft DeepSpeed teamLicense: Apache 2.0
Core Technology
DeepSpeed-MII is a serving framework utilizing Microsoft’s DeepSpeed library’s inference optimizations.
4 core technologies:
- DeepSpeed-Inference: Accelerate Transformer inference with custom CUDA kernels
- ZeRO-Inference: When model doesn’t fit single GPU, utilize CPU memory/NVMe for offloading. Enable single GPU serving of models like Bloom-176B
- DeepSpeed-FastGen: Continuous batching + Dynamic SplitFuse (dynamically split/combine prefill and decode)
- Tensor Parallelism: Multi-GPU parallel inference
Dynamic SplitFuse: Split long prompt prefill across multiple iterations and fuse with decode tokens to maintain uniform GPU utilization.
Performance
DeepSpeed-FastGen blog (2023):
- Up to 2.3x throughput, up to 2x latency reduction vs vLLM (specific workloads)
- However, gap has narrowed in recent comparisons as vLLM significantly evolved
Pros and Cons
| Pros | Cons |
|---|---|
| ZeRO-Inference for ultra-large model deployment | Decreasing development activity trend |
| Official Microsoft support | Lags behind vLLM/SGLang in performance (recent basis) |
| Dynamic SplitFuse technique | Limited model support range |
| Azure integration | Insufficient documentation/examples |
Suitable Use Cases
- Single GPU serving of ultra-large models (ZeRO-Inference)
- Azure/Microsoft ecosystem
- Integration with DeepSpeed training pipelines
3.16 OpenLLM (BentoML) GitHub: bentoml/OpenLLMDevelopment: BentoMLLicense: Apache 2.0
Core Technology
OpenLLM is an LLM serving tool built on BentoML framework, managing the entire lifecycle from model packaging to cloud deployment.
Features:
- Bento packaging: Package model + dependencies + serving code together
- OpenAI-compatible API
- Swappable inference backends: Use vLLM, TensorRT-LLM, etc. as backends
- BentoCloud deployment: One-click cloud deployment
- LangChain integration
Pros and Cons
| Pros | Cons |
|---|---|
| Model lifecycle management | Inference performance depends on backend |
| BentoCloud one-click deployment | Possible overhead from indirect backend usage |
| Various backend support | Limited community size |
| LangChain integration |
Suitable Use Cases
- Teams needing ML model packaging/deployment pipelines
- BentoCloud users
- Serving LLM + other ML models together
3.17 CTranslate2 GitHub: OpenNMT/CTranslate2Development: OpenNMT (SYSTRAN)License: MIT
Core Technology
CTranslate2 is an engine that converts Transformer models to optimized C++ format for inference. Originally developed for machine translation (NMT), expanded to LLMs.
Optimization techniques:
- Layer Fusion: Combine consecutive layers into single operations
- Padding Removal: Remove padding within batches to prevent unnecessary computation
- Batch Reordering: Sort sequences by length within batches for efficiency improvement
- In-place Operations: Minimize memory allocation
- Caching Mechanism: Cache repetitive operation results
Quantization: Supports INT8, INT16, Float16. INT8 models are 3.53x faster than Float32 (AMD ROCm benchmark).
Primary use case: Faster-Whisper (high-speed Whisper speech recognition implementation) uses CTranslate2 as core backend.
Pros and Cons
| Pros | Cons |
|---|---|
| Excellent CPU performance | No LLM-specific optimizations (PagedAttention, etc.) |
| Lightweight, minimal dependencies | Limited model support (mainly encoder-decoder) |
| Production-proven (translation services) | Decreasing community activity |
| AMD ROCm support | Slow support for latest LLM architectures |
Suitable Use Cases
- Machine translation serving
- Whisper-based speech recognition (Faster-Whisper)
- Transformer inference in CPU-only environments
- Lightweight deployment
3.18 Candle GitHub: huggingface/candleDevelopment: Hugging FaceLicense: Apache 2.0/MIT
Core Technology
Candle is a minimal ML framework written in Rust, providing PyTorch-like API with Rust’s safety and performance.
Features:
- Pure Rust implementation (no libtorch/Python dependencies)
- CUDA, Metal backend support
- Native HuggingFace Hub integration
- WASM target (browser execution)
- Flash Attention support (CUDA feature flag)
Ecosystem:
candle-transformers: Major model implementations (LLaMA, Mistral, Phi, etc.)candle-einops: Rust einops implementationatoma-infer: Large-scale inference library based on Candle (FlashAttention2, PagedAttention)
Pros and Cons
| Pros | Cons |
|---|---|
| Rust memory safety/performance | Inference-only (no training support) |
| Python dependency elimination | Fewer model implementations vs Python ecosystem |
| WASM support (serverless/browser) | Small community size |
| Lightweight binaries | Absence of high-level serving features |
Suitable Use Cases
- Embedding ML in Rust-based applications
- Lightweight inference in serverless/edge
- WASM-based browser AI
- Direct HuggingFace model usage in Rust
4. Technology Comparison Analysis
4.1 KV Cache Management Comparison
| Method | Tools | Core Idea | Memory Efficiency | Prefix Reuse | Complexity |
|---|---|---|---|---|---|
| PagedAttention | vLLM, Aphrodite | Store KV in fixed blocks non-contiguously using OS paging techniques | ★★★★★ | △ (hash-based) | Medium |
| RadixAttention | SGLang | Automatically share prefix via radix tree | ★★★★★ | ★★★★★ | High |
| Blocked KV Cache | LMDeploy TurboMind | Block grid-based management, split & fuse optimization | ★★★★☆ | △ | Medium |
| Paged + Quantized KV | TensorRT-LLM | Block-based + INT8/FP8 KV quantization | ★★★★★ | ○ (CPU offloading) | High |
| Contiguous | llama.cpp, ExLlamaV2 | Contiguous memory, pre-allocation | ★★☆☆☆ | ✗ | Low |
Key insights:
- Fragmentation elimination: PagedAttention (vLLM) became standard. Reduced memory waste from 60-80% to under 5%
- Prefix reuse: RadixAttention (SGLang) achieves highest cache hit rates. 85-95% in few-shot vs PagedAttention’s 15-25%
- KV quantization: Supported by TensorRT-LLM and LMDeploy. Quantizing KV to FP8/INT8 saves 50% memory with minimal quality loss
4.2 Quantization Method Comparison
| Method | Bits | Process | GPU Required | Quality | Speed | Compatible Tools |
|---|---|---|---|---|---|---|
| GPTQ | 4bit (mainly) | Post-training, Hessian-based | Required for quantization | ★★★★☆ | ★★★★★ (ExLlama) | vLLM, TGI, ExLlamaV2 |
| AWQ | 4bit | Activation-aware weight quant | Required for quantization | ★★★★★ | ★★★★☆ | vLLM, LMDeploy, TGI |
| EXL2 | 2-8bit mixed | Per-layer mixed precision | Required for quantization | ★★★★☆ | ★★★★★ | ExLlamaV2, Aphrodite |
| GGUF | 2-8bit | K-quant super-block | CPU possible | ★★★★☆ | ★★★☆☆ (CPU) | llama.cpp, Ollama, LocalAI |
| FP8 | 8bit | 8-bit floating point | Hopper GPU | ★★★★★ | ★★★★★ | TensorRT-LLM, vLLM |
| bitsandbytes | 4/8bit | NF4, INT8 | Required | ★★★☆☆ | ★★★☆☆ | TGI, HF Transformers |
Quality ranking (same 4-bit, perplexity basis): AWQ > GPTQ ≈ EXL2 > GGUF Q4_K_M > bitsandbytes NF4 Speed ranking (GPU, 4-bit): EXL2 (ExLlamaV2) > GPTQ (ExLlama backend) > AWQ (vLLM) > GGUF (llama.cpp GPU offload)
Key selection criteria:
- GPU serving, maximum speed: EXL2 (ExLlamaV2) or GPTQ (ExLlama backend)
- GPU serving, highest quality: AWQ (vLLM/LMDeploy)
- CPU/hybrid inference: GGUF (llama.cpp)
- NVIDIA Hopper, production: FP8 (TensorRT-LLM)
4.3 Batching Strategy Comparison
| Strategy | Description | GPU Utilization | Latency | Supporting Tools |
|---|---|---|---|---|
| Static Batching | Wait until all sequences in batch complete | ★★☆☆☆ | High (bound by longest sequence) | Basic HF Transformers |
| Continuous Batching | Insert new requests immediately upon sequence completion | ★★★★☆ | Low | vLLM, SGLang, TGI, Aphrodite |
| In-flight Batching | NVIDIA’s continuous batching implementation, iteration-level scheduling | ★★★★★ | Very low | TensorRT-LLM, Triton |
| Persistent Batching | Maintain batches while dynamically replacing individual sequences | ★★★★★ | Low | LMDeploy |
| Dynamic SplitFuse | Dynamically split/combine Prefill and decode | ★★★★☆ | Low | DeepSpeed-MII |
Key insight: Evolution from Static → Continuous → In-flight/Persistent. All modern serving engines use continuous batching or better.
4.4 Attention Optimization Comparison
| Technique | Paper | Core Idea | Main Effect | Using Tools |
|---|---|---|---|---|
| Flash Attention | Dao et al., 2022 | Minimize HBM access via SRAM tiling | Memory savings + 2-4x speed improvement | TGI, SGLang, Candle |
| Flash Attention 2 | Dao, 2023 | Improved work partitioning, sequence parallelization | 2x additional improvement over FA1 | Most modern engines |
| Flash Attention 3 | 2024 | Hopper asynchronous execution, FP8 support | Additional improvement over FA2 (especially H100) | SGLang (latest) |
| PagedAttention | Kwon et al., 2023 | Block-based KV management + attention | Memory efficiency maximization | vLLM, TGI, Aphrodite |
| FlashInfer | 2024 | Shared prefix batch decoding optimization, cascading | Up to 31x faster than vLLM on shared prefix | SGLang, vLLM (integrating) |
| FlexAttention | PyTorch, 2024 | BlockMask + page table integration | Combine flexible mask + paged attention | PyTorch native |
FlashInfer detail:
- When shared prefix is 32,768 tokens and batch size 256, up to 31x speed improvement vs basic PagedAttention
- Cascading technique computes shared prefix attention only once
FA3 benchmark: In SGLang, FA3 surpasses both FlashInfer and Triton backends, with gap widening as input/output size increases.
4.5 Speculative Decoding Support Status
Speculative decoding is a technique where a small “draft model” rapidly generates multiple tokens, and a large “target model” verifies them at once (Leviathan et al., 2023; Chen et al., 2023).
| Tool | Support | Draft Model Method | Performance Improvement |
|---|---|---|---|
| vLLM | ✅ | Separate small model, n-gram, MLPSpeculator | 2-3x (workload dependent) |
| SGLang | ✅ | EAGLE, EAGLE 2, EAGLE 3 (2025 latest) | 2-4x |
| TensorRT-LLM | ✅ | Draft model, Medusa heads | 2-3x |
| TGI | ✅ | Medusa | 2x |
| LMDeploy | △ (experimental) | - | - |
| llama.cpp | ✅ | Draft model | 1.5-2x |
| ExLlamaV2 | △ | - | - |
| Others | ✗ | - | - |
EAGLE 3 (SGLang, December 2025): LMSYS provides speculative decoding draft models bundled for popular models. Groq reports 6x+ speed improvement on Llama-3.1-70B, SambaNova reports 2x+ improvement on Llama-3.1-405B.
4.6 Prefix Caching Comparison
| Tool | Method | Cache Hit Rate (few-shot) | Cache Hit Rate (chat) | Implementation |
|---|---|---|---|---|
| SGLang | RadixAttention (radix tree) | 85-95% | 60-85% | Token sequence-based tree |
| vLLM | Hash-based prefix caching | 15-25% | 30-50% | Block hash matching |
| TensorRT-LLM | KV Cache Reuse + CPU offloading | Medium | Medium | CPU-GPU transfer |
| TGI v3 | Prefix KV caching | Medium-High | High (long history) | Chunk-based |
| LMDeploy | Blocked KV reuse | Low-Medium | Medium | Block matching |
Key insight: For workloads with high prefix reuse (agent, few-shot, same system prompt), SGLang’s RadixAttention is overwhelming. The difference narrows in simple chatbot serving.
4.7 Distributed Inference Comparison
| Method | Description | Advantages | Disadvantages | Supporting Tools |
|---|---|---|---|---|
| Tensor Parallelism(TP) | Split single layer across multiple GPUs | Low latency | All-reduce communication needed, requires high bandwidth between GPUs | vLLM, SGLang, TensorRT-LLM, LMDeploy, TGI |
| Pipeline Parallelism(PP) | Sequential layer placement across GPUs | Low communication overhead | Pipeline bubbles, high latency | TensorRT-LLM, DeepSpeed |
| Expert Parallelism(EP) | Distribute MoE model experts across GPUs | Optimal for MoE models | MoE-only | vLLM (Wide-EP), SGLang |
| Disaggregated Serving | Run Prefill and Decode on separate nodes | Independent scaling per phase | KV transfer overhead | vLLM (NIXL), SGLang |
| Sequence Parallelism | Split long sequences | Useful for long context | Complex implementation | DeepSpeed, Ring Attention |
vLLM’s latest distributed serving (December 2025):
- DeepSeek models achieve 2,200 tokens/s per H200 with Wide Expert Parallelism
- Efficient KV transfer via NIXL/LMCache connectors for prefill-decode separation
- Independent autoscaling based on Ray
5. Comprehensive Comparison Tables
5.1 Feature Comparison
| Tool | Language | Continuous Batching | PagedAttention | Quantization | Speculative Decoding | Distributed Inference | OpenAI API |
|---|---|---|---|---|---|---|---|
| vLLM | Python/C++ | ✅ | ✅ | AWQ,GPTQ,FP8 | ✅ | TP | ✅ |
| SGLang | Python/C++ | ✅ | ✅ (RadixAttn) | AWQ,GPTQ,FP8 | ✅ (EAGLE3) | TP,EP | ✅ |
| TensorRT-LLM | Python/C++ | ✅ (in-flight) | ✅ | FP8,INT4,INT8 | ✅ | TP,PP | via Triton |
| TGI | Rust/Python | ✅ | ✅ | AWQ,GPTQ,bnb | ✅ (Medusa) | TP | ✅ |
| llama.cpp | C/C++ | △ | ✗ | GGUF (2-8bit) | ✅ | ✗ | ✅ |
| Ollama | Go/C++ | △ | ✗ | GGUF | ✗ | ✗ | ✅ |
| MLC LLM | Python/C++ | ✅ | ✅ | 3-4bit | ✗ | ✗ | ✅ |
| LMDeploy | Python/C++ | ✅ (persistent) | ✅ (blocked) | AWQ,INT8,KV quant | △ | TP | ✅ |
| Triton Server | C++/Python | ✅ (dynamic) | via backend | via backend | via backend | via backend | ✗ |
| ExLlamaV2 | Python/C++ | ✗ | ✗ | EXL2,GPTQ | △ | ✗ | via TabbyAPI |
| Ray Serve+vLLM | Python | ✅ | ✅ | vLLM all | ✅ | TP+multi-node | ✅ |
| PowerInfer | C/C++ | ✗ | ✗ | GGUF | ✗ | ✗ | ✗ |
| Aphrodite | Python/C++ | ✅ | ✅ | EXL2,GGUF,AWQ,GPTQ | ✗ | TP | ✅ |
| LocalAI | Go/C++ | △ | ✗ | GGUF | ✗ | P2P | ✅ |
| DeepSpeed-MII | Python/C++ | ✅ | ✗ | INT8 | ✗ | TP,PP | ✅ |
| OpenLLM | Python | via backend | via backend | via backend | via backend | via backend | ✅ |
| CTranslate2 | C++/Python | △ | ✗ | INT8,INT16 | ✗ | ✗ | ✗ |
| Candle | Rust | ✗ | △ (atoma-infer) | ✗ | ✗ | ✗ | ✗ |
5.2 Hardware Support
| Tool | NVIDIA CUDA | AMD ROCm | Apple Metal | CPU | Mobile | WebGPU |
|---|---|---|---|---|---|---|
| vLLM | ✅ | ✅ | ✗ | ✅ | ✗ | ✗ |
| SGLang | ✅ | ✅ | ✗ | ✗ | ✗ | ✗ |
| TensorRT-LLM | ✅ | ✗ | ✗ | ✗ | ✗ | ✗ |
| TGI | ✅ | ✅ | ✗ | ✗ | ✗ | ✗ |
| llama.cpp | ✅ | ✅ | ✅ | ✅ | ✅ | ✗ |
| Ollama | ✅ | ✅ | ✅ | ✅ | ✗ | ✗ |
| MLC LLM | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| LMDeploy | ✅ | ✗ | ✗ | ✗ | ✗ | ✗ |
| ExLlamaV2 | ✅ | ✗ | ✗ | ✗ | ✗ | ✗ |
| PowerInfer | ✅ | ✗ | ✗ | ✅ (hybrid) | ✗ | ✗ |
| LocalAI | ✅ | ✅ | ✅ | ✅ | ✗ | ✗ |
| Candle | ✅ | ✗ | ✅ | ✅ | ✗ | ✅ (WASM) |
5.3 Performance Tiers (2025 basis, approximate ranking)
GPU Serving Throughput (high concurrency, A100/H100):
- 🥇 LMDeploy (TurboMind) — especially quantized models
- 🥇 SGLang — workloads with high prefix reuse
- 🥈 TensorRT-LLM — optimal performance after engine build
- 🥈 vLLM — general-purpose champion
- 🥉 TGI — slightly behind vLLM
- DeepSpeed-MII, MLC LLM
Single Request Latency:
- 🥇 TensorRT-LLM (compiled kernels)
- 🥈 SGLang / vLLM
- 🥉 LMDeploy
Consumer GPU (single user):
- 🥇 ExLlamaV2 — highest speed
- 🥈 llama.cpp (GPU offload)
- 🥉 Ollama / PowerInfer
6. Scenario-Based Recommendations
Scenario 1: Production Chatbot Service
Recommendation: vLLM or SGLang + Ray Serve
- High concurrency, stable TTFT needed
- If multi-turn chat, SGLang (RadixAttention advantage)
- Add Ray Serve if autoscaling needed
Scenario 2: NVIDIA-only, Maximum Performance
Recommendation: TensorRT-LLM + Triton
- Fixed models with engine build investment feasible
- Maximum throughput with FP8 (H100)
- Enterprise-level stability
Scenario 3: Local Development / Prototyping
Recommendation: Ollama
- 5-minute installation + execution
- Simple model management via model registry
Scenario 4: CPU Server / GPU-less Environment
Recommendation: llama.cpp or CTranslate2
- llama.cpp: General LLM, various quantizations
- CTranslate2: Specialized for translation/Whisper etc.
Scenario 5: Mobile App / Browser
Recommendation: MLC LLM (mobile), llama.cpp (mobile), Candle (WASM)
- MLC LLM: Most comprehensive mobile support
- web-llm: WebGPU-based browser execution
Scenario 6: Single GPU, Large Model
Recommendation: PowerInfer (sparse models) or DeepSpeed-MII (ZeRO-Inference)
- Run GPU memory-exceeding models with CPU offloading
Scenario 7: Agent / Tool-use / Structured Output
Recommendation: SGLang
- Maximize prefix reuse with RadixAttention
- JSON output optimization with Compressed FSM
- Compose complex LLM pipelines with DSL
Scenario 8: OpenAI API Drop-in Replacement
Recommendation: LocalAI
- Full /v1/chat/completions compatibility
- Text + image + audio all-in-one
7. Conclusion
As of 2025, the LLM serving ecosystem is entering maturity with distinct tool differentiation.
Key Trends
-
vLLM and SGLang’s two-horse race: vLLM dominates general serving, SGLang leads in structured workloads. This structure strengthens as TGI enters maintenance mode.
-
KV cache management innovation: PagedAttention became standard, RadixAttention opened new possibilities for prefix reuse. KV quantization (FP8) is the next frontier of memory efficiency.
-
Speculative Decoding ubiquity: 2-4x speed improvements through EAGLE 3, Medusa etc. are generalizing, with all major engines supporting.
-
Disaggregated Serving: Architectures separating Prefill and Decode for independent scaling are emerging as the new standard for large-scale serving.
-
Consumer hardware accessibility expansion: llama.cpp/Ollama ecosystem democratized local AI, PowerInfer is expanding consumer GPU limits.
Selection Guide Summary
| Priority | Recommended Tool |
|---|---|
| General production | vLLM |
| Maximum throughput (NVIDIA) | LMDeploy or TensorRT-LLM |
| Agent/structured output | SGLang |
| Easy local execution | Ollama |
| Mobile/edge | MLC LLM |
| Maximum single GPU speed | ExLlamaV2 |
| Hardware versatility | llama.cpp |
8. References
Core Papers
-
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., … & Stoica, I. (2023). “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023. [arXiv:2309.06180]
-
Zheng, L., Yin, L., Xie, Z., Huang, J., Sun, C., Yu, C. H., … & Stoica, I. (2023). “SGLang: Efficient Execution of Structured Language Model Programs.” ICLR 2025. [arXiv:2312.07104]
-
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS 2022. [arXiv:2205.14135]
-
Dao, T. (2023). “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” ICLR 2024. [arXiv:2307.08691]
-
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” ICLR 2023. [arXiv:2210.17323]
-
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2024). “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” MLSys 2024. [arXiv:2306.00978]
-
Leviathan, Y., Kalman, M., & Matias, Y. (2023). “Fast Inference from Transformers via Speculative Decoding.” ICML 2023. [arXiv:2211.17192]
-
Chen, C., Borgeaud, S., Irving, G., Lespiau, J. B., Sifre, L., & Jumper, J. (2023). “Accelerating Large Language Model Decoding with Speculative Sampling.” [arXiv:2302.01318]
-
Yu, G. I., Jeong, J. S., Kim, G. W., Kim, S., & Chun, B. G. (2022). “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI 2022.
-
Song, Y., Mi, Z., Xie, H., & Chen, H. (2023). “PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU.” [arXiv:2312.12456]
-
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., … & Krishnamurthy, A. (2018). “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” OSDI 2018.
-
Li, Y., Cai, T., Zhang, Y., Chen, D., & Narasimhan, K. (2024). “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.” ICML 2024. [arXiv:2401.15077]
Benchmark Sources
- BentoML. (2024). “Benchmarking LLM Inference Backends.” https://www.bentoml.com/blog/benchmarking-llm-inference-backends
- LMSYS. (2024). “Achieving Faster Open-Source Llama3 Serving with SGLang Runtime.” https://lmsys.org/blog/2024-07-25-sglang-llama3/
- LMSYS. (2024). “Fast and Expressive LLM Inference with RadixAttention and SGLang.” https://lmsys.org/blog/2024-01-17-sglang/
- Clarifai. (2025). “Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B.” https://www.clarifai.com/blog/comparing-sglang-vllm-and-tensorrt-llm-with-gpt-oss-120b
- MarkTechPost. (2025). “Comparing the Top 6 Inference Runtimes for LLM Serving in 2025.” https://www.marktechpost.com/2025/11/07/comparing-the-top-6-inference-runtimes-for-llm-serving-in-2025/
- FlashInfer. (2024). “Accelerating Self-Attentions for LLM Serving with FlashInfer.” https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
- oobabooga. (2023). “A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M.” https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/
- vLLM Blog. (2025). “Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP.” https://blog.vllm.ai/2025/12/17/large-scale-serving.html
This article is written based on the latest information as of February 2026. The LLM serving ecosystem evolves rapidly, so please check official documentation and latest releases for each tool.