Complete LLM Serving Engine Guide — In-Depth Analysis of 18 Tools

2026-02-19 · # AI 개념

LLM vLLM SGLang TensorRT-LLM inference serving quantization

Last Updated: February 2026
Target Audience: ML Engineers, MLOps, Infrastructure Architects
Scope: 18 production LLM serving tools + core technology comparative analysis

1. Introduction

The most critical bottleneck in practical deployment of Large Language Models (LLMs) is inference serving. Serving models with tens to hundreds of billions of parameters in real-time requires solving various technical challenges including memory management, batching strategies, attention optimization, and quantization.

The LLM serving tool ecosystem has exploded since 2023. After vLLM’s PagedAttention changed the paradigm of KV cache memory management, various approaches have emerged including SGLang’s RadixAttention, TensorRT-LLM’s FP8 optimization, and llama.cpp’s consumer hardware accessibility expansion.

This article provides paper-level depth analysis of 18 major LLM serving/inference tools and systematically compares their core technologies.

Core Evaluation Metrics

Metric	Description
Throughput	Tokens generated per second (tokens/s)
TTFT	Time to First Token — time taken to first token
Latency(P50/P99)	Response latency per request
Memory Efficiency	GPU/CPU memory usage efficiency
Scalability	Performance maintenance capability as concurrent users increase

2. Core Technology Concepts

2.1 KV Cache

A mechanism to avoid duplicate computation by reusing Key-Value tensors from previous tokens during Transformer decoding. In LLM serving, KV cache occupies 30–60% of total GPU memory, making efficient management crucial for serving performance.

2.2 Continuous Batching

While static batching must wait until all sequences in the batch complete, continuous batching inserts new requests immediately as each sequence finishes. This maximizes GPU utilization. First proposed in Yu et al. (2022)‘s Orca system.

2.3 Quantization

A technique to reduce memory and increase inference speed by converting FP16/BF16 weights to lower precision like INT4/INT8. There’s a tradeoff between quality loss and speed improvement.

3. In-Depth Tool Analysis

3.1 vLLM GitHub: vllm-project/vllmDevelopment: UC Berkeley (Kwon et al.)License: Apache 2.0Current Status: Active development (v0.7.x+ as of 2025, official distribution via NVIDIA NGC)

Core Technology: PagedAttention

vLLM’s core innovation is PagedAttention (Kwon et al., 2023). Inspired by virtual memory paging techniques in operating systems, it partitions KV cache into fixed-size blocks and manages them through an indirection layer.

Problems with existing approach: Traditional KV cache allocates continuous memory regions for each sequence. Maximum sequence length must be pre-reserved, resulting in average 60–80% memory waste (internal + external fragmentation).

PagedAttention’s solution:

Partition KV cache into fixed-size blocks (e.g., 16 tokens)
Each sequence references non-contiguous blocks through a block table (page table)
Dynamically allocate new blocks as sequences grow
Enable KV cache sharing for beam search etc. via Copy-on-Write

Result: Reduces memory waste to under 5%, enabling 2–4x more concurrent requests on the same GPU.

Architecture

Client Request → FastAPI Server → AsyncLLMEngine
    → Scheduler (continuous batching)
    → Model Runner (GPU execution)
    → PagedAttention KV Cache Manager
    → Sampler → Token Output (streaming)

Scheduler: Schedules requests iteration-wise with continuous batching
Model Runner: Executes model with CUDA kernels (FlashAttention/FlashInfer backend selection)
KV Cache Manager: Block-level allocation/deallocation, Copy-on-Write support

Performance Benchmarks

Comparison	Throughput vs vLLM
HuggingFace Transformers	14–24x lower (Kwon et al., 2023)
Early TGI	2.2–3.5x lower
FasterTransformer	1.5–2x lower

BentoML benchmark (2024, A100 80GB, Llama 3 8B):

TTFT: Best-in-class across all concurrent user levels
Token generation rate: ~2,300–2,500 tokens/s at 100 users (lower than LMDeploy’s 4,000 tokens/s)
Slightly behind in decode throughput compared to engines with higher GPU utilization (LMDeploy etc.)

Supported Models/Quantization

Models: 30+ architectures (LLaMA, Mistral, Qwen, Gemma, Phi, Command-R, DeepSeek, etc.)
Quantization: AWQ, GPTQ, FP8, INT8 (W8A8), Marlin kernels
Hardware: NVIDIA CUDA, AMD ROCm, AWS Neuron, CPU

Pros and Cons

Pros	Cons
Best-in-class memory efficiency	Lack of decode speed optimization for quantized models
Extensive model support	Lags behind TensorRT-LLM in single-request latency
Active community, rapid updates	Higher setup complexity vs Ollama
OpenAI-compatible API
Speculative decoding support

Suitable Use Cases

General production LLM serving (high throughput + low TTFT)
Serving various models on single infrastructure
Multi-GPU distributed inference

3.2 SGLang GitHub: sgl-project/sglangDevelopment: LMSYS (UC Berkeley, Zheng et al.)Paper: Zheng et al., “SGLang: Efficient Execution of Structured Language Model Programs” (2023, accepted to ICLR 2025)Current Status: Very active development (Diffusion model support, EAGLE 3 speculative decoding as of late 2025)

Core Technology: RadixAttention

SGLang’s innovation is RadixAttention—managing KV cache in a radix tree structure to automatically share prefixes among multiple requests.

Difference from PagedAttention:

PagedAttention: Focus on block-level memory management (eliminating fragmentation)
RadixAttention: Focus on prefix reuse (requests sharing the same prefix don’t duplicate KV cache computation)

Radix tree structure:

Root
├── "You are a helpful assistant. " → KV cached
│   ├── "Translate: Hello" → Branch A
│   ├── "Translate: World" → Branch B
│   └── "Summarize: ..." → Branch C

Numerous requests sharing the same system prompt or few-shot examples compute and reuse KV cache only once.

Cache hit rates:

Few-shot learning (shared examples): 85–95% (vLLM PagedAttention: 15–25%)
Multi-turn chat: 60–85% (vLLM: 30–50%)
LMSYS production: 52.4% for LLaVA-Next-34B, 74.1% for some models

Frontend DSL

SGLang provides not just runtime but also frontend DSL:

@sgl.function
def multi_turn_qa(s, question1, question2):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user(question1)
    s += sgl.assistant(sgl.gen("answer1", max_tokens=256))
    s += sgl.user(question2)
    s += sgl.assistant(sgl.gen("answer2", max_tokens=256))

This DSL automatically optimizes prefix sharing and supports parallel generation with fork/join.

Structured Generation

Compressed Finite State Machine technique for efficiently decoding structured outputs like JSON and regex. Processes in intervals (jumps) rather than token-by-token masking, dramatically reducing overhead.

Performance Benchmarks

LMSYS benchmark (July 2024, Llama 3):

Llama 3 8B (A100): Both SGLang and TensorRT-LLM achieve up to 5,000 tokens/s on short inputs, vLLM lags behind
Llama 3 70B: SGLang achieves up to 3x throughput vs vLLM in online serving
Structured workload: Up to 6.4x throughput, 3.7x lower latency vs baseline

Clarifai benchmark (August 2025, GPT-OSS-120B, H100):

Strong performance at medium-high concurrency (50 requests)
TensorRT-LLM shows highest throughput for single requests, lacks scaling at extreme concurrency

Pros and Cons

Pros	Cons
Dramatic performance improvement via prefix reuse	Smaller ecosystem vs vLLM
Structured output generation optimization	Relatively less online examples/documentation
Complex LLM program authoring via DSL	Some model support lags behind
EAGLE 3 speculative decoding
Extension to Diffusion models

Suitable Use Cases

Agent/Tool-use workflows (high prefix reuse)
When structured outputs (JSON) are needed
Multi-turn chat serving
Few-shot evaluation pipelines

3.3 TensorRT-LLM GitHub: NVIDIA/TensorRT-LLMDevelopment: NVIDIALicense: Apache 2.0Current Status: v0.17+ (as of 2025), NVIDIA’s official inference stack

Core Technology

TensorRT-LLM is an LLM-specific inference engine built on NVIDIA’s TensorRT compiler, generating model-specific optimized CUDA kernels at compile-time.

Key optimizations:

In-flight Batching: NVIDIA’s implementation of continuous batching. Insert new requests immediately as individual requests complete
FP8/INT4 quantization: Utilizes FP8 Tensor Cores in Hopper architecture (H100). 2x+ throughput vs FP16, quality loss under 2%
Paged KV Cache: Block-based KV management similar to vLLM
Quantized KV Cache: Quantize KV cache itself to INT8, FP8 for memory savings
KV Cache Reuse: KV offloading to CPU then reuse. Up to 14x TTFT reduction (H100 basis)
Kernel Fusion: Fuse MHA, MLP etc. into single kernels

Architecture

Model Definition (Python) → TensorRT Engine Build (compilation)
    → Executor API → Triton Inference Server (serving)
    → In-flight Batching Scheduler
    → Fused CUDA Kernels

Important: TensorRT-LLM requires explicit compilation stage. Must build engines for each model+hardware+batch size combination, taking tens of minutes to hours.

Performance Benchmarks

Single-request latency: Lowest on NVIDIA GPUs (strength of compiled kernels)
Llama 3.1 8B FP8 (H100): ~2x throughput improvement vs FP16
LMSYS benchmark: Achieves 5,000 tokens/s on short inputs alongside SGLang
High concurrency may increase P99 latency due to aggressive batching

Pros and Cons

Pros	Cons
Best single-request performance on NVIDIA GPUs	NVIDIA-only (vendor lock-in)
FP8 optimization (Hopper)	Complex setup (engine build, Triton configuration)
Rich KV cache options	Recompilation needed for model changes
Official NVIDIA support	Steepest learning curve

Suitable Use Cases

NVIDIA-only environments requiring maximum performance
Latency-critical workloads (real-time chatbots)
Fixed models where engine build investment is feasible

3.4 TGI (Text Generation Inference) GitHub: huggingface/text-generation-inferenceDevelopment: Hugging FaceLicense: HFOIL (v1), Apache 2.0 (v2+)Current Status: Maintenance mode as of December 2025 — accepting only minor bug fixes

Core Technology

TGI was Hugging Face ecosystem’s official inference server, providing all-in-one production serving features:

Rust-based HTTP/gRPC server: High-performance web server
Flash Attention (Dao et al., 2022): Attention algorithm optimizing HBM ↔ SRAM IO
Continuous Batching: Dynamic request insertion/removal
Paged Attention: vLLM-style KV cache management
TGI v3’s Chunked Prefill: Split long contexts into chunks for prefill, reducing memory peaks
Prefix KV Caching: Reuse KV of long conversation history

Performance Benchmarks

General prompts: Similar level to vLLM, vLLM slightly ahead at high concurrency
TGI v3 + long context: 3x more token processing, up to 13x faster vs vLLM (long history + prefix caching)
BentoML benchmark (Llama 3 8B, A100): 2,300–2,500 tokens/s (similar to vLLM)

Supported Quantization

AWQ, GPTQ, bitsandbytes (INT4, INT8)
FP8 (experimental)

Pros and Cons

Pros	Cons
Perfect HuggingFace Hub integration	Entered maintenance mode (December 2025~)
Easy setup, excellent documentation	Lags behind latest optimizations vs vLLM/SGLang
Built-in safety features (watermark, safety)	Slow model support updates
Various hardware (CUDA, ROCm, Gaudi, Inferentia)

Suitable Use Cases

HuggingFace Inference Endpoints users
Chat workloads with long conversation history (utilizing v3’s prefix caching)
Rapid prototyping and deployment

Note: With TGI entering maintenance mode, HuggingFace recommends vLLM/SGLang as alternatives.

3.5 llama.cpp GitHub: ggml-org/llama.cppDevelopment: Georgi Gerganov and communityLicense: MITCurrent Status: Daily active development (build 4000+ as of 2025)

Core Technology: GGUF Quantization

llama.cpp is a pure inference engine written in C/C++ that can run LLMs on both CPU and GPU without Python/PyTorch dependencies.

GGUF (GGML Unified Format): llama.cpp’s model file format supporting various quantization methods.

Quantization Methods Detail

Quantization	Bits	Size (7B basis)	Quality	Speed	Description
Q8_0	8bit	~7.0 GB	Best	Slow	Near FP16
Q6_K	6bit	~5.5 GB	Very good	Medium	Super-blocks with 6-bit
Q5_K_M	5bit	~4.8 GB	Good	Medium	Mixed 5-bit, recommended
Q4_K_M	4bit	~4.1 GB	Fair	Fast	Most popular balance point
Q4_K_S	4bit	~3.9 GB	Fair	Fast	Slightly smaller than Q4_K_M
Q3_K_M	3bit	~3.3 GB	Degraded	Fast	Memory constrained
Q2_K	2bit	~2.7 GB	Significantly degraded	Very fast	Extreme compression
IQ4_XS	~4bit	~3.7 GB	Q4_K_M level	Slow*	Importance Matrix based

*IQ quantization can be very slow with partial GPU offloading.

K-Quant System: Quantizations with “K” (Q4_K_M etc.) use super-block structure. Each super-block (usually 256 weights) has independent scale factors, with M(medium) and S(small) being precision differences in scale factors.

Architecture

GGUF Model File → ggml tensor library
    → CPU: AVX2/AVX-512/ARM NEON vector operations
    → GPU: CUDA/Metal/Vulkan/OpenCL offloading
    → Multi-threaded inference
    → HTTP Server (llama-server) or CLI

Partial GPU Offloading: Can split GPU/CPU by layer
Metal Support: Excellent performance on Apple Silicon
Vulkan: Universal GPU acceleration (AMD, Intel)

Performance Benchmarks

llama-bench results (Apple Silicon M-series, Qwen2 1.5B Q4_0):

Prompt processing (pp512): 5,765 tokens/s
Token generation (tg128): 198 tokens/s

With full GPU offloading vs ExLlamaV2:

llama.cpp: ~7,500 tokens/s (prompt), ExLlamaV2: ~14,000 tokens/s (~2x difference)

Pros and Cons

Pros	Cons
Hardware universality (CPU/all GPUs)	Lower throughput vs GPU-only tools
Single binary, minimal dependencies	Weak continuous batching
Extensive quantization options	Lack of production serving features
Apple Silicon optimization	Unsuitable for large-scale concurrent serving
Very active community

Suitable Use Cases

Running LLMs on local PC/laptop
CPU server deployment without GPU
Inference on Apple Silicon Mac
Edge device deployment

3.6 Ollama GitHub: ollama/ollamaDevelopment: Ollama Inc.License: MITCurrent Status: Active development (expanded to cloud model support as of 2025)

Core Technology

Ollama is a user-friendly LLM execution environment that wraps llama.cpp. Provides Docker-like interface to pull/run models.

ollama pull llama3.1
ollama run llama3.1

Architecture

Ollama CLI/API → Go server (REST API)
    → llama.cpp (inference backend)
    → Model registry (ollama.com)
    → Modelfile (Dockerfile-like model configuration)

Key features:

Model management: ollama pull, ollama list, ollama rm
Modelfile: Declaratively set system prompts, temperature etc.
OpenAI-compatible API: /v1/chat/completions endpoint
Multimodal: Vision model support
2025 updates: Cloud model integration (Turbo), local-only mode settings

Performance

Ollama’s performance is essentially identical to llama.cpp. Go server wrapper overhead is negligible, with main bottleneck being llama.cpp backend inference speed.

vLLM comparison (same model, same GPU):

Single request: Nearly identical latency
Concurrent requests: vLLM achieves 2–5x higher throughput with continuous batching

Pros and Cons

Pros	Cons
Extremely easy installation/usage	Unsuitable for large-scale serving (weak batching)
Model registry ecosystem	Cannot exceed llama.cpp performance
Custom model creation via Modelfile	Limited GPU memory optimization
Cross-platform	Lower throughput vs vLLM/SGLang

Suitable Use Cases

Developer local environment prototyping
AI accessibility for non-developers
Internal AI tools for small teams
LLM testing in CI/CD pipelines

3.7 MLC LLM GitHub: mlc-ai/mlc-llmDevelopment: CMU/OctoAI (TVM team, Chen et al.)Paper: Based on Apache TVM (Chen et al., 2018)License: Apache 2.0

Core Technology: TVM Compiler

MLC LLM uses the Apache TVM compiler framework to compile LLMs to native code for various hardware backends.

Compilation pipeline:

HuggingFace model → Relax IR (TVM)
    → Hardware-specific optimization (fusion, tiling, vectorization)
    → Backend-specific code generation:
        - CUDA (NVIDIA GPU)
        - Metal (Apple GPU)
        - Vulkan (Universal GPU)
        - OpenCL (Mobile GPU)
        - WebGPU (Browser)
        - C/LLVM (CPU)

Mobile/Edge Deployment

MLC LLM’s unique strength is LLM inference on mobile devices:

iOS: Metal backend, Swift bindings
Android: OpenCL/Vulkan backend, Java/Kotlin bindings
WebGPU: Direct execution in browsers (web-llm)

Mobile benchmarks (arxiv:2410.03613, 2024):

Qualcomm Snapdragon 8 Gen 3 with 7B 4-bit model: ~10-15 tokens/s
Apple A17 Pro with similar setup: ~20+ tokens/s

BentoML Benchmark (Llama 3 8B, A100)

10 users: Similar decode performance to LMDeploy, best-in-class TTFT
50 users: Still good TTFT
100 users: Sharp performance degradation under high load — both decode speed and TTFT lag behind LMDeploy

Pros and Cons

Pros	Cons
Mobile/edge/browser deployment	Compilation stage required (cold start increase)
Most comprehensive hardware support	No stable releases (nightly only)
WebGPU support (web-llm)	Performance degradation at high concurrency
TVM optimization auto-tuning	Learning curve

Suitable Use Cases

Embedding LLMs in mobile apps
Browser-based AI (WebGPU)
Edge device deployment (Jetson, RPi, etc.)
Environments with high hardware diversity

3.8 LMDeploy GitHub: InternLM/lmdeployDevelopment: Shanghai AI Lab (InternLM team)License: Apache 2.0

Core Technology: TurboMind

LMDeploy’s core inference engine TurboMind started from NVIDIA FasterTransformer’s GPT-NeoX implementation and optimized for conversational model inference.

Key optimizations:

Persistent Batching: Variant of continuous batching that maintains batches while dynamically replacing individual sequences
Blocked KV Cache: Block-based KV management similar to vLLM PagedAttention, but with different internal layout
Dynamic Split & Fuse: Dynamically split/fuse attention blocks for optimal GPU utilization
KV Quantization: Quantize KV cache itself to INT8/INT4
Weight Quantization: AWQ 4-bit, INT8 support

Performance Benchmarks

BentoML benchmark (Llama 3, A100 80GB):

Metric	LMDeploy	vLLM	TensorRT-LLM	MLC-LLM	TGI
Decode (8B, 100 users)	~4,000 t/s	~2,400 t/s	~2,400 t/s	~2,000 t/s	~2,300 t/s
TTFT (8B, 10 users)	Best	Best	Good	Best	Medium
Decode (70B Q4, 100 users)	~700 t/s	~450 t/s	~650 t/s	N/A	~400 t/s

InternLM benchmark: After GQA optimization, internlm2-20b achieves 16+ RPS, 1.8x faster than vLLM.

LMDeploy achieves near 100% GPU utilization particularly with quantized models.

Pros and Cons

Pros	Cons
Best-in-class decode throughput	NVIDIA CUDA only
Particularly strong in 4-bit inference	Limited model support (~20 models)
Easy to use (on-the-fly conversion)	Uneven English/Chinese documentation quality
KV quantization support	Smaller community vs vLLM

Suitable Use Cases

NVIDIA GPU environments requiring maximum throughput
Quantized model serving (AWQ 4-bit)
When using InternLM family models
Large-scale concurrent serving (stable even at high concurrency)

3.9 Triton Inference Server GitHub: triton-inference-server/serverDevelopment: NVIDIALicense: BSD 3-Clause

Core Technology

Triton is a universal model serving platform, not LLM-specific but serving various ML models. For LLM serving, primarily used with TensorRT-LLM backend.

Core features:

Dynamic Batching: Automatically batch multiple requests. Configurable wait time/batch size limits
Model Ensembles: Configure preprocessing → LLM → postprocessing as pipelines
Multi-backends: TensorRT, ONNX Runtime, PyTorch, TensorFlow, vLLM, etc.
Concurrent model serving: Serve multiple models simultaneously on single server
Model versioning: Model version management for A/B testing

Architecture

Client (HTTP/gRPC) → Triton Server
    → Request Scheduler (dynamic batching)
    → Model Repository
        ├── Model A (TensorRT-LLM)
        ├── Model B (ONNX Runtime)
        └── Ensemble Pipeline
    → Response Aggregator

Role in LLM Serving

Triton itself doesn’t perform LLM inference optimizations (PagedAttention etc.). Instead:

TensorRT-LLM Backend: Serve TensorRT-LLM engines via tensorrtllm_backend
vLLM Backend: Use vLLM as Triton backend
Actual inference optimization handled by backend engines

Pros and Cons

Pros	Cons
Multi-model serving (LLM + vision + audio)	Complex setup as not LLM-specific
Production-proven stability	No LLM optimization when used alone
Built-in monitoring, metrics	High learning curve when used with TensorRT-LLM
Ensemble pipelines

Suitable Use Cases

Multimodal AI pipelines (LLM + image + audio)
Large-scale enterprise ML infrastructure
A/B testing + model versioning needs
Production serving wrapper for TensorRT-LLM models

3.10 ExLlamaV2 GitHub: turboderp-org/exllamav2Development: turboderpLicense: MIT

Core Technology: EXL2 Quantization

ExLlamaV2’s core innovation is EXL2 (ExLlama v2 Quantization)—quantization mixing different bit counts per layer and tensor.

How it works:

Measure importance (sensitivity) of each layer/tensor
Quantize important layers to high bits (6-8bit), less important layers to low bits (2-3bit)
Match overall model’s average bits to target (e.g., 4.25 bits per weight)
Achieve higher quality than uniform quantization at same model size

Supported bits: 2, 3, 4, 5, 6, 8 bit and their mixtures

Performance Benchmarks

Reddit benchmark (2024, RTX 4090):

Llama 3 8B: ExLlamaV2 achieves ~14,000 tokens/s in prompt processing (vs llama.cpp’s ~7,500, about 2x)
At same 4-bit, EXL2 slightly higher quality than GPTQ, similar or slightly lower than GGUF
ExLlama-based GPTQ execution shows fastest evaluation speed (oobabooga benchmark)

Quality comparison (4-bit, Llama 2 13B, perplexity basis):

AWQ: Best quality
GPTQ ≈ EXL2: Similar
GGUF (Q4_K_M): Slightly behind

Pros and Cons

Pros	Cons
Best-in-class GPU inference speed	NVIDIA CUDA only
Flexible mixed-precision quantization	Limited serving features (inference library)
Flash Attention, context caching support	No CPU inference
Popular in local communities	No continuous batching

Suitable Use Cases

Maximum inference speed on single GPU (with TabbyAPI etc.)
Precise quantization adjustment to fit memory
Local AI chat (oobabooga, SillyTavern)

3.11 Ray Serve + vLLM Framework: Ray Serve + vLLMDevelopment: Anyscale (Ray), UC Berkeley (vLLM)

Core Technology

Ray Serve is a distributed model serving framework that adds autoscaling, monitoring, fault recovery while using vLLM as backend.

Architecture:

Load Balancer → Ray Serve Router
    → Replica 1 (vLLM on GPU 0-1)
    → Replica 2 (vLLM on GPU 2-3)
    → Replica N (autoscaled)
    → Ray Dashboard (monitoring)

Key features:

Autoscaling: Automatically increase/decrease vLLM instances based on traffic
Multi-model serving: Serve multiple models simultaneously on one cluster
Fault recovery: Automatic restart on replica failure
Disaggregated Serving: Run Prefill and Decode on separate nodes (vLLM’s latest feature)

vLLM Large-Scale Serving (December 2025):

DeepSeek models achieve 2,200 tokens/s per H200 (Wide Expert Parallelism)
Efficient KV transfer via NIXL/LMCache connectors
Independent scaling of each phase (prefill/decode) with Ray’s distributed computing

Pros and Cons

Pros	Cons
Production-level autoscaling	Complex setup (Ray + vLLM)
Built-in monitoring, fault recovery	Ray cluster management required
Multi-model, multi-node serving	Overhead exists
Disaggregated serving support

Suitable Use Cases

Large-scale production LLM services
Environments with high traffic fluctuation
Multi-model / multi-tenant serving
Cloud-native AI infrastructure

3.12 PowerInfer GitHub: SJTU-IPADS/PowerInferDevelopment: Shanghai Jiao Tong UniversityPaper: Song et al., “PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU” (2023)

Core Technology: Neuron-Aware Sparse Inference

PowerInfer leverages activation sparsity in LLMs. In FFN layers, only a portion of neurons actually activate, and which neurons activate frequently (“hot neurons”) can be profiled beforehand.

How it works:

Offline profiling to analyze activation frequency per neuron
Hot neurons (frequently activated): Reside on GPU
Cold neurons (rarely activated): Stored in CPU memory
Runtime adaptive predictor predicts which neurons will activate
Neuron-aware sparse operator computes only activated neurons

Performance Benchmarks

RTX 4090 single GPU:

Various LLMs including OPT-175B achieve average 13.20 tokens/s, max 29.08 tokens/s
Only 18% lower performance than A100 server — on consumer GPU!
Up to 11x faster inference than llama.cpp (on GPU memory constrained models)

Pros and Cons

Pros	Cons
Run large models on consumer GPUs	Only effective on models with FFN sparsity
GPU-CPU hybrid overcomes VRAM constraints	Profiling stage required
Dramatic performance improvement vs llama.cpp	Limited GQA/MoE model support
	No production serving features

Suitable Use Cases

Running large models on VRAM-limited consumer GPUs
Models with strong sparse activation patterns like OPT, Falcon
Research/experimental purposes

3.13 Aphrodite Engine GitHub: aphrodite-engine/aphrodite-engineDevelopment: PygmalionAILicense: Apache 2.0GitHub Stars: ~1.6k

Core Technology

Aphrodite is a vLLM fork optimized for RP/storytelling community needs.

Features added over vLLM:

Enhanced sampling parameters (fine control of temperature, repetition penalty, etc.)
EXL2, GGUF quantization format support (vLLM focuses on GPTQ/AWQ)
Rapid response to community requests
PagedAttention KV cache management (vLLM-based)
Continuous batching (async server)

Pros and Cons

Pros	Cons
vLLM-based high performance	May lag in tracking vLLM upstream
Various quantization format support	Community-focused rather than production environment
Fine-grained sampling control	Limited documentation/support

Suitable Use Cases

RP/storytelling serving (SillyTavern, etc.)
When wanting to serve EXL2/GGUF models on server
When needing sampling features absent in vLLM

3.14 LocalAI GitHub: mudler/LocalAIDevelopment: mudler and communityLicense: MIT

Core Technology

LocalAI is a fully OpenAI API compatible local AI server that integrates various backends.

Multi-backend architecture:

OpenAI-compatible API (/v1/chat/completions, /v1/images, /v1/audio, etc.)
    ├── llama.cpp (text generation)
    ├── whisper.cpp (speech recognition)
    ├── stable-diffusion.cpp (image generation)
    ├── bark (TTS)
    ├── piper (TTS)
    └── other backends

2025 features:

LocalAI Core (text, image, audio, vision APIs)
LocalAGI (autonomous agents)
LocalRecall (semantic search)
P2P distributed inference
Constrained grammars (structured output)

Pros and Cons

Pros	Cons
Complete OpenAI API drop-in replacement	Lacks performance optimization vs individual backends
Text+image+audio all-in-one	Setup complexity
P2P distributed support	Documentation insufficient for community size
Easy Docker-based deployment

Suitable Use Cases

Converting existing OpenAI API code to local
Multimodal AI (text+image+audio) from single server
Privacy-sensitive environments

3.15 DeepSpeed-MII GitHub: deepspeedai/DeepSpeed-MIIDevelopment: Microsoft DeepSpeed teamLicense: Apache 2.0

Core Technology

DeepSpeed-MII is a serving framework utilizing Microsoft’s DeepSpeed library’s inference optimizations.

4 core technologies:

DeepSpeed-Inference: Accelerate Transformer inference with custom CUDA kernels
ZeRO-Inference: When model doesn’t fit single GPU, utilize CPU memory/NVMe for offloading. Enable single GPU serving of models like Bloom-176B
DeepSpeed-FastGen: Continuous batching + Dynamic SplitFuse (dynamically split/combine prefill and decode)
Tensor Parallelism: Multi-GPU parallel inference

Dynamic SplitFuse: Split long prompt prefill across multiple iterations and fuse with decode tokens to maintain uniform GPU utilization.

Performance

DeepSpeed-FastGen blog (2023):

Up to 2.3x throughput, up to 2x latency reduction vs vLLM (specific workloads)
However, gap has narrowed in recent comparisons as vLLM significantly evolved

Pros and Cons

Pros	Cons
ZeRO-Inference for ultra-large model deployment	Decreasing development activity trend
Official Microsoft support	Lags behind vLLM/SGLang in performance (recent basis)
Dynamic SplitFuse technique	Limited model support range
Azure integration	Insufficient documentation/examples

Suitable Use Cases

Single GPU serving of ultra-large models (ZeRO-Inference)
Azure/Microsoft ecosystem
Integration with DeepSpeed training pipelines

3.16 OpenLLM (BentoML) GitHub: bentoml/OpenLLMDevelopment: BentoMLLicense: Apache 2.0

Core Technology

OpenLLM is an LLM serving tool built on BentoML framework, managing the entire lifecycle from model packaging to cloud deployment.

Features:

Bento packaging: Package model + dependencies + serving code together
OpenAI-compatible API
Swappable inference backends: Use vLLM, TensorRT-LLM, etc. as backends
BentoCloud deployment: One-click cloud deployment
LangChain integration

Pros and Cons

Pros	Cons
Model lifecycle management	Inference performance depends on backend
BentoCloud one-click deployment	Possible overhead from indirect backend usage
Various backend support	Limited community size
LangChain integration

Suitable Use Cases

Teams needing ML model packaging/deployment pipelines
BentoCloud users
Serving LLM + other ML models together

3.17 CTranslate2 GitHub: OpenNMT/CTranslate2Development: OpenNMT (SYSTRAN)License: MIT

Core Technology

CTranslate2 is an engine that converts Transformer models to optimized C++ format for inference. Originally developed for machine translation (NMT), expanded to LLMs.

Optimization techniques:

Layer Fusion: Combine consecutive layers into single operations
Padding Removal: Remove padding within batches to prevent unnecessary computation
Batch Reordering: Sort sequences by length within batches for efficiency improvement
In-place Operations: Minimize memory allocation
Caching Mechanism: Cache repetitive operation results

Quantization: Supports INT8, INT16, Float16. INT8 models are 3.53x faster than Float32 (AMD ROCm benchmark).

Primary use case: Faster-Whisper (high-speed Whisper speech recognition implementation) uses CTranslate2 as core backend.

Pros and Cons

Pros	Cons
Excellent CPU performance	No LLM-specific optimizations (PagedAttention, etc.)
Lightweight, minimal dependencies	Limited model support (mainly encoder-decoder)
Production-proven (translation services)	Decreasing community activity
AMD ROCm support	Slow support for latest LLM architectures

Suitable Use Cases

Machine translation serving
Whisper-based speech recognition (Faster-Whisper)
Transformer inference in CPU-only environments
Lightweight deployment

3.18 Candle GitHub: huggingface/candleDevelopment: Hugging FaceLicense: Apache 2.0/MIT

Core Technology

Candle is a minimal ML framework written in Rust, providing PyTorch-like API with Rust’s safety and performance.

Features:

Pure Rust implementation (no libtorch/Python dependencies)
CUDA, Metal backend support
Native HuggingFace Hub integration
WASM target (browser execution)
Flash Attention support (CUDA feature flag)

Ecosystem:

candle-transformers: Major model implementations (LLaMA, Mistral, Phi, etc.)
candle-einops: Rust einops implementation
atoma-infer: Large-scale inference library based on Candle (FlashAttention2, PagedAttention)

Pros and Cons

Pros	Cons
Rust memory safety/performance	Inference-only (no training support)
Python dependency elimination	Fewer model implementations vs Python ecosystem
WASM support (serverless/browser)	Small community size
Lightweight binaries	Absence of high-level serving features

Suitable Use Cases

Embedding ML in Rust-based applications
Lightweight inference in serverless/edge
WASM-based browser AI
Direct HuggingFace model usage in Rust

4. Technology Comparison Analysis

4.1 KV Cache Management Comparison

Method	Tools	Core Idea	Memory Efficiency	Prefix Reuse	Complexity
PagedAttention	vLLM, Aphrodite	Store KV in fixed blocks non-contiguously using OS paging techniques	★★★★★	△ (hash-based)	Medium
RadixAttention	SGLang	Automatically share prefix via radix tree	★★★★★	★★★★★	High
Blocked KV Cache	LMDeploy TurboMind	Block grid-based management, split & fuse optimization	★★★★☆	△	Medium
Paged + Quantized KV	TensorRT-LLM	Block-based + INT8/FP8 KV quantization	★★★★★	○ (CPU offloading)	High
Contiguous	llama.cpp, ExLlamaV2	Contiguous memory, pre-allocation	★★☆☆☆	✗	Low

Key insights:

Fragmentation elimination: PagedAttention (vLLM) became standard. Reduced memory waste from 60-80% to under 5%
Prefix reuse: RadixAttention (SGLang) achieves highest cache hit rates. 85-95% in few-shot vs PagedAttention’s 15-25%
KV quantization: Supported by TensorRT-LLM and LMDeploy. Quantizing KV to FP8/INT8 saves 50% memory with minimal quality loss

4.2 Quantization Method Comparison

Method	Bits	Process	GPU Required	Quality	Speed	Compatible Tools
GPTQ	4bit (mainly)	Post-training, Hessian-based	Required for quantization	★★★★☆	★★★★★ (ExLlama)	vLLM, TGI, ExLlamaV2
AWQ	4bit	Activation-aware weight quant	Required for quantization	★★★★★	★★★★☆	vLLM, LMDeploy, TGI
EXL2	2-8bit mixed	Per-layer mixed precision	Required for quantization	★★★★☆	★★★★★	ExLlamaV2, Aphrodite
GGUF	2-8bit	K-quant super-block	CPU possible	★★★★☆	★★★☆☆ (CPU)	llama.cpp, Ollama, LocalAI
FP8	8bit	8-bit floating point	Hopper GPU	★★★★★	★★★★★	TensorRT-LLM, vLLM
bitsandbytes	4/8bit	NF4, INT8	Required	★★★☆☆	★★★☆☆	TGI, HF Transformers

Quality ranking (same 4-bit, perplexity basis): AWQ > GPTQ ≈ EXL2 > GGUF Q4_K_M > bitsandbytes NF4 Speed ranking (GPU, 4-bit): EXL2 (ExLlamaV2) > GPTQ (ExLlama backend) > AWQ (vLLM) > GGUF (llama.cpp GPU offload)

Key selection criteria:

GPU serving, maximum speed: EXL2 (ExLlamaV2) or GPTQ (ExLlama backend)
GPU serving, highest quality: AWQ (vLLM/LMDeploy)
CPU/hybrid inference: GGUF (llama.cpp)
NVIDIA Hopper, production: FP8 (TensorRT-LLM)

4.3 Batching Strategy Comparison

Strategy	Description	GPU Utilization	Latency	Supporting Tools
Static Batching	Wait until all sequences in batch complete	★★☆☆☆	High (bound by longest sequence)	Basic HF Transformers
Continuous Batching	Insert new requests immediately upon sequence completion	★★★★☆	Low	vLLM, SGLang, TGI, Aphrodite
In-flight Batching	NVIDIA’s continuous batching implementation, iteration-level scheduling	★★★★★	Very low	TensorRT-LLM, Triton
Persistent Batching	Maintain batches while dynamically replacing individual sequences	★★★★★	Low	LMDeploy
Dynamic SplitFuse	Dynamically split/combine Prefill and decode	★★★★☆	Low	DeepSpeed-MII

Key insight: Evolution from Static → Continuous → In-flight/Persistent. All modern serving engines use continuous batching or better.

4.4 Attention Optimization Comparison

Technique	Paper	Core Idea	Main Effect	Using Tools
Flash Attention	Dao et al., 2022	Minimize HBM access via SRAM tiling	Memory savings + 2-4x speed improvement	TGI, SGLang, Candle
Flash Attention 2	Dao, 2023	Improved work partitioning, sequence parallelization	2x additional improvement over FA1	Most modern engines
Flash Attention 3	2024	Hopper asynchronous execution, FP8 support	Additional improvement over FA2 (especially H100)	SGLang (latest)
PagedAttention	Kwon et al., 2023	Block-based KV management + attention	Memory efficiency maximization	vLLM, TGI, Aphrodite
FlashInfer	2024	Shared prefix batch decoding optimization, cascading	Up to 31x faster than vLLM on shared prefix	SGLang, vLLM (integrating)
FlexAttention	PyTorch, 2024	BlockMask + page table integration	Combine flexible mask + paged attention	PyTorch native

FlashInfer detail:

When shared prefix is 32,768 tokens and batch size 256, up to 31x speed improvement vs basic PagedAttention
Cascading technique computes shared prefix attention only once

FA3 benchmark: In SGLang, FA3 surpasses both FlashInfer and Triton backends, with gap widening as input/output size increases.

4.5 Speculative Decoding Support Status

Speculative decoding is a technique where a small “draft model” rapidly generates multiple tokens, and a large “target model” verifies them at once (Leviathan et al., 2023; Chen et al., 2023).

Tool	Support	Draft Model Method	Performance Improvement
vLLM	✅	Separate small model, n-gram, MLPSpeculator	2-3x (workload dependent)
SGLang	✅	EAGLE, EAGLE 2, EAGLE 3 (2025 latest)	2-4x
TensorRT-LLM	✅	Draft model, Medusa heads	2-3x
TGI	✅	Medusa	2x
LMDeploy	△ (experimental)	-	-
llama.cpp	✅	Draft model	1.5-2x
ExLlamaV2	△	-	-
Others	✗	-	-

EAGLE 3 (SGLang, December 2025): LMSYS provides speculative decoding draft models bundled for popular models. Groq reports 6x+ speed improvement on Llama-3.1-70B, SambaNova reports 2x+ improvement on Llama-3.1-405B.

4.6 Prefix Caching Comparison

Tool	Method	Cache Hit Rate (few-shot)	Cache Hit Rate (chat)	Implementation
SGLang	RadixAttention (radix tree)	85-95%	60-85%	Token sequence-based tree
vLLM	Hash-based prefix caching	15-25%	30-50%	Block hash matching
TensorRT-LLM	KV Cache Reuse + CPU offloading	Medium	Medium	CPU-GPU transfer
TGI v3	Prefix KV caching	Medium-High	High (long history)	Chunk-based
LMDeploy	Blocked KV reuse	Low-Medium	Medium	Block matching

Key insight: For workloads with high prefix reuse (agent, few-shot, same system prompt), SGLang’s RadixAttention is overwhelming. The difference narrows in simple chatbot serving.

4.7 Distributed Inference Comparison

Method	Description	Advantages	Disadvantages	Supporting Tools
Tensor Parallelism(TP)	Split single layer across multiple GPUs	Low latency	All-reduce communication needed, requires high bandwidth between GPUs	vLLM, SGLang, TensorRT-LLM, LMDeploy, TGI
Pipeline Parallelism(PP)	Sequential layer placement across GPUs	Low communication overhead	Pipeline bubbles, high latency	TensorRT-LLM, DeepSpeed
Expert Parallelism(EP)	Distribute MoE model experts across GPUs	Optimal for MoE models	MoE-only	vLLM (Wide-EP), SGLang
Disaggregated Serving	Run Prefill and Decode on separate nodes	Independent scaling per phase	KV transfer overhead	vLLM (NIXL), SGLang
Sequence Parallelism	Split long sequences	Useful for long context	Complex implementation	DeepSpeed, Ring Attention

vLLM’s latest distributed serving (December 2025):

DeepSeek models achieve 2,200 tokens/s per H200 with Wide Expert Parallelism
Efficient KV transfer via NIXL/LMCache connectors for prefill-decode separation
Independent autoscaling based on Ray

5. Comprehensive Comparison Tables

5.1 Feature Comparison

Tool	Language	Continuous Batching	PagedAttention	Quantization	Speculative Decoding	Distributed Inference	OpenAI API
vLLM	Python/C++	✅	✅	AWQ,GPTQ,FP8	✅	TP	✅
SGLang	Python/C++	✅	✅ (RadixAttn)	AWQ,GPTQ,FP8	✅ (EAGLE3)	TP,EP	✅
TensorRT-LLM	Python/C++	✅ (in-flight)	✅	FP8,INT4,INT8	✅	TP,PP	via Triton
TGI	Rust/Python	✅	✅	AWQ,GPTQ,bnb	✅ (Medusa)	TP	✅
llama.cpp	C/C++	△	✗	GGUF (2-8bit)	✅	✗	✅
Ollama	Go/C++	△	✗	GGUF	✗	✗	✅
MLC LLM	Python/C++	✅	✅	3-4bit	✗	✗	✅
LMDeploy	Python/C++	✅ (persistent)	✅ (blocked)	AWQ,INT8,KV quant	△	TP	✅
Triton Server	C++/Python	✅ (dynamic)	via backend	via backend	via backend	via backend	✗
ExLlamaV2	Python/C++	✗	✗	EXL2,GPTQ	△	✗	via TabbyAPI
Ray Serve+vLLM	Python	✅	✅	vLLM all	✅	TP+multi-node	✅
PowerInfer	C/C++	✗	✗	GGUF	✗	✗	✗
Aphrodite	Python/C++	✅	✅	EXL2,GGUF,AWQ,GPTQ	✗	TP	✅
LocalAI	Go/C++	△	✗	GGUF	✗	P2P	✅
DeepSpeed-MII	Python/C++	✅	✗	INT8	✗	TP,PP	✅
OpenLLM	Python	via backend	via backend	via backend	via backend	via backend	✅
CTranslate2	C++/Python	△	✗	INT8,INT16	✗	✗	✗
Candle	Rust	✗	△ (atoma-infer)	✗	✗	✗	✗

5.2 Hardware Support

Tool	NVIDIA CUDA	AMD ROCm	Apple Metal	CPU	Mobile	WebGPU
vLLM	✅	✅	✗	✅	✗	✗
SGLang	✅	✅	✗	✗	✗	✗
TensorRT-LLM	✅	✗	✗	✗	✗	✗
TGI	✅	✅	✗	✗	✗	✗
llama.cpp	✅	✅	✅	✅	✅	✗
Ollama	✅	✅	✅	✅	✗	✗
MLC LLM	✅	✅	✅	✅	✅	✅
LMDeploy	✅	✗	✗	✗	✗	✗
ExLlamaV2	✅	✗	✗	✗	✗	✗
PowerInfer	✅	✗	✗	✅ (hybrid)	✗	✗
LocalAI	✅	✅	✅	✅	✗	✗
Candle	✅	✗	✅	✅	✗	✅ (WASM)

5.3 Performance Tiers (2025 basis, approximate ranking)

GPU Serving Throughput (high concurrency, A100/H100):

🥇 LMDeploy (TurboMind) — especially quantized models
🥇 SGLang — workloads with high prefix reuse
🥈 TensorRT-LLM — optimal performance after engine build
🥈 vLLM — general-purpose champion
🥉 TGI — slightly behind vLLM
DeepSpeed-MII, MLC LLM

Single Request Latency:

🥇 TensorRT-LLM (compiled kernels)
🥈 SGLang / vLLM
🥉 LMDeploy

Consumer GPU (single user):

🥇 ExLlamaV2 — highest speed
🥈 llama.cpp (GPU offload)
🥉 Ollama / PowerInfer

6. Scenario-Based Recommendations

Scenario 1: Production Chatbot Service

Recommendation: vLLM or SGLang + Ray Serve

High concurrency, stable TTFT needed
If multi-turn chat, SGLang (RadixAttention advantage)
Add Ray Serve if autoscaling needed

Scenario 2: NVIDIA-only, Maximum Performance

Recommendation: TensorRT-LLM + Triton

Fixed models with engine build investment feasible
Maximum throughput with FP8 (H100)
Enterprise-level stability

Scenario 3: Local Development / Prototyping

Recommendation: Ollama

5-minute installation + execution
Simple model management via model registry

Scenario 4: CPU Server / GPU-less Environment

Recommendation: llama.cpp or CTranslate2

llama.cpp: General LLM, various quantizations
CTranslate2: Specialized for translation/Whisper etc.

Scenario 5: Mobile App / Browser

Recommendation: MLC LLM (mobile), llama.cpp (mobile), Candle (WASM)

MLC LLM: Most comprehensive mobile support
web-llm: WebGPU-based browser execution

Scenario 6: Single GPU, Large Model

Recommendation: PowerInfer (sparse models) or DeepSpeed-MII (ZeRO-Inference)

Run GPU memory-exceeding models with CPU offloading

Scenario 7: Agent / Tool-use / Structured Output

Recommendation: SGLang

Maximize prefix reuse with RadixAttention
JSON output optimization with Compressed FSM
Compose complex LLM pipelines with DSL

Scenario 8: OpenAI API Drop-in Replacement

Recommendation: LocalAI

Full /v1/chat/completions compatibility
Text + image + audio all-in-one

7. Conclusion

As of 2025, the LLM serving ecosystem is entering maturity with distinct tool differentiation.

Key Trends

vLLM and SGLang’s two-horse race: vLLM dominates general serving, SGLang leads in structured workloads. This structure strengthens as TGI enters maintenance mode.
KV cache management innovation: PagedAttention became standard, RadixAttention opened new possibilities for prefix reuse. KV quantization (FP8) is the next frontier of memory efficiency.
Speculative Decoding ubiquity: 2-4x speed improvements through EAGLE 3, Medusa etc. are generalizing, with all major engines supporting.
Disaggregated Serving: Architectures separating Prefill and Decode for independent scaling are emerging as the new standard for large-scale serving.
Consumer hardware accessibility expansion: llama.cpp/Ollama ecosystem democratized local AI, PowerInfer is expanding consumer GPU limits.

Selection Guide Summary

Priority	Recommended Tool
General production	vLLM
Maximum throughput (NVIDIA)	LMDeploy or TensorRT-LLM
Agent/structured output	SGLang
Easy local execution	Ollama
Mobile/edge	MLC LLM
Maximum single GPU speed	ExLlamaV2
Hardware versatility	llama.cpp

8. References

Core Papers

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., … & Stoica, I. (2023). “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023. [arXiv:2309.06180]
Zheng, L., Yin, L., Xie, Z., Huang, J., Sun, C., Yu, C. H., … & Stoica, I. (2023). “SGLang: Efficient Execution of Structured Language Model Programs.” ICLR 2025. [arXiv:2312.07104]
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS 2022. [arXiv:2205.14135]
Dao, T. (2023). “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” ICLR 2024. [arXiv:2307.08691]
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” ICLR 2023. [arXiv:2210.17323]
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2024). “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” MLSys 2024. [arXiv:2306.00978]
Leviathan, Y., Kalman, M., & Matias, Y. (2023). “Fast Inference from Transformers via Speculative Decoding.” ICML 2023. [arXiv:2211.17192]
Chen, C., Borgeaud, S., Irving, G., Lespiau, J. B., Sifre, L., & Jumper, J. (2023). “Accelerating Large Language Model Decoding with Speculative Sampling.” [arXiv:2302.01318]
Yu, G. I., Jeong, J. S., Kim, G. W., Kim, S., & Chun, B. G. (2022). “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI 2022.
Song, Y., Mi, Z., Xie, H., & Chen, H. (2023). “PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU.” [arXiv:2312.12456]
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., … & Krishnamurthy, A. (2018). “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” OSDI 2018.
Li, Y., Cai, T., Zhang, Y., Chen, D., & Narasimhan, K. (2024). “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.” ICML 2024. [arXiv:2401.15077]

Benchmark Sources

BentoML. (2024). “Benchmarking LLM Inference Backends.” https://www.bentoml.com/blog/benchmarking-llm-inference-backends
LMSYS. (2024). “Achieving Faster Open-Source Llama3 Serving with SGLang Runtime.” https://lmsys.org/blog/2024-07-25-sglang-llama3/
LMSYS. (2024). “Fast and Expressive LLM Inference with RadixAttention and SGLang.” https://lmsys.org/blog/2024-01-17-sglang/
Clarifai. (2025). “Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B.” https://www.clarifai.com/blog/comparing-sglang-vllm-and-tensorrt-llm-with-gpt-oss-120b
MarkTechPost. (2025). “Comparing the Top 6 Inference Runtimes for LLM Serving in 2025.” https://www.marktechpost.com/2025/11/07/comparing-the-top-6-inference-runtimes-for-llm-serving-in-2025/
FlashInfer. (2024). “Accelerating Self-Attentions for LLM Serving with FlashInfer.” https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
oobabooga. (2023). “A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M.” https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/
vLLM Blog. (2025). “Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP.” https://blog.vllm.ai/2025/12/17/large-scale-serving.html

This article is written based on the latest information as of February 2026. The LLM serving ecosystem evolves rapidly, so please check official documentation and latest releases for each tool.