Complete LLM Serving Engine Guide — In-Depth Analysis of 18 Tools

· # AI 개념
LLM vLLM SGLang TensorRT-LLM inference serving quantization

Last Updated: February 2026
Target Audience: ML Engineers, MLOps, Infrastructure Architects
Scope: 18 production LLM serving tools + core technology comparative analysis


1. Introduction

The most critical bottleneck in practical deployment of Large Language Models (LLMs) is inference serving. Serving models with tens to hundreds of billions of parameters in real-time requires solving various technical challenges including memory management, batching strategies, attention optimization, and quantization.

The LLM serving tool ecosystem has exploded since 2023. After vLLM’s PagedAttention changed the paradigm of KV cache memory management, various approaches have emerged including SGLang’s RadixAttention, TensorRT-LLM’s FP8 optimization, and llama.cpp’s consumer hardware accessibility expansion.

This article provides paper-level depth analysis of 18 major LLM serving/inference tools and systematically compares their core technologies.

Core Evaluation Metrics

MetricDescription
ThroughputTokens generated per second (tokens/s)
TTFTTime to First Token — time taken to first token
Latency(P50/P99)Response latency per request
Memory EfficiencyGPU/CPU memory usage efficiency
ScalabilityPerformance maintenance capability as concurrent users increase

2. Core Technology Concepts

2.1 KV Cache

A mechanism to avoid duplicate computation by reusing Key-Value tensors from previous tokens during Transformer decoding. In LLM serving, KV cache occupies 30–60% of total GPU memory, making efficient management crucial for serving performance.

2.2 Continuous Batching

While static batching must wait until all sequences in the batch complete, continuous batching inserts new requests immediately as each sequence finishes. This maximizes GPU utilization. First proposed in Yu et al. (2022)‘s Orca system.

2.3 Quantization

A technique to reduce memory and increase inference speed by converting FP16/BF16 weights to lower precision like INT4/INT8. There’s a tradeoff between quality loss and speed improvement.


3. In-Depth Tool Analysis

3.1 vLLM GitHub: vllm-project/vllmDevelopment: UC Berkeley (Kwon et al.)License: Apache 2.0Current Status: Active development (v0.7.x+ as of 2025, official distribution via NVIDIA NGC)

Core Technology: PagedAttention

vLLM’s core innovation is PagedAttention (Kwon et al., 2023). Inspired by virtual memory paging techniques in operating systems, it partitions KV cache into fixed-size blocks and manages them through an indirection layer.

Problems with existing approach: Traditional KV cache allocates continuous memory regions for each sequence. Maximum sequence length must be pre-reserved, resulting in average 60–80% memory waste (internal + external fragmentation).

PagedAttention’s solution:

  • Partition KV cache into fixed-size blocks (e.g., 16 tokens)
  • Each sequence references non-contiguous blocks through a block table (page table)
  • Dynamically allocate new blocks as sequences grow
  • Enable KV cache sharing for beam search etc. via Copy-on-Write

Result: Reduces memory waste to under 5%, enabling 2–4x more concurrent requests on the same GPU.

Architecture

Client Request → FastAPI Server → AsyncLLMEngine
    → Scheduler (continuous batching)
    → Model Runner (GPU execution)
    → PagedAttention KV Cache Manager
    → Sampler → Token Output (streaming)
  • Scheduler: Schedules requests iteration-wise with continuous batching
  • Model Runner: Executes model with CUDA kernels (FlashAttention/FlashInfer backend selection)
  • KV Cache Manager: Block-level allocation/deallocation, Copy-on-Write support

Performance Benchmarks

ComparisonThroughput vs vLLM
HuggingFace Transformers14–24x lower (Kwon et al., 2023)
Early TGI2.2–3.5x lower
FasterTransformer1.5–2x lower

BentoML benchmark (2024, A100 80GB, Llama 3 8B):

  • TTFT: Best-in-class across all concurrent user levels
  • Token generation rate: ~2,300–2,500 tokens/s at 100 users (lower than LMDeploy’s 4,000 tokens/s)
  • Slightly behind in decode throughput compared to engines with higher GPU utilization (LMDeploy etc.)

Supported Models/Quantization

  • Models: 30+ architectures (LLaMA, Mistral, Qwen, Gemma, Phi, Command-R, DeepSeek, etc.)
  • Quantization: AWQ, GPTQ, FP8, INT8 (W8A8), Marlin kernels
  • Hardware: NVIDIA CUDA, AMD ROCm, AWS Neuron, CPU

Pros and Cons

ProsCons
Best-in-class memory efficiencyLack of decode speed optimization for quantized models
Extensive model supportLags behind TensorRT-LLM in single-request latency
Active community, rapid updatesHigher setup complexity vs Ollama
OpenAI-compatible API
Speculative decoding support

Suitable Use Cases

  • General production LLM serving (high throughput + low TTFT)
  • Serving various models on single infrastructure
  • Multi-GPU distributed inference

3.2 SGLang GitHub: sgl-project/sglangDevelopment: LMSYS (UC Berkeley, Zheng et al.)Paper: Zheng et al., “SGLang: Efficient Execution of Structured Language Model Programs” (2023, accepted to ICLR 2025)Current Status: Very active development (Diffusion model support, EAGLE 3 speculative decoding as of late 2025)

Core Technology: RadixAttention

SGLang’s innovation is RadixAttention—managing KV cache in a radix tree structure to automatically share prefixes among multiple requests.

Difference from PagedAttention:

  • PagedAttention: Focus on block-level memory management (eliminating fragmentation)
  • RadixAttention: Focus on prefix reuse (requests sharing the same prefix don’t duplicate KV cache computation)

Radix tree structure:

Root
├── "You are a helpful assistant. " → KV cached
│   ├── "Translate: Hello" → Branch A
│   ├── "Translate: World" → Branch B
│   └── "Summarize: ..." → Branch C

Numerous requests sharing the same system prompt or few-shot examples compute and reuse KV cache only once.

Cache hit rates:

  • Few-shot learning (shared examples): 85–95% (vLLM PagedAttention: 15–25%)
  • Multi-turn chat: 60–85% (vLLM: 30–50%)
  • LMSYS production: 52.4% for LLaVA-Next-34B, 74.1% for some models

Frontend DSL

SGLang provides not just runtime but also frontend DSL:

@sgl.function
def multi_turn_qa(s, question1, question2):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user(question1)
    s += sgl.assistant(sgl.gen("answer1", max_tokens=256))
    s += sgl.user(question2)
    s += sgl.assistant(sgl.gen("answer2", max_tokens=256))

This DSL automatically optimizes prefix sharing and supports parallel generation with fork/join.

Structured Generation

Compressed Finite State Machine technique for efficiently decoding structured outputs like JSON and regex. Processes in intervals (jumps) rather than token-by-token masking, dramatically reducing overhead.

Performance Benchmarks

LMSYS benchmark (July 2024, Llama 3):

  • Llama 3 8B (A100): Both SGLang and TensorRT-LLM achieve up to 5,000 tokens/s on short inputs, vLLM lags behind
  • Llama 3 70B: SGLang achieves up to 3x throughput vs vLLM in online serving
  • Structured workload: Up to 6.4x throughput, 3.7x lower latency vs baseline

Clarifai benchmark (August 2025, GPT-OSS-120B, H100):

  • Strong performance at medium-high concurrency (50 requests)
  • TensorRT-LLM shows highest throughput for single requests, lacks scaling at extreme concurrency

Pros and Cons

ProsCons
Dramatic performance improvement via prefix reuseSmaller ecosystem vs vLLM
Structured output generation optimizationRelatively less online examples/documentation
Complex LLM program authoring via DSLSome model support lags behind
EAGLE 3 speculative decoding
Extension to Diffusion models

Suitable Use Cases

  • Agent/Tool-use workflows (high prefix reuse)
  • When structured outputs (JSON) are needed
  • Multi-turn chat serving
  • Few-shot evaluation pipelines

3.3 TensorRT-LLM GitHub: NVIDIA/TensorRT-LLMDevelopment: NVIDIALicense: Apache 2.0Current Status: v0.17+ (as of 2025), NVIDIA’s official inference stack

Core Technology

TensorRT-LLM is an LLM-specific inference engine built on NVIDIA’s TensorRT compiler, generating model-specific optimized CUDA kernels at compile-time.

Key optimizations:

  1. In-flight Batching: NVIDIA’s implementation of continuous batching. Insert new requests immediately as individual requests complete
  2. FP8/INT4 quantization: Utilizes FP8 Tensor Cores in Hopper architecture (H100). 2x+ throughput vs FP16, quality loss under 2%
  3. Paged KV Cache: Block-based KV management similar to vLLM
  4. Quantized KV Cache: Quantize KV cache itself to INT8, FP8 for memory savings
  5. KV Cache Reuse: KV offloading to CPU then reuse. Up to 14x TTFT reduction (H100 basis)
  6. Kernel Fusion: Fuse MHA, MLP etc. into single kernels

Architecture

Model Definition (Python) → TensorRT Engine Build (compilation)
    → Executor API → Triton Inference Server (serving)
    → In-flight Batching Scheduler
    → Fused CUDA Kernels

Important: TensorRT-LLM requires explicit compilation stage. Must build engines for each model+hardware+batch size combination, taking tens of minutes to hours.

Performance Benchmarks

  • Single-request latency: Lowest on NVIDIA GPUs (strength of compiled kernels)
  • Llama 3.1 8B FP8 (H100): ~2x throughput improvement vs FP16
  • LMSYS benchmark: Achieves 5,000 tokens/s on short inputs alongside SGLang
  • High concurrency may increase P99 latency due to aggressive batching

Pros and Cons

ProsCons
Best single-request performance on NVIDIA GPUsNVIDIA-only (vendor lock-in)
FP8 optimization (Hopper)Complex setup (engine build, Triton configuration)
Rich KV cache optionsRecompilation needed for model changes
Official NVIDIA supportSteepest learning curve

Suitable Use Cases

  • NVIDIA-only environments requiring maximum performance
  • Latency-critical workloads (real-time chatbots)
  • Fixed models where engine build investment is feasible

3.4 TGI (Text Generation Inference) GitHub: huggingface/text-generation-inferenceDevelopment: Hugging FaceLicense: HFOIL (v1), Apache 2.0 (v2+)Current Status: Maintenance mode as of December 2025 — accepting only minor bug fixes

Core Technology

TGI was Hugging Face ecosystem’s official inference server, providing all-in-one production serving features:

  1. Rust-based HTTP/gRPC server: High-performance web server
  2. Flash Attention (Dao et al., 2022): Attention algorithm optimizing HBM ↔ SRAM IO
  3. Continuous Batching: Dynamic request insertion/removal
  4. Paged Attention: vLLM-style KV cache management
  5. TGI v3’s Chunked Prefill: Split long contexts into chunks for prefill, reducing memory peaks
  6. Prefix KV Caching: Reuse KV of long conversation history

Performance Benchmarks

  • General prompts: Similar level to vLLM, vLLM slightly ahead at high concurrency
  • TGI v3 + long context: 3x more token processing, up to 13x faster vs vLLM (long history + prefix caching)
  • BentoML benchmark (Llama 3 8B, A100): 2,300–2,500 tokens/s (similar to vLLM)

Supported Quantization

  • AWQ, GPTQ, bitsandbytes (INT4, INT8)
  • FP8 (experimental)

Pros and Cons

ProsCons
Perfect HuggingFace Hub integrationEntered maintenance mode (December 2025~)
Easy setup, excellent documentationLags behind latest optimizations vs vLLM/SGLang
Built-in safety features (watermark, safety)Slow model support updates
Various hardware (CUDA, ROCm, Gaudi, Inferentia)

Suitable Use Cases

  • HuggingFace Inference Endpoints users
  • Chat workloads with long conversation history (utilizing v3’s prefix caching)
  • Rapid prototyping and deployment

Note: With TGI entering maintenance mode, HuggingFace recommends vLLM/SGLang as alternatives.


3.5 llama.cpp GitHub: ggml-org/llama.cppDevelopment: Georgi Gerganov and communityLicense: MITCurrent Status: Daily active development (build 4000+ as of 2025)

Core Technology: GGUF Quantization

llama.cpp is a pure inference engine written in C/C++ that can run LLMs on both CPU and GPU without Python/PyTorch dependencies.

GGUF (GGML Unified Format): llama.cpp’s model file format supporting various quantization methods.

Quantization Methods Detail

QuantizationBitsSize (7B basis)QualitySpeedDescription
Q8_08bit~7.0 GBBestSlowNear FP16
Q6_K6bit~5.5 GBVery goodMediumSuper-blocks with 6-bit
Q5_K_M5bit~4.8 GBGoodMediumMixed 5-bit, recommended
Q4_K_M4bit~4.1 GBFairFastMost popular balance point
Q4_K_S4bit~3.9 GBFairFastSlightly smaller than Q4_K_M
Q3_K_M3bit~3.3 GBDegradedFastMemory constrained
Q2_K2bit~2.7 GBSignificantly degradedVery fastExtreme compression
IQ4_XS~4bit~3.7 GBQ4_K_M levelSlow*Importance Matrix based

*IQ quantization can be very slow with partial GPU offloading.

K-Quant System: Quantizations with “K” (Q4_K_M etc.) use super-block structure. Each super-block (usually 256 weights) has independent scale factors, with M(medium) and S(small) being precision differences in scale factors.

Architecture

GGUF Model File → ggml tensor library
    → CPU: AVX2/AVX-512/ARM NEON vector operations
    → GPU: CUDA/Metal/Vulkan/OpenCL offloading
    → Multi-threaded inference
    → HTTP Server (llama-server) or CLI
  • Partial GPU Offloading: Can split GPU/CPU by layer
  • Metal Support: Excellent performance on Apple Silicon
  • Vulkan: Universal GPU acceleration (AMD, Intel)

Performance Benchmarks

llama-bench results (Apple Silicon M-series, Qwen2 1.5B Q4_0):

  • Prompt processing (pp512): 5,765 tokens/s
  • Token generation (tg128): 198 tokens/s

With full GPU offloading vs ExLlamaV2:

  • llama.cpp: ~7,500 tokens/s (prompt), ExLlamaV2: ~14,000 tokens/s (~2x difference)

Pros and Cons

ProsCons
Hardware universality (CPU/all GPUs)Lower throughput vs GPU-only tools
Single binary, minimal dependenciesWeak continuous batching
Extensive quantization optionsLack of production serving features
Apple Silicon optimizationUnsuitable for large-scale concurrent serving
Very active community

Suitable Use Cases

  • Running LLMs on local PC/laptop
  • CPU server deployment without GPU
  • Inference on Apple Silicon Mac
  • Edge device deployment

3.6 Ollama GitHub: ollama/ollamaDevelopment: Ollama Inc.License: MITCurrent Status: Active development (expanded to cloud model support as of 2025)

Core Technology

Ollama is a user-friendly LLM execution environment that wraps llama.cpp. Provides Docker-like interface to pull/run models.

ollama pull llama3.1
ollama run llama3.1

Architecture

Ollama CLI/API → Go server (REST API)
    → llama.cpp (inference backend)
    → Model registry (ollama.com)
    → Modelfile (Dockerfile-like model configuration)

Key features:

  • Model management: ollama pull, ollama list, ollama rm
  • Modelfile: Declaratively set system prompts, temperature etc.
  • OpenAI-compatible API: /v1/chat/completions endpoint
  • Multimodal: Vision model support
  • 2025 updates: Cloud model integration (Turbo), local-only mode settings

Performance

Ollama’s performance is essentially identical to llama.cpp. Go server wrapper overhead is negligible, with main bottleneck being llama.cpp backend inference speed.

vLLM comparison (same model, same GPU):

  • Single request: Nearly identical latency
  • Concurrent requests: vLLM achieves 2–5x higher throughput with continuous batching

Pros and Cons

ProsCons
Extremely easy installation/usageUnsuitable for large-scale serving (weak batching)
Model registry ecosystemCannot exceed llama.cpp performance
Custom model creation via ModelfileLimited GPU memory optimization
Cross-platformLower throughput vs vLLM/SGLang

Suitable Use Cases

  • Developer local environment prototyping
  • AI accessibility for non-developers
  • Internal AI tools for small teams
  • LLM testing in CI/CD pipelines

3.7 MLC LLM GitHub: mlc-ai/mlc-llmDevelopment: CMU/OctoAI (TVM team, Chen et al.)Paper: Based on Apache TVM (Chen et al., 2018)License: Apache 2.0

Core Technology: TVM Compiler

MLC LLM uses the Apache TVM compiler framework to compile LLMs to native code for various hardware backends.

Compilation pipeline:

HuggingFace model → Relax IR (TVM)
    → Hardware-specific optimization (fusion, tiling, vectorization)
    → Backend-specific code generation:
        - CUDA (NVIDIA GPU)
        - Metal (Apple GPU)
        - Vulkan (Universal GPU)
        - OpenCL (Mobile GPU)
        - WebGPU (Browser)
        - C/LLVM (CPU)

Mobile/Edge Deployment

MLC LLM’s unique strength is LLM inference on mobile devices:

  • iOS: Metal backend, Swift bindings
  • Android: OpenCL/Vulkan backend, Java/Kotlin bindings
  • WebGPU: Direct execution in browsers (web-llm)

Mobile benchmarks (arxiv:2410.03613, 2024):

  • Qualcomm Snapdragon 8 Gen 3 with 7B 4-bit model: ~10-15 tokens/s
  • Apple A17 Pro with similar setup: ~20+ tokens/s

BentoML Benchmark (Llama 3 8B, A100)

  • 10 users: Similar decode performance to LMDeploy, best-in-class TTFT
  • 50 users: Still good TTFT
  • 100 users: Sharp performance degradation under high load — both decode speed and TTFT lag behind LMDeploy

Pros and Cons

ProsCons
Mobile/edge/browser deploymentCompilation stage required (cold start increase)
Most comprehensive hardware supportNo stable releases (nightly only)
WebGPU support (web-llm)Performance degradation at high concurrency
TVM optimization auto-tuningLearning curve

Suitable Use Cases

  • Embedding LLMs in mobile apps
  • Browser-based AI (WebGPU)
  • Edge device deployment (Jetson, RPi, etc.)
  • Environments with high hardware diversity

3.8 LMDeploy GitHub: InternLM/lmdeployDevelopment: Shanghai AI Lab (InternLM team)License: Apache 2.0

Core Technology: TurboMind

LMDeploy’s core inference engine TurboMind started from NVIDIA FasterTransformer’s GPT-NeoX implementation and optimized for conversational model inference.

Key optimizations:

  1. Persistent Batching: Variant of continuous batching that maintains batches while dynamically replacing individual sequences
  2. Blocked KV Cache: Block-based KV management similar to vLLM PagedAttention, but with different internal layout
  3. Dynamic Split & Fuse: Dynamically split/fuse attention blocks for optimal GPU utilization
  4. KV Quantization: Quantize KV cache itself to INT8/INT4
  5. Weight Quantization: AWQ 4-bit, INT8 support

Performance Benchmarks

BentoML benchmark (Llama 3, A100 80GB):

MetricLMDeployvLLMTensorRT-LLMMLC-LLMTGI
Decode (8B, 100 users)~4,000 t/s~2,400 t/s~2,400 t/s~2,000 t/s~2,300 t/s
TTFT (8B, 10 users)BestBestGoodBestMedium
Decode (70B Q4, 100 users)~700 t/s~450 t/s~650 t/sN/A~400 t/s

InternLM benchmark: After GQA optimization, internlm2-20b achieves 16+ RPS, 1.8x faster than vLLM.

LMDeploy achieves near 100% GPU utilization particularly with quantized models.

Pros and Cons

ProsCons
Best-in-class decode throughputNVIDIA CUDA only
Particularly strong in 4-bit inferenceLimited model support (~20 models)
Easy to use (on-the-fly conversion)Uneven English/Chinese documentation quality
KV quantization supportSmaller community vs vLLM

Suitable Use Cases

  • NVIDIA GPU environments requiring maximum throughput
  • Quantized model serving (AWQ 4-bit)
  • When using InternLM family models
  • Large-scale concurrent serving (stable even at high concurrency)

3.9 Triton Inference Server GitHub: triton-inference-server/serverDevelopment: NVIDIALicense: BSD 3-Clause

Core Technology

Triton is a universal model serving platform, not LLM-specific but serving various ML models. For LLM serving, primarily used with TensorRT-LLM backend.

Core features:

  1. Dynamic Batching: Automatically batch multiple requests. Configurable wait time/batch size limits
  2. Model Ensembles: Configure preprocessing → LLM → postprocessing as pipelines
  3. Multi-backends: TensorRT, ONNX Runtime, PyTorch, TensorFlow, vLLM, etc.
  4. Concurrent model serving: Serve multiple models simultaneously on single server
  5. Model versioning: Model version management for A/B testing

Architecture

Client (HTTP/gRPC) → Triton Server
    → Request Scheduler (dynamic batching)
    → Model Repository
        ├── Model A (TensorRT-LLM)
        ├── Model B (ONNX Runtime)
        └── Ensemble Pipeline
    → Response Aggregator

Role in LLM Serving

Triton itself doesn’t perform LLM inference optimizations (PagedAttention etc.). Instead:

  • TensorRT-LLM Backend: Serve TensorRT-LLM engines via tensorrtllm_backend
  • vLLM Backend: Use vLLM as Triton backend
  • Actual inference optimization handled by backend engines

Pros and Cons

ProsCons
Multi-model serving (LLM + vision + audio)Complex setup as not LLM-specific
Production-proven stabilityNo LLM optimization when used alone
Built-in monitoring, metricsHigh learning curve when used with TensorRT-LLM
Ensemble pipelines

Suitable Use Cases

  • Multimodal AI pipelines (LLM + image + audio)
  • Large-scale enterprise ML infrastructure
  • A/B testing + model versioning needs
  • Production serving wrapper for TensorRT-LLM models

3.10 ExLlamaV2 GitHub: turboderp-org/exllamav2Development: turboderpLicense: MIT

Core Technology: EXL2 Quantization

ExLlamaV2’s core innovation is EXL2 (ExLlama v2 Quantization)quantization mixing different bit counts per layer and tensor.

How it works:

  1. Measure importance (sensitivity) of each layer/tensor
  2. Quantize important layers to high bits (6-8bit), less important layers to low bits (2-3bit)
  3. Match overall model’s average bits to target (e.g., 4.25 bits per weight)
  4. Achieve higher quality than uniform quantization at same model size

Supported bits: 2, 3, 4, 5, 6, 8 bit and their mixtures

Performance Benchmarks

Reddit benchmark (2024, RTX 4090):

  • Llama 3 8B: ExLlamaV2 achieves ~14,000 tokens/s in prompt processing (vs llama.cpp’s ~7,500, about 2x)
  • At same 4-bit, EXL2 slightly higher quality than GPTQ, similar or slightly lower than GGUF
  • ExLlama-based GPTQ execution shows fastest evaluation speed (oobabooga benchmark)

Quality comparison (4-bit, Llama 2 13B, perplexity basis):

  • AWQ: Best quality
  • GPTQ ≈ EXL2: Similar
  • GGUF (Q4_K_M): Slightly behind

Pros and Cons

ProsCons
Best-in-class GPU inference speedNVIDIA CUDA only
Flexible mixed-precision quantizationLimited serving features (inference library)
Flash Attention, context caching supportNo CPU inference
Popular in local communitiesNo continuous batching

Suitable Use Cases

  • Maximum inference speed on single GPU (with TabbyAPI etc.)
  • Precise quantization adjustment to fit memory
  • Local AI chat (oobabooga, SillyTavern)

3.11 Ray Serve + vLLM Framework: Ray Serve + vLLMDevelopment: Anyscale (Ray), UC Berkeley (vLLM)

Core Technology

Ray Serve is a distributed model serving framework that adds autoscaling, monitoring, fault recovery while using vLLM as backend.

Architecture:

Load Balancer → Ray Serve Router
    → Replica 1 (vLLM on GPU 0-1)
    → Replica 2 (vLLM on GPU 2-3)
    → Replica N (autoscaled)
    → Ray Dashboard (monitoring)

Key features:

  1. Autoscaling: Automatically increase/decrease vLLM instances based on traffic
  2. Multi-model serving: Serve multiple models simultaneously on one cluster
  3. Fault recovery: Automatic restart on replica failure
  4. Disaggregated Serving: Run Prefill and Decode on separate nodes (vLLM’s latest feature)

vLLM Large-Scale Serving (December 2025):

  • DeepSeek models achieve 2,200 tokens/s per H200 (Wide Expert Parallelism)
  • Efficient KV transfer via NIXL/LMCache connectors
  • Independent scaling of each phase (prefill/decode) with Ray’s distributed computing

Pros and Cons

ProsCons
Production-level autoscalingComplex setup (Ray + vLLM)
Built-in monitoring, fault recoveryRay cluster management required
Multi-model, multi-node servingOverhead exists
Disaggregated serving support

Suitable Use Cases

  • Large-scale production LLM services
  • Environments with high traffic fluctuation
  • Multi-model / multi-tenant serving
  • Cloud-native AI infrastructure

3.12 PowerInfer GitHub: SJTU-IPADS/PowerInferDevelopment: Shanghai Jiao Tong UniversityPaper: Song et al., “PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU” (2023)

Core Technology: Neuron-Aware Sparse Inference

PowerInfer leverages activation sparsity in LLMs. In FFN layers, only a portion of neurons actually activate, and which neurons activate frequently (“hot neurons”) can be profiled beforehand.

How it works:

  1. Offline profiling to analyze activation frequency per neuron
  2. Hot neurons (frequently activated): Reside on GPU
  3. Cold neurons (rarely activated): Stored in CPU memory
  4. Runtime adaptive predictor predicts which neurons will activate
  5. Neuron-aware sparse operator computes only activated neurons

Performance Benchmarks

RTX 4090 single GPU:

  • Various LLMs including OPT-175B achieve average 13.20 tokens/s, max 29.08 tokens/s
  • Only 18% lower performance than A100 server — on consumer GPU!
  • Up to 11x faster inference than llama.cpp (on GPU memory constrained models)

Pros and Cons

ProsCons
Run large models on consumer GPUsOnly effective on models with FFN sparsity
GPU-CPU hybrid overcomes VRAM constraintsProfiling stage required
Dramatic performance improvement vs llama.cppLimited GQA/MoE model support
No production serving features

Suitable Use Cases

  • Running large models on VRAM-limited consumer GPUs
  • Models with strong sparse activation patterns like OPT, Falcon
  • Research/experimental purposes

3.13 Aphrodite Engine GitHub: aphrodite-engine/aphrodite-engineDevelopment: PygmalionAILicense: Apache 2.0GitHub Stars: ~1.6k

Core Technology

Aphrodite is a vLLM fork optimized for RP/storytelling community needs.

Features added over vLLM:

  • Enhanced sampling parameters (fine control of temperature, repetition penalty, etc.)
  • EXL2, GGUF quantization format support (vLLM focuses on GPTQ/AWQ)
  • Rapid response to community requests
  • PagedAttention KV cache management (vLLM-based)
  • Continuous batching (async server)

Pros and Cons

ProsCons
vLLM-based high performanceMay lag in tracking vLLM upstream
Various quantization format supportCommunity-focused rather than production environment
Fine-grained sampling controlLimited documentation/support

Suitable Use Cases

  • RP/storytelling serving (SillyTavern, etc.)
  • When wanting to serve EXL2/GGUF models on server
  • When needing sampling features absent in vLLM

3.14 LocalAI GitHub: mudler/LocalAIDevelopment: mudler and communityLicense: MIT

Core Technology

LocalAI is a fully OpenAI API compatible local AI server that integrates various backends.

Multi-backend architecture:

OpenAI-compatible API (/v1/chat/completions, /v1/images, /v1/audio, etc.)
    ├── llama.cpp (text generation)
    ├── whisper.cpp (speech recognition)
    ├── stable-diffusion.cpp (image generation)
    ├── bark (TTS)
    ├── piper (TTS)
    └── other backends

2025 features:

  • LocalAI Core (text, image, audio, vision APIs)
  • LocalAGI (autonomous agents)
  • LocalRecall (semantic search)
  • P2P distributed inference
  • Constrained grammars (structured output)

Pros and Cons

ProsCons
Complete OpenAI API drop-in replacementLacks performance optimization vs individual backends
Text+image+audio all-in-oneSetup complexity
P2P distributed supportDocumentation insufficient for community size
Easy Docker-based deployment

Suitable Use Cases

  • Converting existing OpenAI API code to local
  • Multimodal AI (text+image+audio) from single server
  • Privacy-sensitive environments

3.15 DeepSpeed-MII GitHub: deepspeedai/DeepSpeed-MIIDevelopment: Microsoft DeepSpeed teamLicense: Apache 2.0

Core Technology

DeepSpeed-MII is a serving framework utilizing Microsoft’s DeepSpeed library’s inference optimizations.

4 core technologies:

  1. DeepSpeed-Inference: Accelerate Transformer inference with custom CUDA kernels
  2. ZeRO-Inference: When model doesn’t fit single GPU, utilize CPU memory/NVMe for offloading. Enable single GPU serving of models like Bloom-176B
  3. DeepSpeed-FastGen: Continuous batching + Dynamic SplitFuse (dynamically split/combine prefill and decode)
  4. Tensor Parallelism: Multi-GPU parallel inference

Dynamic SplitFuse: Split long prompt prefill across multiple iterations and fuse with decode tokens to maintain uniform GPU utilization.

Performance

DeepSpeed-FastGen blog (2023):

  • Up to 2.3x throughput, up to 2x latency reduction vs vLLM (specific workloads)
  • However, gap has narrowed in recent comparisons as vLLM significantly evolved

Pros and Cons

ProsCons
ZeRO-Inference for ultra-large model deploymentDecreasing development activity trend
Official Microsoft supportLags behind vLLM/SGLang in performance (recent basis)
Dynamic SplitFuse techniqueLimited model support range
Azure integrationInsufficient documentation/examples

Suitable Use Cases

  • Single GPU serving of ultra-large models (ZeRO-Inference)
  • Azure/Microsoft ecosystem
  • Integration with DeepSpeed training pipelines

3.16 OpenLLM (BentoML) GitHub: bentoml/OpenLLMDevelopment: BentoMLLicense: Apache 2.0

Core Technology

OpenLLM is an LLM serving tool built on BentoML framework, managing the entire lifecycle from model packaging to cloud deployment.

Features:

  • Bento packaging: Package model + dependencies + serving code together
  • OpenAI-compatible API
  • Swappable inference backends: Use vLLM, TensorRT-LLM, etc. as backends
  • BentoCloud deployment: One-click cloud deployment
  • LangChain integration

Pros and Cons

ProsCons
Model lifecycle managementInference performance depends on backend
BentoCloud one-click deploymentPossible overhead from indirect backend usage
Various backend supportLimited community size
LangChain integration

Suitable Use Cases

  • Teams needing ML model packaging/deployment pipelines
  • BentoCloud users
  • Serving LLM + other ML models together

3.17 CTranslate2 GitHub: OpenNMT/CTranslate2Development: OpenNMT (SYSTRAN)License: MIT

Core Technology

CTranslate2 is an engine that converts Transformer models to optimized C++ format for inference. Originally developed for machine translation (NMT), expanded to LLMs.

Optimization techniques:

  1. Layer Fusion: Combine consecutive layers into single operations
  2. Padding Removal: Remove padding within batches to prevent unnecessary computation
  3. Batch Reordering: Sort sequences by length within batches for efficiency improvement
  4. In-place Operations: Minimize memory allocation
  5. Caching Mechanism: Cache repetitive operation results

Quantization: Supports INT8, INT16, Float16. INT8 models are 3.53x faster than Float32 (AMD ROCm benchmark).

Primary use case: Faster-Whisper (high-speed Whisper speech recognition implementation) uses CTranslate2 as core backend.

Pros and Cons

ProsCons
Excellent CPU performanceNo LLM-specific optimizations (PagedAttention, etc.)
Lightweight, minimal dependenciesLimited model support (mainly encoder-decoder)
Production-proven (translation services)Decreasing community activity
AMD ROCm supportSlow support for latest LLM architectures

Suitable Use Cases

  • Machine translation serving
  • Whisper-based speech recognition (Faster-Whisper)
  • Transformer inference in CPU-only environments
  • Lightweight deployment

3.18 Candle GitHub: huggingface/candleDevelopment: Hugging FaceLicense: Apache 2.0/MIT

Core Technology

Candle is a minimal ML framework written in Rust, providing PyTorch-like API with Rust’s safety and performance.

Features:

  • Pure Rust implementation (no libtorch/Python dependencies)
  • CUDA, Metal backend support
  • Native HuggingFace Hub integration
  • WASM target (browser execution)
  • Flash Attention support (CUDA feature flag)

Ecosystem:

  • candle-transformers: Major model implementations (LLaMA, Mistral, Phi, etc.)
  • candle-einops: Rust einops implementation
  • atoma-infer: Large-scale inference library based on Candle (FlashAttention2, PagedAttention)

Pros and Cons

ProsCons
Rust memory safety/performanceInference-only (no training support)
Python dependency eliminationFewer model implementations vs Python ecosystem
WASM support (serverless/browser)Small community size
Lightweight binariesAbsence of high-level serving features

Suitable Use Cases

  • Embedding ML in Rust-based applications
  • Lightweight inference in serverless/edge
  • WASM-based browser AI
  • Direct HuggingFace model usage in Rust

4. Technology Comparison Analysis

4.1 KV Cache Management Comparison

MethodToolsCore IdeaMemory EfficiencyPrefix ReuseComplexity
PagedAttentionvLLM, AphroditeStore KV in fixed blocks non-contiguously using OS paging techniques★★★★★△ (hash-based)Medium
RadixAttentionSGLangAutomatically share prefix via radix tree★★★★★★★★★★High
Blocked KV CacheLMDeploy TurboMindBlock grid-based management, split & fuse optimization★★★★☆Medium
Paged + Quantized KVTensorRT-LLMBlock-based + INT8/FP8 KV quantization★★★★★○ (CPU offloading)High
Contiguousllama.cpp, ExLlamaV2Contiguous memory, pre-allocation★★☆☆☆Low

Key insights:

  • Fragmentation elimination: PagedAttention (vLLM) became standard. Reduced memory waste from 60-80% to under 5%
  • Prefix reuse: RadixAttention (SGLang) achieves highest cache hit rates. 85-95% in few-shot vs PagedAttention’s 15-25%
  • KV quantization: Supported by TensorRT-LLM and LMDeploy. Quantizing KV to FP8/INT8 saves 50% memory with minimal quality loss

4.2 Quantization Method Comparison

MethodBitsProcessGPU RequiredQualitySpeedCompatible Tools
GPTQ4bit (mainly)Post-training, Hessian-basedRequired for quantization★★★★☆★★★★★ (ExLlama)vLLM, TGI, ExLlamaV2
AWQ4bitActivation-aware weight quantRequired for quantization★★★★★★★★★☆vLLM, LMDeploy, TGI
EXL22-8bit mixedPer-layer mixed precisionRequired for quantization★★★★☆★★★★★ExLlamaV2, Aphrodite
GGUF2-8bitK-quant super-blockCPU possible★★★★☆★★★☆☆ (CPU)llama.cpp, Ollama, LocalAI
FP88bit8-bit floating pointHopper GPU★★★★★★★★★★TensorRT-LLM, vLLM
bitsandbytes4/8bitNF4, INT8Required★★★☆☆★★★☆☆TGI, HF Transformers

Quality ranking (same 4-bit, perplexity basis): AWQ > GPTQ ≈ EXL2 > GGUF Q4_K_M > bitsandbytes NF4 Speed ranking (GPU, 4-bit): EXL2 (ExLlamaV2) > GPTQ (ExLlama backend) > AWQ (vLLM) > GGUF (llama.cpp GPU offload)

Key selection criteria:

  • GPU serving, maximum speed: EXL2 (ExLlamaV2) or GPTQ (ExLlama backend)
  • GPU serving, highest quality: AWQ (vLLM/LMDeploy)
  • CPU/hybrid inference: GGUF (llama.cpp)
  • NVIDIA Hopper, production: FP8 (TensorRT-LLM)

4.3 Batching Strategy Comparison

StrategyDescriptionGPU UtilizationLatencySupporting Tools
Static BatchingWait until all sequences in batch complete★★☆☆☆High (bound by longest sequence)Basic HF Transformers
Continuous BatchingInsert new requests immediately upon sequence completion★★★★☆LowvLLM, SGLang, TGI, Aphrodite
In-flight BatchingNVIDIA’s continuous batching implementation, iteration-level scheduling★★★★★Very lowTensorRT-LLM, Triton
Persistent BatchingMaintain batches while dynamically replacing individual sequences★★★★★LowLMDeploy
Dynamic SplitFuseDynamically split/combine Prefill and decode★★★★☆LowDeepSpeed-MII

Key insight: Evolution from Static → Continuous → In-flight/Persistent. All modern serving engines use continuous batching or better.


4.4 Attention Optimization Comparison

TechniquePaperCore IdeaMain EffectUsing Tools
Flash AttentionDao et al., 2022Minimize HBM access via SRAM tilingMemory savings + 2-4x speed improvementTGI, SGLang, Candle
Flash Attention 2Dao, 2023Improved work partitioning, sequence parallelization2x additional improvement over FA1Most modern engines
Flash Attention 32024Hopper asynchronous execution, FP8 supportAdditional improvement over FA2 (especially H100)SGLang (latest)
PagedAttentionKwon et al., 2023Block-based KV management + attentionMemory efficiency maximizationvLLM, TGI, Aphrodite
FlashInfer2024Shared prefix batch decoding optimization, cascadingUp to 31x faster than vLLM on shared prefixSGLang, vLLM (integrating)
FlexAttentionPyTorch, 2024BlockMask + page table integrationCombine flexible mask + paged attentionPyTorch native

FlashInfer detail:

  • When shared prefix is 32,768 tokens and batch size 256, up to 31x speed improvement vs basic PagedAttention
  • Cascading technique computes shared prefix attention only once

FA3 benchmark: In SGLang, FA3 surpasses both FlashInfer and Triton backends, with gap widening as input/output size increases.


4.5 Speculative Decoding Support Status

Speculative decoding is a technique where a small “draft model” rapidly generates multiple tokens, and a large “target model” verifies them at once (Leviathan et al., 2023; Chen et al., 2023).

ToolSupportDraft Model MethodPerformance Improvement
vLLMSeparate small model, n-gram, MLPSpeculator2-3x (workload dependent)
SGLangEAGLE, EAGLE 2, EAGLE 3 (2025 latest)2-4x
TensorRT-LLMDraft model, Medusa heads2-3x
TGIMedusa2x
LMDeploy△ (experimental)--
llama.cppDraft model1.5-2x
ExLlamaV2--
Others--

EAGLE 3 (SGLang, December 2025): LMSYS provides speculative decoding draft models bundled for popular models. Groq reports 6x+ speed improvement on Llama-3.1-70B, SambaNova reports 2x+ improvement on Llama-3.1-405B.


4.6 Prefix Caching Comparison

ToolMethodCache Hit Rate (few-shot)Cache Hit Rate (chat)Implementation
SGLangRadixAttention (radix tree)85-95%60-85%Token sequence-based tree
vLLMHash-based prefix caching15-25%30-50%Block hash matching
TensorRT-LLMKV Cache Reuse + CPU offloadingMediumMediumCPU-GPU transfer
TGI v3Prefix KV cachingMedium-HighHigh (long history)Chunk-based
LMDeployBlocked KV reuseLow-MediumMediumBlock matching

Key insight: For workloads with high prefix reuse (agent, few-shot, same system prompt), SGLang’s RadixAttention is overwhelming. The difference narrows in simple chatbot serving.


4.7 Distributed Inference Comparison

MethodDescriptionAdvantagesDisadvantagesSupporting Tools
Tensor Parallelism(TP)Split single layer across multiple GPUsLow latencyAll-reduce communication needed, requires high bandwidth between GPUsvLLM, SGLang, TensorRT-LLM, LMDeploy, TGI
Pipeline Parallelism(PP)Sequential layer placement across GPUsLow communication overheadPipeline bubbles, high latencyTensorRT-LLM, DeepSpeed
Expert Parallelism(EP)Distribute MoE model experts across GPUsOptimal for MoE modelsMoE-onlyvLLM (Wide-EP), SGLang
Disaggregated ServingRun Prefill and Decode on separate nodesIndependent scaling per phaseKV transfer overheadvLLM (NIXL), SGLang
Sequence ParallelismSplit long sequencesUseful for long contextComplex implementationDeepSpeed, Ring Attention

vLLM’s latest distributed serving (December 2025):

  • DeepSeek models achieve 2,200 tokens/s per H200 with Wide Expert Parallelism
  • Efficient KV transfer via NIXL/LMCache connectors for prefill-decode separation
  • Independent autoscaling based on Ray

5. Comprehensive Comparison Tables

5.1 Feature Comparison

ToolLanguageContinuous BatchingPagedAttentionQuantizationSpeculative DecodingDistributed InferenceOpenAI API
vLLMPython/C++AWQ,GPTQ,FP8TP
SGLangPython/C++✅ (RadixAttn)AWQ,GPTQ,FP8✅ (EAGLE3)TP,EP
TensorRT-LLMPython/C++✅ (in-flight)FP8,INT4,INT8TP,PPvia Triton
TGIRust/PythonAWQ,GPTQ,bnb✅ (Medusa)TP
llama.cppC/C++GGUF (2-8bit)
OllamaGo/C++GGUF
MLC LLMPython/C++3-4bit
LMDeployPython/C++✅ (persistent)✅ (blocked)AWQ,INT8,KV quantTP
Triton ServerC++/Python✅ (dynamic)via backendvia backendvia backendvia backend
ExLlamaV2Python/C++EXL2,GPTQvia TabbyAPI
Ray Serve+vLLMPythonvLLM allTP+multi-node
PowerInferC/C++GGUF
AphroditePython/C++EXL2,GGUF,AWQ,GPTQTP
LocalAIGo/C++GGUFP2P
DeepSpeed-MIIPython/C++INT8TP,PP
OpenLLMPythonvia backendvia backendvia backendvia backendvia backend
CTranslate2C++/PythonINT8,INT16
CandleRust△ (atoma-infer)

5.2 Hardware Support

ToolNVIDIA CUDAAMD ROCmApple MetalCPUMobileWebGPU
vLLM
SGLang
TensorRT-LLM
TGI
llama.cpp
Ollama
MLC LLM
LMDeploy
ExLlamaV2
PowerInfer✅ (hybrid)
LocalAI
Candle✅ (WASM)

5.3 Performance Tiers (2025 basis, approximate ranking)

GPU Serving Throughput (high concurrency, A100/H100):

  1. 🥇 LMDeploy (TurboMind) — especially quantized models
  2. 🥇 SGLang — workloads with high prefix reuse
  3. 🥈 TensorRT-LLM — optimal performance after engine build
  4. 🥈 vLLM — general-purpose champion
  5. 🥉 TGI — slightly behind vLLM
  6. DeepSpeed-MII, MLC LLM

Single Request Latency:

  1. 🥇 TensorRT-LLM (compiled kernels)
  2. 🥈 SGLang / vLLM
  3. 🥉 LMDeploy

Consumer GPU (single user):

  1. 🥇 ExLlamaV2 — highest speed
  2. 🥈 llama.cpp (GPU offload)
  3. 🥉 Ollama / PowerInfer

6. Scenario-Based Recommendations

Scenario 1: Production Chatbot Service

Recommendation: vLLM or SGLang + Ray Serve

  • High concurrency, stable TTFT needed
  • If multi-turn chat, SGLang (RadixAttention advantage)
  • Add Ray Serve if autoscaling needed

Scenario 2: NVIDIA-only, Maximum Performance

Recommendation: TensorRT-LLM + Triton

  • Fixed models with engine build investment feasible
  • Maximum throughput with FP8 (H100)
  • Enterprise-level stability

Scenario 3: Local Development / Prototyping

Recommendation: Ollama

  • 5-minute installation + execution
  • Simple model management via model registry

Scenario 4: CPU Server / GPU-less Environment

Recommendation: llama.cpp or CTranslate2

  • llama.cpp: General LLM, various quantizations
  • CTranslate2: Specialized for translation/Whisper etc.

Scenario 5: Mobile App / Browser

Recommendation: MLC LLM (mobile), llama.cpp (mobile), Candle (WASM)

  • MLC LLM: Most comprehensive mobile support
  • web-llm: WebGPU-based browser execution

Scenario 6: Single GPU, Large Model

Recommendation: PowerInfer (sparse models) or DeepSpeed-MII (ZeRO-Inference)

  • Run GPU memory-exceeding models with CPU offloading

Scenario 7: Agent / Tool-use / Structured Output

Recommendation: SGLang

  • Maximize prefix reuse with RadixAttention
  • JSON output optimization with Compressed FSM
  • Compose complex LLM pipelines with DSL

Scenario 8: OpenAI API Drop-in Replacement

Recommendation: LocalAI

  • Full /v1/chat/completions compatibility
  • Text + image + audio all-in-one

7. Conclusion

As of 2025, the LLM serving ecosystem is entering maturity with distinct tool differentiation.

  1. vLLM and SGLang’s two-horse race: vLLM dominates general serving, SGLang leads in structured workloads. This structure strengthens as TGI enters maintenance mode.

  2. KV cache management innovation: PagedAttention became standard, RadixAttention opened new possibilities for prefix reuse. KV quantization (FP8) is the next frontier of memory efficiency.

  3. Speculative Decoding ubiquity: 2-4x speed improvements through EAGLE 3, Medusa etc. are generalizing, with all major engines supporting.

  4. Disaggregated Serving: Architectures separating Prefill and Decode for independent scaling are emerging as the new standard for large-scale serving.

  5. Consumer hardware accessibility expansion: llama.cpp/Ollama ecosystem democratized local AI, PowerInfer is expanding consumer GPU limits.

Selection Guide Summary

PriorityRecommended Tool
General productionvLLM
Maximum throughput (NVIDIA)LMDeploy or TensorRT-LLM
Agent/structured outputSGLang
Easy local executionOllama
Mobile/edgeMLC LLM
Maximum single GPU speedExLlamaV2
Hardware versatilityllama.cpp

8. References

Core Papers

  1. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., … & Stoica, I. (2023). “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023. [arXiv:2309.06180]

  2. Zheng, L., Yin, L., Xie, Z., Huang, J., Sun, C., Yu, C. H., … & Stoica, I. (2023). “SGLang: Efficient Execution of Structured Language Model Programs.” ICLR 2025. [arXiv:2312.07104]

  3. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS 2022. [arXiv:2205.14135]

  4. Dao, T. (2023). “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” ICLR 2024. [arXiv:2307.08691]

  5. Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” ICLR 2023. [arXiv:2210.17323]

  6. Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2024). “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” MLSys 2024. [arXiv:2306.00978]

  7. Leviathan, Y., Kalman, M., & Matias, Y. (2023). “Fast Inference from Transformers via Speculative Decoding.” ICML 2023. [arXiv:2211.17192]

  8. Chen, C., Borgeaud, S., Irving, G., Lespiau, J. B., Sifre, L., & Jumper, J. (2023). “Accelerating Large Language Model Decoding with Speculative Sampling.” [arXiv:2302.01318]

  9. Yu, G. I., Jeong, J. S., Kim, G. W., Kim, S., & Chun, B. G. (2022). “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI 2022.

  10. Song, Y., Mi, Z., Xie, H., & Chen, H. (2023). “PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU.” [arXiv:2312.12456]

  11. Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., … & Krishnamurthy, A. (2018). “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” OSDI 2018.

  12. Li, Y., Cai, T., Zhang, Y., Chen, D., & Narasimhan, K. (2024). “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.” ICML 2024. [arXiv:2401.15077]

Benchmark Sources


This article is written based on the latest information as of February 2026. The LLM serving ecosystem evolves rapidly, so please check official documentation and latest releases for each tool.

← Qwen 3.5 Complete Guide — Specs, Benchmarks, VRAM, and Usage AI Agent Security Crisis — When Autonomous AI Becomes a Hacking Weapon →