LLM Compression Techniques Deep Dive — Quantization, Pruning, and Distillation

· # AI 개념
quantization LLM model compression GPTQ AWQ FP8

The moment I bought an RTX 4090 and tried to run Llama 70B, only to get a “CUDA out of memory” error during loading—that was pure despair. I thought 24GB VRAM would be enough, but the 70B model requires over 140GB of memory even at FP16 precision.

That’s where model compression techniques come in. Technologies that shrink a 140GB model down to 35GB so it can run on an RTX 4090. But it’s not simply about “just making it smaller.” We need to understand what exactly gets lost at each compression stage and why it still works despite these losses.

The Science of Number Representation: Anatomy of Floating Point

To understand all compression techniques, we first need to know how computers store real numbers.

FP32 (32-bit floating point) is structured according to the IEEE 754 standard:

Bit StructureRoleRange/Precision
1-bit signDetermines positive/negative±1
8-bit exponentNumber magnitude range±3.4 × 10³⁸
23-bit mantissaDecimal precision~7 digits

This means it can accurately represent up to 7 decimal places. But do neural network weights really need that level of precision?

FP32 → FP16: The First Revolution

The transition to FP16 isn’t simply “cutting in half”. The internal structure fundamentally changes:

ChangeFP32FP16Real Impact
Mantissa23bit →10bitPrecision 7 digits → 3–4 digits
Exponent8bit →5bitRange ±3.4×10³⁸ → ±65,504
Representable values~4 billion~65,000Massive resolution decrease

Real-world impact:

  • Most LLM inference: Almost no performance degradation (perplexity difference <1%)
  • Training process: Potential underflow issues in gradient accumulation
  • Extreme value handling: Values beyond ±65,504 overflow

This is where BF16 (Brain Floating Point 16) emerged.

BF16’s Innovation: Range vs Precision Tradeoff

BF16, developed by Google for TPUs, took a different approach:

FormatExponentMantissaAdvantagesDisadvantages
FP165bit10bitHigh precisionNarrow range
BF168bit7bitSame range as FP32Lower precision

BF16’s key insight is “don’t sacrifice range.” By maintaining the same 8-bit exponent as FP32, it eliminates overflow problems entirely. It trades off mantissa precision to 7 bits, but for most deep learning workloads, this is sufficient.

FP16 → INT8: From Floating Point to Integer

Now the real challenge begins. Completely abandoning floating point and moving to integers.

INT8 is simple: it only uses 256 integers from –128 to 127. But to do this properly, we need to understand the quantization process.

The Core of Quantization: Scale Factor and Zero Point

The formula for mapping continuous floating point values to 256 discrete integers:

quantized_value = round((original_value - zero_point) / scale_factor)

Where:

  • scale_factor: The ratio that divides the actual value range into 256 intervals
  • zero_point: The offset determining which integer corresponds to 0.0

Symmetric vs Asymmetric Quantization:

  • Symmetric: zero_point = 0, symmetric range (–127 to 127)
  • Asymmetric: zero_point ≠ 0, optimized for actual data distribution

Real Impact of INT8: Still Manageable

Surprisingly, most text generation tasks show nearly identical performance to FP16. But careful examination reveals emerging differences:

Performance maintained areas:

  • General conversation, summarization, translation
  • Basic common-sense reasoning

Minor degradation areas:

  • Complex mathematical problems (1–3% drop in GSM8K)
  • Logical reasoning chains
  • Coding benchmarks (2–5% drop in HumanEval)

The Outlier Problem: Why SmoothQuant Emerged

The biggest enemy of INT8 quantization is “outlier activation values”.

In some channels of LLMs, activation values 100+ times larger than normal range appear. For example:

  • Most activations: –1.0 to 1.0 range
  • Outlier channels: extreme values like –150.0 or 200.0

These extreme values ruin the entire scale_factor. If one 200.0 forces quantization to –200 to 200 range, most –1 to 1 values get represented only as –1, 0, 1, causing severe information loss.

SmoothQuant’s solution:

  1. Shift activation extremes to weights: Mathematically equivalent transformation to “smooth” activations
  2. Weights are easy to quantize: Weight distributions are usually uniformly distributed and easy to quantize
  3. Result: Both activations and weights can be cleanly quantized to INT8

With SmoothQuant, perplexity increase compared to FP16 stays under 1%.

INT8 → INT4: The Extreme Territory

Now we truly reach the extreme. Representing all weights with 16 values (-8 to 7).

Group Quantization: Essential Survival Technique

For INT4, group quantization is almost mandatory. Quantizing entire weights with a single scale_factor causes too much information loss.

How it works:

  • Divide weights into groups of 128
  • Calculate separate scale_factor for each group
  • Total memory: 4bit × weights + scale_factors

For example, with 4096 weights:

  • 32 groups (4096 ÷ 128)
  • 4096 × 4bit + 32 × 16bit = 16,384 + 512 = 16,896 bits
  • Simple INT4: 4096 × 4bit = 16,384 bits
  • Overhead: ~3% (negligible compared to quality improvement)

Real Impact of INT4: Finally Noticeable

From INT4 onwards, performance degradation becomes clearly apparent:

BenchmarkFP16INT8INT4Degradation
WikiText Perplexity5.365.415.68+6%
MMLU76.2%75.8%73.1%–4%
GSM8K84.3%82.7%78.9%–6.4%
HumanEval45.7%44.1%39.2%–14.2%

Particularly noticeable degradation areas:

  • Mathematical accuracy: Decimal calculations, large number operations
  • Rare tokens: Technical terms, foreign languages, specialized vocabulary
  • Long context consistency: Context retention in 4k+ tokens
  • Coding: Complex algorithms, precise grammatical structures

Why still use it:

  • Overwhelming memory savings: 70B model from 140GB → 35GB
  • Still sufficient quality for most practical tasks
  • Inference speed improvement: 25–40% faster due to memory bandwidth relief

Technical Depth of Advanced Compression Techniques

GPTQ: The Magic of Hessian-Based Optimization

GPTQ extends Optimal Brain Quantization (OBQ) to large language models.

Core idea:

  1. Hessian matrix: Calculate second derivative impact of each weight on loss function
  2. Layer-wise processing: One layer at a time, memory efficiently
  3. Error compensation: Distribute quantization error of one weight to other weights

Mathematical approach:

H^(-1) * gradient_error → Distribute error to other weights

This results in sequential quantization starting from “least important” weights while minimizing overall model performance loss.

Benchmark performance:

  • Llama 70B GPTQ-INT4: perplexity 5.54 (+3.4% vs FP16)
  • Inference speed: 2.6x improvement vs FP16 (with Marlin kernel)

AWQ: Activation-Based Weight Importance

AWQ has a completely different philosophy: “Find weights that matter when actual data flows through”

Process:

  1. Run actual inference with calibration dataset
  2. Analyze activation magnitude of each channel
  3. Identify salient weights: Weights connected to high activations
  4. Mixed precision: Important weights stay FP16, others go INT4

Formula:

importance_score = mean(|activation_values|) * weight_variance

AWQ vs GPTQ real performance comparison:

TechniqueMMLUGSM8KHumanEvalARC-CAverage Quality Retention
AWQ73.4%79.2%39.8%68.1%95%
GPTQ72.8%77.6%38.4%66.9%90%

AWQ consistently shows better quality, with particularly large gaps in reasoning tasks like GSM8K and ARC-Challenge.

FP8 vs INT8: The Power of Hardware Native

FP8 structure (E4M3 format):

  • 1-bit sign + 4-bit exponent + 3-bit mantissa
  • Range: –448 to 448
  • Precision: ~3 digits

Performance comparison with INT8 (H100 basis):

MetricFP8INT8Improvement
Perplexity5.415.45FP8 advantage
Inference speed1,247 tok/s1,182 tok/s+5.5%
Memory efficiencySameSameSame
Implementation complexityLowMediumFP8 advantage

Remarkably, FP8 has lower perplexity than INT8. This proves floating point representation is still better suited to weight distributions.

Extreme Compression: INT3, INT2, and BitNet

The Reality of 3-bit and 2-bit

INT3 (8 values): Being attempted in some research, but 10–20% performance degradation in most benchmarks INT2 (4 values): Experimental level, very low practicality

BitNet: New Paradigm of Extreme Compression

BitNet 1.58b trains from scratch with extremely compressed weights:

  • Weights: –1, 0, 1 (1.58bit per weight)
  • Activations: Maintain INT8
  • Apply constraints from training

Result: Much better performance than post-training quantization, but requires completely new architecture and training process.

Practical Benchmarks: What to Choose

Comprehensive Performance Comparison

TechniqueMemory SavingPerplexityMMLUGSM8KHumanEvalInference SpeedImplementation Difficulty
FP160%5.3676.2%84.3%45.7%1.0x
FP850%5.3476.0%83.9%45.2%1.3x⭐⭐ (H100 required)
AWQ-INT475%5.5273.4%79.2%39.8%2.8x⭐⭐⭐
GPTQ-INT475%5.6872.8%77.6%38.4%2.6x⭐⭐⭐
GGUF Q4_K_M75%5.6173.0%78.1%39.1%2.2x⭐⭐

Understanding Through Real Models: Qwen 3.5-397B

Since theory alone might not be intuitive, let’s use a recent model as an example. Looking at Qwen 3.5-397B-A17B on Hugging Face, there are these files:

  • Qwen3.5-397B-A17B — Original BF16
  • Qwen3.5-397B-A17B-FP8 — FP8 quantization
  • Qwen3.5-397B-A17B-GGUF — GGUF various quantizations

This model uses MoE structure with 397B total parameters but 17B active parameters. “397B but only using 17B” means when processing one token, only 17B out of 397B parameters actually activate. However, all 397B must be loaded into memory. Unused parameters must stay on standby because “you never know when they’ll activate.”

Quantization LevelModel SizeRequired MemoryRunnable Environment
BF16 (original)~807GB810GB+ VRAMH100 × 10+ units
FP8~400GB420GB+ VRAMH100 × 5 units
INT4 (Q4_K_M)~214GB256GB+M3 Ultra Mac or H100 × 3 units
INT3 (Q3_K_M)~170GB192GB+M3 Ultra Mac (192GB)
INT2 (Q2_K)~130GB150GB+High-capacity RAM server

As shown, the original 807GB model shrinks to 214GB with INT4. Less than a quarter of the original size—this is the power of quantization.

But why isn’t it exactly one-quarter? 397B × 4bit ÷ 8 = ~199GB but it’s 214GB because of Dynamic Quantization. Techniques like Unsloth’s Dynamic 2.0 keep important layers at 8bit or 16bit while compressing less important layers to 4bit. This mixed-bit strategy significantly improves quality over pure 4bit.

The FP8 version is exactly half at ~400GB. FP8 only needs to store per-tensor or per-channel scales with virtually no additional metadata like group quantization. Since H100 natively supports FP8, inference speed is similar or faster than BF16.

Practical meaning: To run this model personally, you need at minimum INT4 (Q4_K_M) with 256GB memory. Impossible with RTX 4090 (24GB), but possible with Apple M3 Ultra (192GB unified memory) up to INT3. With MoE offloading, 24GB GPU + 256GB RAM combination is possible but token generation will be slower.

The same model requires completely different hardware depending on quantization level. This is why quantization technology isn’t a “theoretical interest” but a “practical necessity.”

Optimal Choice Guide by GPU

RTX 4090 (24GB):

  • 1st choice: GGUF Q4_K_M (stability, compatibility)
  • 2nd choice: AWQ-INT4 (performance priority)
  • Expected performance: Llama 70B baseline 15–25 tok/s

H100 (80GB):

  • 1st choice: FP8 (if possible)
  • 2nd choice: FP16 (when highest quality needed)
  • Expected performance: 50–100+ tok/s

Apple Silicon (32GB+ unified memory):

  • Only choice: GGUF
  • Expected performance: 3–8 tok/s (CPU)

Future Outlook: Evolution of Compression Technology

Hardware Evolution

  • FP4 native support: Next-gen GPUs expected to directly support 4-bit floating point
  • Mixed precision accelerators: Efficient processing of different precisions per layer

Algorithm Evolution

  • Adaptive quantization: Dynamic precision adjustment based on input
  • Neural architecture co-design: New model architectures designed with quantization in mind

Practical Evolution

  • One-click quantization: Automatic optimal compression without complex user settings
  • Quality-speed auto balancing: Automatic compression tailored to task requirements

Personal Thoughts

Diving deep into LLM compression technology, what struck me most was the “perfect harmony of mathematical elegance and practical effectiveness.”

That we can compress from 32-bit to 4-bit—an 8x reduction—while barely noticing the difference in most tasks proves neural networks have much greater redundancy than expected. But simultaneously, the need for sophisticated techniques like outlier handling, group quantization, activation-aware weighting shows that brute-force compression has limits.

From a practical perspective:

  • Industrial settings: FP8 (H100) or AWQ will become standard
  • Personal users: GGUF remains the most accessible choice
  • Research: Extreme approaches like BitNet provide new breakthroughs

The most intriguing future scenario is “1-bit native hardware.” As BitNet proved, designing from scratch with extreme compression in mind can achieve amazing efficiency. If GPU vendors directly support 1-bit or 2-bit operations in hardware, current compression techniques will evolve to entirely new dimensions.

Ultimately, compression technology’s core is “wise choice of what to sacrifice and what to preserve.” We can sacrifice 7th decimal place precision, but must preserve reasoning ability. Memory can be reduced to one-eighth, but core performance must stay above 95%.

This delicate sense of balance is why LLM compression technology is true engineering artistry beyond simple “size reduction.”

← Cohesion, Coupling, and SOLID — Software Design Principles That Survived 50 Years AI Reasoning Wars, Anthropic's $30B, and the Security Paradox →