LLM Compression Techniques Deep Dive — Quantization, Pruning, and Distillation

2026-02-20 · # AI 개념

quantization LLM model compression GPTQ AWQ FP8

The moment I bought an RTX 4090 and tried to run Llama 70B, only to get a “CUDA out of memory” error during loading—that was pure despair. I thought 24GB VRAM would be enough, but the 70B model requires over 140GB of memory even at FP16 precision.

That’s where model compression techniques come in. Technologies that shrink a 140GB model down to 35GB so it can run on an RTX 4090. But it’s not simply about “just making it smaller.” We need to understand what exactly gets lost at each compression stage and why it still works despite these losses.

The Science of Number Representation: Anatomy of Floating Point

To understand all compression techniques, we first need to know how computers store real numbers.

FP32 (32-bit floating point) is structured according to the IEEE 754 standard:

Bit Structure	Role	Range/Precision
1-bit sign	Determines positive/negative	±1
8-bit exponent	Number magnitude range	±3.4 × 10³⁸
23-bit mantissa	Decimal precision	~7 digits

This means it can accurately represent up to 7 decimal places. But do neural network weights really need that level of precision?

FP32 → FP16: The First Revolution

The transition to FP16 isn’t simply “cutting in half”. The internal structure fundamentally changes:

Change	FP32	FP16	Real Impact
Mantissa	23bit →	10bit	Precision 7 digits → 3–4 digits
Exponent	8bit →	5bit	Range ±3.4×10³⁸ → ±65,504
Representable values	~4 billion	~65,000	Massive resolution decrease

Real-world impact:

Most LLM inference: Almost no performance degradation (perplexity difference <1%)
Training process: Potential underflow issues in gradient accumulation
Extreme value handling: Values beyond ±65,504 overflow

This is where BF16 (Brain Floating Point 16) emerged.

BF16’s Innovation: Range vs Precision Tradeoff

BF16, developed by Google for TPUs, took a different approach:

Format	Exponent	Mantissa	Advantages	Disadvantages
FP16	5bit	10bit	High precision	Narrow range
BF16	8bit	7bit	Same range as FP32	Lower precision

BF16’s key insight is “don’t sacrifice range.” By maintaining the same 8-bit exponent as FP32, it eliminates overflow problems entirely. It trades off mantissa precision to 7 bits, but for most deep learning workloads, this is sufficient.

FP16 → INT8: From Floating Point to Integer

Now the real challenge begins. Completely abandoning floating point and moving to integers.

INT8 is simple: it only uses 256 integers from –128 to 127. But to do this properly, we need to understand the quantization process.

The Core of Quantization: Scale Factor and Zero Point

The formula for mapping continuous floating point values to 256 discrete integers:

quantized_value = round((original_value - zero_point) / scale_factor)

Where:

scale_factor: The ratio that divides the actual value range into 256 intervals
zero_point: The offset determining which integer corresponds to 0.0

Symmetric vs Asymmetric Quantization:

Symmetric: zero_point = 0, symmetric range (–127 to 127)
Asymmetric: zero_point ≠ 0, optimized for actual data distribution

Real Impact of INT8: Still Manageable

Surprisingly, most text generation tasks show nearly identical performance to FP16. But careful examination reveals emerging differences:

Performance maintained areas:

General conversation, summarization, translation
Basic common-sense reasoning

Minor degradation areas:

Complex mathematical problems (1–3% drop in GSM8K)
Logical reasoning chains
Coding benchmarks (2–5% drop in HumanEval)

The Outlier Problem: Why SmoothQuant Emerged

The biggest enemy of INT8 quantization is “outlier activation values”.

In some channels of LLMs, activation values 100+ times larger than normal range appear. For example:

Most activations: –1.0 to 1.0 range
Outlier channels: extreme values like –150.0 or 200.0

These extreme values ruin the entire scale_factor. If one 200.0 forces quantization to –200 to 200 range, most –1 to 1 values get represented only as –1, 0, 1, causing severe information loss.

SmoothQuant’s solution:

Shift activation extremes to weights: Mathematically equivalent transformation to “smooth” activations
Weights are easy to quantize: Weight distributions are usually uniformly distributed and easy to quantize
Result: Both activations and weights can be cleanly quantized to INT8

With SmoothQuant, perplexity increase compared to FP16 stays under 1%.

INT8 → INT4: The Extreme Territory

Now we truly reach the extreme. Representing all weights with 16 values (-8 to 7).

Group Quantization: Essential Survival Technique

For INT4, group quantization is almost mandatory. Quantizing entire weights with a single scale_factor causes too much information loss.

How it works:

Divide weights into groups of 128
Calculate separate scale_factor for each group
Total memory: 4bit × weights + scale_factors

For example, with 4096 weights:

32 groups (4096 ÷ 128)
4096 × 4bit + 32 × 16bit = 16,384 + 512 = 16,896 bits
Simple INT4: 4096 × 4bit = 16,384 bits
Overhead: ~3% (negligible compared to quality improvement)

Real Impact of INT4: Finally Noticeable

From INT4 onwards, performance degradation becomes clearly apparent:

Benchmark	FP16	INT8	INT4	Degradation
WikiText Perplexity	5.36	5.41	5.68	+6%
MMLU	76.2%	75.8%	73.1%	–4%
GSM8K	84.3%	82.7%	78.9%	–6.4%
HumanEval	45.7%	44.1%	39.2%	–14.2%

Particularly noticeable degradation areas:

Mathematical accuracy: Decimal calculations, large number operations
Rare tokens: Technical terms, foreign languages, specialized vocabulary
Long context consistency: Context retention in 4k+ tokens
Coding: Complex algorithms, precise grammatical structures

Why still use it:

Overwhelming memory savings: 70B model from 140GB → 35GB
Still sufficient quality for most practical tasks
Inference speed improvement: 25–40% faster due to memory bandwidth relief

Technical Depth of Advanced Compression Techniques

GPTQ: The Magic of Hessian-Based Optimization

GPTQ extends Optimal Brain Quantization (OBQ) to large language models.

Core idea:

Hessian matrix: Calculate second derivative impact of each weight on loss function
Layer-wise processing: One layer at a time, memory efficiently
Error compensation: Distribute quantization error of one weight to other weights

Mathematical approach:

H^(-1) * gradient_error → Distribute error to other weights

This results in sequential quantization starting from “least important” weights while minimizing overall model performance loss.

Benchmark performance:

Llama 70B GPTQ-INT4: perplexity 5.54 (+3.4% vs FP16)
Inference speed: 2.6x improvement vs FP16 (with Marlin kernel)

AWQ: Activation-Based Weight Importance

AWQ has a completely different philosophy: “Find weights that matter when actual data flows through”

Process:

Run actual inference with calibration dataset
Analyze activation magnitude of each channel
Identify salient weights: Weights connected to high activations
Mixed precision: Important weights stay FP16, others go INT4

Formula:

importance_score = mean(|activation_values|) * weight_variance

AWQ vs GPTQ real performance comparison:

Technique	MMLU	GSM8K	HumanEval	ARC-C	Average Quality Retention
AWQ	73.4%	79.2%	39.8%	68.1%	95%
GPTQ	72.8%	77.6%	38.4%	66.9%	90%

AWQ consistently shows better quality, with particularly large gaps in reasoning tasks like GSM8K and ARC-Challenge.

FP8 vs INT8: The Power of Hardware Native

FP8 structure (E4M3 format):

1-bit sign + 4-bit exponent + 3-bit mantissa
Range: –448 to 448
Precision: ~3 digits

Performance comparison with INT8 (H100 basis):

Metric	FP8	INT8	Improvement
Perplexity	5.41	5.45	FP8 advantage
Inference speed	1,247 tok/s	1,182 tok/s	+5.5%
Memory efficiency	Same	Same	Same
Implementation complexity	Low	Medium	FP8 advantage

Remarkably, FP8 has lower perplexity than INT8. This proves floating point representation is still better suited to weight distributions.

Extreme Compression: INT3, INT2, and BitNet

The Reality of 3-bit and 2-bit

INT3 (8 values): Being attempted in some research, but 10–20% performance degradation in most benchmarks INT2 (4 values): Experimental level, very low practicality

BitNet: New Paradigm of Extreme Compression

BitNet 1.58b trains from scratch with extremely compressed weights:

Weights: –1, 0, 1 (1.58bit per weight)
Activations: Maintain INT8
Apply constraints from training

Result: Much better performance than post-training quantization, but requires completely new architecture and training process.

Practical Benchmarks: What to Choose

Comprehensive Performance Comparison

Technique	Memory Saving	Perplexity	MMLU	GSM8K	HumanEval	Inference Speed	Implementation Difficulty
FP16	0%	5.36	76.2%	84.3%	45.7%	1.0x	⭐
FP8	50%	5.34	76.0%	83.9%	45.2%	1.3x	⭐⭐ (H100 required)
AWQ-INT4	75%	5.52	73.4%	79.2%	39.8%	2.8x	⭐⭐⭐
GPTQ-INT4	75%	5.68	72.8%	77.6%	38.4%	2.6x	⭐⭐⭐
GGUF Q4_K_M	75%	5.61	73.0%	78.1%	39.1%	2.2x	⭐⭐

Understanding Through Real Models: Qwen 3.5-397B

Since theory alone might not be intuitive, let’s use a recent model as an example. Looking at Qwen 3.5-397B-A17B on Hugging Face, there are these files:

Qwen3.5-397B-A17B — Original BF16
Qwen3.5-397B-A17B-FP8 — FP8 quantization
Qwen3.5-397B-A17B-GGUF — GGUF various quantizations

This model uses MoE structure with 397B total parameters but 17B active parameters. “397B but only using 17B” means when processing one token, only 17B out of 397B parameters actually activate. However, all 397B must be loaded into memory. Unused parameters must stay on standby because “you never know when they’ll activate.”

Quantization Level	Model Size	Required Memory	Runnable Environment
BF16 (original)	~807GB	810GB+ VRAM	H100 × 10+ units
FP8	~400GB	420GB+ VRAM	H100 × 5 units
INT4 (Q4_K_M)	~214GB	256GB+	M3 Ultra Mac or H100 × 3 units
INT3 (Q3_K_M)	~170GB	192GB+	M3 Ultra Mac (192GB)
INT2 (Q2_K)	~130GB	150GB+	High-capacity RAM server

As shown, the original 807GB model shrinks to 214GB with INT4. Less than a quarter of the original size—this is the power of quantization.

But why isn’t it exactly one-quarter? 397B × 4bit ÷ 8 = ~199GB but it’s 214GB because of Dynamic Quantization. Techniques like Unsloth’s Dynamic 2.0 keep important layers at 8bit or 16bit while compressing less important layers to 4bit. This mixed-bit strategy significantly improves quality over pure 4bit.

The FP8 version is exactly half at ~400GB. FP8 only needs to store per-tensor or per-channel scales with virtually no additional metadata like group quantization. Since H100 natively supports FP8, inference speed is similar or faster than BF16.

Practical meaning: To run this model personally, you need at minimum INT4 (Q4_K_M) with 256GB memory. Impossible with RTX 4090 (24GB), but possible with Apple M3 Ultra (192GB unified memory) up to INT3. With MoE offloading, 24GB GPU + 256GB RAM combination is possible but token generation will be slower.

The same model requires completely different hardware depending on quantization level. This is why quantization technology isn’t a “theoretical interest” but a “practical necessity.”

Optimal Choice Guide by GPU

RTX 4090 (24GB):

1st choice: GGUF Q4_K_M (stability, compatibility)
2nd choice: AWQ-INT4 (performance priority)
Expected performance: Llama 70B baseline 15–25 tok/s

H100 (80GB):

1st choice: FP8 (if possible)
2nd choice: FP16 (when highest quality needed)
Expected performance: 50–100+ tok/s

Apple Silicon (32GB+ unified memory):

Only choice: GGUF
Expected performance: 3–8 tok/s (CPU)

Future Outlook: Evolution of Compression Technology

Hardware Evolution

FP4 native support: Next-gen GPUs expected to directly support 4-bit floating point
Mixed precision accelerators: Efficient processing of different precisions per layer

Algorithm Evolution

Adaptive quantization: Dynamic precision adjustment based on input
Neural architecture co-design: New model architectures designed with quantization in mind

Practical Evolution

One-click quantization: Automatic optimal compression without complex user settings
Quality-speed auto balancing: Automatic compression tailored to task requirements

Personal Thoughts

Diving deep into LLM compression technology, what struck me most was the “perfect harmony of mathematical elegance and practical effectiveness.”

That we can compress from 32-bit to 4-bit—an 8x reduction—while barely noticing the difference in most tasks proves neural networks have much greater redundancy than expected. But simultaneously, the need for sophisticated techniques like outlier handling, group quantization, activation-aware weighting shows that brute-force compression has limits.

From a practical perspective:

Industrial settings: FP8 (H100) or AWQ will become standard
Personal users: GGUF remains the most accessible choice
Research: Extreme approaches like BitNet provide new breakthroughs

The most intriguing future scenario is “1-bit native hardware.” As BitNet proved, designing from scratch with extreme compression in mind can achieve amazing efficiency. If GPU vendors directly support 1-bit or 2-bit operations in hardware, current compression techniques will evolve to entirely new dimensions.

Ultimately, compression technology’s core is “wise choice of what to sacrifice and what to preserve.” We can sacrifice 7th decimal place precision, but must preserve reasoning ability. Memory can be reduced to one-eighth, but core performance must stay above 95%.

This delicate sense of balance is why LLM compression technology is true engineering artistry beyond simple “size reduction.”