LLM Compression Techniques Deep Dive — Quantization, Pruning, and Distillation
The moment I bought an RTX 4090 and tried to run Llama 70B, only to get a “CUDA out of memory” error during loading—that was pure despair. I thought 24GB VRAM would be enough, but the 70B model requires over 140GB of memory even at FP16 precision.
That’s where model compression techniques come in. Technologies that shrink a 140GB model down to 35GB so it can run on an RTX 4090. But it’s not simply about “just making it smaller.” We need to understand what exactly gets lost at each compression stage and why it still works despite these losses.
The Science of Number Representation: Anatomy of Floating Point
To understand all compression techniques, we first need to know how computers store real numbers.
FP32 (32-bit floating point) is structured according to the IEEE 754 standard:
| Bit Structure | Role | Range/Precision |
|---|---|---|
| 1-bit sign | Determines positive/negative | ±1 |
| 8-bit exponent | Number magnitude range | ±3.4 × 10³⁸ |
| 23-bit mantissa | Decimal precision | ~7 digits |
This means it can accurately represent up to 7 decimal places. But do neural network weights really need that level of precision?
FP32 → FP16: The First Revolution
The transition to FP16 isn’t simply “cutting in half”. The internal structure fundamentally changes:
| Change | FP32 | FP16 | Real Impact |
|---|---|---|---|
| Mantissa | 23bit → | 10bit | Precision 7 digits → 3–4 digits |
| Exponent | 8bit → | 5bit | Range ±3.4×10³⁸ → ±65,504 |
| Representable values | ~4 billion | ~65,000 | Massive resolution decrease |
Real-world impact:
- Most LLM inference: Almost no performance degradation (perplexity difference <1%)
- Training process: Potential underflow issues in gradient accumulation
- Extreme value handling: Values beyond ±65,504 overflow
This is where BF16 (Brain Floating Point 16) emerged.
BF16’s Innovation: Range vs Precision Tradeoff
BF16, developed by Google for TPUs, took a different approach:
| Format | Exponent | Mantissa | Advantages | Disadvantages |
|---|---|---|---|---|
| FP16 | 5bit | 10bit | High precision | Narrow range |
| BF16 | 8bit | 7bit | Same range as FP32 | Lower precision |
BF16’s key insight is “don’t sacrifice range.” By maintaining the same 8-bit exponent as FP32, it eliminates overflow problems entirely. It trades off mantissa precision to 7 bits, but for most deep learning workloads, this is sufficient.
FP16 → INT8: From Floating Point to Integer
Now the real challenge begins. Completely abandoning floating point and moving to integers.
INT8 is simple: it only uses 256 integers from –128 to 127. But to do this properly, we need to understand the quantization process.
The Core of Quantization: Scale Factor and Zero Point
The formula for mapping continuous floating point values to 256 discrete integers:
quantized_value = round((original_value - zero_point) / scale_factor)
Where:
- scale_factor: The ratio that divides the actual value range into 256 intervals
- zero_point: The offset determining which integer corresponds to 0.0
Symmetric vs Asymmetric Quantization:
- Symmetric: zero_point = 0, symmetric range (–127 to 127)
- Asymmetric: zero_point ≠ 0, optimized for actual data distribution
Real Impact of INT8: Still Manageable
Surprisingly, most text generation tasks show nearly identical performance to FP16. But careful examination reveals emerging differences:
Performance maintained areas:
- General conversation, summarization, translation
- Basic common-sense reasoning
Minor degradation areas:
- Complex mathematical problems (1–3% drop in GSM8K)
- Logical reasoning chains
- Coding benchmarks (2–5% drop in HumanEval)
The Outlier Problem: Why SmoothQuant Emerged
The biggest enemy of INT8 quantization is “outlier activation values”.
In some channels of LLMs, activation values 100+ times larger than normal range appear. For example:
- Most activations: –1.0 to 1.0 range
- Outlier channels: extreme values like –150.0 or 200.0
These extreme values ruin the entire scale_factor. If one 200.0 forces quantization to –200 to 200 range, most –1 to 1 values get represented only as –1, 0, 1, causing severe information loss.
SmoothQuant’s solution:
- Shift activation extremes to weights: Mathematically equivalent transformation to “smooth” activations
- Weights are easy to quantize: Weight distributions are usually uniformly distributed and easy to quantize
- Result: Both activations and weights can be cleanly quantized to INT8
With SmoothQuant, perplexity increase compared to FP16 stays under 1%.
INT8 → INT4: The Extreme Territory
Now we truly reach the extreme. Representing all weights with 16 values (-8 to 7).
Group Quantization: Essential Survival Technique
For INT4, group quantization is almost mandatory. Quantizing entire weights with a single scale_factor causes too much information loss.
How it works:
- Divide weights into groups of 128
- Calculate separate scale_factor for each group
- Total memory: 4bit × weights + scale_factors
For example, with 4096 weights:
- 32 groups (4096 ÷ 128)
- 4096 × 4bit + 32 × 16bit = 16,384 + 512 = 16,896 bits
- Simple INT4: 4096 × 4bit = 16,384 bits
- Overhead: ~3% (negligible compared to quality improvement)
Real Impact of INT4: Finally Noticeable
From INT4 onwards, performance degradation becomes clearly apparent:
| Benchmark | FP16 | INT8 | INT4 | Degradation |
|---|---|---|---|---|
| WikiText Perplexity | 5.36 | 5.41 | 5.68 | +6% |
| MMLU | 76.2% | 75.8% | 73.1% | –4% |
| GSM8K | 84.3% | 82.7% | 78.9% | –6.4% |
| HumanEval | 45.7% | 44.1% | 39.2% | –14.2% |
Particularly noticeable degradation areas:
- Mathematical accuracy: Decimal calculations, large number operations
- Rare tokens: Technical terms, foreign languages, specialized vocabulary
- Long context consistency: Context retention in 4k+ tokens
- Coding: Complex algorithms, precise grammatical structures
Why still use it:
- Overwhelming memory savings: 70B model from 140GB → 35GB
- Still sufficient quality for most practical tasks
- Inference speed improvement: 25–40% faster due to memory bandwidth relief
Technical Depth of Advanced Compression Techniques
GPTQ: The Magic of Hessian-Based Optimization
GPTQ extends Optimal Brain Quantization (OBQ) to large language models.
Core idea:
- Hessian matrix: Calculate second derivative impact of each weight on loss function
- Layer-wise processing: One layer at a time, memory efficiently
- Error compensation: Distribute quantization error of one weight to other weights
Mathematical approach:
H^(-1) * gradient_error → Distribute error to other weights
This results in sequential quantization starting from “least important” weights while minimizing overall model performance loss.
Benchmark performance:
- Llama 70B GPTQ-INT4: perplexity 5.54 (+3.4% vs FP16)
- Inference speed: 2.6x improvement vs FP16 (with Marlin kernel)
AWQ: Activation-Based Weight Importance
AWQ has a completely different philosophy: “Find weights that matter when actual data flows through”
Process:
- Run actual inference with calibration dataset
- Analyze activation magnitude of each channel
- Identify salient weights: Weights connected to high activations
- Mixed precision: Important weights stay FP16, others go INT4
Formula:
importance_score = mean(|activation_values|) * weight_variance
AWQ vs GPTQ real performance comparison:
| Technique | MMLU | GSM8K | HumanEval | ARC-C | Average Quality Retention |
|---|---|---|---|---|---|
| AWQ | 73.4% | 79.2% | 39.8% | 68.1% | 95% |
| GPTQ | 72.8% | 77.6% | 38.4% | 66.9% | 90% |
AWQ consistently shows better quality, with particularly large gaps in reasoning tasks like GSM8K and ARC-Challenge.
FP8 vs INT8: The Power of Hardware Native
FP8 structure (E4M3 format):
- 1-bit sign + 4-bit exponent + 3-bit mantissa
- Range: –448 to 448
- Precision: ~3 digits
Performance comparison with INT8 (H100 basis):
| Metric | FP8 | INT8 | Improvement |
|---|---|---|---|
| Perplexity | 5.41 | 5.45 | FP8 advantage |
| Inference speed | 1,247 tok/s | 1,182 tok/s | +5.5% |
| Memory efficiency | Same | Same | Same |
| Implementation complexity | Low | Medium | FP8 advantage |
Remarkably, FP8 has lower perplexity than INT8. This proves floating point representation is still better suited to weight distributions.
Extreme Compression: INT3, INT2, and BitNet
The Reality of 3-bit and 2-bit
INT3 (8 values): Being attempted in some research, but 10–20% performance degradation in most benchmarks INT2 (4 values): Experimental level, very low practicality
BitNet: New Paradigm of Extreme Compression
BitNet 1.58b trains from scratch with extremely compressed weights:
- Weights: –1, 0, 1 (1.58bit per weight)
- Activations: Maintain INT8
- Apply constraints from training
Result: Much better performance than post-training quantization, but requires completely new architecture and training process.
Practical Benchmarks: What to Choose
Comprehensive Performance Comparison
| Technique | Memory Saving | Perplexity | MMLU | GSM8K | HumanEval | Inference Speed | Implementation Difficulty |
|---|---|---|---|---|---|---|---|
| FP16 | 0% | 5.36 | 76.2% | 84.3% | 45.7% | 1.0x | ⭐ |
| FP8 | 50% | 5.34 | 76.0% | 83.9% | 45.2% | 1.3x | ⭐⭐ (H100 required) |
| AWQ-INT4 | 75% | 5.52 | 73.4% | 79.2% | 39.8% | 2.8x | ⭐⭐⭐ |
| GPTQ-INT4 | 75% | 5.68 | 72.8% | 77.6% | 38.4% | 2.6x | ⭐⭐⭐ |
| GGUF Q4_K_M | 75% | 5.61 | 73.0% | 78.1% | 39.1% | 2.2x | ⭐⭐ |
Understanding Through Real Models: Qwen 3.5-397B
Since theory alone might not be intuitive, let’s use a recent model as an example. Looking at Qwen 3.5-397B-A17B on Hugging Face, there are these files:
Qwen3.5-397B-A17B— Original BF16Qwen3.5-397B-A17B-FP8— FP8 quantizationQwen3.5-397B-A17B-GGUF— GGUF various quantizations
This model uses MoE structure with 397B total parameters but 17B active parameters. “397B but only using 17B” means when processing one token, only 17B out of 397B parameters actually activate. However, all 397B must be loaded into memory. Unused parameters must stay on standby because “you never know when they’ll activate.”
| Quantization Level | Model Size | Required Memory | Runnable Environment |
|---|---|---|---|
| BF16 (original) | ~807GB | 810GB+ VRAM | H100 × 10+ units |
| FP8 | ~400GB | 420GB+ VRAM | H100 × 5 units |
| INT4 (Q4_K_M) | ~214GB | 256GB+ | M3 Ultra Mac or H100 × 3 units |
| INT3 (Q3_K_M) | ~170GB | 192GB+ | M3 Ultra Mac (192GB) |
| INT2 (Q2_K) | ~130GB | 150GB+ | High-capacity RAM server |
As shown, the original 807GB model shrinks to 214GB with INT4. Less than a quarter of the original size—this is the power of quantization.
But why isn’t it exactly one-quarter? 397B × 4bit ÷ 8 = ~199GB but it’s 214GB because of Dynamic Quantization. Techniques like Unsloth’s Dynamic 2.0 keep important layers at 8bit or 16bit while compressing less important layers to 4bit. This mixed-bit strategy significantly improves quality over pure 4bit.
The FP8 version is exactly half at ~400GB. FP8 only needs to store per-tensor or per-channel scales with virtually no additional metadata like group quantization. Since H100 natively supports FP8, inference speed is similar or faster than BF16.
Practical meaning: To run this model personally, you need at minimum INT4 (Q4_K_M) with 256GB memory. Impossible with RTX 4090 (24GB), but possible with Apple M3 Ultra (192GB unified memory) up to INT3. With MoE offloading, 24GB GPU + 256GB RAM combination is possible but token generation will be slower.
The same model requires completely different hardware depending on quantization level. This is why quantization technology isn’t a “theoretical interest” but a “practical necessity.”
Optimal Choice Guide by GPU
RTX 4090 (24GB):
- 1st choice: GGUF Q4_K_M (stability, compatibility)
- 2nd choice: AWQ-INT4 (performance priority)
- Expected performance: Llama 70B baseline 15–25 tok/s
H100 (80GB):
- 1st choice: FP8 (if possible)
- 2nd choice: FP16 (when highest quality needed)
- Expected performance: 50–100+ tok/s
Apple Silicon (32GB+ unified memory):
- Only choice: GGUF
- Expected performance: 3–8 tok/s (CPU)
Future Outlook: Evolution of Compression Technology
Hardware Evolution
- FP4 native support: Next-gen GPUs expected to directly support 4-bit floating point
- Mixed precision accelerators: Efficient processing of different precisions per layer
Algorithm Evolution
- Adaptive quantization: Dynamic precision adjustment based on input
- Neural architecture co-design: New model architectures designed with quantization in mind
Practical Evolution
- One-click quantization: Automatic optimal compression without complex user settings
- Quality-speed auto balancing: Automatic compression tailored to task requirements
Personal Thoughts
Diving deep into LLM compression technology, what struck me most was the “perfect harmony of mathematical elegance and practical effectiveness.”
That we can compress from 32-bit to 4-bit—an 8x reduction—while barely noticing the difference in most tasks proves neural networks have much greater redundancy than expected. But simultaneously, the need for sophisticated techniques like outlier handling, group quantization, activation-aware weighting shows that brute-force compression has limits.
From a practical perspective:
- Industrial settings: FP8 (H100) or AWQ will become standard
- Personal users: GGUF remains the most accessible choice
- Research: Extreme approaches like BitNet provide new breakthroughs
The most intriguing future scenario is “1-bit native hardware.” As BitNet proved, designing from scratch with extreme compression in mind can achieve amazing efficiency. If GPU vendors directly support 1-bit or 2-bit operations in hardware, current compression techniques will evolve to entirely new dimensions.
Ultimately, compression technology’s core is “wise choice of what to sacrifice and what to preserve.” We can sacrifice 7th decimal place precision, but must preserve reasoning ability. Memory can be reduced to one-eighth, but core performance must stay above 95%.
This delicate sense of balance is why LLM compression technology is true engineering artistry beyond simple “size reduction.”