Quantization
Shrinking models
The Problem: LLaMA-70B needs 140GB of memory. Your laptop has 16GB. How can you possibly run such a model at home?
The Solution: JPEG for Neural Networks
Remember how JPEG compresses photos? A 10MB image becomes 1MB, and you barely notice the difference. Quantization does the same thing for AI models — it "compresses" the numbers that make up the model, speeding up inference and making deployment on consumer hardware possible!
The Trade-off: Size vs Quality
Like with JPEG, more compression means some quality loss:
- FP16: almost no quality loss, recommended default
- INT8: minimal loss, great for most tasks
- INT4: noticeable but acceptable loss, good for chatbots
- INT2: significant loss, only for experiments
Quantization is often combined with fine-tuning (via QLoRA) to train models that are both compact and specialized.
Think of it like JPEG compression for numbers:
- 1. Original (FP32): each number uses 32 bits. Like storing "3.14159265358979..."
- 2. FP16 (half precision): 16 bits per number. Like storing "3.14159" — 2x smaller!
- 3. INT8 (8-bit): 8 bits per number. Like storing "3.14" — 4x smaller!
- 4. INT4 (4-bit): 4 bits per number. Like storing "3" — 8x smaller!
A 70B model at FP32 = 280GB. At INT4 = 35GB. Now it fits on a gaming GPU!
Popular Quantization Methods
- GPTQ: fast inference, needs calibration data
- AWQ: better quality, protects important weights
- GGUF: works on CPU, popular for local use
- bitsandbytes: easy to use, dynamic quantization
Fun Fact: LLaMA-7B at INT4 runs on a MacBook Air with 8GB RAM! It won't be fast, but it works. Tools like llama.cpp and Ollama make local LLMs accessible to everyone.
Try It Yourself!
See how different quantization levels affect model size and quality. Drag the slider to find your balance!
Deep Dive: Quantization Methods
Simple Rounding vs Smart Quantization
The simplest method is to round each number to the nearest value in lower precision. It's fast but crude. Advanced methods (GPTQ, AWQ) use calibration data: they run a small set of texts through the model and adjust rounding to minimize output error. The result is nearly the same quality with significant memory savings.
Practical Impact
A 70 billion parameter model in full precision (FP16) takes ~140 GB — that's several expensive server GPUs. Quantized to 4-bit (Q4), it shrinks to ~35 GB and can run on a powerful consumer PC. This makes cutting-edge AI models accessible without server hardware.
Try it yourself
Interactive demo of this technique
Choose a model for a support chatbot with a limited budget
Configuration: GPT-4 FP16 RAM: 32GB GPU Cost: ~$0.03/request Latency: ~800ms Accuracy on typical questions: 99%
10,000 requests/day = 9,000/month
| Option | RAM | Cost | Latency | FAQ Quality | |--------|-----|------|---------|-------------| | FP16 | 16GB | 0.002/req | 120ms | 97% | | INT4 | 4GB | $0.001/req | 80ms | 95% |
Recommendation: INT8. For typical FAQ, 1% accuracy loss (98→97%) is negligible. Savings: 300/month. INT4 also works ($150/month) but 95% may cause errors in addresses/numbers.
GPT-4 for FAQ is like a Ferrari for a grocery run.
Quantization is a precision-for-speed-and-cost tradeoff. For simple tasks (FAQ) INT8 saves 2-3x without noticeable quality loss. Not all tasks require FP16.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path