Lesson 11New

Quantization

Shrinking models

The Problem: LLaMA-70B needs 140GB of memory. Your laptop has 16GB. How can you possibly run such a model at home?

The Solution: JPEG for Neural Networks

Remember how JPEG compresses photos? A 10MB image becomes 1MB, and you barely notice the difference. Quantization does the same thing for AI models — it "compresses" the numbers that make up the model, speeding up inference and making deployment on consumer hardware possible!

The Trade-off: Size vs Quality

Like with JPEG, more compression means some quality loss:

FP16: almost no quality loss, recommended default
INT8: minimal loss, great for most tasks
INT4: noticeable but acceptable loss, good for chatbots
INT2: significant loss, only for experiments

Quantization is often combined with fine-tuning (via QLoRA) to train models that are both compact and specialized.

Think of it like JPEG compression for numbers:

1. Original (FP32): each number uses 32 bits. Like storing "3.14159265358979..."
2. FP16 (half precision): 16 bits per number. Like storing "3.14159" — 2x smaller!
3. INT8 (8-bit): 8 bits per number. Like storing "3.14" — 4x smaller!
4. INT4 (4-bit): 4 bits per number. Like storing "3" — 8x smaller!

A 70B model at FP32 = 280GB. At INT4 = 35GB. Now it fits on a gaming GPU!

Popular Quantization Methods

GPTQ: fast inference, needs calibration data
AWQ: better quality, protects important weights
GGUF: works on CPU, popular for local use
bitsandbytes: easy to use, dynamic quantization

Fun Fact: LLaMA-7B at INT4 runs on a MacBook Air with 8GB RAM! It won't be fast, but it works. Tools like llama.cpp and Ollama make local LLMs accessible to everyone.

Try It Yourself!

See how different quantization levels affect model size and quality. Drag the slider to find your balance!

Deep Dive: Quantization Methods

Simple Rounding vs Smart Quantization

The simplest method is to round each number to the nearest value in lower precision. It's fast but crude. Advanced methods (GPTQ, AWQ) use calibration data: they run a small set of texts through the model and adjust rounding to minimize output error. The result is nearly the same quality with significant memory savings.

Practical Impact

A 70 billion parameter model in full precision (FP16) takes ~140 GB — that's several expensive server GPUs. Quantized to 4-bit (Q4), it shrinks to ~35 GB and can run on a powerful consumer PC. This makes cutting-edge AI models accessible without server hardware.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Choose a model for a support chatbot with a limited budget

Without technique

Prompt

We need a support chatbot. Using GPT-4 (FP16, full precision) to answer typical questions: business hours, address, order status.

Response

Configuration: GPT-4 FP16 RAM: 32GB GPU Cost: ~$0.03/request Latency: ~800ms Accuracy on typical questions: 99%

10,000 requests/day = $300/day =$ 9,000/month

Tokens:35/55

Time:800ms

Quality:

With Quantization

With technique

Prompt

Task: support chatbot for typical questions (business hours, address, order status). Volume: 10,000 requests/day. Budget: up to $1,000/month. Compare quantization options for Llama 3 8B: 1. FP16 (full precision) 2. INT8 (8-bit) 3. INT4 (4-bit, GPTQ) For each: RAM, cost/request, latency, quality on typical FAQ.

Response

| Option | RAM | Cost | Latency | FAQ Quality | |--------|-----|------|---------|-------------| | FP16 | 16GB | $0.005/req | 200ms | 98% | | INT8 | 8GB |$ 0.002/req | 120ms | 97% | | INT4 | 4GB | $0.001/req | 80ms | 95% |

Recommendation: INT8. For typical FAQ, 1% accuracy loss (98→97%) is negligible. Savings: $600→$ 300/month. INT4 also works ($150/month) but 95% may cause errors in addresses/numbers.

GPT-4 for FAQ is like a Ferrari for a grocery run.

👁️FP16 stores each weight as a 16-bit number. For a FAQ bot this is excessive precision — like a microscope for reading a book

🧠INT8 compresses weights to 8 bits → 2x less RAM, faster inference. Quality loss on simple tasks <1%

✅The simpler the task, the more aggressively you can quantize. FAQ = simple task → INT8 is optimal

Tokens:75/130

Time:680ms

Quality:

Why this works

Quantization is a precision-for-speed-and-cost tradeoff. For simple tasks (FAQ) INT8 saves 2-3x without noticeable quality loss. Not all tasks require FP16.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Inference Cost Optimization

This lesson is part of a structured LLM course.

My Learning Path

The Solution: JPEG for Neural Networks

The Trade-off: Size vs Quality

Like with JPEG, more compression means some quality loss:

FP16: almost no quality loss, recommended default

INT8: minimal loss, great for most tasks

INT4: noticeable but acceptable loss, good for chatbots

INT2: significant loss, only for experiments

Quantization is often combined with fine-tuning (via QLoRA) to train models that are both compact and specialized.

Think of it like JPEG compression for numbers:

1. Original (FP32): each number uses 32 bits. Like storing "3.14159265358979..."
2. FP16 (half precision): 16 bits per number. Like storing "3.14159" — 2x smaller!
3. INT8 (8-bit): 8 bits per number. Like storing "3.14" — 4x smaller!
4. INT4 (4-bit): 4 bits per number. Like storing "3" — 8x smaller!

A 70B model at FP32 = 280GB. At INT4 = 35GB. Now it fits on a gaming GPU!

Popular Quantization Methods

GPTQ: fast inference, needs calibration data

AWQ: better quality, protects important weights

GGUF: works on CPU, popular for local use

bitsandbytes: easy to use, dynamic quantization

Fun Fact: LLaMA-7B at INT4 runs on a MacBook Air with 8GB RAM! It won't be fast, but it works. Tools like llama.cpp and Ollama make local LLMs accessible to everyone.