Lesson 17

Knowledge Distillation

Big model knowledge, small model speed

The Problem: Your production system needs to classify thousands of requests per second, but your best model (175B parameters) costs $0.03 per request and takes 2 seconds to respond. You need the same quality at 1/10th the cost and 10x the speed. How do you compress the knowledge of a giant model into something deployable?

The Solution: Knowledge Distillation — Teaching Small Models to Think Like Big Ones

Large models like GPT-4 and Claude are highly accurate but expensive and slow for production. Knowledge distillation solves this by training a smaller student model to mimic a larger teacher model. The key insight is using soft labels — the teacher's full probability distribution over all classes — instead of just the correct answer. A temperature parameter T softens these distributions, making small probabilities visible. This reveals dark knowledge: the information hidden in the teacher's "wrong" answers that tells the student which classes are similar to each other. The loss function combines soft loss (from teacher) with hard loss (ground truth): L = alpha * L_soft + (1-alpha) * L_hard.

Think of it like an experienced chef training an apprentice — not just giving the recipe (hard label: "this is soup"), but sharing nuances: "this is 70% soup technique, 20% sauce method, 10% stew approach" (soft labels). The apprentice learns not just the right answer, but why other answers are partially right:

1. Teacher generates soft predictions: The large teacher model produces probability distributions over all classes for each input — not just the top-1 prediction, but the full distribution revealing how confident it is about every option
2. Temperature softens the distribution: Parameter T > 1 "smooths" the distribution, making small probabilities more visible. At T=1 the top class dominates; at T=5 the dark knowledge in minor classes becomes accessible to the student
3. Student learns from soft + hard labels: The student model trains on both soft labels (from teacher, weighted by alpha) and hard labels (ground truth, weighted by 1-alpha). Typical alpha is 0.5-0.7. This dual signal gives the student both the teacher's intuition and factual correctness
4. Student deployed independently: After training, the student model works on its own — no teacher needed at inference time. The result: a model that is 2-10x smaller and faster, with 90-99% of the teacher's quality on the target task

Knowledge Distillation in Practice

Mobile Deployment: DistilBERT retains 97% of BERT's performance with 40% fewer parameters and 60% faster inference. This makes transformer models practical for on-device applications like keyboard prediction and real-time translation
Cost Reduction: Companies like Microsoft (Phi series) and Meta (Llama) use distillation to create models that run on consumer GPUs. A distilled 7B model can match a 70B teacher on specific tasks at 1/10th the cost per inference
Real-Time Inference: Latency-critical applications like voice assistants, autonomous driving, and live content moderation require sub-100ms responses. Distilled models deliver teacher-quality results within strict latency budgets
Common Pitfall: You cannot distill a 175B parameter model into a 1M parameter student — the capacity gap is too large. The student architecture must be proportional to task complexity. Also, distillation transfers behavior, not knowledge: if the teacher hallucinates, the student will too

Fun Fact: DistilBERT (2019) was one of the first major distillation success stories: 40% smaller than BERT, 60% faster, yet retaining 97% of performance. Today, many popular "small" LLMs like Microsoft Phi and parts of the Llama family use distillation from larger models as a key training ingredient.

Try It Yourself!

Explore the interactive distillation pipeline below: see how temperature affects probability distributions, compare soft vs hard labels, and observe the teacher-student knowledge transfer.

Knowledge Distillation: From Teacher to Student

Follow the knowledge distillation pipeline step by step:

Step 1/4

Teacher Model

Large pre-trained model (e.g. 175B params) generates soft probability distributions for training data

Frequently asked questions

What is knowledge distillation in machine learning?

Knowledge distillation is a model compression technique where a small 'student' model learns to mimic a large 'teacher' model. Instead of training the student on hard labels (correct answers only), it learns from the teacher's soft probability distribution over all classes. This transfers 'dark knowledge' — information hidden in the teacher's confidence levels across wrong answers.

What role does temperature play in knowledge distillation?

Temperature (T) controls how 'soft' the teacher's output probability distribution becomes. At T=1 (normal), the distribution is peaked — the top class dominates. At higher T (3-5), the distribution smooths out, revealing relationships between classes that the student can learn from. This is critical because dark knowledge lives in these small probabilities that would otherwise be invisible.

How much smaller can a distilled model be while maintaining quality?

Results vary by task, but DistilBERT achieves 97% of BERT's performance with 40% fewer parameters and 60% faster inference. In practice, student models can be 2-10x smaller than their teachers while retaining 90-99% of performance on target tasks. The key is matching student capacity to task complexity — task-specific distillation typically outperforms general distillation.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateAnalysis

Explain how knowledge distillation helps deploy a model on a mobile device

Without technique

Prompt

Tell me about knowledge distillation

Response

Knowledge distillation is a process where a small model learns from a large one. The large model (teacher) transfers knowledge to the small one (student). This is useful for mobile devices.

Tokens:8/35

Time:380ms

Quality:

With fundamentals-knowledge-distillation

With technique

Prompt

Explain knowledge distillation for a specific scenario: - Teacher: BERT-large (340M params, 94% accuracy on email classification) - Student: DistilBERT (66M params) - Goal: run on smartphone with 200ms per-request limit Show: 1) What the teacher transfers via soft labels, 2) Role of temperature (T=4), 3) Final metrics student vs teacher.

Response

Knowledge Distillation: BERT-large → DistilBERT for mobile email classifier

1. Soft labels from teacher: For email "Meeting rescheduled to Friday 3pm":

Teacher (T=1): [Calendar: 0.89, Work: 0.06, Personal: 0.03, Spam: 0.02]
Teacher (T=4): [Calendar: 0.42, Work: 0.25, Personal: 0.18, Spam: 0.15]

Soft labels at T=4 reveal: "this is calendar-related, but connected to work and somewhat personal" — dark knowledge that the student learns.

2. Role of temperature T=4:

Without T: student sees "Calendar=1, rest=0" → no class relationships
T=4: student sees Work and Personal are related to Calendar → better edge case handling

3. Final metrics: | Metric | Teacher (BERT-large) | Student (DistilBERT) | |--------|---------------------|---------------------| | Parameters | 340M | 66M (−80%) | | Accuracy | 94% | 91.3% (−2.7%) | | Latency | 450ms | 85ms (−81%) | | RAM | 1.3GB | 260MB (−80%) |

👁️Vague "tell me about" prompt produces a generic answer without details

🧠Specific scenario (BERT → DistilBERT, email, mobile) focuses the answer on practice

✅Requesting numerical metrics (accuracy, latency, RAM) gives measurable comparison instead of generalities

Tokens:65/220

Time:1200ms

Quality:

Why this works

A specific scenario with numbers (340M→66M, accuracy, latency, RAM) transforms an abstract "tell me about distillation" into a practical guide. A prompt with temperature and soft labels forces the model to show mechanics, not just describe the concept.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Fine Tuning Quantization Cost Optimization

This lesson is part of a structured LLM course.

My Learning Path

Lesson 17

Knowledge Distillation

Big model knowledge, small model speed

The Solution: Knowledge Distillation — Teaching Small Models to Think Like Big Ones

Think of it like an experienced chef training an apprentice — not just giving the recipe (hard label: "this is soup"), but sharing nuances: "this is 70% soup technique, 20% sauce method, 10% stew approach" (soft labels). The apprentice learns not just the right answer, but why other answers are partially right:

1. Teacher generates soft predictions: The large teacher model produces probability distributions over all classes for each input — not just the top-1 prediction, but the full distribution revealing how confident it is about every option
2. Temperature softens the distribution: Parameter T > 1 "smooths" the distribution, making small probabilities more visible. At T=1 the top class dominates; at T=5 the dark knowledge in minor classes becomes accessible to the student
3. Student learns from soft + hard labels: The student model trains on both soft labels (from teacher, weighted by alpha) and hard labels (ground truth, weighted by 1-alpha). Typical alpha is 0.5-0.7. This dual signal gives the student both the teacher's intuition and factual correctness
4. Student deployed independently: After training, the student model works on its own — no teacher needed at inference time. The result: a model that is 2-10x smaller and faster, with 90-99% of the teacher's quality on the target task

Knowledge Distillation in Practice

Mobile Deployment: DistilBERT retains 97% of BERT's performance with 40% fewer parameters and 60% faster inference. This makes transformer models practical for on-device applications like keyboard prediction and real-time translation
Cost Reduction: Companies like Microsoft (Phi series) and Meta (Llama) use distillation to create models that run on consumer GPUs. A distilled 7B model can match a 70B teacher on specific tasks at 1/10th the cost per inference
Real-Time Inference: Latency-critical applications like voice assistants, autonomous driving, and live content moderation require sub-100ms responses. Distilled models deliver teacher-quality results within strict latency budgets
Common Pitfall: You cannot distill a 175B parameter model into a 1M parameter student — the capacity gap is too large. The student architecture must be proportional to task complexity. Also, distillation transfers behavior, not knowledge: if the teacher hallucinates, the student will too

Try It Yourself!

Explore the interactive distillation pipeline below: see how temperature affects probability distributions, compare soft vs hard labels, and observe the teacher-student knowledge transfer.

Knowledge Distillation: From Teacher to Student

Follow the knowledge distillation pipeline step by step:

Step 1/4

Teacher Model

Large pre-trained model (e.g. 175B params) generates soft probability distributions for training data

Frequently asked questions

What is knowledge distillation in machine learning?

What role does temperature play in knowledge distillation?

How much smaller can a distilled model be while maintaining quality?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateAnalysis

Explain how knowledge distillation helps deploy a model on a mobile device

Without technique

Prompt

Tell me about knowledge distillation

Response

Knowledge distillation is a process where a small model learns from a large one. The large model (teacher) transfers knowledge to the small one (student). This is useful for mobile devices.

Tokens:8/35

Time:380ms

Quality:

With fundamentals-knowledge-distillation

With technique

Prompt

Response

Knowledge Distillation: BERT-large → DistilBERT for mobile email classifier

1. Soft labels from teacher: For email "Meeting rescheduled to Friday 3pm":

Teacher (T=1): [Calendar: 0.89, Work: 0.06, Personal: 0.03, Spam: 0.02]
Teacher (T=4): [Calendar: 0.42, Work: 0.25, Personal: 0.18, Spam: 0.15]

Soft labels at T=4 reveal: "this is calendar-related, but connected to work and somewhat personal" — dark knowledge that the student learns.

2. Role of temperature T=4:

Without T: student sees "Calendar=1, rest=0" → no class relationships
T=4: student sees Work and Personal are related to Calendar → better edge case handling

👁️Vague "tell me about" prompt produces a generic answer without details

🧠Specific scenario (BERT → DistilBERT, email, mobile) focuses the answer on practice

✅Requesting numerical metrics (accuracy, latency, RAM) gives measurable comparison instead of generalities

Tokens:65/220

Time:1200ms

Quality:

Why this works

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Fine Tuning Quantization Cost Optimization

This lesson is part of a structured LLM course.

My Learning Path