Knowledge Distillation
Big model knowledge, small model speed
The Problem: Your production system needs to classify thousands of requests per second, but your best model (175B parameters) costs $0.03 per request and takes 2 seconds to respond. You need the same quality at 1/10th the cost and 10x the speed. How do you compress the knowledge of a giant model into something deployable?
The Solution: Knowledge Distillation — Teaching Small Models to Think Like Big Ones
Large models like GPT-4 and Claude are highly accurate but expensive and slow for production. Knowledge distillation solves this by training a smaller student model to mimic a larger teacher model. The key insight is using soft labels — the teacher's full probability distribution over all classes — instead of just the correct answer. A temperature parameter T softens these distributions, making small probabilities visible. This reveals dark knowledge: the information hidden in the teacher's "wrong" answers that tells the student which classes are similar to each other. The loss function combines soft loss (from teacher) with hard loss (ground truth): L = alpha * L_soft + (1-alpha) * L_hard.
Think of it like an experienced chef training an apprentice — not just giving the recipe (hard label: "this is soup"), but sharing nuances: "this is 70% soup technique, 20% sauce method, 10% stew approach" (soft labels). The apprentice learns not just the right answer, but why other answers are partially right:
- 1. Teacher generates soft predictions: The large teacher model produces probability distributions over all classes for each input — not just the top-1 prediction, but the full distribution revealing how confident it is about every option
- 2. Temperature softens the distribution: Parameter T > 1 "smooths" the distribution, making small probabilities more visible. At T=1 the top class dominates; at T=5 the dark knowledge in minor classes becomes accessible to the student
- 3. Student learns from soft + hard labels: The student model trains on both soft labels (from teacher, weighted by alpha) and hard labels (ground truth, weighted by 1-alpha). Typical alpha is 0.5-0.7. This dual signal gives the student both the teacher's intuition and factual correctness
- 4. Student deployed independently: After training, the student model works on its own — no teacher needed at inference time. The result: a model that is 2-10x smaller and faster, with 90-99% of the teacher's quality on the target task
Knowledge Distillation in Practice
- Mobile Deployment: DistilBERT retains 97% of BERT's performance with 40% fewer parameters and 60% faster inference. This makes transformer models practical for on-device applications like keyboard prediction and real-time translation
- Cost Reduction: Companies like Microsoft (Phi series) and Meta (Llama) use distillation to create models that run on consumer GPUs. A distilled 7B model can match a 70B teacher on specific tasks at 1/10th the cost per inference
- Real-Time Inference: Latency-critical applications like voice assistants, autonomous driving, and live content moderation require sub-100ms responses. Distilled models deliver teacher-quality results within strict latency budgets
- Common Pitfall: You cannot distill a 175B parameter model into a 1M parameter student — the capacity gap is too large. The student architecture must be proportional to task complexity. Also, distillation transfers behavior, not knowledge: if the teacher hallucinates, the student will too
Fun Fact: DistilBERT (2019) was one of the first major distillation success stories: 40% smaller than BERT, 60% faster, yet retaining 97% of performance. Today, many popular "small" LLMs like Microsoft Phi and parts of the Llama family use distillation from larger models as a key training ingredient.
Try It Yourself!
Explore the interactive distillation pipeline below: see how temperature affects probability distributions, compare soft vs hard labels, and observe the teacher-student knowledge transfer.
Follow the knowledge distillation pipeline step by step:
Teacher Model
Large pre-trained model (e.g. 175B params) generates soft probability distributions for training data
Try it yourself
Interactive demo of this technique
Explain how knowledge distillation helps deploy a model on a mobile device
Knowledge distillation is a process where a small model learns from a large one. The large model (teacher) transfers knowledge to the small one (student). This is useful for mobile devices.
Knowledge Distillation: BERT-large → DistilBERT for mobile email classifier
1. Soft labels from teacher: For email "Meeting rescheduled to Friday 3pm":
- Teacher (T=1): [Calendar: 0.89, Work: 0.06, Personal: 0.03, Spam: 0.02]
- Teacher (T=4): [Calendar: 0.42, Work: 0.25, Personal: 0.18, Spam: 0.15]
Soft labels at T=4 reveal: "this is calendar-related, but connected to work and somewhat personal" — dark knowledge that the student learns.
2. Role of temperature T=4:
- Without T: student sees "Calendar=1, rest=0" → no class relationships
- T=4: student sees Work and Personal are related to Calendar → better edge case handling
3. Final metrics: | Metric | Teacher (BERT-large) | Student (DistilBERT) | |--------|---------------------|---------------------| | Parameters | 340M | 66M (−80%) | | Accuracy | 94% | 91.3% (−2.7%) | | Latency | 450ms | 85ms (−81%) | | RAM | 1.3GB | 260MB (−80%) |
A specific scenario with numbers (340M→66M, accuracy, latency, RAM) transforms an abstract "tell me about distillation" into a practical guide. A prompt with temperature and soft labels forces the model to show mechanics, not just describe the concept.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path