Mixture of Experts (MoE)
Trillion parameters, billion-parameter cost
The Problem: You need a model with the quality of a trillion-parameter giant, but your compute budget only covers a billion-parameter model. Scaling parameters always meant scaling compute proportionally — until MoE broke this tradeoff. How?
The Solution: MoE — Activate Only What You Need
In a standard (dense) transformer, every token passes through all parameters in each layer. MoE replaces the single FFN (feed-forward network) in each layer with multiple expert FFNs and a gating network (router). The router examines each token's embedding and assigns it to the top-K experts (typically K=1 or K=2). Only those experts compute; the rest stay idle. This is called sparse activation — the model has many parameters but uses only a fraction per token. A load balancing loss during training prevents "expert collapse" where the router sends all tokens to the same few experts.
Think of it like a hospital with specialist doctors — the reception desk (router) directs each patient to the right 2-3 specialists (experts), not to every doctor at once. Each specialist is world-class in their area, but the patient only sees those relevant to their case:
- 1. Token arrives at router: Each token's embedding is fed into a small gating network (router) at the MoE layer. The router is a learned linear layer that produces a score for each expert
- 2. Router scores all experts: The gating network applies softmax to produce a probability distribution over all N experts. Each expert gets a routing weight that indicates how relevant it is for this particular token
- 3. Top-K experts activated: Only the K experts with the highest scores are selected (typically K=2). The selected experts process the token through their FFN independently. All other experts remain idle, saving compute
- 4. Outputs combined weighted: The outputs from the K active experts are weighted by their router scores and summed to produce the final layer output. This weighted combination lets the model blend different expert specializations for each token
MoE in Practice
- Mixtral 8x7B: Mistral AI's open-source MoE model with 8 experts and top-2 routing. Total parameters: 46.7B, but only ~12.9B active per token. Matches or outperforms Llama 2 70B while being 6x faster at inference
- DeepSeek-V2/V3: DeepSeek's fine-grained MoE with 160 experts and top-6 routing in V2, and 256 experts in V3. Uses shared experts (always active) plus routed experts for a hybrid approach that improves stability and quality
- Switch Transformer: Google's pioneering MoE architecture using top-1 routing for maximum efficiency. Each token goes to exactly one expert, maximizing throughput at the cost of some quality. Scaled to 1.6 trillion parameters in 2021
- Common Pitfall: MoE models are compute-efficient but NOT memory-efficient. All experts must reside in GPU memory even though only a few are active. A 46.7B MoE model needs the same VRAM as a 46.7B dense model, despite using only 12.9B parameters per forward pass
Fun Fact: Mixtral 8x7B has 46.7 billion total parameters but activates only 12.9 billion per token. This means it runs at roughly the speed of a 13B dense model while achieving quality comparable to a 70B model — a 5x efficiency gain from sparse routing alone.
Try It Yourself!
Explore the interactive MoE visualization below: route tokens to experts, compare dense vs sparse compute, and see how load balancing distributes work across experts.
Watch how the router assigns each token to the top-2 experts:
Token: "Python"
E1
Literature
E2
Code
E3
Math
E4
Science
E5
History
E6
Logic
E7
Language
E8
Technical
Try it yourself
Interactive demo of this technique
Explain Mixtral 8x7B architecture: parameters, routing, performance
Mixtral 8x7B is a large language model from Mistral AI. It uses Mixture of Experts architecture and is open source. The model performs well on various tasks.
Mixtral 8x7B Architecture:
-
Parameters: 46.7B total, ~12.9B active per token. 8 expert FFNs of ~7B each + shared attention layers.
-
Routing: Top-2 — for each token, the router (linear layer + softmax) selects 2 of 8 experts. Outputs are weighted-summed by router scores.
-
Speed: Inference ~5x faster than dense 70B (12.9B active vs 70B). Benchmarks comparable or better than Llama 2 70B.
-
Tradeoffs: All 46.7B in GPU VRAM (no memory savings!). Only saves FLOPs. Needs expert parallelism for multi-GPU.
Asking specific questions about MoE architecture (total vs active params, routing, tradeoffs) yields concrete numbers instead of generic phrases. Key takeaway: MoE saves compute, NOT memory.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path