Lesson 16

Mixture of Experts (MoE)

Trillion parameters, billion-parameter cost

The Problem: You need a model with the quality of a trillion-parameter giant, but your compute budget only covers a billion-parameter model. Scaling parameters always meant scaling compute proportionally — until MoE broke this tradeoff. How?

The Solution: MoE — Activate Only What You Need

In a standard (dense) transformer, every token passes through all parameters in each layer. MoE replaces the single FFN (feed-forward network) in each layer with multiple expert FFNs and a gating network (router). The router examines each token's embedding and assigns it to the top-K experts (typically K=1 or K=2). Only those experts compute; the rest stay idle. This is called sparse activation — the model has many parameters but uses only a fraction per token. A load balancing loss during training prevents "expert collapse" where the router sends all tokens to the same few experts.

Think of it like a hospital with specialist doctors — the reception desk (router) directs each patient to the right 2-3 specialists (experts), not to every doctor at once. Each specialist is world-class in their area, but the patient only sees those relevant to their case:

1. Token arrives at router: Each token's embedding is fed into a small gating network (router) at the MoE layer. The router is a learned linear layer that produces a score for each expert
2. Router scores all experts: The gating network applies softmax to produce a probability distribution over all N experts. Each expert gets a routing weight that indicates how relevant it is for this particular token
3. Top-K experts activated: Only the K experts with the highest scores are selected (typically K=2). The selected experts process the token through their FFN independently. All other experts remain idle, saving compute
4. Outputs combined weighted: The outputs from the K active experts are weighted by their router scores and summed to produce the final layer output. This weighted combination lets the model blend different expert specializations for each token

MoE in Practice

Mixtral 8x7B: Mistral AI's open-source MoE model with 8 experts and top-2 routing. Total parameters: 46.7B, but only ~12.9B active per token. Matches or outperforms Llama 2 70B while being 6x faster at inference
DeepSeek-V2/V3: DeepSeek's fine-grained MoE with 160 experts and top-6 routing in V2, and 256 experts in V3. Uses shared experts (always active) plus routed experts for a hybrid approach that improves stability and quality
Switch Transformer: Google's pioneering MoE architecture using top-1 routing for maximum efficiency. Each token goes to exactly one expert, maximizing throughput at the cost of some quality. Scaled to 1.6 trillion parameters in 2021
Common Pitfall: MoE models are compute-efficient but NOT memory-efficient. All experts must reside in GPU memory even though only a few are active. A 46.7B MoE model needs the same VRAM as a 46.7B dense model, despite using only 12.9B parameters per forward pass

Fun Fact: Mixtral 8x7B has 46.7 billion total parameters but activates only 12.9 billion per token. This means it runs at roughly the speed of a 13B dense model while achieving quality comparable to a 70B model — a 5x efficiency gain from sparse routing alone.

Try It Yourself!

Explore the interactive MoE visualization below: route tokens to experts, compare dense vs sparse compute, and see how load balancing distributes work across experts.

MoE: Routing Tokens to Experts

Watch how the router assigns each token to the top-2 experts:

Token: "Python"

Literature

Code

Math

Science

History

Logic

Language

Technical

Frequently asked questions

What is Mixture of Experts (MoE) in LLMs?

MoE is an architecture where the model contains multiple specialized sub-networks (experts), but only activates a small subset for each input token. A gating network (router) decides which experts to use. This means the model can have trillions of total parameters while only using a fraction per token, dramatically reducing compute cost.

How does the router decide which experts to activate?

The router is a small neural network that takes a token embedding as input and produces a probability score for each expert. The top-K experts (usually K=1 or K=2) with the highest scores are selected. Their outputs are then combined using weighted summation based on the router scores. A load balancing loss during training ensures all experts get used roughly equally.

What are the advantages and disadvantages of MoE models?

Advantages: much lower compute per token (only K experts active vs all parameters), faster inference, ability to scale to very large total parameter counts. Disadvantages: all experts must fit in memory (high VRAM requirements), expert collapse risk during training, complex load balancing, and difficulty with fine-tuning since experts may specialize unpredictably.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateAnalysis

Explain Mixtral 8x7B architecture: parameters, routing, performance

Without technique

Prompt

Tell me about Mixtral 8x7B

Response

Mixtral 8x7B is a large language model from Mistral AI. It uses Mixture of Experts architecture and is open source. The model performs well on various tasks.

Tokens:8/35

Time:380ms

Quality:

With fundamentals-mixture-of-experts

With technique

Prompt

Explain Mixtral 8x7B architecture: 1. Total parameters vs active per token? 2. How does top-K routing work (what K)? 3. How does this affect inference speed vs dense 70B? 4. What are the tradeoffs (memory vs compute)?

Response

Mixtral 8x7B Architecture:

Parameters: 46.7B total, ~12.9B active per token. 8 expert FFNs of ~7B each + shared attention layers.
Routing: Top-2 — for each token, the router (linear layer + softmax) selects 2 of 8 experts. Outputs are weighted-summed by router scores.
Speed: Inference ~5x faster than dense 70B (12.9B active vs 70B). Benchmarks comparable or better than Llama 2 70B.
Tradeoffs: All 46.7B in GPU VRAM (no memory savings!). Only saves FLOPs. Needs expert parallelism for multi-GPU.

👁️The prompt "tell me about Mixtral" is too vague — no specific questions

🧠Structured questions (parameters, routing, speed, tradeoffs) guide the model toward specifics

✅Result: specific numbers (46.7B, 12.9B, top-2, 5x) instead of generic phrases

Tokens:52/165

Time:920ms

Quality:

Why this works

Asking specific questions about MoE architecture (total vs active params, routing, tradeoffs) yields concrete numbers instead of generic phrases. Key takeaway: MoE saves compute, NOT memory.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Inference Quantization Transformers

This lesson is part of a structured LLM course.

My Learning Path

Lesson 16

Mixture of Experts (MoE)

Trillion parameters, billion-parameter cost

The Solution: MoE — Activate Only What You Need

Think of it like a hospital with specialist doctors — the reception desk (router) directs each patient to the right 2-3 specialists (experts), not to every doctor at once. Each specialist is world-class in their area, but the patient only sees those relevant to their case:

1. Token arrives at router: Each token's embedding is fed into a small gating network (router) at the MoE layer. The router is a learned linear layer that produces a score for each expert
2. Router scores all experts: The gating network applies softmax to produce a probability distribution over all N experts. Each expert gets a routing weight that indicates how relevant it is for this particular token
3. Top-K experts activated: Only the K experts with the highest scores are selected (typically K=2). The selected experts process the token through their FFN independently. All other experts remain idle, saving compute
4. Outputs combined weighted: The outputs from the K active experts are weighted by their router scores and summed to produce the final layer output. This weighted combination lets the model blend different expert specializations for each token

MoE in Practice

Mixtral 8x7B: Mistral AI's open-source MoE model with 8 experts and top-2 routing. Total parameters: 46.7B, but only ~12.9B active per token. Matches or outperforms Llama 2 70B while being 6x faster at inference
DeepSeek-V2/V3: DeepSeek's fine-grained MoE with 160 experts and top-6 routing in V2, and 256 experts in V3. Uses shared experts (always active) plus routed experts for a hybrid approach that improves stability and quality
Switch Transformer: Google's pioneering MoE architecture using top-1 routing for maximum efficiency. Each token goes to exactly one expert, maximizing throughput at the cost of some quality. Scaled to 1.6 trillion parameters in 2021
Common Pitfall: MoE models are compute-efficient but NOT memory-efficient. All experts must reside in GPU memory even though only a few are active. A 46.7B MoE model needs the same VRAM as a 46.7B dense model, despite using only 12.9B parameters per forward pass

Try It Yourself!

Explore the interactive MoE visualization below: route tokens to experts, compare dense vs sparse compute, and see how load balancing distributes work across experts.

MoE: Routing Tokens to Experts

Watch how the router assigns each token to the top-2 experts:

Token: "Python"

Literature

Code

Math

Science

History

Logic

Language

Technical

Frequently asked questions

What is Mixture of Experts (MoE) in LLMs?

How does the router decide which experts to activate?

What are the advantages and disadvantages of MoE models?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateAnalysis

Explain Mixtral 8x7B architecture: parameters, routing, performance

Without technique

Prompt

Tell me about Mixtral 8x7B

Response

Mixtral 8x7B is a large language model from Mistral AI. It uses Mixture of Experts architecture and is open source. The model performs well on various tasks.

Tokens:8/35

Time:380ms

Quality:

With fundamentals-mixture-of-experts

With technique

Prompt

Response

Mixtral 8x7B Architecture:

Parameters: 46.7B total, ~12.9B active per token. 8 expert FFNs of ~7B each + shared attention layers.
Routing: Top-2 — for each token, the router (linear layer + softmax) selects 2 of 8 experts. Outputs are weighted-summed by router scores.
Speed: Inference ~5x faster than dense 70B (12.9B active vs 70B). Benchmarks comparable or better than Llama 2 70B.
Tradeoffs: All 46.7B in GPU VRAM (no memory savings!). Only saves FLOPs. Needs expert parallelism for multi-GPU.

👁️The prompt "tell me about Mixtral" is too vague — no specific questions

🧠Structured questions (parameters, routing, speed, tradeoffs) guide the model toward specifics

✅Result: specific numbers (46.7B, 12.9B, top-2, 5x) instead of generic phrases

Tokens:52/165

Time:920ms

Quality:

Why this works

Asking specific questions about MoE architecture (total vs active params, routing, tradeoffs) yields concrete numbers instead of generic phrases. Key takeaway: MoE saves compute, NOT memory.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Inference Quantization Transformers

This lesson is part of a structured LLM course.

My Learning Path