Lesson 2

Training Dynamics

From random weights to predicting the next word

The Problem: You know that LLMs are "trained" on massive amounts of text — but what does training actually mean? How does a model go from random weights to predicting the next word with uncanny accuracy? Understanding training dynamics reveals why LLMs cost millions to train, why some models are better than others, and why "temperature" controls creativity.

The Solution: How LLM Training Works

LLM training is an optimization process driven by a single number: cross-entropy loss. For each token in the training data, the model predicts a probability distribution over the vocabulary. The loss measures how far this prediction is from reality. Gradient descent then computes the slope (gradient) of this loss with respect to every weight and nudges each weight in the direction that reduces the loss. The learning rate controls the step size — too large and the model overshoots; too small and training takes forever. Modern LLMs use the Adam optimizer, which adapts the learning rate for each parameter individually and uses momentum to smooth out noisy gradients.

Think of it like tuning a piano with billions of strings. The loss function tells you "how out of tune the piano is." Gradient descent tells you "which direction to turn each tuning peg." The learning rate controls "how much you turn each peg." The Adam optimizer remembers "which pegs moved well last time" and adapts the turning speed for each one individually:

1. Compute loss: how wrong is the prediction?: A batch of tokens is fed through the model (forward pass). For each position, the model outputs a probability distribution over ~100K tokens. Cross-entropy loss measures the gap between predicted probabilities and the actual next token. This single number drives all learning
2. Compute gradients: which way to adjust?: Backpropagation computes the gradient (slope) of the loss with respect to every weight. For GPT-3, that means computing 175 billion partial derivatives in a single backward pass. Each gradient tells us: "if this weight increases slightly, the loss will increase/decrease by this much"
3. Update weights: take a step downhill: The Adam optimizer updates each weight: w_new = w_old - lr * adaptive_gradient. Adam maintains running averages of past gradients (momentum) and squared gradients (adaptive LR). The learning rate schedule controls the global step size: warmup from near-zero, peak, then cosine decay
4. Repeat billions of times: One complete pass through all training data is one epoch. Modern LLMs often train for less than 1 epoch — Chinchilla showed it is better to use more data than to see the same data twice. Llama 3 saw 15.6T tokens in a single pass. The loss curve gradually flattens as the model converges

Training Dynamics in LLMs

Why Training Costs Millions: GPT-3 training: 300B tokens x 175B parameters x ~3.6 million GPU-hours. Each training step adjusted 175 billion weights simultaneously using the Adam optimizer. At cloud GPU prices, this cost approximately $4.6M. GPT-4 reportedly cost over $100M. The training loop is the same — the scale is what drives the cost
Temperature and the Loss Landscape: When you set temperature in LLM settings, you modify the softmax distribution at inference time. Temperature = 0.1 makes the distribution peaky (model is very confident). Temperature = 2.0 flattens it (more random). This directly relates to how the model was trained — during training, the loss function optimized for the "correct" next token distribution
Perplexity: The Public Loss Metric: Perplexity = e^(cross-entropy loss). When a paper says "perplexity of 15.2," it means the model is, on average, as uncertain as choosing between 15 equally likely next words. Lower perplexity = better model. GPT-2 achieved ~35 on WikiText-103; GPT-3 brought it down to ~20. This single number captures how well training worked
Common Pitfall: Thinking a bigger model is always better. The Chinchilla scaling law (DeepMind, 2022) proved that GPT-3 was undertrained — it should have used more data and fewer parameters. For a fixed compute budget, a 70B model trained on 1.4T tokens outperforms a 175B model on 300B tokens. Training dynamics matter more than raw size

Fun Fact: Llama 3 405B was trained on 15.6 trillion tokens using 16,384 H100 GPUs simultaneously. The learning rate schedule used a warmup of 8,000 steps, peaked at 8e-5, then decayed with cosine annealing. One training run took ~54 days. If any GPU failed, checkpointing allowed resuming from the last saved state.

Try It Yourself!

Explore the interactive training landscape below: adjust learning rates, compare optimizers, visualize learning rate schedules, and see how training transforms random weights into a language model.

Training Landscape: From Loss to Knowledge

The "ball" = current model weights. It rolls downhill on the loss landscape — green = low error, red = high. Adjust learning rate, press Start, and watch:

Learning Rate0.050

Low loss

High loss

Loss: 9.44

Try it: Set LR to 0.005 (crawling), then 0.100 (wild jumps). Too low = barely moves, too high = overshoots minimum. Sweet spot: 0.03-0.05.

Frequently asked questions

What does the loss function measure during LLM training?

The loss function (cross-entropy) measures how far the model's predicted probability distribution is from the actual next token. If the model predicts 'mat' with 5% probability but it was the correct answer, the loss is high. Training minimizes this loss across trillions of tokens, gradually making predictions more accurate.

Why does LLM training use learning rate warmup and decay?

At the start of training, weights are random and gradients are large and noisy. A high learning rate would cause the loss to explode. Warmup starts with a tiny learning rate and gradually increases it, letting the model stabilize. After reaching peak, cosine decay slowly reduces the rate to fine-tune the weights without overshooting.

What is the Chinchilla scaling law and why does it matter?

The Chinchilla scaling law (DeepMind, 2022) showed that for a fixed compute budget, it is better to train a smaller model on more data than a larger model on less data. GPT-3 (175B parameters, 300B tokens) was undertrained by this standard. Chinchilla (70B parameters, 1.4T tokens) matched GPT-3 performance with less than half the parameters.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Get a clear explanation of how language models are trained

Without technique

Prompt

How do neural networks learn?

Response

Neural networks learn from data. They adjust their weights to make better predictions. The process involves forward and backward passes.

Tokens:6/30

Time:280ms

Quality:

With fundamentals-training-dynamics

With technique

Prompt

Walk me through one training step of a language model: start with a batch of text, show how cross-entropy loss is computed for each next-token prediction, then explain how backpropagation adjusts the weights. Use concrete numbers for a small example (vocabulary of 5 tokens).

Response

One LLM training step:

Batch: "The cat sat" -> predict "on"
Forward pass: model outputs probabilities [the=0.15, cat=0.10, sat=0.05, on=0.20, mat=0.50]
Cross-entropy loss: -log(P("on")) = -log(0.20) = 1.61
- If P("on")=0.90, loss = 0.105 (much better)
Backward pass: compute dL/dw for every weight
- Gradient for "on": 0.20 - 1.0 = -0.80 (needs to increase)
- Gradient for "mat": 0.50 - 0.0 = +0.50 (needs to decrease)
Adam update: w_new = w_old - 0.0001 * adaptive_gradient

At GPT-3 scale: repeat for 175B weights x 300B tokens.

👁️Basic prompt "how do neural networks learn" gives a general overview without math

🧠Asking for "one step with concrete numbers" forces AI to show real computations: cross-entropy, gradients, weight updates

✅Result: a step-by-step demonstration with numbers that can be verified and builds intuition for the entire training process

Tokens:48/195

Time:950ms

Quality:

Why this works

The prompt "walk me through one step with concrete numbers" transforms the abstract training process into a verifiable sequence of computations you can reproduce.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

This lesson is part of a structured LLM course.

My Learning Path

Lesson 2

Training Dynamics

From random weights to predicting the next word

The Solution: How LLM Training Works

Think of it like tuning a piano with billions of strings. The loss function tells you "how out of tune the piano is." Gradient descent tells you "which direction to turn each tuning peg." The learning rate controls "how much you turn each peg." The Adam optimizer remembers "which pegs moved well last time" and adapts the turning speed for each one individually:

1. Compute loss: how wrong is the prediction?: A batch of tokens is fed through the model (forward pass). For each position, the model outputs a probability distribution over ~100K tokens. Cross-entropy loss measures the gap between predicted probabilities and the actual next token. This single number drives all learning
2. Compute gradients: which way to adjust?: Backpropagation computes the gradient (slope) of the loss with respect to every weight. For GPT-3, that means computing 175 billion partial derivatives in a single backward pass. Each gradient tells us: "if this weight increases slightly, the loss will increase/decrease by this much"
3. Update weights: take a step downhill: The Adam optimizer updates each weight: w_new = w_old - lr * adaptive_gradient. Adam maintains running averages of past gradients (momentum) and squared gradients (adaptive LR). The learning rate schedule controls the global step size: warmup from near-zero, peak, then cosine decay
4. Repeat billions of times: One complete pass through all training data is one epoch. Modern LLMs often train for less than 1 epoch — Chinchilla showed it is better to use more data than to see the same data twice. Llama 3 saw 15.6T tokens in a single pass. The loss curve gradually flattens as the model converges

Training Dynamics in LLMs

Why Training Costs Millions: GPT-3 training: 300B tokens x 175B parameters x ~3.6 million GPU-hours. Each training step adjusted 175 billion weights simultaneously using the Adam optimizer. At cloud GPU prices, this cost approximately $4.6M. GPT-4 reportedly cost over $100M. The training loop is the same — the scale is what drives the cost
Temperature and the Loss Landscape: When you set temperature in LLM settings, you modify the softmax distribution at inference time. Temperature = 0.1 makes the distribution peaky (model is very confident). Temperature = 2.0 flattens it (more random). This directly relates to how the model was trained — during training, the loss function optimized for the "correct" next token distribution
Perplexity: The Public Loss Metric: Perplexity = e^(cross-entropy loss). When a paper says "perplexity of 15.2," it means the model is, on average, as uncertain as choosing between 15 equally likely next words. Lower perplexity = better model. GPT-2 achieved ~35 on WikiText-103; GPT-3 brought it down to ~20. This single number captures how well training worked
Common Pitfall: Thinking a bigger model is always better. The Chinchilla scaling law (DeepMind, 2022) proved that GPT-3 was undertrained — it should have used more data and fewer parameters. For a fixed compute budget, a 70B model trained on 1.4T tokens outperforms a 175B model on 300B tokens. Training dynamics matter more than raw size

Try It Yourself!

Explore the interactive training landscape below: adjust learning rates, compare optimizers, visualize learning rate schedules, and see how training transforms random weights into a language model.

Training Landscape: From Loss to Knowledge

The "ball" = current model weights. It rolls downhill on the loss landscape — green = low error, red = high. Adjust learning rate, press Start, and watch:

Learning Rate0.050

Low loss

High loss

Loss: 9.44

Try it: Set LR to 0.005 (crawling), then 0.100 (wild jumps). Too low = barely moves, too high = overshoots minimum. Sweet spot: 0.03-0.05.

Frequently asked questions

What does the loss function measure during LLM training?

Why does LLM training use learning rate warmup and decay?

What is the Chinchilla scaling law and why does it matter?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Get a clear explanation of how language models are trained

Without technique

Prompt

How do neural networks learn?

Response

Neural networks learn from data. They adjust their weights to make better predictions. The process involves forward and backward passes.

Tokens:6/30

Time:280ms

Quality:

With fundamentals-training-dynamics

With technique

Prompt

Response

One LLM training step:

Batch: "The cat sat" -> predict "on"
Forward pass: model outputs probabilities [the=0.15, cat=0.10, sat=0.05, on=0.20, mat=0.50]
Cross-entropy loss: -log(P("on")) = -log(0.20) = 1.61
- If P("on")=0.90, loss = 0.105 (much better)
Backward pass: compute dL/dw for every weight
- Gradient for "on": 0.20 - 1.0 = -0.80 (needs to increase)
- Gradient for "mat": 0.50 - 0.0 = +0.50 (needs to decrease)
Adam update: w_new = w_old - 0.0001 * adaptive_gradient

At GPT-3 scale: repeat for 175B weights x 300B tokens.

👁️Basic prompt "how do neural networks learn" gives a general overview without math

🧠Asking for "one step with concrete numbers" forces AI to show real computations: cross-entropy, gradients, weight updates

✅Result: a step-by-step demonstration with numbers that can be verified and builds intuition for the entire training process

Tokens:48/195

Time:950ms

Quality:

Why this works

The prompt "walk me through one step with concrete numbers" transforms the abstract training process into a verifiable sequence of computations you can reproduce.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

This lesson is part of a structured LLM course.

My Learning Path