Training Dynamics
From random weights to predicting the next word
The Problem: You know that LLMs are "trained" on massive amounts of text — but what does training actually mean? How does a model go from random weights to predicting the next word with uncanny accuracy? Understanding training dynamics reveals why LLMs cost millions to train, why some models are better than others, and why "temperature" controls creativity.
The Solution: How LLM Training Works
LLM training is an optimization process driven by a single number: cross-entropy loss. For each token in the training data, the model predicts a probability distribution over the vocabulary. The loss measures how far this prediction is from reality. Gradient descent then computes the slope (gradient) of this loss with respect to every weight and nudges each weight in the direction that reduces the loss. The learning rate controls the step size — too large and the model overshoots; too small and training takes forever. Modern LLMs use the Adam optimizer, which adapts the learning rate for each parameter individually and uses momentum to smooth out noisy gradients.
Think of it like tuning a piano with billions of strings. The loss function tells you "how out of tune the piano is." Gradient descent tells you "which direction to turn each tuning peg." The learning rate controls "how much you turn each peg." The Adam optimizer remembers "which pegs moved well last time" and adapts the turning speed for each one individually:
- 1. Compute loss: how wrong is the prediction?: A batch of tokens is fed through the model (forward pass). For each position, the model outputs a probability distribution over ~100K tokens. Cross-entropy loss measures the gap between predicted probabilities and the actual next token. This single number drives all learning
- 2. Compute gradients: which way to adjust?: Backpropagation computes the gradient (slope) of the loss with respect to every weight. For GPT-3, that means computing 175 billion partial derivatives in a single backward pass. Each gradient tells us: "if this weight increases slightly, the loss will increase/decrease by this much"
- 3. Update weights: take a step downhill: The Adam optimizer updates each weight: w_new = w_old - lr * adaptive_gradient. Adam maintains running averages of past gradients (momentum) and squared gradients (adaptive LR). The learning rate schedule controls the global step size: warmup from near-zero, peak, then cosine decay
- 4. Repeat billions of times: One complete pass through all training data is one epoch. Modern LLMs often train for less than 1 epoch — Chinchilla showed it is better to use more data than to see the same data twice. Llama 3 saw 15.6T tokens in a single pass. The loss curve gradually flattens as the model converges
Training Dynamics in LLMs
- Why Training Costs Millions: GPT-3 training: 300B tokens x 175B parameters x ~3.6 million GPU-hours. Each training step adjusted 175 billion weights simultaneously using the Adam optimizer. At cloud GPU prices, this cost approximately $4.6M. GPT-4 reportedly cost over $100M. The training loop is the same — the scale is what drives the cost
- Temperature and the Loss Landscape: When you set temperature in LLM settings, you modify the softmax distribution at inference time. Temperature = 0.1 makes the distribution peaky (model is very confident). Temperature = 2.0 flattens it (more random). This directly relates to how the model was trained — during training, the loss function optimized for the "correct" next token distribution
- Perplexity: The Public Loss Metric: Perplexity = e^(cross-entropy loss). When a paper says "perplexity of 15.2," it means the model is, on average, as uncertain as choosing between 15 equally likely next words. Lower perplexity = better model. GPT-2 achieved ~35 on WikiText-103; GPT-3 brought it down to ~20. This single number captures how well training worked
- Common Pitfall: Thinking a bigger model is always better. The Chinchilla scaling law (DeepMind, 2022) proved that GPT-3 was undertrained — it should have used more data and fewer parameters. For a fixed compute budget, a 70B model trained on 1.4T tokens outperforms a 175B model on 300B tokens. Training dynamics matter more than raw size
Fun Fact: Llama 3 405B was trained on 15.6 trillion tokens using 16,384 H100 GPUs simultaneously. The learning rate schedule used a warmup of 8,000 steps, peaked at 8e-5, then decayed with cosine annealing. One training run took ~54 days. If any GPU failed, checkpointing allowed resuming from the last saved state.
Try It Yourself!
Explore the interactive training landscape below: adjust learning rates, compare optimizers, visualize learning rate schedules, and see how training transforms random weights into a language model.
The "ball" = current model weights. It rolls downhill on the loss landscape — green = low error, red = high. Adjust learning rate, press Start, and watch:
Try it: Set LR to 0.005 (crawling), then 0.100 (wild jumps). Too low = barely moves, too high = overshoots minimum. Sweet spot: 0.03-0.05.
Try it yourself
Interactive demo of this technique
Get a clear explanation of how language models are trained
Neural networks learn from data. They adjust their weights to make better predictions. The process involves forward and backward passes.
One LLM training step:
- Batch: "The cat sat" -> predict "on"
- Forward pass: model outputs probabilities [the=0.15, cat=0.10, sat=0.05, on=0.20, mat=0.50]
- Cross-entropy loss: -log(P("on")) = -log(0.20) = 1.61
- If P("on")=0.90, loss = 0.105 (much better)
- Backward pass: compute dL/dw for every weight
- Gradient for "on": 0.20 - 1.0 = -0.80 (needs to increase)
- Gradient for "mat": 0.50 - 0.0 = +0.50 (needs to decrease)
- Adam update: w_new = w_old - 0.0001 * adaptive_gradient
At GPT-3 scale: repeat for 175B weights x 300B tokens.
The prompt "walk me through one step with concrete numbers" transforms the abstract training process into a verifiable sequence of computations you can reproduce.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path