RLHF
From next-token predictor to helpful assistant
The Problem: Your pre-trained LLM generates plausible text but sometimes produces toxic, inaccurate, or unhelpful responses. It doesn't distinguish between "technically possible" and "actually useful." How do you teach it what humans actually want?
The Solution: RLHF — Teaching Models What Humans Prefer
After pre-training on internet text, an LLM can predict the next token — but it doesn't know what makes a response helpful, harmless, or honest. RLHF bridges this gap by using human judgments to create a reward model that scores responses, then fine-tuning the LLM with PPO (Proximal Policy Optimization) to maximize that score. A KL divergence penalty prevents the model from drifting too far from its pre-trained knowledge, avoiding "reward hacking" where the model finds degenerate ways to get high scores.
Think of it like training a puppy with approval and disapproval — the puppy learns which behaviors the owner prefers, and eventually internalizes the rules:
- 1. Collect human preferences: Show annotators pairs of model responses to the same prompt. They choose which response is better (A > B, B > A, or tie). Thousands of these comparisons form the preference dataset
- 2. Train a reward model: A neural network learns to predict human preferences from these comparisons. Given a prompt and response, it outputs a scalar score — how much a human would approve of this response
- 3. Fine-tune with PPO: Optimize the LLM to maximize reward model scores using PPO. A KL divergence penalty keeps the model close to its pre-trained version — without it, the model would "reward hack" by finding degenerate responses that score high but are nonsensical
- 4. Iterate and improve: Collect new preference data from the improved model, update the reward model, repeat. Each iteration makes the model better aligned. Variants like RLAIF (Constitutional AI) use AI feedback instead of human annotators for scalability
RLHF in Practice
- ChatGPT Alignment: RLHF is what transformed GPT-3 from an autocomplete engine into ChatGPT. The InstructGPT paper (2022) showed that a 1.3B model with RLHF outperformed a 175B model without it on human evaluations
- Safety Training: RLHF teaches models to refuse harmful requests, avoid generating toxic content, and respond appropriately to sensitive topics. The reward model learns what "safe" means from human annotators
- Instruction Following: Pre-trained models often ignore formatting requests or ramble. RLHF optimizes for responses that actually follow user instructions: concise when asked for brevity, structured when asked for lists
- Common Pitfall: RLHF does not teach the model new knowledge — it teaches which existing knowledge to surface. Reward model biases from annotator disagreements can lead to sycophantic behavior where the model agrees with the user even when wrong
Fun Fact: The InstructGPT paper showed that a 1.3 billion parameter model fine-tuned with RLHF was preferred by human evaluators over the base 175 billion parameter GPT-3 — a 135x smaller model beating its giant counterpart through alignment alone.
Try It Yourself!
Explore the interactive RLHF pipeline below: collect preferences, train the reward model, and see how PPO optimization shapes model behavior.
Choose the better response (like a human annotator):
Score: 0/3Prompt: "How do I pick a lock?"
First, get a tension wrench and rake pick. Insert the wrench into the bottom of the keyhole...
I can't provide instructions for lock picking as it could facilitate illegal entry. If you're locked out, I recommend calling a licensed locksmith.
Try it yourself
Interactive demo of this technique
Explain to a user why their code is not working
Code is wrong. Syntax error. Need closing quote. Could also use f-strings. Python supports single and double quotes. Strings in Python are immutable. The print function was introduced in Python 3...
You're missing a closing quote and parenthesis. Here's the fixed code:
print("hello world")
Error: SyntaxError: EOL while scanning string literal — Python couldn't find the end of the string. Always check for matching quotes and parentheses.
RLHF doesn't teach the model new knowledge — both responses are technically correct. But RLHF teaches the model to give useful responses: a specific fix instead of an information dump.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path