Lesson 15

RLHF

From next-token predictor to helpful assistant

The Problem: Your pre-trained LLM generates plausible text but sometimes produces toxic, inaccurate, or unhelpful responses. It doesn't distinguish between "technically possible" and "actually useful." How do you teach it what humans actually want?

The Solution: RLHF — Teaching Models What Humans Prefer

After pre-training on internet text, an LLM can predict the next token — but it doesn't know what makes a response helpful, harmless, or honest. RLHF bridges this gap by using human judgments to create a reward model that scores responses, then fine-tuning the LLM with PPO (Proximal Policy Optimization) to maximize that score. A KL divergence penalty prevents the model from drifting too far from its pre-trained knowledge, avoiding "reward hacking" where the model finds degenerate ways to get high scores.

Think of it like training a puppy with approval and disapproval — the puppy learns which behaviors the owner prefers, and eventually internalizes the rules:

1. Collect human preferences: Show annotators pairs of model responses to the same prompt. They choose which response is better (A > B, B > A, or tie). Thousands of these comparisons form the preference dataset
2. Train a reward model: A neural network learns to predict human preferences from these comparisons. Given a prompt and response, it outputs a scalar score — how much a human would approve of this response
3. Fine-tune with PPO: Optimize the LLM to maximize reward model scores using PPO. A KL divergence penalty keeps the model close to its pre-trained version — without it, the model would "reward hack" by finding degenerate responses that score high but are nonsensical
4. Iterate and improve: Collect new preference data from the improved model, update the reward model, repeat. Each iteration makes the model better aligned. Variants like RLAIF (Constitutional AI) use AI feedback instead of human annotators for scalability

RLHF in Practice

ChatGPT Alignment: RLHF is what transformed GPT-3 from an autocomplete engine into ChatGPT. The InstructGPT paper (2022) showed that a 1.3B model with RLHF outperformed a 175B model without it on human evaluations
Safety Training: RLHF teaches models to refuse harmful requests, avoid generating toxic content, and respond appropriately to sensitive topics. The reward model learns what "safe" means from human annotators
Instruction Following: Pre-trained models often ignore formatting requests or ramble. RLHF optimizes for responses that actually follow user instructions: concise when asked for brevity, structured when asked for lists
Common Pitfall: RLHF does not teach the model new knowledge — it teaches which existing knowledge to surface. Reward model biases from annotator disagreements can lead to sycophantic behavior where the model agrees with the user even when wrong

Fun Fact: The InstructGPT paper showed that a 1.3 billion parameter model fine-tuned with RLHF was preferred by human evaluators over the base 175 billion parameter GPT-3 — a 135x smaller model beating its giant counterpart through alignment alone.

Try It Yourself!

Explore the interactive RLHF pipeline below: collect preferences, train the reward model, and see how PPO optimization shapes model behavior.

RLHF: From Preferences to Alignment

Choose the better response (like a human annotator):

Score: 0/3

Prompt: "How do I pick a lock?"

Response A

First, get a tension wrench and rake pick. Insert the wrench into the bottom of the keyhole...

Response B

I can't provide instructions for lock picking as it could facilitate illegal entry. If you're locked out, I recommend calling a licensed locksmith.

Frequently asked questions

What is RLHF and why is it important for LLMs?

RLHF (Reinforcement Learning from Human Feedback) is a training technique that aligns LLMs with human preferences. After pre-training, models can generate text but don't know what responses humans consider helpful or safe. RLHF uses human comparisons of response pairs to train a reward model, then optimizes the LLM to maximize that reward using PPO.

How does the RLHF training process work step by step?

RLHF has three main steps: (1) Collect human preferences — annotators compare pairs of model responses and choose the better one, (2) Train a reward model — a neural network learns to predict human preferences from these comparisons, (3) Fine-tune with PPO — the LLM is optimized to maximize the reward model's score while staying close to the original model via KL divergence penalty.

What is the difference between RLHF and RLAIF (Constitutional AI)?

RLHF uses human annotators to compare and rank model outputs. RLAIF (Reinforcement Learning from AI Feedback), used in Constitutional AI by Anthropic, replaces human annotators with another AI model that evaluates responses against a set of principles. RLAIF scales better but may inherit biases from the evaluating model.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateAnalysis

Explain to a user why their code is not working

Without technique

Prompt

My Python code doesn't work: print("hello world)

Response

Code is wrong. Syntax error. Need closing quote. Could also use f-strings. Python supports single and double quotes. Strings in Python are immutable. The print function was introduced in Python 3...

Tokens:15/85

Time:450ms

Quality:

With fundamentals-rlhf

With technique

Prompt

My Python code doesn't work: print("hello world)

Response

You're missing a closing quote and parenthesis. Here's the fixed code:

print("hello world")

Error: SyntaxError: EOL while scanning string literal — Python couldn't find the end of the string. Always check for matching quotes and parentheses.

👁️Without RLHF the model "knows" the answer but produces an info dump — all facts about Python strings

🧠RLHF taught the model: the user wants a fix, not a Python lecture

✅After RLHF: specific diagnosis → fixed code → error explanation. Structure, usefulness, conciseness

Tokens:15/52

Time:380ms

Quality:

Why this works

RLHF doesn't teach the model new knowledge — both responses are technically correct. But RLHF teaches the model to give useful responses: a specific fix instead of an information dump.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Fine Tuning Alignment Inference

This lesson is part of a structured LLM course.

My Learning Path

Lesson 15

RLHF

From next-token predictor to helpful assistant

The Solution: RLHF — Teaching Models What Humans Prefer

Think of it like training a puppy with approval and disapproval — the puppy learns which behaviors the owner prefers, and eventually internalizes the rules:

1. Collect human preferences: Show annotators pairs of model responses to the same prompt. They choose which response is better (A > B, B > A, or tie). Thousands of these comparisons form the preference dataset
2. Train a reward model: A neural network learns to predict human preferences from these comparisons. Given a prompt and response, it outputs a scalar score — how much a human would approve of this response
3. Fine-tune with PPO: Optimize the LLM to maximize reward model scores using PPO. A KL divergence penalty keeps the model close to its pre-trained version — without it, the model would "reward hack" by finding degenerate responses that score high but are nonsensical
4. Iterate and improve: Collect new preference data from the improved model, update the reward model, repeat. Each iteration makes the model better aligned. Variants like RLAIF (Constitutional AI) use AI feedback instead of human annotators for scalability

RLHF in Practice

ChatGPT Alignment: RLHF is what transformed GPT-3 from an autocomplete engine into ChatGPT. The InstructGPT paper (2022) showed that a 1.3B model with RLHF outperformed a 175B model without it on human evaluations
Safety Training: RLHF teaches models to refuse harmful requests, avoid generating toxic content, and respond appropriately to sensitive topics. The reward model learns what "safe" means from human annotators
Instruction Following: Pre-trained models often ignore formatting requests or ramble. RLHF optimizes for responses that actually follow user instructions: concise when asked for brevity, structured when asked for lists
Common Pitfall: RLHF does not teach the model new knowledge — it teaches which existing knowledge to surface. Reward model biases from annotator disagreements can lead to sycophantic behavior where the model agrees with the user even when wrong

Try It Yourself!

Explore the interactive RLHF pipeline below: collect preferences, train the reward model, and see how PPO optimization shapes model behavior.

RLHF: From Preferences to Alignment

Choose the better response (like a human annotator):

Score: 0/3

Prompt: "How do I pick a lock?"

Response A

First, get a tension wrench and rake pick. Insert the wrench into the bottom of the keyhole...

Response B

I can't provide instructions for lock picking as it could facilitate illegal entry. If you're locked out, I recommend calling a licensed locksmith.

Frequently asked questions

What is RLHF and why is it important for LLMs?

How does the RLHF training process work step by step?

What is the difference between RLHF and RLAIF (Constitutional AI)?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateAnalysis

Explain to a user why their code is not working

Without technique

Prompt

My Python code doesn't work: print("hello world)

Response

Code is wrong. Syntax error. Need closing quote. Could also use f-strings. Python supports single and double quotes. Strings in Python are immutable. The print function was introduced in Python 3...

Tokens:15/85

Time:450ms

Quality:

With fundamentals-rlhf

With technique

Prompt

My Python code doesn't work: print("hello world)

Response

You're missing a closing quote and parenthesis. Here's the fixed code:

print("hello world")

Error: SyntaxError: EOL while scanning string literal — Python couldn't find the end of the string. Always check for matching quotes and parentheses.

👁️Without RLHF the model "knows" the answer but produces an info dump — all facts about Python strings

🧠RLHF taught the model: the user wants a fix, not a Python lecture

✅After RLHF: specific diagnosis → fixed code → error explanation. Structure, usefulness, conciseness

Tokens:15/52

Time:380ms

Quality:

Why this works

RLHF doesn't teach the model new knowledge — both responses are technically correct. But RLHF teaches the model to give useful responses: a specific fix instead of an information dump.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Fine Tuning Alignment Inference

This lesson is part of a structured LLM course.

My Learning Path