Lesson 18New

RLVR and GRPO

Training reasoning models with an answer key, not a tutor's opinion

The Problem: You want to train a model to solve hard math and coding problems. RLHF would require thousands of expensive human preference labels and a learned reward model that the model can quietly game — producing answers that score high but are actually wrong. How do you reward correctness directly, without humans in the loop?

The Solution: RLVR & GRPO — Reward What You Can Verify

RLHF needs expensive human preference labels and a learned reward model that the policy can learn to game (reward hacking). For tasks with checkable answers — math, code, proofs — you can skip the human reward model entirely: RLVR (Reinforcement Learning with Verifiable Rewards) rewards the model whenever its answer passes an automatic correctness check. The training algorithm is usually GRPO (Group Relative Policy Optimization): sample a group of answers per problem, verify each, and score each answer by how much better it is than the group average — no separate value network required. This is exactly how modern reasoning models like DeepSeek-R1 are trained, and it is far more robust than learned rewards because a deterministic verifier is very hard to fool.

Think of it like studying with an answer key versus a tutor’s opinion. RLHF is the tutor’s subjective grade — it can be flattered or gamed. RLVR is an automatic answer key: your answer is either right or wrong, objectively, every single time:

1. Sample a group of answers: For each problem, sample multiple full answers (e.g. 8-16) from the current policy. Because sampling is stochastic, the group contains a mix of correct and incorrect reasoning paths — this diversity is what makes group-relative comparison possible
2. Verify each answer automatically: Run a deterministic verifier on every answer: exact-match against the known result for math, a unit-test suite for code, a proof checker for theorems. Each answer gets a binary (or graded) reward — correct or not — with no human and no learned scorer involved
3. Compute group-relative advantage (GRPO): Instead of a separate value network, GRPO uses the group itself as the baseline: each answer’s advantage = (its reward − group mean reward) / group std. Answers better than the group average get a positive advantage; worse-than-average answers get a negative one. This is what makes GRPO cheaper and more stable than PPO
4. Update the policy toward correct reasoning: Update the policy to make positive-advantage (verifiably correct) reasoning paths more likely and negative-advantage paths less likely, with a KL penalty keeping it close to the reference model. Repeat for thousands of steps — the model gradually learns to reason longer and more carefully because that is what earns verifiable reward

RLVR in Practice

Math Reasoning (DeepSeek-R1): DeepSeek-R1 was trained with RLVR on math problems where the final answer is checked by exact-match. The model discovered, on its own, that writing longer chains of reasoning leads to more correct answers — so it learned to "think" longer without ever being explicitly told to
Code Generation: For coding tasks, the reward is simply whether the generated code passes a hidden unit-test suite. No human grades the code — the test runner is the verifier. This makes the reward signal cheap, scalable, and almost impossible to fake
Tool Use & Theorem Proving: Any task where success is checkable works: an API call that returns the expected result, a SQL query that matches the gold output, or a formal proof accepted by a proof checker like Lean. The verifier replaces the human reward model entirely
Common Pitfall: RLVR does not work for subjective tasks — tone, creativity, helpfulness — because there is no deterministic verifier for "good writing". For those, you still need RLHF and its gameable learned reward. Applying RLVR where no objective checker exists is the most common mistake

Fun Fact: DeepSeek-R1-Zero was trained with pure RLVR + GRPO and no supervised fine-tuning at all. During training, the model spontaneously developed an "aha moment": it learned to stop, re-check its own work, and re-derive answers — behaviors nobody programmed, that emerged purely because longer correct reasoning earned more verifiable reward.

Try It Yourself!

Explore the interactive RLVR loop below: sample a group of answers, watch the verifier mark each one, see GRPO compute group-relative advantage, and toggle between the gameable RLHF reward model and the deterministic RLVR verifier.

RLVR + GRPO: The Verifiable-Reward Training Loop

Step 1 — Sample a group of answers for the same problem from the current policy. Sampling is stochastic, so paths differ:

Problem: 2 + 2 × 3 = ?

Answer A

2 + 2*3 → multiply first: 2*3=6, then 2+6 = 8

Answer B

2 + 2*3 → left to right: 2+2=4, 4*3 = 12

Answer C

2 + 2*3 → 2*3=6, 6+2 = 8 (re-checked order of operations)

Answer D

2 + 2*3 → guessed 7 without showing work

Frequently asked questions

What is RLVR and how does it differ from RLHF?

RLVR (Reinforcement Learning with Verifiable Rewards) replaces RLHF's learned reward model with a deterministic verifier. Instead of training a neural network to predict human preferences, RLVR rewards the model whenever its answer passes an automatic correctness check — unit tests for code, exact-match for math. This removes the expensive human-labeling step and eliminates the gameable learned reward model, which is why it works so well for math and coding where answers are checkable.

What is GRPO and what does it normalize the advantage against?

GRPO (Group Relative Policy Optimization) is the RL algorithm behind DeepSeek-R1. For each problem it samples a group of answers from the current policy, verifies each one, and computes each answer's advantage relative to the group's mean reward — no separate value (critic) network is needed. An answer better than the group average gets a positive advantage; worse than average gets negative. This group-relative baseline makes training cheaper and more stable than PPO.

Why does RLVR only work in verifiable domains?

RLVR depends on an automatic, deterministic check that an answer is correct. This exists for math (exact-match against a known answer), code (passing a test suite), and formal proofs (a proof checker). For subjective or open-ended tasks — creative writing, tone, helpfulness — there is no objective verifier, so you fall back to learned reward models (RLHF) which can be gamed. RLVR is more robust precisely because a deterministic checker is far harder to reward-hack than a learned scorer.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskAdvancedAnalysis

Choose a reward signal for training a model to solve math problems via RL

Without technique

Prompt

Train a model to solve math problems. Use a reward model that scores 1-10 based on how correct and well-explained the answer looks.

Response

We use a learned reward model that scores "looks correct". The problem: the model learns to maximize the score, not correctness. After a few thousand steps it writes long, confident, well-formatted "solutions" whose final answer is often WRONG — the reward model rewards a confident tone and nice formatting, not the math. This is reward hacking: the proxy metric ("looks correct") diverged from the goal (being correct).

Tokens:32/96

Time:640ms

Quality:

With fundamentals-rlvr

With technique

Prompt

Train a model to solve math problems via RLVR + GRPO. Reward = exact-match of the final answer against the known correct answer (binary: 1 if correct, 0 if not). Sample a group of 8 answers per problem and normalize advantage against the group mean.

Response

The reward is a deterministic exact-match: the answer either matches the gold answer (r=1) or not (r=0). Reward hacking is nearly impossible: you cannot "look correct" — you must BE correct. GRPO samples 8 answers, computes advantage = (r − group mean) / std, reinforces correct paths and suppresses wrong ones, with no separate value network. Over time the model discovers on its own that longer, self-checked reasoning chains yield more correct answers — and starts "thinking" longer. This is exactly how DeepSeek-R1 was trained.

👁️The baseline signal "looks correct" is a learned proxy the model learns to game

🧠Switching to exact-match makes the reward deterministic and robust to reward hacking

✅GRPO with group-relative advantage removes the value network and stabilizes training; the model learns to reason longer on its own

Tokens:58/142

Time:980ms

Quality:

Why this works

For verifiable tasks, a deterministic reward (exact-match, unit tests) beats a learned reward model: you cannot "look correct" — you must be correct, which closes the main reward-hacking channel.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Rlhf Reasoning Models Fine Tuning

This lesson is part of a structured LLM course.

My Learning Path

Lesson 18New

RLVR and GRPO

Training reasoning models with an answer key, not a tutor's opinion

The Solution: RLVR & GRPO — Reward What You Can Verify

Think of it like studying with an answer key versus a tutor’s opinion. RLHF is the tutor’s subjective grade — it can be flattered or gamed. RLVR is an automatic answer key: your answer is either right or wrong, objectively, every single time:

1. Sample a group of answers: For each problem, sample multiple full answers (e.g. 8-16) from the current policy. Because sampling is stochastic, the group contains a mix of correct and incorrect reasoning paths — this diversity is what makes group-relative comparison possible
2. Verify each answer automatically: Run a deterministic verifier on every answer: exact-match against the known result for math, a unit-test suite for code, a proof checker for theorems. Each answer gets a binary (or graded) reward — correct or not — with no human and no learned scorer involved
3. Compute group-relative advantage (GRPO): Instead of a separate value network, GRPO uses the group itself as the baseline: each answer’s advantage = (its reward − group mean reward) / group std. Answers better than the group average get a positive advantage; worse-than-average answers get a negative one. This is what makes GRPO cheaper and more stable than PPO
4. Update the policy toward correct reasoning: Update the policy to make positive-advantage (verifiably correct) reasoning paths more likely and negative-advantage paths less likely, with a KL penalty keeping it close to the reference model. Repeat for thousands of steps — the model gradually learns to reason longer and more carefully because that is what earns verifiable reward

RLVR in Practice

Math Reasoning (DeepSeek-R1): DeepSeek-R1 was trained with RLVR on math problems where the final answer is checked by exact-match. The model discovered, on its own, that writing longer chains of reasoning leads to more correct answers — so it learned to "think" longer without ever being explicitly told to
Code Generation: For coding tasks, the reward is simply whether the generated code passes a hidden unit-test suite. No human grades the code — the test runner is the verifier. This makes the reward signal cheap, scalable, and almost impossible to fake
Tool Use & Theorem Proving: Any task where success is checkable works: an API call that returns the expected result, a SQL query that matches the gold output, or a formal proof accepted by a proof checker like Lean. The verifier replaces the human reward model entirely
Common Pitfall: RLVR does not work for subjective tasks — tone, creativity, helpfulness — because there is no deterministic verifier for "good writing". For those, you still need RLHF and its gameable learned reward. Applying RLVR where no objective checker exists is the most common mistake

Try It Yourself!

RLVR + GRPO: The Verifiable-Reward Training Loop

Step 1 — Sample a group of answers for the same problem from the current policy. Sampling is stochastic, so paths differ:

Problem: 2 + 2 × 3 = ?

Answer A

2 + 2*3 → multiply first: 2*3=6, then 2+6 = 8

Answer B

2 + 2*3 → left to right: 2+2=4, 4*3 = 12

Answer C

2 + 2*3 → 2*3=6, 6+2 = 8 (re-checked order of operations)

Answer D

2 + 2*3 → guessed 7 without showing work

Frequently asked questions

What is RLVR and how does it differ from RLHF?

What is GRPO and what does it normalize the advantage against?

Why does RLVR only work in verifiable domains?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskAdvancedAnalysis

Choose a reward signal for training a model to solve math problems via RL

Without technique

Prompt

Train a model to solve math problems. Use a reward model that scores 1-10 based on how correct and well-explained the answer looks.

Response

Tokens:32/96

Time:640ms

Quality:

With fundamentals-rlvr

With technique

Prompt

Response

👁️The baseline signal "looks correct" is a learned proxy the model learns to game

🧠Switching to exact-match makes the reward deterministic and robust to reward hacking

✅GRPO with group-relative advantage removes the value network and stabilizes training; the model learns to reason longer on its own

Tokens:58/142

Time:980ms

Quality:

Why this works

For verifiable tasks, a deterministic reward (exact-match, unit tests) beats a learned reward model: you cannot "look correct" — you must be correct, which closes the main reward-hacking channel.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Rlhf Reasoning Models Fine Tuning

This lesson is part of a structured LLM course.

My Learning Path