AI Safety & Alignment
Learn how LLMs are trained to be helpful, harmless, and honest using RLHF, DPO, and Constitutional AI
The Problem: LLMs are trained on massive internet datasets that contain harmful, biased, and false information. How do we make these models safe, helpful, and honest — without losing their capabilities?
The Solution: Align Models with Human Values
Alignment is the process of training models to follow human values, instructions, and safety guidelines. A base LLM trained on internet data can produce harmful, biased, or dishonest outputs. Alignment techniques like RLHF, DPO, and Constitutional AI transform these raw models into helpful, harmless, and honest assistants.
Think of it like training a guard dog — it must be powerful enough to protect, but controlled enough not to attack its owner:
- 1. Step 1: Supervised Fine-Tuning (SFT): Train the base model on high-quality instruction-response pairs written by humans. This teaches the model the FORMAT of being an assistant
- 2. Step 2: Preference Learning (RLHF or DPO): Human labelers rank model outputs from best to worst. RLHF trains a reward model + PPO; DPO skips the reward model and optimizes directly on preferences
- 3. Step 3: Safety Training: Red-team the model to find harmful outputs, then train it to refuse dangerous requests while remaining helpful for legitimate ones
- 4. Step 4: Evaluate and Iterate: Test with safety benchmarks (TruthfulQA, BBQ, HarmBench), monitor in production for reward hacking and unexpected behaviors
Key Alignment Methods
- RLHF Pipeline: Pre-training → SFT (Supervised Fine-Tuning) → Reward Model training → PPO optimization. InstructGPT used just 40 labelers to dramatically improve GPT-3 with this pipeline
- DPO (Direct Preference Optimization): Eliminates the separate reward model — trains directly on human preference pairs (chosen vs rejected). Simpler, more stable, same quality. Used by Llama 3, Mistral
- Constitutional AI: Anthropic's approach: the model critiques its own outputs using a set of principles (constitution), then revises them. Reduces need for human labelers. Used to train Claude
- Reward Hacking: Models can learn to exploit the reward signal rather than genuinely improve. Example: a model learns to produce verbose answers because labelers preferred longer responses, not better ones
Fun Fact: OpenAI's InstructGPT paper showed that a 1.3B parameter model with RLHF was preferred by humans over the 175B parameter GPT-3 without alignment — proving that alignment matters more than raw scale.
Try It Yourself!
Explore the interactive pipeline below to see how raw models become safe assistants.
Alignment is the process of training AI systems to act in accordance with human intentions and values. Without alignment, models can be harmful, dishonest, or unhelpful — even if technically powerful.
- •Pre-training: learning from internet data (next token prediction)
- •SFT: learning assistant format from quality pairs
- •RLHF/DPO: learning human preferences (what is better vs worse)
- •Red Teaming: finding and fixing vulnerabilities
- •RLHF — reward model + PPO (GPT-4, Claude, Gemini)
- •DPO — direct preference optimization, no reward model (Llama 3)
- •Constitutional AI — self-critique using principles (Claude)
- •GRPO — group optimization for reasoning (DeepSeek R1)
- •Reward hacking: model games the reward signal
- •Alignment tax: safety vs capabilities trade-off
- •Scalable oversight: supervising superhuman systems
- •Value alignment: whose values to encode?
Raw model trained on internet text. Predicts the next token. No concept of helpfulness or safety.
How to pick a lock?
First, you need a tension wrench and a pick. Insert the wrench into the bottom of the keyhole and apply slight pressure...
• Without alignment, a 175B parameter LLM can be less useful than a 1.3B aligned model (InstructGPT paper).
• Jailbreaks work precisely because alignment is statistical, not absolute — the model will "probably" refuse, but not "guaranteed".
• As model capabilities grow, alignment becomes more critical — a more powerful model without alignment is more dangerous.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path