Reinforcement Learning from Human Feedback (RLHF) allows us to align language models with human preferences and values.
Standard training objectives don't necessarily produce outputs that are:
Initial Applications (2019-2021):
Used for targeted goals like reducing harmful outputs in summarization
InstructGPT (2022):
Applied to instruction following and general helpfulness
Modern Usage (2023+):
Core component in production systems like ChatGPT, Claude, and others
Start with a supervised fine-tuned model
Humans compare multiple model outputs and select which is better
Which is better?
Model learns to predict human preferences from comparison data
Use RL algorithms to maximize reward model scores
Common approach: Proximal Policy Optimization (PPO) with KL penalty to prevent drifting too far from the SFT model
The reward model is trained to predict which response a human would prefer:
Prompt:
How do I make a vegetable soup?
Response A:
First, chop onions, carrots, and celery. Sauté in olive oil. Add vegetable broth and simmer for 20 minutes...
Response B:
I don't know how to make soup. Maybe look it up online?
Training Objective
Maximize likelihood of chosen response having higher reward than rejected one
The policy is updated to maximize reward while staying close to the original model:
1. Sample outputs
Generate responses from current policy for various prompts
2. Compute rewards
Score each response with reward model
3. Calculate advantage
Determine how much better/worse each token choice was
4. Update policy
Increase probability of high-reward sequences
5. Apply KL penalty
Prevent policy from drifting too far from reference model
Direct Preference Optimization (DPO)
Eliminates explicit reward model and PPO steps
Constitutional AI
Uses AI feedback to complement human feedback
Rejection Sampling
Simpler alternative to PPO for policy improvement
Follows instructions more accurately
Avoids harmful, unethical content
Better matches human communication style