Reinforcement Learning from Human Feedback

13/30

Why RLHF?

Reinforcement Learning from Human Feedback (RLHF) allows us to align language models with human preferences and values.

The Alignment Problem

Standard training objectives don't necessarily produce outputs that are:

Helpful - providing useful and relevant information
Harmless - avoiding harmful, unethical, or misleading content
Honest - admitting uncertainty and limitations
Aligned with broader human values and intentions

RLHF Evolution

Initial Applications (2019-2021):

Used for targeted goals like reducing harmful outputs in summarization

InstructGPT (2022):

Applied to instruction following and general helpfulness

Modern Usage (2023+):

Core component in production systems like ChatGPT, Claude, and others

RLHF is now considered essential for creating useful, safe AI assistants from raw language models.

The RLHF Process

RLHF Training Pipeline

1. SFT Model

Start with a supervised fine-tuned model

2. Collect Human Preferences

Humans compare multiple model outputs and select which is better

Response A

Which is better?

Response B

3. Train Reward Model

Model learns to predict human preferences from comparison data

4. Optimize with RL

Use RL algorithms to maximize reward model scores

Reinforcement Learning Strategy

Common approach: Proximal Policy Optimization (PPO) with KL penalty to prevent drifting too far from the SFT model

The RL stage is the most computationally expensive and challenging part of RLHF.

Inside RLHF

Reward Model Training

The reward model is trained to predict which response a human would prefer:

Prompt:

How do I make a vegetable soup?

Response A:

First, chop onions, carrots, and celery. Sauté in olive oil. Add vegetable broth and simmer for 20 minutes...

Response B:

I don't know how to make soup. Maybe look it up online?

Training Objective

Maximize likelihood of chosen response having higher reward than rejected one

Loss = -log(σ(r(chosen) - r(rejected)))

Challenges in Reward Modeling

Collecting diverse, high-quality human preference data
Ensuring consistency across different annotators
Avoiding reward hacking and gaming
Balancing multiple competing objectives

RL Optimization

The policy is updated to maximize reward while staying close to the original model:

1. Sample outputs

Generate responses from current policy for various prompts

2. Compute rewards

Score each response with reward model

3. Calculate advantage

Determine how much better/worse each token choice was

4. Update policy

Increase probability of high-reward sequences

5. Apply KL penalty

Prevent policy from drifting too far from reference model

Total Objective = E[r(x) - β * KL(π_new || π_ref)]

Recent Innovations

Direct Preference Optimization (DPO)

Eliminates explicit reward model and PPO steps

Constitutional AI

Uses AI feedback to complement human feedback

Rejection Sampling

Simpler alternative to PPO for policy improvement

Impact of RLHF

More Helpful

Follows instructions more accurately

Safer Responses

Avoids harmful, unethical content

More Human-like

Better matches human communication style

RLHF is powerful but raises complex questions about whose preferences and values should shape AI behavior.

Previous All Slides Next