Deep Reinforcement Learning: Lecture 8 Advanced Offline RL & Reward Learning Structured Notes & In-Depth Analysis

One. Course Details

This is the eighth lecture of Stanford University's CS224R Deep Reinforcement Learning course, taught by Professor Chelsea Finn. It concludes the offline reinforcement learning module with a deep dive into Conservative Q-Learning (CQL), one of the most widely used offline RL algorithms, then transitions to the critical topic of reward learning—the problem of automatically inferring reward functions from human supervision rather than manually engineering them.The lecture first explains how CQL solves the offline RL overestimation problem through conservative regularization, then addresses the fundamental challenge of task specification in real-world RL. It covers two primary reward learning paradigms: learning from examples of successful goals and learning from human pairwise preferences. The lecture concludes with a detailed breakdown of how reward learning powers Reinforcement Learning from Human Feedback (RLHF), the technology behind modern large language models like ChatGPT.

Two. Key Learning Objectives

By the end of this lecture, students should be able to:

Explain the core intuition behind Conservative Q-Learning and how it prevents overestimation of out-of-distribution Q-values
Derive the CQL objective function and understand the role of the conservative regularization term
Identify the fundamental limitations of manual reward engineering for complex real-world tasks
Implement reward learning from goal examples and explain how adversarial training prevents classifier exploitation
Derive the maximum likelihood objective for learning reward functions from human pairwise preferences
Describe the full three-stage pipeline for training large language models with RLHF
Compare the strengths and weaknesses of goal-based and preference-based reward learning

Three. Memorable Course Quotes

"Conservative Q-Learning takes the opposite approach to standard RL—instead of being optimistic about unknown actions, it is intentionally pessimistic about everything outside the data distribution."
"The biggest lie in reinforcement learning is that someone will give you a reward function. In the real world, rewards are never handed to you on a silver platter."
"Humans are terrible at assigning absolute scores to behaviors, but we are excellent at comparing two things and saying which one is better."
"If you train a classifier to be your reward function, your RL agent will spend all its time finding ways to fool the classifier instead of solving the actual task."
"Reinforcement Learning from Human Feedback works because it turns the hard problem of writing a reward function into the much easier problem of comparing two things."
"CQL doesn't just give you a lower bound on Q-values—it gives you a lower bound that is tight exactly where you have data and loose where you don't."

Four. Detailed Study Notes

4.1 Conservative Q-Learning (CQL): Preventing Overestimation Through Pessimism

While Implicit Q-Learning (IQL) avoids querying out-of-distribution actions entirely, Conservative Q-Learning takes a different approach: it explicitly penalizes high Q-values for actions outside the data distribution.The core insight of CQL is that standard Q-learning overestimates Q-values for out-of-distribution actions because:

Neural networks produce arbitrary outputs for inputs they have never seen
The policy update step always selects the action with the highest Q-value, which will almost always be an overestimated out-of-distribution action

CQL solves this by adding a conservative regularization term to the standard Q-learning loss: L_CQL(φ) = L_standard(φ) + α * ( E_{s ~ D, a ~ μ(a|s)} [ Q_φ(s,a) ] - E_{(s,a) ~ D} [ Q_φ(s,a) ] )Where:

L_standard(φ) is the standard TD loss for Q-learning
μ(a|s) is a distribution that seeks out actions with high Q-values
α is a hyperparameter that controls the strength of the conservative penalty

This objective does two things:

It minimizes Q-values for actions that have high Q-values but are not in the dataset
It maximizes Q-values for actions that are actually present in the dataset

The result is a Q-function that is a lower bound on the true Q-function, eliminating the overestimation bias that plagues standard off-policy algorithms in offline settings.

4.1.1 Practical Implementation of CQL

In practice, we do not need to explicitly represent the distribution μ(a|s). For maximum entropy regularization, the optimal μ(a|s) is proportional to exp(Q_φ(s,a)), which allows us to rewrite the regularization term in closed form: E_{a ~ μ} [ Q(s,a) ] = log( Σ_a exp(Q(s,a)) )For continuous action spaces, this expectation is estimated via sampling.

4.1.2 Real-World Application: LinkedIn Notification Optimization

LinkedIn used CQL to optimize their email notification policy, with impressive results:

15% higher click-through rate
10% fewer total notifications sent
5% increase in weekly active users

This demonstrates that offline RL can deliver significant business value by learning from existing historical data without risky online experimentation.

4.2 The Reward Learning Problem

All RL algorithms we have covered so far assume the existence of a predefined reward function. However, in almost all real-world applications, manually engineering a good reward function is extremely difficult or impossible.Examples of hard-to-specify reward functions include:

A robot pouring water into a cup: Need to reward getting water into the cup, penalize spilling, and account for different cup sizes and positions
An autonomous car: Need to balance safety, comfort, speed, and adherence to traffic laws
A chatbot: Need to reward helpfulness, honesty, harmlessness, and conversational naturalness

Manual reward functions almost always have loopholes that RL agents will exploit to maximize reward without actually solving the intended task. Reward learning solves this by inferring the reward function directly from human supervision.

4.3 Reward Learning from Goal Examples

The simplest approach to reward learning is to train a binary classifier to distinguish between successful and unsuccessful states.The basic pipeline is:

Collect a dataset of positive examples (states where the task is successfully completed)
Collect a dataset of negative examples (states where the task is not completed)
Train a binary classifier to predict whether a state is a success or failure
Use the classifier's output probability as the reward signal for RL

4.3.1 The Classifier Exploitation Problem

This approach has a critical flaw: the RL agent will learn to exploit weaknesses in the classifier. The classifier is only trained on the initial dataset of positive and negative examples, but the RL agent can visit states that are far outside this distribution where the classifier's predictions are arbitrary.The solution is adversarial training:

Train the initial classifier on the positive and negative examples
Run RL using the classifier as the reward function
Add all states visited by the RL agent to the negative example set
Retrain the classifier on the expanded dataset
Repeat until convergence

This process is mathematically equivalent to training a Generative Adversarial Network (GAN), where:

The classifier is the discriminator trying to distinguish true success states from states generated by the policy
The RL policy is the generator trying to produce states that the discriminator classifies as successes

4.3.2 Stabilizing Adversarial Reward Learning

Two critical tricks make this approach work in practice:

Data balancing: Always maintain a 50/50 balance between positive and negative examples in the training set. This prevents the classifier from collapsing to predicting "negative" for all states.
Strong regularization: Use heavy regularization on the classifier to prevent overfitting to the small number of initial positive examples.

4.4 Reward Learning from Human Preferences

While goal-based reward learning works well for tasks with clear success criteria, it cannot capture more nuanced preferences. For these cases, we can learn reward functions from human pairwise preferences.The key insight is that humans are much better at comparing two things and saying which one is better than they are at assigning an absolute score to a single thing.

4.4.1 Mathematical Formulation

If a human indicates that trajectory τ_w is better than trajectory τ_l, we assume that: R(τ_w) > R(τ_l)We model the probability that the human prefers τ_w over τ_l using the sigmoid function: P(τ_w ≻ τ_l) = σ( R(τ_w) - R(τ_l) )Where R(τ) = Σ_{t=0}^T r(s_t, a_t) is the cumulative reward of trajectory τ under the learned reward function r(s,a).We train the reward function by maximizing the log-likelihood of the human preferences: L(θ) = - E_{(τ_w, τ_l) ~ D} [ log σ( R_θ(τ_w) - R_θ(τ_l) ) ]

4.4.2 Preference Data Collection

In practice, preference data is collected by:

Sampling k trajectories from the current policy starting from the same initial state
Asking a human to rank these k trajectories from best to worst
Generating k choose 2 pairwise preference labels from the ranking

This approach is much more efficient than asking humans to score individual trajectories.

4.4.3 Reinforcement Learning from Human Feedback (RLHF)

Preference-based reward learning is the foundation of RLHF, the technology used to train modern large language models. The full RLHF pipeline has three stages:

Pre-training: Train a large language model on a massive corpus of text using next-token prediction
Supervised Fine-Tuning (SFT): Fine-tune the model on a dataset of high-quality human demonstrations to teach it to follow instructions
RLHF:
- Collect pairwise preference data on responses from the SFT model
- Train a reward model to predict human preferences
- Fine-tune the SFT model using PPO to maximize the reward from the reward model

4.4.4 Reinforcement Learning from AI Feedback (RLAIF)

An increasingly popular alternative to human feedback is to use a stronger AI model to provide the preference labels. This is much cheaper and faster than collecting human feedback and has been shown to produce comparable results for many tasks.

These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 8: Reward Learning Stanford Online
• Course Series: Stanford CS224R Deep Reinforcement Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/PDIxDhA9Z6Y?si=hmojNf4N47QSoS-p

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.