One. Course Details
This is the eighth lecture of Stanford University's CS224R Deep Reinforcement Learning course, taught by Professor Chelsea Finn. It concludes the offline reinforcement learning module with a deep dive into Conservative Q-Learning (CQL), one of the most widely used offline RL algorithms, then transitions to the critical topic of reward learning—the problem of automatically inferring reward functions from human supervision rather than manually engineering them.The lecture first explains how CQL solves the offline RL overestimation problem through conservative regularization, then addresses the fundamental challenge of task specification in real-world RL. It covers two primary reward learning paradigms: learning from examples of successful goals and learning from human pairwise preferences. The lecture concludes with a detailed breakdown of how reward learning powers Reinforcement Learning from Human Feedback (RLHF), the technology behind modern large language models like ChatGPT.Two. Key Learning Objectives
By the end of this lecture, students should be able to:-
Explain the core intuition behind Conservative Q-Learning and how it prevents overestimation of out-of-distribution Q-values
-
Derive the CQL objective function and understand the role of the conservative regularization term
-
Identify the fundamental limitations of manual reward engineering for complex real-world tasks
-
Implement reward learning from goal examples and explain how adversarial training prevents classifier exploitation
-
Derive the maximum likelihood objective for learning reward functions from human pairwise preferences
-
Describe the full three-stage pipeline for training large language models with RLHF
-
Compare the strengths and weaknesses of goal-based and preference-based reward learning
Three. Memorable Course Quotes
-
"Conservative Q-Learning takes the opposite approach to standard RL—instead of being optimistic about unknown actions, it is intentionally pessimistic about everything outside the data distribution."
-
"The biggest lie in reinforcement learning is that someone will give you a reward function. In the real world, rewards are never handed to you on a silver platter."
-
"Humans are terrible at assigning absolute scores to behaviors, but we are excellent at comparing two things and saying which one is better."
-
"If you train a classifier to be your reward function, your RL agent will spend all its time finding ways to fool the classifier instead of solving the actual task."
-
"Reinforcement Learning from Human Feedback works because it turns the hard problem of writing a reward function into the much easier problem of comparing two things."
-
"CQL doesn't just give you a lower bound on Q-values—it gives you a lower bound that is tight exactly where you have data and loose where you don't."
Four. Detailed Study Notes
4.1 Conservative Q-Learning (CQL): Preventing Overestimation Through Pessimism
While Implicit Q-Learning (IQL) avoids querying out-of-distribution actions entirely, Conservative Q-Learning takes a different approach: it explicitly penalizes high Q-values for actions outside the data distribution.The core insight of CQL is that standard Q-learning overestimates Q-values for out-of-distribution actions because:-
Neural networks produce arbitrary outputs for inputs they have never seen
-
The policy update step always selects the action with the highest Q-value, which will almost always be an overestimated out-of-distribution action
L_CQL(φ) = L_standard(φ) + α * ( E_{s ~ D, a ~ μ(a|s)} [ Q_φ(s,a) ] - E_{(s,a) ~ D} [ Q_φ(s,a) ] )Where:
-
L_standard(φ)is the standard TD loss for Q-learning -
μ(a|s)is a distribution that seeks out actions with high Q-values -
αis a hyperparameter that controls the strength of the conservative penalty
-
It minimizes Q-values for actions that have high Q-values but are not in the dataset
-
It maximizes Q-values for actions that are actually present in the dataset
4.1.1 Practical Implementation of CQL
In practice, we do not need to explicitly represent the distributionμ(a|s). For maximum entropy regularization, the optimal μ(a|s) is proportional to exp(Q_φ(s,a)), which allows us to rewrite the regularization term in closed form: E_{a ~ μ} [ Q(s,a) ] = log( Σ_a exp(Q(s,a)) )For continuous action spaces, this expectation is estimated via sampling.
4.1.2 Real-World Application: LinkedIn Notification Optimization
LinkedIn used CQL to optimize their email notification policy, with impressive results:-
15% higher click-through rate
-
10% fewer total notifications sent
-
5% increase in weekly active users
4.2 The Reward Learning Problem
All RL algorithms we have covered so far assume the existence of a predefined reward function. However, in almost all real-world applications, manually engineering a good reward function is extremely difficult or impossible.Examples of hard-to-specify reward functions include:-
A robot pouring water into a cup: Need to reward getting water into the cup, penalize spilling, and account for different cup sizes and positions
-
An autonomous car: Need to balance safety, comfort, speed, and adherence to traffic laws
-
A chatbot: Need to reward helpfulness, honesty, harmlessness, and conversational naturalness
4.3 Reward Learning from Goal Examples
The simplest approach to reward learning is to train a binary classifier to distinguish between successful and unsuccessful states.The basic pipeline is:-
Collect a dataset of positive examples (states where the task is successfully completed)
-
Collect a dataset of negative examples (states where the task is not completed)
-
Train a binary classifier to predict whether a state is a success or failure
-
Use the classifier's output probability as the reward signal for RL
4.3.1 The Classifier Exploitation Problem
This approach has a critical flaw: the RL agent will learn to exploit weaknesses in the classifier. The classifier is only trained on the initial dataset of positive and negative examples, but the RL agent can visit states that are far outside this distribution where the classifier's predictions are arbitrary.The solution is adversarial training:-
Train the initial classifier on the positive and negative examples
-
Run RL using the classifier as the reward function
-
Add all states visited by the RL agent to the negative example set
-
Retrain the classifier on the expanded dataset
-
Repeat until convergence
-
The classifier is the discriminator trying to distinguish true success states from states generated by the policy
-
The RL policy is the generator trying to produce states that the discriminator classifies as successes
4.3.2 Stabilizing Adversarial Reward Learning
Two critical tricks make this approach work in practice:-
Data balancing: Always maintain a 50/50 balance between positive and negative examples in the training set. This prevents the classifier from collapsing to predicting "negative" for all states.
-
Strong regularization: Use heavy regularization on the classifier to prevent overfitting to the small number of initial positive examples.
4.4 Reward Learning from Human Preferences
While goal-based reward learning works well for tasks with clear success criteria, it cannot capture more nuanced preferences. For these cases, we can learn reward functions from human pairwise preferences.The key insight is that humans are much better at comparing two things and saying which one is better than they are at assigning an absolute score to a single thing.4.4.1 Mathematical Formulation
If a human indicates that trajectoryτ_w is better than trajectory τ_l, we assume that: R(τ_w) > R(τ_l)We model the probability that the human prefers τ_w over τ_l using the sigmoid function: P(τ_w ≻ τ_l) = σ( R(τ_w) - R(τ_l) )Where R(τ) = Σ_{t=0}^T r(s_t, a_t) is the cumulative reward of trajectory τ under the learned reward function r(s,a).We train the reward function by maximizing the log-likelihood of the human preferences: L(θ) = - E_{(τ_w, τ_l) ~ D} [ log σ( R_θ(τ_w) - R_θ(τ_l) ) ]
4.4.2 Preference Data Collection
In practice, preference data is collected by:-
Sampling k trajectories from the current policy starting from the same initial state
-
Asking a human to rank these k trajectories from best to worst
-
Generating k choose 2 pairwise preference labels from the ranking
4.4.3 Reinforcement Learning from Human Feedback (RLHF)
Preference-based reward learning is the foundation of RLHF, the technology used to train modern large language models. The full RLHF pipeline has three stages:-
Pre-training: Train a large language model on a massive corpus of text using next-token prediction
-
Supervised Fine-Tuning (SFT): Fine-tune the model on a dataset of high-quality human demonstrations to teach it to follow instructions
-
RLHF:
-
Collect pairwise preference data on responses from the SFT model
-
Train a reward model to predict human preferences
-
Fine-tune the SFT model using PPO to maximize the reward from the reward model
-
4.4.4 Reinforcement Learning from AI Feedback (RLAIF)
An increasingly popular alternative to human feedback is to use a stronger AI model to provide the preference labels. This is much cheaper and faster than collecting human feedback and has been shown to produce comparable results for many tasks.These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.


