Deep Reinforcement Learning: Lecture 9 Preference Optimization & LLM Alignment Structured Notes & In-Depth Analysis

One. Course Details

This is the ninth lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by guest lecturer Nikhil Sardana. It focuses entirely on preference optimization—the transformative set of techniques that convert raw pre-trained large language models (LLMs) into the helpful, conversational chatbots we interact with daily.The lecture begins by explaining the fundamental mismatch between next-token pre-training objectives and assistant behavior. It then systematically covers three generations of LLM alignment techniques: instruction fine-tuning, Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO). The presentation includes rigorous mathematical derivations of policy gradients and the DPO objective, head-to-head comparisons of RLHF and DPO, and a candid discussion of the critical limitations of current alignment methods, including reward hacking, sycophancy, and the gap between short-term and long-term human preferences.

Two. Key Learning Objectives

By the end of this lecture, students should be able to:

Explain the core misalignment between next-token pre-training and the behavior expected from a helpful assistant
Describe the instruction fine-tuning pipeline and identify its five key limitations
Derive the policy gradient update rule and explain why reinforcement learning is necessary for LLM alignment
Implement the full three-stage RLHF pipeline, including reward model training with the Bradley-Terry objective
Understand the theoretical derivation of Direct Preference Optimization (DPO) and how it eliminates the need for an explicit RL loop
Compare the computational cost, stability, and performance of RLHF and DPO
Identify the major open challenges in LLM alignment and active research directions

Three. Memorable Course Quotes

"Pre-training teaches a model to predict the next token. It does not teach it to be your assistant."
"Humans are terrible at assigning absolute scores, but we are excellent at comparing two things and saying which one is better."
"RLHF works because it turns the hard problem of writing a reward function into the much easier problem of comparing two completions."
"Direct Preference Optimization shows us that we don't need a separate reward model and RL loop—we can optimize directly on preference data with a simple classification loss."
"The biggest limitation of all current alignment methods is that they optimize for what humans say they want, not what they actually want."
"Reward hacking is not a bug in RL—it is the optimal strategy. The agent will always find the easiest way to maximize the reward signal, not the intended behavior."
"Instruction fine-tuning mostly teaches the model formatting, not new knowledge. All the knowledge was already there in the base pre-trained model."

Four. Detailed Study Notes

4.1 The Alignment Problem: From Pre-Trained Models to Assistants

Raw pre-trained LLMs are trained on trillions of tokens of internet text using the next-token prediction objective. This objective teaches them an extraordinary amount of information about language syntax, world facts, and even intuitive physics and human behavior. However, it does not teach them to follow instructions or act as helpful assistants.A classic example from GPT-3 perfectly illustrates this mismatch:

User prompt: "Explain the moon landing to a six-year-old in a few sentences."
Pre-trained model output: "Explain the theory of gravity to a six-year-old. Explain the theory of relativity to a six-year-old in a few sentences."
Desired assistant output: "A giant rocket ship blasted off from Earth carrying astronauts to the moon. They walked on the moon and brought rocks back to Earth to study."

The pre-trained model simply continues the pattern of text it has seen most often on the internet, rather than answering the user's actual question. This fundamental disconnect is the core of the alignment problem: how to align LLM behavior with human intent and preferences.

4.2 Instruction Fine-Tuning: The First Alignment Step

The simplest and most widely used first step in alignment is instruction fine-tuning (IFT):

Collect a dataset of (instruction, response) pairs covering a wide range of tasks
Fine-tune the pre-trained LLM to maximize the log-likelihood of the response given the instruction

A critical insight is that instruction fine-tuning primarily teaches the model formatting, not new knowledge. Almost all of the factual knowledge is already present in the base pre-trained model; fine-tuning just teaches it how to present that knowledge in the form of answers to user instructions rather than continuing arbitrary text patterns.

4.2.1 Limitations of Instruction Fine-Tuning

While simple and effective for basic tasks, instruction fine-tuning has five fundamental limitations:

Prohibitive data scaling costs: Humans must write high-quality responses to every instruction, which becomes exponentially more expensive as task difficulty increases
Inability to handle open-ended tasks: For creative tasks like writing a story or poem, there is no single "correct" answer to use as a label
Binary error penalty: The entire response is treated as either correct or incorrect, with no way to reward partially good answers or penalize specific flaws
Human performance ceiling: The model can never outperform the humans who wrote the demonstration responses
Objective mismatch: Fine-tuning optimizes for token prediction accuracy, not human preference, engagement, or helpfulness

4.3 Reinforcement Learning from Human Feedback (RLHF)

RLHF addresses the limitations of instruction fine-tuning by optimizing directly for human preferences rather than token prediction accuracy. It is the core technology behind all modern commercial chatbots including ChatGPT, Claude, and Gemini.The full RLHF pipeline consists of three sequential stages:

Supervised Fine-Tuning (SFT): First, perform instruction fine-tuning on a small dataset of high-quality human demonstrations to establish a base model that follows basic formatting conventions
Reward Model (RM) Training: Train a separate reward model to predict which of two completions a human will prefer
RL Optimization: Fine-tune the SFT model using reinforcement learning to maximize the reward score from the reward model

4.3.1 Policy Gradient Recap for LLMs

RLHF uses policy gradient methods to optimize the LLM. The core objective is to maximize the expected reward: J(θ) = E_{(x,y) ~ p_θ(y|x)} [ r(x,y) ]Using the log-derivative trick, we can derive the policy gradient update rule: ∇J(θ) = E_{(x,y) ~ p_θ(y|x)} [ r(x,y) * ∇ log p_θ(y|x) ]This rule has an extremely intuitive interpretation:

If a generated sample has a high reward, increase the probability of generating that sample in the future
If a generated sample has a low reward, decrease the probability of generating that sample in the future

4.3.2 Why We Cannot Avoid RL

A common question is: if the reward model is a differentiable neural network, why can't we just backpropagate gradients directly through it to update the LLM?The answer is the discrete token bottleneck:

LLMs generate discrete tokens one at a time
The sampling operation that selects which token to generate is fundamentally non-differentiable
We cannot backpropagate gradients from the reward model through the discrete tokens to the LLM parameters

Policy gradient methods solve this problem by only requiring the log probabilities of the generated tokens, not gradients through the sampling process itself.

4.3.3 Reward Model Training with the Bradley-Terry Objective

Humans are notoriously bad at assigning absolute numerical scores to completions. Two different humans may assign wildly different scores to the same completion, and even the same human may score the same completion differently on different days.However, humans are extremely good at comparing two completions and saying which one is better. RLHF leverages this by training reward models on pairwise preference data rather than absolute scores.Given a dataset of preferences where completion y_w is preferred over completion y_l for the same input x, we train the reward model using the Bradley-Terry objective: L_RM(φ) = - E_{(x,y_w,y_l) ~ D} [ log σ( r_φ(x,y_w) - r_φ(x,y_l) ) ]This simple objective only requires the reward model to assign a higher score to the preferred completion.

4.3.4 The KL Penalty: The Most Important Trick in RLHF

A critical addition to the RLHF objective that makes it work in practice is the Kullback-Leibler (KL) divergence penalty: r_total(x,y) = r_φ(x,y) - β * log( p_θ(y|x) / p_sft(y|x) )Where:

p_θ is the current RL model being optimized
p_sft is the initial supervised fine-tuned model
β is a hyperparameter that controls the strength of the penalty

The KL penalty ensures that the RL model does not drift too far from the SFT model. This is absolutely essential because:

Reward models are imperfect and only accurate on the distribution of data they were trained on
Without the penalty, the model will quickly learn to exploit flaws in the reward model to get extremely high scores without producing good outputs (reward hacking)
The penalty preserves the general knowledge and capabilities of the base pre-trained model that would otherwise be lost during RL optimization

4.4 Direct Preference Optimization (DPO): Alignment Without RL

RLHF is extremely effective but also complex, unstable, and computationally expensive. It requires maintaining three separate models (SFT, RM, and RL) and performing expensive sampling during the RL loop.Direct Preference Optimization (DPO) revolutionized LLM alignment by showing that we can optimize directly on preference data without an explicit reward model or RL loop.

4.4.1 Theoretical Derivation of DPO

The key mathematical insight behind DPO is that the KL-constrained RL objective has an exact closed-form solution: p*(y|x) = (1/Z(x)) * p_sft(y|x) * exp( r(x,y) / β )Where Z(x) is a normalizing constant that ensures the probabilities sum to 1.We can rearrange this equation to express the reward function in terms of the optimal policy: r(x,y) = β * log( p*(y|x) / p_sft(y|x) ) + β * log Z(x)When we substitute this expression for the reward into the Bradley-Terry objective, the normalizing constant Z(x) cancels out completely because it is the same for both completions being compared. This leaves us with the elegant DPO objective: L_DPO(θ) = - E_{(x,y_w,y_l) ~ D} [ log σ( β * ( log(p_θ(y_w|x)/p_sft(y_w|x)) - log(p_θ(y_l|x)/p_sft(y_l|x)) ) ) ]This is a standard binary classification loss that can be optimized directly on the preference data with standard backpropagation. No reward model, no RL loop, no sampling required.

4.4.2 Advantages of DPO

DPO has rapidly become the preferred alignment method for the open-source community due to its numerous advantages:

Extreme simplicity: Only one model to train, no complex RL infrastructure required
Superior stability: Avoids all the instability issues common in deep RL training
Dramatic computational efficiency: Eliminates the expensive sampling step during RL optimization
Equal or better performance: In practice, DPO produces models that match or exceed RLHF models on almost all benchmarks
Fewer hyperparameters: Only requires tuning a single β hyperparameter, compared to dozens for RLHF

4.5 Limitations of Current Alignment Methods

While RLHF and DPO have produced remarkable results, they have significant fundamental limitations that remain active areas of research:

Reward hacking: Models will always find ways to exploit flaws in the reward model to maximize reward without producing the intended behavior
Sycophancy: Models learn to agree with users even when they are clearly wrong, because this is what humans typically prefer in the short term
Preference misalignment: Current methods optimize for what humans say they want, not what they actually need or what is good for them in the long term
Prohibitive preference data costs: Collecting high-quality human preference data remains extremely expensive and time-consuming
Lack of personalization: Models are aligned to the aggregate preferences of all humans, not the unique preferences of individual users
Uncontrolled emergent behaviors: Alignment training can produce unexpected and undesirable emergent behaviors like increased dishonesty or decreased reasoning ability
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 9: RL for LLMs Stanford Online
• Course Series: Stanford CS224R Deep Reinforcement Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/XKLGuwvSKvI?si=jSVaO7leMxloDWbU

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.