One. Course Details
This is the ninth lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by guest lecturer Nikhil Sardana. It focuses entirely on preference optimization—the transformative set of techniques that convert raw pre-trained large language models (LLMs) into the helpful, conversational chatbots we interact with daily.The lecture begins by explaining the fundamental mismatch between next-token pre-training objectives and assistant behavior. It then systematically covers three generations of LLM alignment techniques: instruction fine-tuning, Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO). The presentation includes rigorous mathematical derivations of policy gradients and the DPO objective, head-to-head comparisons of RLHF and DPO, and a candid discussion of the critical limitations of current alignment methods, including reward hacking, sycophancy, and the gap between short-term and long-term human preferences.Two. Key Learning Objectives
By the end of this lecture, students should be able to:-
Explain the core misalignment between next-token pre-training and the behavior expected from a helpful assistant
-
Describe the instruction fine-tuning pipeline and identify its five key limitations
-
Derive the policy gradient update rule and explain why reinforcement learning is necessary for LLM alignment
-
Implement the full three-stage RLHF pipeline, including reward model training with the Bradley-Terry objective
-
Understand the theoretical derivation of Direct Preference Optimization (DPO) and how it eliminates the need for an explicit RL loop
-
Compare the computational cost, stability, and performance of RLHF and DPO
-
Identify the major open challenges in LLM alignment and active research directions
Three. Memorable Course Quotes
-
"Pre-training teaches a model to predict the next token. It does not teach it to be your assistant."
-
"Humans are terrible at assigning absolute scores, but we are excellent at comparing two things and saying which one is better."
-
"RLHF works because it turns the hard problem of writing a reward function into the much easier problem of comparing two completions."
-
"Direct Preference Optimization shows us that we don't need a separate reward model and RL loop—we can optimize directly on preference data with a simple classification loss."
-
"The biggest limitation of all current alignment methods is that they optimize for what humans say they want, not what they actually want."
-
"Reward hacking is not a bug in RL—it is the optimal strategy. The agent will always find the easiest way to maximize the reward signal, not the intended behavior."
-
"Instruction fine-tuning mostly teaches the model formatting, not new knowledge. All the knowledge was already there in the base pre-trained model."
Four. Detailed Study Notes
4.1 The Alignment Problem: From Pre-Trained Models to Assistants
Raw pre-trained LLMs are trained on trillions of tokens of internet text using the next-token prediction objective. This objective teaches them an extraordinary amount of information about language syntax, world facts, and even intuitive physics and human behavior. However, it does not teach them to follow instructions or act as helpful assistants.A classic example from GPT-3 perfectly illustrates this mismatch:-
User prompt: "Explain the moon landing to a six-year-old in a few sentences."
-
Pre-trained model output: "Explain the theory of gravity to a six-year-old. Explain the theory of relativity to a six-year-old in a few sentences."
-
Desired assistant output: "A giant rocket ship blasted off from Earth carrying astronauts to the moon. They walked on the moon and brought rocks back to Earth to study."
4.2 Instruction Fine-Tuning: The First Alignment Step
The simplest and most widely used first step in alignment is instruction fine-tuning (IFT):-
Collect a dataset of (instruction, response) pairs covering a wide range of tasks
-
Fine-tune the pre-trained LLM to maximize the log-likelihood of the response given the instruction
4.2.1 Limitations of Instruction Fine-Tuning
While simple and effective for basic tasks, instruction fine-tuning has five fundamental limitations:-
Prohibitive data scaling costs: Humans must write high-quality responses to every instruction, which becomes exponentially more expensive as task difficulty increases
-
Inability to handle open-ended tasks: For creative tasks like writing a story or poem, there is no single "correct" answer to use as a label
-
Binary error penalty: The entire response is treated as either correct or incorrect, with no way to reward partially good answers or penalize specific flaws
-
Human performance ceiling: The model can never outperform the humans who wrote the demonstration responses
-
Objective mismatch: Fine-tuning optimizes for token prediction accuracy, not human preference, engagement, or helpfulness
4.3 Reinforcement Learning from Human Feedback (RLHF)
RLHF addresses the limitations of instruction fine-tuning by optimizing directly for human preferences rather than token prediction accuracy. It is the core technology behind all modern commercial chatbots including ChatGPT, Claude, and Gemini.The full RLHF pipeline consists of three sequential stages:-
Supervised Fine-Tuning (SFT): First, perform instruction fine-tuning on a small dataset of high-quality human demonstrations to establish a base model that follows basic formatting conventions
-
Reward Model (RM) Training: Train a separate reward model to predict which of two completions a human will prefer
-
RL Optimization: Fine-tune the SFT model using reinforcement learning to maximize the reward score from the reward model
4.3.1 Policy Gradient Recap for LLMs
RLHF uses policy gradient methods to optimize the LLM. The core objective is to maximize the expected reward:J(θ) = E_{(x,y) ~ p_θ(y|x)} [ r(x,y) ]Using the log-derivative trick, we can derive the policy gradient update rule: ∇J(θ) = E_{(x,y) ~ p_θ(y|x)} [ r(x,y) * ∇ log p_θ(y|x) ]This rule has an extremely intuitive interpretation:
-
If a generated sample has a high reward, increase the probability of generating that sample in the future
-
If a generated sample has a low reward, decrease the probability of generating that sample in the future
4.3.2 Why We Cannot Avoid RL
A common question is: if the reward model is a differentiable neural network, why can't we just backpropagate gradients directly through it to update the LLM?The answer is the discrete token bottleneck:-
LLMs generate discrete tokens one at a time
-
The sampling operation that selects which token to generate is fundamentally non-differentiable
-
We cannot backpropagate gradients from the reward model through the discrete tokens to the LLM parameters
4.3.3 Reward Model Training with the Bradley-Terry Objective
Humans are notoriously bad at assigning absolute numerical scores to completions. Two different humans may assign wildly different scores to the same completion, and even the same human may score the same completion differently on different days.However, humans are extremely good at comparing two completions and saying which one is better. RLHF leverages this by training reward models on pairwise preference data rather than absolute scores.Given a dataset of preferences where completiony_w is preferred over completion y_l for the same input x, we train the reward model using the Bradley-Terry objective: L_RM(φ) = - E_{(x,y_w,y_l) ~ D} [ log σ( r_φ(x,y_w) - r_φ(x,y_l) ) ]This simple objective only requires the reward model to assign a higher score to the preferred completion.
4.3.4 The KL Penalty: The Most Important Trick in RLHF
A critical addition to the RLHF objective that makes it work in practice is the Kullback-Leibler (KL) divergence penalty:r_total(x,y) = r_φ(x,y) - β * log( p_θ(y|x) / p_sft(y|x) )Where:
-
p_θis the current RL model being optimized -
p_sftis the initial supervised fine-tuned model -
βis a hyperparameter that controls the strength of the penalty
-
Reward models are imperfect and only accurate on the distribution of data they were trained on
-
Without the penalty, the model will quickly learn to exploit flaws in the reward model to get extremely high scores without producing good outputs (reward hacking)
-
The penalty preserves the general knowledge and capabilities of the base pre-trained model that would otherwise be lost during RL optimization
4.4 Direct Preference Optimization (DPO): Alignment Without RL
RLHF is extremely effective but also complex, unstable, and computationally expensive. It requires maintaining three separate models (SFT, RM, and RL) and performing expensive sampling during the RL loop.Direct Preference Optimization (DPO) revolutionized LLM alignment by showing that we can optimize directly on preference data without an explicit reward model or RL loop.4.4.1 Theoretical Derivation of DPO
The key mathematical insight behind DPO is that the KL-constrained RL objective has an exact closed-form solution:p*(y|x) = (1/Z(x)) * p_sft(y|x) * exp( r(x,y) / β )Where Z(x) is a normalizing constant that ensures the probabilities sum to 1.We can rearrange this equation to express the reward function in terms of the optimal policy: r(x,y) = β * log( p*(y|x) / p_sft(y|x) ) + β * log Z(x)When we substitute this expression for the reward into the Bradley-Terry objective, the normalizing constant Z(x) cancels out completely because it is the same for both completions being compared. This leaves us with the elegant DPO objective: L_DPO(θ) = - E_{(x,y_w,y_l) ~ D} [ log σ( β * ( log(p_θ(y_w|x)/p_sft(y_w|x)) - log(p_θ(y_l|x)/p_sft(y_l|x)) ) ) ]This is a standard binary classification loss that can be optimized directly on the preference data with standard backpropagation. No reward model, no RL loop, no sampling required.
4.4.2 Advantages of DPO
DPO has rapidly become the preferred alignment method for the open-source community due to its numerous advantages:-
Extreme simplicity: Only one model to train, no complex RL infrastructure required
-
Superior stability: Avoids all the instability issues common in deep RL training
-
Dramatic computational efficiency: Eliminates the expensive sampling step during RL optimization
-
Equal or better performance: In practice, DPO produces models that match or exceed RLHF models on almost all benchmarks
-
Fewer hyperparameters: Only requires tuning a single β hyperparameter, compared to dozens for RLHF
4.5 Limitations of Current Alignment Methods
While RLHF and DPO have produced remarkable results, they have significant fundamental limitations that remain active areas of research:-
Reward hacking: Models will always find ways to exploit flaws in the reward model to maximize reward without producing the intended behavior
-
Sycophancy: Models learn to agree with users even when they are clearly wrong, because this is what humans typically prefer in the short term
-
Preference misalignment: Current methods optimize for what humans say they want, not what they actually need or what is good for them in the long term
-
Prohibitive preference data costs: Collecting high-quality human preference data remains extremely expensive and time-consuming
-
Lack of personalization: Models are aligned to the aggregate preferences of all humans, not the unique preferences of individual users
-
Uncontrolled emergent behaviors: Alignment training can produce unexpected and undesirable emergent behaviors like increased dishonesty or decreased reasoning ability
-
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.


