Lecture 5: LLM Preference Alignment, RLHF and Direct Preference Optimization

One. Course Details

This is the fifth lecture of Stanford University’s CME 295: Transformers and Large Language Models, taught by twin brothers Afshine and Shervine Amidi. The lecture opens with a brief post-midterm update, noting that both the exam and its official solutions are now posted on the course website for enrolled students and auditors alike. The final exam will exclusively cover content from lectures 5 through 9 and will follow the same closed-book, closed-notes format as the midterm.

The lecture builds directly on the previous session’s coverage of pre-training and supervised fine-tuning (SFT) to introduce the critical third stage of LLM development: preference alignment. In the first half, Afshine breaks down the industry-standard Reinforcement Learning from Human Feedback (RLHF) pipeline, covering preference data collection, reward model training using the Bradley-Terry formulation, and the Proximal Policy Optimization (PPO) algorithm. In the second half, Shervine presents two practical alternatives to full RLHF—Best-of-N (BoN) sampling and Direct Preference Optimization (DPO)—explaining their mathematical foundations, implementation simplicity, and performance tradeoffs compared to traditional RLHF.

This lecture represents a pivotal shift from model training fundamentals to the alignment techniques that transform raw language models into safe, helpful, and user-friendly assistants.

Two. Key Learning Objectives

By the end of this lecture, students will be able to:

Explain the fundamental need for preference alignment and identify three key ways it differs from supervised fine-tuning.

Compare pointwise, pairwise, and listwise preference data collection methods and justify why pairwise data is the industry standard.

Describe the two-stage RLHF pipeline and derive the reward model loss function from the Bradley-Terry formulation.

Interpret the PPO objective function and explain the role of KL divergence in preventing catastrophic forgetting and reward hacking.

Distinguish between on-policy and off-policy training and explain why PPO is classified as an on-policy algorithm.

Evaluate the tradeoffs between RLHF, Best-of-N sampling, and Direct Preference Optimization for different production use cases.

Identify the core challenges of RL-based alignment, including training instability, hyperparameter sensitivity, and reward hacking.

Three. Memorable Course Quotes

"Supervised fine-tuning teaches the model what to generate, but preference tuning teaches the model what to prefer."

"RLHF is not about teaching the model new facts—it’s about shaping the tone, safety, and helpfulness of the responses it already knows how to generate."

"Reward hacking is when your model optimizes perfectly for the metric you gave it, but completely misses the thing you actually care about."

"Direct preference optimization eliminates the entire reward modeling step by directly optimizing the model on preference pairs."

"The best alignment method is always a tradeoff between performance, compute cost, and engineering complexity."

Four. Detailed Lecture Notes

4.1 Preference Alignment Overview

The lecture begins with a concise recap of the two-stage LLM training pipeline covered in Lecture 4:

Pre-training: Trains the model on trillions of tokens of unlabeled text to learn general language understanding and world knowledge.
Supervised Fine-Tuning (SFT): Trains the model on high-quality instruction-response pairs to turn it into a basic instruction-following assistant.

Even a well-executed SFT model often produces responses that are factually correct but misaligned with human preferences. For example, it might give a blunt, unempathetic answer to a question about washing a beloved teddy bear, or generate content that is unsafe, biased, or unhelpful.

Preference alignment is the third and final stage of LLM training that addresses these limitations. It differs from SFT in three critical ways:

It teaches the model what to prefer rather than dictating exactly what to generate word-for-word.
It allows injecting explicit negative signals about what the model should not produce, which SFT cannot do effectively.
It leverages the fact that humans are far better at comparing two responses than writing a perfect response from scratch, making data collection much more scalable.

4.2 Preference Data Collection

All preference alignment techniques rely on high-quality preference data. There are three primary approaches to collecting this data:

Pointwise: Assign an absolute numerical score to each individual response. This is rarely used because humans struggle to consistently assign meaningful absolute scores to subjective text.
Pairwise: Present two responses to the same prompt and ask the rater to select the better one. This is the industry standard because it is intuitive, reliable, and produces consistent labels across raters.
Listwise: Present a list of 3-5 responses and ask the rater to rank them from best to worst. This provides more information per prompt but is significantly more cognitively demanding for raters.

Pairwise preference data is typically collected by:

Generating two distinct responses to the same prompt using the SFT model with a positive temperature to ensure diversity.
Having human raters select the preferred response, or using a powerful LLM as an automated judge for faster, cheaper labeling.
Optionally using a nuanced 6-point scale (much better, better, slightly better, slightly worse, worse, much worse) rather than a simple binary choice, though binary labels remain the most common.

4.3 Reinforcement Learning from Human Feedback (RLHF)

RLHF is the most widely used preference alignment technique, powering all major commercial LLMs including ChatGPT, Claude, and Gemini. It consists of two sequential, independent stages.

4.3.1 Stage 1: Reward Model Training

The goal of the reward model is to learn to predict how much a human would prefer a given response to a specific prompt. It is trained exclusively on the pairwise preference data collected in the previous step.

The reward model is built on the Bradley-Terry formulation, a statistical model used to predict the outcome of pairwise comparisons. It states that the probability that response yᵢ is better than response yⱼ is: P(yᵢ ≻ yⱼ) = σ(r(x, yᵢ) - r(x, yⱼ)) where σ is the sigmoid function and r(x, y) is the scalar reward score assigned to prompt x and response y.

The loss function for training the reward model is the negative log-likelihood of the observed preferences: L_RM = -E[log σ(r(x, y_w) - r(x, y_l))] where y_w is the winning (preferred) response and y_l is the losing response.

Key implementation details:

Reward models are almost always initialized from the SFT model, with a single linear head added to output the scalar reward score.
The reward model outputs a single continuous score for each prompt-response pair.
While trained on pairwise data, the reward model produces pointwise scores at inference time.

4.3.2 Stage 2: Reinforcement Learning with PPO

Once the reward model is trained and frozen, it is used to provide reward signals for fine-tuning the LLM using reinforcement learning. The core challenge here is to maximize the reward from the reward model while preventing the model from deviating too far from the original SFT model, which would lead to:

Catastrophic forgetting: The model loses the general knowledge and capabilities it learned during pre-training and SFT.
Reward hacking: The model learns to exploit flaws in the reward model to get high scores without actually producing good responses.

The standard algorithm for this step is Proximal Policy Optimization (PPO), which is specifically designed to make stable, incremental updates to the model policy.

The PPO objective function has two core components:

Reward maximization: Maximize the expected reward from the frozen reward model.
KL divergence penalty: Minimize the KL divergence between the current policy and the original SFT policy to prevent excessive deviation.

Two additional concepts are critical to understanding PPO:

Value function: A token-level estimator that predicts the expected final reward given a partial generation. It is trained jointly with the policy and used to compute advantages.
Advantage function: Measures how much better a particular action (generating a specific token) is than the average expected action. It reduces the variance of policy updates and dramatically stabilizes training.

There are two common variants of the PPO loss:

PPO-Clip: Clips the policy update ratio between 1-ε and 1+ε to prevent excessively large updates in a single iteration.
PPO-KL Penalty: Adds an explicit KL divergence term to the loss function to penalize deviations from the reference policy.

Modern production implementations typically combine elements of both variants for optimal stability and performance.

4.3.3 Key Challenges of RLHF

While RLHF produces state-of-the-art aligned models, it has significant practical drawbacks:

Complex pipeline: Requires training and maintaining four separate models in memory simultaneously (policy, reference model, reward model, value function).
Hyperparameter sensitivity: Has dozens of hyperparameters that require extensive tuning to avoid training instability.
Training instability: RL training is notoriously finicky and can diverge completely if not carefully monitored.
On-policy requirement: Requires generating new data from the current policy at each iteration, which is computationally expensive.

4.4 Alternatives to RLHF

Given the complexity and cost of full RLHF, researchers have developed simpler alternatives that work surprisingly well for many use cases.

4.4.1 Best-of-N (BoN) Sampling

Best-of-N is the simplest possible preference alignment method and requires no additional training beyond the reward model. It works by:

Generating N different responses to the same prompt using the frozen SFT model.
Scoring all N responses using the trained reward model.
Returning only the highest-scoring response to the end user.

Advantages:

Extremely simple to implement and debug.
Produces high-quality responses with minimal engineering effort.

Disadvantages:

Prohibitively expensive at inference time, as it requires generating N responses for every user query.
Increases end-to-end latency, even with parallel generation, due to the need to wait for the longest completion.

4.4.2 Direct Preference Optimization (DPO)

DPO is a breakthrough alignment technique published in 2023 that eliminates the entire reward modeling and RL step entirely. It directly optimizes the model on pairwise preference data using a simple supervised loss function.

The core insight behind DPO is that the optimal policy under the standard RLHF objective can be expressed analytically as a function of the preference data. This allows rewriting the entire complex RLHF pipeline as a simple binary cross-entropy loss that can be optimized with standard supervised training techniques.

The DPO loss function is: L_DPO = -E[log σ(β log(π_θ(y_w|x)/π_ref(y_w|x)) - β log(π_θ(y_l|x)/π_ref(y_l|x)))] where β is the temperature hyperparameter that controls the strength of the KL penalty, and π_ref is the original frozen SFT policy.

Advantages of DPO:

Much simpler pipeline than RLHF, requiring only two models (current policy and reference policy).
Far more stable training with only one primary hyperparameter to tune.
Computationally cheaper and faster to implement and iterate on.

Disadvantages of DPO:

Generally performs slightly worse than well-tuned PPO on complex alignment tasks.
More susceptible to distribution shift between the preference data and the model’s native generation distribution.

4.5 Final Alignment Takeaway

The ultimate goal of all preference alignment techniques is not to teach the model new information, but to shape its existing knowledge into responses that are helpful, harmless, and aligned with human values. The choice of alignment method depends entirely on the specific use case, available compute budget, and required performance level.

These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 5 - LLM tuning Stanford Online
• Course Series: Stanford CME295: Transformers and Large Language Models I Autumn 2025
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/PmW_TMQ3l0I?si=sHGdzWlLMQIiMqyY

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.