One. Course Details
This is the tenth lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by a guest lecturer specializing in large language model reasoning. It focuses entirely on reinforcement learning techniques for LLM reasoning, with a specific emphasis on mathematical problem-solving tasks.The lecture organizes the field of reasoning RL into two eras: methods developed before the release of DeepSeek-R1 in early 2025, and the modern approaches that power today's state-of-the-art "thinking models". It presents a systematic data scaling analysis of different training paradigms, explains the critical problem of spurious steps and generalization collapse in on-policy imitation learning, and introduces step-level credit assignment as the key breakthrough that enabled dramatic improvements in reasoning performance. The lecture concludes with an explanation of how the same core RL techniques, when combined with more capable base models and extended thinking budgets, produce the remarkable reasoning abilities of modern models like DeepSeek-R1 and OpenAI's O-series.Two. Key Learning Objectives
By the end of this lecture, students should be able to:-
Formulate the LLM reasoning problem as a sparse reward Markov Decision Process (MDP) with deterministic dynamics
-
Compare the data efficiency and limitations of three core training paradigms: supervised fine-tuning (SFT), rejection fine-tuning (RFT), and full reinforcement learning
-
Explain the spurious step problem and why excessive on-policy imitation learning leads to generalization collapse
-
Implement step-level credit assignment using rollout-based Q-value estimation and advantage functions
-
Apply advantage filtering and DPO-based offline RL to reasoning tasks
-
Understand the differences between sparse end-of-trajectory RL and dense per-step RL with process reward models
-
Explain the principles behind modern thinking models and why extended thinking budgets lead to better performance
Three. Memorable Course Quotes
-
"Supervised fine-tuning teaches models to write solutions that look correct. Reinforcement learning teaches them to write solutions that are correct."
-
"On-policy imitation learning can be 2x more data-efficient than off-policy imitation learning for reasoning tasks—but it hits a hard generalization wall if you push it too far."
-
"Spurious steps are incorrect or irrelevant steps that accidentally lead to the right answer on the training set but break generalization to new problems."
-
"The biggest breakthrough in RL for reasoning is moving from sparse end-of-trajectory rewards to dense per-step credit assignment."
-
"The magic of modern thinking models is not in new RL algorithms—it's in giving models the freedom to define their own steps and think for longer."
-
"We are running out of high-quality human-written reasoning data. The future of reasoning models lies in training on self-generated data with reinforcement learning."
Four. Detailed Study Notes
4.1 Problem Formulation: Reasoning as an MDP
We can formalize the task of solving math problems (or any step-by-step reasoning task) as a Markov Decision Process with the following components:-
Initial state S₀: The natural language problem statement
-
States Sᵢ: The concatenation of the initial problem and all reasoning steps generated so far
-
Actions Aᵢ: The next reasoning step (a sentence, paragraph, or logical block of text)
-
Dynamics: Fully deterministic—taking action Aᵢ in state Sᵢ always leads to state Sᵢ₊₁ = Sᵢ + Aᵢ
-
Reward: A sparse binary reward of +1 if the final answer is correct, 0 otherwise
4.2 The Three Axes of Reasoning Training Methods
All modern reasoning training methods can be categorized along three axes based on how they use data:-
Supervised Fine-Tuning (SFT): Train on human-written correct solutions
-
Rejection Fine-Tuning (RFT): Train on model-generated correct solutions
-
Full Reinforcement Learning: Learn from both correct and incorrect solutions
4.2.1 Supervised Fine-Tuning (SFT)
SFT is the simplest baseline: train the model to predict the next token in human-written step-by-step solutions.-
Data efficiency: Extremely low. Test error decreases as D⁻⁰·¹⁵, meaning you need exponentially more data to get linear improvements in performance
-
Limitations:
-
We will run out of high-quality human-written reasoning data by 2028
-
Models learn to produce solutions that look human-like but often contain logical errors
-
Performance is capped by the ability of the humans who wrote the training data
-
4.2.2 Rejection Fine-Tuning (RFT)
RFT is a simple but powerful improvement over SFT that leverages the model's own generations:-
Sample N solutions from the current model for each problem
-
Filter out all incorrect solutions, retaining only those that produce the right answer
-
Fine-tune the model on the filtered set of correct self-generated solutions
-
Data efficiency: 2x higher than SFT. You can achieve the same test error with half as many problems
-
Critical limitation: Generalization collapse with excessive training. As you increase the number of correct solutions per problem beyond a certain point, test error starts to increase rather than decrease
4.3 The Spurious Step Problem
The generalization collapse in RFT is caused by spurious steps: incorrect or irrelevant steps that accidentally lead to the correct answer on the training set but fail catastrophically on new, unseen problems.A classic example:-
Problem: "If 100 apples are split equally between 2 people, how many apples does each person get?"
-
Spurious solution: "100 × 2 = 200. Therefore, 100 ÷ 2 = 50. Each person gets 50 apples."
4.4 Step-Level Credit Assignment: The Key Breakthrough
The solution to the spurious step problem is step-level credit assignment: instead of treating the entire trajectory as either good or bad, evaluate the quality of each individual step.4.4.1 Rollout-Based Q-Value Estimation
We estimate the quality of a prefix (state) using rollouts:-
Take a partial solution prefix
-
Sample K complete solutions from a rollout policy starting from this prefix
-
The Q-value of the prefix is the fraction of these rollouts that produce the correct answer
4.4.2 Advantage Functions for Step Evaluation
We define the advantage of a step as the change in Q-value caused by taking that step:A(Sᵢ₋₁, Aᵢ) = Q(Sᵢ) - Q(Sᵢ₋₁)
-
Positive advantage: The step improved the model's chance of success
-
Negative advantage: The step was harmful or spurious and reduced the model's chance of success
4.5 Training Methods with Step-Level Credit
4.5.1 Advantage Filtered Imitation Learning
The simplest way to use advantages:-
Sample trajectories from the current model
-
Compute the advantage for every step in every trajectory
-
Add only steps with positive advantages to the training set
-
Train the model to imitate these good steps
-
Completely eliminates the generalization collapse of RFT
-
Allows us to extract good steps from incorrect trajectories
-
Allows us to discard bad steps from correct trajectories
4.5.2 Offline RL with DPO
We can also use step-level credit to construct preference pairs for Direct Preference Optimization:-
Take a partial solution prefix
-
Take the original step from the model's trajectory
-
Take an alternative step from a rollout that led to success
-
Construct a preference pair: (prefix + good step) ≻ (prefix + bad step)
-
Train the model using the standard DPO loss
-
Data efficiency: 8x higher than SFT, 4x higher than RFT
-
Stability: Inherits all the stability benefits of DPO compared to traditional RL
4.6 Online Reinforcement Learning for Reasoning
For maximum performance, we can use online RL methods that continuously improve the policy through interaction:-
Basic online RL: Use policy gradients with the sparse end-of-trajectory reward. This is essentially RFT without explicit filtering.
-
GRPO (Group Relative Policy Optimization): A variant of PPO that uses rollout-based advantage estimation instead of a learned value function, making it more stable for LLM reasoning.
-
Dense per-step RL with Process Advantage Verifiers:
-
Train a separate process advantage model to predict the advantage of every step
-
Use this model to provide dense per-step rewards during RL training
-
Achieves a 5-6x improvement in sample efficiency and a 6-7% absolute improvement in accuracy over sparse reward RL
-
4.7 Modern Thinking Models
The remarkable performance of modern thinking models like DeepSeek-R1 and OpenAI's O-series is not due to revolutionary new RL algorithms. Instead, it comes from three key factors:-
More capable base models: Modern base models can implement complex meta-steps like self-verification, backtracking, and error correction
-
Extended thinking budgets: Allowing models to generate thousands of tokens of reasoning instead of forcing them to produce short answers
-
The same core RL techniques: All the methods described in this lecture—step-level credit assignment, process reward models, and online policy gradients—are still the foundation of these systems
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.


