Deep Reinforcement Learning: Lecture 10 Reinforcement Learning for LLM Reasoning Structured Notes & In-Depth Analysis

One. Course Details

This is the tenth lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by a guest lecturer specializing in large language model reasoning. It focuses entirely on reinforcement learning techniques for LLM reasoning, with a specific emphasis on mathematical problem-solving tasks.The lecture organizes the field of reasoning RL into two eras: methods developed before the release of DeepSeek-R1 in early 2025, and the modern approaches that power today's state-of-the-art "thinking models". It presents a systematic data scaling analysis of different training paradigms, explains the critical problem of spurious steps and generalization collapse in on-policy imitation learning, and introduces step-level credit assignment as the key breakthrough that enabled dramatic improvements in reasoning performance. The lecture concludes with an explanation of how the same core RL techniques, when combined with more capable base models and extended thinking budgets, produce the remarkable reasoning abilities of modern models like DeepSeek-R1 and OpenAI's O-series.

Two. Key Learning Objectives

By the end of this lecture, students should be able to:

Formulate the LLM reasoning problem as a sparse reward Markov Decision Process (MDP) with deterministic dynamics
Compare the data efficiency and limitations of three core training paradigms: supervised fine-tuning (SFT), rejection fine-tuning (RFT), and full reinforcement learning
Explain the spurious step problem and why excessive on-policy imitation learning leads to generalization collapse
Implement step-level credit assignment using rollout-based Q-value estimation and advantage functions
Apply advantage filtering and DPO-based offline RL to reasoning tasks
Understand the differences between sparse end-of-trajectory RL and dense per-step RL with process reward models
Explain the principles behind modern thinking models and why extended thinking budgets lead to better performance

Three. Memorable Course Quotes

"Supervised fine-tuning teaches models to write solutions that look correct. Reinforcement learning teaches them to write solutions that are correct."
"On-policy imitation learning can be 2x more data-efficient than off-policy imitation learning for reasoning tasks—but it hits a hard generalization wall if you push it too far."
"Spurious steps are incorrect or irrelevant steps that accidentally lead to the right answer on the training set but break generalization to new problems."
"The biggest breakthrough in RL for reasoning is moving from sparse end-of-trajectory rewards to dense per-step credit assignment."
"The magic of modern thinking models is not in new RL algorithms—it's in giving models the freedom to define their own steps and think for longer."
"We are running out of high-quality human-written reasoning data. The future of reasoning models lies in training on self-generated data with reinforcement learning."

Four. Detailed Study Notes

4.1 Problem Formulation: Reasoning as an MDP

We can formalize the task of solving math problems (or any step-by-step reasoning task) as a Markov Decision Process with the following components:

Initial state S₀: The natural language problem statement
States Sᵢ: The concatenation of the initial problem and all reasoning steps generated so far
Actions Aᵢ: The next reasoning step (a sentence, paragraph, or logical block of text)
Dynamics: Fully deterministic—taking action Aᵢ in state Sᵢ always leads to state Sᵢ₊₁ = Sᵢ + Aᵢ
Reward: A sparse binary reward of +1 if the final answer is correct, 0 otherwise

This formulation reveals the core challenge of reasoning RL: we only receive feedback at the very end of a potentially long trajectory, with no information about which individual steps were correct or incorrect.

4.2 The Three Axes of Reasoning Training Methods

All modern reasoning training methods can be categorized along three axes based on how they use data:

Supervised Fine-Tuning (SFT): Train on human-written correct solutions
Rejection Fine-Tuning (RFT): Train on model-generated correct solutions
Full Reinforcement Learning: Learn from both correct and incorrect solutions

4.2.1 Supervised Fine-Tuning (SFT)

SFT is the simplest baseline: train the model to predict the next token in human-written step-by-step solutions.

Data efficiency: Extremely low. Test error decreases as D⁻⁰·¹⁵, meaning you need exponentially more data to get linear improvements in performance
Limitations:
- We will run out of high-quality human-written reasoning data by 2028
- Models learn to produce solutions that look human-like but often contain logical errors
- Performance is capped by the ability of the humans who wrote the training data

4.2.2 Rejection Fine-Tuning (RFT)

RFT is a simple but powerful improvement over SFT that leverages the model's own generations:

Sample N solutions from the current model for each problem
Filter out all incorrect solutions, retaining only those that produce the right answer
Fine-tune the model on the filtered set of correct self-generated solutions

Data efficiency: 2x higher than SFT. You can achieve the same test error with half as many problems
Critical limitation: Generalization collapse with excessive training. As you increase the number of correct solutions per problem beyond a certain point, test error starts to increase rather than decrease

4.3 The Spurious Step Problem

The generalization collapse in RFT is caused by spurious steps: incorrect or irrelevant steps that accidentally lead to the correct answer on the training set but fail catastrophically on new, unseen problems.A classic example:

Problem: "If 100 apples are split equally between 2 people, how many apples does each person get?"
Spurious solution: "100 × 2 = 200. Therefore, 100 ÷ 2 = 50. Each person gets 50 apples."

While this solution produces the correct final answer, the first step is completely wrong. Excessive RFT teaches the model to reproduce these spurious patterns, which do not generalize to new problems.

4.4 Step-Level Credit Assignment: The Key Breakthrough

The solution to the spurious step problem is step-level credit assignment: instead of treating the entire trajectory as either good or bad, evaluate the quality of each individual step.

4.4.1 Rollout-Based Q-Value Estimation

We estimate the quality of a prefix (state) using rollouts:

Take a partial solution prefix
Sample K complete solutions from a rollout policy starting from this prefix
The Q-value of the prefix is the fraction of these rollouts that produce the correct answer

This Q-value represents the expected future success probability from that point onward.

4.4.2 Advantage Functions for Step Evaluation

We define the advantage of a step as the change in Q-value caused by taking that step: A(Sᵢ₋₁, Aᵢ) = Q(Sᵢ) - Q(Sᵢ₋₁)

Positive advantage: The step improved the model's chance of success
Negative advantage: The step was harmful or spurious and reduced the model's chance of success

This allows us to identify exactly which steps in a trajectory are good and which are bad, even if the final answer is correct.

4.5 Training Methods with Step-Level Credit

4.5.1 Advantage Filtered Imitation Learning

The simplest way to use advantages:

Sample trajectories from the current model
Compute the advantage for every step in every trajectory
Add only steps with positive advantages to the training set
Train the model to imitate these good steps

This method:

Completely eliminates the generalization collapse of RFT
Allows us to extract good steps from incorrect trajectories
Allows us to discard bad steps from correct trajectories

4.5.2 Offline RL with DPO

We can also use step-level credit to construct preference pairs for Direct Preference Optimization:

Take a partial solution prefix
Take the original step from the model's trajectory
Take an alternative step from a rollout that led to success
Construct a preference pair: (prefix + good step) ≻ (prefix + bad step)
Train the model using the standard DPO loss

Data efficiency: 8x higher than SFT, 4x higher than RFT
Stability: Inherits all the stability benefits of DPO compared to traditional RL

4.6 Online Reinforcement Learning for Reasoning

For maximum performance, we can use online RL methods that continuously improve the policy through interaction:

Basic online RL: Use policy gradients with the sparse end-of-trajectory reward. This is essentially RFT without explicit filtering.
GRPO (Group Relative Policy Optimization): A variant of PPO that uses rollout-based advantage estimation instead of a learned value function, making it more stable for LLM reasoning.
Dense per-step RL with Process Advantage Verifiers:
- Train a separate process advantage model to predict the advantage of every step
- Use this model to provide dense per-step rewards during RL training
- Achieves a 5-6x improvement in sample efficiency and a 6-7% absolute improvement in accuracy over sparse reward RL

4.7 Modern Thinking Models

The remarkable performance of modern thinking models like DeepSeek-R1 and OpenAI's O-series is not due to revolutionary new RL algorithms. Instead, it comes from three key factors:

More capable base models: Modern base models can implement complex meta-steps like self-verification, backtracking, and error correction
Extended thinking budgets: Allowing models to generate thousands of tokens of reasoning instead of forcing them to produce short answers
The same core RL techniques: All the methods described in this lecture—step-level credit assignment, process reward models, and online policy gradients—are still the foundation of these systems

The key insight is that as base models become more capable of defining their own reasoning steps, the same RL algorithms produce exponentially better results.

These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 10: RL for LLM Reasoning Stanford Online
• Course Series: Stanford CS224R Deep Reinforcement Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/O2VpNnwB4lM?si=_CAE2ICXxshOAFsH

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.