One. Course Details
This is the sixth lecture of Stanford University’s CME 295: Transformers and Large Language Models, taught by twin brothers Afshine and Shervine Amidi. The lecture opens with a brief recap of the three-stage LLM training pipeline covered in previous sessions: pre-training, supervised fine-tuning (SFT), and preference alignment via RLHF. It then transitions to the cutting-edge topic of LLM reasoning, a field that has exploded in progress since late 2024.The lecture follows a logical four-part structure. First, it defines what reasoning means in the context of LLMs and explains how reasoning models differ from traditional "vanilla" LLMs. Second, it covers the standard benchmarks and evaluation metrics used to quantify reasoning ability, with a deep dive into the widely used pass@k metric. Third, it introduces Group Relative Policy Optimization (GRPO), the state-of-the-art RL algorithm that powers all modern reasoning models, and compares it side-by-side with the traditional PPO algorithm. Finally, it breaks down the complete training pipeline of DeepSeek R1, the groundbreaking open-source reasoning model that matched closed-source performance in early 2025.
All content covered in this lecture represents the absolute latest research in the field, with almost all papers and techniques published between 2024 and 2025.
Two. Key Learning Objectives
By the end of this lecture, students will be able to:Define LLM reasoning and explain how chain-of-thought prompting enables models to solve multi-step problems.
Compare and contrast reasoning models with traditional vanilla LLMs and identify their key strengths and weaknesses.
Derive the pass@k evaluation metric from first principles and explain its significance for reasoning tasks.
Describe the core intuition behind GRPO and identify the key differences between GRPO and PPO for RL training.
Explain the length bias problem in standard GRPO and evaluate proposed solutions like DAPO and Dr. GRPO.
Walk through the complete training pipeline of the DeepSeek R1 reasoning model, including the R1-Zero proof of concept.
Explain how knowledge distillation works for reasoning models and why it is more efficient than training small models from scratch.
Three. Memorable Course Quotes
"Reasoning is the ability to break down a complex problem into multiple smaller, tractable steps and solve them sequentially to reach a final answer.""Chain-of-thought works because it gives the model more compute—every additional token it generates is another full forward pass through the network."
"Pass@k estimates the probability that at least one out of k attempts at solving a problem will be correct."
"GRPO eliminates the need for a separate value function entirely by computing advantages relative to other completions for the same prompt."
"DeepSeek R1 proved that you can achieve state-of-the-art reasoning performance with a fully open-source pipeline built on verifiable rewards."
Four. Detailed Lecture Notes
4.1 LLM Reasoning Overview
The lecture begins by identifying four key weaknesses of traditional vanilla LLMs:-
Limited reasoning ability: Struggles with multi-step math, coding, and logic problems.
-
Static knowledge: Trapped behind a fixed knowledge cutoff date.
-
No action capability: Can only generate text, not perform real-world actions.
-
Hard evaluation: Free-form text outputs are difficult to evaluate with traditional metrics.
While the last three weaknesses will be covered in later lectures, this lecture focuses exclusively on improving reasoning ability.
4.1.1 What is Reasoning?
There is no universally agreed-upon definition of reasoning, but for the purposes of this class, it is defined as the ability to solve problems that require a multi-step thought process. This includes:-
Math problems of all difficulty levels
-
Competitive programming and software engineering tasks
-
Logical deduction and problem solving
A key distinction is made between knowledge questions and reasoning questions:
-
Knowledge question: "What is the course code for Stanford’s Transformers and LLMs class?" (Requires memorization)
-
Reasoning question: "A bear was born in 2020. How old is the bear in 2025?" (Requires multi-step calculation)
4.1.2 Chain-of-Thought Foundation
All modern reasoning models are built on the core idea of chain-of-thought (CoT) prompting, which was introduced earlier in the course. Instead of asking the model to output only the final answer, CoT encourages it to first explain its reasoning step-by-step.The lecture presents two key intuitions for why CoT works:
-
It decomposes complex problems into smaller, simpler subproblems that the model has seen patterns for during pre-training.
-
It effectively gives the model more compute time to solve the problem, as each additional reasoning token requires a full forward pass through the network.
4.1.3 Reasoning Model Architecture and Timeline
Reasoning models differ from vanilla LLMs in their output structure:-
Vanilla LLM: Input → Final answer
-
Reasoning model: Input → Hidden reasoning chain → Final answer
The lecture provides a brief timeline of the emergence of reasoning models:
-
September 2024: OpenAI releases o1-preview, the first widely available reasoning model
-
December 2024: Google releases Gemini 2.0 Flash Thinking
-
January 2025: DeepSeek releases R1, the first open-source model to match closed-source reasoning performance
-
Early 2025: xAI, Anthropic, and Mistral all release their own reasoning models
An important practical note is that reasoning tokens are included in the output token count for API pricing, even when they are not shown to the end user.
4.2 Reasoning Benchmarks and Evaluation
Evaluating reasoning ability requires specialized benchmarks and metrics that differ from traditional NLP evaluation.4.2.1 Standard Reasoning Benchmarks
There are two primary categories of reasoning benchmarks:-
Coding benchmarks: Evaluate the model’s ability to write correct code that passes test cases
-
HumanEval: 164 handwritten coding problems
-
Codeforces: Competitive programming problems of varying difficulty
-
SWE-bench: Real-world GitHub issues that require modifying existing codebases
-
-
Math benchmarks: Evaluate the model’s ability to solve mathematical problems
-
GSM8K: 8,000 grade school math problems
-
AIME: Problems from the American Invitational Mathematics Examination
-
Olympiad Benchmarks: Problems from international math olympiads
-
4.2.2 The pass@k Metric
The standard evaluation metric for reasoning tasks is pass@k, which estimates the probability that at least one out of k attempts at solving a problem will be correct.The lecture derives the pass@k formula from first principles:
-
Start with the definition: pass@k = 1 - P(all k attempts are incorrect)
-
Assume we generate n total completions for a problem, with c correct completions
-
Calculate the probability of drawing k incorrect completions without replacement
-
Simplify using combinatorial notation to get the final formula:
pass@k = 1 - (C(n-c, k) / C(n, k))
A special case is pass@1, which simplifies to simply c/n—the proportion of correct single attempts.
4.2.3 Temperature and pass@k
The temperature parameter used during generation has a significant impact on pass@k performance:-
Low temperature (T=0): Generates almost identical outputs every time. pass@k does not improve as k increases.
-
Moderate temperature (T=0.4-0.8): Generates diverse but high-quality outputs. pass@k improves significantly as k increases.
-
High temperature (T>1.0): Generates too much random noise. Overall performance degrades.
Papers always report the temperature used to generate their benchmark results, as it has a dramatic effect on the final numbers.
4.3 Training Reasoning Models with RL
The lecture explains why reinforcement learning (RL) is the preferred method for training reasoning models, rather than traditional supervised fine-tuning.4.3.1 Why Not SFT?
There are three key limitations to using SFT for reasoning training:-
Data scarcity: Writing high-quality, correct reasoning chains is extremely difficult and time-consuming.
-
Mismatched reasoning styles: Human-written reasoning chains may not match the way an LLM naturally thinks.
-
No negative signal: SFT only teaches the model what to do, not what to avoid.
In contrast, reasoning tasks have a unique advantage: verifiable rewards. For coding problems, we can run test cases to check if the solution is correct. For math problems, we can compare the final answer to a known ground truth. This allows us to create perfect reward signals without any human labeling.
4.3.2 Group Relative Policy Optimization (GRPO)
GRPO (Group Relative Policy Optimization) is the state-of-the-art RL algorithm for training reasoning models, released in 2024. It was designed specifically to address the limitations of PPO for LLM training.The core idea behind GRPO is extremely simple:
-
For each prompt, generate g completions (a "group") from the current policy
-
Compute the reward for each completion
-
Calculate the advantage for each completion as its reward relative to the average reward of the group
-
Update the policy using these advantages, with a KL penalty to prevent deviation from the base model
4.3.3 GRPO vs. PPO
The lecture provides a detailed side-by-side comparison of GRPO and PPO:|
Aspect |
GRPO |
PPO |
|---|---|---|
|
Advantage calculation |
Reward minus group average reward |
Reward minus value function estimate |
|
Required models |
Policy model + reference model |
Policy model + reference model + value function + reward model |
|
Training complexity |
Low |
High |
|
Stability |
Excellent |
Good but requires careful tuning |
The key advantage of GRPO is that it eliminates the need for a separate value function entirely, which simplifies the training pipeline, reduces memory usage, and improves stability.
Both algorithms use the same clipping mechanism to prevent excessively large policy updates, and both include a KL divergence penalty to prevent catastrophic forgetting.
4.3.4 Length Bias in GRPO
A major flaw in the original GRPO formulation is length bias, where the model learns to generate increasingly long reasoning chains even when they do not improve performance.The lecture explains the mathematical cause of this bias:
-
The original GRPO loss normalizes each completion’s contribution by its length
-
This means tokens in short completions have much higher weight than tokens in long completions
-
The model learns that generating longer bad outputs is better than generating shorter bad outputs
-
Over time, this leads to ever-increasing output lengths with no corresponding performance gain
Two proposed solutions to this problem are:
-
DAPO: Equalizes token-level contributions across all completions
-
Dr. GRPO: Removes the length normalization factor entirely
Both methods successfully eliminate length bias while maintaining or improving reasoning performance.
4.4 The DeepSeek R1 Training Pipeline
The lecture concludes with a detailed breakdown of the training pipeline for DeepSeek R1, the groundbreaking open-source reasoning model released in January 2025.4.4.1 R1-Zero: The Proof of Concept
DeepSeek first demonstrated the power of RL for reasoning with their R1-Zero model, which:-
Started directly from a raw pre-trained base model with no SFT
-
Trained exclusively using verifiable rewards for correct answers and proper formatting
-
Achieved state-of-the-art reasoning performance on math benchmarks
However, R1-Zero had significant limitations:
-
Reasoning chains often mixed languages
-
Had syntax and formatting issues
-
Lacked general assistant capabilities
4.4.2 Full DeepSeek R1 Pipeline
The complete R1 training pipeline addresses these limitations with a multi-stage approach:-
Pre-training: Train a standard decoder-only transformer on trillions of tokens of text and code
-
Cold-start SFT: Train on a small set of high-quality, human-curated reasoning chains to fix formatting and language issues
-
Reasoning RL: Train using GRPO with verifiable rewards for correct answers, proper formatting, and language consistency
-
Large-scale SFT: Train on a mixture of 200k general assistant examples and millions of high-quality reasoning examples generated via rejection sampling
-
Mixed RL: Final RL stage that combines reasoning rewards with standard helpfulness and harmlessness rewards for general assistant capabilities
4.4.3 Distilling Reasoning Models
For smaller models that cannot be trained from scratch with RL, knowledge distillation is the most effective approach:-
Use a large teacher reasoning model to generate complete outputs including reasoning chains
-
Train a smaller student model via standard SFT to mimic the teacher’s outputs exactly
-
The student model inherits most of the teacher’s reasoning ability at a fraction of the size
This approach is significantly more efficient than trying to train small reasoning models from scratch with RL.
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.


