Lecture 6: LLM Reasoning, GRPO and the DeepSeek R1 Training Pipeline

One. Course Details

This is the sixth lecture of Stanford University’s CME 295: Transformers and Large Language Models, taught by twin brothers Afshine and Shervine Amidi. The lecture opens with a brief recap of the three-stage LLM training pipeline covered in previous sessions: pre-training, supervised fine-tuning (SFT), and preference alignment via RLHF. It then transitions to the cutting-edge topic of LLM reasoning, a field that has exploded in progress since late 2024.

The lecture follows a logical four-part structure. First, it defines what reasoning means in the context of LLMs and explains how reasoning models differ from traditional "vanilla" LLMs. Second, it covers the standard benchmarks and evaluation metrics used to quantify reasoning ability, with a deep dive into the widely used pass@k metric. Third, it introduces Group Relative Policy Optimization (GRPO), the state-of-the-art RL algorithm that powers all modern reasoning models, and compares it side-by-side with the traditional PPO algorithm. Finally, it breaks down the complete training pipeline of DeepSeek R1, the groundbreaking open-source reasoning model that matched closed-source performance in early 2025.

All content covered in this lecture represents the absolute latest research in the field, with almost all papers and techniques published between 2024 and 2025.

Two. Key Learning Objectives

By the end of this lecture, students will be able to:

Define LLM reasoning and explain how chain-of-thought prompting enables models to solve multi-step problems.

Compare and contrast reasoning models with traditional vanilla LLMs and identify their key strengths and weaknesses.

Derive the pass@k evaluation metric from first principles and explain its significance for reasoning tasks.

Describe the core intuition behind GRPO and identify the key differences between GRPO and PPO for RL training.

Explain the length bias problem in standard GRPO and evaluate proposed solutions like DAPO and Dr. GRPO.

Walk through the complete training pipeline of the DeepSeek R1 reasoning model, including the R1-Zero proof of concept.

Explain how knowledge distillation works for reasoning models and why it is more efficient than training small models from scratch.

Three. Memorable Course Quotes

"Reasoning is the ability to break down a complex problem into multiple smaller, tractable steps and solve them sequentially to reach a final answer."

"Chain-of-thought works because it gives the model more compute—every additional token it generates is another full forward pass through the network."

"Pass@k estimates the probability that at least one out of k attempts at solving a problem will be correct."

"GRPO eliminates the need for a separate value function entirely by computing advantages relative to other completions for the same prompt."

"DeepSeek R1 proved that you can achieve state-of-the-art reasoning performance with a fully open-source pipeline built on verifiable rewards."

Four. Detailed Lecture Notes

4.1 LLM Reasoning Overview

The lecture begins by identifying four key weaknesses of traditional vanilla LLMs:

Limited reasoning ability: Struggles with multi-step math, coding, and logic problems.
Static knowledge: Trapped behind a fixed knowledge cutoff date.
No action capability: Can only generate text, not perform real-world actions.
Hard evaluation: Free-form text outputs are difficult to evaluate with traditional metrics.

While the last three weaknesses will be covered in later lectures, this lecture focuses exclusively on improving reasoning ability.

4.1.1 What is Reasoning?

There is no universally agreed-upon definition of reasoning, but for the purposes of this class, it is defined as the ability to solve problems that require a multi-step thought process. This includes:

Math problems of all difficulty levels
Competitive programming and software engineering tasks
Logical deduction and problem solving

A key distinction is made between knowledge questions and reasoning questions:

Knowledge question: "What is the course code for Stanford’s Transformers and LLMs class?" (Requires memorization)
Reasoning question: "A bear was born in 2020. How old is the bear in 2025?" (Requires multi-step calculation)

4.1.2 Chain-of-Thought Foundation

All modern reasoning models are built on the core idea of chain-of-thought (CoT) prompting, which was introduced earlier in the course. Instead of asking the model to output only the final answer, CoT encourages it to first explain its reasoning step-by-step.

The lecture presents two key intuitions for why CoT works:

It decomposes complex problems into smaller, simpler subproblems that the model has seen patterns for during pre-training.
It effectively gives the model more compute time to solve the problem, as each additional reasoning token requires a full forward pass through the network.

4.1.3 Reasoning Model Architecture and Timeline

Reasoning models differ from vanilla LLMs in their output structure:

Vanilla LLM: Input → Final answer
Reasoning model: Input → Hidden reasoning chain → Final answer

The lecture provides a brief timeline of the emergence of reasoning models:

September 2024: OpenAI releases o1-preview, the first widely available reasoning model
December 2024: Google releases Gemini 2.0 Flash Thinking
January 2025: DeepSeek releases R1, the first open-source model to match closed-source reasoning performance
Early 2025: xAI, Anthropic, and Mistral all release their own reasoning models

An important practical note is that reasoning tokens are included in the output token count for API pricing, even when they are not shown to the end user.

4.2 Reasoning Benchmarks and Evaluation

Evaluating reasoning ability requires specialized benchmarks and metrics that differ from traditional NLP evaluation.

4.2.1 Standard Reasoning Benchmarks

There are two primary categories of reasoning benchmarks:

Coding benchmarks: Evaluate the model’s ability to write correct code that passes test cases
- HumanEval: 164 handwritten coding problems
- Codeforces: Competitive programming problems of varying difficulty
- SWE-bench: Real-world GitHub issues that require modifying existing codebases
Math benchmarks: Evaluate the model’s ability to solve mathematical problems
- GSM8K: 8,000 grade school math problems
- AIME: Problems from the American Invitational Mathematics Examination
- Olympiad Benchmarks: Problems from international math olympiads

4.2.2 The pass@k Metric

The standard evaluation metric for reasoning tasks is pass@k, which estimates the probability that at least one out of k attempts at solving a problem will be correct.

The lecture derives the pass@k formula from first principles:

Start with the definition: pass@k = 1 - P(all k attempts are incorrect)
Assume we generate n total completions for a problem, with c correct completions
Calculate the probability of drawing k incorrect completions without replacement
Simplify using combinatorial notation to get the final formula: pass@k = 1 - (C(n-c, k) / C(n, k))

A special case is pass@1, which simplifies to simply c/n—the proportion of correct single attempts.

4.2.3 Temperature and pass@k

The temperature parameter used during generation has a significant impact on pass@k performance:

Low temperature (T=0): Generates almost identical outputs every time. pass@k does not improve as k increases.
Moderate temperature (T=0.4-0.8): Generates diverse but high-quality outputs. pass@k improves significantly as k increases.
High temperature (T>1.0): Generates too much random noise. Overall performance degrades.

Papers always report the temperature used to generate their benchmark results, as it has a dramatic effect on the final numbers.

4.3 Training Reasoning Models with RL

The lecture explains why reinforcement learning (RL) is the preferred method for training reasoning models, rather than traditional supervised fine-tuning.

4.3.1 Why Not SFT?

There are three key limitations to using SFT for reasoning training:

Data scarcity: Writing high-quality, correct reasoning chains is extremely difficult and time-consuming.
Mismatched reasoning styles: Human-written reasoning chains may not match the way an LLM naturally thinks.
No negative signal: SFT only teaches the model what to do, not what to avoid.

In contrast, reasoning tasks have a unique advantage: verifiable rewards. For coding problems, we can run test cases to check if the solution is correct. For math problems, we can compare the final answer to a known ground truth. This allows us to create perfect reward signals without any human labeling.

4.3.2 Group Relative Policy Optimization (GRPO)

GRPO (Group Relative Policy Optimization) is the state-of-the-art RL algorithm for training reasoning models, released in 2024. It was designed specifically to address the limitations of PPO for LLM training.

The core idea behind GRPO is extremely simple:

For each prompt, generate g completions (a "group") from the current policy
Compute the reward for each completion
Calculate the advantage for each completion as its reward relative to the average reward of the group
Update the policy using these advantages, with a KL penalty to prevent deviation from the base model

4.3.3 GRPO vs. PPO

The lecture provides a detailed side-by-side comparison of GRPO and PPO:

Aspect	GRPO	PPO
Advantage calculation	Reward minus group average reward	Reward minus value function estimate
Required models	Policy model + reference model	Policy model + reference model + value function + reward model
Training complexity	Low	High
Stability	Excellent	Good but requires careful tuning

The key advantage of GRPO is that it eliminates the need for a separate value function entirely, which simplifies the training pipeline, reduces memory usage, and improves stability.

Both algorithms use the same clipping mechanism to prevent excessively large policy updates, and both include a KL divergence penalty to prevent catastrophic forgetting.

4.3.4 Length Bias in GRPO

A major flaw in the original GRPO formulation is length bias, where the model learns to generate increasingly long reasoning chains even when they do not improve performance.

The lecture explains the mathematical cause of this bias:

The original GRPO loss normalizes each completion’s contribution by its length
This means tokens in short completions have much higher weight than tokens in long completions
The model learns that generating longer bad outputs is better than generating shorter bad outputs
Over time, this leads to ever-increasing output lengths with no corresponding performance gain

Two proposed solutions to this problem are:

DAPO: Equalizes token-level contributions across all completions
Dr. GRPO: Removes the length normalization factor entirely

Both methods successfully eliminate length bias while maintaining or improving reasoning performance.

4.4 The DeepSeek R1 Training Pipeline

The lecture concludes with a detailed breakdown of the training pipeline for DeepSeek R1, the groundbreaking open-source reasoning model released in January 2025.

4.4.1 R1-Zero: The Proof of Concept

DeepSeek first demonstrated the power of RL for reasoning with their R1-Zero model, which:

Started directly from a raw pre-trained base model with no SFT
Trained exclusively using verifiable rewards for correct answers and proper formatting
Achieved state-of-the-art reasoning performance on math benchmarks

However, R1-Zero had significant limitations:

Reasoning chains often mixed languages
Had syntax and formatting issues
Lacked general assistant capabilities

4.4.2 Full DeepSeek R1 Pipeline

The complete R1 training pipeline addresses these limitations with a multi-stage approach:

Pre-training: Train a standard decoder-only transformer on trillions of tokens of text and code
Cold-start SFT: Train on a small set of high-quality, human-curated reasoning chains to fix formatting and language issues
Reasoning RL: Train using GRPO with verifiable rewards for correct answers, proper formatting, and language consistency
Large-scale SFT: Train on a mixture of 200k general assistant examples and millions of high-quality reasoning examples generated via rejection sampling
Mixed RL: Final RL stage that combines reasoning rewards with standard helpfulness and harmlessness rewards for general assistant capabilities

4.4.3 Distilling Reasoning Models

For smaller models that cannot be trained from scratch with RL, knowledge distillation is the most effective approach:

Use a large teacher reasoning model to generate complete outputs including reasoning chains
Train a smaller student model via standard SFT to mimic the teacher’s outputs exactly
The student model inherits most of the teacher’s reasoning ability at a fraction of the size

This approach is significantly more efficient than trying to train small reasoning models from scratch with RL.

These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 6 - LLM Reasoning Stanford Online
• Course Series: Stanford CME295: Transformers and Large Language Models I Autumn 2025
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/k5Fh-UgTuCo?si=VxyoR4K7lpd8kcb9

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.