Deep Reinforcement Learning: Lecture 2 Imitation Learning Structured Notes & In-Depth Analysis

One. Course Details

This is the second lecture of Stanford University's CS224R Deep Reinforcement Learning course, taught by Professor Chelsea Finn. It builds directly on the reinforcement learning fundamentals introduced in Lecture 1 and focuses entirely on imitation learning—one of the most practical and widely deployed approaches to training RL agents.
The lecture covers four core topics: the basic formulation of imitation learning, the critical importance of learning expressive policy distributions, methods to address compounding errors through online interventions, and practical considerations for collecting demonstration data. Unlike pure reinforcement learning, imitation learning does not require manually defining reward functions, making it especially valuable for real-world tasks where reward engineering is difficult.
The core goal of this lecture is to equip students to implement robust imitation learning systems, diagnose common failure modes (such as multimodal action ambiguity and compounding errors), and select appropriate distribution models and correction strategies for different use cases.

Two. Key Learning Objectives

By the end of this lecture, students should be able to:

Define imitation learning and explain its relationship to supervised learning and reinforcement learning
Identify why deterministic policies fail catastrophically on multimodal demonstration data
Implement three types of expressive policy distributions: Gaussian mixture models, discretized autoregressive models, and diffusion models
Explain the root causes of compounding errors and covariate shift in sequential decision-making
Describe the Dataset Aggregation (DAgger) algorithm and expert intervention methods for correcting policy drift
Compare the trade-offs between offline and online imitation learning approaches

Three. Memorable Course Quotes

"Neural network expressivity is often distinct from distribution expressivity—making your model bigger won't fix the fact that it's only predicting the mean."
"Compounding errors are the single biggest reason simple behavior cloning fails on long-horizon tasks."
"If you have data from one consistent demonstrator, a unimodal policy works fine. If you have data from multiple people, you need an expressive generative model—this is non-negotiable."
"DAgger has a terrible name because it's not descriptive at all, but it's one of the most practical imitation learning algorithms ever invented."
"Nearly all state-of-the-art robotics and autonomous driving systems today are built on the expressive imitation learning techniques we're covering today."

Four. Detailed Study Notes

4.1 Imitation Learning: Problem Formulation

Imitation learning is a paradigm for training agents by mimicking expert behavior, rather than learning from trial and error with reward signals.

Core Assumption: We are given a dataset D of expert demonstrations, which are trajectories (s₁, a₁, s₂, a₂, ..., s_T, a_T) collected from an unknown expert policy π_expert.
Goal: Learn a policy π_θ that performs as well as the expert policy by matching the state-action pairs in the demonstration dataset.
Version 0: Behavior Cloning: The simplest approach is to treat imitation learning as a standard supervised regression problem. We train the policy to minimize the difference between its predicted actions and the expert's actions: min_θ Σ_{(s,a)∈D} ||π_θ(s) - a||² This works by running forward passes on the neural network, computing the loss, and backpropagating gradients using stochastic gradient descent.

4.2 The Fatal Flaw of Deterministic Policies

While behavior cloning seems straightforward, it fails catastrophically on most real-world datasets due to multimodality in expert actions.

Driving Example: In identical highway scenarios, some human drivers will stay straight while others merge left to pass slower traffic. This creates a bimodal distribution of steering commands.
Failure Mode of L2 Regression: A deterministic policy trained with mean squared error will predict the average of the two modes—an action that straddles the lane lines and has almost zero probability under the expert data distribution.
Generalization: This problem is not contrived. It occurs whenever demonstrations are collected from multiple people with different preferences or strategies, which is almost always the case for large-scale real-world datasets.

4.3 Learning Expressive Policy Distributions

The solution is to train the neural network to output parameters of a probability distribution over actions, rather than a single deterministic action.

4.3.1 Core Concept: Distribution vs. Neural Network Expressivity

This is the most important insight of the lecture:

Neural network expressivity: How well the network can map states to any set of output values.
Distribution expressivity: How many different types of probability distributions those output values can represent.
A infinitely large neural network that only outputs the mean of a Gaussian can never represent a multimodal distribution, no matter how much data you train it on.

4.3.2 Three Expressive Distribution Models

Gaussian Mixture Models (GMMs): The network outputs the mean, standard deviation, and weight for multiple Gaussian components. GMMs are strictly more expressive than single Gaussians and work well for moderately multimodal data. The number of mixture components is a hyperparameter.
Discretized Autoregressive Models: Inspired by language models, this approach discretizes continuous action dimensions into bins and predicts one dimension at a time.
- First, discretize each action dimension (e.g., steering angle, acceleration) into a fixed number of bins.
- The network predicts a categorical distribution over the first action dimension.
- It then samples an action from this distribution and feeds it back into the network to predict the next dimension.
- This approach can model arbitrarily complex joint distributions and is widely used in autonomous driving systems from Waymo and Wayve.
Diffusion Models: These models generate actions through an iterative denoising process, starting from random noise and gradually refining it into a valid action. Diffusion models are the most expressive option and excel at modeling high-dimensional, highly multimodal action spaces for robotics tasks.

4.3.3 Training Objective

All expressive imitation learning models use the same core training objective: minimize the negative log-likelihood of the expert actions under the learned policy: min_θ -E_{(s,a)∈D} [log π_θ(a | s)] For discrete distributions, this reduces to standard cross-entropy loss. For continuous distributions, it uses the probability density function of the chosen distribution class.

4.4 Empirical Evidence for Expressive Distributions

Simulated Transport Task: Diffusion models outperform GMMs on both single-human and multi-human demonstration datasets. The performance gap widens significantly for multi-human data, where multimodality is more pronounced.
Real-World Robot Task: For the complex task of hanging a shirt on a hanger, using diffusion models doubles the success rate compared to a deterministic policy trained with L1 loss.
Industry Adoption: All leading robotics foundation models (NVIDIA, Figure, OpenVLA) and autonomous driving systems rely on expressive policy distributions as a core component.

4.5 The Compounding Errors Problem

Even with perfect distribution matching, imitation learning still suffers from a fundamental limitation that distinguishes it from standard supervised learning: compounding errors.

Covariate Shift: In supervised learning, inputs are independent and identically distributed (i.i.d.). In imitation learning, the states visited by the learned policy depend on the actions it takes.
Error Propagation: A small mistake early in a trajectory can push the agent into a state that was never seen in the expert demonstrations. From this unfamiliar state, the agent is even more likely to make another mistake, leading to a rapid drift away from the expert distribution.
Long-Horizon Impact: This problem becomes exponentially worse for longer trajectories, where even tiny per-step errors can accumulate into catastrophic failures.

4.6 Solutions to Compounding Errors

There are two primary approaches to addressing compounding errors:

Massive Data Collection: Collect enough diverse demonstration data to cover almost all possible states the agent might encounter. This is the approach used by large-scale autonomous driving companies, but it is extremely expensive.
Online Corrective Data Collection: Collect additional data from the states visited by the learned policy to teach it how to recover from mistakes.
- Dataset Aggregation (DAgger):
  1. Roll out the current policy to collect trajectories.
  2. Query an expert to label the correct action for every state visited by the policy.
  3. Combine this corrective data with the original demonstration dataset.
  4. Retrain the policy on the combined dataset and repeat.
- Expert Intervention: A more practical alternative for real-world systems. The expert takes full control of the agent whenever it makes a mistake, providing partial demonstrations of how to recover. This is the standard approach used for training self-driving cars with safety drivers.

Both methods are online algorithms that require additional data collection after initial training, but they produce significantly more robust policies than pure offline behavior cloning.
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 2: Imitation Learning
• Course Series: Stanford CS224R Deep Reinforcement Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/WxRDyObrm_M?si=Pcn2SLEMs6XkIbaQ

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.