Deep Reinforcement Learning: Lecture 11 Model-Based Reinforcement Learning Structured Notes & In-Depth Analysis

One. Course Details

This is the eleventh lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by Professor Chelsea Finn. It provides a comprehensive introduction to model-based reinforcement learning, a fundamental paradigm that learns a simulator of the environment to enable planning and more data-efficient learning.The lecture begins with a high-level recap of all reinforcement learning algorithms covered in the course, organizing them into online/offline, on-policy/off-policy, and imitation learning categories. It then explains the core idea behind model-based RL, how to learn accurate dynamics models, and two primary ways to use these models: planning and data generation. The lecture covers both gradient-based and sampling-based planning methods, introduces the widely used Model Predictive Control (MPC) framework, and concludes with a detailed case study of dexterous robot manipulation that demonstrates the dramatic data efficiency advantages of model-based RL over model-free approaches.

Two. Key Learning Objectives

By the end of this lecture, students should be able to:

Classify all major RL algorithms along the online/offline, on-policy/off-policy, and imitation learning/reinforcement learning axes
Explain the core intuition behind model-based RL and identify its three fundamental challenges
Compare the tradeoffs between learning dynamics models in pixel space versus learned latent space
Implement both gradient-based and sampling-based planning methods using learned dynamics models
Describe the Model Predictive Control (MPC) framework and explain why it is robust to model errors
Understand the role of model ensembles in mitigating model inaccuracies
Appreciate the data efficiency advantages of model-based RL for real-world robotics applications

Three. Memorable Course Quotes

"The key idea behind model-based reinforcement learning is essentially we're going to try to learn a simulator."
"Planning is at every single state, we're actually going to be planning forward and thinking ahead: what are the actions I should consider that I think will lead to good reward?"
"Model Predictive Control is very, very common in certain robotic systems, very widely used."
"Data efficiency is really important for fragile hardware. This robot hand is very fragile, and all the students are afraid to touch it because they don't want to break it."
"The model-based method is using maybe on the order of 100,000 time steps of data, whereas the model-free methods are using half a million time steps. So it's a lot more efficient."
"If you have a good universal model of the world, you can plug in different rewards or different goals at test time and get a policy that can follow them."

Four. Detailed Study Notes

4.1 Algorithm Landscape Recap

All reinforcement learning and imitation learning algorithms can be organized along three primary axes:

Category	Subcategories	Algorithms
Online RL	On-policy	REINFORCE, Vanilla Policy Gradient
	Off-policy (no replay buffer)	PPO, Importance Sampling
	Off-policy (with replay buffer)	DQN, SAC
Offline RL	In-distribution constraint	AWR, AWAC, IQL
	Conservative regularization	CQL
Imitation Learning	Offline	Behavior Cloning
	Online	DAgger

Key properties:

Policy gradient methods (blue): Only train a policy
Actor-critic methods (purple): Train both a policy and a critic (PPO, SAC)
Q-learning methods (green): Only train a critic (DQN)
As you move from left to right, algorithms require more online data and are generally more expensive to train

4.2 Core Idea of Model-Based RL

Model-based RL takes a fundamentally different approach from all the model-free algorithms covered so far:

First, learn a dynamics model (simulator) that predicts the next state given the current state and action: s' = f(s, a)
Then, use this learned model to plan actions or generate synthetic data for training

This paradigm has several compelling advantages:

Extreme data efficiency: Can learn good policies with orders of magnitude less real-world data
Transferability: A single good model can be used for multiple different tasks with different reward functions
Interpretability: The model provides insight into how the world works

However, it also faces three critical challenges:

Model inaccuracy: Learned models are never perfect, and small errors can compound over time
Distribution mismatch: The model is only accurate on states covered by the training data
Planning complexity: Planning with high-dimensional state and action spaces is computationally expensive

4.3 Learning Dynamics Models

There are three scenarios for obtaining a dynamics model:

Known model: You already have an accurate simulator (e.g., chess, Atari games)
Partially known model: You know most of the physics but need to fit unknown parameters (e.g., friction coefficients)
Unknown model: You learn the entire model from scratch using neural networks

4.3.1 Model Learning Approaches

Pixel space modeling: Directly predict future images from current images and actions. This is very expressive but computationally expensive and requires modeling irrelevant details.
Latent space modeling: First learn a low-dimensional representation of the state, then model dynamics in this latent space. This is more computationally efficient but requires learning a good representation that preserves all task-relevant information.

In addition to the dynamics model, you will almost always also need to learn a reward model that predicts the reward for a given state and action.

4.4 Using Learned Models

There are two primary ways to use a learned dynamics model:

Planning: At test time, use the model to simulate different possible action sequences and select the one that maximizes reward
Data generation: Generate synthetic trajectories using the model and use them to train a model-free policy

This lecture focuses primarily on the planning approach.

4.5 Planning with Learned Models

Planning is the process of finding a sequence of actions that maximizes the sum of future rewards according to the learned model. There are two main families of planning methods:

4.5.1 Gradient-Based Planning

Gradient-based planning backpropagates gradients from the reward through the dynamics model to optimize the action sequence:

Initialize a sequence of actions randomly
Simulate the trajectory using the dynamics model
Compute the gradient of the total reward with respect to the actions
Update the actions using gradient ascent
Repeat until convergence

Advantages:

Scales well to high-dimensional action spaces
Works well with smooth optimization landscapes

Disadvantages:

Can get stuck in local optima
Requires differentiable dynamics and reward models

4.5.2 Sampling-Based Planning

Sampling-based planning (also called zero-order optimization) does not require gradients:

Random shooting: Sample many action sequences, simulate all of them, and select the one with the highest reward. This is surprisingly effective for low-dimensional action spaces.
Cross-Entropy Method (CEM): An iterative refinement of random shooting:
- Sample action sequences from an initial distribution
- Rank them by reward
- Fit a new distribution to the top-performing samples
- Resample from the new distribution and repeat

Advantages:

Easy to parallelize
Works with non-differentiable models and discrete action spaces
Less prone to local optima than gradient-based methods

Disadvantages:

Scales poorly to very high-dimensional action spaces
Requires many samples to cover the space well

4.6 Model Predictive Control (MPC)

The naive planning approach of executing an entire planned action sequence is called open-loop control. It is very brittle to model errors and environmental disturbances.Model Predictive Control (MPC) is a closed-loop alternative that addresses these issues:

At each time step, plan a sequence of H actions into the future
Execute only the first action in the sequence
Observe the resulting next state
Replan a new sequence of H actions starting from the new state
Repeat indefinitely

Key advantages:

Robustness to model errors: If the model makes a mistake, the replanning step corrects it at the next time step
Robustness to environmental changes: Reacts to unexpected events in real-time
Stability: Generally produces more stable and reliable behavior than open-loop control

Practical improvements:

Warm starting: Initialize the new plan with the tail of the previous plan to reduce computation and prevent oscillation
Adaptive replanning frequency: Replan less frequently as the model becomes more accurate

4.7 Addressing the Distribution Mismatch Problem

The biggest challenge in model-based RL is that the model is only accurate on states covered by the initial training data. The solution is an iterative data collection loop:

Collect initial data using a random policy
Train the dynamics model on this data
Use MPC to select actions and execute them in the environment
Add the new trajectories to the dataset
Retrain the dynamics model on the expanded dataset
Repeat until convergence

This process continuously expands the region of state space where the model is accurate.

4.8 Case Study: Dexterous Robot Manipulation

The lecture concludes with a case study of the PDDM algorithm for dexterous manipulation with a five-fingered robot hand:

State space: 24 joint angles of the hand plus 3D position of the object
Action space: 24-dimensional joint torques
Reward: Distance to target object trajectory plus penalty for dropping the object
Model: Ensemble of 3 neural networks to mitigate model errors
Planner: Modified cross-entropy method with temporal smoothing

Key results:

Only the model-based method was able to solve the complex ball rotation task
Achieved 100% success rate on 90-degree rotations after only 4 hours of real-world data
Was 5x more data-efficient than model-free methods like SAC and Natural Policy Gradient
The ensemble of models was critical for good performance, as it reduced the tendency to exploit model errors
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 11: Model-Based RL Stanford Online
• Course Series: Stanford CS224R Deep Reinforcement Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/PvqyGnOirgA?si=UW4gaX-dKsZ4URHc

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.