One. Course Details
This is the eleventh lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by Professor Chelsea Finn. It provides a comprehensive introduction to model-based reinforcement learning, a fundamental paradigm that learns a simulator of the environment to enable planning and more data-efficient learning.The lecture begins with a high-level recap of all reinforcement learning algorithms covered in the course, organizing them into online/offline, on-policy/off-policy, and imitation learning categories. It then explains the core idea behind model-based RL, how to learn accurate dynamics models, and two primary ways to use these models: planning and data generation. The lecture covers both gradient-based and sampling-based planning methods, introduces the widely used Model Predictive Control (MPC) framework, and concludes with a detailed case study of dexterous robot manipulation that demonstrates the dramatic data efficiency advantages of model-based RL over model-free approaches.Two. Key Learning Objectives
By the end of this lecture, students should be able to:-
Classify all major RL algorithms along the online/offline, on-policy/off-policy, and imitation learning/reinforcement learning axes
-
Explain the core intuition behind model-based RL and identify its three fundamental challenges
-
Compare the tradeoffs between learning dynamics models in pixel space versus learned latent space
-
Implement both gradient-based and sampling-based planning methods using learned dynamics models
-
Describe the Model Predictive Control (MPC) framework and explain why it is robust to model errors
-
Understand the role of model ensembles in mitigating model inaccuracies
-
Appreciate the data efficiency advantages of model-based RL for real-world robotics applications
Three. Memorable Course Quotes
-
"The key idea behind model-based reinforcement learning is essentially we're going to try to learn a simulator."
-
"Planning is at every single state, we're actually going to be planning forward and thinking ahead: what are the actions I should consider that I think will lead to good reward?"
-
"Model Predictive Control is very, very common in certain robotic systems, very widely used."
-
"Data efficiency is really important for fragile hardware. This robot hand is very fragile, and all the students are afraid to touch it because they don't want to break it."
-
"The model-based method is using maybe on the order of 100,000 time steps of data, whereas the model-free methods are using half a million time steps. So it's a lot more efficient."
-
"If you have a good universal model of the world, you can plug in different rewards or different goals at test time and get a policy that can follow them."
Four. Detailed Study Notes
4.1 Algorithm Landscape Recap
All reinforcement learning and imitation learning algorithms can be organized along three primary axes:| Category | Subcategories | Algorithms |
|---|---|---|
| Online RL | On-policy | REINFORCE, Vanilla Policy Gradient |
| Off-policy (no replay buffer) | PPO, Importance Sampling | |
| Off-policy (with replay buffer) | DQN, SAC | |
| Offline RL | In-distribution constraint | AWR, AWAC, IQL |
| Conservative regularization | CQL | |
| Imitation Learning | Offline | Behavior Cloning |
| Online | DAgger |
-
Policy gradient methods (blue): Only train a policy
-
Actor-critic methods (purple): Train both a policy and a critic (PPO, SAC)
-
Q-learning methods (green): Only train a critic (DQN)
-
As you move from left to right, algorithms require more online data and are generally more expensive to train
4.2 Core Idea of Model-Based RL
Model-based RL takes a fundamentally different approach from all the model-free algorithms covered so far:-
First, learn a dynamics model (simulator) that predicts the next state given the current state and action:
s' = f(s, a) -
Then, use this learned model to plan actions or generate synthetic data for training
-
Extreme data efficiency: Can learn good policies with orders of magnitude less real-world data
-
Transferability: A single good model can be used for multiple different tasks with different reward functions
-
Interpretability: The model provides insight into how the world works
-
Model inaccuracy: Learned models are never perfect, and small errors can compound over time
-
Distribution mismatch: The model is only accurate on states covered by the training data
-
Planning complexity: Planning with high-dimensional state and action spaces is computationally expensive
4.3 Learning Dynamics Models
There are three scenarios for obtaining a dynamics model:-
Known model: You already have an accurate simulator (e.g., chess, Atari games)
-
Partially known model: You know most of the physics but need to fit unknown parameters (e.g., friction coefficients)
-
Unknown model: You learn the entire model from scratch using neural networks
4.3.1 Model Learning Approaches
-
Pixel space modeling: Directly predict future images from current images and actions. This is very expressive but computationally expensive and requires modeling irrelevant details.
-
Latent space modeling: First learn a low-dimensional representation of the state, then model dynamics in this latent space. This is more computationally efficient but requires learning a good representation that preserves all task-relevant information.
4.4 Using Learned Models
There are two primary ways to use a learned dynamics model:-
Planning: At test time, use the model to simulate different possible action sequences and select the one that maximizes reward
-
Data generation: Generate synthetic trajectories using the model and use them to train a model-free policy
4.5 Planning with Learned Models
Planning is the process of finding a sequence of actions that maximizes the sum of future rewards according to the learned model. There are two main families of planning methods:4.5.1 Gradient-Based Planning
Gradient-based planning backpropagates gradients from the reward through the dynamics model to optimize the action sequence:-
Initialize a sequence of actions randomly
-
Simulate the trajectory using the dynamics model
-
Compute the gradient of the total reward with respect to the actions
-
Update the actions using gradient ascent
-
Repeat until convergence
-
Scales well to high-dimensional action spaces
-
Works well with smooth optimization landscapes
-
Can get stuck in local optima
-
Requires differentiable dynamics and reward models
4.5.2 Sampling-Based Planning
Sampling-based planning (also called zero-order optimization) does not require gradients:-
Random shooting: Sample many action sequences, simulate all of them, and select the one with the highest reward. This is surprisingly effective for low-dimensional action spaces.
-
Cross-Entropy Method (CEM): An iterative refinement of random shooting:
-
Sample action sequences from an initial distribution
-
Rank them by reward
-
Fit a new distribution to the top-performing samples
-
Resample from the new distribution and repeat
-
-
Easy to parallelize
-
Works with non-differentiable models and discrete action spaces
-
Less prone to local optima than gradient-based methods
-
Scales poorly to very high-dimensional action spaces
-
Requires many samples to cover the space well
4.6 Model Predictive Control (MPC)
The naive planning approach of executing an entire planned action sequence is called open-loop control. It is very brittle to model errors and environmental disturbances.Model Predictive Control (MPC) is a closed-loop alternative that addresses these issues:-
At each time step, plan a sequence of H actions into the future
-
Execute only the first action in the sequence
-
Observe the resulting next state
-
Replan a new sequence of H actions starting from the new state
-
Repeat indefinitely
-
Robustness to model errors: If the model makes a mistake, the replanning step corrects it at the next time step
-
Robustness to environmental changes: Reacts to unexpected events in real-time
-
Stability: Generally produces more stable and reliable behavior than open-loop control
-
Warm starting: Initialize the new plan with the tail of the previous plan to reduce computation and prevent oscillation
-
Adaptive replanning frequency: Replan less frequently as the model becomes more accurate
4.7 Addressing the Distribution Mismatch Problem
The biggest challenge in model-based RL is that the model is only accurate on states covered by the initial training data. The solution is an iterative data collection loop:-
Collect initial data using a random policy
-
Train the dynamics model on this data
-
Use MPC to select actions and execute them in the environment
-
Add the new trajectories to the dataset
-
Retrain the dynamics model on the expanded dataset
-
Repeat until convergence
4.8 Case Study: Dexterous Robot Manipulation
The lecture concludes with a case study of the PDDM algorithm for dexterous manipulation with a five-fingered robot hand:-
State space: 24 joint angles of the hand plus 3D position of the object
-
Action space: 24-dimensional joint torques
-
Reward: Distance to target object trajectory plus penalty for dropping the object
-
Model: Ensemble of 3 neural networks to mitigate model errors
-
Planner: Modified cross-entropy method with temporal smoothing
-
Only the model-based method was able to solve the complex ball rotation task
-
Achieved 100% success rate on 90-degree rotations after only 4 hours of real-world data
-
Was 5x more data-efficient than model-free methods like SAC and Natural Policy Gradient
-
The ensemble of models was critical for good performance, as it reduced the tendency to exploit model errors
-
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.


