One. Course Details
This is the twelfth lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by Professor Chelsea Finn. It completes the coverage of model-based reinforcement learning and introduces the fundamental concepts of multitask learning.The lecture first presents the second major paradigm for using learned dynamics models: generating synthetic training data to augment real-world experience. It provides a detailed analysis of the tradeoffs involved in model-based RL and clear guidance on when to use it versus model-free methods. The second half of the lecture focuses on multitask learning, covering both imitation learning and reinforcement learning settings, task representation methods, architectural considerations, and the powerful hindsight relabeling technique that dramatically improves data efficiency for goal-reaching tasks.Two. Key Learning Objectives
By the end of this lecture, students should be able to:-
Implement the synthetic data generation pipeline for model-based reinforcement learning
-
Compare the two primary uses of learned models (planning vs. data generation) and their respective tradeoffs
-
Evaluate whether model-based RL is appropriate for a given domain based on its characteristics
-
Formulate the multitask reinforcement learning problem and represent tasks using different types of identifiers
-
Implement multitask imitation learning with best practices for stable training
-
Explain the hindsight relabeling technique and apply it to goal-conditioned reinforcement learning problems
-
Design neural network architectures for multimodal multitask policies that accept language, video, or goal state task descriptors
Three. Memorable Course Quotes
-
"You're kind of trading data in the environment for compute. And one of the reasons why that trade-off is often favorable is because collecting online data in the real environment might be unsafe, or it might just be really expensive if you have to run experiments on physical equipment."
-
"The big upside is that if you can learn a pretty good model, then it is immensely useful to be able to predict the future. And you can get away with using far, far less real data by using the data that's generated."
-
"One downside is that the model doesn't optimize for task performance and errors. There's this mismatch between what you're optimizing for when you train a model and what you actually might care about, in terms of downstream policy performance."
-
"Generalist systems end up being more reliable and more performant than if we were to just train on one task."
-
"Hindsight relabeling: you look at what you did. If you did something good, then in hindsight, you can then relabel that to be a different task."
Four. Detailed Study Notes
4.1 Model-Based RL with Synthetic Data Generation
In the previous lecture, we covered the first major use of learned models: planning at test time. This lecture covers the second major use: generating synthetic data to train policies.4.1.1 The Synthetic Data Generation Pipeline
The core idea is to use the learned dynamics model to imagine additional trajectories that supplement the real data collected from the environment:-
Collect real data: Gather trajectories from the real environment using any initial policy
-
Train the dynamics model: Learn a model
p(s' | s, a)that predicts the next state given the current state and action. Also train a reward modelr(s, a)if the reward is not known. -
Generate synthetic data:
-
Sample states uniformly from all states in the real dataset
-
Generate short rollouts (typically 1-5 steps) starting from these sampled states using the current policy
-
These short rollouts avoid the compounding error problem that plagues long synthetic trajectories
-
-
Update the policy: Train the policy using both the real data and the generated synthetic data
-
Repeat: Iterate the process by collecting more real data, updating the model, generating more synthetic data, and improving the policy
4.1.2 Key Properties
-
Test time efficiency: Unlike planning, this approach does not increase test time computation. The final policy is a standard parametric policy that runs in a single forward pass.
-
Data-compute tradeoff: Exchanges expensive real-world environment data for cheaper computational resources. This is especially valuable for domains where real data collection is dangerous, expensive, or time-consuming.
-
Algorithm agnostic: Compatible with almost any reinforcement learning algorithm, including both on-policy and off-policy methods.
4.1.3 When to Use Model-Based RL
Model-based RL offers significant advantages but also has important drawbacks. The decision to use it depends heavily on the domain:表格| Advantages of Model-Based RL | Disadvantages of Model-Based RL |
|---|---|
| Dramatically higher data efficiency | Model optimization objective is mismatched to downstream task performance |
| Task-agnostic models can transfer across multiple tasks | Learning a good model can be harder than learning a good policy directly |
| Can leverage large amounts of unlabeled data without rewards | Additional hyperparameters and increased training complexity |
| Enables both planning and data generation | Model errors compound over long horizons |
-
Highly successful: Low-dimensional state spaces (robot joint angles, object positions), legged locomotion, dexterous manipulation
-
Less successful: High-dimensional image observations, complex fluid dynamics, scenarios with highly stochastic dynamics
4.2 Multitask Reinforcement Learning
Multitask RL aims to train a single generalist policy that can perform many different tasks, rather than training separate specialist policies for each task.4.2.1 Motivation
There are two primary motivations for multitask learning:-
Improved data efficiency: Tasks share common structure and skills. Learning them together amortizes the cost of learning these shared skills across all tasks.
-
More capable generalist systems: Empirically, generalist policies trained on many tasks often outperform specialist policies trained on only one task, even on that single task.
4.2.2 Problem Formulation
Each task is defined as a separate Markov Decision Process (MDP) with its own:-
State space
S_i -
Action space
A_i -
Transition dynamics
T_i(s' | s, a) -
Reward function
r_i(s, a) -
Initial state distribution
p_i(s_0)
z_i that tells the policy which task it should perform. Task identifiers can take many forms:
-
Integer indices (simplest but least expressive)
-
Natural language descriptions (most flexible and human-interpretable)
-
Goal states (for goal-reaching tasks)
-
Video demonstrations of the task
4.2.3 Multitask Imitation Learning
Multitask imitation learning extends standard behavior cloning to the multitask setting:-
Collect expert demonstration trajectories for each task
-
Train a single policy
π_θ(a | s, z)that takes both the state and task identifier as input -
Minimize the average behavior cloning loss across all tasks:
L(θ) = E_{i ~ Uniform(1..N), (s,a) ~ D_i} [ -log π_θ(a | s, z_i) ]
-
Stratified mini-batch sampling: Ensure each mini-batch contains data from all tasks to prevent high variance gradient updates that cause catastrophic forgetting
-
Per-task normalization: Normalize states and actions separately for each task to ensure they are on similar scales
-
Balanced task weighting: Adjust the loss weights for different tasks to prevent high-data tasks from dominating the training process
4.2.4 Architectural Considerations
Multitask policies must handle multimodal inputs (states, language descriptions, images, videos). Modern architectures typically use:-
Separate encoders for each input modality (language encoder for text, image encoder for images/videos)
-
A shared backbone network that processes the combined state and task embedding
-
Transformer architectures are increasingly dominant for both language and robotic multitask policies
4.2.5 Multitask Reinforcement Learning & Hindsight Relabeling
Multitask RL extends the same task conditioning approach to reinforcement learning. The most important innovation in this space is hindsight relabeling.The core insight: When an agent tries to achieve one goal but accidentally achieves a different goal, that experience is still valuable for learning how to achieve the second goal.How hindsight relabeling works:-
Collect a trajectory while trying to achieve task
z_i -
Observe that the trajectory actually achieves a different task
z_j -
Relabel the entire trajectory with task identifier
z_jand the corresponding rewards for taskz_j -
Add the relabeled trajectory to the replay buffer for task
z_j
-
All tasks share the same state and action space
-
All tasks share the same transition dynamics
-
Using an off-policy reinforcement learning algorithm (since the relabeled data is off-policy for the new task)
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.


