Deep Reinforcement Learning: Lecture 12 Synthetic Data Generation & Multitask RL Structured Notes & In-Depth Analysis

One. Course Details

This is the twelfth lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by Professor Chelsea Finn. It completes the coverage of model-based reinforcement learning and introduces the fundamental concepts of multitask learning.The lecture first presents the second major paradigm for using learned dynamics models: generating synthetic training data to augment real-world experience. It provides a detailed analysis of the tradeoffs involved in model-based RL and clear guidance on when to use it versus model-free methods. The second half of the lecture focuses on multitask learning, covering both imitation learning and reinforcement learning settings, task representation methods, architectural considerations, and the powerful hindsight relabeling technique that dramatically improves data efficiency for goal-reaching tasks.

Two. Key Learning Objectives

By the end of this lecture, students should be able to:

Implement the synthetic data generation pipeline for model-based reinforcement learning
Compare the two primary uses of learned models (planning vs. data generation) and their respective tradeoffs
Evaluate whether model-based RL is appropriate for a given domain based on its characteristics
Formulate the multitask reinforcement learning problem and represent tasks using different types of identifiers
Implement multitask imitation learning with best practices for stable training
Explain the hindsight relabeling technique and apply it to goal-conditioned reinforcement learning problems
Design neural network architectures for multimodal multitask policies that accept language, video, or goal state task descriptors

Three. Memorable Course Quotes

"You're kind of trading data in the environment for compute. And one of the reasons why that trade-off is often favorable is because collecting online data in the real environment might be unsafe, or it might just be really expensive if you have to run experiments on physical equipment."
"The big upside is that if you can learn a pretty good model, then it is immensely useful to be able to predict the future. And you can get away with using far, far less real data by using the data that's generated."
"One downside is that the model doesn't optimize for task performance and errors. There's this mismatch between what you're optimizing for when you train a model and what you actually might care about, in terms of downstream policy performance."
"Generalist systems end up being more reliable and more performant than if we were to just train on one task."
"Hindsight relabeling: you look at what you did. If you did something good, then in hindsight, you can then relabel that to be a different task."

Four. Detailed Study Notes

4.1 Model-Based RL with Synthetic Data Generation

In the previous lecture, we covered the first major use of learned models: planning at test time. This lecture covers the second major use: generating synthetic data to train policies.

4.1.1 The Synthetic Data Generation Pipeline

The core idea is to use the learned dynamics model to imagine additional trajectories that supplement the real data collected from the environment:

Collect real data: Gather trajectories from the real environment using any initial policy
Train the dynamics model: Learn a model p(s' | s, a) that predicts the next state given the current state and action. Also train a reward model r(s, a) if the reward is not known.
Generate synthetic data:
- Sample states uniformly from all states in the real dataset
- Generate short rollouts (typically 1-5 steps) starting from these sampled states using the current policy
- These short rollouts avoid the compounding error problem that plagues long synthetic trajectories
Update the policy: Train the policy using both the real data and the generated synthetic data
Repeat: Iterate the process by collecting more real data, updating the model, generating more synthetic data, and improving the policy

4.1.2 Key Properties

Test time efficiency: Unlike planning, this approach does not increase test time computation. The final policy is a standard parametric policy that runs in a single forward pass.
Data-compute tradeoff: Exchanges expensive real-world environment data for cheaper computational resources. This is especially valuable for domains where real data collection is dangerous, expensive, or time-consuming.
Algorithm agnostic: Compatible with almost any reinforcement learning algorithm, including both on-policy and off-policy methods.

4.1.3 When to Use Model-Based RL

Model-based RL offers significant advantages but also has important drawbacks. The decision to use it depends heavily on the domain:表格

Advantages of Model-Based RL	Disadvantages of Model-Based RL
Dramatically higher data efficiency	Model optimization objective is mismatched to downstream task performance
Task-agnostic models can transfer across multiple tasks	Learning a good model can be harder than learning a good policy directly
Can leverage large amounts of unlabeled data without rewards	Additional hyperparameters and increased training complexity
Enables both planning and data generation	Model errors compound over long horizons

Domain suitability:

Highly successful: Low-dimensional state spaces (robot joint angles, object positions), legged locomotion, dexterous manipulation
Less successful: High-dimensional image observations, complex fluid dynamics, scenarios with highly stochastic dynamics

4.2 Multitask Reinforcement Learning

Multitask RL aims to train a single generalist policy that can perform many different tasks, rather than training separate specialist policies for each task.

4.2.1 Motivation

There are two primary motivations for multitask learning:

Improved data efficiency: Tasks share common structure and skills. Learning them together amortizes the cost of learning these shared skills across all tasks.
More capable generalist systems: Empirically, generalist policies trained on many tasks often outperform specialist policies trained on only one task, even on that single task.

4.2.2 Problem Formulation

Each task is defined as a separate Markov Decision Process (MDP) with its own:

State space S_i
Action space A_i
Transition dynamics T_i(s' | s, a)
Reward function r_i(s, a)
Initial state distribution p_i(s_0)

To train a single policy across all tasks, we introduce a task identifier z_i that tells the policy which task it should perform. Task identifiers can take many forms:

Integer indices (simplest but least expressive)
Natural language descriptions (most flexible and human-interpretable)
Goal states (for goal-reaching tasks)
Video demonstrations of the task

With the task identifier, we can construct a single aggregated MDP that covers all tasks. The state in this aggregated MDP is simply the original state concatenated with the task identifier.

4.2.3 Multitask Imitation Learning

Multitask imitation learning extends standard behavior cloning to the multitask setting:

Collect expert demonstration trajectories for each task
Train a single policy π_θ(a | s, z) that takes both the state and task identifier as input
Minimize the average behavior cloning loss across all tasks: L(θ) = E_{i ~ Uniform(1..N), (s,a) ~ D_i} [ -log π_θ(a | s, z_i) ]

Best practices for stable training:

Stratified mini-batch sampling: Ensure each mini-batch contains data from all tasks to prevent high variance gradient updates that cause catastrophic forgetting
Per-task normalization: Normalize states and actions separately for each task to ensure they are on similar scales
Balanced task weighting: Adjust the loss weights for different tasks to prevent high-data tasks from dominating the training process

4.2.4 Architectural Considerations

Multitask policies must handle multimodal inputs (states, language descriptions, images, videos). Modern architectures typically use:

Separate encoders for each input modality (language encoder for text, image encoder for images/videos)
A shared backbone network that processes the combined state and task embedding
Transformer architectures are increasingly dominant for both language and robotic multitask policies

4.2.5 Multitask Reinforcement Learning & Hindsight Relabeling

Multitask RL extends the same task conditioning approach to reinforcement learning. The most important innovation in this space is hindsight relabeling.The core insight: When an agent tries to achieve one goal but accidentally achieves a different goal, that experience is still valuable for learning how to achieve the second goal.How hindsight relabeling works:

Collect a trajectory while trying to achieve task z_i
Observe that the trajectory actually achieves a different task z_j
Relabel the entire trajectory with task identifier z_j and the corresponding rewards for task z_j
Add the relabeled trajectory to the replay buffer for task z_j

Requirements for hindsight relabeling:

All tasks share the same state and action space
All tasks share the same transition dynamics
Using an off-policy reinforcement learning algorithm (since the relabeled data is off-policy for the new task)

Goal-conditioned RL: A special case of multitask RL where all tasks are goal-reaching tasks. The reward function is simply the negative distance between the current state and the goal state. Hindsight relabeling is particularly powerful in this setting, as any trajectory that reaches any state can be relabeled as a successful trajectory for reaching that state.

These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 12: Multi-Task RL Stanford Online
• Course Series: Stanford CS224R Deep Reinforcement Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/qNdsI_4AQJw?si=BCQJ2GoyN6Z-i_qG

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.