Lecture 16: Autonomous Reinforcement Learning for Robots

1. Course Details

This is the 15th lecture of Stanford University's CS224R Deep Reinforcement Learning course. Building on prior topics of multi-task learning and meta-reinforcement learning, this lecture introduces hierarchical learning as a framework to solve complex long-horizon tasks that require sequencing multiple subtasks.The lecture covers the fundamental motivation for hierarchical approaches, the core two-level policy architecture, three critical design choices for building practical hierarchical systems, and detailed examples of state-of-the-art hierarchical imitation learning systems for robotics. It concludes with a discussion of hierarchical reinforcement learning and open research directions in the field.

2. Key Learning Objectives

By the end of this lecture, students will be able to:

Explain the core challenges of long-horizon tasks that make them difficult for standard flat policies
Describe the basic hierarchical policy architecture and its key advantages over flat policies
Evaluate the tradeoffs between different goal representations including language, images, and structured states
Design effective supervision strategies for hierarchical systems to avoid dangerous module mismatch
Implement practical hierarchical imitation learning systems using both language and image subgoals
Compare hierarchical approaches to alternative methods like chain-of-thought reasoning
Identify the current limitations of hierarchical reinforcement learning and promising future research directions

3. Memorable Quotes

"Long-horizon tasks are some of the hardest problems for AI systems to solve, because they require navigating vast state spaces, recovering from mistakes, and avoiding getting stuck in loops of non-progress."
"The core idea behind hierarchy is simple: break a single overwhelming long task into smaller, manageable subtasks, then compose them with a higher-level controller."
"Intermediate supervision is the single biggest benefit of hierarchical approaches—without it, sparse rewards for long tasks make learning nearly impossible."
"The biggest pitfall of modular hierarchical systems is the mismatch between levels: a low-level policy trained in isolation will fail when called from states the high-level policy actually produces."
"Errors in estimating when a subtask is complete are far more fatal than frequent high-level re-planning, which is why almost all practical systems use fixed-frequency goal updates."

4. Detailed Lecture Notes

4.1 The Challenge of Long-Horizon Tasks

Long-horizon tasks are complex activities that require multiple sequential steps to complete, often with interdependent subtasks. Examples include cooking a meal, debugging a neural network, driving across the country, or drafting a legal brief.These tasks present three unique challenges for AI systems:

Vast state spaces: The agent must make appropriate decisions across thousands or millions of unique states encountered during the task
Error recovery: Mistakes are inevitable, and the agent must be able to recognize and correct them rather than failing catastrophically
Progress stagnation: Agents easily get stuck in loops of repetitive behavior that makes no forward progress toward the goal

Standard flat policies that map directly from states to actions struggle with these tasks because they suffer from extreme credit assignment problems, cannot effectively share knowledge across subtasks, and waste compute by performing high-level reasoning at every single timestep.

4.2 Core Hierarchical Policy Architecture

Hierarchical learning addresses these challenges by decomposing the problem into multiple levels of abstraction. The standard two-level architecture consists of two separate policies operating at different timescales:

Low-level policy (π_LL): Operates at the original action frequency (e.g., 20 hertz for robot motors, word-by-word for language models). It takes the current state and an intermediate subgoal as input and outputs primitive actions to execute that subgoal.
High-level policy (π_HL): Operates at a much coarser timescale. It takes the current state and the overall high-level task as input and outputs a sequence of intermediate subgoals for the low-level policy to accomplish.

Execution Workflow

The high-level policy observes the initial state and plans the first intermediate subgoal
The low-level policy repeatedly executes actions to achieve the current subgoal
After a predetermined period or when the subgoal is complete, the high-level policy observes the new state and plans the next subgoal
This process repeats until the overall task is completed

This architecture can be extended to more than two levels for extremely complex tasks, creating a pyramid of abstraction from primitive motor control all the way up to high-level strategic planning.

Alternative: Chain-of-Thought Reasoning

An intermediate approach that retains many benefits of hierarchy without separate policies is chain-of-thought reasoning. In this approach, a single policy first outputs an intermediate subgoal or reasoning step, then outputs the corresponding low-level action.

Advantages: Enables knowledge sharing between the reasoning and execution components
Disadvantages: No computational efficiency benefit, as reasoning must be performed at every timestep

There is currently no conclusive empirical evidence that one approach is strictly better than the other, except for the clear computational advantage of separate hierarchical policies.

4.3 Three Critical Design Choices

Building an effective hierarchical system requires making three fundamental design decisions that will largely determine its performance.

4.3.1 Goal Representation

The choice of how to represent intermediate subgoals is highly domain-dependent and must balance three competing requirements:

Expressiveness: The ability to describe all necessary subtasks for the domain
Structure: Similar goals should have similar representations to enable generalization
Appropriate abstraction: Goals should be neither too fine-grained (defeating the purpose of hierarchy) nor too coarse-grained (failing to decompose the problem)

Common goal representations include:

Natural language: The most general and expressive representation, ideal for tasks like cooking or cleaning. It is easy for humans to annotate and understand but can suffer from ambiguity.
Goal images: Excellent for visual robotic tasks. They eliminate the need for language annotation and can leverage vast amounts of unlabeled video data for pre-training.
Structured representations: GPS coordinates, bounding boxes, or joint angles for navigation and control tasks. These have strong inherent structure that makes learning easier.
Latent representations: Learned neural activations that require no human annotation but are completely uninterpretable.

4.3.2 Supervision for Each Level

The most common failure mode for hierarchical systems is module mismatch: when the low-level policy is trained on states that the high-level policy never actually produces during deployment.To avoid this:

Pre-train the low-level policy on a wide distribution of states and goals to ensure it is robust
Adapt the high-level policy to the actual capabilities and failure modes of the low-level policy
Jointly fine-tune both policies end-to-end whenever possible to align their behaviors

A particularly effective technique for adaptation is DAgger (Dataset Aggregation) applied at the high level. Human operators intervene to correct incorrect subgoals during deployment, and this intervention data is used to fine-tune the high-level policy to compensate for deficiencies in the low-level policy.Critical note: A hierarchical architecture without explicit intermediate supervision is functionally identical to a flat policy with a more complex neural network architecture and will not provide any of the benefits of hierarchy.

4.3.3 When to Switch Goals

There are two basic strategies for deciding when to request a new subgoal from the high-level policy:

Completion-based switching: Request a new goal only when the low-level policy indicates it has completed the current one
- Advantages: Maximum computational efficiency
- Disadvantages: Extremely difficult to accurately estimate task completion; cannot handle errors that require revisiting previous subtasks; failures lead to permanent stagnation
Fixed-frequency switching: Request a new goal at regular, predetermined intervals
- Advantages: Simple, robust, and automatically recovers from errors by re-planning frequently
- Disadvantages: Introduces small delays between subtask completion and new goal assignment; higher computational cost

Virtually all practical hierarchical systems use fixed-frequency switching because the potential failures of completion-based switching are far more severe than the minor inefficiencies of frequent re-planning.

4.4 Practical Hierarchical Imitation Learning Systems

Most successful hierarchical systems deployed today use imitation learning rather than reinforcement learning, as RL remains unstable and sample-inefficient for long-horizon tasks.

4.4.1 Language Subgoal Systems

These systems use natural language as the intermediate goal representation:

Data collection: Long-horizon demonstrations are segmented and annotated with language labels describing each subtask. This can be done either during data collection or post-hoc.
Architecture: A high-level policy takes RGB images as input and predicts language subgoals. A low-level policy takes RGB images, joint angles, and the language subgoal as input and predicts joint motor commands.
Training process:
1. Pre-train both policies offline on the annotated demonstration data
2. Freeze the low-level policy
3. Fine-tune the high-level policy using DAgger with human language interventions
Results: Hierarchical systems achieve a 34% improvement in task progress compared to equivalent flat policies and can complete complex tasks like putting three objects into a sealed bag or cleaning an entire bedroom.

4.4.2 Image Subgoal Systems

These systems use generated images as the intermediate goal representation:

Data collection: Only raw video demonstrations are required, with no language annotation.
Architecture: A high-level diffusion model takes the current image and high-level task as input and generates a subgoal image representing what the scene should look like several seconds in the future. A goal-conditioned low-level policy takes the current image and subgoal image as input and predicts actions.
Key advantages:
- Can handle novel objects that have no standard language name
- Can leverage massive amounts of unlabeled human video data for pre-training
- Adding human video data to the training set significantly improves generalization performance

4.5 Hierarchical Reinforcement Learning & Open Directions

While hierarchical imitation learning has seen significant recent success, hierarchical reinforcement learning remains less mature but offers the potential for much higher reliability and performance.Current approaches typically use:

A goal-conditioned low-level policy trained to reach arbitrary states
A high-level policy that outputs goal states or language subgoals
Hindsight relabeling to make efficient use of off-policy data for training the high-level policy

Major open research questions include:

1. Course Details

This is the 16th lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by Professor Chelsea Finn. This lecture focuses on autonomous reinforcement learning specifically tailored for physical robots, addressing the fundamental barrier that has prevented widespread deployment of RL-trained robots in the real world.The lecture covers the core reset problem in robotic RL, two distinct evaluation paradigms for autonomous learning systems, practical reset-free learning algorithms including forward-backward RL and MEDAL, multi-task cycle learning frameworks, and the emerging field of single-life reinforcement learning for deployment-time adaptation. It concludes with a discussion of safety considerations for autonomous learning systems.
2. Key Learning Objectives
By the end of this lecture, students will be able to:
1. Explain the reset problem in physical robotic reinforcement learning and why it prevents fully autonomous training
2. Distinguish between deployed policy evaluation and continuing policy evaluation for autonomous systems
3. Implement the forward-backward reinforcement learning framework for reset-free training
4. Describe how the MEDAL algorithm uses expert demonstrations to improve reset efficiency and learning performance
5. Design a multi-task cycle learning system that eliminates the need for dedicated reset policies
6. Define the single-life reinforcement learning problem and its relevance to real-world deployment
7. Identify key safety challenges in autonomous robotic learning and practical mitigation strategies
3. Memorable Quotes
- "The problem is reset, often in practice, requires some human intervention, that we want to be able to ideally eliminate so that our agents can practice and get better on their own."
- "We can't reset the physical world to our own will."
- "One very simple idea is to try to learn a policy that brings us from wherever we ended up after attempting the task back to the initial state distribution."
- "You can actually run these algorithms on real robots, and it's cool. You can watch time lapses of them improving just through a fully autonomous process."
- "We'd like to think about this setting, where we could actually adapt, within an episode, during deployment."
4. Detailed Lecture Notes

4.1 The Core Motivation: The Reset Problem in Robotic RL
Traditional reinforcement learning relies on a critical assumption that is completely violated in the physical world: the ability to reset the environment to an initial state with a simple env.reset() call. In simulation, this is trivial, but for physical robots, resetting typically requires human intervention:
- A robot learning to hit a hockey puck needs a human to retrieve the puck and place it back in the starting position
- A robot learning to open doors needs a human to close the door after each attempt
- A robot learning to fold towels needs a human to unfold the towel and place it back on the table
This is particularly problematic because reinforcement learning typically requires thousands or even millions of attempts to learn a single task. The constant need for human supervision makes large-scale robotic RL prohibitively expensive and impractical.
4.2 Two Evaluation Metrics for Autonomous RL
There are two fundamentally different ways to evaluate autonomous reinforcement learning systems, depending on the ultimate goal of the system:
4.2.1 Deployed Policy Evaluation
This metric measures the quality of the final policy learned by the agent. After the autonomous training process is complete, we deploy the policy starting from the standard initial state distribution and measure its performance.Use case: A cooking robot that will be deployed in a restaurant kitchen. We care about how well it can cook meals repeatedly after training, not about how many meals it burned during the learning process.
4.2.2 Continuing Policy Evaluation
This metric measures the average reward the agent receives over its entire lifetime, including during the learning process. There is no separate training and deployment phase—the agent is learning and performing simultaneously.Use case: A Mars rover that has no opportunity for human intervention. We care about how much scientific data it collects over its entire mission, not just how good its policy would be if we could reset it and deploy it again.
4.3 Why Standard RL Fails Without Resets
A naive approach to autonomous RL would be to simply increase the length of episodes, reducing the frequency of required resets. However, experiments show that this approach fails catastrophically even on simple tasks.When running Soft Actor-Critic (SAC) on a simple fish navigation task:
- Episode length of 1,000 steps: The algorithm learns effectively
- Episode length of 2,000 steps: Performance degrades significantly
- Episode length of 10,000 steps: The algorithm barely learns anything at all
There are two core reasons for this failure:
1. State Distribution Drift: The agent will inevitably make mistakes and drift away from the initial state distribution into regions of the state space that are difficult or impossible to recover from.
2. Distribution Collapse: Even if the agent eventually reaches the goal, it will tend to stay in the region around the goal. The replay buffer becomes filled with data from the goal region, and the agent never learns how to get from the initial state to the goal.
4.4 Forward-Backward Reinforcement Learning
Forward-backward RL is the simplest and most intuitive solution to the reset problem. The core insight is to train two separate policies:
- Forward policy (πf): Learns to complete the primary task
- Backward policy (πb): Learns to reset the environment by returning to the initial state distribution
Training Pipeline
1. Initialize the environment once from the initial state distribution
2. Run the forward policy for H time steps
3. Update the forward policy to maximize the task reward function rf
4. Run the backward policy for H time steps
5. Update the backward policy to maximize the reset reward function rb, which rewards reaching states in the initial state distribution
6. Repeat steps 2-5 indefinitely, with only extremely infrequent human resets (if any)
At test time, the backward policy is discarded entirely, and only the forward policy is deployed.
Advantages and Disadvantages
- Advantages: Conceptually simple, easy to implement, works with any standard RL algorithm
- Disadvantages: The backward policy serves no purpose beyond enabling training, which can feel computationally wasteful; there may be a mismatch in difficulty between the forward and backward tasks
4.5 MEDAL: Reset to Expert State Distribution
MEDAL (Matching Expert Distribution for Autonomous Learning) is an improvement over vanilla forward-backward RL that addresses the distribution collapse problem and improves learning efficiency.Instead of resetting all the way back to the initial state, MEDAL resets to any state in the expert state distribution—the set of states visited by human demonstrators performing the task.
How It Works
1. Collect a small number of expert demonstrations of the task
2. Train a discriminator to distinguish between states visited by the expert and states visited by the policy
3. The backward policy's reward function is the output of this discriminator—it is rewarded for visiting states that look like they came from the expert demonstrations
4. The forward policy is trained as before to maximize the task reward
Key Advantages
- The agent practices the task from a wide variety of starting states along the expert trajectory, not just the initial state
- This significantly improves learning efficiency, as the agent gets more meaningful practice
- It naturally prevents distribution collapse, as the backward policy is incentivized to visit all parts of the expert state distribution
- The expert demonstrations can be used to bootstrap both the forward and backward policies
4.6 Benchmark Results on EARL
The EARL benchmark is the standard evaluation suite for autonomous reinforcement learning algorithms. It features a variety of robotic manipulation and locomotion tasks, with resets only allowed every 200,000 environment steps.Key results from the benchmark:
- Standard SAC without a reset policy performs extremely poorly
- Simple novelty-seeking exploration methods also perform poorly
- Vanilla forward-backward RL and MEDAL both perform close to the Oracle baseline (which gets frequent resets)
- MEDAL generally outperforms vanilla forward-backward RL on more complex tasks
An important practical consideration is that the forward and backward tasks may have very different difficulties. For example, closing a door is much easier than opening a door, or vice versa. This mismatch can make training more challenging.
4.7 Multi-Task Cycle Learning
For systems that need to learn multiple tasks, we can eliminate the need for a dedicated reset policy entirely by cycling between different tasks.
Core Idea
Instead of having a separate reset task, we simply select the next task to practice based on the current state of the environment. If the robot drops a cup while trying to place it in the coffee machine, it doesn't need a human to reset it—it can practice the task of picking up the cup instead.
Training Pipeline
1. Observe the current state of the environment
2. Propose a task that can be started from the current state
3. Run the multi-task policy to complete the proposed task
4. Update the policy using the reward for the completed task
5. Repeat indefinitely
Task Proposal Methods
- Predefined task graph: Manually define which tasks can be started from which states
- VLM/LLM-based proposal: Use a vision-language model or large language model to automatically identify feasible tasks in the current scene
This approach is particularly powerful because it naturally handles errors and allows the agent to continuously improve at all tasks simultaneously.
4.8 Single-Life Reinforcement Learning
Single-life RL addresses a different but equally important problem: how to enable robots to adapt to new, unexpected situations during deployment, without any resets or retraining.
Problem Setting
The agent has been pre-trained on a variety of tasks, but during deployment, it encounters an out-of-distribution situation that it has never seen before. It has only one chance to complete the task and must adapt on the fly within a single episode.
Key Challenges
- Directly fine-tuning the policy at test time often leads to catastrophic failures, as the agent can get stuck in unrecoverable states
- The agent has no opportunity to reset and try again
Promising Solutions
1. Familiar State Guidance: When the agent encounters difficulty, it first tries to return to a familiar state that it knows how to handle, rather than continuing to push toward the goal
2. High-Level Skill Adaptation: Instead of adapting low-level motor commands, the agent adapts at the level of pre-trained skills, allowing it to try completely different strategies
3. LLM/VLM Common Sense: Leverage the common sense reasoning capabilities of large pre-trained models to propose novel solutions to unexpected problems
4.9 Safety Considerations in Autonomous RL
Autonomous learning robots operate in the physical world without constant human supervision, making safety a critical concern:
- Hardware safety: Robots must avoid actions that could damage themselves or the environment
- Hard constraints: Implement low-level safety constraints in the controller to prevent dangerous actions
- Safety recovery policies: Train dedicated recovery policies that can return the robot to a safe state if something goes wrong
VLM-based safety: Use vision-language models to identify potentially dangerous situations and avoid themThese are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.
Unsupervised skill discovery: Can we automatically discover useful reusable skills without any human supervision or annotation?
Scalable hierarchical RL: How can we effectively fine-tune large pre-trained hierarchical imitation learning systems with reinforcement learning?
Multi-level hierarchy: Can we learn deep hierarchies with more than two levels of abstraction automatically?
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 16: RL for Robots Stanford Online
• Course Series: Stanford CS224R Deep Reinforcement Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/rbaWQQLrzl0?si=jQ0EQFiv1ZsMDqn4

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.