Lecture 18: Open Problems in Deep Reinforcement Learning & How to Conduct Impactful Research

1. Course Details

This is the 18th and final lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by Professor Chelsea Finn. This lecture serves as a capstone to the course, covering the most pressing unsolved challenges in deep reinforcement learning (RL) and providing practical, actionable advice for conducting high-impact empirical RL research.The lecture is divided into two major parts. The first part categorizes open problems into three core areas: challenges in problem formulation, methodological challenges, and challenges in deployment and evaluation. The second part provides a comprehensive guide to the research process, including how to select research problems, manage risk, execute experiments effectively, and communicate results to maximize impact.

2. Key Learning Objectives

By the end of this lecture, students will be able to:

Identify and categorize the fundamental open problems in modern deep reinforcement learning
Analyze the limitations of current reward specification approaches and their real-world consequences
Evaluate the challenges and opportunities of leveraging world models and pre-trained knowledge for RL
Distinguish between problem-driven research and idea-driven research and understand their respective tradeoffs
Implement practical strategies for managing risk in empirical research projects
Design effective evaluation protocols for generalist RL systems
Apply evidence-based best practices for writing papers and presenting research findings

3. Memorable Quotes

"Very, very few research ideas are good. Very, very few research ideas have impact, especially lasting impact. A lot of ideas don't work."
"Research is inherently incremental. We are building upon the knowledge of previous research that has been done."
"Oftentimes it is actually the simple ideas that have more impact than the fancy ideas or the complex algorithms."
"A research project needs two things: an important problem and a plan for how to approach that problem. If you only have one of these, that doesn't constitute a good project direction."
"The output of research is essentially ideas, knowledge, things that you learned from doing the research. If no one knows about what you learned, then there was really no impact or no output as a result."

4. Detailed Lecture Notes

4.1 Part One: Open Problems in Deep Reinforcement Learning

All major successes in reinforcement learning have occurred in domains with well-specified rewards and scalable policy execution. However, extending RL to the broader real world requires solving three interconnected classes of challenges.

4.1.1 Challenges in Problem Formulation

The most fundamental challenge in RL remains reward specification: defining what we actually want the agent to optimize for.The Preference Optimization Paradox:

Human preferences systematically prioritize confidence and agreement over factuality
Chatbots optimized for human preference often produce convincing but incorrect answers
Personalization can lead to echo chambers and polarization, even as it improves individual user satisfaction
Balancing multiple competing objectives (e.g., engagement vs. truthfulness) remains largely a manual, ad-hoc process

Domain-Specific Reward Challenges:

Robotics: Binary success/failure rewards provide insufficient signal, while hand-shaped rewards require enormous engineering effort and are prone to reward hacking
Recommendation systems: Metrics like click-through rate and watch time optimize for short-term engagement at the expense of long-term user well-being
Scientific discovery: Unlike math problems, we often do not know the correct answer in advance, making verification extremely difficult

Intrinsic Motivation:

Humans learn through unstructured play and exploration without explicit external rewards
Developing algorithms that can learn general-purpose skills through autonomous exploration remains one of the grandest challenges in AI

4.1.2 Methodological Challenges

Even when rewards are well-specified, current RL methods face significant limitations that prevent them from scaling to more complex problems.Leveraging Prior Data and Knowledge:

The default approach of initializing weights and replay buffers with prior data is surprisingly limited
Pre-training can sometimes constrain exploration and lead to worse final performance, especially when the agent needs to discover novel behaviors beyond human capabilities
Incorporating abstract knowledge (e.g., from text, news articles) into RL systems remains an open problem

World Models and Video Generation:

Modern video generation models possess rich commonsense knowledge about the physical world
However, action-conditioned world models suffer from severe out-of-distribution generalization issues:
- They are trained almost exclusively on successful demonstrations
- They cannot accurately predict the outcomes of mistakes or novel actions
- Small physical inaccuracies compound over time and lead to catastrophic control failures
Promising alternative: Use world models for high-level planning rather than low-level action prediction

Scaling Reinforcement Learning:

Current large-scale RL is limited to relatively short-horizon problems
Long-horizon RL requires accurate value functions, but current value functions trained with PPO are only used for variance reduction and are not accurate enough for decision-making
Batch online RL: Collecting large batches of data before updating policies is more practical for real-world systems but introduces new algorithmic challenges
Diffusion policies have shown particular promise in this setting due to their ability to generate diverse actions

4.1.3 Challenges in Deployment and Evaluation

The greatest gap between RL research and real-world impact lies in deployment and evaluation.Safety in Open-World Environments:

Formal verification methods fail to scale to the complexity of real-world scenarios
Data-driven safety approaches require collecting data on unsafe situations, which is ethically problematic and potentially dangerous
Humans can learn new skills without catastrophic failure by following a natural curriculum of increasing difficulty
Developing algorithms that can explore safely while learning remains a critical open challenge

Human-AI Collaboration:

AI systems often perform worse when paired with humans than when operating independently
Example: A diagnostic AI with 92% accuracy led to 76% accuracy when used by physicians, barely better than the 74% accuracy of physicians alone
Reinforcement learning degrades model calibration, making it harder for humans to trust and appropriately use model outputs
Models that can verbalize their uncertainty and provide multiple hypotheses show promise for improving human-AI collaboration

Evaluation of Generalist Systems:

Unlike supervised learning, RL has no reliable offline evaluation metrics
Policy performance must be measured through online rollouts, which is expensive and time-consuming
Generalist systems that perform many tasks require evaluation across an enormous number of scenarios
Open questions:
- Can we develop offline metrics that can at least rule out bad models?
- How can we select representative test scenarios that accurately predict general performance?

4.2 Part Two: How to Conduct Impactful Deep RL Research

Conducting successful empirical research requires more than just technical skill—it requires a strategic approach to problem selection, risk management, and communication.

4.2.1 Fundamental Realities of Research

Most ideas do not work, and most papers do not have lasting impact
Research is inherently incremental; even breakthroughs like AlphaFold build on decades of prior work
Simple ideas often have greater impact than complex ones because they are easier to implement, scale, and build upon

4.2.2 How to Select Research Problems

A good research project requires three essential elements:

An important problem that, if solved, would have significant real-world or scientific impact
A concrete plan for how to approach the problem
Genuine excitement about the problem that will sustain you through inevitable setbacks

Problem-Driven vs. Idea-Driven Research:

Idea-driven research: Start with a cool algorithm and look for a problem it solves. Risk: The algorithm may not solve any important problem, or there may be a simpler solution.
Problem-driven research: Start with an important problem and look for the best solution. Risk: The solution may seem obvious in retrospect, making it harder to publish.
Recommendation: Lean toward problem-driven research. You are guaranteed to be working on something important, even if the solution is not the most technically sophisticated.

Identifying Bottlenecks:

The most impactful research addresses the actual bottleneck preventing progress on a problem
If a problem is bottlenecked by data, working on new algorithms will yield only marginal improvements
Think about the long-term vision for a field and identify the specific subproblems that need to be solved to get there

Crossing Disciplinary Boundaries:

Many of the most impactful discoveries occur at the intersection of fields
If you encounter a limitation in a tool from another field, fixing that tool can lead to significant advances
Do not box yourself into a narrow research area; be willing to learn new skills and tackle problems outside your comfort zone

4.2.3 How to Execute Research and Manage Risk

The biggest challenge in empirical research is dealing with the high probability of failure.Frontload Risk:

Identify the core unknowns in your project as early as possible
Run small, fast experiments to test these core assumptions before building full infrastructure
This is uncomfortable because building infrastructure feels like progress, but it saves enormous time in the long run

Iterate Rapidly:

You create your own luck by trying many ideas
Discard ideas that do not show signs of life quickly and move on to the next one
Do not commit to a single idea until you have preliminary evidence that it works

The Simplification Debugging Strategy:

If your system is not working, simplify the problem dramatically until something does work
Then gradually add complexity back in, one component at a time
This is far more effective than trying to debug a complex system that fails completely

Know When to Pivot:

The sunk cost fallacy is the biggest enemy of productive research
If an idea is not working despite significant effort, it is probably not going to work
Mental trick: Frame the decision as "continue with project A vs. start project B" rather than "continue vs. quit"

4.2.4 How to Share Your Research

Research has no impact if no one knows about it. Sharing your work is not self-promotion—it is a service to the community.Principles of Effective Communication:

Prioritize clarity above all else
Assume your audience knows less than you think they do
Avoid unnecessary jargon
Use clear visuals and concrete examples to illustrate abstract concepts

Writing and Presenting:

Start with an outline before writing a full paper
Practice presentations repeatedly and get honest feedback
The most important part of any paper or talk is the introduction, which should clearly state the problem, your approach, and your key results

These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 18: Frontiers Stanford Online
• Course Series: Stanford CS224R Deep Reinforcement Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/FacJ_1tTSx4?si=JvNEAHU6sDz8bYcc

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.