1. Course Details
This is the 18th and final lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by Professor Chelsea Finn. This lecture serves as a capstone to the course, covering the most pressing unsolved challenges in deep reinforcement learning (RL) and providing practical, actionable advice for conducting high-impact empirical RL research.The lecture is divided into two major parts. The first part categorizes open problems into three core areas: challenges in problem formulation, methodological challenges, and challenges in deployment and evaluation. The second part provides a comprehensive guide to the research process, including how to select research problems, manage risk, execute experiments effectively, and communicate results to maximize impact.2. Key Learning Objectives
By the end of this lecture, students will be able to:-
Identify and categorize the fundamental open problems in modern deep reinforcement learning
-
Analyze the limitations of current reward specification approaches and their real-world consequences
-
Evaluate the challenges and opportunities of leveraging world models and pre-trained knowledge for RL
-
Distinguish between problem-driven research and idea-driven research and understand their respective tradeoffs
-
Implement practical strategies for managing risk in empirical research projects
-
Design effective evaluation protocols for generalist RL systems
-
Apply evidence-based best practices for writing papers and presenting research findings
3. Memorable Quotes
-
"Very, very few research ideas are good. Very, very few research ideas have impact, especially lasting impact. A lot of ideas don't work."
-
"Research is inherently incremental. We are building upon the knowledge of previous research that has been done."
-
"Oftentimes it is actually the simple ideas that have more impact than the fancy ideas or the complex algorithms."
-
"A research project needs two things: an important problem and a plan for how to approach that problem. If you only have one of these, that doesn't constitute a good project direction."
-
"The output of research is essentially ideas, knowledge, things that you learned from doing the research. If no one knows about what you learned, then there was really no impact or no output as a result."
4. Detailed Lecture Notes
4.1 Part One: Open Problems in Deep Reinforcement Learning
All major successes in reinforcement learning have occurred in domains with well-specified rewards and scalable policy execution. However, extending RL to the broader real world requires solving three interconnected classes of challenges.4.1.1 Challenges in Problem Formulation
The most fundamental challenge in RL remains reward specification: defining what we actually want the agent to optimize for.The Preference Optimization Paradox:-
Human preferences systematically prioritize confidence and agreement over factuality
-
Chatbots optimized for human preference often produce convincing but incorrect answers
-
Personalization can lead to echo chambers and polarization, even as it improves individual user satisfaction
-
Balancing multiple competing objectives (e.g., engagement vs. truthfulness) remains largely a manual, ad-hoc process
-
Robotics: Binary success/failure rewards provide insufficient signal, while hand-shaped rewards require enormous engineering effort and are prone to reward hacking
-
Recommendation systems: Metrics like click-through rate and watch time optimize for short-term engagement at the expense of long-term user well-being
-
Scientific discovery: Unlike math problems, we often do not know the correct answer in advance, making verification extremely difficult
-
Humans learn through unstructured play and exploration without explicit external rewards
-
Developing algorithms that can learn general-purpose skills through autonomous exploration remains one of the grandest challenges in AI
4.1.2 Methodological Challenges
Even when rewards are well-specified, current RL methods face significant limitations that prevent them from scaling to more complex problems.Leveraging Prior Data and Knowledge:-
The default approach of initializing weights and replay buffers with prior data is surprisingly limited
-
Pre-training can sometimes constrain exploration and lead to worse final performance, especially when the agent needs to discover novel behaviors beyond human capabilities
-
Incorporating abstract knowledge (e.g., from text, news articles) into RL systems remains an open problem
-
Modern video generation models possess rich commonsense knowledge about the physical world
-
However, action-conditioned world models suffer from severe out-of-distribution generalization issues:
-
They are trained almost exclusively on successful demonstrations
-
They cannot accurately predict the outcomes of mistakes or novel actions
-
Small physical inaccuracies compound over time and lead to catastrophic control failures
-
-
Promising alternative: Use world models for high-level planning rather than low-level action prediction
-
Current large-scale RL is limited to relatively short-horizon problems
-
Long-horizon RL requires accurate value functions, but current value functions trained with PPO are only used for variance reduction and are not accurate enough for decision-making
-
Batch online RL: Collecting large batches of data before updating policies is more practical for real-world systems but introduces new algorithmic challenges
-
Diffusion policies have shown particular promise in this setting due to their ability to generate diverse actions
4.1.3 Challenges in Deployment and Evaluation
The greatest gap between RL research and real-world impact lies in deployment and evaluation.Safety in Open-World Environments:-
Formal verification methods fail to scale to the complexity of real-world scenarios
-
Data-driven safety approaches require collecting data on unsafe situations, which is ethically problematic and potentially dangerous
-
Humans can learn new skills without catastrophic failure by following a natural curriculum of increasing difficulty
-
Developing algorithms that can explore safely while learning remains a critical open challenge
-
AI systems often perform worse when paired with humans than when operating independently
-
Example: A diagnostic AI with 92% accuracy led to 76% accuracy when used by physicians, barely better than the 74% accuracy of physicians alone
-
Reinforcement learning degrades model calibration, making it harder for humans to trust and appropriately use model outputs
-
Models that can verbalize their uncertainty and provide multiple hypotheses show promise for improving human-AI collaboration
-
Unlike supervised learning, RL has no reliable offline evaluation metrics
-
Policy performance must be measured through online rollouts, which is expensive and time-consuming
-
Generalist systems that perform many tasks require evaluation across an enormous number of scenarios
-
Open questions:
-
Can we develop offline metrics that can at least rule out bad models?
-
How can we select representative test scenarios that accurately predict general performance?
-
4.2 Part Two: How to Conduct Impactful Deep RL Research
Conducting successful empirical research requires more than just technical skill—it requires a strategic approach to problem selection, risk management, and communication.4.2.1 Fundamental Realities of Research
-
Most ideas do not work, and most papers do not have lasting impact
-
Research is inherently incremental; even breakthroughs like AlphaFold build on decades of prior work
-
Simple ideas often have greater impact than complex ones because they are easier to implement, scale, and build upon
4.2.2 How to Select Research Problems
A good research project requires three essential elements:-
An important problem that, if solved, would have significant real-world or scientific impact
-
A concrete plan for how to approach the problem
-
Genuine excitement about the problem that will sustain you through inevitable setbacks
-
Idea-driven research: Start with a cool algorithm and look for a problem it solves. Risk: The algorithm may not solve any important problem, or there may be a simpler solution.
-
Problem-driven research: Start with an important problem and look for the best solution. Risk: The solution may seem obvious in retrospect, making it harder to publish.
-
Recommendation: Lean toward problem-driven research. You are guaranteed to be working on something important, even if the solution is not the most technically sophisticated.
-
The most impactful research addresses the actual bottleneck preventing progress on a problem
-
If a problem is bottlenecked by data, working on new algorithms will yield only marginal improvements
-
Think about the long-term vision for a field and identify the specific subproblems that need to be solved to get there
-
Many of the most impactful discoveries occur at the intersection of fields
-
If you encounter a limitation in a tool from another field, fixing that tool can lead to significant advances
-
Do not box yourself into a narrow research area; be willing to learn new skills and tackle problems outside your comfort zone
4.2.3 How to Execute Research and Manage Risk
The biggest challenge in empirical research is dealing with the high probability of failure.Frontload Risk:-
Identify the core unknowns in your project as early as possible
-
Run small, fast experiments to test these core assumptions before building full infrastructure
-
This is uncomfortable because building infrastructure feels like progress, but it saves enormous time in the long run
-
You create your own luck by trying many ideas
-
Discard ideas that do not show signs of life quickly and move on to the next one
-
Do not commit to a single idea until you have preliminary evidence that it works
-
If your system is not working, simplify the problem dramatically until something does work
-
Then gradually add complexity back in, one component at a time
-
This is far more effective than trying to debug a complex system that fails completely
-
The sunk cost fallacy is the biggest enemy of productive research
-
If an idea is not working despite significant effort, it is probably not going to work
-
Mental trick: Frame the decision as "continue with project A vs. start project B" rather than "continue vs. quit"
4.2.4 How to Share Your Research
Research has no impact if no one knows about it. Sharing your work is not self-promotion—it is a service to the community.Principles of Effective Communication:-
Prioritize clarity above all else
-
Assume your audience knows less than you think they do
-
Avoid unnecessary jargon
-
Use clear visuals and concrete examples to illustrate abstract concepts
-
Start with an outline before writing a full paper
-
Practice presentations repeatedly and get honest feedback
-
The most important part of any paper or talk is the introduction, which should clearly state the problem, your approach, and your key results
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.


