One. Course Details
This is the first lecture of Stanford University's CS224R Deep Reinforcement Learning course, taught by Chelsea Finn, an assistant professor at Stanford whose research focuses on reinforcement learning, robotics, and large language models.
The course centers on deep reinforcement learning solutions that scale to deep neural networks, with minimal coverage of non-neural network RL methods. It covers core topics including imitation learning, model-free and model-based RL, offline and online RL, multi-task and meta-RL, with special emphasis on RL applications for large language models and robotics.
The core course goals are to enable students to understand and implement both existing and emerging deep RL methods, master core concepts to grasp advanced techniques independently, and gain hands-on experience with algorithm implementation through lectures and projects. For students seeking more theoretical depth or broader applications, CS234 is recommended as a complementary course.
Two. Key Learning Objectives
By the end of this lecture, students should be able to:
- Define deep reinforcement learning and distinguish it from traditional supervised machine learning
- Represent agent behavior using standard RL notation and data structures
- Formulate any sequential decision-making problem as a reinforcement learning problem
- Explain the Markov property and the difference between MDPs and POMDPs
- State the formal optimization objective of reinforcement learning
- Identify the main categories of RL algorithms and their core trade-offs
Three. Memorable Course Quotes
- "Reinforcement learning enables this ability to get better with practice that isn't present in other machine learning systems."
- "Learning from experience seems fundamental to intelligence, both for humans and for building intelligent machines."
- "Reinforcement learning is a tool to discover new solutions rather than just mimicking existing data."
- "Nearly all modern language models use some form of RL for post-training, especially for advanced reasoning capabilities."
- "RL algorithms make different trade-offs and thrive under different assumptions—there is no one-size-fits-all solution."
Four. Detailed Study Notes
4.1 Deep RL vs. Traditional Supervised Learning
The most fundamental differences lie in data distribution and supervision type:
- Supervised learning: Learns a mapping from input
xto outputyusing labeled i.i.d. (independent and identically distributed) data. The model receives direct, explicit feedback about the correct output for each input. - Reinforcement learning: Learns a mapping from states/observations to actions (denoted as policy
π). Feedback is indirect and delayed in the form of rewards, not direct correct answers. Critically, the data distribution depends entirely on the current policy being learned—the agent's actions shape the future data it will see, breaking the i.i.d. assumption.
RL applies to any scenario where decisions have long-term consequences, direct supervision is unavailable, or the objective is non-differentiable.
4.2 Core Components of an RL Problem
Every RL problem can be decomposed into these standard components:
- State (
S): A complete description of the current world state that contains all information needed to make optimal decisions. - Observation (
O): A partial description of the world state that the agent can actually perceive. Observations may omit critical information, requiring the agent to use past observation history to infer the true state. - Action (
a): The decision the agent makes at each time step, which changes the world state. - Trajectory: A sequence of states/observations and actions (
s₁, a₁, s₂, a₂, ..., s_T, a_T) representing one complete interaction between the agent and the environment, also called a rollout or episode. - Reward function (
r(s, a)): A scalar value that quantifies how good a state-action pair is. It defines the agent's goal and can depend on states only, or both states and actions (e.g., penalizing excessive energy use in robots). - Dynamics function (
P(s' | s, a)): The probability distribution over next states given the current state and action, which models how the world evolves.
4.3 Markov Property, MDPs, and POMDPs
- Markov property: The future is independent of the past given the present. Formally,
P(sₜ₊₁ | s₁, a₁, ..., sₜ, aₜ) = P(sₜ₊₁ | sₜ, aₜ). This property simplifies RL problems by breaking them into sequential, independent steps. - Markov Decision Process (MDP): A fully observable RL problem where the agent has access to the complete state
Sat all times. - Partially Observable Markov Decision Process (POMDP): A more general RL problem where the agent only receives observations
Oinstead of full states. POMDPs require policies with memory (e.g., sequence models) to incorporate past observations.
4.4 Policies and the RL Objective
- Policy (
π_θ): A function that maps states/observations to actions, parameterized byθ(typically weights of a neural network). Policies are often stochastic rather than deterministic to enable exploration and model diverse human behaviors. - Formal RL objective: Maximize the expected sum of rewards over all possible trajectories generated by the policy:
max_π E[Σₜ=₀^T r(sₜ, aₜ)] - Discount factor (
γ): A value between 0 and 1 that weights immediate rewards more heavily than future rewards, addressing infinite horizon problems and modeling human preference for immediate outcomes.
4.5 Overview of RL Algorithm Categories
The course will cover five main classes of RL algorithms, each with distinct trade-offs:
- Imitation learning: Mimics expert demonstrations to learn a high-performing policy
- Policy gradients: Directly differentiates the RL objective to update the policy
- Actor-critic methods: Combines policy learning with value function estimation for more stable updates
- Value-based methods: Estimates the value of optimal states/actions and derives a policy from these estimates
- Model-based methods: Learns a dynamics model of the world and uses it for planning or policy improvement
Algorithm choice depends on factors including data collection cost, availability of demonstrations, required stability, action space dimensionality, and ease of learning a dynamics model.
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.


