One. Course Details
This is the sixth lecture of Stanford University's CS224R Deep Reinforcement Learning course, taught by Professor Chelsea Finn. It marks a shift from policy-based and actor-critic methods to value-based reinforcement learning—the paradigm that produced the first widely successful deep RL algorithm, Deep Q-Networks (DQN).The lecture covers the mathematical foundations of Q-learning, including the Bellman equation and Bellman optimality equation, the policy iteration framework, and the core Q-learning update rule. It then addresses the fundamental instability issues that arise when combining Q-learning with neural networks and presents three critical innovations that make DQN work reliably in practice: target networks, Double DQN, and n-step returns. The lecture concludes with practical guidance on when to use value-based methods versus policy-based methods like PPO and SAC.Value-based methods eliminate the need for an explicit policy network, instead deriving the optimal policy directly from the learned Q-function. This makes them particularly efficient for discrete action spaces and has led to breakthrough applications in game playing and robotic manipulation.
Two. Key Learning Objectives
By the end of this lecture, students should be able to:-
Define and distinguish between the Bellman equation and Bellman optimality equation
-
Explain the two-step policy iteration process (policy evaluation and policy improvement)
-
Derive the Q-learning update rule and explain why Q-learning is an off-policy algorithm
-
Identify the three main sources of instability in vanilla deep Q-learning
-
Implement the three core DQN improvements: target networks, Double DQN, and n-step returns
-
Design effective exploration strategies for Q-learning, including epsilon-greedy exploration
-
Compare the strengths, weaknesses, and optimal use cases of PPO, SAC, and DQN
Three. Memorable Course Quotes
-
"Deep Q-Networks was arguably the very first deep reinforcement learning method that demonstrated neural networks could learn complex tasks directly from raw pixel inputs."
-
"The biggest challenge with vanilla Q-learning is instability—your target is constantly moving as you update your Q-function, turning supervised learning into a moving target problem."
-
"Overestimation of Q-values is a fundamental flaw in vanilla Q-learning, caused by using the same network to both select and evaluate actions."
-
"Exploration is not optional in Q-learning—you need to try bad actions to learn which ones are actually good."
-
"Value-based methods eliminate the need for an explicit policy network—your policy is just the argmax of your Q-function."
-
"Don't be alarmed if your DQN loss goes up during training—this is normal and often means your policy is improving and discovering higher-value states."
Four. Detailed Study Notes
4.1 Recap: Policy-Based vs. Value-Based RL
All previous algorithms covered (policy gradients, actor-critic, PPO, SAC) are policy-based methods that explicitly learn a parameterized policy networkπ_θ(a | s). In contrast, value-based methods take a fundamentally different approach:
-
They learn a Q-function
Q(s, a)that estimates the expected future reward of taking actionain statesand then following the optimal policy thereafter -
The optimal policy is derived implicitly as the action that maximizes the Q-value at each state:
π*(s) = argmax_a Q*(s, a) -
No separate policy network is required, which simplifies the architecture and reduces computational overhead
4.2 Policy Iteration: The Foundation of Value-Based RL
Value-based methods are built on the framework of policy iteration, an iterative algorithm that alternates between two steps:-
Policy Evaluation: Given a fixed policy
π, compute the Q-functionQ^π(s, a)that estimates the expected future reward of followingπafter taking actionain states -
Policy Improvement: Update the policy to be greedy with respect to the current Q-function:
π'(s) = argmax_a Q^π(s, a)
π*.
4.3 Q-Learning: Learning the Optimal Q-Function Directly
Q-learning skips the explicit policy representation and learns the optimal Q-function Q*(s, a) directly. It is based on the Bellman optimality equation, which defines the recursive relationship between optimal Q-values: Q*(s, a) = r(s, a) + γ * E_{s' ~ P(s' | s, a)} [ max_{a'} Q*(s', a') ]This equation states that the optimal Q-value for a state-action pair is equal to the immediate reward plus the discounted maximum Q-value of the next state.The Q-learning update rule follows directly from this equation: Q(s, a) ← Q(s, a) + α [ r + γ * max_{a'} Q(s', a') - Q(s, a) ] where α is the learning rate.A critical property of Q-learning is that it is fully off-policy. It can learn the optimal Q-function using data collected from any exploration policy, not just the current greedy policy. This makes it extremely data-efficient when combined with a replay buffer.
4.4 Exploration Strategies for Q-Learning
Since Q-learning uses a deterministic greedy policy for improvement, it requires a separate exploration policy to collect data that covers the state-action space. The two most common exploration strategies are:4.4.1 Epsilon-Greedy Exploration
The simplest and most widely used strategy:-
With probability
ε, take a completely random action -
With probability
1-ε, take the greedy action that maximizes the current Q-value
ε is typically annealed over time from a high initial value (e.g., 1.0) to a low final value (e.g., 0.01). This encourages extensive exploration early in training and gradual exploitation of learned knowledge later.
4.4.2 Boltzmann Exploration
A more sophisticated strategy that selects actions with probability proportional to their Q-values:π(a | s) ∝ exp(Q(s, a) / τ) where τ is a temperature parameter that controls the level of exploration. High temperature leads to uniform random action selection; low temperature leads to greedy action selection.
4.5 Deep Q-Networks (DQN): Stabilizing Q-Learning with Neural Networks
While Q-learning works perfectly for small tabular environments, it becomes unstable when combined with neural networks for large state spaces. There are three main sources of instability:-
Moving targets: The target value
r + γ * max_{a'} Q(s', a')depends on the same Q-network being updated, causing the target to change with every gradient step -
Correlated data: Sequential transitions from trajectories are highly correlated, violating the i.i.d. assumption required for stable gradient descent
-
Q-value overestimation: Using the same network to both select and evaluate actions leads to systematic overestimation of Q-values due to noise in the network
4.5.1 Experience Replay Buffer
Store all collected transitions(s, a, r, s') in a large replay buffer. During training, sample mini-batches of transitions uniformly from the buffer instead of using sequential data. This de-correlates the training data and stabilizes learning.
4.5.2 Target Network
Maintain two separate Q-networks:-
Online network: Updated every gradient step to learn the Q-function
-
Target network: A frozen copy of the online network that is only updated periodically (e.g., every 1000 gradient steps)
y = r + γ * max_{a'} Q_target(s', a')This fixes the target values for multiple gradient steps, turning the moving target problem into a standard supervised learning problem.
4.5.3 Double DQN
Address Q-value overestimation by decoupling action selection from action evaluation:-
Use the online network to select the best action in the next state:
a'* = argmax_a Q_online(s', a) -
Use the target network to evaluate the value of that action:
y = r + γ * Q_target(s', a'*)
4.5.4 N-Step Returns
Balance bias and variance in the Q-learning target by using a combination of observed rewards and bootstrapped Q-values:y = Σ_{k=0}^{n-1} γ^k r_{t+k} + γ^n * max_{a'} Q_target(s_{t+n}, a')N-step returns use more observed reward information, reducing bias early in training when the Q-network is inaccurate. While technically incorrect for off-policy learning, it almost always improves performance in practice.
4.6 Practical Algorithm Selection Guide
The lecture concludes with a practical comparison of the three main deep RL algorithms covered:| Algorithm | Best For | Key Strengths | Key Weaknesses |
|---|---|---|---|
| PPO | Simulation environments, large language models | Extremely stable, minimal hyperparameter tuning | Very data inefficient |
| SAC | Real-world robotics, expensive data | Extremely data efficient | Moderately difficult to tune |
| DQN | Discrete action spaces, game playing | Fast inference, simple architecture | Less stable than PPO, limited to low-dimensional continuous actions |
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.


