Engineering and Technology - Artificial Intelligence

Deep Reinforcement Learning: Lecture 6 Value-Based RL & Deep Q-Networks (DQN) Structured Notes & In-Depth Analysis

Class Details
Curriculum
Video Description

One. Course Details

This is the sixth lecture of Stanford University's CS224R Deep Reinforcement Learning course, taught by Professor Chelsea Finn. It marks a shift from policy-based and actor-critic methods to value-based reinforcement learning—the paradigm that produced the first widely successful deep RL algorithm, Deep Q-Networks (DQN).The lecture covers the mathematical foundations of Q-learning, including the Bellman equation and Bellman optimality equation, the policy iteration framework, and the core Q-learning update rule. It then addresses the fundamental instability issues that arise when combining Q-learning with neural networks and presents three critical innovations that make DQN work reliably in practice: target networks, Double DQN, and n-step returns. The lecture concludes with practical guidance on when to use value-based methods versus policy-based methods like PPO and SAC.Value-based methods eliminate the need for an explicit policy network, instead deriving the optimal policy directly from the learned Q-function. This makes them particularly efficient for discrete action spaces and has led to breakthrough applications in game playing and robotic manipulation.

Two. Key Learning Objectives

By the end of this lecture, students should be able to:

Define and distinguish between the Bellman equation and Bellman optimality equation
Explain the two-step policy iteration process (policy evaluation and policy improvement)
Derive the Q-learning update rule and explain why Q-learning is an off-policy algorithm
Identify the three main sources of instability in vanilla deep Q-learning
Implement the three core DQN improvements: target networks, Double DQN, and n-step returns
Design effective exploration strategies for Q-learning, including epsilon-greedy exploration
Compare the strengths, weaknesses, and optimal use cases of PPO, SAC, and DQN

Three. Memorable Course Quotes

"Deep Q-Networks was arguably the very first deep reinforcement learning method that demonstrated neural networks could learn complex tasks directly from raw pixel inputs."
"The biggest challenge with vanilla Q-learning is instability—your target is constantly moving as you update your Q-function, turning supervised learning into a moving target problem."
"Overestimation of Q-values is a fundamental flaw in vanilla Q-learning, caused by using the same network to both select and evaluate actions."
"Exploration is not optional in Q-learning—you need to try bad actions to learn which ones are actually good."
"Value-based methods eliminate the need for an explicit policy network—your policy is just the argmax of your Q-function."
"Don't be alarmed if your DQN loss goes up during training—this is normal and often means your policy is improving and discovering higher-value states."

Four. Detailed Study Notes

4.1 Recap: Policy-Based vs. Value-Based RL

All previous algorithms covered (policy gradients, actor-critic, PPO, SAC) are policy-based methods that explicitly learn a parameterized policy network π_θ(a | s). In contrast, value-based methods take a fundamentally different approach:

They learn a Q-function Q(s, a) that estimates the expected future reward of taking action a in state s and then following the optimal policy thereafter
The optimal policy is derived implicitly as the action that maximizes the Q-value at each state: π*(s) = argmax_a Q*(s, a)
No separate policy network is required, which simplifies the architecture and reduces computational overhead

This approach is most natural for discrete action spaces, where the argmax operation is trivial to compute. For continuous action spaces, additional optimization steps are required to find the action that maximizes the Q-function.

4.2 Policy Iteration: The Foundation of Value-Based RL

Value-based methods are built on the framework of policy iteration, an iterative algorithm that alternates between two steps:

Policy Evaluation: Given a fixed policy π, compute the Q-function Q^π(s, a) that estimates the expected future reward of following π after taking action a in state s
Policy Improvement: Update the policy to be greedy with respect to the current Q-function: π'(s) = argmax_a Q^π(s, a)

A key theoretical result guarantees that each policy improvement step produces a policy that is at least as good as the previous one. This process repeats until the policy converges to the optimal policy π*.

4.3 Q-Learning: Learning the Optimal Q-Function Directly

Q-learning skips the explicit policy representation and learns the optimal Q-function Q*(s, a) directly. It is based on the Bellman optimality equation, which defines the recursive relationship between optimal Q-values: Q*(s, a) = r(s, a) + γ * E_{s' ~ P(s' | s, a)} [ max_{a'} Q*(s', a') ]This equation states that the optimal Q-value for a state-action pair is equal to the immediate reward plus the discounted maximum Q-value of the next state.The Q-learning update rule follows directly from this equation: Q(s, a) ← Q(s, a) + α [ r + γ * max_{a'} Q(s', a') - Q(s, a) ] where α is the learning rate.A critical property of Q-learning is that it is fully off-policy. It can learn the optimal Q-function using data collected from any exploration policy, not just the current greedy policy. This makes it extremely data-efficient when combined with a replay buffer.

4.4 Exploration Strategies for Q-Learning

Since Q-learning uses a deterministic greedy policy for improvement, it requires a separate exploration policy to collect data that covers the state-action space. The two most common exploration strategies are:

4.4.1 Epsilon-Greedy Exploration

The simplest and most widely used strategy:

With probability ε, take a completely random action
With probability 1-ε, take the greedy action that maximizes the current Q-value

In practice, ε is typically annealed over time from a high initial value (e.g., 1.0) to a low final value (e.g., 0.01). This encourages extensive exploration early in training and gradual exploitation of learned knowledge later.

4.4.2 Boltzmann Exploration

A more sophisticated strategy that selects actions with probability proportional to their Q-values: π(a | s) ∝ exp(Q(s, a) / τ) where τ is a temperature parameter that controls the level of exploration. High temperature leads to uniform random action selection; low temperature leads to greedy action selection.

4.5 Deep Q-Networks (DQN): Stabilizing Q-Learning with Neural Networks

While Q-learning works perfectly for small tabular environments, it becomes unstable when combined with neural networks for large state spaces. There are three main sources of instability:

Moving targets: The target value r + γ * max_{a'} Q(s', a') depends on the same Q-network being updated, causing the target to change with every gradient step
Correlated data: Sequential transitions from trajectories are highly correlated, violating the i.i.d. assumption required for stable gradient descent
Q-value overestimation: Using the same network to both select and evaluate actions leads to systematic overestimation of Q-values due to noise in the network

The DQN algorithm introduces three key innovations to address these issues:

4.5.1 Experience Replay Buffer

Store all collected transitions (s, a, r, s') in a large replay buffer. During training, sample mini-batches of transitions uniformly from the buffer instead of using sequential data. This de-correlates the training data and stabilizes learning.

4.5.2 Target Network

Maintain two separate Q-networks:

Online network: Updated every gradient step to learn the Q-function
Target network: A frozen copy of the online network that is only updated periodically (e.g., every 1000 gradient steps)

Use the target network to compute the target values: y = r + γ * max_{a'} Q_target(s', a')This fixes the target values for multiple gradient steps, turning the moving target problem into a standard supervised learning problem.

4.5.3 Double DQN

Address Q-value overestimation by decoupling action selection from action evaluation:

Use the online network to select the best action in the next state: a'* = argmax_a Q_online(s', a)
Use the target network to evaluate the value of that action: y = r + γ * Q_target(s', a'*)

This de-correlates the noise in the action selection and evaluation steps, drastically reducing overestimation bias.

4.5.4 N-Step Returns

Balance bias and variance in the Q-learning target by using a combination of observed rewards and bootstrapped Q-values: y = Σ_{k=0}^{n-1} γ^k r_{t+k} + γ^n * max_{a'} Q_target(s_{t+n}, a')N-step returns use more observed reward information, reducing bias early in training when the Q-network is inaccurate. While technically incorrect for off-policy learning, it almost always improves performance in practice.

4.6 Practical Algorithm Selection Guide

The lecture concludes with a practical comparison of the three main deep RL algorithms covered:

Algorithm	Best For	Key Strengths	Key Weaknesses
PPO	Simulation environments, large language models	Extremely stable, minimal hyperparameter tuning	Very data inefficient
SAC	Real-world robotics, expensive data	Extremely data efficient	Moderately difficult to tune
DQN	Discrete action spaces, game playing	Fast inference, simple architecture	Less stable than PPO, limited to low-dimensional continuous actions

General rule of thumb: Use PPO first for most problems unless data efficiency is a critical concern, in which case use SAC. Use DQN specifically for discrete action spaces where fast inference is required.

These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.

Lecture 19: Q-Learning Fundamentals & Practical Implementation

Lecture 18: Open Problems in Deep Reinforcement Learning & How to Conduct Impactful Research

Lecture 17: Advancing Robot Intelligence with Reinforcement Learning & Sim-to-Real Transfer

Lecture 16: Autonomous Reinforcement Learning for Robots

Lecture 15: Hierarchical Reinforcement & Imitation Learning for Long-Horizon Tasks

Lecture 14: Exploration in Reinforcement Learning & Decoupled Meta-RL Exploration

Deep Reinforcement Learning: Lecture 13 Meta Reinforcement Learning Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 12 Synthetic Data Generation & Multitask RL Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 11 Model-Based Reinforcement Learning Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 10 Reinforcement Learning for LLM Reasoning Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 9 Preference Optimization & LLM Alignment Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 8 Advanced Offline RL & Reward Learning Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 7 Offline Reinforcement Learning Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 6 Value-Based RL & Deep Q-Networks (DQN) Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 5 Practical Deep RL Algorithms (PPO & SAC) Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 4 Actor-Critic Methods Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 3 Policy Gradients Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 2 Imitation Learning Structured Notes & In-Depth Analysis

Deep Reinforcement Learning: Lecture 1 Structured Notes & In-Depth Analysis

Video Source and Usage Instructions

Video Title: Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 6: Q-Learning Stanford Online
• Course Series: Stanford CS224R Deep Reinforcement Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/-7kv6jf0isQ?si=OWxZ7g-6TfzvYzfZ

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.