Home - Open Courses - Engineering and Technology - Artificial Intelligence

Lecture 19: Q-Learning Fundamentals & Practical Implementation

This lecture covers Q-learning fundamentals from tabular dynamic programming to deep Q-learning, including the bias-variance tradeoff, stabilization techniques, and solutions to Q-value overestimation.

By:

3 Views
Jun 05, 2026

1. Course Details

This is the 19th lecture of Stanford University's CS224R Deep Reinforcement Learning course, delivered by teaching assistant Anikait. This lecture provides a comprehensive, ground-up introduction to Q-learning, starting from foundational Markov Decision Process (MDP) concepts and progressing to practical implementation details for modern deep Q-learning algorithms.The lecture covers a brief review of MDP notation and core value function concepts, tabular Q-learning via dynamic programming, the transition to parametric function approximation with neural networks, the fundamental bias-variance tradeoff between Monte Carlo and Temporal Difference (TD) learning, and critical practical techniques for stabilizing deep Q-learning including replay buffers, target networks, and methods to mitigate Q-value overestimation.

2. Key Learning Objectives

By the end of this lecture, students will be able to:

Define and distinguish between value functions, Q-functions, and advantage functions in the context of MDPs
Derive the Bellman equation for optimal Q-values and implement iterative dynamic programming updates for tabular environments
Explain the fundamental bias-variance tradeoff between Monte Carlo rollouts and TD learning
Design parametric Q-function architectures for both discrete and continuous action spaces
Implement core stabilization techniques for deep Q-learning including target networks, gradient clipping, and Huber loss
Describe the Q-value overestimation problem caused by the max operation and explain how double Q-learning and critic ensembling solve it
Design and use replay buffers to break temporal correlations and improve sample efficiency in Q-learning

3. Memorable Quotes

"A Q value function allows you to trade-off that variance for bias and is another way to deal with specific environment configurations."
"The main distinction between Monte Carlo and TD is that Monte Carlo has no bias but high variance, while TD introduces bias from the value function but significantly reduces variance."
"One of the most powerful properties of dynamic programming is stitching: you can combine partial trajectories to find the optimal path even if you never saw the full trajectory during training."
"Even zero-mean noise in Q-value estimates becomes positively biased after the max operation, leading to systematic overestimation that compounds through iterative updates."
"Replay buffers break temporal correlations in your data and prevent recency bias, allowing the agent to reuse past experience and dramatically improve sample efficiency."

4. Detailed Lecture Notes

4.1 MDP and Value Function Foundations

All reinforcement learning problems in this course are formalized as Markov Decision Processes, defined by five core components:

A set of states S
A set of actions A
Transition dynamics \(P(s' | s, a)\) that define the probability of moving to state \(s'\) after taking action a in state s
A reward function \(R(s, a)\) that provides immediate feedback
A discount factor \(\gamma \in [0, 1]\) that prioritizes immediate rewards over distant future rewards

Core Value Functions

Three interrelated functions form the foundation of value-based RL:

Value function \(V^\pi(s)\): The expected total discounted reward starting from state s and following policy \(\pi\)
Q-function \(Q^\pi(s, a)\): The expected total discounted reward starting from state s, taking action a, and then following policy \(\pi\)
Advantage function \(A^\pi(s, a)\): The relative advantage of taking action a compared to the average action under policy \(\pi\), defined as \(A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\)

Key relationships:

\(V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot|s)} [Q^\pi(s, a)]\)
For the optimal policy \(\pi^*\), the advantage of the optimal action is zero, and all other actions have non-positive advantage

4.2 Tabular Q-Learning with Dynamic Programming

In tabular environments with discrete state and action spaces, we can represent the Q-function as a table with one entry for each state-action pair.

Bellman Optimality Equation

The optimal Q-function satisfies the Bellman optimality equation: \(Q^*(s, a) = \mathbb{E}_{s' \sim P(\cdot|s, a)} \left[ R(s, a) + \gamma \max_{a'} Q^*(s', a') \right]\)This equation forms the basis for iterative dynamic programming updates: \(Q_{k+1}(s, a) = \mathbb{E}_{s' \sim P(\cdot|s, a)} \left[ R(s, a) + \gamma \max_{a'} Q_k(s', a') \right]\)

Iterative Value Propagation

The algorithm proceeds as follows:

Initialize \(Q_0(s, a) = 0\) for all state-action pairs
For terminal states, set \(Q(s, a) = R(s, a)\) for all actions
Iteratively apply the Bellman update to propagate value backward from terminal states to all other states
Continue until the Q-function converges to within a small threshold

Once the optimal Q-function is learned, the optimal policy is simply the greedy policy: \(\pi^*(s) = \arg\max_a Q^*(s, a)\)

4.3 Parametric Q-Learning

Tabular methods fail to scale to large or continuous state spaces. Instead, we use parametric function approximators (typically neural networks) to represent the Q-function, which allows for generalization across similar states.

Q-Function Architectures

Two standard architectures are used depending on the action space:

Discrete action spaces: The network takes a state as input and outputs a vector of Q-values, one for each possible action. This allows exact maximization over actions.
Continuous action spaces: The network takes both a state and an action as input and outputs a single scalar Q-value. Maximization requires sampling actions from the policy.

4.4 Monte Carlo vs. Temporal Difference Learning

There are two fundamental approaches to estimating returns for training value functions, each with distinct bias-variance characteristics.

Monte Carlo Estimation

Monte Carlo methods roll out full trajectories until termination and use the actual observed return as the target: \(G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots + \gamma^{T-t-1} R_{T-1}\)

Advantages: Unbiased estimate of the true return
Disadvantages: Very high variance due to the long horizon of random transitions and rewards

Temporal Difference (TD) Learning

TD learning bootstraps from the current value function estimate to create a one-step target: \(y_t = R_t + \gamma V(s_{t+1})\)

Advantages: Extremely low variance because it only depends on one step of experience
Disadvantages: Biased because the value function \(V(s_{t+1})\) is itself an estimate and may be inaccurate

N-Step Returns

N-step returns provide a continuous tradeoff between Monte Carlo and TD learning by unrolling the trajectory for n steps before bootstrapping: \(G_t^{(n)} = R_t + \gamma R_{t+1} + \dots + \gamma^{n-1} R_{t+n-1} + \gamma^n V(s_{t+n})\)By adjusting n, practitioners can find the optimal balance between bias and variance for their specific environment.

The Stitching Property of Dynamic Programming

A unique advantage of TD learning and dynamic programming is the ability to stitch together partial trajectories. If an agent experiences trajectories A→B and C→D, dynamic programming can discover the optimal path A→D even if the agent never traversed B→C during training. This property significantly improves the sample efficiency of value-based methods.

4.5 Practical Techniques for Stabilizing Deep Q-Learning

Training neural networks with TD targets is inherently unstable. Several key techniques have been developed to address this instability.

Target Networks and Semi-Gradient Updates

The standard TD update uses the same network to predict both the current Q-value and the target Q-value, leading to a moving target problem. The solution is to use two separate networks:

Online network: The network being actively trained, used to predict \(Q(s_t, a_t)\)
Target network: A delayed copy of the online network, used to predict the target \(Q(s_{t+1}, a_{t+1})\)

Two update strategies are common:

Hard update: Copy the online network weights to the target network every N steps
Soft update (Polyak averaging): Slowly interpolate the target network weights toward the online network weights at every step: \(\theta' \leftarrow \tau \theta + (1-\tau) \theta'\), typically with \(\tau \approx 0.001\)

Additionally, gradients are only propagated through the online network, not the target network (semi-gradient update), which further stabilizes training.

Gradient Clipping and Huber Loss

TD updates can produce very large gradients that destabilize training. Two solutions are widely used:

Gradient clipping: Limit the maximum norm of the gradient vector to prevent catastrophic parameter updates
Huber loss: A combination of L1 and L2 loss that behaves like L2 loss near the optimum (for fine-grained adjustments) and like L1 loss for large errors (to reduce sensitivity to outliers)

Replay Buffers

A replay buffer is a data structure that stores past experience tuples \((s_t, a_t, r_t, s_{t+1})\). During training, mini-batches are sampled uniformly from the replay buffer rather than using consecutive experience.Key benefits:

Breaks temporal correlations in the training data, which is essential for training neural networks
Prevents recency bias by ensuring the agent learns from both recent and distant past experience
Dramatically improves sample efficiency by reusing experience multiple times

4.6 Q-Value Overestimation and Mitigation

A fundamental problem with Q-learning is systematic overestimation of Q-values, caused by the max operation in the Bellman update.

The Overestimation Mechanism

Even if the Q-function approximation has zero-mean noise, the max operation will select the action with the highest positive noise. This creates a positive bias that compounds through iterative updates, leading to increasingly overestimated Q-values. Overestimation causes the agent to prefer overoptimistic actions and can lead to unstable training and poor final performance.

Solutions

Double Q-learning: Use two separate Q-networks. One network selects the action, and the other network evaluates the value of that action. This decorrelates the selection and evaluation steps, eliminating the overestimation bias.
Critic ensembling: Train an ensemble of Q-networks and use the minimum value across the ensemble as the target. This produces a conservative underestimate of the true Q-value, which is particularly effective in offline RL settings where distribution shift is a major concern.

For standard online Q-learning, an ensemble size of 2 (double Q-learning) is sufficient to eliminate most overestimation.

These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.

Lecture 18: Open Problems in Deep Reinforcement Learning & How to Conduct Impactful Research

The Emerging Science of Finding Critical Metals: AI-Powered Exploration for a Sustainable Future