Deep Reinforcement Learning: Lecture 4 Actor-Critic Methods Structured Notes & In-Depth Analysis

One. Course Details

This is the fourth lecture of Stanford University's CS224R Deep Reinforcement Learning course, taught by Professor Chelsea Finn. It builds directly on the policy gradient methods introduced in Lecture 3 and presents actor-critic algorithms—the most widely used family of deep RL algorithms in practice, which form the foundation for state-of-the-art methods like PPO used to train large language models and humanoid robots.
The lecture covers the mathematical definition and intuitive interpretation of value functions, Q-functions, and advantage functions, three core techniques for policy evaluation (Monte Carlo estimation, temporal difference learning, and n-step returns), and the complete end-to-end actor-critic algorithm pipeline. A key focus is on how actor-critic methods address the high variance and poor data efficiency limitations of vanilla policy gradients by learning an explicit estimate of future rewards.
The core goal of this lecture is to equip students to understand the theory behind actor-critic methods, implement them correctly, and make informed trade-offs between bias and variance when estimating value functions for real-world tasks.

Two. Key Learning Objectives

By the end of this lecture, students should be able to:

Define and distinguish between value functions (V), Q-functions (Q), and advantage functions (A)
Explain the mathematical relationship between V, Q, and A
Describe three methods for policy evaluation and compare their bias-variance trade-offs
Derive the actor-critic policy gradient using the advantage function
Implement the complete actor-critic algorithm with separate actor and critic networks
Explain how bootstrapping works in temporal difference learning and its benefits
Apply discount factors and n-step returns to improve value function estimation

Three. Memorable Course Quotes

"Policy gradients don't always make efficient use of the data we've collected. If we can figure out what actions are advantageous versus not, we can make much better use of that data."
"Actor-critic methods learn to estimate what's good and bad, then do more of the good stuff according to your estimate."
"Monte Carlo estimation has high variance but is completely unbiased. Bootstrapping has much lower variance but can be biased by the quality of your current value estimate."
"The advantage function tells you exactly how much better taking a particular action is than following your average policy in that state."
"Actor-critic algorithms have two separate neural networks: the actor that decides what to do, and the critic that tells the actor how good its decisions were."

Four. Detailed Study Notes

4.1 Recap: Limitations of Vanilla Policy Gradients

Before introducing actor-critic methods, Professor Finn recaps the key limitations of the policy gradient algorithms covered in the previous lecture:

High variance: Gradient estimates are based on single trajectory returns, leading to noisy updates
Poor data efficiency: On-policy nature requires discarding all data after one gradient step
Inefficient credit assignment: All actions in a trajectory are weighted by the total return, even if some actions were good and others were bad

For example, if a robot takes a small step forward but then falls backward, vanilla policy gradients will penalize the forward step because the total trajectory return is negative, even though the step itself was progress toward the goal. Actor-critic methods solve this problem by learning an explicit estimate of how good each state and action is.

4.2 Core Value Function Concepts

Actor-critic methods rely on three fundamental quantities to evaluate policies:

4.2.1 Value Function (V^π(s))

The value function of a state s under policy π is the expected sum of future rewards the agent will receive if it starts in state s and follows policy π for all subsequent steps: V^π(s) = E_{τ ~ π | s₀=s} [Σₜ'=ₜ^T r(sₜ', aₜ')]
Intuitively, it answers the question: "How good is it to be in this state right now if I follow my current policy?"

4.2.2 Q-Function (Q^π(s, a))

The Q-function (or action-value function) of a state-action pair (s, a) under policy π is the expected sum of future rewards the agent will receive if it starts in state s, takes action a, and then follows policy π for all subsequent steps: Q^π(s, a) = E_{τ ~ π | s₀=s, a₀=a} [Σₜ'=ₜ^T r(sₜ', aₜ')]
It answers the question: "How good is it to take this specific action in this state right now?"

4.2.3 Advantage Function (A^π(s, a))

The advantage function measures how much better taking action a in state s is compared to following the average policy π in that state: A^π(s, a) = Q^π(s, a) - V^π(s)
This is the most important quantity for actor-critic methods. A positive advantage means the action is better than average; a negative advantage means it is worse than average.

4.2.4 Intuitive Example: Learning to Play Drums

To illustrate these concepts, Professor Finn uses a simple example:

Reward: 1 if you can play drums in one month, 0 otherwise
Actions: Sit on the beach, watch TV, practice drums
Current policy: Always sit on the beach

For this scenario:

V^π(s) = 0 (following the current policy will never lead to learning drums)
Q^π(s, sit on beach) = 0, Q^π(s, watch TV) = 0
Q^π(s, practice drums) ≈ 1 (practicing once gives a chance to learn)
A^π(s, practice drums) = 1 - 0 = 1 (practicing is much better than the average policy)

4.3 Improving Policy Gradients with Advantage Functions

The vanilla policy gradient can be rewritten using the advantage function to produce a much more accurate and lower-variance gradient estimate: ∇_θ J(θ) = E_{τ ~ π_θ} [ Σₜ=₁^T ∇_θ log π_θ(aₜ | sₜ) * A^π(sₜ, aₜ) ]
This is a significant improvement over vanilla policy gradients because:

It only rewards actions that are better than average, rather than all actions in high-return trajectories
It correctly penalizes actions that are worse than average, even in high-return trajectories
It produces much lower-variance gradient estimates

The challenge now becomes how to accurately estimate the advantage function A^π(s, a).

4.4 Policy Evaluation: Estimating Value Functions

To compute the advantage function, we first need to estimate the value function V^π(s). There are three main methods for policy evaluation, each with different bias-variance trade-offs:

4.4.1 Monte Carlo Estimation

The simplest approach is to use the actual sum of future rewards observed in trajectories as the target for the value function:

Target: y_t = Σₜ'=ₜ^T r(sₜ', aₜ')
Training objective: min_φ Σ ||V_φ(s_t) - y_t||²
Properties: Unbiased (uses actual observed returns) but high variance (each trajectory is a single sample from the distribution)

Monte Carlo estimation requires waiting until the end of each trajectory to compute the target, making it unsuitable for very long or infinite-horizon tasks.

4.4.2 Temporal Difference (TD) Learning (Bootstrapping)

TD learning addresses the high variance of Monte Carlo estimation by using bootstrapping—using the current estimate of the value function to compute the target:

Target: y_t = r(s_t, a_t) + V_φ(s_{t+1})
Training objective: min_φ Σ ||V_φ(s_t) - y_t||²
Properties: Low variance (only depends on one immediate reward and one value estimate) but biased (relies on the current, imperfect value function estimate)

TD learning can update the value function after every time step, making it much more efficient for long-horizon tasks. It propagates value information backward through time as the agent gains experience.

4.4.3 N-Step Returns

N-step returns provide a middle ground between Monte Carlo and TD learning by balancing bias and variance:

Target: y_t = Σₖ=₀^{n-1} r(s_{t+k}, a_{t+k}) + V_φ(s_{t+n})
Properties: Lower variance than Monte Carlo, lower bias than 1-step TD

The choice of n is a hyperparameter that controls the bias-variance trade-off. Smaller n gives lower variance but higher bias; larger n gives lower bias but higher variance. In practice, values between 5 and 20 often work best.

4.4.4 Discount Factors

For long or infinite-horizon tasks, we typically add a discount factor γ (between 0 and 1) to weight immediate rewards more heavily than future rewards:

Discounted n-step target: y_t = Σₖ=₀^{n-1} γᵏ r(s_{t+k}, a_{t+k}) + γⁿ V_φ(s_{t+n})

Intuitively, this is equivalent to assuming there is a 1-γ probability that the episode will end at each time step. Discount factors prevent value estimates from becoming infinitely large and help stabilize training.

4.5 The Complete Actor-Critic Algorithm

Actor-critic algorithms use two separate neural networks:

Actor network (π_θ): Maps states to actions (the policy)
Critic network (V_φ): Maps states to value estimates (evaluates the policy)

The full algorithm follows this iterative loop:

Sample trajectories: Run the current actor policy in the environment to collect a batch of trajectories
Fit the critic: Train the critic network to predict value functions using either Monte Carlo, TD, or n-step targets
Compute advantages: For each state-action pair, estimate the advantage function using A(s_t, a_t) ≈ r(s_t, a_t) + γ V_φ(s_{t+1}) - V_φ(s_t)
Update the actor: Compute the policy gradient using the estimated advantages and update the actor network parameters
Repeat: Discard the old data and collect new trajectories with the updated actor policy

4.6 Key Benefits of Actor-Critic Methods

Lower variance gradients: Advantage function estimates produce much more stable updates than vanilla policy gradients
Better credit assignment: Correctly identifies which actions are good or bad, rather than weighting all actions in a trajectory equally
Higher data efficiency: Can take multiple gradient steps on the critic network for each batch of data
Works well with sparse rewards: Can learn from partial progress toward the goal by estimating intermediate state values
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 4: Actor-Critic Methods Stanford Online
• Course Series: Stanford CS224R Deep Reinforcement Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/oejFZShW9hU?si=HSZnNNqL4oF0c4NX

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.