Deep Reinforcement Learning: Lecture 5 Practical Deep RL Algorithms (PPO & SAC) Structured Notes & In-Depth Analysis

One. Course Details

This is the fifth lecture of Stanford University's CS224R Deep Reinforcement Learning course, taught by Professor Chelsea Finn. It builds directly on the actor-critic methods introduced in Lecture 4 and dives into the two most widely used and practical deep reinforcement learning algorithms in industry and research: Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC).

The lecture addresses the core limitations of vanilla policy gradients and on-policy actor-critic methods—poor data efficiency and unstable learning—and presents the key innovations that make PPO and SAC work reliably in real-world scenarios. It covers the mathematical formulation of both algorithms, their core design choices, hyperparameter tuning considerations, and head-to-head comparisons of their trade-offs. Both algorithms form the backbone of state-of-the-art systems for training large language models, humanoid robots, and game-playing agents.

The core goal of this lecture is to equip students to understand, implement, and debug PPO and SAC, and to make informed decisions about which algorithm to use for different problem domains based on their unique strengths and weaknesses.

Two. Key Learning Objectives

By the end of this lecture, students should be able to:

Explain how importance sampling enables multiple gradient steps on a single batch of data
Identify the root cause of instability in unconstrained off-policy policy gradients
Describe the clipped surrogate objective that forms the core of PPO
Implement the full PPO algorithm with generalized advantage estimation (GAE)
Explain how replay buffers enable extreme data efficiency in off-policy RL
Solve the value estimation problem for replay buffers using Q-functions instead of value functions
Compare the bias-variance, stability, and data efficiency trade-offs between PPO and SAC
Select the appropriate algorithm for a given task based on data availability and computational resources

Three. Memorable Course Quotes

"PPO is the workhorse of modern deep reinforcement learning—it's stable, reliable, and works surprisingly well out of the box for most problems."
"Soft Actor-Critic is far more data-efficient than PPO, but it comes with the trade-off of being harder to tune and less stable in practice."
"The key to stable off-policy learning is keeping your new policy close enough to the old policy that your advantage estimates remain valid."
"Replay buffers let you reuse every single experience your agent ever has, which is a complete game-changer for data efficiency."
"PPO's clipped objective doesn't explicitly constrain the policy—it just removes the incentive for the policy to change too much, which is why it works so well."

Four. Detailed Study Notes

4.1 Recap: The Fundamental Limitation of On-Policy Learning

All the algorithms covered so far—vanilla policy gradients and basic actor-critic—are on-policy algorithms, meaning they can only use data collected from the exact current version of the policy. This creates a crippling data efficiency problem:

You must discard all data after taking a single gradient step
For neural networks that require thousands of gradient steps to converge, this means collecting millions of time steps of experience
This is prohibitively expensive for real-world systems like robots, where each second of experience costs real time and resources

The solution to this problem is importance sampling, a statistical technique that allows us to estimate expectations under one distribution using samples from another distribution. For policy gradients, this means we can use data collected from an old policy π_θ to update a new policy π_θ' by weighting each sample by the importance ratio: w_t = π_θ'(a_t | s_t) / π_θ(a_t | s_t)This allows us to take multiple gradient steps on a single batch of data, drastically improving data efficiency.

4.2 The Instability of Unconstrained Importance Sampling

While importance sampling works in theory, it causes catastrophic instability in practice:

The surrogate objective encourages the policy to maximize the importance ratio for actions with positive advantages
If an action had a very low probability under the old policy, the policy can get an enormous reward by increasing its probability even slightly
This leads to the policy changing drastically from the old policy in a single update
Once the policy has changed too much, the advantage estimates (which were computed for the old policy) become completely invalid
This causes learning to collapse entirely, with the policy's performance dropping to near zero

Professor Finn illustrates this with a simple plot:

1 gradient step per batch: Slow but stable learning
5 gradient steps per batch: Faster learning with minor instability
50+ gradient steps per batch: Rapid initial improvement followed by complete performance collapse

4.3 Proximal Policy Optimization (PPO): Stable Multi-Step Updates

PPO solves the instability problem by preventing the policy from changing too much from the old policy that collected the data. There are two main approaches to this:

4.3.1 Approach 1: KL Divergence Penalty

Add a penalty term to the surrogate objective that discourages large differences between the new and old policies: L_KL(θ') = L_surrogate(θ') - β * KL(π_θ' || π_θ) where β is a hyperparameter that controls the strength of the penalty. While effective, this approach requires tuning β and keeping the old policy in memory, which can be expensive for large models.

4.3.2 Approach 2: Clipped Surrogate Objective (Most Common)

The more popular and simpler approach is to simply clip the importance ratio to a small range around 1, typically [1-ε, 1+ε] where ε=0.2 is standard. This removes any incentive for the policy to change the ratio beyond this range.The final PPO surrogate objective takes the minimum of the original and clipped objectives: L_PPO(θ') = E[ min( w_t * A_t, clip(w_t, 1-ε, 1+ε) * A_t ) ]This ensures that:

We are always maximizing a lower bound on the true surrogate objective
The policy can never get a better objective value by changing the ratio beyond the clipped range
No explicit KL penalty or old policy memory is required

4.3.3 Generalized Advantage Estimation (GAE)

PPO typically uses generalized advantage estimation to compute advantage functions. GAE is a weighted average of n-step advantage estimates with exponentially decaying weights: A_t^GAE(γ, λ) = Σ_{k=0}^∞ (γλ)^k δ_{t+k} where δ_t = r_t + γ V(s_{t+1}) - V(s_t) is the TD error.This provides a flexible way to balance bias and variance in advantage estimates, with λ=0.95 being a common default.

4.3.4 Full PPO Algorithm

The complete PPO algorithm follows this iterative loop:

Sample a batch of trajectories using the current policy
Fit a value function to the batch of data using Monte Carlo or TD targets
Compute generalized advantage estimates for all state-action pairs
Take M gradient steps (typically 10-30 epochs) on the clipped surrogate objective
Discard the old batch and repeat

Typical hyperparameters for PPO include:

Batch size: 2048 time steps
Number of epochs per batch: 10
Clipping parameter ε: 0.2
Discount factor γ: 0.99
GAE parameter λ: 0.95

4.4 Going Further Off-Policy: Soft Actor-Critic (SAC)

PPO allows multiple gradient steps per batch but still discards data after each outer loop iteration. Soft Actor-Critic (SAC) goes a step further and uses a replay buffer to store all past experience, allowing it to reuse data from all previous policies.

4.4.1 The Replay Buffer Challenge

Replay buffers create a fundamental problem for value estimation:

The buffer contains data from dozens or hundreds of different old policies
If you train a value function V^π(s) on this mixed data, you are not learning the value function for your current policy—you are learning the value function for some unknown mixture of all past policies

The solution is to abandon value functions entirely and instead learn a Q-function Q^π(s, a). Q-functions have a critical property that makes them suitable for off-policy learning: they depend only on the current state and action, not on the policy that took the action.

4.4.2 Off-Policy Q-Learning with Replay Buffers

To learn the Q-function for the current policy using data from the replay buffer:

Sample a transition (s, a, r, s') from the replay buffer
Sample an action a' from the current policy at state s' (not the action from the buffer)
Compute the target Q-value: y = r + γ Q(s', a')
Update the Q-network to minimize the L2 error between Q(s, a) and y

This works because:

The dynamics of the environment are constant, so s' is valid regardless of which policy took action a
By sampling a' from the current policy, we ensure the target reflects the value of following the current policy from s'

4.4.3 Full SAC Algorithm

The complete SAC algorithm follows this loop:

Initialize an empty replay buffer
Collect experience using the current policy and add it to the buffer
Sample a mini-batch of transitions from the buffer
Update the Q-network using the off-policy target described above
Update the policy to maximize the expected Q-value: ∇_θ E[ log π_θ(a | s) * Q(s, a) ]
Repeat steps 3-5 many times before collecting new experience

4.5 PPO vs. SAC: Head-to-Head Comparison

Aspect	PPO	SAC
Policy Type	Mildly off-policy	Fully off-policy
Data Efficiency	Low	Extremely high
Stability	Excellent	Good but less than PPO
Hyperparameter Tuning	Very easy	Moderately difficult
Best For	Simulation environments, cheap data, large language models	Real-world systems, expensive data, robots
Industry Adoption	Extremely widespread	Growing rapidly for robotics

Key takeaways:

Use PPO if you have unlimited simulated data and want something that just works
Use SAC if you are working with real robots or any scenario where data collection is expensive
PPO is the default choice for fine-tuning large language models with reinforcement learning
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 5: Off-Policy Actor Critic Stanford Onlin
• Course Series: Stanford CS224R Deep Reinforcement Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/cRGKc-nAWho?si=UGQX2eePzLn91tF7

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.