One. Course Details
This is the fifth lecture of Stanford University's CS224R Deep Reinforcement Learning course, taught by Professor Chelsea Finn. It builds directly on the actor-critic methods introduced in Lecture 4 and dives into the two most widely used and practical deep reinforcement learning algorithms in industry and research: Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC).
The lecture addresses the core limitations of vanilla policy gradients and on-policy actor-critic methods—poor data efficiency and unstable learning—and presents the key innovations that make PPO and SAC work reliably in real-world scenarios. It covers the mathematical formulation of both algorithms, their core design choices, hyperparameter tuning considerations, and head-to-head comparisons of their trade-offs. Both algorithms form the backbone of state-of-the-art systems for training large language models, humanoid robots, and game-playing agents.
The core goal of this lecture is to equip students to understand, implement, and debug PPO and SAC, and to make informed decisions about which algorithm to use for different problem domains based on their unique strengths and weaknesses.Two. Key Learning Objectives
By the end of this lecture, students should be able to:-
Explain how importance sampling enables multiple gradient steps on a single batch of data
-
Identify the root cause of instability in unconstrained off-policy policy gradients
-
Describe the clipped surrogate objective that forms the core of PPO
-
Implement the full PPO algorithm with generalized advantage estimation (GAE)
-
Explain how replay buffers enable extreme data efficiency in off-policy RL
-
Solve the value estimation problem for replay buffers using Q-functions instead of value functions
-
Compare the bias-variance, stability, and data efficiency trade-offs between PPO and SAC
-
Select the appropriate algorithm for a given task based on data availability and computational resources
Three. Memorable Course Quotes
-
"PPO is the workhorse of modern deep reinforcement learning—it's stable, reliable, and works surprisingly well out of the box for most problems."
-
"Soft Actor-Critic is far more data-efficient than PPO, but it comes with the trade-off of being harder to tune and less stable in practice."
-
"The key to stable off-policy learning is keeping your new policy close enough to the old policy that your advantage estimates remain valid."
-
"Replay buffers let you reuse every single experience your agent ever has, which is a complete game-changer for data efficiency."
-
"PPO's clipped objective doesn't explicitly constrain the policy—it just removes the incentive for the policy to change too much, which is why it works so well."
Four. Detailed Study Notes
4.1 Recap: The Fundamental Limitation of On-Policy Learning
All the algorithms covered so far—vanilla policy gradients and basic actor-critic—are on-policy algorithms, meaning they can only use data collected from the exact current version of the policy. This creates a crippling data efficiency problem:-
You must discard all data after taking a single gradient step
-
For neural networks that require thousands of gradient steps to converge, this means collecting millions of time steps of experience
-
This is prohibitively expensive for real-world systems like robots, where each second of experience costs real time and resources
π_θ to update a new policy π_θ' by weighting each sample by the importance ratio: w_t = π_θ'(a_t | s_t) / π_θ(a_t | s_t)This allows us to take multiple gradient steps on a single batch of data, drastically improving data efficiency.
4.2 The Instability of Unconstrained Importance Sampling
While importance sampling works in theory, it causes catastrophic instability in practice:-
The surrogate objective encourages the policy to maximize the importance ratio for actions with positive advantages
-
If an action had a very low probability under the old policy, the policy can get an enormous reward by increasing its probability even slightly
-
This leads to the policy changing drastically from the old policy in a single update
-
Once the policy has changed too much, the advantage estimates (which were computed for the old policy) become completely invalid
-
This causes learning to collapse entirely, with the policy's performance dropping to near zero
-
1 gradient step per batch: Slow but stable learning
-
5 gradient steps per batch: Faster learning with minor instability
-
50+ gradient steps per batch: Rapid initial improvement followed by complete performance collapse
4.3 Proximal Policy Optimization (PPO): Stable Multi-Step Updates
PPO solves the instability problem by preventing the policy from changing too much from the old policy that collected the data. There are two main approaches to this:4.3.1 Approach 1: KL Divergence Penalty
Add a penalty term to the surrogate objective that discourages large differences between the new and old policies:L_KL(θ') = L_surrogate(θ') - β * KL(π_θ' || π_θ) where β is a hyperparameter that controls the strength of the penalty. While effective, this approach requires tuning β and keeping the old policy in memory, which can be expensive for large models.
4.3.2 Approach 2: Clipped Surrogate Objective (Most Common)
The more popular and simpler approach is to simply clip the importance ratio to a small range around 1, typically[1-ε, 1+ε] where ε=0.2 is standard. This removes any incentive for the policy to change the ratio beyond this range.The final PPO surrogate objective takes the minimum of the original and clipped objectives: L_PPO(θ') = E[ min( w_t * A_t, clip(w_t, 1-ε, 1+ε) * A_t ) ]This ensures that:
-
We are always maximizing a lower bound on the true surrogate objective
-
The policy can never get a better objective value by changing the ratio beyond the clipped range
-
No explicit KL penalty or old policy memory is required
4.3.3 Generalized Advantage Estimation (GAE)
PPO typically uses generalized advantage estimation to compute advantage functions. GAE is a weighted average of n-step advantage estimates with exponentially decaying weights:A_t^GAE(γ, λ) = Σ_{k=0}^∞ (γλ)^k δ_{t+k} where δ_t = r_t + γ V(s_{t+1}) - V(s_t) is the TD error.This provides a flexible way to balance bias and variance in advantage estimates, with λ=0.95 being a common default.
4.3.4 Full PPO Algorithm
The complete PPO algorithm follows this iterative loop:-
Sample a batch of trajectories using the current policy
-
Fit a value function to the batch of data using Monte Carlo or TD targets
-
Compute generalized advantage estimates for all state-action pairs
-
Take M gradient steps (typically 10-30 epochs) on the clipped surrogate objective
-
Discard the old batch and repeat
-
Batch size: 2048 time steps
-
Number of epochs per batch: 10
-
Clipping parameter ε: 0.2
-
Discount factor γ: 0.99
-
GAE parameter λ: 0.95
4.4 Going Further Off-Policy: Soft Actor-Critic (SAC)
PPO allows multiple gradient steps per batch but still discards data after each outer loop iteration. Soft Actor-Critic (SAC) goes a step further and uses a replay buffer to store all past experience, allowing it to reuse data from all previous policies.4.4.1 The Replay Buffer Challenge
Replay buffers create a fundamental problem for value estimation:-
The buffer contains data from dozens or hundreds of different old policies
-
If you train a value function
V^π(s)on this mixed data, you are not learning the value function for your current policy—you are learning the value function for some unknown mixture of all past policies
Q^π(s, a). Q-functions have a critical property that makes them suitable for off-policy learning: they depend only on the current state and action, not on the policy that took the action.
4.4.2 Off-Policy Q-Learning with Replay Buffers
To learn the Q-function for the current policy using data from the replay buffer:-
Sample a transition
(s, a, r, s')from the replay buffer -
Sample an action
a'from the current policy at states'(not the action from the buffer) -
Compute the target Q-value:
y = r + γ Q(s', a') -
Update the Q-network to minimize the L2 error between
Q(s, a)andy
-
The dynamics of the environment are constant, so
s'is valid regardless of which policy took actiona -
By sampling
a'from the current policy, we ensure the target reflects the value of following the current policy froms'
4.4.3 Full SAC Algorithm
The complete SAC algorithm follows this loop:-
Initialize an empty replay buffer
-
Collect experience using the current policy and add it to the buffer
-
Sample a mini-batch of transitions from the buffer
-
Update the Q-network using the off-policy target described above
-
Update the policy to maximize the expected Q-value:
∇_θ E[ log π_θ(a | s) * Q(s, a) ] -
Repeat steps 3-5 many times before collecting new experience
4.5 PPO vs. SAC: Head-to-Head Comparison
| Aspect | PPO | SAC |
|---|---|---|
| Policy Type | Mildly off-policy | Fully off-policy |
| Data Efficiency | Low | Extremely high |
| Stability | Excellent | Good but less than PPO |
| Hyperparameter Tuning | Very easy | Moderately difficult |
| Best For | Simulation environments, cheap data, large language models | Real-world systems, expensive data, robots |
| Industry Adoption | Extremely widespread | Growing rapidly for robotics |
-
Use PPO if you have unlimited simulated data and want something that just works
-
Use SAC if you are working with real robots or any scenario where data collection is expensive
-
PPO is the default choice for fine-tuning large language models with reinforcement learning
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you master the content of this subject thoroughly. Wish you continuous academic progress and great achievements in your studies.


