Home - Open Courses - Engineering and Technology - Information and Communication

Reinforcement Learning for Games: From TD-Learning to Simultaneous Games

This Stanford lecture teaches how to apply reinforcement learning and self-play to master complex games. It covers TD-learning for value estimation, historical game AI breakthroughs, and game theory for simultaneous zero-sum and non-zero-sum games.

By:

4 Views
May 22, 2026

One. Course Details

This lecture continues Stanford University's CME 296 coverage of game theory, bridging the gap between reinforcement learning and adversarial game playing. It builds directly on the previous lecture's introduction to minimax search, alpha-beta pruning, and hand-crafted evaluation functions, showing how we can learn evaluation functions automatically from experience.

The lecture follows a logical progression: it begins with a review of core reinforcement learning concepts (Q-values, V-values, SARSA), introduces temporal difference (TD) learning for value function estimation, explains the self-play paradigm for training game agents, presents historical breakthroughs from checkers to AlphaGo, and concludes with an in-depth exploration of simultaneous zero-sum and non-zero-sum games, including the minimax theorem and Nash equilibrium. The central thesis is that reinforcement learning provides a powerful framework for mastering complex games where exhaustive search is infeasible, and game theory offers rigorous solutions for strategic interactions.

Two. Key Learning Takeaways

We use reinforcement learning for games not because we don't know the rules, but because the number of states is exponentially large, making value iteration and exhaustive search computationally impossible.

TD learning estimates the state value function Vπ, which tells us how good a given state is under policy π, and is the natural counterpart to SARSA for Q-value estimation.

Self-play is the core training paradigm for game AI: both the agent and opponent use the same value function, with the agent maximizing and the opponent minimizing the game value.

In simultaneous zero-sum games, pure strategies are always exploitable, but optimal mixed strategies exist that cannot be beaten even if your opponent knows your strategy.

The von Neumann minimax theorem guarantees that for any finite two-player zero-sum simultaneous game, the value of going first equals the value of going second when using mixed strategies.

Nash equilibrium is the solution concept for non-zero-sum games, representing a stable state where no player can improve their outcome by unilaterally changing their strategy.

The prisoner's dilemma demonstrates a fundamental paradox: rational self-interested behavior can lead to a suboptimal outcome for all players.

Three. Course Gold Quotes

"TD learning is to Vπ as SARSA is to Qπ. That's the only difference you need to remember."

"We don't use reinforcement learning for games because we don't know the MDP. We know the MDP perfectly. We use it because there are just too many states to count."

"Self-play is elegant. You only need to learn one value function, and it tells you both how to play as the maximizer and the minimizer."

"Revealing your optimal mixed strategy doesn't hurt you. You can tell everyone exactly what probabilities you're using, and there's nothing they can do to beat you in expectation."

"The minimax theorem is one of the most beautiful results in game theory. It says that for zero-sum games, there's a single fair value of the game, no matter who goes first."

"Nash equilibrium isn't about what's best for everyone. It's about what's stable—no one wants to change their strategy once we're there."

"The prisoner's dilemma shows that sometimes, what's rational for the individual is terrible for the group. That's why we have contracts and institutions."

Four. Layered Learning Notes

Module 1: Reinforcement Learning Review and Motivation

The lecture opens with a review of core reinforcement learning concepts, focusing on the distinction between value-based methods and their application to games. We recall two key quantities:

Qπ(s, a): The expected utility of taking action a in state s and then following policy π
Vπ(s): The expected utility of following policy π from state s

While SARSA estimates Q-values, which directly give us a policy via argmax, TD learning estimates V-values. For games, this is sufficient because we know the transition function perfectly—given a state and action, we know exactly what the next state will be. This allows us to compute Q-values from V-values on the fly:

Q(s, a) = R(s, a) + γ V(s')

The critical motivation for using reinforcement learning in games is state space size. Games like chess and Go have exponentially many states, making exact dynamic programming infeasible. Reinforcement learning with function approximation allows us to generalize across similar states and learn effective policies without visiting every possible state.

Module 2: TD Learning for Game Evaluation

Temporal difference (TD) learning is the workhorse algorithm for learning value functions in games. It follows the same bootstrapping principle as SARSA but operates on state values rather than state-action values.

The TD learning update rule is:

Observe a transition (s, a, r, s')
Compute the target: target = r + γ V(s')
Update the value function: V(s) ← V(s) + η (target - V(s))

Where η is the learning rate and γ is the discount factor. For games, we typically use γ = 1 and only give a reward at the end of the game (+1 for winning, -1 for losing, 0 for draw).

When using function approximation (e.g., a neural network), we define a parameterized value function V(s; w) and minimize the squared loss between the prediction and the target:

Loss = (V(s; w) - target)²

We then take a gradient step to update the weights w, treating the target as a constant (detached from the computation graph) to stabilize training.

Module 3: Self-Play and Historical Breakthroughs

The most powerful training paradigm for game AI is self-play, where the agent plays against itself using the same value function. This works because:

The agent maximizes the value function when it is its turn to move
The opponent minimizes the value function when it is its turn to move

This simple idea has led to some of the most impressive achievements in AI history:

Arthur Samuel's Checkers (1959): The first self-playing game program. It used a linear value function with hand-crafted features and reached amateur-level play on 9 kilobytes of memory.
TD-Gammon (1992): Gerald Tesauro's backgammon program used a neural network with minimal feature engineering and learned entirely through self-play. It reached human expert level and discovered new opening strategies that changed how the game is played.
AlphaGo Zero (2017): The most famous example, which learned to play Go entirely through self-play without any human data. It used a deep neural network and Monte Carlo Tree Search, defeating the previous version of AlphaGo which had beaten world champion Lee Sedol.

A clear trend emerges: as compute power increases, we need less hand-engineered feature engineering and can rely more on raw self-play to learn effective strategies.

Module 4: Simultaneous Zero-Sum Games

The lecture then shifts from turn-based games to simultaneous games, where both players choose their actions at the same time. Classic examples include rock-paper-scissors and two-finger Morra.

Simultaneous games are represented using a payoff matrix V(a, b), which gives the utility for player A when player A chooses action a and player B chooses action b. Since the game is zero-sum, player B's utility is -V(a, b).

A pure strategy is a deterministic choice of a single action. A mixed strategy is a probability distribution over actions. For simultaneous games, pure strategies are always exploitable—if you always play rock, your opponent will always play paper.

The optimal solution comes from the von Neumann minimax theorem, which states that for any finite two-player zero-sum simultaneous game:

maxπA minπB V(πA, πB) = minπB maxπA V(πA, πB)

This means there exists a single value of the game, and both players can play optimally using mixed strategies. Crucially, revealing your optimal mixed strategy does not give your opponent an advantage—they cannot improve their expected outcome by changing their strategy.

For the two-finger Morra example from the lecture, the optimal strategy for player A is to play one finger with probability 7/12 and two fingers with probability 5/12, resulting in a game value of -1/12 for player A.

Module 5: Non-Zero-Sum Games and Nash Equilibrium

The final section introduces non-zero-sum games, where the sum of the players' utilities is not zero. This covers a much wider range of real-world interactions, from business negotiations to international relations.

The solution concept for non-zero-sum games is the Nash equilibrium, defined as a pair of strategies (πA*, πB*) such that:

Player A cannot improve their utility by changing from πA* while player B plays πB*
Player B cannot improve their utility by changing from πB* while player A plays πA*

Nash proved that every finite game with any number of players has at least one Nash equilibrium (in mixed strategies if necessary).

The most famous example of a non-zero-sum game is the prisoner's dilemma:

	B testifies	B refuses
A testifies	(-5, -5)	(0, -10)
A refuses	(-10, 0)	(-1, -1)

The only Nash equilibrium is for both players to testify, even though both would be better off if they both refused. This demonstrates a fundamental tension between individual rationality and collective welfare.

Wishing you great success as you explore the fascinating intersection of reinforcement learning and game theory. May you develop an intuitive understanding of how to build intelligent agents that can learn to master complex games and make optimal strategic decisions. Whether you go on to build game AI, study economic interactions, or design multi-agent systems, may these skills give you a powerful framework for reasoning about strategic behavior. Good luck with your studies and the upcoming lectures on Bayesian networks!

Bayesian Networks: Foundations of Probabilistic Reasoning Under Uncertainty

Turning Statistics Into Art: A Case Study of Chris Jordan’s Visual Critique of Consumer Culture