Bayesian Network Parameter Learning: From Counting to Expectation Maximization

One. Course Details

This lecture continues Stanford University's CME 296 coverage of Bayesian networks, focusing exclusively on parameter learning—the critical process of deriving probability tables from data. It builds directly on previous lectures covering Bayesian network structure, exact and approximate inference algorithms, and conditional independence properties.

The lecture follows a progressive, example-driven structure. It begins with a concise review of Bayesian network fundamentals and conditional independence, then introduces parameter learning in the fully observed setting using simple, intuitive examples. It progresses to more advanced topics including parameter sharing for reducing model complexity, Laplace smoothing for preventing overfitting, and finally the expectation maximization (EM) algorithm for handling partially observed data. The central thesis is that all Bayesian network parameter learning techniques derive from the maximum likelihood principle, with elegant adaptations for different data availability scenarios.

Two. Key Learning Takeaways

The parameters of a Bayesian network are exactly the local conditional probability tables associated with each node, defining the distribution of a variable given its parents.

In the fully observed setting, maximum likelihood estimation has a trivial closed-form solution: simply count occurrences of each variable-parent configuration and normalize to get probabilities.

Parameter sharing is a powerful technique that allows multiple nodes to use the same conditional distribution, drastically reducing the number of parameters and improving generalization with limited data.

Laplace smoothing addresses overfitting in small datasets by adding pseudocounts to all possible outcomes, preventing zero probability estimates for never-observed events.

The expectation maximization (EM) algorithm solves the chicken-and-egg problem of learning with hidden variables by alternating between guessing the values of unobserved variables and re-estimating parameters.

EM is guaranteed to monotonically increase the likelihood of the observed data but will only converge to a local maximum, making initialization critically important.

Conditional independence in Bayesian networks is determined by d-separation: two variables are independent given a set of evidence if all paths between them are blocked by observed nodes.

Three. Course Gold Quotes

"Everything we've done so far assumed probabilities just popped out of thin air. But that's not principled—we need to learn these parameters from data."

"Count and normalize. That's the theme of this entire lecture, and everything else is just variations on that theme."

"Maximum likelihood estimation is not some fancy black box. It just means find the parameters that make the data you saw most probable."

"If you only see two movie ratings and assign zero probability to all others, that's just being close-minded. Laplace smoothing gives you a more reasonable estimate."

"EM is magical because it lets you learn even when you're missing data. You guess what you don't know, then count what you guessed, then repeat."

"Parameter sharing is like passing an object by reference in programming. Change it once, and it changes everywhere it's used."

"EM will always converge, but it might converge to the wrong hill. You have to try multiple initializations to find the best solution."

Four. Layered Learning Notes

Module 1: Bayesian Network Review and Conditional Independence

The lecture opens with a quick recap of Bayesian network fundamentals. A Bayesian network is a directed acyclic graph (DAG) that compactly represents a joint distribution over a set of random variables. Each node stores a local conditional distribution P(Xᵢ | Parents(Xᵢ)), and the joint distribution is simply the product of these local distributions.

The lecture then revisits conditional independence with a clearer presentation of the d-separation criterion:

Two variables are conditionally independent given evidence C if every path between them is blocked
A path is blocked if it contains:
1. A chain structure X → C → Y where C is observed
2. A common cause structure C → X and C → Y where C is observed
3. A V-structure X → Z ← Y where neither Z nor any of its descendants are observed

This last case is particularly important and leads to the explaining away phenomenon: observing a common effect makes its previously independent causes become dependent.

Module 2: Parameter Learning in the Fully Observed Setting

The core of the lecture focuses on learning the parameters of a Bayesian network when we have complete data—every example assigns a value to every variable in the network.

The simplest case is a single node with no parents. To estimate its distribution:

Count the number of times each value appears in the training data
Normalize the counts by the total number of examples to get probabilities

For a node with parents, the process is almost identical:

For each possible combination of parent values, count how many times each child value occurs
Normalize each group of counts separately to get a conditional distribution for that parent configuration

This pattern generalizes to any Bayesian network structure. For every local conditional distribution P(X | Parents(X)):

Count occurrences of each (X, Parents(X)) tuple in the data
Normalize for each parent configuration to get probabilities

This simple count-and-normalize procedure is not just intuitive—it is exactly the maximum likelihood estimate (MLE) for the parameters. Maximum likelihood chooses the parameters that maximize the probability of observing the training data. For Bayesian networks, this optimization problem has a closed-form solution that requires no iterative gradient descent.

Module 3: Parameter Sharing

The lecture introduces parameter sharing as a crucial technique for reducing model complexity and improving generalization. By default, each node in a Bayesian network has its own independent conditional distribution table. However, in many cases, multiple nodes follow the same probabilistic relationship.

The classic example is a hidden Markov model (HMM):

We have a sequence of hidden states H₁, H₂, ..., Hₜ
We have a sequence of observations E₁, E₂, ..., Eₜ
All transition probabilities P(Hₜ | Hₜ₋₁) are identical across time steps
All emission probabilities P(Eₜ | Hₜ) are identical across time steps

Instead of learning 2T-1 separate distributions, we only need to learn 3: the initial state distribution, the transition distribution, and the emission distribution. This drastically reduces the number of parameters and allows us to learn from sequences of arbitrary length.

Parameter sharing involves a fundamental tradeoff:

Fewer parameters require less data to estimate and generalize better
More parameters provide greater flexibility to model nuanced differences
The choice depends on how similar the nodes are and how much data you have

Module 4: Laplace Smoothing

Maximum likelihood estimation works well with large datasets but suffers from severe overfitting when data is scarce. If a particular outcome never appears in the training data, MLE assigns it a probability of zero, which is often overly confident and incorrect.

Laplace smoothing solves this problem by adding a small pseudocount λ to every possible outcome before normalizing. For a variable with k possible values:

Add λ to each count
Normalize by (total count + λ × k)

This has several desirable properties:

As λ → 0, we recover the original maximum likelihood estimates
As λ → ∞, we approach a uniform distribution
With more data, the effect of the pseudocounts becomes negligible

A common choice is λ=1, which corresponds to adding one imaginary observation of each possible outcome. This prevents zero probabilities while still giving more weight to actually observed events.

Module 5: Expectation Maximization (EM) Algorithm

The final and most advanced topic is learning with partially observed data, where some variables are hidden or missing in the training examples. This breaks the simple count-and-normalize approach because we don't know the values of the hidden variables to count them.

The EM algorithm solves this problem through an iterative two-step process:

E-step (Expectation): Using the current parameter estimates, compute the posterior distribution over the hidden variables given the observed variables. This produces a set of weighted complete assignments, where each possible value of the hidden variable is weighted by its probability.
M-step (Maximization): Treat the weighted assignments as if they were fully observed data. Count and normalize using the weights instead of 1s to get new parameter estimates.

The algorithm repeats these two steps until convergence. Each iteration is guaranteed to increase the likelihood of the observed data, and the process will eventually converge to a local maximum.

Important properties of EM:

It only finds local maxima, so multiple random initializations are recommended
It cannot recover the absolute labels of hidden variables, only their relative clustering
It is widely used in unsupervised learning, clustering, and many other machine learning applications
Wishing you great success as you master the art of probabilistic modeling. May you develop an intuitive understanding of how to turn raw data into meaningful probability distributions, and may these skills serve you well in all your future machine learning endeavors. Whether you go on to build complex graphical models, work with time series data, or tackle unsupervised learning problems, may you always approach each challenge with curiosity and rigor. Good luck with your studies and the upcoming lectures on logical reasoning!

Video Source and Usage Instructions

Video Title: Stanford CS221 | Autumn 2025 | Lecture 14: Bayesian Networks and Learning
• Course Series: Stanford CS221 | Autumn 2025 | Lecture 14: Bayesian Networks and Learning
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/4d9V6Sxa6gU?si=X6ggfKwsM_1uOQ5r

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.