One. Course Details
This lecture continues Stanford University's CME 296 coverage of Bayesian networks, focusing exclusively on parameter learning—the critical process of deriving probability tables from data. It builds directly on previous lectures covering Bayesian network structure, exact and approximate inference algorithms, and conditional independence properties.
The lecture follows a progressive, example-driven structure. It begins with a concise review of Bayesian network fundamentals and conditional independence, then introduces parameter learning in the fully observed setting using simple, intuitive examples. It progresses to more advanced topics including parameter sharing for reducing model complexity, Laplace smoothing for preventing overfitting, and finally the expectation maximization (EM) algorithm for handling partially observed data. The central thesis is that all Bayesian network parameter learning techniques derive from the maximum likelihood principle, with elegant adaptations for different data availability scenarios.
Two. Key Learning Takeaways
The parameters of a Bayesian network are exactly the local conditional probability tables associated with each node, defining the distribution of a variable given its parents.
In the fully observed setting, maximum likelihood estimation has a trivial closed-form solution: simply count occurrences of each variable-parent configuration and normalize to get probabilities.
Parameter sharing is a powerful technique that allows multiple nodes to use the same conditional distribution, drastically reducing the number of parameters and improving generalization with limited data.
Laplace smoothing addresses overfitting in small datasets by adding pseudocounts to all possible outcomes, preventing zero probability estimates for never-observed events.
The expectation maximization (EM) algorithm solves the chicken-and-egg problem of learning with hidden variables by alternating between guessing the values of unobserved variables and re-estimating parameters.
EM is guaranteed to monotonically increase the likelihood of the observed data but will only converge to a local maximum, making initialization critically important.
Conditional independence in Bayesian networks is determined by d-separation: two variables are independent given a set of evidence if all paths between them are blocked by observed nodes.
Three. Course Gold Quotes
"Everything we've done so far assumed probabilities just popped out of thin air. But that's not principled—we need to learn these parameters from data."
"Count and normalize. That's the theme of this entire lecture, and everything else is just variations on that theme."
"Maximum likelihood estimation is not some fancy black box. It just means find the parameters that make the data you saw most probable."
"If you only see two movie ratings and assign zero probability to all others, that's just being close-minded. Laplace smoothing gives you a more reasonable estimate."
"EM is magical because it lets you learn even when you're missing data. You guess what you don't know, then count what you guessed, then repeat."
"Parameter sharing is like passing an object by reference in programming. Change it once, and it changes everywhere it's used."
"EM will always converge, but it might converge to the wrong hill. You have to try multiple initializations to find the best solution."
Four. Layered Learning Notes
Module 1: Bayesian Network Review and Conditional Independence
The lecture opens with a quick recap of Bayesian network fundamentals. A Bayesian network is a directed acyclic graph (DAG) that compactly represents a joint distribution over a set of random variables. Each node stores a local conditional distribution P(Xᵢ | Parents(Xᵢ)), and the joint distribution is simply the product of these local distributions.
The lecture then revisits conditional independence with a clearer presentation of the d-separation criterion:
-
Two variables are conditionally independent given evidence C if every path between them is blocked
-
A path is blocked if it contains:
-
A chain structure X → C → Y where C is observed
-
A common cause structure C → X and C → Y where C is observed
-
A V-structure X → Z ← Y where neither Z nor any of its descendants are observed
-
This last case is particularly important and leads to the explaining away phenomenon: observing a common effect makes its previously independent causes become dependent.
Module 2: Parameter Learning in the Fully Observed Setting
The core of the lecture focuses on learning the parameters of a Bayesian network when we have complete data—every example assigns a value to every variable in the network.
The simplest case is a single node with no parents. To estimate its distribution:
-
Count the number of times each value appears in the training data
-
Normalize the counts by the total number of examples to get probabilities
For a node with parents, the process is almost identical:
-
For each possible combination of parent values, count how many times each child value occurs
-
Normalize each group of counts separately to get a conditional distribution for that parent configuration
This pattern generalizes to any Bayesian network structure. For every local conditional distribution P(X | Parents(X)):
-
Count occurrences of each (X, Parents(X)) tuple in the data
-
Normalize for each parent configuration to get probabilities
This simple count-and-normalize procedure is not just intuitive—it is exactly the maximum likelihood estimate (MLE) for the parameters. Maximum likelihood chooses the parameters that maximize the probability of observing the training data. For Bayesian networks, this optimization problem has a closed-form solution that requires no iterative gradient descent.
Module 3: Parameter Sharing
The lecture introduces parameter sharing as a crucial technique for reducing model complexity and improving generalization. By default, each node in a Bayesian network has its own independent conditional distribution table. However, in many cases, multiple nodes follow the same probabilistic relationship.
The classic example is a hidden Markov model (HMM):
-
We have a sequence of hidden states H₁, H₂, ..., Hₜ
-
We have a sequence of observations E₁, E₂, ..., Eₜ
-
All transition probabilities P(Hₜ | Hₜ₋₁) are identical across time steps
-
All emission probabilities P(Eₜ | Hₜ) are identical across time steps
Instead of learning 2T-1 separate distributions, we only need to learn 3: the initial state distribution, the transition distribution, and the emission distribution. This drastically reduces the number of parameters and allows us to learn from sequences of arbitrary length.
Parameter sharing involves a fundamental tradeoff:
-
Fewer parameters require less data to estimate and generalize better
-
More parameters provide greater flexibility to model nuanced differences
-
The choice depends on how similar the nodes are and how much data you have
Module 4: Laplace Smoothing
Maximum likelihood estimation works well with large datasets but suffers from severe overfitting when data is scarce. If a particular outcome never appears in the training data, MLE assigns it a probability of zero, which is often overly confident and incorrect.
Laplace smoothing solves this problem by adding a small pseudocount λ to every possible outcome before normalizing. For a variable with k possible values:
-
Add λ to each count
-
Normalize by (total count + λ × k)
This has several desirable properties:
-
As λ → 0, we recover the original maximum likelihood estimates
-
As λ → ∞, we approach a uniform distribution
-
With more data, the effect of the pseudocounts becomes negligible
A common choice is λ=1, which corresponds to adding one imaginary observation of each possible outcome. This prevents zero probabilities while still giving more weight to actually observed events.
Module 5: Expectation Maximization (EM) Algorithm
The final and most advanced topic is learning with partially observed data, where some variables are hidden or missing in the training examples. This breaks the simple count-and-normalize approach because we don't know the values of the hidden variables to count them.
The EM algorithm solves this problem through an iterative two-step process:
-
E-step (Expectation): Using the current parameter estimates, compute the posterior distribution over the hidden variables given the observed variables. This produces a set of weighted complete assignments, where each possible value of the hidden variable is weighted by its probability.
-
M-step (Maximization): Treat the weighted assignments as if they were fully observed data. Count and normalize using the weights instead of 1s to get new parameter estimates.
The algorithm repeats these two steps until convergence. Each iteration is guaranteed to increase the likelihood of the observed data, and the process will eventually converge to a local maximum.
Important properties of EM:
-
It only finds local maxima, so multiple random initializations are recommended
-
It cannot recover the absolute labels of hidden variables, only their relative clustering
-
It is widely used in unsupervised learning, clustering, and many other machine learning applications
-
Wishing you great success as you master the art of probabilistic modeling. May you develop an intuitive understanding of how to turn raw data into meaningful probability distributions, and may these skills serve you well in all your future machine learning endeavors. Whether you go on to build complex graphical models, work with time series data, or tackle unsupervised learning problems, may you always approach each challenge with curiosity and rigor. Good luck with your studies and the upcoming lectures on logical reasoning!


