One. Course Details
This lecture continues Stanford University's CME 296 coverage of Bayesian networks, focusing on advanced probabilistic inference techniques and conditional independence properties. It builds directly on the previous lecture's introduction to Bayesian network structure and rejection sampling, providing a deeper dive into more efficient approximate inference methods and the fundamental graph-theoretic properties that make Bayesian networks powerful.
The lecture follows a logical progression: it begins with a comprehensive review of Bayesian network fundamentals and exact inference, then introduces rejection sampling and its limitations, transitions to Gibbs sampling as a more efficient alternative, explains the Markov blanket optimization for Gibbs sampling, and concludes with an in-depth exploration of conditional independence in graphical models. The central thesis is that approximate inference algorithms like Gibbs sampling allow us to scale Bayesian networks to real-world problems, while conditional independence provides the theoretical foundation for understanding and optimizing these models.
Two. Key Learning Takeaways
A Bayesian network compactly represents a joint distribution as the product of local conditional probability tables, one for each node given its parents.
Exact inference is computationally intractable for large networks because the number of possible assignments grows exponentially with the number of variables.
Gibbs sampling is a Markov chain Monte Carlo (MCMC) algorithm that generates samples by iteratively updating one variable at a time conditioned on all other variables.
The Markov blanket of a node consists of its parents, children, and co-parents, and is the only information needed to sample that node in Gibbs sampling.
Rejection sampling and Gibbs sampling have complementary weaknesses: rejection sampling fails for rare evidence, while Gibbs sampling struggles with highly correlated variables.
Conditional independence in Bayesian networks is a graph property: two variables are independent given evidence if all paths between them are blocked in the processed graph.
The explaining away phenomenon occurs when observing a common effect makes its previously independent causes become dependent.
Three. Course Gold Quotes
"Rejection sampling is simple and works in the limit, but if your evidence is rare, you'll be sitting around rejecting samples forever."
"Gibbs sampling is incremental. Instead of throwing everything away and starting fresh, you gradually modify your sample while always satisfying the evidence."
"The Markov blanket is beautiful. You don't need to look at the entire network to sample a node—just its immediate neighborhood."
"Rejection sampling hates rare events. Gibbs sampling hates highly correlated variables. And the bad news is that most real-world problems have both."
"Conditional independence is what makes Bayesian networks useful. It lets us break a huge joint distribution into small, manageable pieces."
"Two variables that are marginally independent can become dependent once you observe their common effect. That's explaining away, and it's one of the most counterintuitive but important concepts in probability."
"A Bayesian network is both a probabilistic object and a combinatorial object. The graph structure tells us everything about the independence properties of the distribution."
Four. Layered Learning Notes
Module 1: Bayesian Network Fundamentals Review
The lecture opens with a thorough review of Bayesian network basics. A Bayesian network is a directed acyclic graph (DAG) where each node represents a random variable, and edges represent direct probabilistic dependencies. The joint distribution over all variables is simply the product of the local conditional distributions:
P(X₁, X₂, ..., Xₙ) = ∏ P(Xᵢ | Parents(Xᵢ))
Using the classic alarm example with burglary (B), earthquake (E), and alarm (A) variables, the joint distribution is P(B) × P(E) × P(A | B, E). Each local conditional distribution is represented as a table specifying the probability of each value of the node given every possible combination of parent values.
Exact inference in Bayesian networks involves three steps:
-
Condition on evidence: Slice the joint distribution to only include assignments that match the observed evidence
-
Marginalize out irrelevant variables: Sum over all values of variables not in the query or evidence
-
Normalize: Divide by the probability of the evidence to get a valid conditional distribution
While exact inference works for small networks, it becomes computationally infeasible for large networks with hundreds of variables, as the number of possible assignments grows exponentially. This necessitates approximate inference methods.
Module 2: Rejection Sampling and Its Limitations
Rejection sampling is the simplest approximate inference algorithm. The algorithm works as follows:
-
Run the probabilistic program corresponding to the Bayesian network to generate a complete sample
-
Check if the sample satisfies the evidence
-
If it does, keep the sample; if not, reject it
-
After generating enough samples, estimate the query distribution by counting the frequency of each query value in the kept samples
The key advantage of rejection sampling is its simplicity and correctness in the limit. As the number of samples approaches infinity, the estimated distribution converges to the true distribution.
However, rejection sampling has a critical flaw: it is extremely inefficient when the evidence is rare. If the probability of the evidence is 10⁻³⁰, you would need to generate an astronomically large number of samples to get even one that satisfies the evidence. This makes rejection sampling impractical for most real-world problems where evidence is often unlikely.
Module 3: Gibbs Sampling Fundamentals
Gibbs sampling addresses the limitations of rejection sampling by always working with samples that satisfy the evidence. It is a type of Markov chain Monte Carlo (MCMC) algorithm, meaning it generates a sequence of samples where each sample depends only on the previous one.
The Gibbs sampling algorithm proceeds as follows:
-
Initialize: Start with an arbitrary assignment to all variables that satisfies the evidence
-
Iterate: For each non-evidence variable in turn:a. Compute the conditional distribution of the variable given the current values of all other variablesb. Sample a new value for the variable from this conditional distribution
-
Collect samples: After a burn-in period, collect samples to estimate the query distribution
Unlike rejection sampling, Gibbs sampling never rejects any samples. All samples are used to estimate the distribution, making it much more efficient for problems with rare evidence.
However, Gibbs sampling has its own limitations. The samples are not independent—each sample is correlated with the previous one. This means you need more samples to get the same level of accuracy as you would with independent samples from rejection sampling.
Module 4: Markov Blanket Optimization
A major optimization for Gibbs sampling comes from the concept of the Markov blanket. When sampling a variable, you do not need to consider the entire network—only the variables in its Markov blanket.
The Markov blanket of a node X consists of exactly three sets of nodes:
-
The parents of X
-
The children of X
-
The co-parents of X (other parents of X's children)
This is because the conditional distribution P(X | all other variables) depends only on the local conditional distributions that involve X:
-
P(X | Parents(X)): The distribution of X given its parents
-
P(Y | Parents(Y)) for each child Y of X: The distribution of each child given its parents (which include X)
All other variables in the network are conditionally independent of X given its Markov blanket. This optimization drastically reduces the computational cost of Gibbs sampling, especially for large networks with sparse connections.
Module 5: Comparative Analysis of Sampling Methods
Rejection sampling and Gibbs sampling have complementary strengths and weaknesses, making them suitable for different types of problems:
|
Aspect |
Rejection Sampling |
Gibbs Sampling |
|---|---|---|
|
Sample independence |
Fully independent samples |
Correlated samples |
|
Evidence handling |
Fails for rare evidence |
Works well for rare evidence |
|
Correlated variables |
Works well |
Gets stuck or mixes slowly |
|
Implementation complexity |
Extremely simple |
Moderately complex |
|
Convergence speed |
Fast for common evidence |
Can be slow to mix |
The lecture provides extreme examples to illustrate these differences:
-
Rejection sampling worst case: Evidence with probability 10⁻⁵. Rejection sampling will reject 99,999 out of every 100,000 samples.
-
Gibbs sampling worst case: Two perfectly correlated variables (A = B). Gibbs sampling will get stuck in whatever state it starts in and never explore the other state.
In practice, Gibbs sampling is more widely used than rejection sampling because rare evidence is a more common problem than perfectly correlated variables in most real-world applications.
Module 6: Conditional Independence
The final section of the lecture introduces conditional independence, a fundamental property of Bayesian networks that connects the graph structure to the probabilistic properties of the distribution.
Two variables A and B are independent if P(A, B) = P(A) × P(B) for all values of A and B. They are conditionally independent given C if P(A, B | C) = P(A | C) × P(B | C) for all values of A, B, and C.
The lecture presents four basic graph structures and their independence properties:
-
No edge between A and B: A and B are marginally independent
-
Chain A → C → B: A and B are marginally dependent but conditionally independent given C
-
Common cause C → A and C → B: A and B are marginally dependent but conditionally independent given C
-
Common effect A → C ← B: A and B are marginally independent but conditionally dependent given C (explaining away)
The last case is the most counterintuitive and gives rise to the explaining away phenomenon. For example, burglary and earthquake are marginally independent, but once you observe that the alarm has gone off, they become dependent—learning that there was an earthquake reduces the probability that there was a burglary.
To determine if two variables are independent given a set of evidence in an arbitrary Bayesian network, you can use the following algorithm:
-
Shade in all evidence nodes
-
Recursively remove any unshaded leaf nodes
-
"Marry" the parents of each node (connect them with undirected edges)
-
Check if there is any path between the two variables that does not pass through a shaded node
If no such path exists, the variables are conditionally independent given the evidence.
Wishing you great success as you continue your journey into probabilistic modeling and graphical models. May you develop an intuitive understanding of how to apply these powerful techniques to real-world problems, and may you always appreciate the elegant connection between graph theory and probability theory. Whether you go on to build complex Bayesian models, work with time series data, or tackle challenging inference problems, may you approach each challenge with curiosity and rigor. Good luck with your studies and the upcoming lecture on Bayesian network parameter learning!


