One. Course Details
This is the introductory lecture on Bayesian networks in Stanford University's CME 296 course, marking a shift from reactive machine learning models to model-based reasoning under uncertainty. It builds on previous units covering search, Markov decision processes, and games, introducing graphical models as a powerful framework for representing and reasoning about complex real-world systems.
The lecture follows a carefully structured progression: it begins with a review of fundamental probability operations and their tensor implementation using einops, then introduces the four-step procedure for building Bayesian networks through two concrete examples (the burglar-alarm system and medical diagnosis), explains the counterintuitive explaining away phenomenon, and concludes with an introduction to probabilistic programming and rejection sampling for approximate inference. The central thesis is that Bayesian networks provide a modular, interpretable way to decompose large joint distributions into manageable local components, enabling rigorous reasoning about uncertainty.
Two. Key Learning Takeaways
Bayesian networks are directed acyclic graphs (DAGs) that compactly represent joint distributions over random variables as the product of local conditional probability tables.
All probability operations reduce to two fundamental actions: marginalization (collapsing over irrelevant variables) and conditioning (selecting assignments that match observed evidence).
Probability tables are tensors, and all standard probability computations can be elegantly expressed using einops notation for efficient implementation.
The four-step Bayesian network construction process is: define variables, add directed edges representing direct influence, specify local conditional distributions for each node, and multiply them to get the joint distribution.
Explaining away is a defining property of Bayesian networks: observing a common effect makes its previously independent causes become negatively correlated.
Probabilistic inference is analogous to database queries: you condition on evidence, marginalize out irrelevant variables, and normalize to get the conditional distribution of query variables.
Probabilistic programs are executable definitions of distributions: a program that generates samples from a distribution formally defines that distribution.
Rejection sampling is a universal approximate inference algorithm that works for any Bayesian network but becomes inefficient when the evidence is rare.
Three. Course Gold Quotes
"The joint distribution is like a SQL database, and probabilistic inference is essentially doing SQL queries on that database."
"Model-free methods are direct and cheaper, which is why most practical applications use them. But model-based methods are more flexible—you can change the reward function on the fly and recompute the optimal policy."
"Explaining away is magical. Two causes that are completely independent become dependent the second you observe their common effect."
"Before Bayesian networks, people used all sorts of inconsistent heuristics to reason about uncertainty. Only when you lay mathematical clarity to these problems do they start making sense."
"You have one local conditional distribution per node, not per edge. The parents always act as a unit—they are married, so you can't split them apart."
"A probabilistic program is just a program that returns a sample. If samples from it follow a distribution, then that program is as good a definition of the distribution as any mathematical formula."
"Rejection sampling is mathematically perfect and will converge to the true answer in the limit. The only problem is that you might have to wait until the heat death of the universe to get enough samples if your evidence is rare."
Four. Layered Learning Notes
Module 1: Probability Fundamentals and Tensor Representation
The lecture opens with a review of basic probability concepts, framed through the lens of representing states of the world. A joint distribution assigns a probability to every possible assignment of values to a set of random variables, representing how likely each state of the world is.
Two core operations form the basis of all probabilistic reasoning:
-
Marginalization: Collapses a joint distribution over a subset of variables by summing out the variables you want to ignore. For example, to get the probability of sunshine regardless of rain, you add the probabilities of (sunny, rainy) and (sunny, not rainy).
-
Conditioning: Updates your beliefs based on observed evidence. You first select all assignments that match the evidence, then renormalize their probabilities so they sum to one.
A key insight is that all probability tables are tensors, and both marginalization and conditioning can be implemented concisely using einops notation. This provides a unified, efficient way to perform probabilistic computations that scales to multiple variables.
Module 2: Bayesian Network Construction
Bayesian networks solve the problem of representing large joint distributions, which become intractable for more than a handful of variables. Instead of writing down every possible assignment explicitly, you decompose the joint distribution into a product of local conditional distributions using a graph structure.
The four-step construction process is:
-
Define variables: Identify the random variables that represent the attributes of the world you care about.
-
Add directed edges: Draw edges from cause variables to effect variables, representing direct probabilistic influence.
-
Specify local conditional distributions: For each node, write a probability table that defines the distribution of the node given every possible combination of its parents' values.
-
Compute the joint distribution: Multiply all the local conditional distributions together to get the full joint distribution over all variables.
This modular approach makes it much easier to build and modify large models, as you only need to specify local relationships rather than the entire global distribution at once.
Module 3: Explaining Away Phenomenon
The most counterintuitive and important property of Bayesian networks is explaining away. This occurs when you have two independent causes that both contribute to a common effect.
Using the classic burglar-alarm example:
-
Burglaries and earthquakes are independent events, each with a 5% probability.
-
The alarm goes off if either a burglary or an earthquake occurs.
-
If you observe the alarm going off, the probability of a burglary jumps to over 50%.
-
If you then learn there was an earthquake, the probability of a burglary drops back down to 5%.
The earthquake explains away the alarm, reducing the need to invoke the burglary as an explanation. This reasoning pattern is extremely common in real-world diagnosis and decision-making, and Bayesian networks capture it naturally through the mathematics of probability.
Module 4: Probabilistic Inference
Probabilistic inference is the process of answering questions about the world using your Bayesian network. Any inference query follows the same three-step procedure:
-
Select evidence: Slice the joint distribution to only include assignments that match the observed evidence.
-
Marginalize irrelevant variables: Sum out all variables that are not in the query or the evidence.
-
Normalize: Divide by the probability of the evidence to get a valid conditional distribution over the query variables.
While this procedure is conceptually simple, it can be computationally expensive for large networks. Exact inference requires enumerating all possible assignments, which grows exponentially with the number of variables. This motivates the development of approximate inference methods.
Module 5: Probabilistic Programming and Rejection Sampling
Probabilistic programming provides an alternative way to define Bayesian networks using executable code instead of mathematical formulas. A probabilistic program is simply a function that generates samples from the joint distribution of the model.
This representation naturally leads to the rejection sampling algorithm for approximate inference:
-
Generate a large number of samples from the probabilistic program.
-
Reject all samples that do not match the observed evidence.
-
Estimate the query distribution by counting the frequency of each query value in the remaining samples.
Rejection sampling is extremely general and works for any Bayesian network that can be written as a probabilistic program. However, it suffers from a critical limitation: if the evidence is rare, almost all samples will be rejected, making the algorithm prohibitively slow.
Wishing you great success as you continue exploring the fascinating world of probabilistic modeling. May you develop an intuitive feel for how to translate messy real-world uncertainty into clear, structured Bayesian networks. Whether you go on to build diagnostic systems, track objects in video, or analyze causal relationships, may these skills give you a powerful new lens for understanding the world. Good luck with your studies and the upcoming lectures on advanced inference techniques!


