One. Course Details
This is the opening lecture of Stanford University's CME 296: Diffusion and Large Vision Models, taught by twin brothers Afshine and Shervine Amidi. Both instructors share identical academic and industry backgrounds: they graduated from Centrale Paris in France, pursued graduate studies at MIT and Stanford ICME respectively, and previously worked at Uber and Google before joining Netflix.
The course has two core goals: first, to build a deep intuitive and mathematical understanding of modern image generation paradigms, and second, to teach how these models are trained, optimized, and evaluated in practice. It is designed for students interested in generative AI research, industry careers, or personal projects, with strong prerequisites in linear algebra, probability theory, differential equations, and basic machine learning.
Logistically, this is a two-unit class held every Friday from three thirty to five twenty Pacific Time. It features no homework assignments but includes two ninety-minute pen-and-paper exams focused on core concepts and intuition rather than detailed derivations. All course materials—including slides, recordings, and a four-page condensed cheat sheet—are posted on the class website, with slides released the Thursday evening before each lecture.
This first lecture lays the foundation for the entire course by introducing Denoising Diffusion Probabilistic Models (DDPM), the breakthrough 2020 paper that revolutionized image generation. It covers the core intuition of diffusion, the forward noise process, variational derivation of the training objective, standard DDPM training and inference, and concludes with DDIM, a technique that speeds up generation by 10 to 100 times.
Two. Key Learning Takeaways
All modern generative models start from Gaussian noise and gradually refine it into realistic images, leveraging noise for both tractability and generation diversity.
Diffusion models operate through two complementary processes: a predefined forward process that gradually adds Gaussian noise to clean images, and a learned reverse process that removes noise step-by-step.
The forward diffusion process is variance-preserving (VP), meaning the variance of the noisy image remains approximately one at every time step when inputs are normalized.
Direct maximum likelihood estimation for diffusion is intractable, so we optimize the Evidence Lower Bound (ELBO) instead, which simplifies to a trivial L2 regression loss.
The final DDPM training objective requires only predicting the Gaussian noise added to a clean image, making it extremely stable and easy to implement.
Standard DDPM inference requires 1000 sequential steps, which is computationally expensive for real-world applications.
Denoising Diffusion Implicit Models (DDIM) eliminate intermediate stochasticity to enable fast deterministic sampling, using the exact same trained DDPM model without retraining.
DDIM achieves a 10 to 50 times speedup with minimal quality degradation, making diffusion models practical for production use cases.
Three. Course Gold Quotes
"Image generation is like sculpting: you start with a block of rock—noise—and carve away the excess to reveal the image inside."
"We don't generate images out of thin air. We start from noise because it's easy to sample, it gives us diversity, and it has beautiful mathematical properties."
"All the complexity of diffusion melts away into a single L2 loss. You just add noise and learn to predict what you added. That's the magic of DDPM."
"The forward process is the only thing we define. Everything else—the reverse process, the loss function—flows naturally from that single choice."
"DDPM gave us amazing quality, but it was too slow. DDIM didn't change the model—it just changed how we use it. Sometimes the best innovations are just better ways to look at what you already have."
"Stochasticity is great for diversity, but it's terrible for speed. If you can remove the randomness between steps, you can take huge leaps forward."
"You don't need to memorize every derivation. What matters is understanding why the math works the way it does, and what it means for building better models."
Four. Layered Learning Notes
Module 1: Generative Modeling Basics and Course Roadmap
The core goal of all unconditioned generative models is to sample new observations from an unknown, complex data distribution \(p_{\text{data}}(x)\). Unlike text generation, which produces one token at a time, image generation requires a fundamentally different approach because images are high-dimensional and spatially coherent.
The course is structured into eight lectures covering the full stack of modern image generation:
-
Diffusion models (DDPM) and DDIM
-
Score matching and continuous SDE formulations
-
Flow matching
-
Conditioned generation and guidance techniques
-
Model architectures (U-Net, Diffusion Transformers)
-
Training optimization and distillation
-
Evaluation metrics and multi-modal evaluation
All three core generation paradigms—diffusion, score matching, and flow matching—share a common foundation: they start from simple Gaussian noise and transform it into realistic data. This lecture focuses on the first and most historically influential paradigm: diffusion.
Module 2: Core Intuition of Diffusion Models
Diffusion models draw inspiration from thermodynamics, where particles naturally diffuse from regions of high concentration to low concentration. In generative modeling, we reverse this process: we start with uniformly distributed noise (high entropy) and gradually move it toward the structured data distribution (low entropy).
The diffusion framework has two core components:
-
Forward process: A predefined, deterministic procedure that adds small amounts of Gaussian noise to clean images over many steps, eventually turning them into pure noise.
-
Reverse process: A learned neural network that takes a noisy image and predicts the noise that was added, allowing us to reverse the corruption step-by-step.
A key insight is that we never need to explicitly model the data distribution. Instead, we only need to model the difference between a noisy image and a slightly less noisy image, which is a much simpler regression task.
Module 3: Forward Noise Process (Variance Preserving)
The forward process is defined recursively for each time step t from one to T:\(x_t = \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon_t\)where \(\epsilon_t \sim \mathcal{N}(0, I)\) is standard Gaussian noise, and \(\beta_t\) is the noise schedule—a sequence of small values that gradually increases from \(10^{-4}\) to \(0.02\) in the original DDPM paper.
This formulation is called variance-preserving because the variance of \(x_t\) remains approximately one at every step when inputs are normalized to have zero mean and unit variance. This stability is critical for training deep neural networks.
A critical mathematical simplification allows us to compute the noisy image at any time step t directly from the clean image \(x_0\), without iterating through all intermediate steps:\(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\)where \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{i=1}^t \alpha_i\) is the cumulative product of alphas up to step t. This closed-form expression is the foundation of the DDPM training procedure.
Module 4: Variational Formulation and ELBO Derivation
Our goal is to learn the reverse process \(p_\theta(x_{t-1} | x_t)\), which takes a noisy image \(x_t\) and produces a slightly less noisy image \(x_{t-1}\). We train this model by maximizing the log-likelihood of the training data under our model.
Directly computing this log-likelihood is intractable because it requires marginalizing over all possible noise trajectories from \(x_0\) to \(x_T\). Instead, we use a standard variational inference technique: we derive a lower bound on the log-likelihood called the Evidence Lower Bound (ELBO) and maximize this bound instead.
The ELBO derivation relies on three key steps:
-
Introduce the known forward process \(q(x_{1:T} | x_0)\) into the likelihood expression
-
Apply Jensen's inequality to the logarithm of an expectation, which gives us the lower bound
-
Expand the bound into a sum of KL divergence terms between the forward and reverse processes
The only term in the ELBO that depends on the model parameters \(\theta\) is the KL divergence between the true conditional distribution \(q(x_{t-1} | x_t, x_0)\) and our learned distribution \(p_\theta(x_{t-1} | x_t)\).
Module 5: Tractable Loss Function for DDPM
Both \(q(x_{t-1} | x_t, x_0)\) and \(p_\theta(x_{t-1} | x_t)\) are Gaussian distributions. The KL divergence between two Gaussians has a closed-form solution that simplifies dramatically when we fix the variance of the learned distribution.
After algebraic simplification, the entire ELBO reduces to an extremely simple L2 regression loss:\(\mathcal{L}(\theta) = \mathbb{E}_{t, x_0, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta(x_t, t) \right\|^2 \right]\)where \(\epsilon_\theta(x_t, t)\) is our neural network, which takes the noisy image \(x_t\) and the time step t as input and predicts the noise \(\epsilon\) that was added to the clean image \(x_0\) to produce \(x_t\).
This loss function is revolutionary for three reasons:
-
It is trivial to implement and extremely stable to train
-
It requires no complex likelihood calculations
-
It leverages the closed-form forward process expression to generate training examples on the fly
Module 6: DDPM Training and Inference Pipelines
The DDPM training procedure is remarkably straightforward:
-
Sample a clean image \(x_0\) from the training dataset
-
Sample a random time step t uniformly from one to T
-
Sample Gaussian noise \(\epsilon\) from \(\mathcal{N}(0, I)\)
-
Compute the noisy image \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\)
-
Pass \(x_t\) and t through the neural network to get the predicted noise \(\epsilon_\theta\)
-
Compute the L2 loss between the predicted noise and the true noise
-
Backpropagate the loss and update the model parameters
Standard DDPM inference is equally simple but computationally expensive:
-
Sample pure noise \(x_T\) from \(\mathcal{N}(0, I)\)
-
For each time step t from T down to one:a. Predict the noise \(\epsilon_\theta(x_t, t)\)b. Compute the mean of the reverse distributionc. Sample \(x_{t-1}\) from the Gaussian distribution with this mean and fixed variance
-
The final image \(x_0\) is the generated sample
The original DDPM paper uses \(T = 1000\) steps, meaning inference requires 1000 sequential forward passes through the neural network.
Module 7: DDIM for Faster Deterministic Sampling
The 1000-step inference requirement made DDPM impractical for real-world applications. Denoising Diffusion Implicit Models (DDIM) solve this problem by reinterpreting the diffusion process to eliminate intermediate stochasticity.
The key insight behind DDIM is that the DDPM training objective only constrains the marginal distributions \(q(x_t | x_0)\), not the joint distribution over trajectories. This means we can define a new generation process that matches these marginals but is deterministic between steps.
In the DDIM formulation, the reverse process becomes:\(x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \hat{x}_0 + \sqrt{1 - \bar{\alpha}_{t-1}} \epsilon_\theta(x_t, t)\)where \(\hat{x}_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}\) is the model's current prediction of the clean image.
When we set the stochasticity parameter \(\sigma = 0\), this process becomes completely deterministic: the same initial noise \(x_T\) will always produce the same final image \(x_0\). This allows us to skip large numbers of steps without significant quality degradation.
DDIM achieves a 10 to 50 times speedup over standard DDPM, reducing inference from 1000 steps to 20 to 100 steps, while using the exact same trained model with no retraining required.
Wishing you all the best as you begin your journey into diffusion models. May your noise predictions always be precise, your ELBO bounds stay tight, and your training losses drop smoothly and steadily. May your DDPM samples be crisp and diverse, and your DDIM steps run lightning-fast without sacrificing an ounce of quality. The fundamentals you're mastering today power every state-of-the-art image generation model from Stable Diffusion to Sora—keep digging into the math, keep experimenting with code, and keep pushing the boundaries of what generative AI can create. Happy generating!


