One. Course Details
This is Lecture 2 of Stanford University's CME 296 course, introducing the second core generative modeling paradigm: score matching. Following the discrete diffusion (DDPM) framework from Lecture 1, this lecture builds a continuous-time mathematical foundation that unifies both diffusion and score-based methods under the Stochastic Differential Equation (SDE) umbrella.
The lecture progresses systematically from first principles: it starts with the motivation for score functions, derives denoising score matching as a tractable training objective, introduces Noise Conditional Score Networks (NCSN) to address low-density estimation issues, and finally transitions to continuous SDE formulations. It concludes with advanced sampling techniques including the Probability Flow ODE (PF-ODE) and DPM-Solver, which enable fast, high-quality generation with minimal function evaluations.
Two. Key Learning Takeaways
The score function—defined as the gradient of the log probability density—eliminates the intractable normalizing constant problem that plagues direct density estimation.
Denoising Score Matching (DSM) transforms the impossible task of estimating the true data score into a simple L2 regression problem by adding controlled Gaussian noise to training samples.
Single-noise score matching suffers from a fundamental bias-variance tradeoff; Noise Conditional Score Networks (NCSN) solve this by training a single model to predict scores at multiple noise levels.
Annealed Langevin Dynamics uses the multi-noise score estimates to progressively refine samples, starting from high noise and gradually reducing it to converge to the true data distribution.
DDPM and score matching are mathematically equivalent and can both be expressed as SDEs with different drift and diffusion terms.
The reverse SDE allows us to denoise images by reversing the forward noising process, and it requires only the learned score function as input.
The Probability Flow ODE (PF-ODE) converts the stochastic reverse SDE into a deterministic ordinary differential equation, reducing sampling error and enabling adaptive step sizes.
DPM-Solver leverages the linear structure of diffusion ODEs to achieve state-of-the-art sample quality with as few as 10-20 function evaluations, compared to hundreds of steps for traditional samplers.
Three. Course Gold Quotes
"The score is your compass in the high-dimensional image space—it always points toward regions of real images."
"We don't need to know the absolute probability of an image—we just need to know which direction is more probable. That's the magic of the score function."
"Denoising score matching turns a hard unsupervised problem into a trivial supervised regression problem. You just add noise and learn to predict where to go back."
"Single noise level is a catch-22: too little noise and you can't see far, too much noise and you can't see clearly."
"Diffusion and score matching aren't competing methods—they're two sides of the same coin. One predicts noise, the other predicts direction, but they're mathematically interchangeable."
"The continuous SDE framework didn't just unify two fields—it gave us a whole toolkit of mathematical techniques to make generative models faster and better."
"PF-ODE doesn't change where you end up—it just changes how you get there. No more random detours, just a straight shot to your destination."
Four. Layered Learning Notes
Module 1: Course Recap and Score Function Motivation
Lecture 1 introduced DDPM, a discrete-time generative model that gradually adds Gaussian noise to clean images and learns to reverse the process by predicting the added noise. While effective, DDPM is just one way to approach generative modeling. Score matching offers an alternative perspective rooted in probability theory.
The core goal of all generative models is to sample from the complex, unknown data distribution \(p_{\text{data}}(x)\). A naive approach would be to follow the gradient of \(p(x)\) toward regions of higher density, but this has three fatal flaws:
-
The normalizing constant Z in \(p(x) = \frac{1}{Z} \tilde{p}(x)\) is intractable to compute for high-dimensional data
-
Probability densities are extremely small in low-density regions, causing numerical instability
-
The gradient magnitude vanishes far from the data distribution
The score function solves all three problems. Defined as \(\nabla_x \log p(x)\), it:
-
Eliminates the normalizing constant entirely (the gradient of \(\log Z\) is zero)
-
Points in exactly the same direction as \(\nabla p(x)\)
-
Has much better numerical stability because it normalizes the gradient by \(p(x)\)
Module 2: Langevin Sampling and Score Estimation Challenges
If we knew the true score function, we could sample from \(p_{\text{data}}\) using Langevin sampling, which iteratively updates samples according to:\(x_{t+1} = x_t + \alpha \nabla_x \log p(x_t) + \sqrt{2\alpha} \epsilon\)
The gradient term moves samples toward higher density regions, while the noise term ensures diversity and prevents all samples from collapsing to a single mode.
The problem is that we do not know the true score function. Direct score matching is impossible because we do not have ground truth scores for real data. Early methods like implicit score matching and sliced score matching addressed this but were computationally expensive or unstable.
Module 3: Denoising Score Matching (DSM)
Denoising score matching is the breakthrough that made score-based generative modeling practical. The key insight is simple: while we do not know the score of the true data distribution, we do know the score of a Gaussian distribution.
The method works as follows:
-
Take a clean sample x from the training set
-
Add controlled Gaussian noise to get a noisy sample \(\tilde{x} = x + \sigma \epsilon\)
-
The noisy sample follows a conditional Gaussian distribution \(q_\sigma(\tilde{x}|x) = \mathcal{N}(x, \sigma^2 I)\)
-
The score of this conditional distribution has a closed-form solution: \(\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x) = -\frac{\tilde{x} - x}{\sigma^2} = -\frac{\epsilon}{\sigma}\)
A fundamental theorem proves that minimizing the expected squared distance between the predicted score and the conditional score is equivalent to minimizing the expected squared distance between the predicted score and the true score of the noisy distribution \(q_\sigma(\tilde{x})\). This reduces the entire problem to a simple L2 regression loss.
Module 4: Noise Conditional Score Networks (NCSN)
Denoising score matching solves the tractability problem but introduces a new tradeoff:
- If \(\sigma\) is too small: The noisy distribution is close to the true data distribution, but the model learns poor scores in low-density regions (where samples are rare)
- If \(\sigma\) is too large: The model learns good scores everywhere, but the noisy distribution is very different from the true data distribution
During sampling, we use Annealed Langevin Dynamics:
-
Start with a sample from a high-noise Gaussian distribution
-
Run Langevin sampling using the score for the highest noise level
-
Gradually decrease the noise level and repeat Langevin sampling at each level
-
Continue until reaching the lowest noise level, which approximates the true data distribution
Module 5: Unification with Diffusion via SDEs
Both DDPM and NCSN can be generalized to continuous time using Stochastic Differential Equations (SDEs). The forward noising process for any generative model can be written as:\(dx = f(x, t) dt + g(t) dW\)where \(f(x, t)\) is the deterministic drift term and \(g(t) dW\) is the stochastic diffusion term (driven by a Wiener process W).
For DDPM (variance-preserving):
-
Drift: \(f(x, t) = -\frac{1}{2} \beta(t) x\)
-
Diffusion: \(g(t) = \sqrt{\beta(t)}\)
For NCSN (variance-exploding):
-
Drift: \(f(x, t) = 0\)
-
Diffusion: \(g(t) = \sigma(t)\)
A landmark result from stochastic calculus shows that the reverse process (denoising) is also an SDE:\(dx = \left[ f(x, t) - g(t)^2 \nabla_x \log p_t(x) \right] dt + g(t) d\bar{W}\)
This is why the score function is so important: it is the only additional information needed to reverse any forward SDE.
Module 6: Fast Sampling with PF-ODE and DPM-Solver
While the reverse SDE produces high-quality samples, it has two major limitations:
-
It requires many small steps to accurately simulate the stochastic process
-
It accumulates both discretization error and stochastic error
The Probability Flow ODE (PF-ODE) solves these problems by converting the stochastic SDE into a deterministic ODE that preserves the same marginal probability distribution at every time step:\(\frac{dx}{dt} = f(x, t) - \frac{1}{2} g(t)^2 \nabla_x \log p_t(x)\)
The PF-ODE has several advantages:
-
It is deterministic: the same initial noise always produces the same sample
-
It has only discretization error, no stochastic error
-
It supports adaptive step size solvers that can take much larger steps
DPM-Solver further optimizes sampling by leveraging the linear structure of the drift term in diffusion ODEs. It splits the ODE into an exactly solvable linear part and a nonlinear part (the score), and discretizes only the nonlinear part. This allows DPM-Solver to achieve state-of-the-art sample quality with just 10-20 function evaluations, compared to 1000 steps for the original DDPM sampler.
Wishing you all the best as you continue exploring the mathematical foundations of generative AI. May your score functions always point true, your Langevin steps converge smoothly, and your DPM-Solvers deliver perfect samples in record time. May your SDEs be well-behaved, your ODEs be fast to solve, and your loss curves drop steadily to zero. The continuous-time framework you're learning today powers every modern generative model—keep digging into the math, keep experimenting with samplers, and keep pushing the boundaries of what AI can create. Happy generating!


