One. Course Details
This is Lecture 3 of Stanford University's CME 296 course, completing the trilogy of core generative modeling paradigms following diffusion (DDPM) in Lecture 1 and score matching in Lecture 2. Flow matching has emerged as the industry standard for modern generative models due to its mathematical simplicity, training stability, and fast inference capabilities.
The lecture balances intuitive explanations with rigorous mathematical derivation, starting with foundational concepts from optimal transport, then building up to the conditional flow matching loss that powers models like FLUX and Stable Diffusion 3. It concludes with a discussion of rectified flow for faster inference and a unifying framework that connects diffusion, score matching, and flow matching as different perspectives on the same underlying problem.
Two. Key Learning Takeaways
Flow matching frames generative modeling as an optimal transport problem, where the goal is to transport probability mass from a simple initial distribution (Gaussian noise) to the complex target data distribution.
The core object of interest in flow matching is the vector field (velocity field), which tells each particle how fast and in which direction to move at every point in space and time.
Conditional Flow Matching (CFM) simplifies the training objective to a trivial L2 regression loss between the predicted velocity and the target velocity (x₁ - x₀), eliminating the need for expensive likelihood estimation.
Lipschitz continuity of the vector field guarantees unique trajectories for each initial point, ensuring a one-to-one mapping between the initial and target distributions.
Rectified Flow is a fine-tuning procedure that straightens the learned trajectories, allowing for high-quality generation with as few as one or two inference steps.
All three major generative paradigms—diffusion, score matching, and flow matching—are mathematically equivalent and can be unified under the stochastic interpolants framework.
Flow matching produces deterministic trajectories by default, eliminating the stochasticity inherent in traditional diffusion models while still supporting diverse generation.
The training process for flow matching is significantly more stable than earlier methods, with fewer hyperparameters to tune and better scaling properties with model size.
Three. Course Gold Quotes
"The vector field is like giving self-driving cars turn-by-turn directions. The score is just a compass pointing toward high-density regions."
"All the complexity of optimal transport melts away into a single L2 loss. That's the magic of flow matching."
"If your vector field isn't Lipschitz continuous, you can have two particles starting at the same point ending up in completely different places. That's a disaster for generative modeling."
"We don't match the flow directly—we match the velocity. But if the velocity is Lipschitz, matching the velocity is exactly the same as matching the flow."
"Rectified flow doesn't make your model better—it makes your model faster. Sometimes speed is the most important feature of all."
"Diffusion, score matching, flow matching—they're all just different ways of looking at the same mountain. The view is different, but the summit is the same."
"The beauty of flow matching is that you don't need to be a math genius to implement it. The loss is literally just subtracting two vectors and squaring them."
Four. Layered Learning Notes
Module 1: Course Recap and Paradigm Comparison
The first two lectures established two foundational generative modeling paradigms:
-
DDPM Diffusion: Discrete-time process that gradually adds noise to clean images and learns to reverse it by predicting the added noise
-
Score Matching: Continuous-time process that learns the gradient of the log probability distribution (the score) and uses Langevin dynamics to sample from the data distribution
Both paradigms result in an L2 regression loss, but they have important limitations: diffusion requires carefully designed noise schedules, and score matching has complex sampling procedures. Flow matching addresses these limitations by reframing the entire problem from an optimal transport perspective.
A critical convention change is introduced in this lecture:
-
In diffusion and score matching: t=0 = clean data, t=T = pure noise
-
In flow matching: t=0 = pure noise (initial distribution p₀), t=1 = clean data (target distribution p₁)
This convention aligns with the optimal transport literature and has become the standard for all modern flow-based generative models.
Module 2: Core Terminology in Flow Matching
Flow matching is built on four interrelated concepts that form the foundation of the entire framework:
-
Trajectory (xₜ): The path taken by a single particle from its initial position at t=0 to its final position at t=1
-
Flow (ψₜ(x₀)): A function that maps an initial point x₀ to its position at time t. The flow can be thought of as the collection of all possible trajectories.
-
Probability Path (pₜ(x)): The probability distribution of particles at time t. p₀ is the initial Gaussian distribution, and p₁ is the target data distribution.
-
Vector Field (uₜ(x)): A time-dependent function that assigns a velocity vector to every point in space. The vector field tells each particle how to move at every moment in time.
The lecture uses an intuitive analogy to distinguish the vector field from the score function learned in score matching:
-
The vector field is like turn-by-turn directions for self-driving cars, telling each car exactly where to go and how fast to drive
-
The score function is like a compass that only points toward the nearest city, without giving specific directions
Module 3: Foundational Equations
Flow matching relies on two fundamental equations that connect the micro behavior of individual particles to the macro behavior of the entire probability distribution:
-
Ordinary Differential Equation (ODE): Describes the motion of a single particle
dx/dt = uₜ(x)This equation states that the velocity of a particle at position x and time t is exactly equal to the vector field at that point. A key mathematical result guarantees that if the vector field is Lipschitz continuous, each initial point will have a unique trajectory. -
Continuity Equation: Describes the evolution of the entire probability distribution
∂pₜ/∂t = -∇ · (pₜ uₜ(x))This equation enforces conservation of mass: the change in density at any point is equal to the net inflow of probability mass to that point. The divergence operator ∇· measures how much the vector field is spreading out or converging at a given point.
Together, these two equations form the complete mathematical description of the flow matching problem. If you know the vector field uₜ(x), you can solve the ODE to generate individual samples and use the continuity equation to verify that the resulting distribution matches the target data distribution.
Module 4: Conditional Flow Matching Derivation
The biggest challenge in flow matching is learning the vector field uₜ(x) from data. Direct maximum likelihood estimation (used in earlier continuous normalizing flows) is prohibitively expensive because it requires solving an ODE at every training step.
Conditional Flow Matching (CFM) solves this problem with a clever simplification:
-
Instead of trying to transport the entire initial distribution to the entire target distribution at once, consider the simpler problem of transporting the initial distribution to a single data point x₁
-
For this conditional problem, define a simple Gaussian probability path that interpolates between the initial Gaussian and a Dirac delta at x₁:
pₜ(x|x₁) = 𝒩(t x₁, (1-t)² I) -
The corresponding conditional vector field has a closed-form solution:
uₜ(x|x₁) = (x₁ - x) / (1 - t)
A critical simplification occurs when xₜ is sampled from this conditional probability path. In this case, xₜ can be written as:xₜ = t x₁ + (1 - t) x₀where x₀ is sampled from the initial Gaussian distribution. Substituting this into the conditional vector field gives:uₜ(xₜ|x₁) = x₁ - x₀
This is an extremely simple result: the target velocity for any point on the straight line between x₀ and x₁ is just the difference between the two endpoints.
The final conditional flow matching loss is then:L(θ) = 𝔼_{t~𝒰(0,1), x₁~p_data, x₀~p₀} [ || u_θ(xₜ, t) - (x₁ - x₀) ||² ]
This is a standard L2 regression loss that is trivial to implement and extremely stable to train.
Module 5: Training and Inference
The training process for flow matching is remarkably simple:
-
Sample a random noise vector x₀ from the standard Gaussian distribution
-
Sample a random clean image x₁ from the training dataset
-
Sample a random time step t uniformly from [0, 1]
-
Construct the noisy intermediate point xₜ = t x₁ + (1 - t) x₀
-
Use the neural network to predict the velocity u_θ(xₜ, t)
-
Compute the L2 loss between the predicted velocity and the target velocity (x₁ - x₀)
-
Backpropagate the loss and update the model parameters
Inference is equally straightforward:
-
Sample an initial point x₀ from the standard Gaussian distribution
-
Numerically solve the ODE dx/dt = u_θ(x, t) from t=0 to t=1 using a numerical solver like Euler or Heun
-
The resulting point x₁ is the generated sample
Module 6: Rectified Flow for Faster Inference
While standard flow matching works well, it has one important limitation: the learned trajectories are often curved. This means that numerical solvers require many steps (typically 20-50) to accurately follow the trajectory, making inference slow.
Rectified Flow solves this problem with a simple fine-tuning procedure:
-
Train an initial flow matching model as described above
-
Generate a large number of samples by solving the ODE from x₀ to x₁
-
Use these (x₀, x₁) pairs to retrain the model
-
Repeat this process 1-2 times
Each reflow step straightens the trajectories, making them closer to straight lines. After just one reflow step, high-quality samples can be generated with as few as 2-4 inference steps. After two steps, even one-step generation becomes possible.
The tradeoff is that reflow can introduce small degradations in sample quality, so it is typically only done once or twice.
Module 7: Unification of Generative Paradigms
The lecture concludes with a powerful insight: diffusion, score matching, and flow matching are not competing methods—they are different perspectives on the same underlying problem.
All three paradigms can be unified under the Stochastic Interpolants framework, which shows that:
-
The noise predicted by diffusion models
-
The score predicted by score matching models
-
The velocity predicted by flow matching models
are all mathematically related. If you know any two of them, you can compute the third.
This unification explains why all three methods produce similar results when implemented correctly. It also allows researchers to combine insights from all three paradigms to develop even better generative models.
Wishing you all the best as you continue your journey into generative modeling. May your vector fields be perfectly Lipschitz continuous, your trajectories be straight and true, and your conditional flow matching loss drop smoothly to zero. May your rectified flow steps give you lightning-fast inference without sacrificing quality, and may your numerical solvers converge perfectly on the first try. The flow matching techniques you're learning today power the fastest and most powerful generative models in the world—keep exploring, keep deriving, and keep pushing the boundaries of what AI can create. Happy generating!


