Multi-Modal Guided Generation

One. Course Details

This is Lecture 4 of Stanford University's CME 296 course, marking the conclusion of the mathematically intensive first half of the class. After three lectures deriving the core theoretical frameworks for unconditional image generation—DDPM diffusion, score-based methods, and flow matching—this lecture transitions to conditional generation, the foundation of modern text-to-image systems.

The lecture is structured into three interconnected modules: first, it addresses the critical problem of image representation, moving from high-dimensional pixel space to structured latent spaces using Variational Autoencoders (VAEs). Second, it covers how to represent conditional inputs like text and images using transformers and contrastive learning. Finally, it introduces two dominant methods for guiding generation: classifier-based guidance and the industry-standard classifier-free guidance (CFG). All concepts are grounded in practical implementation details that power state-of-the-art models like Stable Diffusion and FLUX.

Two. Key Learning Takeaways

Pixel space is unsuitable for diffusion models due to its extreme dimensionality, redundancy, and lack of meaningful structure for generative tasks.

Variational Autoencoders (VAEs) solve these issues by learning a low-dimensional latent space that is both compact and semantically meaningful, reducing computational costs by up to 64x.

Standard VAEs produce blurry reconstructions due to pixel-wise L2 loss and probabilistic latent mapping; this is mitigated by adding perceptual loss and adversarial loss.

Text conditions are represented using transformer encoders, while image conditions use Vision Transformers (ViTs), both producing dense embeddings that capture semantic meaning.

CLIP (Contrastive Language-Image Pre-training) aligns text and image embeddings into a shared space, enabling cross-modal similarity comparisons that are critical for conditional generation.

Classifier-based guidance uses an external noisy-image classifier to steer generation, but it requires training a separate model and is computationally expensive.

Classifier-Free Guidance (CFG) eliminates the need for an external classifier by training a single model to handle both conditional and unconditional generation, making it the standard for all modern text-to-image systems.

The guidance scale parameter in CFG controls how strongly the model follows the input prompt, balancing creativity and adherence to instructions.

Three. Course Gold Quotes

"Semantic similarity is about what the image is. Perceptual similarity is about what the image looks like to the human eye."

"The encoder is a low-pass filter that captures the big picture. The decoder is a high-pass filter that adds all the fine details."

"Pixel-wise L2 loss is a terrible judge of image quality. It penalizes a one-pixel shift more than it penalizes blurriness."

"Classifier-free guidance is brilliant in its simplicity. You don't need a separate model—you just train one model to do two jobs."

"The guidance scale is the dial between 'do whatever you want' and 'do exactly what I said.'"

"Latent diffusion didn't invent diffusion—it made diffusion practical by moving it to a space where computers can actually run it."

"Contrastive learning teaches the model what goes together and what doesn't. That's the foundation of all multi-modal AI."

Four. Layered Learning Notes

Module 1: Course Recap and Core Problem Statement

The first three lectures of CME 296 established three equivalent mathematical frameworks for unconditional image generation:

DDPM: Discrete diffusion that predicts added noise
Score-based: Continuous generation that estimates the gradient of the log probability
Flow matching: Optimal transport that predicts the velocity vector field

All three frameworks result in a simple L2 regression loss, but they all shared a critical unstated assumption: that images are represented in some unspecified d-dimensional space. This lecture addresses that assumption and extends the theory to conditional generation, where the model produces images based on user inputs like text prompts or reference images.
Before diving into solutions, the lecture outlines three fatal flaws of using raw pixel space for generative modeling:

Prohibitive dimensionality: A 1024×1024 RGB image has over 3 million dimensions, making diffusion training computationally infeasible
Extreme redundancy: Adjacent pixels are almost always highly correlated, wasting computational resources
Unstructured distribution: Valid images form sparse, spiky clusters in pixel space, making it impossible for generative models to learn smooth transitions between concepts

Module 2: Image Representation with Autoencoders and VAEs

The first attempt to solve these problems is the standard autoencoder, which consists of two symmetric networks:

Encoder: Compresses high-dimensional pixel images into low-dimensional latent vectors using convolutions and pooling operations
Decoder: Reconstructs the original image from the latent vector using transpose convolutions

Autoencoders successfully reduce dimensionality and remove redundancy, but they have a critical limitation: they do not enforce any structure on the latent space. The model is only incentivized to reconstruct inputs accurately, so latent vectors end up scattered randomly throughout the space with no meaningful organization. This makes them useless for diffusion models, which require sampling from a smooth, continuous distribution.
The Variational Autoencoder (VAE) solves this problem by adding a probabilistic layer to the encoder. Instead of outputting a single deterministic latent vector, the encoder predicts the mean and variance of a Gaussian distribution for each input image. Latent vectors are then sampled from this distribution during training. A KL divergence term is added to the loss function, penalizing distributions that deviate from a standard normal prior. This forces the latent space to be smooth, continuous, and centered around the origin—perfect for generative modeling.
However, VAEs introduce a new problem: blurry reconstructions. This has two primary causes:

Pixel-wise L2 reconstruction loss: This loss averages over plausible reconstructions, producing blurry results rather than sharp, detailed images
Probabilistic latent mapping: The stochastic sampling step introduces additional uncertainty that the decoder averages out

Two complementary techniques are used to combat blurriness:

Perceptual loss: Compares feature maps from a pre-trained convolutional network rather than raw pixels, capturing structural and textural similarity instead of exact pixel alignment
Adversarial loss: Adds a discriminator network that tries to distinguish real images from reconstructed ones, incentivizing the decoder to produce photorealistic outputs

The lecture emphasizes that modern VAEs used in diffusion models are asymmetric: the encoder is relatively small and focuses on capturing semantic information, while the decoder is much larger and focuses on adding fine perceptual details. This division of labor makes the latent space easy for diffusion models to learn while still producing high-quality final images.

Module 3: Conditional Input Representation

With a suitable latent space established, the lecture turns to representing conditional inputs. The two most common conditions are text and images, both of which are converted into dense embeddings using transformer-based models.
For text conditions:

Text is first tokenized into subword units using algorithms like Byte-Pair Encoding (BPE)
Token embeddings are passed through a transformer encoder, which captures contextual relationships between words
The final hidden state of the encoder (or the token) is used as the text embedding

For image conditions:

Images are divided into non-overlapping patches, which are flattened and projected into embeddings
Position embeddings are added to preserve spatial information
The patch embeddings are processed by a Vision Transformer (ViT), producing a global image embedding

The biggest challenge in multi-modal generation is aligning these two different embedding spaces. This is solved by CLIP (Contrastive Language-Image Pre-training), which trains a text encoder and an image encoder simultaneously on millions of image-caption pairs. The training objective is simple: maximize the similarity between matching image-caption pairs while minimizing the similarity between non-matching pairs. This produces a shared embedding space where semantically similar concepts, whether text or image, are clustered together.

Module 4: Guided Generation Techniques

The final and most important part of the lecture covers how to use these conditional embeddings to steer the generation process. Two main approaches are presented:
Classifier-Based Guidance was the first successful method for conditional diffusion. It works by:

Training a separate classifier on noisy images that predicts the class label (or other condition)
During generation, computing the gradient of the classifier's output with respect to the noisy latent
Shifting the denoising step in the direction of this gradient to increase the probability of the desired condition

While effective, classifier-based guidance has significant drawbacks: it requires training a separate classifier, it is computationally expensive to compute gradients at every step, and it often produces artifacts when the guidance scale is set too high.

Classifier-Free Guidance (CFG) addresses all these issues and has become the industry standard. The key insight is that a single diffusion model can be trained to perform both conditional and unconditional generation. During training:

With probability p (typically 10-20%), the model is trained without any condition (unconditional generation)
With probability 1-p, the model is trained with the condition embedding as input

During inference, the guided noise prediction is computed as a weighted combination of the conditional and unconditional predictions:ε_guided = ε_unconditional + w * (ε_conditional - ε_unconditional)

The guidance scale w controls how strongly the model follows the condition. A value of 1 gives standard conditional generation, while higher values (typically 3-7) produce images that adhere more closely to the prompt but may become less diverse or more artificial.

CFG eliminates the need for a separate classifier, reduces computational overhead, and produces higher-quality results with fewer artifacts. It is used in every major text-to-image model released since 2021.

Wishing you all the best as you continue exploring the world of conditional generative AI. May your VAEs learn smooth latent spaces, your CLIP embeddings align perfectly, and your classifier-free guidance strikes just the right balance between creativity and prompt adherence. May your training runs converge quickly, your loss curves drop steadily, and your generated images exceed every expectation. The techniques you're learning today power every text-to-image tool you use—keep experimenting, keep building, and never stop pushing the boundaries of what AI can create. Happy generating!

Video Source and Usage Instructions

Video Title: Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance
• Course Series: Stanford CME296: Diffusion & Large Vision Models
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/WUUq6TVAu8U?si=LX_iymiz5RWps6mV

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.