One. Course Details
This is Lecture 5 of Stanford University's CME 296 course, marking the official start of the second half of the class. After three lectures deriving tractable training losses for image generation and one lecture on latent space representation with VAEs, the course shifts focus to the practical implementation of modern image generation systems. This lecture specifically covers the core architectures that power state-of-the-art text-to-image models.
The lecture begins with a comprehensive recap of the first half of the course, reviewing the three theoretical frameworks for diffusion models—DDPM, score-based, and flow matching—along with VAE latent space representation and classifier-free guidance. It then systematically introduces the two dominant architectures in image generation: the U-Net, which dominated the field until 2022, and the Diffusion Transformer (DiT), which has become the industry standard for all modern systems. The lecture concludes with a deep dive into position embedding techniques, a critical but often overlooked component of transformer-based models.
Two. Key Learning Takeaways
All modern image generation models take three inputs: the noisy latent x_t, the timestep t indicating noise level, and a condition c (typically text), and output the velocity/noise/score needed for denoising.
The U-Net architecture uses an encoder-decoder structure with skip connections to balance global structure understanding and local detail preservation, leveraging the inductive bias of convolutions.
The Diffusion Transformer (DiT) addresses U-Net's limitation in modeling long-range dependencies by using self-attention, allowing every patch of the image to interact with every other patch.
Adaptive Layer Normalization (adaLN) is the most effective method for injecting timestep and condition information into DiT blocks, using gate, scale, and shift parameters to modulate patch embeddings.
Multi-Modal DiT (MM-DiT) improves upon standard DiT by using joint attention instead of uniform modulation, allowing different parts of the image to respond differently to different parts of the text prompt.
Absolute position embeddings add position information at the input level, while Rotary Position Embeddings (RoPE) rotate queries and keys in the attention layer, providing superior position encoding and extrapolation capabilities.
2D position embeddings have two main variants: axial RoPE, which separates x and y axes, and mixed RoPE, which combines rotations from both axes for better spatial interaction.
The evolution of image generation architectures follows a clear timeline: U-Nets dominated the early 2020s, standard DiTs emerged in 2022, and MM-DiTs have become the standard since 2024.
Three. Course Gold Quotes
"We want our model to be biased towards looking at images the way we humans look at them. That's what we call an inductive bias."
"The U-Net is great, but it fails at things like a teddy bear looking in a mirror. It can't connect the local details in two distant parts of the image."
"Adaptive LayerNorm is like a dimmer switch for your patch embeddings. It turns up the dimensions that matter for your prompt and timestep, and turns down the ones that don't."
"Cross-attention is like a painter looking at a poem written by someone else. Joint attention is like the painter and poet sitting in the same room, working together to create the perfect image."
"Position embeddings are the unsung heroes of transformers. You don't notice them when they work, but everything falls apart when they don't."
"RoPE doesn't just add position information—it rotates the representations themselves. That's why it works so much better for long sequences and extrapolation."
"Architecture is not about picking the fanciest model. It's about understanding the tradeoffs and choosing the right tool for the job."
Four. Layered Learning Notes
Module 1: Course Recap and Architecture Requirements
The first half of CME 296 focused entirely on the theoretical foundations of image generation. Lectures 1 through 3 derived three equivalent frameworks for training generative models:
-
DDPM: Predict the noise added to clean images
-
Score-based: Estimate the gradient of the log probability distribution
-
Flow matching: Regress the vector field that transports noise to data
All three frameworks result in a simple L2 regression loss, with flow matching emerging as the industry standard for modern systems due to its stability and simplicity. Lecture 4 then covered latent space representation using Variational Autoencoders (VAEs), which reduce the dimensionality of images by a factor of 4-8, dramatically lowering the computational cost of training and inference.
Before diving into specific architectures, the lecture outlines four core requirements for any image generation model:
-
Global structure understanding: The model must grasp the overall composition and layout of the image
-
Local detail preservation: The model must generate crisp, high-fidelity textures and fine details
-
Conditioning capability: The model must effectively incorporate external signals like text prompts
-
Scalability: The model must perform well at high resolutions and scale efficiently with more parameters
Module 2: Convolutions and the U-Net Architecture
The U-Net architecture is built around the inductive bias of convolutions, which mimic how humans scan images by focusing on local regions. A convolution operation applies a learnable filter across an input image, extracting local features like edges, corners, and textures. Each filter is a 3D tensor with dimensions F × F × C, where F is the filter size and C is the number of input channels.
A key limitation of convolutions is their limited receptive field—the area of the input that a single output neuron can see. To address this, U-Net uses a symmetric encoder-decoder structure:
-
Downsampling path (encoder): Alternates between convolutions and pooling operations to progressively reduce spatial resolution while increasing the receptive field. This allows the model to build a global understanding of the image structure at the bottleneck.
-
Upsampling path (decoder): Uses transpose convolutions to gradually increase spatial resolution back to the input size, reconstructing the image from the compressed bottleneck representation.
-
Skip connections: Directly copy feature maps from corresponding levels of the encoder to the decoder. This preserves local details that would otherwise be lost during downsampling, combining them with the global structure information from the bottleneck.
The U-Net was originally developed for medical image segmentation in 2015 but was adopted for diffusion models in the early 2020s, powering breakthroughs like Stable Diffusion. However, it has a fundamental limitation: convolutions only model local interactions, making it difficult for U-Nets to capture long-range dependencies between distant parts of the image.
Module 3: Diffusion Transformer (DiT) Architecture
The Diffusion Transformer (DiT), introduced in 2022, addresses U-Net's long-range dependency problem by replacing convolutions with self-attention. Self-attention allows every patch of the image to directly interact with every other patch, regardless of their distance.
The DiT pipeline proceeds in four main steps:
-
Patchification: The input noisy latent is divided into non-overlapping patches of size
P × P. Each patch is flattened and projected to a D-dimensional embedding, resulting in a sequence of tokens similar to text transformers. -
Position embedding: A position embedding is added to each patch token to preserve spatial information, which is lost during the patchification process.
-
DiT blocks: The sequence of patch tokens passes through multiple DiT blocks, each consisting of layer normalization, self-attention, and a feed-forward network.
-
Output projection: The final sequence of tokens is reshaped back to the original latent dimensions, producing the predicted velocity/noise for denoising.
The most critical design choice in DiT is how to inject timestep and condition information. The paper evaluated three methods:
-
Concatenation: Add the condition embedding to the input sequence
-
Cross-attention: Use the condition as keys and values in an additional attention layer
-
Adaptive Layer Normalization (adaLN): Modulate the normalized patch embeddings using gate, scale, and shift parameters derived from the condition
adaLN was found to be the most effective by a significant margin. It works by passing the combined timestep and condition embedding through a small MLP to produce three parameters:
-
alpha: A gate that controls how much of the modulated signal is added to the residual -
gamma: A scale factor that amplifies or attenuates different dimensions of the patch embedding -
beta: A shift factor that offsets the patch embedding
adaLN-zero, a variant that initializes all modulation parameters to zero, further improves training stability by ensuring the model starts as an identity function.
Module 4: Multi-Modal DiT (MM-DiT)
While standard DiT with adaLN works well, it has a critical limitation: it applies the same modulation to all patch tokens uniformly. This means every part of the image receives the same signal from the text prompt, making it impossible to generate images where different regions correspond to different parts of the prompt.
The Multi-Modal DiT (MM-DiT), popularized by Stable Diffusion 3 in 2024, solves this problem by using joint attention instead of uniform modulation. In joint attention, text embeddings and image patch embeddings are concatenated into a single sequence and processed together through the same self-attention layers. This allows:
-
Image patches to attend to relevant parts of the text
-
Text tokens to attend to image patches and adjust their meaning based on visual context
-
Natural interaction between the two modalities
MM-DiT architectures are classified into three types based on how they process different modalities:
-
Single-stream: All tokens (text and image) share the same attention and feed-forward layers
-
Dual-stream: Text and image tokens have separate feed-forward layers but share attention layers
-
Hybrid: A combination of single-stream and dual-stream layers, balancing performance and efficiency
All state-of-the-art image generation models released since 2024 use some variant of MM-DiT, including Qwen-Image, Z-Image, and FLUX.1.
Module 5: Position Embedding Optimization
Position embeddings are a critical component of all transformer architectures, as self-attention is inherently order-agnostic. Without position information, the model would treat the image as an unordered bag of patches.
The lecture covers two main types of position embeddings:
-
Absolute Position Embeddings: Add a fixed or learned position vector to each token embedding at the input level. The original transformer used sinusoidal absolute embeddings, while Vision Transformers (ViT) typically use learned embeddings. While simple, absolute embeddings have limited extrapolation capabilities and interact poorly with content embeddings.
-
Rotary Position Embeddings (RoPE): Introduced in 2021, RoPE has become the industry standard for modern transformers. Instead of adding position information, RoPE rotates the query and key vectors in the attention layer by an amount proportional to their position. This ensures that the dot product between two vectors depends only on their relative distance, not their absolute position.
For 2D images, RoPE has two main variants:
-
Axial RoPE: Splits the embedding dimension into two halves, one for the x-axis and one for the y-axis. While simple, it produces axial artifacts due to the lack of interaction between the two axes.
-
Mixed RoPE: Combines rotations from both x and y axes in the same embedding dimensions, allowing for natural spatial interaction and eliminating axial artifacts.
The lecture also covers advanced position embedding techniques for variable-resolution images and multi-modal models, including centered coordinate systems and diagonal position encoding for text tokens in MM-DiTs.
Wishing you all the best as you continue your journey into image generation model design. May your skip connections preserve every fine detail, your attention mechanisms capture all long-range dependencies, and your RoPE rotations perfectly encode spatial relationships. May your DiT blocks train smoothly, your adaLN modulations hit just the right balance, and your generated images exceed every expectation. The architectures you're learning today are the foundation of the next generation of visual AI—keep exploring, keep building, and never stop asking why things are designed the way they are. Happy coding!


