One. Course Details
This is Lecture 6 of Stanford University's CME 296 course, marking the transition from theoretical foundations to practical implementation of text-to-image generation systems. The lecture builds on previous coverage of UNet and Diffusion Transformer (DiT) architectures, providing a complete end-to-end guide to training production-grade image generation models.
The lecture is structured around the full training lifecycle of modern diffusion models, covering four sequential phases: pre-training, post-training, personalization tuning, and production distillation. It emphasizes practical engineering tricks and optimizations that are critical for real-world performance but rarely covered in theoretical papers. All concepts are grounded in the flow-matching framework, which has become the industry standard for modern text-to-image systems.
Two. Key Learning Takeaways
Modern text-to-image training follows a four-stage lifecycle: pre-training, post-training, personalization, and distillation, each with distinct objectives and techniques.
Flow matching has emerged as the dominant loss function for contemporary diffusion models, replacing earlier DDPM and score-based approaches for improved stability and performance.
Uniform time step sampling is suboptimal; logit-normal sampling that emphasizes middle timesteps significantly improves training efficiency and final model quality.
Perceived noise level varies with image resolution, requiring resolution-dependent time step shifting to maintain consistent training difficulty across scales.
Representation Alignment (REPA) accelerates training by up to 18x by aligning model representations with those of pre-trained encoders.
Curriculum learning—starting with simple low-resolution images and progressing to complex high-resolution inputs—dramatically improves model convergence.
Post-training combines continued training for domain specialization, supervised fine-tuning for instruction following, and preference tuning for aesthetic and alignment goals.
DreamBooth with Low-Rank Adaptation (LoRA) enables efficient personalization of pre-trained models to specific subjects with minimal compute.
Progressive distillation, InstaFlow, and consistency models allow reducing inference steps from hundreds to just one or four while preserving most image quality.
Three. Course Gold Quotes
"Pre-training teaches your model how to generate any images. Post-training teaches it how to generate good images."
"If you want to learn about a topic, get a book about it. That's exactly what REPA does—it gives your model a pre-written book of visual representations to learn from faster."
"Training a diffusion model is like learning to cook. Pre-training is learning about ingredients and basic techniques. Post-training is learning to make restaurant-quality dishes. Preference tuning is learning what your customers actually like to eat."
"Prompt enhancement is the waiter between the customer and the chef. It takes a simple request like 'I want meat' and translates it into the detailed instructions the chef needs to make a perfect meal."
"Distillation is not about making a worse model faster. It's about capturing all the knowledge of a big slow model into a small fast one that can run in production."
"The hardest part of training is not getting the model to generate images—it's getting it to generate the right images reliably, every single time."
Four. Layered Learning Notes
Module 1: Full Training Lifecycle Overview
Modern text-to-image generation systems follow a structured four-phase training process that has been refined through years of industry practice. Each phase builds on the previous one, addressing specific limitations and adding new capabilities.
The first phase is pre-training, where the model learns the fundamental structure of visual data and how to map text descriptions to images. This phase is the most compute-intensive, requiring billions of image-text pairs and thousands of GPU hours. The goal here is not perfection but broad coverage of concepts, objects, and scenes.
The second phase is post-training, which transforms a general-purpose pre-trained model into a high-quality production system. This phase has three sub-components: continued training for domain specialization, supervised fine-tuning for instruction following, and preference tuning for aesthetic quality and alignment.
The third optional phase is personalization tuning, which allows adapting the base model to generate specific subjects, styles, or characters that were not present in the original training data. This is critical for many real-world applications where users want to generate images of their own products, pets, or avatars.
The final phase is production distillation, which optimizes the model for low-latency inference. This step reduces the number of denoising steps from hundreds to just one or four, making the model fast enough for real-time applications while preserving most of its quality.
Module 2: Loss Functions and Time Step Sampling
While three theoretical frameworks exist for diffusion models—DDPM, score-based, and flow matching—the lecture emphasizes that flow matching is now the industry standard for all modern text-to-image systems. It offers better training stability, faster convergence, and simpler implementation compared to earlier approaches.
A critical but often overlooked detail is time step sampling. The theoretical derivation assumes uniform sampling of timesteps from 0 to 1, but this is highly suboptimal in practice. The difficulty of the denoising task varies dramatically across timesteps:
-
Early timesteps (t near 0): Easy task, as the model only needs to predict the general direction toward the data distribution
-
Late timesteps (t near 1): Easy task, as the image is almost clean and only requires minor adjustments
-
Middle timesteps (t near 0.5): Extremely hard task, as the model must make all the critical decisions about object placement, composition, and details
To address this imbalance, practitioners use a logit-normal distribution for time step sampling. This distribution concentrates samples in the middle timesteps where the model learns the most, while still providing some coverage of early and late steps. The distribution can be tuned by adjusting its mean and standard deviation to prioritize different timesteps based on the specific use case.
Another important optimization is resolution-dependent time step shifting. The same absolute noise level appears much more severe in low-resolution images than in high-resolution images, due to spatial correlation between pixels. For a given noise level, high-resolution images have more pixels to average out the noise, making the underlying structure easier to discern. To compensate for this, time steps must be shifted upward for higher resolutions to maintain consistent perceived noise levels across different training scales.
Module 3: Pre-Training and Representation Alignment
Pre-training is the foundation of any text-to-image model, and its success depends almost entirely on the quality and diversity of the training data. The lecture emphasizes that data curation is by far the most time-consuming and impactful part of the pre-training process, involving multiple stages of filtering, deduplication, cleaning, and resolution sorting.
A powerful technique for accelerating pre-training is Representation Alignment (REPA), introduced in 2024. REPA works by adding an auxiliary loss that encourages the diffusion transformer's internal representations to match those of a frozen pre-trained image encoder. This is analogous to giving a student a textbook to learn from, rather than making them discover everything from scratch.
REPA provides several key benefits:
-
Speeds up training by up to 18x, dramatically reducing compute costs
-
Improves final model quality by leveraging knowledge from larger pre-trained encoders
-
Works best when aligning earlier layers of the DiT, which capture semantic information rather than fine details
-
Scales better with larger models, providing even greater benefits for billion-parameter systems
Another critical pre-training practice is curriculum learning. Instead of training on all data from the start, the model is first trained on simple examples—low-resolution square images with short, simple prompts—and gradually progresses to more complex examples—high-resolution images with arbitrary aspect ratios and long, detailed prompts. This mimics how humans learn, starting with basic concepts before moving to more complex ones, and significantly improves model convergence and final performance.
Module 4: Post-Training and Preference Alignment
Post-training transforms a general-purpose pre-trained model into a high-quality, user-friendly system. It addresses the fact that pre-trained models often generate technically correct but aesthetically unpleasing images, and frequently fail to follow detailed user prompts accurately.
The first component of post-training is Continued Training (CT), which involves further training the model on a smaller, higher-quality dataset focused on the specific domain or use case. For example, a general-purpose model might be further trained on a dataset of professional photographs to improve its aesthetic quality.
The second component is Supervised Fine-Tuning (SFT), which focuses on improving instruction following. The model is trained on a dataset of high-quality prompt-image pairs that demonstrate how to translate detailed natural language descriptions into images. This teaches the model to pay attention to all parts of the prompt and generate images that accurately reflect user intent.
The third and most important component is preference tuning, which teaches the model what not to do. Unlike supervised learning which only shows the model correct examples, preference tuning uses pairwise comparisons where one image is clearly better than another. Three main preference tuning methods are covered:
-
Reward Feedback Learning (ReFL): Trains a separate reward model on pairwise comparisons, then uses it to provide gradients to the generation model
-
Flow-GRPO: Adapts the Group Relative Policy Optimization algorithm from LLMs to image generation, using relative rewards within a group of generated images
-
Diffusion-DPO: The diffusion equivalent of Direct Preference Optimization, which directly optimizes the model to prefer winning images without a separate reward model
A critical complementary technique is Prompt Enhancement (PE), which bridges the gap between the short, simple prompts users actually write and the long, detailed prompts the model was trained on. A prompt enhancement model takes a user's simple request and expands it into the detailed, descriptive prompt that will produce the best results.
Module 5: Personalization Tuning
Personalization tuning allows adapting a pre-trained model to generate specific subjects, styles, or characters that were not present in the original training data. The most popular method for this is DreamBooth, which enables teaching the model to recognize and generate a specific subject from just a few example images.
DreamBooth works by associating the subject with a rare, unused token in the model's vocabulary. The model is then trained on the few example images, learning to generate the subject when that rare token appears in the prompt. A key challenge is preventing catastrophic forgetting, where the model loses its ability to generate other concepts after being fine-tuned on the new subject. This is addressed with a prior preservation loss, which encourages the model to retain its original capabilities while learning the new subject.
To make personalization efficient and accessible, DreamBooth is almost always combined with Low-Rank Adaptation (LoRA). Instead of updating all the model's parameters, LoRA only trains two small low-rank matrices that are added to the original weights. This reduces the number of trainable parameters by orders of magnitude, making personalization possible on consumer GPUs in just a few minutes.
LoRA has become the de facto standard for model customization in the image generation community, with millions of LoRA models available for every conceivable subject, style, and character. It allows users to mix and match different customizations by loading multiple LoRAs simultaneously, creating endless possibilities for creative expression.
Module 6: Production Distillation Techniques
Even the best-trained model is useless in production if it is too slow and expensive to run. Distillation techniques address this by reducing the number of inference steps required to generate an image, from hundreds to just one or four, while preserving most of the image quality.
The foundational distillation technique is Progressive Distillation, which works by iteratively halving the number of steps. In each iteration, a student model is trained to replicate the output of a teacher model in half the number of steps. This process is repeated until the desired number of steps is reached. Progressive distillation is simple and reliable, but it requires multiple rounds of training and can lose some quality at very low step counts.
InstaFlow builds on progressive distillation by first applying a reflow step to straighten the flow paths in the latent space. Straight paths are much easier to approximate in few steps, allowing InstaFlow to achieve high-quality one-step generation. It uses a combination of MSE loss and LPIPS (Learned Perceptual Image Patch Similarity) loss to preserve both pixel accuracy and perceptual quality.
Consistency Models take a different approach, training the model to predict the final clean image directly from any noisy input along the same trajectory. This allows generating an image in a single step from pure noise, without any iterative denoising. Consistency models are particularly flexible and work well with LoRA personalization, making them popular for real-time applications.
More advanced techniques like Distribution Matching Distillation (DMD) and Adversarial Diffusion Distillation (ADD) introduce adversarial objectives to address the regression-to-the-mean problem that plagues MSE-based distillation methods. These techniques produce sharper, more detailed images but are more complex to train and can suffer from stability issues.
Wishing you all the best as you embark on your text-to-image generation journey. May your training runs converge smoothly, your loss curves drop steadily, and your generated images exceed all expectations. May your LoRAs capture every detail perfectly, your distillations preserve all the quality, and your models run fast enough for even the most demanding production workloads. The skills you are learning here are at the cutting edge of AI, and they will empower you to create amazing things. Keep experimenting, keep learning, and never stop pushing the boundaries of what is possible with generative AI. Happy generating!


