One. Course Details
This is the fourth lecture of Stanford University’s CME 295: Transformers and Large Language Models, taught by twin brothers Afshine and Shervine Amidi. The lecture opens with critical exam logistics: the midterm exam will take place on Friday, October 24, from 3:30 PM to 5:00 PM in the regular classroom, covering all material from lectures 1 through 4. The final exam has been finalized for Wednesday, December 10, from 7:00 PM to 8:30 PM in a different location, and will only cover content from lectures 5 through 9. Both exams are closed-book, closed-notes, and no calculators are required.The lecture follows a clear two-part structure. In the first half, Afshine explains the pre-training stage of LLM development, covers the famous scaling laws that govern model performance, and breaks down the core training optimizations that make training billion-parameter models feasible. In the second half, Shervine introduces the supervised fine tuning (SFT) stage that transforms a general language model into a helpful assistant, discusses the challenges of evaluating LLM performance, and provides an in-depth explanation of parameter-efficient fine tuning techniques, most notably LoRA and QLoRA.
Two. Key Learning Objectives
By the end of this lecture, students will be able to:Explain the two-stage transfer learning paradigm that underpins all modern LLM development.
Describe the Chinchilla scaling law and understand the optimal relationship between model size, training data volume, and compute budget.
Compare and contrast data parallelism, model parallelism, and ZeRO redundancy optimization for distributed LLM training.
Explain the core intuition behind Flash Attention and how it leverages GPU memory hierarchy to accelerate attention computation.
Distinguish between pre-training and supervised fine tuning objectives and understand how instruction tuning aligns models with user needs.
Evaluate LLM performance using both standard benchmarks and human preference-based methods.
Describe how Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) enable efficient fine tuning of large models on consumer hardware.
Three. Memorable Course Quotes
"Pre-training is by far the most expensive part of LLM training, costing millions to hundreds of millions of dollars and requiring thousands of GPUs running for months.""The Chinchilla law states that optimal training uses approximately 20 tokens per parameter. Most early models like GPT-3 were significantly undertrained."
"Flash Attention is the best of all worlds: it is faster, uses less memory, and produces exactly the same output as vanilla attention with no approximations."
"Supervised fine tuning turns a model that only predicts the next token into a helpful assistant that responds to user instructions."
"LoRA reduces the number of trainable parameters by orders of magnitude while retaining almost full performance of full fine tuning."
Four. Detailed Lecture Notes
4.1 LLM Training Paradigm Overview
All modern LLMs follow a two-stage transfer learning paradigm that has revolutionized natural language processing:-
Pre-training: A large model is trained on trillions of tokens of unlabeled text from the internet to learn general language understanding and world knowledge.
-
Fine tuning: The pre-trained model is further trained on a small set of high-quality labeled data to adapt it to specific tasks or align it with human preferences.
This paradigm eliminates the need to train a separate model for each task, leveraging the general knowledge acquired during pre-training across all downstream applications.
4.2 Pre-training Fundamentals
Pre-training is the most computationally intensive and expensive stage of LLM development. The core objective is next token prediction: given a sequence of tokens, the model learns to predict the probability distribution over the vocabulary for the next token in the sequence.4.2.1 Pre-training Data
Pre-training datasets are massive collections of text from diverse sources, including:-
Common Crawl (billions of web pages)
-
Wikipedia and other encyclopedias
-
Books and academic papers
-
Code repositories (GitHub, Stack Overflow)
-
Social media conversations (Reddit, Twitter)
Modern LLMs are trained on hundreds of billions to tens of trillions of tokens. For example, GPT-3 was trained on 300 billion tokens, while Llama 3 was trained on 15 trillion tokens.
4.2.2 Compute Metrics
Two critical metrics are used to quantify the computational requirements of LLM training:-
FLOPs (Floating Point Operations): A unit of compute that measures the number of arithmetic operations performed. Training a modern LLM requires approximately 10²⁵ FLOPs.
-
FLOPS (Floating Point Operations per Second): A measure of hardware compute speed, indicating how many operations a GPU can perform per second.
The total compute required for pre-training is roughly proportional to the product of the number of parameters and the number of training tokens.
4.3 Scaling Laws and the Chinchilla Law
A landmark 2020 paper discovered that LLM performance follows predictable scaling laws: performance improves smoothly as model size, training data volume, and compute budget increase, with no signs of diminishing returns at the scales tested.Bigger models are also more sample efficient: they achieve better performance with the same amount of training data compared to smaller models.
The Chinchilla law, published in 2022, answered a critical question: given a fixed compute budget, what is the optimal split between model size and training data? The researchers found that optimal training uses approximately 20 tokens per parameter. This revealed that most early large models, including GPT-3 (175B parameters trained on 300B tokens), were significantly undertrained.
4.4 LLM Training Optimizations
Training a billion-parameter model on trillions of tokens presents enormous technical challenges, primarily related to memory limitations and computational efficiency.4.4.1 The Memory Challenge
A single training step requires storing three types of data in GPU memory:-
Model parameters: The weights of the neural network.
-
Activations: The intermediate values computed during the forward pass, needed for backpropagation.
-
Optimizer states: The first and second moments used by the Adam optimizer, which require twice as much memory as the parameters themselves.
Even the most powerful GPUs (such as the NVIDIA H100 with 80GB of memory) cannot fit a large LLM entirely in memory, necessitating distributed training across multiple GPUs.
4.4.2 Data Parallelism and ZeRO
Data parallelism is the most basic distributed training technique. The training batch is split across multiple GPUs, each with a full copy of the model. After each forward and backward pass, gradients are averaged across all GPUs, and the model weights are updated synchronously.The ZeRO (Zero Redundancy Optimizer) eliminates the memory redundancy in standard data parallelism by partitioning model parameters, gradients, and optimizer states across GPUs rather than replicating them on every device. ZeRO comes in three stages of increasing memory savings:
-
ZeRO-1: Partitions only optimizer states.
-
ZeRO-2: Partitions optimizer states and gradients.
-
ZeRO-3: Partitions optimizer states, gradients, and parameters.
4.4.3 Model Parallelism
Model parallelism splits the model itself across multiple GPUs, allowing training of models too large to fit on a single GPU. There are three main variants:-
Tensor parallelism: Splits individual matrix multiplications across GPUs.
-
Pipeline parallelism: Splits the model by layers, with different GPUs responsible for different layers.
-
Expert parallelism: Places different experts of a mixture of experts model on different GPUs.
4.4.4 Flash Attention
Flash Attention, developed at Stanford in 2022, is a revolutionary optimization that accelerates self-attention computation by leveraging the GPU memory hierarchy.GPUs have two types of memory:
-
HBM (High Bandwidth Memory): Large (tens of gigabytes) but relatively slow.
-
SRAM (Static Random Access Memory): Small (tens of megabytes) but 10-100 times faster than HBM.
Vanilla attention implementation spends most of its time reading and writing intermediate results to and from slow HBM. Flash Attention uses a tiling technique to break the attention computation into small blocks that fit entirely in fast SRAM, minimizing expensive HBM accesses.
Flash Attention also uses activation recomputation instead of storing all activations during the forward pass. It recomputes activations during the backward pass, resulting in both faster training and lower memory usage with no loss in accuracy.
4.4.5 Quantization and Mixed Precision Training
Quantization reduces the memory footprint of model weights by representing them with fewer bits. Modern GPUs also achieve much higher compute speeds with lower-precision number formats.Mixed precision training combines the benefits of high and low precision:
-
Model weights are stored in 32-bit floating point (FP32) for numerical stability.
-
All forward and backward pass computations are performed in 16-bit floating point (FP16) or Brain Float 16 (BF16) for speed and memory efficiency.
-
Weight updates are performed in FP32 to prevent accumulation of quantization errors.
This technique typically provides a 2x speedup and 2x memory reduction with negligible impact on model performance.
4.5 Supervised Fine Tuning (SFT)
A pre-trained LLM is only a next-token predictor—it does not naturally know how to respond to user instructions or be a helpful assistant. Supervised fine tuning transforms the pre-trained model into a useful tool by training it on a dataset of input-output pairs.4.5.1 SFT Objective
The SFT objective is similar to pre-training, with one critical difference: no loss is calculated over the input prompt. The model is only trained to predict the response tokens, not to parrot the user's input.4.5.2 Instruction Tuning
Instruction tuning is a specific type of supervised fine tuning that trains the model to follow natural language instructions. The training dataset consists of diverse examples of instructions and corresponding helpful responses, covering tasks like:-
Question answering
-
Story writing
-
Code generation
-
Explanation and summarization
-
Safety and alignment
Instruction tuning datasets are much smaller than pre-training datasets, typically consisting of thousands to millions of high-quality examples. For example, GPT-3 used 13,000 examples for instruction tuning, while Llama 3 used 10 million examples.
4.6 LLM Evaluation
Evaluating LLM performance is notoriously challenging because language understanding and generation are inherently subjective tasks.4.6.1 Standard Benchmarks
Standard benchmarks measure model performance on specific objective tasks:-
MMLU (Massive Multitask Language Understanding): Tests general knowledge and reasoning across 57 subjects.
-
GSM8K: Tests elementary school math problem solving.
-
HumanEval: Tests code generation ability.
However, benchmarks have significant limitations: models often overfit to benchmark tasks, and high benchmark scores do not always correlate with real-world user satisfaction.
4.6.2 Human Preference Evaluation
Human preference evaluation directly measures how well models align with user expectations. The most famous example is Chatbot Arena, where users compare anonymous responses from two models and vote for the better one. An Elo rating system is then used to rank models based on these pairwise comparisons.Human evaluation also has limitations: it is expensive, subjective, and prone to biases.
4.7 Parameter-Efficient Fine Tuning
Full fine tuning of large LLMs is prohibitively expensive for most users. Parameter-efficient fine tuning (PEFT) techniques allow adapting pre-trained models to new tasks while only training a tiny fraction of the model's parameters.4.7.1 Low-Rank Adaptation (LoRA)
LoRA is the most widely used PEFT technique. Instead of updating the full weight matrix W₀, LoRA decomposes the weight update into the product of two small low-rank matrices A and B:W = W₀ + BAThe pre-trained weights W₀ are frozen, and only the matrices A and B are trained during fine tuning. The rank r of these matrices is typically very small (4-64), resulting in a 1000x reduction in the number of trainable parameters.
LoRA is typically applied to the attention projection matrices and feed-forward networks in the transformer. It achieves performance comparable to full fine tuning while requiring a fraction of the compute and memory.
4.7.2 Quantized LoRA (QLoRA)
QLoRA further extends LoRA by quantizing the frozen pre-trained weights to 4-bit precision, reducing memory usage by an additional 8x. QLoRA uses a novel NF4 (Normalized Float 4-bit) quantization format optimized for normally distributed neural network weights, and double quantization to further reduce memory overhead.QLoRA enables fine tuning of 70B parameter models on a single consumer GPU with 48GB of VRAM, with almost no performance degradation compared to full fine tuning.
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.


