Lecture 4: LLM Pre-training, Supervised Fine Tuning and Parameter-Efficient Adaptation

One. Course Details

This is the fourth lecture of Stanford University’s CME 295: Transformers and Large Language Models, taught by twin brothers Afshine and Shervine Amidi. The lecture opens with critical exam logistics: the midterm exam will take place on Friday, October 24, from 3:30 PM to 5:00 PM in the regular classroom, covering all material from lectures 1 through 4. The final exam has been finalized for Wednesday, December 10, from 7:00 PM to 8:30 PM in a different location, and will only cover content from lectures 5 through 9. Both exams are closed-book, closed-notes, and no calculators are required.

The lecture follows a clear two-part structure. In the first half, Afshine explains the pre-training stage of LLM development, covers the famous scaling laws that govern model performance, and breaks down the core training optimizations that make training billion-parameter models feasible. In the second half, Shervine introduces the supervised fine tuning (SFT) stage that transforms a general language model into a helpful assistant, discusses the challenges of evaluating LLM performance, and provides an in-depth explanation of parameter-efficient fine tuning techniques, most notably LoRA and QLoRA.

Two. Key Learning Objectives

By the end of this lecture, students will be able to:

Explain the two-stage transfer learning paradigm that underpins all modern LLM development.

Describe the Chinchilla scaling law and understand the optimal relationship between model size, training data volume, and compute budget.

Compare and contrast data parallelism, model parallelism, and ZeRO redundancy optimization for distributed LLM training.

Explain the core intuition behind Flash Attention and how it leverages GPU memory hierarchy to accelerate attention computation.

Distinguish between pre-training and supervised fine tuning objectives and understand how instruction tuning aligns models with user needs.

Evaluate LLM performance using both standard benchmarks and human preference-based methods.

Describe how Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) enable efficient fine tuning of large models on consumer hardware.

Three. Memorable Course Quotes

"Pre-training is by far the most expensive part of LLM training, costing millions to hundreds of millions of dollars and requiring thousands of GPUs running for months."

"The Chinchilla law states that optimal training uses approximately 20 tokens per parameter. Most early models like GPT-3 were significantly undertrained."

"Flash Attention is the best of all worlds: it is faster, uses less memory, and produces exactly the same output as vanilla attention with no approximations."

"Supervised fine tuning turns a model that only predicts the next token into a helpful assistant that responds to user instructions."

"LoRA reduces the number of trainable parameters by orders of magnitude while retaining almost full performance of full fine tuning."

Four. Detailed Lecture Notes

4.1 LLM Training Paradigm Overview

All modern LLMs follow a two-stage transfer learning paradigm that has revolutionized natural language processing:

Pre-training: A large model is trained on trillions of tokens of unlabeled text from the internet to learn general language understanding and world knowledge.
Fine tuning: The pre-trained model is further trained on a small set of high-quality labeled data to adapt it to specific tasks or align it with human preferences.

This paradigm eliminates the need to train a separate model for each task, leveraging the general knowledge acquired during pre-training across all downstream applications.

4.2 Pre-training Fundamentals

Pre-training is the most computationally intensive and expensive stage of LLM development. The core objective is next token prediction: given a sequence of tokens, the model learns to predict the probability distribution over the vocabulary for the next token in the sequence.

4.2.1 Pre-training Data

Pre-training datasets are massive collections of text from diverse sources, including:

Common Crawl (billions of web pages)
Wikipedia and other encyclopedias
Books and academic papers
Code repositories (GitHub, Stack Overflow)
Social media conversations (Reddit, Twitter)

Modern LLMs are trained on hundreds of billions to tens of trillions of tokens. For example, GPT-3 was trained on 300 billion tokens, while Llama 3 was trained on 15 trillion tokens.

4.2.2 Compute Metrics

Two critical metrics are used to quantify the computational requirements of LLM training:

FLOPs (Floating Point Operations): A unit of compute that measures the number of arithmetic operations performed. Training a modern LLM requires approximately 10²⁵ FLOPs.
FLOPS (Floating Point Operations per Second): A measure of hardware compute speed, indicating how many operations a GPU can perform per second.

The total compute required for pre-training is roughly proportional to the product of the number of parameters and the number of training tokens.

4.3 Scaling Laws and the Chinchilla Law

A landmark 2020 paper discovered that LLM performance follows predictable scaling laws: performance improves smoothly as model size, training data volume, and compute budget increase, with no signs of diminishing returns at the scales tested.

Bigger models are also more sample efficient: they achieve better performance with the same amount of training data compared to smaller models.

The Chinchilla law, published in 2022, answered a critical question: given a fixed compute budget, what is the optimal split between model size and training data? The researchers found that optimal training uses approximately 20 tokens per parameter. This revealed that most early large models, including GPT-3 (175B parameters trained on 300B tokens), were significantly undertrained.

4.4 LLM Training Optimizations

Training a billion-parameter model on trillions of tokens presents enormous technical challenges, primarily related to memory limitations and computational efficiency.

4.4.1 The Memory Challenge

A single training step requires storing three types of data in GPU memory:

Model parameters: The weights of the neural network.
Activations: The intermediate values computed during the forward pass, needed for backpropagation.
Optimizer states: The first and second moments used by the Adam optimizer, which require twice as much memory as the parameters themselves.

Even the most powerful GPUs (such as the NVIDIA H100 with 80GB of memory) cannot fit a large LLM entirely in memory, necessitating distributed training across multiple GPUs.

4.4.2 Data Parallelism and ZeRO

Data parallelism is the most basic distributed training technique. The training batch is split across multiple GPUs, each with a full copy of the model. After each forward and backward pass, gradients are averaged across all GPUs, and the model weights are updated synchronously.

The ZeRO (Zero Redundancy Optimizer) eliminates the memory redundancy in standard data parallelism by partitioning model parameters, gradients, and optimizer states across GPUs rather than replicating them on every device. ZeRO comes in three stages of increasing memory savings:

ZeRO-1: Partitions only optimizer states.
ZeRO-2: Partitions optimizer states and gradients.
ZeRO-3: Partitions optimizer states, gradients, and parameters.

4.4.3 Model Parallelism

Model parallelism splits the model itself across multiple GPUs, allowing training of models too large to fit on a single GPU. There are three main variants:

Tensor parallelism: Splits individual matrix multiplications across GPUs.
Pipeline parallelism: Splits the model by layers, with different GPUs responsible for different layers.
Expert parallelism: Places different experts of a mixture of experts model on different GPUs.

4.4.4 Flash Attention

Flash Attention, developed at Stanford in 2022, is a revolutionary optimization that accelerates self-attention computation by leveraging the GPU memory hierarchy.

GPUs have two types of memory:

HBM (High Bandwidth Memory): Large (tens of gigabytes) but relatively slow.
SRAM (Static Random Access Memory): Small (tens of megabytes) but 10-100 times faster than HBM.

Vanilla attention implementation spends most of its time reading and writing intermediate results to and from slow HBM. Flash Attention uses a tiling technique to break the attention computation into small blocks that fit entirely in fast SRAM, minimizing expensive HBM accesses.

Flash Attention also uses activation recomputation instead of storing all activations during the forward pass. It recomputes activations during the backward pass, resulting in both faster training and lower memory usage with no loss in accuracy.

4.4.5 Quantization and Mixed Precision Training

Quantization reduces the memory footprint of model weights by representing them with fewer bits. Modern GPUs also achieve much higher compute speeds with lower-precision number formats.

Mixed precision training combines the benefits of high and low precision:

Model weights are stored in 32-bit floating point (FP32) for numerical stability.
All forward and backward pass computations are performed in 16-bit floating point (FP16) or Brain Float 16 (BF16) for speed and memory efficiency.
Weight updates are performed in FP32 to prevent accumulation of quantization errors.

This technique typically provides a 2x speedup and 2x memory reduction with negligible impact on model performance.

4.5 Supervised Fine Tuning (SFT)

A pre-trained LLM is only a next-token predictor—it does not naturally know how to respond to user instructions or be a helpful assistant. Supervised fine tuning transforms the pre-trained model into a useful tool by training it on a dataset of input-output pairs.

4.5.1 SFT Objective

The SFT objective is similar to pre-training, with one critical difference: no loss is calculated over the input prompt. The model is only trained to predict the response tokens, not to parrot the user's input.

4.5.2 Instruction Tuning

Instruction tuning is a specific type of supervised fine tuning that trains the model to follow natural language instructions. The training dataset consists of diverse examples of instructions and corresponding helpful responses, covering tasks like:

Question answering
Story writing
Code generation
Explanation and summarization
Safety and alignment

Instruction tuning datasets are much smaller than pre-training datasets, typically consisting of thousands to millions of high-quality examples. For example, GPT-3 used 13,000 examples for instruction tuning, while Llama 3 used 10 million examples.

4.6 LLM Evaluation

Evaluating LLM performance is notoriously challenging because language understanding and generation are inherently subjective tasks.

4.6.1 Standard Benchmarks

Standard benchmarks measure model performance on specific objective tasks:

MMLU (Massive Multitask Language Understanding): Tests general knowledge and reasoning across 57 subjects.
GSM8K: Tests elementary school math problem solving.
HumanEval: Tests code generation ability.

However, benchmarks have significant limitations: models often overfit to benchmark tasks, and high benchmark scores do not always correlate with real-world user satisfaction.

4.6.2 Human Preference Evaluation

Human preference evaluation directly measures how well models align with user expectations. The most famous example is Chatbot Arena, where users compare anonymous responses from two models and vote for the better one. An Elo rating system is then used to rank models based on these pairwise comparisons.

Human evaluation also has limitations: it is expensive, subjective, and prone to biases.

4.7 Parameter-Efficient Fine Tuning

Full fine tuning of large LLMs is prohibitively expensive for most users. Parameter-efficient fine tuning (PEFT) techniques allow adapting pre-trained models to new tasks while only training a tiny fraction of the model's parameters.

4.7.1 Low-Rank Adaptation (LoRA)

LoRA is the most widely used PEFT technique. Instead of updating the full weight matrix W₀, LoRA decomposes the weight update into the product of two small low-rank matrices A and B: W = W₀ + BA

The pre-trained weights W₀ are frozen, and only the matrices A and B are trained during fine tuning. The rank r of these matrices is typically very small (4-64), resulting in a 1000x reduction in the number of trainable parameters.

LoRA is typically applied to the attention projection matrices and feed-forward networks in the transformer. It achieves performance comparable to full fine tuning while requiring a fraction of the compute and memory.

4.7.2 Quantized LoRA (QLoRA)

QLoRA further extends LoRA by quantizing the frozen pre-trained weights to 4-bit precision, reducing memory usage by an additional 8x. QLoRA uses a novel NF4 (Normalized Float 4-bit) quantization format optimized for normally distributed neural network weights, and double quantization to further reduce memory overhead.

QLoRA enables fine tuning of 70B parameter models on a single consumer GPU with 48GB of VRAM, with almost no performance degradation compared to full fine tuning.

These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 4 - LLM Training Stanford Online
• Course Series: Stanford CME295: Transformers and Large Language Models I Autumn 2025
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/VlA_jt_3Qc4?si=OH5naRm0FYmyYdwR

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.