One. Course Details
This is a foundational lecture of Stanford University's CME 296: Diffusion and Large Vision Models, dedicated entirely to demystifying large language models (LLMs)—the technology powering ChatGPT, Cursor, and virtually every modern AI application. Unlike most technical lectures in the course, this session features no executable code or hands-on exercises. Instead, it provides a high-level conceptual framework for understanding what language models are, why they work so surprisingly well, the key engineering innovations that enabled their rise, and the current state of the industry.
The lecture is structured into four core sections: a formal definition of language models and their mathematical underpinnings, three fundamental reasons why modeling language has proven to be such a powerful general-purpose approach, a deep dive into the technical and engineering breakthroughs that make modern LLMs possible, and an overview of the current competitive landscape between closed-source frontier models and open-weight alternatives. The central thesis is that the seemingly magical capabilities of LLMs emerge from an extremely simple core idea—predicting the next token in a sequence—scaled to unimaginable sizes with massive amounts of data and compute.
Two. Key Learning Takeaways
A modern large language model is fundamentally just a probability distribution over sequences of tokens, trained to predict the most likely next token given all previous tokens in the context.
Next-token prediction is an extraordinarily powerful self-supervised learning objective that forces the model to learn not just grammar and vocabulary, but also facts, reasoning, logic, and cultural knowledge from raw text.
The scaling laws of neural language models are the most important discovery in AI of the past decade: model performance improves predictably and smoothly as you increase the number of parameters, the amount of training data, and the compute used for training, with no signs of fundamental saturation yet.
Modern LLM development follows a strict two-stage paradigm: pre-training on trillions of tokens of general internet text to build a knowledgeable base model, followed by post-training to align the model to follow instructions, be safe, and behave in useful ways.
The Transformer architecture is the only practical foundation for large language models today, solving three critical limitations of earlier approaches: it scales efficiently to long sequences, supports dynamic attention to relevant context, and enables massive computation reuse during inference.
Language modeling has become an industrial-scale engineering discipline, with frontier models now requiring hundreds of millions of dollars in compute, hundreds of engineers, and global supply chains to build.
The AI industry is currently split between two competing paradigms: closed-source API-only models controlled by a small number of large tech companies, and open-weight models that anyone can download, modify, and run locally.
Three. Course Gold Quotes
"Qwen 3 is trained on 36 trillion tokens. If you stacked that text on paper, it would reach 9,000 kilometers above the Earth's surface. The International Space Station only orbits at 400 kilometers."
"Language models are unsupervised multitask learners. Train them well enough on next token prediction, and they will learn to do almost every task you can describe in text."
"Every time you double the model size, you just expect the loss to keep dropping. This is the most boring, predictable, and transformative discovery in the history of computer science."
"Pre-training is where the model learns all the knowledge and capabilities. Post-training is just teaching it which direction to face when it talks to you."
"A base model is just autocomplete on steroids. It doesn't know you're asking it a question. It just knows what usually comes after that sequence of characters on the internet."
"Transformer architecture didn't win because it was the most mathematically elegant. It won because it was the only architecture that scaled efficiently to thousands of GPUs."
"Today, if you want to train a frontier model, you don't need a brilliant new algorithm. You need $100 million and a contract with NVIDIA for 10,000 H100s."
Four. Layered Learning Notes
Module 1: What Language Models Actually Are
At their core, language models are mathematical objects that assign a probability to any sequence of tokens. Given a prefix of text, a language model outputs a probability distribution over all possible next tokens in the vocabulary. This simple definition belies their extraordinary power.
The lecture formalizes language modeling using the chain rule of probability. The joint probability of an entire sequence can be decomposed into the product of conditional probabilities of each token given all previous tokens:P(w₁, w₂, ..., wₙ) = P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × ... × P(wₙ|w₁,...,wₙ₋₁)
This decomposition is the foundation of autoregressive language modeling, the approach used by all modern LLMs. To generate text, you simply predict the next token, append it to the context, and repeat the process indefinitely.
Early approaches to language modeling relied on counting n-grams—sequences of n consecutive words—in training data. While simple, these models failed catastrophically on sequences that had never been seen before, as they could not generalize beyond their explicit training examples. Neural language models solved this problem by learning distributed representations (embeddings) of words, allowing them to generalize to novel sequences by leveraging semantic similarity.
Module 2: Why Modeling Language Is Revolutionary
The lecture presents three interconnected reasons why language modeling has become the dominant paradigm in AI:
-
Virtually all cognitive tasks can be cast as sequence completion: Writing emails, writing code, answering questions, solving math problems, and even reasoning are all fundamentally problems of predicting what comes next in a sequence of text. If you can model language well enough, you can solve almost any task that can be described in words.
-
Next-token prediction enables unsupervised multitask learning: Unlike traditional machine learning, which requires manually labeled training data for each individual task, language models learn from raw unlabeled text. Training on trillions of tokens of diverse internet text forces the model to learn not just language, but also facts about the world, logical reasoning, and how to perform thousands of different tasks.
-
Language models scale predictably and reliably: The discovery of scaling laws in 2020 showed that model performance improves as a smooth power law function of compute, data, and parameters. There are no sudden jumps or plateaus—if you invest more resources, you get a better model. This predictability has allowed companies to invest billions of dollars in training larger models with confidence.
Module 3: The Transformer Architecture
While the idea of language modeling has existed for decades, it only became practical at scale with the invention of the Transformer architecture in 2017. The lecture contrasts Transformers with naive multi-layer perceptrons (MLPs) to highlight three critical advantages:
-
Parameter efficiency: Unlike MLPs, whose parameter count grows quadratically with sequence length, Transformers have a parameter count that is almost independent of sequence length. This allows them to scale to context windows of hundreds of thousands or even millions of tokens.
-
Dynamic attention: Transformers use self-attention mechanisms to dynamically weight the importance of different tokens in the input context. This allows the model to focus on relevant information and ignore irrelevant filler words, a capability that is essential for understanding long and complex texts.
-
Computation reuse: Transformers are designed to enable efficient caching of intermediate computations during autoregressive generation. When generating text one token at a time, you only need to recompute the activations for the new token, rather than reprocessing the entire sequence from scratch.
Module 4: Pre-Training: Building the Base Model
Modern LLM training is split into two distinct phases: pre-training and post-training. Pre-training is the most computationally expensive and important phase, where the model learns almost all of its knowledge and capabilities.
During pre-training, the model is trained on trillions of tokens of text scraped from the internet. The raw Common Crawl dataset is extremely noisy, and only about 1.4% of it is considered high enough quality to use for training. Data curation—filtering, deduplication, and quality scoring—is now one of the most important and secretive parts of LLM development.
The only objective used during pre-training is next-token prediction. Despite its simplicity, this objective produces models with remarkable capabilities. The 2020 GPT-3 paper demonstrated that large pre-trained models exhibit few-shot in-context learning: they can perform new tasks simply by being shown a few examples in the input prompt, without any additional training.
Module 5: Post-Training: Aligning the Model
A raw pre-trained base model is an autocomplete system—it will continue any sequence in whatever way is most statistically likely based on its training data. It will not naturally follow instructions or answer questions helpfully. Post-training is the process of turning a base model into a useful assistant like ChatGPT.
The primary post-training techniques are:
-
Supervised Fine-Tuning (SFT): The model is fine-tuned on a dataset of human-written instruction-response pairs to teach it to follow instructions rather than just continue text.
-
Reinforcement Learning from Human Feedback (RLHF): Human labelers rank different model responses to the same prompt, and a separate reward model is trained to predict which responses humans prefer. The base model is then fine-tuned using reinforcement learning to maximize the reward from the reward model.
-
Safety Tuning: A specialized form of post-training that teaches the model to refuse harmful requests. However, safety guardrails are notoriously fragile, and researchers continuously discover new "jailbreak" techniques that can bypass them.
Module 6: Systems Engineering for Large Language Models
Training and deploying large language models is as much a systems engineering problem as a machine learning problem. A single 70 billion parameter model requires 140 gigabytes of memory just to store the weights, and training requires over a terabyte of memory for activations, gradients, and optimizer states.
Key systems techniques that make large-scale training possible:
-
Quantization: Reducing the precision of model weights from 32-bit floating points to 4-bit or even 2-bit integers to reduce memory usage.
-
Model Parallelism: Splitting the model across multiple GPUs, either by layer (pipeline parallelism) or by matrix dimension (tensor parallelism).
-
Kernel Fusion: Combining multiple small operations into a single GPU kernel to reduce memory bandwidth usage, the primary bottleneck in LLM training and inference.
Module 7: The Modern AI Landscape
The lecture concludes with an overview of the current AI industry, which is split between two competing models:
-
Closed-source frontier models: Developed by companies like OpenAI and Anthropic, these models are only accessible via paid APIs. They are generally the most capable, but their internal details are closely guarded secrets.
-
Open-weight models: Developed by companies like Meta, Alibaba, and Mistral, these models are released publicly for anyone to download, modify, and run. While slightly less capable than the top closed models, they are rapidly catching up and offer unparalleled flexibility and privacy.
Academic research in language modeling has been fundamentally transformed by the rise of industrial LLMs. Universities no longer have the resources to train frontier models, so most academic research now focuses on open models, alignment, evaluation, and understanding how existing models work.
Wishing you a clear and intuitive understanding of the technology that is reshaping our world. May you look beyond the hype to see the simple yet powerful ideas that make language models work, and may you use this knowledge to build tools that benefit everyone. Whether you go on to train models, build applications, or study their societal impact, may your curiosity drive you to ask the hard questions and push the boundaries of what is possible. Good luck with the rest of your studies and all your future endeavors!


