One. Course Details
This is the third lecture of Stanford University’s CME 295: Transformers and Large Language Models, taught by twin brothers Afshine and Shervine Amidi. The lecture opens with a key logistical update: course slides will now be published on the official website every Thursday evening, allowing students to download and annotate them in advance of each Friday’s class.The lecture follows a clear two-part structure designed to transition from foundational architecture to real-world LLM functionality. In the first half, Afshine formally defines large language models, introduces the mixture of experts (MoE) architecture that powers most state-of-the-art models, and breaks down the mathematical and practical details of text generation strategies. In the second half, Shervine covers core prompting techniques that enable zero-shot and few-shot learning, followed by a deep dive into the inference optimization methods that make LLMs fast and efficient enough for production use.
This lecture marks a critical turning point in the course, as it moves beyond the theoretical transformer architecture to explain how modern LLMs actually work, how to interact with them effectively, and how they are optimized for real-world deployment.
Two. Key Learning Objectives
By the end of this lecture, students will be able to:Define a large language model and explain its three defining characteristics of "largeness" and its exclusive decoder-only architecture.
Distinguish between dense and sparse mixture of experts models, and analyze the tradeoffs between model capacity, inference cost, and training challenges.
Compare and contrast three core text generation strategies (greedy decoding, beam search, and sampling-based methods) and identify their optimal use cases.
Explain how the temperature parameter modulates the probability distribution of next tokens and controls the tradeoff between output determinism and creativity.
Implement core prompting techniques including zero-shot learning, few-shot learning, chain-of-thought reasoning, and self-consistency to improve LLM performance.
Describe how key-value (KV) caching, grouped query attention, and PagedAttention reduce inference latency and memory usage.
Explain the mathematical intuition behind speculative decoding and how it leverages the memory-bound nature of LLM inference to accelerate generation.
Three. Memorable Course Quotes
"Large language models are decoder-only models that predict the probability of the next token. They are large in three dimensions: model size, training data volume, and compute requirements.""Mixture of experts is like having a room full of specialists—you only ask the mathematician your math question, not the historian. This lets us scale model capacity without scaling inference cost."
"Nothing in the transformer architecture is probabilistic. The only source of randomness in LLM outputs is how you sample the next token."
"Chain of thought works because forcing the model to show its reasoning before giving the answer dramatically improves its ability to solve complex problems."
"At inference time, LLMs are memory-bound, not compute-bound. This is why caching and memory management techniques are so critical for performance."
Four. Detailed Lecture Notes
4.1 Transformer Model Families Recap
The lecture begins with a quick review of the three core transformer model families introduced in Lecture 2:-
Encoder-decoder models: Retain both components of the original transformer, used for sequence-to-sequence tasks like machine translation. The primary example is the T5 family.
-
Encoder-only models: Remove the decoder, use bidirectional attention, and are optimized for classification and token-level tasks. BERT is the most prominent example.
-
Decoder-only models: Remove the encoder and cross-attention layers, use causal masked self-attention, and are optimized for text generation.
Over 90% of modern large language models use the decoder-only architecture, including GPT, LLaMA, Gemma, Mistral, DeepSeek, and Qwen. This architecture has proven to be the most scalable and general-purpose for a wide range of generative AI tasks.
4.2 Large Language Model Definition
A large language model (LLM) is first and foremost a language model: a model that assigns probabilities to sequences of tokens. Specifically, LLMs are trained to predict the probability of the next token given all previous tokens in the sequence.The term "large" refers to three distinct dimensions:
-
Model size: Modern LLMs typically have at least 1 billion parameters, with state-of-the-art models reaching hundreds of billions or even trillions of parameters.
-
Training data volume: LLMs are trained on hundreds of billions to tens of trillions of tokens of text data from the internet, books, and other sources.
-
Compute requirements: Training and running LLMs requires massive amounts of GPU compute, though recent optimizations have made smaller models runnable on consumer GPUs.
An important clarification: under the current standard definition, encoder-only models like BERT are not considered LLMs because they cannot generate text. Only decoder-only text-to-text models qualify as LLMs.
4.3 Mixture of Experts (MoE) Architecture
The mixture of experts architecture addresses a fundamental limitation of dense transformer models: as model size increases, the computational cost of every forward pass increases proportionally.4.3.1 Core Intuition
The MoE approach is inspired by human expertise: just as you would ask a mathematician to solve a math problem rather than a historian, an MoE model only activates a small subset of its parameters for each input token.An MoE model consists of:
-
N expert networks: Each expert is a specialized feedforward neural network.
-
A gating (router) network: A small linear layer that takes a token's representation as input and outputs a probability distribution over the experts.
The final output is a weighted sum of the outputs of the selected experts, weighted by the gating network's probabilities.
4.3.2 Dense vs. Sparse MoE
There are two primary variants of MoE:-
Dense MoE: All experts contribute to the output, with different weights. This does not reduce computational cost but can improve model performance.
-
Sparse MoE: Only the top-k highest-probability experts are activated for each token (typically k=1 or k=2). This reduces the number of floating-point operations (FLOPs) per forward pass by a factor equal to the number of experts divided by k.
Sparse MoE is the variant used in all modern large-scale LLMs, as it allows scaling model capacity to trillions of parameters while keeping inference costs manageable.
4.3.3 MoE Implementation in LLMs
In decoder-only LLMs, the mixture of experts replaces the feedforward neural network (FFN) in each transformer block. This is because the FFN accounts for the vast majority of the model's parameters, making it the optimal location for MoE.Each token in the sequence is routed independently to its own set of experts. This means different tokens in the same sequence can be processed by different experts, allowing the model to specialize in different types of content and linguistic patterns.
4.3.4 Training Challenges and Solutions
The primary challenge in training sparse MoE models is routing collapse: the gating network learns to route almost all tokens to just a few experts, leaving the remaining experts unused. This negates the capacity benefits of the MoE architecture.To mitigate routing collapse, researchers add a load balancing loss term to the overall training objective. This term penalizes imbalanced expert usage and encourages the gating network to distribute tokens evenly across all experts. Additional techniques like noisy gating (adding random noise to the gating probabilities) also help prevent collapse.
4.4 Text Generation Strategies
All LLMs generate text one token at a time by predicting the probability distribution over the vocabulary for the next token. The choice of how to select the next token from this distribution has a profound impact on the quality, diversity, and creativity of the output.4.4.1 Greedy Decoding
The simplest generation strategy is greedy decoding, which always selects the token with the highest probability at each step.Advantages: Fast, simple, and deterministic. Disadvantages:
-
Produces repetitive and boring output with no diversity.
-
Locally optimal choices do not always lead to the globally optimal sequence. A slightly lower-probability token early on may lead to a much higher-probability overall sequence.
4.4.2 Beam Search
Beam search addresses the local optimality problem by keeping track of the k most probable partial sequences (called beams) at each step. At the end of generation, the sequence with the highest overall probability is selected.Advantages: Produces higher-quality and more coherent output than greedy decoding. Disadvantages:
-
Computationally expensive, as it requires maintaining and expanding k separate sequences.
-
Still produces relatively deterministic and uncreative output.
-
Biased toward shorter sequences, as longer sequences have lower overall probability (the product of many numbers less than 1).
Beam search is primarily used for tasks where accuracy and coherence are more important than creativity, such as machine translation and summarization.
4.4.3 Sampling-Based Methods
Sampling-based methods select the next token randomly according to the probability distribution output by the model. This introduces randomness and produces diverse, creative output.Two common refinements to basic sampling prevent the model from generating extremely low-probability and nonsensical tokens:
-
Top-k sampling: Only sample from the k highest-probability tokens.
-
Top-p (nucleus) sampling: Only sample from the smallest set of tokens whose cumulative probability exceeds a threshold p. This dynamically adjusts the number of tokens considered based on the model's confidence.
4.4.4 Temperature Parameter
The temperature parameter T controls the shape of the probability distribution before sampling. It scales the logits (the inputs to the softmax function) as follows:P(token_i) = exp(logit_i / T) / sum(exp(logit_j / T))
-
Low temperature (T < 0.5): Sharpens the probability distribution, making high-probability tokens even more likely. Produces deterministic, focused, and high-quality output.
-
High temperature (T > 1.0): Flattens the probability distribution, making all tokens more equally likely. Produces creative, diverse, and sometimes nonsensical output.
-
T = 0: Equivalent to greedy decoding, as only the highest-probability token has non-zero probability.
An important practical note: while the transformer architecture itself is entirely deterministic, GPU floating-point operations introduce minor non-determinism even at T=0 due to differences in the order of summation of numbers with different magnitudes.
4.4.5 Guided Decoding
Guided decoding ensures that the generated output adheres to a specific format (such as JSON, XML, or a custom grammar) by filtering out invalid next tokens at each step. This eliminates the need for post-processing and retry loops when generating structured output.4.5 Prompting Strategies
Prompting is the practice of designing input text to elicit specific desired outputs from an LLM without modifying its weights.4.5.1 Context Length and Context Rot
The context length (also called context size or window size) of an LLM is the maximum number of tokens it can process in a single forward pass. Modern LLMs have context lengths ranging from tens of thousands to millions of tokens.However, longer context lengths do not always lead to better performance. A phenomenon called context rot (or lost in the middle) shows that LLMs have difficulty retrieving information that is located in the middle of long contexts, especially when there are distracting irrelevant passages. For retrieval tasks, it is still best to provide only the most relevant context to the model.
4.5.2 Prompt Structure
Effective prompts typically consist of four components:-
Context: Background information and setup for the task.
-
Instructions: A clear description of what the model should do.
-
Inputs: The specific data the model should process.
-
Constraints: Rules and limitations on the output format and content.
4.5.3 In-Context Learning
LLMs can perform new tasks without any fine-tuning through in-context learning, where task instructions and examples are provided as part of the prompt. There are two primary variants:-
Zero-shot learning: The model is given only the task instruction with no examples.
-
Few-shot learning: The model is given several examples of input-output pairs before the actual query.
Few-shot learning generally produces better performance than zero-shot learning, especially for smaller models. However, modern state-of-the-art LLMs can often achieve comparable or better performance with well-written detailed instructions than with few-shot examples.
4.5.4 Chain-of-Thought (CoT) Prompting
Chain-of-thought prompting dramatically improves the performance of LLMs on complex reasoning tasks by asking the model to generate a step-by-step reasoning process before giving the final answer.Advantages:
-
Significantly improves performance on arithmetic, logic, and problem-solving tasks.
-
Makes the model's reasoning process transparent, allowing for easier debugging and error analysis.
4.5.5 Self-Consistency
Self-consistency is an extension of chain-of-thought prompting that further improves performance. The model is sampled multiple times to generate different reasoning paths, and the final answer is selected by majority vote among the generated answers.Self-consistency leverages the diversity of sampling to produce more robust and accurate results, with minimal additional cost since the multiple generations can be run in parallel.
4.6 LLM Inference Optimizations
Inference optimization is critical for making LLMs fast and cost-effective enough for production deployment. These techniques fall into two categories: exact optimizations that produce identical output, and approximate optimizations that trade minor quality for significant speedups.4.6.1 Key-Value (KV) Caching
KV caching is the most fundamental and widely used inference optimization. During autoregressive generation, the keys and values of previously generated tokens are reused for all subsequent steps, avoiding the need to recompute them from scratch.This reduces the computational complexity of generation from O(n²) to O(n) and typically provides a 2-10x speedup over naive generation.
4.6.2 Grouped Query Attention (GQA)
Grouped query attention reduces the memory footprint of the KV cache by sharing key and value projection matrices across groups of attention heads. This is a middle ground between:-
Multi-head attention (MHA): Each head has its own key and value projections (highest quality, highest memory usage).
-
Multi-query attention (MQA): All heads share a single key and value projection (lowest quality, lowest memory usage).
GQA provides almost the same quality as MHA with significantly lower memory usage and is used in most modern LLMs.
4.6.3 PagedAttention
PagedAttention is a memory management technique for KV caching inspired by virtual memory in operating systems. It divides the KV cache into fixed-size blocks that can be stored non-contiguously in memory, eliminating internal and external memory fragmentation.PagedAttention is the core technology behind the vLLM inference engine, which increases GPU throughput by 2-4x compared to traditional inference systems.
4.6.4 Speculative Decoding
Speculative decoding accelerates generation by leveraging the observation that LLM inference is memory-bound rather than compute-bound. A small, fast "draft model" generates a sequence of k tokens in a single pass, which are then verified in a single forward pass by the large target model.Tokens accepted by the target model are kept, and any rejected tokens are discarded. This technique can provide 2-3x speedups without any degradation in output quality, as it mathematically preserves the target model's output distribution.


