One. Course Details
This is the first lecture of Stanford University’s CME 295: Transformers and Large Language Models course, taught by twin brothers Afshine and Shervine Amidi. Both instructors share nearly identical academic and industry backgrounds: they graduated from Centrale Paris in France, pursued graduate degrees at MIT and Stanford respectively, and worked together at Uber, Google, and most recently Netflix, where they specialize in large language model development.The course originated as an annual NLP workshop that the pair ran from 2021 to 2024. Following the explosive growth of interest in LLMs after the launch of ChatGPT in 2022, it was converted into an official Stanford course in spring 2024, and this is its second offering. The course’s dual mission is to teach students the underlying mechanics of transformers—the foundational architecture of all modern LLMs—and explain how these models are trained and applied across industries.
The course is designed for anyone with an interest in generative AI, including aspiring research scientists, ML engineers building LLM-powered projects, and professionals from other fields looking to apply LLMs to their domains. The minimum prerequisites are basic machine learning foundations (understanding neural network training) and fundamental linear algebra (matrix multiplication).
Logistically, the 2-unit class meets every Friday from 3:30 to 5:20 PM. Students can take it for a letter grade or credit/no-credit. All lectures are recorded and posted online within 24 hours, along with slides and the full syllabus. Grading is based entirely on two exams: a midterm on October 24 and a final exam during the week of December 8, with no homework assignments. The course uses the instructors’ own textbook Transformer LLMs Super Study Guide and a free VIP cheat sheet available on GitHub in multiple languages.
Two. Key Learning Objectives
By the end of this lecture, students will be able to:
Classify all major natural language processing tasks into three core categories and identify appropriate evaluation metrics for each.
Compare and contrast the three primary tokenization approaches (word-level, subword-level, character-level) and explain their respective tradeoffs.
Describe the limitations of one-hot encoding for word representation and explain how Word2vec learns meaningful contextual embeddings through proxy tasks.
Outline the architecture of recurrent neural networks (RNNs) and LSTMs, and analyze their fundamental limitations with long-range dependencies.
Explain the core intuition behind the attention mechanism and how it solves the vanishing gradient problem in sequential models.
Identify the key components of the original transformer architecture, including encoders, decoders, multi-head attention, and positional encodings.
Walk through the end-to-end workflow of a transformer model performing a machine translation task.
Three. Memorable Course Quotes
"The field of NLP is full of abbreviations—I was completely scared of them when I started. By the end of this class, you’ll have a mental map of what every abbreviation means, and that’s how we’ll know we did a good job."
"One-hot encoding makes all words orthogonal to each other, which is exactly what we don’t want. We want similar words to have similar vectors, and unrelated words to be independent."
"RNNs have this fundamental problem: they forget things from the distant past. We call this the vanishing gradient problem, and it’s why they never worked well for very long sequences."
"Attention is all you need. That’s not just a catchy paper title—it’s the insight that revolutionized the entire field of natural language processing."
"Multi-head attention is like having multiple filters in a convolutional neural network. It lets the model learn different types of relationships between words at the same time."
Four. Detailed Lecture Notes
4.1 NLP Task Taxonomy
Natural Language Processing (NLP) is the field of AI focused on manipulating and understanding human text. All NLP tasks fall into three broad categories:
-
Classification: Takes a single text input and predicts a single output label. Examples include sentiment analysis (positive/negative/neutral), intent detection, language identification, and topic modeling. Evaluation uses standard classification metrics: accuracy, precision, recall, and F1-score, which are critical for imbalanced datasets.
-
Multi-classification: Takes a single text input and predicts multiple outputs, typically at the token level. The most common example is Named Entity Recognition (NER), which labels words as locations, people, times, or organizations. Other tasks include part-of-speech tagging and dependency parsing.
-
Generation: Takes text input and produces variable-length text output. This is the most popular category today and includes machine translation, question answering, summarization, code generation, and creative writing. Traditional evaluation metrics include BLEU and ROUGE (which require reference texts), while modern approaches use reference-free LLM-based metrics. Perplexity, which measures how surprised the model is by its output, is also commonly used (lower is better).
4.2 Tokenization: The First Step in Text Processing
Since machine learning models only understand numbers, text must first be converted into numerical units called tokens. Tokenization is the process of splitting text into these units, and there are three primary approaches:
-
Word-level tokenization: Splits text into individual words. Simple to implement but suffers from high out-of-vocabulary (OOV) rates and cannot leverage shared word roots (e.g., "bear" and "bears" are treated as completely separate tokens).
-
Subword tokenization: Splits text into meaningful subword units (e.g., "unhappiness" becomes "un", "happy", "ness"). This is the standard approach used in all modern LLMs. It balances low OOV rates with reasonable sequence lengths and leverages shared morphological roots.
-
Character-level tokenization: Splits text into individual characters. Robust to misspellings and has zero OOV rate but produces extremely long sequences, making computation prohibitively expensive for most tasks.
4.3 Word Representation: From One-Hot to Word2vec
Once text is tokenized, each token needs a numerical representation. The simplest approach is one-hot encoding, where each token is represented as a vector with a single 1 and all other entries 0. However, this approach is useless for measuring similarity between words, as all one-hot vectors are orthogonal.
The breakthrough came in 2013 with Word2vec, which learns meaningful embeddings through proxy tasks that predict words from their context. There are two variants:
-
Continuous Bag of Words (CBOW): Predicts a target word from its surrounding context words.
-
Skip-gram: Predicts the surrounding context words from a single target word.
The key insight is that a model that can accurately predict the next word must have a meaningful understanding of language. The learned embeddings capture semantic relationships (e.g., "king" - "man" + "woman" ≈ "queen") and are used as the foundation for all modern NLP models.
4.4 Recurrent Neural Networks and Their Limitations
Word2vec produces static token embeddings that do not change based on context. To capture the sequential nature of text and learn contextual representations, researchers developed Recurrent Neural Networks (RNNs).
RNNs process text one token at a time, maintaining a hidden state that represents the meaning of the sequence processed so far. For classification tasks, the final hidden state is used as the sentence representation. For generation tasks, the hidden state is used to predict the next token in the sequence.
However, RNNs have a critical flaw: the vanishing gradient problem. When training on long sequences, gradients become exponentially smaller as they propagate back through time, making it impossible for the model to learn long-range dependencies. Long Short-Term Memory (LSTM) networks were developed to mitigate this issue by adding a cell state that tracks important information over time, but they still struggle with very long sequences and are inherently sequential, making them slow to train on GPUs.
4.5 Attention Mechanism: The Breakthrough
The attention mechanism, introduced in 2014, solves the long-range dependency problem by allowing the model to create direct connections between any two tokens in a sequence. Instead of compressing the entire input into a single hidden state, the model can "attend" to different parts of the input when generating each output token.
The core idea of self-attention (used in transformers) is that the representation of each token is computed as a weighted sum of the representations of all other tokens in the sequence. The weights are determined by the similarity between tokens, measured using three learned projections:
-
Query: What the current token is looking for
-
Key: What each token in the sequence offers
-
Value: The actual content of each token
This approach eliminates the vanishing gradient problem and allows for fully parallel computation, making it perfectly suited for GPU acceleration.
4.6 Transformer Architecture Overview
The transformer architecture, introduced in the 2017 paper Attention Is All You Need, is built entirely on the self-attention mechanism. It consists of two main components: an encoder and a decoder.
-
Encoder: Processes the input sequence and produces context-aware embeddings. Each encoder layer contains a multi-head self-attention layer and a feed-forward neural network, with residual connections and layer normalization around each component. Multi-head attention runs multiple self-attention computations in parallel with different projection matrices, allowing the model to learn different types of relationships between tokens.
-
Decoder: Generates the output sequence one token at a time. Each decoder layer contains three components: a masked multi-head self-attention layer (which only attends to previously generated tokens), a cross-attention layer (which attends to the encoder’s output), and a feed-forward neural network.
Since self-attention is order-agnostic, transformers use positional encodings to add information about the position of each token in the sequence. The original paper uses sine and cosine functions to generate these encodings, which are added element-wise to the token embeddings.
4.7 End-to-End Transformer Example
For a machine translation task (e.g., English to French), the transformer workflow proceeds as follows:
-
The input English sentence is tokenized, and each token is converted to an embedding.
-
Positional encodings are added to the embeddings to preserve sequence order.
-
The embeddings pass through multiple stacked encoder layers, producing context-aware encoder outputs.
-
Decoding starts with a special beginning-of-sequence (BOS) token.
-
The decoder processes the BOS token, using masked self-attention to attend to itself and cross-attention to attend to the encoder outputs.
-
The final decoder output is projected to the vocabulary size and passed through a softmax layer to produce a probability distribution over the next token.
-
The most probable token is selected and added to the output sequence.


