One. Course Details
This is the opening lecture of the third edition of Stanford University's CS 336: Language Models from Scratch, taught by Percy Liang, Tatsu Hashimoto, Marcel, Herman, and Steven. The course is built around the uncompromising "from scratch" philosophy, which holds that true understanding of LLMs comes only from building every component yourself rather than relying on high-level abstractions.
The lecture opens with instructor introductions and a discussion of how the course has evolved over the past two years. This year's edition places increased emphasis on mixture of experts architectures, long-context modeling, and the latest developments in open-source LLMs from Meta, Mistral, DeepSeek, and Qwen. The instructors emphasize that while frontier closed-source models are now prohibitively expensive to replicate, the fundamental principles of LLM construction remain fully accessible.
The course is structured into five interconnected units, each corresponding to a hands-on assignment:
-
Basics: Build a complete transformer language model from scratch, including tokenizer, architecture, optimizer, and training loop
-
Systems: Optimize performance with custom kernels, distributed training, and inference techniques
-
Scaling Laws: Learn to predict model performance at scale using small-scale experiments
-
Data: Curate and process high-quality training datasets from raw web crawls
-
Alignment: Implement preference learning algorithms like DPO and GRPO
All lectures are executable Python programs that demonstrate concepts through working code, and the course provides access to GPU compute through Modal for all students.
Two. Key Learning Takeaways
Abstractions are leaky, and relying solely on prompting or fine-tuning severely limits the design space available for fundamental LLM research.
Efficiency is exponentially more important at scale; a 5% improvement in training efficiency translates to hundreds of millions of dollars in savings for frontier models.
Small-scale experiments do not always translate to large-scale results, as both compute distribution and emergent capabilities change dramatically with model size.
There are three types of transferable knowledge in LLM engineering: mechanics (how things work), mindset (how to approach problems), and intuitions (what decisions work well).
The bitter lesson does not mean algorithms don't matter—it means algorithms that scale are what ultimately matter.
Tokenization is a necessary evil in modern LLMs, balancing compression efficiency with semantic representation despite its many annoying edge cases.
Byte Pair Encoding (BPE) has emerged as the standard tokenization algorithm, creating data-driven vocabularies that merge frequent character sequences into single tokens.
The open-source LLM ecosystem has matured dramatically, with models like Llama 3, DeepSeek, and Qwen now approaching the performance of closed-source alternatives.
Three. Course Gold Quotes
"Abstractions are great, but they are leaky. Sometimes you want to do something, and it just can't do it, and there's no recourse."
"We offer no explanation. We attribute their success, of these architectures, all else, to divine benevolence." — Noam Shazeer on SwiGLU
"The wrong interpretation of the bitter lesson is that scale is all that matters. The right interpretation is that algorithms that scale are what matter."
"If you're doing a small scale experiment, if your run takes twice as long, maybe you just wait twice as long. But if you're doing things at scale, that could be hundreds of millions of dollars."
"Every year, I'm hoping that I don't have to teach tokenization, because the dream is to really have an end-to-end way that directly operates on bytes."
"Data quality basically specifies how good your model is going to be. A lot of language model performance comes down to what you train on."
"Predictability is at least as important as optimality. You don't want to get surprised when you spend $100 million training a model."
Four. Layered Learning Notes
Module 1: Course Philosophy and Motivation
The core motivation for CS 336 is the growing disconnect between AI researchers and the underlying technology of LLMs. The instructors argue that while prompting and fine-tuning are powerful tools, they create a layer of abstraction that prevents researchers from exploring the full design space of language models.
The course addresses the challenge of industrialization in LLM development. Frontier models now cost hundreds of millions to billions of dollars to train, and companies no longer publish detailed implementation details. However, the instructors emphasize that the fundamental principles of LLM construction remain the same across scales, and these principles are fully teachable.
A critical theme throughout the course is the difference between small-scale and large-scale behavior. Two key examples illustrate this:
-
Compute distribution shifts dramatically with scale: at small scales, attention accounts for ~56% of FLOPs, while at 175B parameters, MLP layers dominate at 80% of FLOPs
-
Emergent capabilities only appear above certain model sizes, meaning small-scale experiments may not reveal important phenomena
The instructors stress that while we cannot train frontier models in the course, we can teach the transferable skills that apply at all scales.
Module 2: The Three Types of Transferable Knowledge
The course organizes LLM knowledge into three distinct categories that transfer across scales:
-
Mechanics: The concrete, verifiable facts about how systems work, including how transformers operate, how model parallelism functions, and how kernels execute on GPUs. These are universal and do not change with scale.
-
Mindset: The approach to building and optimizing LLMs, including profiling everything, benchmarking rigorously, and prioritizing efficiency above all else. This mindset is critical for successful large-scale training.
-
Intuitions: The empirical knowledge about what design decisions work well, including which activation functions perform best, how to set learning rates, and what data mixtures yield good results. While some intuitions may not transfer perfectly across scales, many do, and they are developed through extensive experimentation.
The course focuses heavily on teaching mechanics and mindset, as these are the most reliably transferable. Intuitions are developed through the hands-on assignments and leaderboards.
Module 3: Course Structure and Assignments
The course consists of five intensive assignments that build incrementally to a complete LLM training pipeline:
-
Assignment 1: Implement a BPE tokenizer, transformer architecture, AdamW optimizer, and full training loop. Students train models on TinyStories and OpenWebText and compete on a speed-perplexity leaderboard.
-
Assignment 2: Write custom Triton kernels for performance-critical operations and implement distributed training across multiple GPUs. This assignment teaches students to squeeze maximum performance out of hardware.
-
Assignment 3: Explore scaling laws by running small-scale experiments and predicting performance at larger scales. This assignment simulates the high-stakes environment of large-scale training where you only get one shot.
-
Assignment 4: Process a raw web crawl into a high-quality training dataset through filtering, deduplication, and mixture optimization. This assignment exposes students to the "dirty work" of data curation that is critical for model performance.
-
Assignment 5: Implement alignment algorithms like DPO and GRPO to train models that follow human preferences.
The instructors emphasize that the course is intentionally difficult and time-consuming, equivalent in workload to five CS 224n assignments. They advise students to carefully consider their course load before enrolling.
Module 4: Tokenization Fundamentals
Tokenization is the first step in any LLM pipeline, converting raw Unicode text into sequences of integers that the model can process. Despite its apparent simplicity, tokenization has far-reaching effects on model performance and behavior.
The lecture reviews three suboptimal tokenization approaches:
-
Character-level tokenization: Treats each Unicode character as a token, resulting in very long sequences and inefficient modeling
-
Byte-level tokenization: Converts text to UTF-8 bytes, resulting in a fixed 256-token vocabulary but even longer sequences
-
Word-level tokenization: Splits text on whitespace, resulting in good compression but an unbounded vocabulary and unknown tokens
All modern LLMs use some variant of Byte Pair Encoding (BPE), which was originally developed for data compression and first applied to LLMs in GPT-2. BPE creates a vocabulary tailored to the training data, merging the most frequently occurring adjacent token pairs iteratively.
Key properties of BPE include:
-
No unknown tokens: rare sequences are simply split into smaller units
-
Adaptive compression: common sequences are represented as single tokens, while rare sequences are split into multiple tokens
-
Round-trip guarantee: any text can be encoded and decoded back to the original without loss
Module 5: Byte Pair Encoding (BPE) Algorithm
The BPE training algorithm proceeds in three simple steps:
-
Start with a byte-level vocabulary of 256 tokens, representing all possible byte values
-
Count the frequency of every adjacent token pair in the training corpus
-
Merge the most frequent pair into a new token, add it to the vocabulary, and repeat until the desired vocabulary size is reached
Once trained, encoding new text with BPE involves applying the learned merges in order to the input byte sequence. Decoding is simply the reverse process, converting each token back to its corresponding byte sequence and concatenating the results.
The lecture includes a complete working implementation of BPE in Python, noting that while the basic algorithm is simple, optimizing it for speed is non-trivial. Assignment 1 challenges students to implement a fast BPE tokenizer that can process large corpora efficiently.
The instructors conclude by noting that while everyone hopes to eliminate tokenization in favor of end-to-end byte-level models, BPE remains the standard for all frontier models today. Any future replacement will still need to provide adaptive compression and semantic abstraction, the two core functions of tokenization.
Wishing you all the best as you embark on this journey to build language models from scratch. May your tokenizers round trip perfectly, your gradients flow smoothly, and your training runs converge without a single NaN. May your kernels run fast, your scaling laws predict accurately, and your data be clean and high-quality. The skills you are building here are the foundation of modern AI, and they will empower you to push the boundaries of what language models can do. Keep building, keep experimenting, and never stop asking how things work under the hood. Happy coding!


