One. Course Details
This is the core architecture lecture in CS 336: Language Modeling from Scratch at Stanford University, designed to demystify the practical, empirically-driven design choices that define modern large language models. The instructor frames architecture not as a theoretical discipline, but as a collection of battle-tested conventions derived from thousands of training runs across dozens of state-of-the-art models.
The lecture is structured into four interconnected sections that build incrementally:
The first section covers core transformer architecture modifications, including the universal shift from post-norm to pre-norm designs, the adoption of RMSNorm over LayerNorm, and the near-ubiquitous use of gated linear units for feed-forward networks.
The second section explores hyperparameter engineering, presenting the consensus empirical rules of thumb that have emerged across hundreds of models, covering feed-forward dimension ratios, attention head configurations, depth-to-width aspect ratios, and vocabulary size tradeoffs.
The third section addresses training stability, a critical concern for expensive large-scale training runs, covering techniques like z-loss, QK normalization, and logit soft-capping that prevent catastrophic divergence during training.
The fourth section concludes with practical attention optimizations that balance performance and efficiency, including grouped query attention for faster inference and hybrid sliding window attention for long context support.
The instructor emphasizes that while the original transformer architecture has remained surprisingly stable, the cumulative effect of these small, empirically-derived improvements has resulted in order-of-magnitude gains in both training stability and inference efficiency. All concepts are illustrated with references to real-world models including Llama 2, Gemma 4, Qwen 3, DeepSeek v3, and PaLM.
Two. Key Learning Takeaways
The single most impactful architectural improvement over the original transformer is moving layer normalization outside the residual stream, which dramatically improves gradient propagation and allows for much deeper models.
RMSNorm has universally replaced LayerNorm in modern models not for representational reasons, but because it eliminates unnecessary memory operations that can account for up to 25% of runtime in memory-bound workloads.
Gated linear units (GLUs) like SwiGLU and GeGLU provide consistent performance gains over non-gated activations with minimal computational overhead, making them the standard choice for all modern feed-forward networks.
RoPE (Rotary Position Embedding) has emerged as the dominant position embedding scheme due to its true relative position invariance and excellent long context extrapolation properties.
Most transformer hyperparameters are surprisingly forgiving, with wide basins of good performance around the consensus values, allowing practitioners to focus on systems optimization rather than exhaustive hyperparameter search.
Weight decay in large language model training functions primarily as an optimization aid rather than a regularizer, improving convergence even in single-pass training regimes where overfitting is not a concern.
Training stability has become the primary architectural concern for large models, with techniques like QK norm and z-loss now standard to prevent the catastrophic gradient spikes that can waste millions of dollars in compute.
Grouped query attention (GQA) provides nearly the same performance as full multi-head attention while drastically reducing KV cache size and memory bandwidth requirements during inference.
Three. Course Gold Quotes
"Architecture has always been pretty inscrutable. We all wished we lived in a world where the only things you had to know were VC dimension or something very simple theoretical tools, but that's not really where we are."
"The best thing to do, better than listening to this lecture even, is for you to go out and train your own models and try different architectures. That's by far the best thing to do."
"There is one thing that everyone agrees on: the original transformer paper got almost everything right except where you put the layer norm."
"Keep your residual stream clean. That's the mantra of modern architecture design. You want gradients to propagate straight through without being distorted by normalization operations."
"Arithmetic intensity is everything. We want to keep our GPUs hot doing matrix multiplies, not wasting cycles moving tiny bits of memory back and forth."
"Weight decay is not a regularizer in language model training. It's an optimization intervention. That's the most counterintuitive thing you'll learn today."
"If you have stability issues, just throw a LayerNorm in there. It sounds ridiculous, but it actually works almost every time."
"The most successful architectures are not the most mathematically elegant ones. They are the ones that train stably and run fast on our hardware."
Four. Layered Learning Notes
Module 1: Layer Normalization: The Foundation of Modern Transformers
The single most important architectural evolution since the original transformer is the shift in where layer normalization is placed relative to the residual stream.
In the original 2017 transformer paper, layer normalization was placed inside the residual path, after the attention and feed-forward operations but before the residual addition. This design, now called post-norm, required careful learning rate warmup and suffered from gradient attenuation as models grew deeper.
All modern models use pre-norm, where layer normalization is applied before each attention and feed-forward operation, outside the residual stream. This design preserves a clean identity path for gradients to propagate through the entire network during backpropagation, eliminating the need for aggressive warmup and allowing models to scale to hundreds of layers reliably.
Recent models have further evolved this design with double norm configurations, adding additional layer normalizations after attention and feed-forward operations. While theoretically unnecessary, these additional norms provide significant stability benefits with minimal performance cost.
The second major evolution in normalization is the universal adoption of RMSNorm over the original LayerNorm. RMSNorm simplifies the normalization operation by removing the mean subtraction step, retaining only the variance scaling.
While LayerNorm is technically more expressive, extensive empirical testing has shown no meaningful performance difference between the two for language modeling tasks. However, RMSNorm is significantly faster in practice, as it eliminates a memory-bound reduction operation that can account for a surprising portion of runtime, especially in smaller models.
A related convention that has emerged is the removal of bias terms from almost all linear layers and normalization operations. Like RMSNorm, this change is driven primarily by systems considerations: bias terms add memory overhead and require additional memory operations without providing measurable representational benefits for most language modeling workloads.
Module 2: Activation Functions: The Rise of Gated Linear Units
The evolution of activation functions in transformers has followed a clear trajectory toward increasingly expressive gated designs.
The original transformer used ReLU activation, which is simple and computationally efficient but suffers from the dying ReLU problem where neurons can become permanently inactive during training. GPT-3 improved on this with GeLU (Gaussian Error Linear Unit), which adds a smooth transition around zero and provides better gradient flow.
All modern language models now use some variant of Gated Linear Unit (GLU) for their feed-forward networks. GLUs split the feed-forward computation into two separate projections: one that computes the activation values and a second that computes a gate that modulates those values element-wise.
The two most common GLU variants are:
-
SwiGLU: Uses Swish (x * sigmoid(x)) as the activation function, adopted by Llama, PaLM, and most other Llama-derived models
-
GeGLU: Uses GeLU as the activation function, preferred by Google for models like Gemma and T5
When switching from a standard feed-forward network to a GLU-based design, practitioners typically reduce the intermediate dimension by a factor of 2/3 to keep the total parameter count approximately the same. Even with this parameter adjustment, GLUs consistently outperform non-gated activations across all model sizes and benchmarks.
Module 3: Position Embeddings: From Sine Waves to RoPE
Position embedding is one of the most active areas of transformer architecture research, as it directly determines a model's ability to handle long context.
The original transformer used fixed sine and cosine position embeddings, which encode absolute position through trigonometric functions of different frequencies. Early models like BERT and GPT-2 replaced these with learned absolute position embeddings, which provided slightly better performance but limited extrapolation beyond the training context length.
Google's T5 and Chinchilla models used relative position embeddings, which add position-dependent biases directly to the attention scores rather than modifying the token embeddings. This approach provides better relative position invariance but breaks the clean inner product structure of attention.
Rotary Position Embedding (RoPE) has emerged as the clear standard for all modern models. RoPE encodes position information by rotating token embeddings in the vector space according to their position.
The core insight behind RoPE is that rotating two vectors by the same angle preserves their inner product. This means that the attention score between any two tokens depends only on their relative distance, not their absolute positions in the sequence. This property makes RoPE uniquely well-suited for long context extrapolation.
RoPE works by splitting the embedding vector into pairs of dimensions and rotating each pair by an angle that depends on the position. Low-frequency rotations capture long-range dependencies, while high-frequency rotations capture fine-grained local structure.
Module 4: Hyperparameter Engineering: Consensus Rules of Thumb
While transformer hyperparameters were once the subject of extensive search, the community has converged on a set of robust default values that work well across almost all model sizes.
The feed-forward dimension ratio is the most consistent hyperparameter across models. For non-gated feed-forward networks, the standard ratio is 4:1, meaning the intermediate dimension is four times the model dimension. For GLU-based networks, this ratio is typically adjusted to 2.67:1 to account for the additional projection matrix, though Llama popularized a slightly higher 3.5:1 ratio that emphasizes feed-forward capacity.
For multi-head attention, the universal convention is to set the head dimension such that the product of the number of heads and the head dimension equals the model dimension. While this ratio is not theoretically required, it has proven remarkably robust across all model sizes and architectures.
The depth-to-width aspect ratio is another critical hyperparameter that controls the tradeoff between model expressiveness and systems efficiency. Most modern models cluster around an aspect ratio of approximately 100, meaning the model dimension is roughly 100 times the number of layers.
This ratio represents a balance between two competing considerations: deeper models are generally more expressive but difficult to parallelize efficiently, while wider models are easier to parallelize using tensor parallelism but may be less parameter-efficient.
Vocabulary size has trended significantly upward in recent years, with modern models typically using vocabularies between 100,000 and 200,000 tokens. Larger vocabularies reduce sequence length by encoding more information per token, which improves both training and inference efficiency, especially for multilingual models.
Module 5: Training Stability: Preventing Catastrophic Failure
As models have grown larger and training runs more expensive, training stability has become the single most important architectural concern. A single gradient spike halfway through a training run can waste millions of dollars in compute.
The primary sources of instability in transformer training are the softmax operations in the attention mechanism and the output layer. Softmax combines two numerically dangerous operations: exponentiation, which can blow up to infinity, and division, which can produce NaNs if the denominator becomes zero.
The z-loss trick is a simple but effective technique for stabilizing the output softmax. It adds an auxiliary loss term that penalizes the log normalizer of the softmax, keeping it close to zero and preventing numerical underflow or overflow.
QK normalization is another standard stability intervention that adds a layer normalization step to the queries and keys before they are multiplied in the attention operation. This ensures that the inputs to the softmax always have unit variance, preventing the attention scores from becoming too large or too small.
For models that require extreme stability, logit soft-capping applies a tanh function to the attention logits, bounding them between a minimum and maximum value. While this provides the strongest stability guarantees, it can slightly degrade model performance by preventing the model from expressing very confident attention patterns.
Module 6: Attention Optimizations: Balancing Performance and Efficiency
Two attention optimizations have become nearly universal in modern models: grouped query attention for inference efficiency and hybrid sliding window attention for long context support.
Grouped Query Attention (GQA) addresses the memory bandwidth bottleneck that plagues autoregressive inference. In standard multi-head attention, each attention head has its own key and value projection, resulting in a large KV cache that must be read from memory for every generation step.
GQA reduces the number of key and value heads while keeping the number of query heads the same. This drastically reduces the size of the KV cache and the associated memory bandwidth requirements, with minimal impact on model performance. A typical configuration uses 8 key-value heads for 32 query heads, providing most of the efficiency benefits of full multi-query attention while retaining nearly all the expressive power of multi-head attention.
Sliding window attention is a hybrid approach to long context that alternates between layers of full global attention and layers of local attention that can only attend to tokens within a fixed window. This design balances the expressive power of global attention with the efficiency of local attention, allowing models to handle very long contexts without the full O(n²) cost of global attention.
Most modern long-context models use a pattern where one out of every four layers uses full global attention, while the remaining three use sliding window attention. This structure allows local information to be aggregated and propagated globally through the periodic full attention layers.
Wishing you all the best as you apply these practical architecture lessons to your own language model projects. May your layer norms stay properly placed, your gradients flow smoothly, and your training runs complete without a single stability spike. The empirical knowledge you're building here is the foundation of every successful LLM deployment, and it will serve you well whether you're training small open source models or scaling to the largest frontier systems. Keep experimenting, keep measuring, and never underestimate the power of a well-chosen hyperparameter. Happy training!


