One. Course Details
This is the advanced architecture lecture in CS 336: Language Modeling from Scratch at Stanford University, building directly on the previous week's foundational transformer lecture. The instructor transitions from basic transformer modifications to two of the most impactful architectural innovations driving modern large language models: linear-time attention alternatives for long context and mixture of experts (MoE) for improved hardware utilization.
The lecture is structured into two distinct, self-contained sections:
-
The first section addresses the quadratic complexity bottleneck of standard self-attention, covering the mathematical foundations of linear attention, its evolution into state space models like Mamba-2 and Gated Delta Net, and the DeepSeek Attention (DSA) sparse attention approach.
-
The second section provides a comprehensive deep dive into mixture of experts, including core concepts, routing design space, training challenges and solutions, systems optimizations, and the evolution of industrial MoE architectures from early Google designs to modern DeepSeek and Qwen models.
The instructor emphasizes practical, battle-tested approaches that have been proven to work at scale in production systems, with extensive references to real-world models including DeepSeek v3.2, Qwen 3.5, Nemotron-3, Llama 4, and GLM-5. All concepts are framed in the context of the fundamental tradeoffs between expressiveness, compute cost, and memory efficiency that define modern LLM design.
Two. Key Learning Takeaways
-
Standard self-attention has inherent O(n²) complexity that becomes prohibitive for long context lengths, with attention cost quickly surpassing feed-forward cost as sequence length increases.
-
Linear attention leverages the associativity of matrix multiplication to reorder operations, transforming the O(n²) term into an O(n) term that scales gracefully to millions of tokens.
-
All modern linear-time attention architectures are hybrids that combine multiple linear attention layers with occasional full softmax attention layers to preserve expressiveness while maintaining efficiency.
-
Mixture of experts provides a free parameter efficiency win by allowing models to have significantly more total parameters while keeping active compute per forward pass constant.
-
Expert collapse is the primary training challenge for MoEs, where a small number of experts receive almost all tokens while others remain unused, solved primarily through heuristic auxiliary balancing losses.
-
TopK selection with auxiliary balancing losses has emerged as the standard solution for nondifferentiable routing problems, used in both MoEs and sparse attention architectures like DSA.
-
MoEs provide an additional axis of parallelization that complements data and model parallelism, making them uniquely suited for scaling to extremely large model sizes.
-
Shared experts have become a standard component of modern MoE designs, handling common token processing while allowing specialized experts to focus on niche patterns.
Three. Course Gold Quotes
-
"There's a clear rush by all the top LLM vendors to provide larger and larger context sizes to support more complex workloads."
-
"Constant factors really, really matter. It's very easy for theory-trained computer scientists to think only about big O notation, but FlashAttention proved that constant factor improvements can be transformative."
-
"Mixture of experts is just a more efficient MLP. You take your MLP, and someone gives you a better one that has more parameters for the same compute cost."
-
"The rich get richer effect is the core problem with MoE training. Experts that are chosen early get more gradient signal, become stronger, and end up taking all the tokens."
-
"You don't need to be scared of TopK selection. It turns out that if you just add a simple balancing loss, you can pump gradients straight through the nondifferentiable operation and it works surprisingly well."
-
"MoEs are here to stay. Every model past a certain size these days is an MoE, and that will be the case for the foreseeable future."
-
"The most successful architectures are not the most mathematically elegant ones. They are the ones that work well with our hardware and are easy to train at scale."
Four. Layered Learning Notes
Module 1: The Long Context Attention Bottleneck
The demand for longer context windows has exploded in recent years, with models now supporting hundreds of thousands and even millions of tokens. This creates a fundamental problem for the standard transformer architecture.
Standard self-attention computes an all-to-all interaction between every pair of tokens in the sequence, resulting in O(n²) time and memory complexity. For short sequences, the feed-forward network (FFN) dominates compute cost. However, as sequence length increases, attention cost grows quadratically and quickly becomes the dominant bottleneck.
Two broad categories of solutions have emerged to address this problem:
-
Systems optimizations: Techniques like FlashAttention rearrange the attention computation to minimize memory transfer overhead, providing 2-4x speedups without changing the underlying algorithm. While extremely impactful, these do not address the fundamental quadratic complexity.
-
Architectural modifications: Approaches that change the attention mechanism itself to achieve linear or near-linear time complexity, allowing scaling to much longer sequence lengths.
The lecture focuses primarily on the second category, covering the most successful architectural approaches that have been proven to work at scale.
Module 2: Linear Time Attention: Mathematical Foundations
The core insight behind all linear attention approaches is the associativity of matrix multiplication. Standard attention is computed as:Attention(Q, K, V) = softmax(QKᵀ)V
If we temporarily ignore the softmax normalization, we can reorder the operations:QKᵀV = Q(KᵀV)
This seemingly trivial reordering completely changes the complexity characteristics. The original formulation has O(n²d) complexity, where n is sequence length and d is hidden dimension. The reordered formulation has O(nd²) complexity. Since d is typically on the order of thousands while n can be millions, this is an enormous improvement for long sequences.
This formulation also reveals a beautiful duality:
-
The dense matrix multiply form is highly parallel and ideal for training
-
The equivalent recurrent form maintains a fixed-size state and is ideal for inference
This duality allows linear attention models to get the best of both worlds: fast parallel training and efficient autoregressive inference.
Module 3: Evolution of Linear Attention Architectures
The basic linear attention formulation is too simple for high-performance models, but it provides the foundation for more advanced state space models.
Linear Attention + Hybrid ArchitecturesThe simplest approach is to use a hybrid architecture that combines mostly linear attention layers with occasional full softmax attention layers. For example, the Minimax M1 model uses a 7:1 ratio of linear to full attention layers, achieving strong performance while maintaining near-linear scaling with context length. No fully linear attention model has yet matched the performance of full attention at scale, making hybrid architectures the current standard.
Mamba-2Mamba-2 adds a single input-dependent gate to the basic linear attention recurrence:sₜ = γₜ ⊙ sₜ₋₁ + kₜvₜᵀyₜ = qₜᵀsₜ + vₜ ⊙ dₜ
The gate γₜ controls how much of the previous state is carried forward, inspired by the forget gate in LSTMs. This simple addition significantly improves expressiveness while preserving the linear complexity and training/inference duality. Nemotron-3 uses a hybrid Mamba-2 and attention architecture to achieve excellent long context performance.
Gated Delta NetGated Delta Net extends Mamba-2 with a second gate βₜ that controls how much new information is added to the state:sₜ = γₜ ⊙ sₜ₋₁ + βₜ ⊙ (I - kₜkₜᵀ)kₜvₜᵀ
The projection term (I - kₜkₜᵀ) erases information in the direction of the current key before adding new information, allowing for more precise state updates. Qwen 3.5 and Qwen Next use a 3:1 hybrid of Gated Delta Net and attention layers, delivering state-of-the-art open source performance with exceptional long context throughput.
Controlled studies show that hybrid architectures with up to 75% linear layers maintain performance comparable to full attention, while pure linear models suffer significant degradation on complex reasoning and long context tasks.
Module 4: DeepSeek Attention (DSA): Sparse Attention Alternative
DeepSeek Attention (DSA) represents a fundamentally different approach to reducing attention cost. Instead of modifying the attention mechanism itself, DSA uses a lightweight indexer to select a small subset of relevant tokens before performing full attention only on that subset.
The DSA pipeline works as follows:
-
Compute low-dimensional Q and K projections for all tokens
-
Calculate similarity scores between the query and all keys
-
Select the top-k most relevant tokens using these scores
-
Perform full standard attention only on the selected subset
This approach has several key advantages:
-
It preserves the full expressiveness of standard softmax attention for the selected tokens
-
It can be added as a post-training step to existing pre-trained models without full retraining
-
It provides predictable performance that scales well with context length
DeepSeek v3.2 and GLM-5 both use DSA to achieve industry-leading long context performance. While the indexer still requires O(n) computation, the constant factors are extremely low, making DSA competitive with linear attention approaches in practice.
Module 5: Mixture of Experts: Core Concepts
Mixture of experts is an architectural innovation that replaces the standard transformer MLP layer with a collection of smaller "expert" MLPs and a router that sends each token to a subset of experts.
The core value proposition of MoEs is parameter efficiency:
-
A model with N experts has N times more total parameters than an equivalent dense model
-
Only k experts are active per token, so compute cost remains approximately the same as a dense model
Numerous studies have consistently shown that for a fixed compute budget, MoEs outperform dense models on both training speed and final quality. This is because model performance scales more strongly with total parameters than with active compute.
MoEs also provide a natural additional axis of parallelization. Each expert can be placed on a separate GPU, allowing models to scale far beyond the limits of traditional model parallelism. This has made MoEs the architecture of choice for all the largest frontier models, including Llama 4, GPT-4o, and DeepSeek v3.
Module 6: MoE Routing Design Space
The router is the most critical component of any MoE architecture, responsible for assigning tokens to experts. Almost all modern MoEs use token-choice TopK routing, where each token selects its top-k highest-scoring experts.
Standard TopK RoutingThe standard router is an extremely simple linear projection:s = softmax(xWᵣ)selected_experts = topk(s, k)y = Σ(sᵢ * Expertᵢ(x)) for i in selected_experts
Despite its simplicity, this routing mechanism works surprisingly well in practice and has become the industry standard.
Shared ExpertsDeepSeekMoE pioneered the now-universal concept of shared experts. In addition to the routed experts, a small number of experts are always active for all tokens. This handles common, general-purpose processing while allowing the routed experts to specialize in more niche patterns.
Ablations consistently show that shared experts improve performance across almost all benchmarks. Modern MoE designs typically use 1-2 shared experts combined with 16-64 routed experts.
Alternative Routing ApproachesWhile token-choice TopK is dominant, several alternative approaches have been explored:
-
Expert-choice routing: Experts select their favorite tokens
-
Hash-based routing: Tokens are assigned to experts based on a hash function
-
Reinforcement learning-based routing: RL is used to optimize routing decisions
-
Linear assignment routing: Globally optimal token-expert assignments are computed
None of these alternatives have yet matched the simplicity and performance of standard token-choice TopK routing at scale.
Module 7: MoE Training Challenges and Solutions
Training MoEs presents unique challenges that do not exist for dense models. The most significant of these is expert collapse (also called expert starvation).
Expert collapse occurs when a small number of experts become significantly better than others early in training. These experts receive more tokens, get more gradient signal, and become even stronger, creating a positive feedback loop that eventually results in almost all tokens being routed to just 2-3 experts while the rest remain completely unused.
The standard solution to expert collapse is the load balancing auxiliary loss:L_balance = n * Σ(fᵢ * pᵢ)
Where fᵢ is the fraction of tokens routed to expert i and pᵢ is the total probability mass assigned to expert i by the router. This loss penalizes popular experts, pushing the router to distribute tokens more evenly across all experts.
Additional training techniques include:
-
Device-level balancing losses to ensure even utilization across GPUs
-
Z-loss on the router softmax to improve numerical stability
-
Using higher precision (FP32) for router computations to prevent instability
-
Stochastic perturbations during early training to encourage exploration
Ablations clearly demonstrate that removing the load balancing loss results in catastrophic expert collapse and significantly worse model performance.
Module 8: MoE Systems Optimization and Industrial Evolution
MoEs introduce unique systems challenges related to parallelization and communication. The standard approach is expert parallelism, where each expert is placed on a separate GPU. Tokens are dynamically routed to the appropriate GPU for processing.
Key systems optimizations for MoEs include:
-
Communication-computation overlap: Overlap token routing communication with local computation
-
Downprojection before communication: Reduce activation dimension before sending to other GPUs to minimize bandwidth usage
-
Block-sparse matrix multiplication: Leverage hardware-accelerated sparse operations for efficient expert computation
-
MegaBlocks: Open source framework that eliminates token dropping and queueing issues in MoE inference
The evolution of industrial MoE architectures shows a clear progression:
-
Early Google designs (GShard, Switch Transformer): Basic TopK routing with load balancing loss, no shared experts
-
DeepSeekMoE v1: Introduced fine-grained experts and shared experts, became the de facto standard
-
DeepSeekMoE v2: Added device-level balancing and communication optimizations
-
DeepSeekMoE v3: Introduced auxiliary-loss-free balancing and sigmoid gating
Modern MoEs also incorporate additional innovations like multi-head latent attention (MLA) for reduced KV cache size and multi-token prediction for faster decoding.
Wishing you all the best as you explore these cutting-edge transformer architectures and push the boundaries of what's possible with large language models. May your routers stay perfectly balanced, your context windows grow without bound, and your MoEs train smoothly without a single expert collapse. The skills you're learning here are at the very forefront of AI research, and they will empower you to build the next generation of intelligent systems. Keep experimenting, keep questioning, and never stop innovating. Happy coding!


