One. Course Details
This is week ten of CS 336: Language Modeling from Scratch (AI Coachella) at Stanford University, building on Percy Liang's previous lecture on parallelism fundamentals. This deep dive covers the practical complexities of training trillion-parameter language models across massive GPU clusters, including the full spectrum of parallelization strategies and their real-world tradeoffs.
The instructor breaks down the evolution of parallelism from basic data parallelism to modern 4D parallelism, explains how hardware topologies shape parallelization choices, and provides concrete examples from recent frontier model training runs including Llama 3, DeepSeek V3, and Mistral. The lecture emphasizes systems-level optimizations and the critical balance between compute utilization, memory usage, and communication overhead.
The lecture covers:
-
Hardware topology differences between TPUs, GPUs, and custom accelerators
-
Collective communication primitives and their performance characteristics
-
Zero Redundancy Optimizer (ZeRO) and Fully Sharded Data Parallel (FSDP)
-
Pipeline parallelism and bubble reduction techniques
-
Tensor parallelism and sequence parallelism for activation memory reduction
-
Expert parallelism for Mixture of Experts (MoE) models
-
4D parallelism composition strategies
-
Real-world parallelism configurations from production models
-
Fault tolerance and reliability considerations at scale
Two. Key Learning Takeaways
-
Modern LLM training requires combining multiple parallelism strategies (4D parallelism) to fully utilize large clusters. No single strategy can address all bottlenecks of compute, memory, and communication.
-
FSDP (ZeRO Stage 3) provides near-free memory savings by sharding parameters, gradients, and optimizer states across GPUs while overlapping communication with computation to minimize overhead.
-
Pipeline parallelism is ideal for slow inter-node connections because it only requires point-to-point communication of activations rather than expensive all-to-all operations.
-
Tensor parallelism is extremely communication-heavy and should only be used within a single node connected by fast NVLink. It is the only strategy that effectively reduces activation memory.
-
Expert parallelism has largely replaced tensor parallelism for MoE models because it provides better GPU utilization and more efficient scaling of large feed-forward networks.
-
Activation memory often dwarfs parameter memory for large models with long sequence lengths, making sequence parallelism and activation recomputation essential optimizations.
-
The optimal parallelization strategy follows a simple hierarchy: maximize data parallelism first, use tensor/expert parallelism within nodes for fast interconnects, and use pipeline parallelism across nodes for slower connections.
Three. Course Gold Quotes
-
"The new unit of compute is not the GPU. It is the entire data center."
-
"There is no one strictly dominant parallelization strategy. It's all a whole bunch of tradeoffs that you somehow have to manage gracefully to get a good outcome."
-
"The remarkable thing about ZeRO is that stages 1 and 2 are literally free. They have exactly the same communication cost as naive data parallelism."
-
"Pipeline parallelism is terrible until you have no other choice. But it has the best communication properties of any parallelism strategy for slow links."
-
"Tensor parallelism is great if your network is fast enough. If it's not, it will absolutely destroy your performance."
-
"If you look at all frontier models, they all follow the same pattern: maximize data parallelism, keep tensor parallelism below 8, and use pipeline parallelism to scale across nodes."
-
"GPUs fail all the time. During Llama 3 405B training, GPUs failed 148 times. Parallelism is not just about speed—it's about reliability."
Four. Layered Learning Notes
Module 1: Hardware Foundations of Parallelism
-
Parallelism exists to solve two fundamental bottlenecks: insufficient compute on a single chip and insufficient memory to fit large models.
-
The critical distinction is between intra-node parallelism (fast NVLink connections within a single machine) and internode parallelism (slower Ethernet/Infiniband connections between machines).
-
Different hardware architectures favor different parallelism strategies:
-
TPUs: Use a toroidal mesh topology that excels at regular, neighbor-to-neighbor communication patterns like tensor parallelism. Google's new TPU v8i has switched to a more GPU-like tree topology to better support MoE workloads.
-
GPUs: Use a fat-tree topology that provides flexible all-to-all connectivity but becomes more expensive as clusters grow. This makes them ideal for data parallelism and MoE routing.
-
Custom accelerators: Huawei's Ascend 910 uses massive fiber optic switching to connect 384 chips in a single rack, trading higher power consumption for extreme scalability.
-
-
All parallelism strategies are implemented using collective communication primitives: all-reduce, reduce-scatter, all-gather, and point-to-point sends. The equivalence between all-reduce and reduce-scatter plus all-gather is the foundation of ZeRO.
Module 2: Data Parallelism and ZeRO/FSDP
-
Naive data parallelism replicates the entire model across all GPUs and splits the batch. It provides perfect compute scaling but no memory savings.
-
The memory problem in training is far worse than just storing parameters. For Adam optimizer, you need to store:
-
Model parameters (2 bytes per parameter for BF16)
-
Gradients (2 bytes per parameter)
-
First moment estimate (4 bytes per parameter)
-
Second moment estimate (4 bytes per parameter)
-
-
This totals 12 bytes per parameter, meaning a 7B model requires 84GB of memory just for training state—more than the 80GB available on an A100.
-
ZeRO Stage 1: Shards only the optimizer state across GPUs. Communication cost is identical to naive DDP (2× parameters per step) because reduce-scatter + all-gather = all-reduce.
-
ZeRO Stage 2: Shards both optimizer state and gradients. Uses incremental gradient communication during the backward pass to avoid materializing the full gradient vector. Still has the same communication cost as DDP.
-
ZeRO Stage 3 (FSDP): Shards everything—parameters, gradients, and optimizer states. Fetches parameters on demand during forward and backward passes and frees them immediately after use.
-
FSDP achieves near-zero overhead by overlapping communication with computation. While technically requiring an extra all-gather per layer, this is hidden behind the computation of the previous layer.
-
FSDP allows fitting models up to 50B parameters on a single A100 GPU, compared to less than 7B for naive DDP.
Module 3: Pipeline Parallelism
-
Pipeline parallelism splits the model vertically by layer, placing different layers on different GPUs.
-
The naive implementation has terrible utilization (the "bubble" problem) because only one GPU is active at a time.
-
The solution is to split the batch into micro-batches and pipeline their execution. Utilization approaches 100% as the number of micro-batches increases.
-
Advanced scheduling techniques further reduce bubbles:
-
Interleaving forward and backward passes of different micro-batches
-
Zero Bubble Pipelining: Separates weight gradient computation from activation gradient propagation. Weight gradients can be deferred to fill idle slots in the pipeline, eliminating almost all bubbles.
-
-
Pipeline parallelism has the lowest communication overhead of any model parallelism strategy. It only requires sending activations (batch × sequence × hidden dimension) between adjacent stages.
-
This makes it ideal for slow inter-node connections. Most large models use pipeline parallelism to scale across multiple machines.
Module 4: Tensor and Sequence Parallelism
-
Tensor parallelism splits individual matrix operations horizontally across GPUs. It is the only parallelism strategy that reduces activation memory.
-
Matrix multiplication can be split either column-wise or row-wise. Column-wise splits are used for input projections, while row-wise splits are used for output projections.
-
An all-reduce is required after each parallel operation to combine partial results. This makes tensor parallelism extremely communication-heavy.
-
Tensor parallelism should only be used within a single node connected by fast NVLink. Scaling beyond 8 GPUs with tensor parallelism results in severe performance degradation.
-
Sequence Parallelism: Extends tensor parallelism to split layer norms, dropouts, and residual connections along the sequence dimension.
-
Without sequence parallelism, these operations remain replicated across all GPUs, creating a memory floor that cannot be reduced by increasing tensor parallelism.
-
With tensor + sequence parallelism, activation memory scales perfectly linearly with the number of GPUs: 34 × s × b × h / t.
Module 5: Expert Parallelism
-
Expert parallelism is the standard parallelism strategy for MoE models. It splits individual experts across different GPUs.
-
Expert parallelism is preferred over tensor parallelism for MoEs because:
-
It preserves large matrix sizes, maintaining high GPU utilization
-
It naturally aligns with the sparse computation pattern of MoEs
-
It provides better memory scaling for large feed-forward networks
-
-
The main challenge of expert parallelism is efficient all-to-all routing of tokens to the appropriate experts. This requires extremely low-latency communication.
-
Frontier libraries like DeepSeek's DPP and NVIDIA's Hybrid EP use low-level GPU networking primitives and even undocumented PTX instructions to optimize routing performance.
-
Expert parallelism introduces additional complexity when combined with data parallelism. Modern implementations decouple expert and data parallel domains to allow independent scaling.
Module 6: 4D Parallelism Composition
-
Modern large-scale training combines four different parallelism strategies:
-
Data Parallelism: Maximizes compute utilization by splitting the batch across replicas
-
Tensor/Expert Parallelism: Reduces memory usage within a node using fast interconnects
-
Pipeline Parallelism: Scales across nodes using low-bandwidth-friendly communication
-
Sequence/Context Parallelism: Reduces activation memory for long sequences
-
-
The standard prescription for optimal parallelization:
-
Use tensor/expert parallelism up to the number of GPUs per node (typically 8)
-
Use pipeline parallelism to scale across additional nodes
-
Use data parallelism for all remaining GPUs
-
Add sequence/context parallelism for long sequence workloads
-
-
This strategy is universally followed by all frontier models:
-
Llama 3 405B: 8-way tensor parallel, 16-way pipeline parallel, 128-way data parallel
-
DeepSeek V3: 64-way expert parallel, 8-way pipeline parallel
-
Mistral 8x22B: 8-way expert parallel, 4-way pipeline parallel, 4-way tensor parallel
-
Wishing you all the best as you tackle the challenges of distributed training and build the next generation of large language models. May your GPUs stay cool, your networks stay fast, and your training runs complete without a single failure. The skills you've learned here will allow you to scale models to sizes that were unimaginable just a few years ago. Keep experimenting, keep optimizing, and never stop pushing the boundaries of what's possible. Happy training!


