Lecture Notes: Parallelism Fundamentals for Multi-GPU Language Model Training

One. Course Details

This is week eight of CS 336: Language Modeling from Scratch (AI Coachella) at Stanford University, building directly on the previous week's deep dive into single-GPU kernel optimization. This foundational lecture transitions from making individual GPUs fast to leveraging entire clusters of GPUs for training trillion-parameter language models.

The instructor provides a ground-up explanation of distributed computing principles, starting with the mathematical foundations of collective communication operations, moving through modern GPU networking hardware, and concluding with hands-on PyTorch implementations of the three core parallelization strategies. All examples use multi-layer perceptrons (MLPs) to isolate the core parallelism logic from the complexity of full transformer architectures. This material directly prepares students for assignment two, where they will implement their own distributed training system from scratch.

The lecture covers:

The two fundamental motivations for multi-GPU parallelism
Eight collective communication primitives and their properties
Hierarchical GPU networking topologies and performance characteristics
Remote Direct Memory Access (RDMA) and low-latency communication
NVIDIA Collective Communications Library (NCCL) architecture
PyTorch distributed programming model and API
Data Parallelism (DDP) implementation and tradeoffs
Tensor Parallelism for matrix operation decomposition
Pipeline Parallelism and micro-batching techniques
Practical guidelines for parallelism strategy selection

Two. Key Learning Takeaways

Multi-GPU parallelism solves two non-negotiable problems: models that exceed the memory capacity of a single GPU, and the need to reduce training time from years to months for state-of-the-art models.
Collective communication primitives are the universal building blocks of all distributed training systems. The critical identity all-reduce = reduce-scatter + all-gather underpins nearly all advanced parallelism techniques including FSDP and ZeRO.
Hardware topology is the primary driver of parallelism strategy. Tensor parallelism requires extremely high bandwidth and should only be used within a single NVLink domain, while pipeline parallelism can tolerate much slower inter-node connections.
Data Parallelism (DDP) is the most elegant and modular strategy. It requires no modifications to model architecture code, only a single all-reduce operation after the backward pass to average gradients across all workers.
Tensor Parallelism splits individual matrix operations across GPUs, reducing both parameter and activation memory linearly with parallelism size but incurring frequent communication overhead.
Pipeline Parallelism splits models vertically by layer, creating a sequential processing pipeline. Micro-batching is the standard technique to mitigate the "pipeline bubble" where most GPUs sit idle.
Overlapping communication with computation is the single most important optimization for achieving high parallel efficiency. All production distributed training libraries use asynchronous collective operations to hide latency behind useful computation.

Three. Course Gold Quotes

"The game is to orchestrate the computation to try to avoid data transfer bottlenecks. It's very easy to use a ton of GPUs, but it's hard to use them effectively."
"All reduce is equal to reduce scatter plus all gather. This is the most important identity in all of distributed training."
"Data parallelism is elegant because it treats the model as a black box. You just average the gradients after the backward pass."
"Tensor parallelism requires very fast interconnects. You never want to do tensor parallelism across a slow network."
"Pipeline bubbles are the biggest enemy of pipeline parallelism. Micro-batches are the best weapon we have against them."
"You can either recompute or store in memory. Or, in the distributed case, you can store on a different GPU."
"The new unit of compute is not the GPU. It is the entire data center."

Four. Layered Learning Notes

Module 1: The Hierarchy of Memory and Communication

Parallelism exists because we face fundamental physical limits in both compute density and memory capacity on individual chips. As models grow to trillions of parameters, they cannot fit on any single GPU, and even 7B parameter models would take impractically long to train on a single device.
The memory and communication hierarchy defines all performance tradeoffs in distributed training, ordered from fastest to slowest:
1. Registers and L1 cache (on-chip, ~100 TB/s)
2. L2 cache (on-chip, ~10 TB/s)
3. High Bandwidth Memory (HBM, off-chip, ~8 TB/s for B200)
4. NVLink (intra-node, ~1.8 TB/s for NVLink 5)
5. InfiniBand (inter-node, ~800 GB/s)
6. Ethernet (inter-pod, ~100-400 GB/s)
The core principle of all high-performance distributed systems is to minimize movement of data between slower levels of this hierarchy. This is a direct extension of the single-GPU optimization principles covered in the previous lecture.

Module 2: Collective Communication Primitives

Collective operations are standardized communication patterns that allow groups of devices to coordinate without explicit point-to-point messaging. They have been used in scientific parallel computing since the 1980s and remain the foundation of modern LLM training.
Basic foundational operations (primarily used for initialization and setup):
- Broadcast: Sends a single tensor from one designated rank to all other ranks. Used primarily for loading initial checkpoints and synchronizing random seeds.
- Scatter: Splits a tensor on one rank into equal chunks and sends each chunk to a different rank.
- Gather: Collects tensors from all ranks and concatenates them on a single target rank.
- Reduce: Applies an associative and commutative operation (sum, max, min) to tensors from all ranks and stores the result on a single rank.
The three core operations used in all LLM training pipelines:
- All-gather: Every rank collects tensors from all other ranks, resulting in all ranks having the full concatenated tensor. Used extensively in FSDP and tensor parallelism.
- Reduce-scatter: Applies a reduction operation to each dimension of the input tensors and scatters the results to different ranks. Used in gradient synchronization for advanced data parallelism techniques.
- All-reduce: Applies a reduction operation to tensors from all ranks and replicates the result to all ranks. Mathematically equivalent to reduce-scatter followed by all-gather. Used in DDP for gradient averaging.
- All-to-all: Each rank sends a different piece of data to every other rank. Used exclusively for Mixture of Experts (MoE) routing, where tokens are dynamically assigned to different experts.

Module 3: Hardware Networking Topologies

Modern GPU clusters are organized in a hierarchical topology that directly maps to the communication hierarchy:
- Intra-node: Typically 8 GPUs per server connected via NVLink and NVSwitch, providing full all-to-all connectivity at high bandwidth. NVIDIA's NVL72 extends this domain to 72 GPUs for premium configurations.
- Intra-pod: Multiple servers connected via InfiniBand switches, supporting RDMA for direct GPU-to-GPU communication without CPU intervention.
- Inter-pod: Large groups of pods connected via high-speed Ethernet, often using RoCE (RDMA over Converged Ethernet) to reduce latency.
RDMA is a critical technology that allows one GPU to directly read from or write to another GPU's memory without involving the CPU. This eliminates significant latency and CPU overhead associated with traditional network stacks.
Traditional Ethernet requires data to be copied from GPU memory to CPU memory, then to the network interface card, and back again at the destination. RDMA bypasses this entire path, resulting in order-of-magnitude lower latency.

Module 4: PyTorch Distributed Programming

PyTorch provides the torch.distributed library as a high-level interface to collective communication operations, with support for multiple backends optimized for different hardware:
- NCCL: NVIDIA's optimized backend for GPU communication, the industry standard for all LLM training.
- Gloo: CPU-based backend, used primarily for debugging on systems without GPUs.
Key concepts in distributed programming:
- Rank: A unique integer identifier for each process/device in the distributed system. In this course, each rank maps directly to one GPU.
- World size: The total number of processes/devices participating in the computation.
- Barrier: A synchronization primitive that blocks all processes until every process has reached the barrier point. Used to coordinate execution across asynchronous processes.
Collective operations can be executed synchronously or asynchronously. Asynchronous operations return immediately, allowing the program to perform useful computation while communication proceeds in the background. This is the foundation of communication-computation overlap.
Effective bandwidth is the standard metric for measuring communication performance, calculated as total bytes transferred divided by total wall-clock time. NCCL automatically optimizes collective operations based on the underlying hardware topology to maximize effective bandwidth.

Module 5: Data Parallelism (DDP)

Data Parallelism is the simplest and most widely used parallelism strategy. It works by replicating the entire model across all GPUs and splitting the training batch across workers.
The implementation is remarkably simple and requires minimal changes to existing training code:
1. Each GPU loads its own subset of the training batch.
2. Each GPU performs a forward pass and computes gradients locally.
3. An all-reduce operation averages the gradients across all GPUs.
4. Each GPU updates its local copy of the parameters using the averaged gradients.
Advantages:
- Completely modular and model-agnostic. Works with any neural network architecture without modification.
- Near-perfect linear scaling efficiency when the global batch size is large enough.
Disadvantages:
- Extremely memory inefficient. Each GPU must store the full model, gradients, and optimizer state.
- Limited by the critical batch size, beyond which increasing batch size no longer improves optimization performance.
The next lecture will cover Fully Sharded Data Parallelism (FSDP), which addresses the memory inefficiency of DDP by sharding all training state across GPUs while maintaining similar scaling properties.

Module 6: Tensor Parallelism

Tensor Parallelism splits individual matrix operations across multiple GPUs, allowing models to be scaled beyond the memory limits of a single device.
There are two fundamental ways to decompose a matrix multiplication:
- Column-wise parallelism: Split the weight matrix along its columns. Each GPU computes a portion of the output, and an all-gather operation combines the results.
- Row-wise parallelism: Split the weight matrix along its rows. An all-gather operation first combines the input activations, and each GPU computes its portion of the output.
Advantages:
- Reduces both parameter memory and activation memory linearly with the number of GPUs.
- No pipeline bubbles, so can achieve high utilization with sufficiently fast networking.
Disadvantages:
- Requires an all-reduce or all-gather operation after every layer, resulting in very high communication volume.
- Only practical within a single NVLink domain. Scaling beyond 8 GPUs with tensor parallelism results in severe performance degradation.
There is a natural duality between forward and backward passes: an all-gather in the forward pass corresponds to a reduce-scatter in the backward pass, and vice versa. This duality is exploited by all tensor parallelism implementations.

Module 7: Pipeline Parallelism

Pipeline Parallelism splits the model vertically by layer, assigning different layers to different GPUs. Activations are passed sequentially from one GPU to the next in the pipeline.
The naive implementation suffers from the pipeline bubble problem: only one GPU is active at any time, resulting in utilization as low as 1/N where N is the number of pipeline stages.
The standard solution is micro-batching: split the large global batch into many small micro-batches and pipeline their execution through the stages. Utilization approaches 100% as the number of micro-batches increases.
Advantages:
- Very low communication overhead. Only requires sending activations between adjacent stages.
- Can tolerate much slower network connections than tensor parallelism.
Disadvantages:
- Implementation is significantly more complex than data or tensor parallelism.
- Still requires large batch sizes to achieve high utilization.
The most important optimization for pipeline parallelism is overlapping communication with computation: start sending the output of one micro-batch to the next stage while still computing the next micro-batch.

Module 8: Practical Parallelism Strategy Selection

No single parallelism strategy is optimal for all scenarios. The best approach depends on your model size, hardware configuration, and batch size constraints.
General guidelines for strategy selection:
1. Use data parallelism as much as possible, as it is the simplest, most efficient, and most robust strategy.
2. Use tensor parallelism within a single node (up to 8 GPUs) to reduce memory usage when models no longer fit on a single GPU.
3. Use pipeline parallelism to scale across multiple nodes when tensor parallelism is no longer feasible.
4. Combine multiple strategies (3D parallelism) for the largest models.
The choice of parallelism strategy is always a tradeoff between memory usage, communication overhead, and implementation complexity.

Wishing you all the best as you dive into the fascinating world of distributed training and build your own multi-GPU systems. May your collective operations run smoothly, your pipelines stay bubble-free, and your GPUs remain fully utilized at all times. The skills you're mastering here will allow you to train models that were impossible just a few years ago. Keep experimenting, keep learning, and never stop pushing the boundaries of what's possible with parallel computing. Happy coding!

Video Source and Usage Instructions

Video Title: tanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA
• Course Series: Stanford CS336: Language Modeling from Scratch
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/xnDHaNUvHBg?si=Mo7KDx7LYjenRauA

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.