One. Course Details
This is the opening lecture of the systems track in CS 336: Language Modeling from Scratch at Stanford University, transitioning from neural network architecture fundamentals to the practical systems work that powers modern large language models. The instructor frames systems as the most logical and satisfying part of the stack, where every performance improvement follows clear, explainable steps rather than feeling like magic.
The lecture is structured into three distinct parts that build incrementally:
-
A ground-up introduction to GPU hardware architecture and programming models, designed for students with little to no prior GPU programming experience
-
Six core optimization techniques that form the foundation of all fast GPU code for machine learning
-
A deep dive into FlashAttention, demonstrating how all the previous concepts come together to create one of the most impactful optimizations in recent LLM history
The instructor provides curated resources for further learning, including Horace He's technical blogs, the GPU Mode enthusiast community, and Google's comprehensive TPU/GPU book. All concepts are directly tied to upcoming course assignments, where students will implement their own optimized GPU kernels. The overarching promise of the lecture is that by the end, students will be able to explain seemingly mysterious GPU performance behaviors, such as why matrix multiplication throughput varies wildly with minor changes in dimension size.
Two. Key Learning Takeaways
-
GPUs and CPUs are optimized for fundamentally different workloads: CPUs prioritize low-latency serial execution with complex control flow, while GPUs maximize throughput through massive parallelism of simple, identical operations.
-
The memory hierarchy defines all modern LLM optimization: Compute throughput has grown exponentially faster than memory bandwidth, making data movement the primary bottleneck in almost all workloads.
-
Matrix multiplication is the only privileged operation on modern GPUs: Tensor cores deliver 10x higher throughput than any other floating-point operation, making dense matrix math the foundation of all scalable model architectures.
-
All six core GPU optimization techniques revolve around the same central goal: minimizing unnecessary data movement between slow global memory and fast on-chip memory.
-
Matrix dimension alignment has a disproportionate impact on performance: Even a single extra dimension can reduce throughput by 50% or more due to tiling inefficiencies and wave quantization effects.
-
FlashAttention achieves its revolutionary performance gains through three simple but clever applications of basic principles: tiling to reduce global memory access, online softmax computation to avoid global synchronization, and activation recomputation to save memory.
-
Control divergence is a silent killer of GPU utilization: Conditional branches force threads in a warp to execute both paths sequentially, leaving half the hardware idle at any given time.
Three. Course Gold Quotes
-
"Compute is the currency by which we operate in language modeling. Faster hardware, better utilization, more chips, and improved parallelization all drive progress."
-
"Modern hardware and LLM optimization is really defined by the memory. Everything we do has to respect the memory hierarchy."
-
"Matrix multiplies are the one privileged operation in machine learning. There's a gigantic gap in throughput between matrix operations and everything else."
-
"The game for the last few years has been how to bring more compute to bear in training models—and in the last year, how to bring more compute to bear in inference."
-
"You don't want to cargo cult 32 multipliers for your matrices. You want to really understand why we do those things, all the way down to the hardware level."
-
"FlashAttention is just tiling and recomputation done cleverly. There's no magic—just applying the basic principles we've learned really well."
-
"As we go further into the future, memory and communication bottlenecks will only get worse. It will take more and more work to fully utilize the hardware we have."
Four. Layered Learning Notes
Module 1: CPU vs GPU: Fundamental Design Philosophies
The core difference between CPUs and GPUs lies in their design priorities, which evolved to solve very different problems.
CPUs are optimized for general-purpose serial execution. They feature large, complex control units that handle sophisticated branching and out-of-order execution, paired with a small number of powerful arithmetic logic units (ALUs). The primary goal is to minimize the latency of individual instructions, completing each task as quickly as possible.
GPUs, by contrast, are throughput-optimized devices. They sacrifice individual instruction latency to maximize the total number of operations executed per second. This is achieved through hundreds of lightweight streaming multiprocessors (SMs) that run thousands of threads in parallel. The GPU programming model follows Single Instruction, Multiple Threads (SIMT), where all threads in a warp (a group of 32 consecutive threads) execute exactly the same instruction on different data.
This fundamental difference explains why GPUs revolutionized machine learning. Neural network training consists almost entirely of identical mathematical operations applied to large datasets—exactly the workload GPUs were designed to accelerate.
Module 2: GPU Hardware Architecture Deep Dive
The basic building block of any modern GPU is the streaming multiprocessor (SM). An A100 GPU contains 128 independent SMs, each with its own execution units, registers, L1 cache, and shared memory. All SMs share a common L2 cache and connect to high-bandwidth memory (HBM) off-chip.
The memory hierarchy is the most critical aspect of GPU performance, ordered from fastest to slowest:
-
Registers: Fastest memory, private to each thread, with single-cycle access latency
-
L1 cache / Shared memory: On-chip memory shared by all threads in a block, ~20-30 cycle latency
-
L2 cache: Shared across all SMs, ~200 cycle latency
-
HBM (Global memory): Off-chip DRAM, ~300-500 cycle latency, but with extremely high bandwidth
Shared memory is a programmable scratchpad that developers explicitly control, unlike caches which operate automatically. This makes it the most powerful tool for optimizing memory-bound workloads.
The lecture also compares GPUs to TPUs, noting their convergent evolution toward similar architectures. Both rely on a hierarchy of fast and slow memory paired with dedicated matrix multiplication units. The primary difference is naming convention: TPUs call their processing units "tensor cores," while GPUs reserve that name for their matrix multiplication accelerators—a common source of confusion.
Module 3: The Growing Compute-Memory Gap
Dennard scaling, which drove decades of performance improvements through smaller transistors and faster clock speeds, ended in the early 2000s. Since then, all performance gains have come from parallel scaling rather than faster individual instructions.
GPU compute throughput has grown super-exponentially since the introduction of tensor cores in the V100 generation. However, memory bandwidth and interconnect speeds have grown much more slowly. This creates an ever-widening gap between how fast we can compute and how fast we can feed data to the compute units.
The practical implication is that almost all modern LLM workloads are memory-bound rather than compute-bound. Even the most optimized matrix multiplication kernels spend most of their time waiting for data to arrive from HBM. This is why every major optimization technique focuses on reducing data movement rather than speeding up computation itself.
Module 4: Six Core GPU Optimization Techniques
The instructor presents six foundational techniques that form the toolkit of any GPU performance engineer.
one. Avoid Control DivergenceUnder the SIMT model, all threads in a warp execute the same instruction. If threads take different branches of an if statement, the warp must execute both branches sequentially, masking out threads that are not on the current path. This can cut utilization in half or worse. The standard solution is to replace conditional branches with arithmetic operations using masks whenever possible.
two. Low-Precision ComputationReducing the number of bits used to represent numbers is the most straightforward way to reduce memory bandwidth usage. The industry has progressed from FP32 to BF16 as the standard training format, with FP8 now becoming mainstream for both training and inference.
Advanced formats like MXFP8 use block-level scaling factors to maintain numerical stability at 8 bits. However, these formats introduce new complexities, such as the need to store both a matrix and its transpose to avoid expensive requantization operations. The frontier is now pushing to FP4, though this remains experimental for large-scale training.
three. Operator FusionNaive implementations of neural networks execute each operation as a separate kernel, writing intermediate results back to global memory after every step. Operator fusion combines multiple operations into a single kernel, reading input once, performing all computations on-chip, and writing the final result only once. This can reduce memory traffic by an order of magnitude for simple element-wise operations. Modern compilers like TorchScript and XLA automatically fuse simple operations, though more complex fusions still require manual intervention.
four. RecomputationActivation recomputation trades compute for memory by discarding intermediate activations during the forward pass and recalculating them during the backward pass. This eliminates the need to store large activation tensors in global memory, which is particularly valuable for deep models with long sequences. While it increases total computation by roughly 20-30%, it often results in net performance gains by reducing memory pressure and improving cache utilization.
five. Coalesced Memory AccessesDRAM memory is organized into burst sections of typically 128 bytes. When accessing memory, the GPU reads an entire burst section at once, even if only a single byte is requested. Coalesced access occurs when all threads in a warp access consecutive memory addresses, fully utilizing the entire burst. Strided or scattered access patterns waste most of the memory bandwidth, as each thread triggers a separate burst read.
six. TilingTiling is the single most impactful optimization technique for matrix multiplication and other dense operations. The idea is to break large matrices into smaller submatrices (tiles) that fit entirely in shared memory. Each tile is loaded from global memory once, then reused repeatedly for multiple computations. This reduces global memory accesses by a factor equal to the tile size, dramatically improving arithmetic intensity.
Tiling introduces its own complexities, however. Optimal tile size depends on the specific GPU architecture, shared memory size, and matrix dimensions. The max-autotune flag in PyTorch benchmarks different tile sizes to find the fastest configuration for a given workload.
Module 5: Demystifying the Matrix Multiply Performance Plot
The lecture opens with a mysterious plot showing matrix multiplication throughput varying wildly with dimension size, featuring sharp drops and periodic patterns. The instructor breaks down the causes of these behaviors using the concepts covered earlier.
The general upward trend follows the classic roofline model. For small matrices, the workload is memory-bound, and throughput increases with matrix size as arithmetic intensity improves. For large matrices, the workload becomes compute-bound, and throughput plateaus at the maximum theoretical performance of the GPU.
The different colored lines correspond to matrices with different divisibility properties. Matrices whose dimensions are multiples of 32 or 64 achieve much higher throughput because they align perfectly with burst sections and tile boundaries. Matrices with prime dimensions suffer from severe tiling inefficiencies and uncoalesced memory accesses.
The sharp, sudden drops in performance are caused by wave quantization. The A100 has 108 SMs, so any workload that requires 109 tiles will need two full waves of execution, leaving most SMs idle during the second wave. This explains why increasing the matrix dimension by just one can cut throughput in half.
Module 6: FlashAttention: Putting It All Together
FlashAttention is presented as the ultimate demonstration of how basic GPU optimization principles can be combined to create revolutionary performance gains. The naive attention implementation has O(n²) memory complexity and makes repeated trips to global memory, making it extremely slow for long sequences.
FlashAttention solves this through three key innovations:
-
Tiled QKV multiplication: The query, key, and value matrices are split into tiles that fit in shared memory. Attention scores are computed tile by tile, eliminating the need to store the full O(n²) attention matrix in global memory.
-
Online softmax computation: A mathematically equivalent formulation of softmax allows it to be computed incrementally tile by tile, keeping track of running sums and maximum values to maintain numerical stability.
-
Activation recomputation: Intermediate attention scores are discarded after the forward pass and recomputed during the backward pass, further reducing memory usage.
All operations are fused into a single kernel, eliminating almost all intermediate global memory writes. The result is a 2-4x speedup over naive attention and support for much longer sequence lengths.
Wishing you all the best as you dive into the exciting world of GPU programming and systems optimization. May your kernels run at peak utilization, your memory accesses stay perfectly coalesced, and your matrix dimensions always align to powers of 32. The skills you're learning here are the foundation of every modern large language model, and they will serve you well whether you're training trillion-parameter models or building efficient inference systems. Keep experimenting, keep asking questions, and never stop digging into the hardware that makes it all possible. Happy coding!


