One. Course Details
This is the second lecture in Stanford University's CS 336: Language Modeling from Scratch, focused entirely on the systems side of large language model training. The instructor opens with exciting news: the Marin project's 1e23 FLOPs training run has completed successfully, matching the scaling law prediction within 0.05 loss, validating the predictive power of scaling laws for model performance.
The lecture is structured to build a foundational understanding of resource accounting, the critical skill of estimating compute and memory requirements for any model training job. The instructor emphasizes that the core goal of the course is to train the best possible model given finite resources, making computational efficiency the primary optimization target.
The lecture progresses from first principles to practical back-of-the-envelope calculations, covering tensor fundamentals, numerical precision tradeoffs, FLOPs counting, arithmetic intensity and roofline analysis, training memory breakdown, and common memory optimization techniques. All concepts are illustrated with concrete examples using H100 GPUs and real-world model sizes like 70B parameter transformers.
Two. Key Learning Takeaways
All data and computation in deep learning is represented as tensors, and understanding tensor memory usage is the foundation of all resource accounting.
Bfloat16 (bf16) is the sweet spot for modern LLM training, offering the same dynamic range as float32 with half the memory footprint, avoiding the underflow and overflow issues that plagued float16 training.
The standard formula for estimating training FLOPs is 6 times the number of parameters times the number of tokens, derived from counting both forward and backward pass computations.
Model FLOPs Utilization (MFU) measures how efficiently a model uses hardware compute, with 0.5 being an excellent target for real-world transformer training runs.
Arithmetic intensity determines whether a computation is memory-bound or compute-bound; matrix multiplications are the only common operations that achieve compute-bound status on modern GPUs.
Training memory usage consists of four primary components: model parameters, gradients, optimizer states, and activations, with optimizer states often being the largest consumer for AdamW.
Gradient accumulation and activation checkpointing are the two most widely used techniques to reduce memory pressure, allowing larger models or batch sizes to fit on limited GPU memory.
Transformer inference is fundamentally memory-bound due to matrix-vector operations, creating a very different performance profile compared to compute-bound training.
Three. Course Gold Quotes
"Resource accounting is going to be very crucial. I want everyone to get in the habit whenever you write this line of code, think about the performance characteristics."
"There's going to be no ML magic today. I'll leave that to Tatsu for the next lecture."
"Arithmetic intensity is everything. We want to keep our GPUs hot doing matrix multiplies, not wasting cycles moving tiny bits of memory back and forth."
"Matrix multiplications dominate generally the computations. And that's by design."
"If you have stability issues, just throw a LayerNorm in there. It sounds ridiculous, but it actually works almost every time."
"The most successful architectures are not the most mathematically elegant ones. They are the ones that train stably and run fast on our hardware."
"Memory serves two purposes: you have to store things in your HBM, and you have to ship them to the accelerators. Both matter for performance."
Four. Layered Learning Notes
Module 1: Tensors and Numerical Precision
Tensors are the fundamental building block of all deep learning computations, storing parameters, gradients, optimizer states, activations, and training data. Every tensor has a shape and a precision, both of which directly impact memory usage and computational performance.
The memory footprint of a tensor is simply the number of elements multiplied by the size of each element in bytes. This basic calculation is the starting point for all resource accounting.
Numerical precision is one of the most impactful levers for optimizing both memory usage and speed. The lecture covers the full spectrum of common precisions used in deep learning:
-
Float32 (fp32): The default precision, using 4 bytes per element with 8 exponent bits and 23 mantissa bits. It offers excellent stability but high memory usage.
-
Float16 (fp16): Uses 2 bytes per element with only 5 exponent bits, leading to severe underflow and overflow issues during training. It is rarely used for modern LLM training.
-
Bfloat16 (bf16): The industry standard for modern training, also using 2 bytes per element but with the same 8 exponent bits as fp32. It trades reduced mantissa precision for full dynamic range, eliminating numerical stability issues while cutting memory usage in half.
-
FP8 and FP4: Emerging lower-precision formats that offer even greater memory savings. FP4 uses only 4 bits per element with block-wise scaling to maintain dynamic range, and has been successfully used to train the Nemotron 3 Super model.
Mixed precision training is the standard practice, using bf16 for parameters, activations, and gradients while retaining fp32 for optimizer states to ensure numerical stability. PyTorch's Automatic Mixed Precision (AMP) library handles most of this automatically, casting operations to appropriate precisions based on their numerical sensitivity.
Module 2: FLOPs Calculation and Hardware Performance
FLOPs (Floating-Point Operations) are the standard unit for measuring computational cost. The instructor clarifies the critical distinction between FLOPs (total computation) and FLOP/s (computation speed), noting that hardware spec sheets often quote theoretical peak performance with sparsity, which should be divided by 2 for realistic dense computation estimates.
Matrix multiplication is the dominant operation in transformer training, and counting its FLOPs is straightforward: a matrix multiplication of dimensions B×D and D×K requires 2×B×D×K FLOPs, accounting for both multiplication and addition operations. All other operations (elementwise operations, activations, reductions) are negligible in comparison for large matrices.
The total FLOPs for training a neural network can be derived by counting both forward and backward passes. The forward pass requires 2×parameters×tokens FLOPs, while the backward pass requires twice that amount (4×parameters×tokens FLOPs) because it must compute gradients with respect to both inputs and parameters. This gives the famous 6×parameters×tokens formula that is widely used for estimating training cost.
Model FLOPs Utilization (MFU) is defined as the actual achieved FLOP/s divided by the theoretical peak FLOP/s of the hardware. A well-optimized transformer training run typically achieves around 0.5 MFU, while a pure matrix multiplication can reach up to 0.8-0.9 MFU. Lower MFU values indicate significant performance bottlenecks that need to be addressed.
Module 3: Arithmetic Intensity and Roofline Analysis
Arithmetic intensity is the key concept for understanding performance bottlenecks in deep learning. It is defined as the number of FLOPs performed per byte of memory transferred between high-bandwidth memory (HBM) and the compute cores.
Every GPU has a characteristic arithmetic intensity threshold, calculated as peak FLOP/s divided by peak memory bandwidth. For an H100 GPU, this threshold is approximately 295 FLOPs per byte. Operations with arithmetic intensity below this threshold are memory-bound, spending most of their time waiting for data to arrive from memory. Operations above the threshold are compute-bound, fully utilizing the GPU's compute cores.
The lecture analyzes the arithmetic intensity of common operations:
-
Elementwise operations (ReLU, GELU): Very low arithmetic intensity (0.25-5), always memory-bound.
-
Dot products and matrix-vector products: Low arithmetic intensity, memory-bound.
-
Large matrix multiplications: High arithmetic intensity (~300 for square matrices), compute-bound.
This explains why transformers are designed around large matrix multiplications: they are the only operations that can fully utilize modern GPU hardware. It also explains why transformer inference is so much less efficient than training: inference uses matrix-vector operations which are fundamentally memory-bound.
The roofline model visualizes this relationship, plotting achieved FLOP/s against arithmetic intensity. It shows a linear region for memory-bound operations and a flat region for compute-bound operations, providing a clear framework for diagnosing performance issues.
Module 4: Training Memory Breakdown
Understanding training memory usage is critical for determining the largest model that can fit on a given GPU. The total memory footprint consists of four main components:
-
Model parameters: Stored in bf16, requiring 2 bytes per parameter.
-
Gradients: Also stored in bf16, requiring another 2 bytes per parameter.
-
Optimizer states: For AdamW, this includes first and second moment estimates, both stored in fp32 for stability, requiring 8 bytes per parameter.
-
Activations: Stored during the forward pass for use in backpropagation, scaling linearly with batch size, sequence length, and number of layers.
For AdamW training, this gives a total of 12 bytes per parameter just for parameters, gradients, and optimizer states. This means a single H100 with 80GB of HBM can fit approximately 53 billion parameters before accounting for activations.
Activation memory is often the limiting factor for large batch sizes or long sequence lengths. It scales linearly with batch size, making large batches particularly memory-intensive. This creates a fundamental tradeoff between batch size and model size for a given GPU memory budget.
Module 5: Memory Optimization Techniques
Two primary techniques are used to address memory limitations in LLM training: gradient accumulation and activation checkpointing.
Gradient accumulation allows simulating larger batch sizes without increasing memory usage. Instead of updating parameters after every micro-batch, gradients are accumulated across multiple micro-batches, and parameters are updated only after the desired effective batch size is reached. This requires only a trivial code change and has minimal impact on training dynamics.
Activation checkpointing (also called rematerialization) trades compute for memory by not storing all activations during the forward pass. Instead, only a subset of activations (checkpoints) are stored, and missing activations are recomputed during the backward pass. This can reduce activation memory by up to 50% with a roughly 20% increase in compute time.
In the extreme case, no activations are stored at all, and all are recomputed during backpropagation. This minimizes memory usage but maximizes compute overhead. A balanced approach stores checkpoints every √L layers, achieving a good tradeoff between memory and compute.
Wishing you smooth training runs, perfectly balanced memory usage, and high MFU as you build and scale your own language models. May your tensors always stay on GPU, your gradients never underflow, and your backpropagation run without a single NaN. The systems skills you're mastering here are the foundation of every production LLM deployment—keep measuring, keep optimizing, and never stop pushing the limits of what's possible with limited resources. Happy coding!


