One. Course Details
This is week nine of CS 336: Language Modeling from Scratch (AI Coachella) at Stanford University, temporarily shifting from systems topics to deep learning fundamentals. This is the first of two lectures on scaling laws, covering the basic principles, historical origins, and core mathematical frameworks that underpin modern large language model training. The second advanced lecture will cover cutting-edge topics including μP parameterization, optimizer scaling, and modern open model tech reports.
The instructor breaks down how scaling laws allow engineers to predict large-scale model performance from small-scale experiments—a critical skill for avoiding wasted millions on failed training runs. The lecture covers both theoretical foundations and practical engineering applications, including a deep dive into the famous Kaplan vs. Chinchilla debate and its lasting impact on the industry.
The lecture covers:
-
Historical roots of scaling laws dating back to 1993
-
Mathematical foundations of power law performance scaling
-
Data scaling laws and their statistical interpretation
-
Architecture and model size scaling principles
-
Hyperparameter scaling for batch size and learning rate
-
Joint compute-optimal scaling laws
-
The Kaplan vs. Chinchilla controversy and its resolution
-
Practical scaling methodologies and common pitfalls
-
The relationship between upstream perplexity and downstream performance
Two. Key Learning Takeaways
-
Neural scaling laws are predictable power law relationships between resources (compute, parameters, data) and model performance. When plotted on log-log axes, these relationships form straight lines that can be reliably extrapolated across many orders of magnitude.
-
Most interventions only change the intercept of scaling laws, not the slope. Improvements to data quality, architecture, or optimization typically shift the entire curve up or down but preserve the fundamental rate of improvement with scale.
-
The best configuration at small scale is almost always the best configuration at large scale. This is the core insight that makes scaling laws such a powerful engineering tool for optimizing hyperparameters, data mixtures, and architectures before committing to large runs.
-
Compute-optimal training balances model size and data size. The Chinchilla paper established that approximately 20 tokens of data per parameter is the optimal ratio for minimizing training loss, though production models are almost always overtrained to reduce inference costs.
-
IsoFLOP analysis is the most reliable method for scaling studies. By fixing the total compute budget and sweeping across different trade-offs between parameters and data, you can robustly determine optimal configurations without making strong modeling assumptions.
-
All parameters are not created equal. How you count parameters (including or excluding embeddings, softmax layers, or inactive MoE parameters) can dramatically change the shape of your scaling laws and lead to incorrect conclusions.
-
Upstream perplexity and downstream performance are not perfectly correlated. While perplexity scales very predictably, transfer to downstream tasks can be less reliable and requires separate validation.
Three. Course Gold Quotes
-
"Scaling laws are these simple, predictive rules of how to go from small scale model performance and behavior, and try to extrapolate them up to large scale behavior. To some people in big labs, it's almost a way of life."
-
"The best mixture at small scale is also the best mixture at large scale. That's why you can optimize your data mix on tiny models and scale it up with confidence."
-
"All parameters are not created equal. This is a huge can of worms, but it explains almost all of the disagreement between different scaling law papers."
-
"Scaling laws aren't magic. They are engineered, not automatic. You have to pick the right axes and set your hyperparameters correctly to get predictable scaling."
-
"Chinchilla is important not because the 20-to-1 ratio is some golden rule written in stone. It's important because it taught us how to do scaling law analysis properly."
-
"If you only have a tiny slice of a compute range, it's very difficult to tell if something is scaling polynomially or exponentially. Everything looks linear if you zoom in enough."
-
"In production, you almost never want the compute-optimal model. You want the smallest possible model that can hit your target performance, because inference costs will dominate your total spend."
Four. Layered Learning Notes
Module 1: Historical Foundations of Scaling Laws
-
Scaling laws are not a modern invention unique to large language models. The earliest work dates back to 1993 at Bell Labs, where Cortes and Vapnik first proposed fitting error decay curves to estimate classifier performance on large datasets.
-
In 2001, Banko and Brill showed that natural language processing systems improved predictably with increasing data size, sparking the first debates about whether more data beats better algorithms.
-
The modern era of neural scaling laws began with Hestness et al. in 2017, who demonstrated consistent power law scaling across speech recognition, machine translation, and language modeling tasks. This paper also first discussed emergence and the importance of compute scaling.
-
Scaling laws have deep connections to classical machine learning theory, particularly generalization bounds and sample complexity. Both describe how error decreases as a function of increasing resources.
Module 2: Data Scaling Laws
-
Data scaling laws describe how test loss decreases as you increase the size of your training dataset, while keeping the model sufficiently large to avoid saturating.
-
The general form is L = a * D^(-α) + L₀, where L is test loss, D is dataset size, α is the scaling exponent, and L₀ is the irreducible error of the task.
-
For language models, α typically falls between 0.1 and 0.3, much slower than the 1.0 exponent for simple parametric estimation like mean estimation.
-
This slow exponent suggests that neural networks behave like non-parametric estimators in a high-dimensional space (approximately 10 dimensions), learning gradually from additional data.
-
Practical applications of data scaling laws:
-
Optimizing data mixtures by testing different ratios on small models
-
Determining the maximum number of epochs you can run before performance degrades (typically 4 epochs for standard recipes)
-
Understanding how data filtering strategies change with scale—looser filters are better at larger compute budgets
-
Module 3: Architecture and Model Size Scaling
-
Architecture comparisons can be done rigorously using scaling laws by training models of different sizes across a range of compute budgets.
-
The key insight is that if one architecture has a better intercept and the same slope as another, it will be better at all scales. If it has a worse slope, it will eventually fall behind as you scale up.
-
This methodology has been used to validate almost all modern architectural innovations, including gated linear units, mixture of experts, and state space models like Mamba.
-
Scale-invariant quantities are properties that remain optimal across different model sizes. The most important is the aspect ratio (depth-to-width ratio), which stays roughly constant as models grow larger.
-
Mixture of experts (MoE) models require special scaling analysis because total parameters and active parameters are decoupled. Studies show that even inactive parameters improve performance, justifying the use of increasingly sparse models at larger scales.
Module 4: Hyperparameter Scaling
-
Critical batch size is the point beyond which increasing batch size no longer provides proportional improvements in convergence speed.
-
Below the critical batch size, training is noise-limited, and larger batches reduce gradient variance to speed up convergence. Above the critical batch size, training becomes bias-limited, and larger batches provide no benefit.
-
The critical batch size increases predictably as model performance improves, following another power law relationship. This means larger models can and should use larger batch sizes.
-
Learning rate scaling is the other critical hyperparameter consideration. There are two dominant approaches:
-
Traditional scaling: Decrease learning rate as model width increases, typically proportional to 1/√width
-
μP parameterization: Rescale the model to keep the optimal learning rate constant across all sizes
-
-
Surprisingly, even major changes like switching from SGD to Adam only change the intercept of scaling laws, not the slope.
Module 5: Joint Compute-Optimal Scaling
-
The most important practical question in large model training is: given a fixed compute budget, how should I allocate it between model size and data size?
-
Kaplan et al. (2020) proposed that model size should scale faster than data size, leading to the era of giant under-trained models like GPT-3 (only 3 tokens per parameter).
-
Chinchilla et al. (2022) challenged this conclusion, showing that a 20:1 ratio of tokens to parameters is actually compute-optimal. Their Chinchilla 70B model outperformed much larger models while using the same compute.
-
The disagreement between the two papers stemmed from seemingly minor details:
-
Kaplan excluded both embedding and softmax parameters from their count
-
Many small models in the Kaplan study were not fully converged due to poor learning rate warmup
-
Kaplan used a fixed batch size that was suboptimal for small models
-
-
While the 20:1 ratio is compute-optimal for training, production models are almost always overtrained (50-100 tokens per parameter) because inference costs are far larger than training costs at scale.
Module 6: Practical Methodologies and Pitfalls
-
IsoFLOP analysis is the most robust and widely used method for scaling studies. You fix your total compute budget, then sweep across different parameter-data trade-offs to find the minimum loss.
-
This method avoids making strong assumptions about the functional form of the scaling law and is less sensitive to modeling errors than surface fitting approaches.
-
Common pitfalls to avoid:
-
Using too small a compute range, which makes it impossible to distinguish between different functional forms
-
Counting parameters incorrectly, which can shift your entire scaling curve
-
Using suboptimal hyperparameters for small models, which will lead to incorrect extrapolations
-
Assuming that good perplexity automatically translates to good downstream performance
-
-
Always validate your scaling predictions with intermediate-sized runs before committing to your full-scale training run.
Wishing you all the best as you apply these powerful scaling principles to your own language modeling projects. May your small runs predict your big runs accurately, your hyperparameters stay perfectly tuned, and your models scale smoothly to new frontiers. Remember that scaling laws are tools, not rules—always validate your assumptions and don't be afraid to question conventional wisdom. The next breakthrough in AI will come from someone who combines careful scaling with creative new ideas. Happy training!


