Lecture Notes: AI Inference Fundamentals and Optimization Techniques

One. Course Details

This is week eight of CS 153: Frontier Systems (AI Coachella) at Stanford University, following Tatsu Hashimoto's lecture on scaling laws. The lecture focuses entirely on AI inference—the process of generating outputs from a trained model—and its rapidly growing importance in the AI ecosystem.
The instructor breaks down the fundamental differences between training and inference workloads, explains why inference is inherently memory-bound rather than compute-bound, and provides a comprehensive survey of state-of-the-art optimization techniques. The lecture covers both algorithmic innovations and systems-level optimizations, with concrete mathematical analysis of transformer performance and real-world examples from production inference systems.
The lecture covers:

The growing economic and practical importance of inference
Core performance metrics: time to first token (TTFT), latency, and throughput
Arithmetic intensity analysis for transformer layers
The KV cache and its critical role in inference efficiency
Architectural optimizations: grouped query attention, multi-latent attention, and sliding window attention
Model compression techniques: quantization and pruning
Speculative decoding and lossless acceleration methods
Systems optimizations for dynamic workloads: continuous batching and paged attention
Future directions in inference-friendly model architectures

Two. Key Learning Takeaways

Inference is the dominant cost of AI, not training. Training is a one-time expense, but inference costs repeat every single day. OpenAI generates 8.6 trillion tokens per day—more than the entire training dataset of DeepSeek V4 every four days.
Transformer inference is fundamentally memory-bound, not compute-bound. The auto-regressive nature of generation means you cannot parallelize across the sequence dimension, leading to extremely low arithmetic intensity and underutilized GPU compute.
The KV cache is the single most important optimization for inference, but it also becomes the primary memory bottleneck. Reducing KV cache size without sacrificing accuracy is the highest leverage way to improve both latency and throughput.
There is an unavoidable trade-off between latency and throughput. Smaller batch sizes deliver better latency for interactive use cases, while larger batch sizes maximize throughput for batch processing workloads.
Speculative decoding provides lossless speedups by exploiting the asymmetry between fast verification and slow generation. A small draft model generates candidate tokens in parallel, which are then verified by the large target model in a single forward pass.
Systems-level optimizations deliver massive real-world gains. Techniques like continuous batching and paged attention can double or triple throughput on production workloads without any changes to the model itself.
The transformer architecture is inherently inference-unfriendly. New architectures like state space models and linear attention have enormous potential to unlock order-of-magnitude improvements in inference efficiency.

Three. Course Gold Quotes

"Training is a one-time cost. It could be very expensive, it is very expensive, but once you're done with it that's it. But inference is a repeated cost you incur every single day."
"Prefill is compute bound, generation is memory bound. That is the single most important thing to remember about transformer inference."
"If you can make inference twice as fast or even 10% faster, that is a big deal. At the scale we're operating today, 10% faster inference saves billions of dollars a year."
"The KV cache is both our greatest friend and our greatest enemy. It saves us from O(n³) time complexity, but it becomes the dominant memory consumer at scale."
"Checking is faster than generation. If I give you a sequence, it's fast to tell me how good it is—much faster than it is to generate one token at a time."
"There is no free lunch in inference optimization. Every technique that makes the model faster will either hurt accuracy, increase complexity, or both."
"The biggest gains in inference over the next five years will not come from better GPUs. They will come from better model architectures designed from the ground up for inference."

Four. Layered Learning Notes

Module 1: The Growing Importance of Inference

Inference is the process of taking a trained model, feeding it a prompt, and producing a response as accurately and quickly as possible.
While training gets most of the media attention, inference is where AI actually delivers value to end users. It powers chatbots, code completion, agents, batch data processing, model evaluation, and even reinforcement learning training loops.
The importance of inference has exploded with the rise of agentic AI. In the chatbot era, most tokens were generated for humans to read, and human reading speed imposed a natural limit on inference demand.
In the agentic era, most tokens are generated for internal reasoning, tool calls, and introspection. There is no upper limit to how much inference compute an agent can use to solve a difficult problem.
The scale is staggering: OpenAI generates 8.6 trillion tokens per day. For comparison, DeepSeek V4 was trained on just 32 trillion tokens total. Inference compute demand already dwarfs training compute demand, and this gap will only widen.

Module 2: Core Inference Metrics

There are three primary metrics used to evaluate inference performance, each relevant to different use cases:

Time to First Token (TTFT): The time between when a user submits a prompt and when the first token appears. This is the most important metric for interactive applications, as users perceive latency as the wait time before any response starts.
Per-Token Latency: The time between consecutive tokens once generation has started. This is also important for interactive use cases, but users are far more sensitive to TTFT than to per-token latency as long as tokens stream faster than reading speed.
Throughput: The total number of tokens generated per second across all concurrent requests. This is the most important metric for batch processing workloads and for maximizing the utilization of expensive GPU hardware.

There is a fundamental trade-off between latency and throughput. Increasing batch size improves throughput by amortizing the cost of loading model parameters across more requests, but it worsens latency because each request must wait for the entire batch to be processed.

Module 3: Why Inference Is Fundamentally Different From Training

The core difference between training and inference comes down to parallelism. In training, you see all tokens in the sequence at once and can fully parallelize across the sequence dimension.
In inference, you cannot do this. The auto-regressive nature of transformers means you must generate tokens one at a time, and each token depends on all previous tokens.
This leads to drastically different arithmetic intensity. Arithmetic intensity is defined as the number of floating-point operations performed per byte of memory transferred.
For matrix multiplication, arithmetic intensity scales with batch size. On an H100 GPU, you need a batch size of at least 295 to be compute-bound. For smaller batch sizes, you are memory-bound, and your expensive GPU compute sits idle.
Prefill phase: When you first process the input prompt, you can parallelize across all prompt tokens. This phase is compute-bound and has high arithmetic intensity.
Generation phase: When you generate output tokens one at a time, you have an effective batch size of 1 for the sequence dimension. This phase is memory-bound and has arithmetic intensity close to 1.
The attention layer is the primary bottleneck in generation. For MLP layers, arithmetic intensity scales with batch size, so you can achieve good utilization with enough concurrent requests. For attention layers, arithmetic intensity does not scale with batch size at all, making it the fundamental bottleneck.

Module 4: The KV Cache

The naive approach to inference would reprocess the entire sequence from scratch for every new token, leading to O(n³) time complexity for generating n tokens.
The KV cache solves this problem by storing the key and value vectors from previous layers so they do not need to be recomputed. This reduces generation time from O(n³) to O(n²).
The KV cache is stored in high-bandwidth memory (HBM) and grows linearly with sequence length, batch size, and model size. For large batches and long sequences, the KV cache can consume more memory than the model parameters themselves.
The size of the KV cache is the primary limiting factor for how many concurrent requests you can serve on a single GPU. Reducing KV cache size is therefore the highest leverage optimization for inference.

Module 5: KV Cache Optimization Techniques

All modern inference optimizations revolve around reducing the size of the KV cache without sacrificing model accuracy:

Grouped Query Attention (GQA): Reduces the number of key-value heads while keeping the same number of query heads. This reduces KV cache size by a factor equal to the number of queries per KV head, with minimal accuracy loss.
Multi-Latent Attention (MLA): Compresses the key and value vectors into a lower-dimensional latent space before storing them in the cache. DeepSeek V2 uses this technique to reduce KV cache size from 16,000 dimensions to just 512, with comparable accuracy to full attention.
Cross-Layer Attention: Shares KV cache entries across multiple layers instead of storing separate KV caches for each layer.
Sliding Window Attention: Only stores the most recent K tokens in the cache instead of the entire sequence. This is especially useful for long context workloads, but it reduces the model's ability to access information from earlier in the conversation.
Sparse and Compressed Attention: Uses lightweight models to select only the most important tokens to keep in the cache, compressing or discarding less relevant information.

Module 6: Model Compression Techniques

These techniques reduce the overall size of the model, improving both memory usage and inference speed:

Quantization: Reduces the precision of model weights and activations from 16-bit floating point to 8-bit, 4-bit, or even lower. This directly reduces memory usage by a factor equal to the precision reduction.
- Post-training quantization is the simplest approach but can lead to accuracy loss at very low precisions.
- Quantization-aware training simulates quantization errors during training, allowing the model to adapt and maintain accuracy at lower precisions.
- Activation-aware quantization allocates more precision to the most important weights and activations, achieving better accuracy for the same average precision.
Pruning: Removes redundant or unimportant neurons and layers from a trained model, then fine-tunes the remaining weights to recover accuracy. NVIDIA has demonstrated that you can prune a 15B parameter model down to 8B parameters with almost no accuracy loss.

Module 7: Speculative Decoding

Speculative decoding is a lossless acceleration technique that exploits the asymmetry between fast verification and slow generation.
The algorithm works as follows:
1. A small, fast "draft model" generates K candidate tokens in sequence.
2. The large, slow "target model" processes all K candidate tokens in parallel in a single forward pass.
3. The target model accepts all correct candidate tokens up to the first incorrect one, then generates the next token itself.
This approach provides speedups of 2-4x in practice with zero change to the output distribution of the target model.
The optimal number of draft tokens is typically 3-4. Too few, and you do not leverage the parallelism of the target model. Too many, and most tokens get rejected, wasting compute.

Module 8: Systems Optimizations for Dynamic Workloads

Production inference systems face dynamic workloads where requests arrive at different times, have different lengths, and share common prefixes. Two key systems optimizations address these challenges:

Continuous Batching: Instead of processing fixed batches of requests, the system dynamically adds new requests to the batch as old requests finish. This keeps the GPU fully utilized even when request lengths vary widely.
Paged Attention: Treats the KV cache like virtual memory in an operating system, dividing it into fixed-size blocks that can be stored non-contiguously in memory.
- Eliminates internal and external fragmentation of KV cache memory.
- Allows sharing of KV cache blocks between requests that share common prefixes (like system prompts).
- Implements copy-on-write semantics for efficient generation of multiple responses from the same prompt.

These techniques are implemented in popular inference frameworks like vLLM and TensorRT-LLM, and they can double or triple throughput on real-world workloads.

Wishing you all the best as you dive deeper into the fascinating world of AI systems. Inference is where the rubber meets the road in AI, and the optimizations you learn today will power the next generation of intelligent applications. Whether you're building chatbots, agents, or batch processing systems, understanding how to make inference fast and efficient will give you a massive competitive advantage. Keep experimenting, keep questioning assumptions, and never stop learning. The future of AI depends on engineers like you who can turn powerful models into practical, scalable solutions. Good luck!

Video Source and Usage Instructions

Video Title: Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 10: Inference
• Course Series: Stanford CS336: Language Modeling from Scratch
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/EfM546A79aM?si=hw5Hm8mAzQ8ukmH0

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.