One. Course Details
This is the seventh lecture of Stanford University’s CME 295: Transformers and Large Language Models, taught by twin brothers Afshine and Shervine Amidi. The lecture opens with a concise recap of the previous session on reasoning models and the GRPO reinforcement learning algorithm, then transitions to the most practical topic in the course: connecting LLMs to the external world.The lecture follows a clear two-part structure. In the first half, Afshine breaks down Retrieval Augmented Generation (RAG), the industry-standard technique for giving LLMs access to up-to-date and private information. He covers the complete RAG pipeline from knowledge base construction to two-stage retrieval and evaluation metrics. In the second half, Shervine introduces tool calling and agentic workflows, explaining how LLMs can perform actions rather than just generate text. He concludes with a critical discussion of safety risks associated with autonomous agents.
This lecture represents a pivotal shift from model training fundamentals to real-world deployment, addressing two of the four core weaknesses of vanilla LLMs identified earlier: static knowledge and lack of action capability.
Two. Key Learning Objectives
By the end of this lecture, students will be able to:Explain the core intuition behind RAG and identify four key reasons why it is preferable to continued pre-training for knowledge injection.
Describe the three fundamental steps of the RAG pipeline and explain the tradeoffs involved in choosing chunk size, embedding dimension, and chunk overlap.
Compare and contrast bi-encoder and cross-encoder architectures and justify their respective roles in the two-stage retrieval process.
Evaluate the performance of a RAG system using standard ranking metrics including NDCG, MRR, Precision@k, and Recall@k.
Explain the three-stage tool calling workflow and describe both supervised and zero-shot approaches to training tool use capabilities.
Describe the ReAct agent framework and explain how it decomposes complex tasks into iterative observe-plan-act loops.
Identify the primary safety risks associated with tool-enabled LLMs and autonomous agents and evaluate common mitigation strategies.
Three. Memorable Course Quotes
"RAG stands for Retrieval Augmented Generation—and the most important word in that name is augmented. We only add relevant information to the prompt, not everything.""The needle in a haystack test proves a hard truth: more context is not always better context. Irrelevant information confuses LLMs just as much as it confuses humans."
"Tool calling is what allows LLMs to stop just talking and start doing. It turns them from passive information sources into active problem solvers."
"Agents are systems that autonomously pursue goals and complete tasks on a user's behalf. They add reasoning and iteration on top of basic tool calling."
"With great power comes great responsibility. The same agentic capabilities that can write your code can also exfiltrate your data if not properly secured."
Four. Detailed Lecture Notes
4.1 Retrieval Augmented Generation (RAG) Overview
The lecture begins by revisiting the second core weakness of vanilla LLMs: static knowledge trapped behind a fixed pre-training cutoff date. All LLMs only know what they were trained on, and cannot answer questions about events that happened after their cutoff date.4.1.1 Why Not Just Retrain?
There are four compelling reasons why continued pre-training is not a practical solution for updating LLM knowledge:-
Catastrophic forgetting: Changing model weights to add new knowledge often causes the model to forget previously learned information.
-
Maintenance overhead: If you have multiple fine-tuned models for different use cases, you would need to retrain all of them every time you want to update knowledge.
-
Context window limitations: Even the largest context windows cannot hold all human knowledge.
-
Attention degradation: The "needle in a haystack" test demonstrates that LLMs struggle to retrieve relevant information from very long prompts, especially when the information is located in the middle of the context.
-
Cost: LLM API calls are priced per token, so unnecessarily long prompts quickly become expensive.
4.1.2 The RAG Solution
RAG solves all these problems by following three simple steps:-
Retrieve: Given a user query, fetch only the most relevant pieces of information from an external knowledge base.
-
Augment: Insert the retrieved information directly into the user's prompt.
-
Generate: Feed the augmented prompt to the LLM, which then generates an answer based on the provided context.
4.2 Building a RAG System
A production RAG system has two main phases: offline knowledge base construction and online inference.4.2.1 Knowledge Base Construction
The first step in building any RAG system is creating a searchable knowledge base:-
Document collection: Gather all documents that may contain relevant information.
-
Chunking: Split documents into smaller pieces called chunks, typically 200-1000 tokens long.
-
Too small: Chunks lose context and meaning.
-
Too large: Embeddings become unfocused and retrieval quality degrades.
-
-
Overlap: Add 50-200 tokens of overlap between consecutive chunks to preserve context across chunk boundaries.
-
Embedding: Compute a numerical vector embedding for each chunk using an encoder model.
-
Embedding dimension: Typically 768-1536 dimensions. Larger dimensions capture more nuance but require more storage and compute.
-
Chunk size: The single most impactful hyperparameter for RAG performance.
-
Chunk overlap: Prevents information loss at chunk boundaries.
4.2.2 Two-Stage Retrieval
Retrieval itself follows a two-stage process borrowed from search and recommendation systems:Stage 1: Candidate Retrieval The goal of this stage is to quickly filter down millions of chunks to a manageable set of 50-200 potentially relevant candidates. It prioritizes recall over precision.
The standard approach is semantic similarity search using bi-encoders:
-
Both the query and each chunk are encoded into embeddings using separate passes through the same encoder model.
-
Similarity is measured using cosine similarity between the query embedding and chunk embeddings.
-
Approximate Nearest Neighbor (ANN) algorithms are used to make this search efficient at scale.
-
BM25: A heuristic keyword-based scoring function that ensures documents containing exact keywords from the query are retrieved.
-
Hybrid search: Combines semantic similarity scores and BM25 scores to get the best of both approaches.
-
HyDE (Hypothetical Document Embeddings): First generate a fake hypothetical answer to the query, then embed this fake answer to retrieve relevant chunks.
-
Contextual retrieval: Prepend a short LLM-generated context summary to each chunk to improve embedding quality.
-
Prompt caching: Reuse computations for repeated prompt prefixes to reduce costs by up to 90% for bulk operations.
Re-ranking uses cross-encoders:
-
Both the query and a single chunk are fed into the encoder together, allowing the model to capture fine-grained interactions between them.
-
The model outputs a single relevance score for the query-chunk pair.
-
Cross-encoders are much more accurate than bi-encoders but also much slower, which is why they are only used on the small candidate set.
4.2.3 RAG Evaluation
Evaluating ranking systems requires specialized metrics that account for the position of relevant documents:-
NDCG (Normalized Discounted Cumulative Gain): The gold standard ranking metric. It rewards ranking relevant documents higher and is normalized to a 0-1 scale.
-
MRR (Mean Reciprocal Rank): Measures the inverse of the rank of the first relevant document. Simple and intuitive.
-
Precision@k: The proportion of documents in the top k results that are relevant.
-
Recall@k: The proportion of all relevant documents that are retrieved in the top k results.
4.3 Tool Calling
Tool calling extends RAG by allowing LLMs to interact with structured data and perform actions rather than just read unstructured text.4.3.1 What is Tool Calling?
Tool calling allows LLMs to dynamically invoke external functions to complete tasks. A tool is simply a function with a well-defined API, input parameters, and output format.The standard tool calling workflow has three stages:
-
Tool selection and argument parsing: The LLM analyzes the user query and decides which tool to call and what arguments to pass to it.
-
Tool execution: The function is executed on the server, not by the LLM itself.
-
Response generation: The tool's output is fed back to the LLM, which synthesizes a natural language response for the user.
4.3.2 Training Tool Use
There are two main approaches to giving an LLM tool use capabilities:-
Supervised Fine-Tuning (SFT): Train the model on a dataset of (query, tool call, response) triples. This produces the most reliable results but requires significant data collection.
-
Zero-shot prompting: Modern LLMs have strong code understanding capabilities that allow them to use tools without specific fine-tuning, given clear documentation and instructions. Advanced techniques use reasoning models to automatically generate and refine these instructions.
4.3.3 Scaling Tool Use
As the number of tools grows beyond a handful, two key challenges emerge:-
Tool selection: It becomes impossible to fit all tool definitions in the context window. The solution is a two-stage process where a lightweight router first selects only the relevant tools for a given query.
-
Standardization: Different LLMs have different conventions for defining tools. The Model Context Protocol (MCP) from Anthropic is emerging as the industry standard for defining and exposing tools to LLMs.
4.4 LLM Agents
Agents represent the next evolution of tool-enabled LLMs, adding autonomous reasoning and iteration.4.4.1 What is an Agent?
An agent is a system that autonomously pursues goals and completes tasks on a user's behalf. Unlike basic tool calling, which is typically a single step, agents can perform multiple iterations of reasoning and action to solve complex problems.4.4.2 The ReAct Framework
The most widely used agent architecture is ReAct (Reasoning + Acting), which decomposes problem solving into a continuous loop of three steps:-
Observe: Process the current state of the world, including any new information from tool calls.
-
Plan: Decide what to do next based on the observation and the overall goal.
-
Act: Execute the chosen action using a tool.
4.4.3 Multi-Agent Systems
For very complex tasks, it is often useful to have multiple specialized agents that work together. The Agent2Agent protocol from Google standardizes how agents can communicate and collaborate with each other.4.4.4 Safety Considerations
Autonomous agents introduce significant new safety risks that do not exist with vanilla LLMs:-
Data exfiltration: Malicious prompts can trick agents into sending sensitive data to external parties.
-
Unauthorized actions: Agents can perform real-world actions like sending emails, making purchases, or modifying infrastructure.
-
Goal misalignment: Agents may find unintended ways to achieve the stated goal that are harmful or undesirable.
-
Training-time safety alignment with harmlessness objectives.
-
Inference-time safety classifiers that scan all inputs and outputs.
-
Human-in-the-loop approval for high-stakes actions.


