One. Course Details
This is the eighth lecture of Stanford University’s CME 295: Transformers and Large Language Models, taught by twin brothers Afshine and Shervine Amidi. The lecture opens by emphasizing that evaluation is the most critical topic of the entire course—without the ability to measure LLM performance, there is no way to systematically improve models.The lecture follows a logical progression from traditional evaluation methods to modern approaches. It begins with human evaluation and the challenges of inter-rater agreement, then covers classic rule-based metrics like BLEU and ROUGE, before introducing the dominant modern technique: LLM-as-a-Judge. The second half of the lecture focuses on evaluating agentic workflows, including a detailed breakdown of common failure modes, and concludes with a survey of standard LLM benchmarks across five core categories.
This lecture addresses the fourth and final core weakness of vanilla LLMs identified earlier in the course: the difficulty of evaluating free-form text outputs.
Two. Key Learning Objectives
By the end of this lecture, students will be able to:Explain the limitations of human evaluation for LLMs and calculate inter-rater agreement using metrics like Cohen's kappa.
Compare and contrast rule-based evaluation metrics including METEOR, BLEU, and ROUGE, and identify their key strengths and weaknesses.
Describe the LLM-as-a-Judge framework and explain how structured outputs ensure parsable and consistent evaluations.
Identify and mitigate the three most common biases in LLM-as-a-Judge evaluations: position bias, verbosity bias, and self-enhancement bias.
Implement a factuality verification pipeline that decomposes text into verifiable claims and aggregates results.
Diagnose and troubleshoot seven common failure modes in tool calling and agentic workflows.
Categorize standard LLM benchmarks into five core types and explain what capabilities each benchmark measures.
Three. Memorable Course Quotes
"If we cannot measure the performance of our LLM, we do not really know what to improve.""LLM-as-a-Judge is revolutionary not just because it can score outputs, but because it can explain why it gave that score—something no rule-based metric can do."
"When a measure becomes a target, it ceases to be a good measure. This is Goodhart's law, and it applies more to LLMs than to any other field I know."
"Agent evaluation is fundamentally harder than evaluating single outputs because you have to measure an entire iterative process, not just a final result."
"Benchmarks are useful for characterizing the profile of an LLM, but they will never replace your own experience using the model for your specific task."
Four. Detailed Lecture Notes
4.1 Human Evaluation and Inter-Rater Agreement
The ideal evaluation scenario would have a human rate every LLM output, but this approach has three critical limitations:-
Cost: Human evaluation is extremely expensive at scale.
-
Speed: Humans cannot evaluate thousands of outputs quickly enough for iterative model development.
-
Subjectivity: Different humans often disagree on the quality of the same output, especially for subjective tasks.
4.1.1 Inter-Rater Agreement Metrics
To address subjectivity, researchers use inter-rater agreement metrics to quantify how consistent human raters are:-
Simple agreement rate: The proportion of times two raters give the same rating. This is misleading because random guessing can produce high agreement rates for imbalanced datasets.
-
Cohen's kappa: Adjusts the observed agreement rate by subtracting the expected agreement rate due to chance. A kappa score of 1 indicates perfect agreement, while 0 indicates agreement no better than chance.
-
Fleiss's kappa: Extends Cohen's kappa to more than two raters.
-
Krippendorff's alpha: The most general metric, supporting any number of raters and any type of rating scale.
4.2 Rule-Based Evaluation Metrics
Rule-based metrics compare LLM outputs to human-written reference outputs using string matching techniques. They were the standard for NLP evaluation before the rise of modern LLMs.4.2.1 Common Rule-Based Metrics
|
Metric |
Full Name |
Primary Use Case |
Core Idea |
|---|---|---|---|
|
METEOR |
Metric for Evaluation of Translation with Explicit Ordering |
Machine translation |
F-score of unigram matches with a penalty for incorrect word order |
|
BLEU |
Bilingual Evaluation Understudy |
Machine translation |
Precision of n-gram matches with a brevity penalty for short outputs |
|
ROUGE |
Recall-Oriented Understudy for Gisting Evaluation |
Text summarization |
Recall of n-gram matches between output and reference |
4.2.2 Limitations of Rule-Based Metrics
All rule-based metrics share three fundamental flaws:-
No support for stylistic variation: They cannot recognize that two sentences with completely different words can have identical meaning.
-
Poor correlation with human judgment: Even the best rule-based metrics correlate only weakly with how humans actually rate output quality.
-
Require human references: You still need humans to write reference outputs for every prompt, which is expensive and time-consuming.
4.3 LLM-as-a-Judge: The Modern Standard
LLM-as-a-Judge is the dominant evaluation technique for modern LLMs. It uses a powerful LLM to evaluate the outputs of another LLM.4.3.1 How LLM-as-a-Judge Works
The standard workflow is:-
Provide the judge LLM with the original prompt, the model's response, and clear evaluation criteria.
-
Ask the judge to first output a detailed rationale explaining its evaluation, then output a score.
-
Use structured outputs to guarantee that the response follows a parsable format with exactly the required fields.
4.3.2 Common Biases and Mitigations
LLM-as-a-Judge is not perfect and suffers from three well-documented biases:-
Position bias: The judge tends to prefer whichever response is presented first in pairwise comparisons.
-
Mitigation: Run the evaluation twice with the order of responses swapped and take the majority vote.
-
-
Verbosity bias: The judge tends to prefer longer, more detailed responses even when they are not more correct.
-
Mitigation: Explicitly instruct the judge to ignore verbosity, provide few-shot examples of concise high-quality responses, or apply a length penalty to scores.
-
-
Self-enhancement bias: A model will tend to prefer outputs that it generated itself over outputs from other models.
-
Mitigation: Use a different model as the judge than the one being evaluated, preferably a larger and more capable model.
-
4.3.3 Best Practices for LLM-as-a-Judge
-
Use binary scoring (pass/fail) rather than granular scales. Binary judgments are easier for both humans and LLMs to make consistently.
-
Write crisp, unambiguous evaluation guidelines that leave no room for interpretation.
-
Use a low temperature (0.1-0.2) for the judge to ensure reproducible results.
-
Calibrate the judge against human ratings on a small subset of examples to ensure it aligns with human preferences.
-
Never optimize exclusively against LLM-as-a-Judge scores—always verify improvements with human evaluation periodically.
4.3.4 Factuality Evaluation
Factuality is one of the most important and challenging dimensions to evaluate. The standard approach is:-
Decompose: Use an LLM to break the output into a list of individual factual claims.
-
Verify: Check each claim against a trusted knowledge source using RAG or web search.
-
Aggregate: Calculate an overall factuality score by weighting the importance of each correct and incorrect claim.
4.4 Agent Evaluation and Failure Modes
Evaluating agents is significantly harder than evaluating single outputs because agents perform multiple iterative steps of reasoning and action. The lecture breaks down seven common failure modes in tool calling and agentic workflows:4.4.1 Tool Prediction Errors
-
Failure to call a tool: The model does not use an available tool when it should, instead giving a wrong answer or punting on the question.
-
Causes: Tool router recall errors, insufficient SFT examples, unclear instructions.
-
-
Tool hallucination: The model calls a function that does not exist.
-
Causes: Weak model, unclear tool descriptions, missing top-level instructions about using only provided tools.
-
-
Wrong tool selection: The model calls a tool that is not appropriate for the task.
-
Causes: Overlapping tool scopes, ambiguous tool descriptions.
-
-
Incorrect arguments: The model calls the right tool but passes invalid or incorrect arguments.
-
Causes: Missing context information, unclear argument descriptions.
-
4.4.2 Tool Execution Errors
-
Tool returns error: The tool itself has a bug and returns an error message.
-
Best practice: Always return structured error messages that explain what went wrong, not raw exceptions.
-
-
Tool returns no response: The tool executes successfully but returns nothing.
-
Best practice: Always return a structured response, even if it is just an empty JSON object indicating success.
-
4.4.3 Response Synthesis Errors
-
Failure to ground on tool output: The model ignores the results from the tool and generates a response based on its own knowledge.
-
Causes: Weak model, too much irrelevant information in the tool output.
-
4.5 Standard LLM Benchmarks
Benchmarks are standardized test suites used to compare the performance of different LLMs. They fall into five core categories:4.5.1 Knowledge Benchmarks
Measure the model's ability to recall factual information across diverse domains.-
MMLU (Massive Multitask Language Understanding): The most widely used knowledge benchmark, consisting of 14,000 multiple-choice questions across 57 subjects ranging from elementary science to advanced law and medicine.
4.5.2 Reasoning Benchmarks
Measure the model's ability to solve multi-step problems.-
AIME (American Invitational Mathematics Examination): A challenging high school math competition benchmark that requires advanced multi-step reasoning.
-
PIQA (Physical Interaction Question Answering): Measures common sense reasoning about physical objects and everyday situations.
4.5.3 Coding Benchmarks
Measure the model's ability to write and understand code.-
SWE-bench: A realistic software engineering benchmark that asks models to fix real GitHub issues in popular Python repositories. Performance is measured by whether the model's patch passes all existing tests.
4.5.4 Safety Benchmarks
Measure the model's ability to refuse harmful requests.-
HarmBench: A comprehensive safety benchmark covering standard harmful behavior, copyright violations, contextual harm, and multimodal harm.
4.5.5 Agent Benchmarks
Measure the performance of tool-enabled agents.-
Tau-Bench: A benchmark for tool agents in the airline and retail domains. It uses LLM-simulated users to interact with agents and measures success rate based on database state changes. It introduces the pass@k metric, which measures the probability that all k attempts at a task succeed.
4.6 Limitations of Benchmarks
Benchmarks are extremely useful but have important limitations:-
Data contamination: Models may have seen benchmark questions during pre-training, leading to inflated scores.
-
Goodhart's law: When a benchmark becomes the primary target for optimization, models learn to game the benchmark rather than improving general capabilities.
-
Narrow scope: Benchmarks only measure a small subset of the capabilities that matter for real-world use cases.
These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.


