Lecture 8: LLM Evaluation: Human Ratings, Rule-Based Metrics and LLM-as-a-Judge

One. Course Details

This is the eighth lecture of Stanford University’s CME 295: Transformers and Large Language Models, taught by twin brothers Afshine and Shervine Amidi. The lecture opens by emphasizing that evaluation is the most critical topic of the entire course—without the ability to measure LLM performance, there is no way to systematically improve models.

The lecture follows a logical progression from traditional evaluation methods to modern approaches. It begins with human evaluation and the challenges of inter-rater agreement, then covers classic rule-based metrics like BLEU and ROUGE, before introducing the dominant modern technique: LLM-as-a-Judge. The second half of the lecture focuses on evaluating agentic workflows, including a detailed breakdown of common failure modes, and concludes with a survey of standard LLM benchmarks across five core categories.

This lecture addresses the fourth and final core weakness of vanilla LLMs identified earlier in the course: the difficulty of evaluating free-form text outputs.

Two. Key Learning Objectives

By the end of this lecture, students will be able to:

Explain the limitations of human evaluation for LLMs and calculate inter-rater agreement using metrics like Cohen's kappa.

Compare and contrast rule-based evaluation metrics including METEOR, BLEU, and ROUGE, and identify their key strengths and weaknesses.

Describe the LLM-as-a-Judge framework and explain how structured outputs ensure parsable and consistent evaluations.

Identify and mitigate the three most common biases in LLM-as-a-Judge evaluations: position bias, verbosity bias, and self-enhancement bias.

Implement a factuality verification pipeline that decomposes text into verifiable claims and aggregates results.

Diagnose and troubleshoot seven common failure modes in tool calling and agentic workflows.

Categorize standard LLM benchmarks into five core types and explain what capabilities each benchmark measures.

Three. Memorable Course Quotes

"If we cannot measure the performance of our LLM, we do not really know what to improve."

"LLM-as-a-Judge is revolutionary not just because it can score outputs, but because it can explain why it gave that score—something no rule-based metric can do."

"When a measure becomes a target, it ceases to be a good measure. This is Goodhart's law, and it applies more to LLMs than to any other field I know."

"Agent evaluation is fundamentally harder than evaluating single outputs because you have to measure an entire iterative process, not just a final result."

"Benchmarks are useful for characterizing the profile of an LLM, but they will never replace your own experience using the model for your specific task."

Four. Detailed Lecture Notes

4.1 Human Evaluation and Inter-Rater Agreement

The ideal evaluation scenario would have a human rate every LLM output, but this approach has three critical limitations:

Cost: Human evaluation is extremely expensive at scale.
Speed: Humans cannot evaluate thousands of outputs quickly enough for iterative model development.
Subjectivity: Different humans often disagree on the quality of the same output, especially for subjective tasks.

4.1.1 Inter-Rater Agreement Metrics

To address subjectivity, researchers use inter-rater agreement metrics to quantify how consistent human raters are:

Simple agreement rate: The proportion of times two raters give the same rating. This is misleading because random guessing can produce high agreement rates for imbalanced datasets.
Cohen's kappa: Adjusts the observed agreement rate by subtracting the expected agreement rate due to chance. A kappa score of 1 indicates perfect agreement, while 0 indicates agreement no better than chance.
Fleiss's kappa: Extends Cohen's kappa to more than two raters.
Krippendorff's alpha: The most general metric, supporting any number of raters and any type of rating scale.

In practice, teams use these metrics to monitor rating consistency. If agreement is too low, they hold alignment sessions to refine rating guidelines until consistency improves.

4.2 Rule-Based Evaluation Metrics

Rule-based metrics compare LLM outputs to human-written reference outputs using string matching techniques. They were the standard for NLP evaluation before the rise of modern LLMs.

4.2.1 Common Rule-Based Metrics

Metric	Full Name	Primary Use Case	Core Idea
METEOR	Metric for Evaluation of Translation with Explicit Ordering	Machine translation	F-score of unigram matches with a penalty for incorrect word order
BLEU	Bilingual Evaluation Understudy	Machine translation	Precision of n-gram matches with a brevity penalty for short outputs
ROUGE	Recall-Oriented Understudy for Gisting Evaluation	Text summarization	Recall of n-gram matches between output and reference

4.2.2 Limitations of Rule-Based Metrics

All rule-based metrics share three fundamental flaws:

No support for stylistic variation: They cannot recognize that two sentences with completely different words can have identical meaning.
Poor correlation with human judgment: Even the best rule-based metrics correlate only weakly with how humans actually rate output quality.
Require human references: You still need humans to write reference outputs for every prompt, which is expensive and time-consuming.

4.3 LLM-as-a-Judge: The Modern Standard

LLM-as-a-Judge is the dominant evaluation technique for modern LLMs. It uses a powerful LLM to evaluate the outputs of another LLM.

4.3.1 How LLM-as-a-Judge Works

The standard workflow is:

Provide the judge LLM with the original prompt, the model's response, and clear evaluation criteria.
Ask the judge to first output a detailed rationale explaining its evaluation, then output a score.
Use structured outputs to guarantee that the response follows a parsable format with exactly the required fields.

The requirement to output the rationale before the score is critical—it forces the judge to externalize its reasoning, significantly improving evaluation accuracy and consistency. This is directly analogous to chain-of-thought prompting for reasoning tasks.

4.3.2 Common Biases and Mitigations

LLM-as-a-Judge is not perfect and suffers from three well-documented biases:

Position bias: The judge tends to prefer whichever response is presented first in pairwise comparisons.
- Mitigation: Run the evaluation twice with the order of responses swapped and take the majority vote.
Verbosity bias: The judge tends to prefer longer, more detailed responses even when they are not more correct.
- Mitigation: Explicitly instruct the judge to ignore verbosity, provide few-shot examples of concise high-quality responses, or apply a length penalty to scores.
Self-enhancement bias: A model will tend to prefer outputs that it generated itself over outputs from other models.
- Mitigation: Use a different model as the judge than the one being evaluated, preferably a larger and more capable model.

4.3.3 Best Practices for LLM-as-a-Judge

Use binary scoring (pass/fail) rather than granular scales. Binary judgments are easier for both humans and LLMs to make consistently.
Write crisp, unambiguous evaluation guidelines that leave no room for interpretation.
Use a low temperature (0.1-0.2) for the judge to ensure reproducible results.
Calibrate the judge against human ratings on a small subset of examples to ensure it aligns with human preferences.
Never optimize exclusively against LLM-as-a-Judge scores—always verify improvements with human evaluation periodically.

4.3.4 Factuality Evaluation

Factuality is one of the most important and challenging dimensions to evaluate. The standard approach is:

Decompose: Use an LLM to break the output into a list of individual factual claims.
Verify: Check each claim against a trusted knowledge source using RAG or web search.
Aggregate: Calculate an overall factuality score by weighting the importance of each correct and incorrect claim.

This approach captures nuance—an output with one minor factual error will receive a higher score than an output that is completely wrong.

4.4 Agent Evaluation and Failure Modes

Evaluating agents is significantly harder than evaluating single outputs because agents perform multiple iterative steps of reasoning and action. The lecture breaks down seven common failure modes in tool calling and agentic workflows:

4.4.1 Tool Prediction Errors

Failure to call a tool: The model does not use an available tool when it should, instead giving a wrong answer or punting on the question.
- Causes: Tool router recall errors, insufficient SFT examples, unclear instructions.
Tool hallucination: The model calls a function that does not exist.
- Causes: Weak model, unclear tool descriptions, missing top-level instructions about using only provided tools.
Wrong tool selection: The model calls a tool that is not appropriate for the task.
- Causes: Overlapping tool scopes, ambiguous tool descriptions.
Incorrect arguments: The model calls the right tool but passes invalid or incorrect arguments.
- Causes: Missing context information, unclear argument descriptions.

4.4.2 Tool Execution Errors

Tool returns error: The tool itself has a bug and returns an error message.
- Best practice: Always return structured error messages that explain what went wrong, not raw exceptions.
Tool returns no response: The tool executes successfully but returns nothing.
- Best practice: Always return a structured response, even if it is just an empty JSON object indicating success.

4.4.3 Response Synthesis Errors

Failure to ground on tool output: The model ignores the results from the tool and generates a response based on its own knowledge.
- Causes: Weak model, too much irrelevant information in the tool output.

4.5 Standard LLM Benchmarks

Benchmarks are standardized test suites used to compare the performance of different LLMs. They fall into five core categories:

4.5.1 Knowledge Benchmarks

Measure the model's ability to recall factual information across diverse domains.

MMLU (Massive Multitask Language Understanding): The most widely used knowledge benchmark, consisting of 14,000 multiple-choice questions across 57 subjects ranging from elementary science to advanced law and medicine.

4.5.2 Reasoning Benchmarks

Measure the model's ability to solve multi-step problems.

AIME (American Invitational Mathematics Examination): A challenging high school math competition benchmark that requires advanced multi-step reasoning.
PIQA (Physical Interaction Question Answering): Measures common sense reasoning about physical objects and everyday situations.

4.5.3 Coding Benchmarks

Measure the model's ability to write and understand code.

SWE-bench: A realistic software engineering benchmark that asks models to fix real GitHub issues in popular Python repositories. Performance is measured by whether the model's patch passes all existing tests.

4.5.4 Safety Benchmarks

Measure the model's ability to refuse harmful requests.

HarmBench: A comprehensive safety benchmark covering standard harmful behavior, copyright violations, contextual harm, and multimodal harm.

4.5.5 Agent Benchmarks

Measure the performance of tool-enabled agents.

Tau-Bench: A benchmark for tool agents in the airline and retail domains. It uses LLM-simulated users to interact with agents and measures success rate based on database state changes. It introduces the pass@k metric, which measures the probability that all k attempts at a task succeed.

4.6 Limitations of Benchmarks

Benchmarks are extremely useful but have important limitations:

Data contamination: Models may have seen benchmark questions during pre-training, leading to inflated scores.
Goodhart's law: When a benchmark becomes the primary target for optimization, models learn to game the benchmark rather than improving general capabilities.
Narrow scope: Benchmarks only measure a small subset of the capabilities that matter for real-world use cases.

The lecture concludes by emphasizing that while benchmarks are valuable for comparing models, the ultimate test of an LLM is how well it performs on your specific tasks.

These are my structured study notes and in-depth interpretations compiled by watching the open course. I hope this knowledge framework helps you thoroughly master the content of this subject. Wish you continuous academic progress and great achievements in your studies.

Video Source and Usage Instructions

Video Title: Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation Stanford Online
• Course Series: Stanford CME295: Transformers and Large Language Models I Autumn 2025
• Original Platform:
• Original Publisher: Stanford
• Original Video URL: https://youtu.be/8fNP4N46RRo?si=SBSRdo0kpAqJD2Nr

Information About Website Advertising

This site is a non-profit educational sharing platform. The advertisements displayed on the pages are solely intended to cover basic operational costs such as server maintenance, bandwidth, and content upkeep. We do not generate any form of commercial profit from the video content, nor do we charge any fees for the original video content.

Copyright and Compliance Statement

1. We have preserved the original video in its entirety without making any modifications, edits, or alterations to the course content, in order to ensure the authenticity and integrity of the academic material.
2. All copyrights and intellectual property rights related to this video belong to the original author and Stanford. This repost strictly adheres to Creative Commons license and is intended solely for educational, research, and personal communication purposes.
3. If the original copyright holder believes this repost infringes upon your legitimate rights and interests, or if you have any objections to the operation of this site, please contact us through the website. We will remove the relevant content as soon as possible upon receiving notification.

1.If you have any questions, please email us.：[gwang4821@gmail.com]
2. You can also go directly to the Feedback Center,Feedback
3. We will address your feedback immediately upon receipt.