Lesson 10Evaluation

Agent Evaluation & Testing

Metrics, testing pipelines, scorecards

📖 Analogy

Evaluating an AI agent is like a performance review for an employee. You don't just check if they showed up — you review their decision-making process, how they handled edge cases, whether they used the right tools, and if the final result met quality standards.

Why Agent Evaluation is Harder

Non-determinism

The same input can produce different outputs. Traditional unit tests with exact matching don't work — you need semantic evaluation and statistical testing.

Multi-step reasoning

Agents take multiple steps. An error in step 2 can compound into step 5. You need to evaluate not just the final answer but the entire trajectory.

Tool interactions

Agents call external APIs, databases, and tools. Testing requires mocking external services and validating that the right tool was called with correct parameters.

Error compounding

If each step has 90% accuracy, a 5-step agent has only 59% accuracy (0.9^5). Small per-step improvements have outsized impact on end-to-end quality.

4 Levels of Evaluation

Component Testing

Test individual pieces: prompt templates, tool parsers, output formatters. Fast, cheap, catches 60% of bugs.

Trajectory Evaluation

Compare the agent's reasoning path against golden trajectories. Did it choose the right tools in the right order?

End-to-End Testing

Run the full agent on real-world scenarios. Measure task completion, cost, and latency. Slowest but most realistic.

Human Evaluation

Expert review of agent outputs on a sample. Catches subtle quality issues that automated metrics miss.

Key Metrics

85%+

Task Completion

Target completion rate for production agents

95%+

Tool Accuracy

Correct tool calls with right parameters

<$0.10

Cost per Task

Average LLM cost per completed task

<30s

End-to-End Latency

Time from request to final response

⚠️ Common Pitfall

Demo-driven development: your agent works perfectly on 3 hand-picked examples, then fails on 100 real ones. Always test with a diverse golden dataset of 50+ cases covering edge cases, not just the happy path.

Testing Patterns

Golden Datasets

Curate 50-200 test cases with expected outputs. Include happy paths, edge cases, and failure modes. Version them like code.

Regression Suites

Every bug becomes a test case. Run the suite on every prompt change or model update. Catch regressions before they reach production.

Sandbox Environments

Mock external APIs and databases. Test tool interactions without side effects. Use record-replay for deterministic tests.

Adversarial Testing

Throw edge cases, malformed inputs, and prompt injections at your agent. Test recovery behavior when tools fail or return unexpected data.

💡 Fun Fact

Anthropic's internal agent testing revealed that 73% of agent failures stem from incorrect tool parameter formatting — not reasoning errors. Simple input validation on tool calls can dramatically improve agent reliability.

User Task

What's the weather in Tokyo tomorrow?

🧠Thought— Step 1

I need to check the weather for Tokyo. I'll use the weather API tool with the city name and tomorrow's date.

✅

Correct reasoning—Agent correctly identified the need for a weather tool

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Building Agents Architectures Observability

This lesson is part of a structured LLM course.

My Learning Path