Evals for AI Agents: How to Know Your Agent Actually Works

Everyone builds agents, almost nobody measures them. We break down why classical tests don't work here, how to assemble your first eval dataset in an hour, the difference between end-to-end and trajectory evaluation, and how not to fool yourself with LLM-as-judge.

IntermediateAI DevOps35 minDeepEval, Langfuse, Claude API

Why unit tests break down here

You built an agent. It works on your test queries — so it's ready? No: LLMs are non-deterministic, the same input yields different outputs, and 'correct' is often subjective. A classical unit test checks equality, but equality isn't what needs checking here. An eval isn't 'pass / fail'. It's a measurement: how well, in what percentage of cases, where it breaks. The question isn't 'did it pass?' but 'it passed 47 of 50 cases, and here's what the 3 failures have in common'. Failure means something different here: it's not a bug, it's data.

🧪 Unit test

Exact match: 0 or 1
Runs on every commit
Failure = stop and fix

📊 Eval

Pass rate and distribution
Runs scheduled or pre-release
Failure = data to analyze

First thing to change is your relationship with failure. If you want 100% pass rate, you're either writing trivial tests or fooling yourself. The goal is to see where the agent breaks and decide whether it's worth fixing.

The dataset matters more than the metric — always

Metrics look impressive — accuracy, precision, recall. But without a good dataset they're a thermometer in an empty room: measuring something, just not what matters. The right order: first 20 real examples, then a metric around them. Where do the first 20 come from? Three sources. Production or beta-user logs — that's where real users broke things. Corner cases you hunted manually — empty input, ambiguous query, a language you didn't prepare for. And synthetic data — but only as a supplement, never a replacement: synthetic is smooth, reality isn't.

Production logs

Curate manually

Eval dataset (20)

Run

Analyze failures

New cases

Don't aim for 100 tests on day one. 20 carefully chosen examples covering different query types beat 200 random ones. The dataset grows as you find new failure modes in production.

Two views on an agent: outcome and trajectory

The agent answers '14' to 'how many orders did this customer place in March?'. Right or wrong? Two ways to check, each answering a different question. End-to-end compares '14' to the reference. Simple and fast — but tells you nothing about how the agent got there. It could have guessed and still returned the right number. Trajectory looks at the sequence of steps: did it call `get_orders` with a March filter, did it apply `group_by customer`. If any step is wrong but the result matches by accident — trajectory catches it.

🎯 End-to-end

Compare answer to reference
Fast and cheap
Blind to lucky guesses

🔬 Trajectory

Verify each step and its parameters
More expensive, more reliable
Catches fake 'correct' answers

end_to_end_eval:
  запрос → агент → ответ
  сравнить(ответ, эталон) → pass/fail

trajectory_eval:
  запрос → агент → [шаг1, шаг2, шаг3, ответ]
  для каждого шага:
    правильный_инструмент? правильные_параметры?
  итоговый балл = средний по шагам + результат

For trajectory, don't compare steps literally. 'Called get_orders instead of get_customer_orders' isn't a failure if both return correct data. Compare intent, not syntax.

LLM-as-judge — powerful, but easy to fool yourself with

How do you check quality when 'correct' is a paragraph of text, not a number? Put a second LLM in the judge's seat: give it the agent's answer, a reference, and criteria — ask it to grade. Works, but breaks in three ways. First trap — the judge rates everything 'good'. LLMs default to politeness. Fix: a rubric with explicit criteria and mandatory justification for each score. Second — the judge favors longer answers and overrates them. Fix: check conciseness as an explicit separate criterion. Third — you're using the same model as the agent. Models tend to approve of their own style. Fix: the judge must be a different model, ideally stronger.

Signs of a good LLM-judge rubric

Scale 0–N, not "good/bad"

Requires justification for each score

Explicit criteria: correctness, completeness, tone

Judge is a different model than the agent

Single vague "rate the quality" prompt

rubric: оценка ответа поддержки
критерии (каждый 0–2):
  корректность — факты совпадают с базой знаний
  полнота — все части вопроса адресованы
  тон — нейтральный, не снисходительный
вывод: { корректность: 2, полнота: 1, тон: 2, обоснование: "..." }
pass ≥ 5 из 6

Once a week grab 20 random judge scores and check them by hand. If you agree less than 80% of the time — your rubric is bad, and the metric measures the judge, not the agent.

From first test to production: three stages and where to stop

A common mistake is building a perfect eval before launch. The right path has three stages, and you watch different things at each. Stage 1: 20 examples, run locally, fix obvious failures. Don't automate yet — you don't know what to measure. Stage 2: 100 examples, regression before every merge. LLM-as-judge and trajectory metrics live here. A separate smoke set (10 examples) runs on every commit — a fast signal that nothing broke radically. Stage 3: production tracing. Every real call is logged, some logs feed back into the eval dataset. This closes the loop: the agent learns from actual failures, not the ones you dreamt up over coffee.

Stage	Size	Frequency	Purpose
Local	20	Manual	Find obvious gaps
Regression	100	Before merge	Catch quality regressions
Smoke	10	Every commit	Fast signal
Production	∞	Continuous	Close the loop

The sign your eval system is done isn't a pass-rate number — it's when new failures start coming from production, not from your imagination. That means the dataset has caught up with reality.

Result

A working agent evaluation system: a dataset of real examples, metrics that distinguish lucky guesses from actual correctness, and a process where production failures become new tests — not bugs everyone forgot about.

All Recipes

Evals for AI Agents: How to Know Your Agent Actually Works

IntermediateAI DevOps35 minDeepEval, Langfuse, Claude API

Why unit tests break down here

🧪 Unit test

Exact match: 0 or 1
Runs on every commit
Failure = stop and fix

📊 Eval

Pass rate and distribution
Runs scheduled or pre-release
Failure = data to analyze

The dataset matters more than the metric — always

Production logs

Curate manually

Eval dataset (20)

Run

Analyze failures

New cases

Don't aim for 100 tests on day one. 20 carefully chosen examples covering different query types beat 200 random ones. The dataset grows as you find new failure modes in production.

Two views on an agent: outcome and trajectory

🎯 End-to-end

Compare answer to reference
Fast and cheap
Blind to lucky guesses

🔬 Trajectory

Verify each step and its parameters
More expensive, more reliable
Catches fake 'correct' answers

end_to_end_eval:
  запрос → агент → ответ
  сравнить(ответ, эталон) → pass/fail

trajectory_eval:
  запрос → агент → [шаг1, шаг2, шаг3, ответ]
  для каждого шага:
    правильный_инструмент? правильные_параметры?
  итоговый балл = средний по шагам + результат

For trajectory, don't compare steps literally. 'Called get_orders instead of get_customer_orders' isn't a failure if both return correct data. Compare intent, not syntax.

LLM-as-judge — powerful, but easy to fool yourself with

Signs of a good LLM-judge rubric

Scale 0–N, not "good/bad"

Requires justification for each score

Explicit criteria: correctness, completeness, tone

Judge is a different model than the agent

Single vague "rate the quality" prompt

rubric: оценка ответа поддержки
критерии (каждый 0–2):
  корректность — факты совпадают с базой знаний
  полнота — все части вопроса адресованы
  тон — нейтральный, не снисходительный
вывод: { корректность: 2, полнота: 1, тон: 2, обоснование: "..." }
pass ≥ 5 из 6

Once a week grab 20 random judge scores and check them by hand. If you agree less than 80% of the time — your rubric is bad, and the metric measures the judge, not the agent.

From first test to production: three stages and where to stop

Stage	Size	Frequency	Purpose
Local	20	Manual	Find obvious gaps
Regression	100	Before merge	Catch quality regressions
Smoke	10	Every commit	Fast signal
Production	∞	Continuous	Close the loop

The sign your eval system is done isn't a pass-rate number — it's when new failures start coming from production, not from your imagination. That means the dataset has caught up with reality.

Evals for AI Agents: How to Know Your Agent Actually Works

Why unit tests break down here

🧪 Unit test

📊 Eval

The dataset matters more than the metric — always

Two views on an agent: outcome and trajectory

🎯 End-to-end

🔬 Trajectory

LLM-as-judge — powerful, but easy to fool yourself with

Signs of a good LLM-judge rubric

From first test to production: three stages and where to stop

Result

Related Theory

Evals for AI Agents: How to Know Your Agent Actually Works

Why unit tests break down here

🧪 Unit test

📊 Eval

The dataset matters more than the metric — always

Two views on an agent: outcome and trajectory

🎯 End-to-end

🔬 Trajectory

LLM-as-judge — powerful, but easy to fool yourself with

Signs of a good LLM-judge rubric

From first test to production: three stages and where to stop

Result

Related Theory