Evals for AI Agents: How to Know Your Agent Actually Works
Everyone builds agents, almost nobody measures them. We break down why classical tests don't work here, how to assemble your first eval dataset in an hour, the difference between end-to-end and trajectory evaluation, and how not to fool yourself with LLM-as-judge.
IntermediateAI DevOps35 minDeepEval, Langfuse, Claude API
1
Why unit tests break down here
You built an agent. It works on your test queries — so it's ready? No: LLMs are non-deterministic, the same input yields different outputs, and 'correct' is often subjective. A classical unit test checks equality, but equality isn't what needs checking here.
An eval isn't 'pass / fail'. It's a measurement: how well, in what percentage of cases, where it breaks. The question isn't 'did it pass?' but 'it passed 47 of 50 cases, and here's what the 3 failures have in common'. Failure means something different here: it's not a bug, it's data.
🧪 Unit test
- Exact match: 0 or 1
- Runs on every commit
- Failure = stop and fix
📊 Eval
- Pass rate and distribution
- Runs scheduled or pre-release
- Failure = data to analyze
First thing to change is your relationship with failure. If you want 100% pass rate, you're either writing trivial tests or fooling yourself. The goal is to see where the agent breaks and decide whether it's worth fixing.
2
The dataset matters more than the metric — always
Metrics look impressive — accuracy, precision, recall. But without a good dataset they're a thermometer in an empty room: measuring something, just not what matters. The right order: first 20 real examples, then a metric around them.
Where do the first 20 come from? Three sources. Production or beta-user logs — that's where real users broke things. Corner cases you hunted manually — empty input, ambiguous query, a language you didn't prepare for. And synthetic data — but only as a supplement, never a replacement: synthetic is smooth, reality isn't.
Production logs
Curate manually
Eval dataset (20)
Run
Analyze failures
New cases
Don't aim for 100 tests on day one. 20 carefully chosen examples covering different query types beat 200 random ones. The dataset grows as you find new failure modes in production.
3
Two views on an agent: outcome and trajectory
The agent answers '14' to 'how many orders did this customer place in March?'. Right or wrong? Two ways to check, each answering a different question.
End-to-end compares '14' to the reference. Simple and fast — but tells you nothing about how the agent got there. It could have guessed and still returned the right number.
Trajectory looks at the sequence of steps: did it call `get_orders` with a March filter, did it apply `group_by customer`. If any step is wrong but the result matches by accident — trajectory catches it.
🎯 End-to-end
- Compare answer to reference
- Fast and cheap
- Blind to lucky guesses
🔬 Trajectory
- Verify each step and its parameters
- More expensive, more reliable
- Catches fake 'correct' answers
end_to_end_eval:
запрос → агент → ответ
сравнить(ответ, эталон) → pass/fail
trajectory_eval:
запрос → агент → [шаг1, шаг2, шаг3, ответ]
для каждого шага:
правильный_инструмент? правильные_параметры?
итоговый балл = средний по шагам + результатFor trajectory, don't compare steps literally. 'Called get_orders instead of get_customer_orders' isn't a failure if both return correct data. Compare intent, not syntax.
4
LLM-as-judge — powerful, but easy to fool yourself with
How do you check quality when 'correct' is a paragraph of text, not a number? Put a second LLM in the judge's seat: give it the agent's answer, a reference, and criteria — ask it to grade. Works, but breaks in three ways.
First trap — the judge rates everything 'good'. LLMs default to politeness. Fix: a rubric with explicit criteria and mandatory justification for each score.
Second — the judge favors longer answers and overrates them. Fix: check conciseness as an explicit separate criterion.
Third — you're using the same model as the agent. Models tend to approve of their own style. Fix: the judge must be a different model, ideally stronger.
Signs of a good LLM-judge rubric
Scale 0–N, not "good/bad"
Requires justification for each score
Explicit criteria: correctness, completeness, tone
Judge is a different model than the agent
Single vague "rate the quality" prompt
rubric: оценка ответа поддержки
критерии (каждый 0–2):
корректность — факты совпадают с базой знаний
полнота — все части вопроса адресованы
тон — нейтральный, не снисходительный
вывод: { корректность: 2, полнота: 1, тон: 2, обоснование: "..." }
pass ≥ 5 из 6Once a week grab 20 random judge scores and check them by hand. If you agree less than 80% of the time — your rubric is bad, and the metric measures the judge, not the agent.
5
From first test to production: three stages and where to stop
A common mistake is building a perfect eval before launch. The right path has three stages, and you watch different things at each.
Stage 1: 20 examples, run locally, fix obvious failures. Don't automate yet — you don't know what to measure.
Stage 2: 100 examples, regression before every merge. LLM-as-judge and trajectory metrics live here. A separate smoke set (10 examples) runs on every commit — a fast signal that nothing broke radically.
Stage 3: production tracing. Every real call is logged, some logs feed back into the eval dataset. This closes the loop: the agent learns from actual failures, not the ones you dreamt up over coffee.
| Stage | Size | Frequency | Purpose |
|---|---|---|---|
| Local | 20 | Manual | Find obvious gaps |
| Regression | 100 | Before merge | Catch quality regressions |
| Smoke | 10 | Every commit | Fast signal |
| Production | ∞ | Continuous | Close the loop |
The sign your eval system is done isn't a pass-rate number — it's when new failures start coming from production, not from your imagination. That means the dataset has caught up with reality.
Result
A working agent evaluation system: a dataset of real examples, metrics that distinguish lucky guesses from actual correctness, and a process where production failures become new tests — not bugs everyone forgot about.