On this page
TLDR
Evaluation is how you measure tool-call correctness against ground-truth datasets. A full deep-dive guide is coming soon. agentic eval research
What it is
An evaluation (or eval) is a structured test harness that runs prompts and agents against a suite of test cases, scores each result, and measures improvement over iterations. Unlike manual spot-checking, evals are automated and repeatable: the same 50 cases run against Prompt-v1, Prompt-v2, Prompt-v3, with scores improving from 68% to 87%. Evals are the difference between hope (a prompt feels good) and proof (test cases show +13 points).
The eval workflow has four phases. Define test cases (30-100 input-expected pairs). Run agents against each case, capture response and tool calls. Score results with code-based or model-based grading. Analyze and iterate: which cases failed and why? Refine the prompt, re-run, measure delta. Success is when the score plateaus or new cases reveal new failure modes (regression detection).
Golden test suites are the foundation. Golden means the expected outputs are verified (human-reviewed, audited, canonical). 50 cases where only 10 are golden produces misleading scores. Build golden incrementally: week 1 = 10 golden cases; month 2 = 50; month 6 = 200. The exam tests whether you understand that evals without golden data are theater, not measurement.
Tool-calling evals are distinct from output evals. An output eval scores final text. A tool-calling eval scores behavior: did the agent call the right tools in the right order with the right arguments? For agents, tool-call evals matter more, the agent's job is orchestration, not prose. The exam drills this distinction repeatedly.
How it works
An eval harness is three components: test data, scoring function, and a loop. Test data is JSONL (one case per line): {input, expected_tools, expected_outcome}. The loop runs each case through the agent, parses tool calls, and feeds both to the scoring function. The function returns a numeric score (0-1) or pass/fail. Aggregate across all cases, run on Prompt-v1, refine, run on v2, measure delta.
Code-based scoring is deterministic: parse the agent's response, extract tool calls, compare against expected. expected == actual → pass. Works for tool-call evals, JSON schema validation, rule-based checks. 100% reproducible, no hallucination. Use for anything measurable.
Model-based scoring lets Claude grade subjective cases (tone, completeness, empathy). Pass the input, agent output, and rubric to Claude; Claude returns a structured score. Useful but not reproducible across model versions. Use sparingly, always verify a few examples manually.
Regression detection is the hidden value. Run evals monthly: month 1 = 85%, month 2 = 84% after a system prompt edit. Regression detected. Roll back or understand the trade-off. Without evals, the regression goes unnoticed for weeks. With evals, caught immediately.

Where you'll see it
Golden test suite for refund agent
50 cases: 10 happy path, 10 edge cases, 20 escalations, 10 errors. Run, score, refine. After 3 iterations, score climbs from 80% to 95%. Failures route to the 5% that need human review. Eval-driven refinement beats manual spot-checking by 4x in time-to-production.
Regression detection in research agent
Daily eval on 100 cases. Tuesday baseline 82%. Wednesday's prompt refinement scores 79%. Regression caught in hours, not weeks. Rollback or adjust. Without evals, 50+ bad outputs would ship before users complained.
Tool-call eval for CI/CD bot
30 PRs, expected tool sequence [Read, Grep, Bash, Report]. 28/30 pass. The 2 failures called Report twice or skipped Bash. Refine tool descriptions, re-run: 29/30. The eval prevents shipping a bot with incomplete analysis.
Continuous evaluation in shadow mode
Deploy improved agent in parallel (no user impact). Score 1000 real interactions. 94% pass. Fix the 6% (wrong tool, invalid customer_id), redeploy: 98%. Once at 95%+, ship to production. Evals are continuous, not one-time.
Code examples
import json
def score_agent(test_case: dict, agent_response: dict) -> dict:
"""Compare actual tool calls to expected."""
expected_tools = test_case["expected"]["tools"]
actual_tools = [b.name for b in agent_response.get("tool_calls", [])]
score = 0
if actual_tools == expected_tools:
score += 50 # Tool sequence correct
if test_case["expected"]["outcome"] == agent_response.get("outcome"):
score += 30 # Outcome matches
if test_case["expected"].get("reason") in agent_response.get("reason", ""):
score += 20 # Specific reason correct
return {"score": score / 100, "case_id": test_case["id"]}
def run_eval_suite(cases_path: str, agent_fn):
cases = [json.loads(line) for line in open(cases_path)]
scores = []
for case in cases:
response = agent_fn(case["input"])
result = score_agent(case, response)
scores.append(result["score"])
if result["score"] < 1.0:
print(f"FAIL case {case['id']}: {result}")
avg = sum(scores) / len(scores)
print(f"Avg: {avg:.2%} ({sum(1 for s in scores if s == 1.0)}/{len(scores)})")
return avg
run_eval_suite("refund_cases.jsonl", refund_agent_v2)Looks right, isn't
Each row pairs a plausible-looking pattern with the failure it actually creates. These are the shapes exam distractors are built from.
Run a few manual examples to verify the prompt works, then deploy.
Manual spot-checks miss edge cases and provide no proof. A prompt working on 5 chosen examples might fail on 10% of real traffic. Evals run 50-100 diverse cases and catch corner cases.
Evals are only for NLP text quality; tool-using agents don't need them.
Tool-using agents benefit more from evals. Tool calls are deterministic and crisp to measure: did it call the right tools in the right order with the right arguments?
If evals score 95%, the agent is production-ready.
Evals score only on test cases you designed. Missing scenarios (international, mobile, peak load) don't show up. 95% on the suite is necessary but not sufficient. Pair with production monitoring.
Use model-based grading for all eval scoring; it's fast and objective.
Model-based grading fails on deterministic measurements. Use it for subjective cases only (tone, completeness). For tool sequences and JSON schemas, code-based scoring is unambiguous and reproducible.
Generate test cases by running the agent and treating its output as expected.
That's circular validation. You're measuring whether the agent repeats itself, not whether it's correct. Golden cases must be externally validated (human review, expert sign-off).
Side-by-side
| Type | Test Data | Scoring | Reproducibility | Best For |
|---|---|---|---|---|
| Code-based (tool calls) | Expected tool sequence | Exact match | 100% | Agent behavior, compliance |
| Code-based (output) | Expected JSON schema | Schema validation | 100% | Structured output |
| Model-based | Rubric + expected | Claude grades | Not reproducible | Subjective (tone, completeness) |
| Hybrid | Code checks + rubric | Code first, model fallback | Mostly reproducible | Complex cases |
| Production monitoring | Real interactions | Live scoring | Always current | Catching real-world regressions |
| Manual spot-check | Hand-picked examples | Subjective judgment | Not reproducible | Quick sanity check (not proof) |
Decision tree
Can the correct outcome be checked with code (exact match, schema validation)?
Do you have 50+ test cases that have been externally validated?
Is the metric tied to business impact (compliance, revenue, legal)?
Will you run evals only once or continuously?
Are there edge cases you haven't tested yet?
Question patterns

139 V2 questions wired to this concept. Tap an answer to check it instantly — you'll see whether it's right and why — then expand the full breakdown for the mental model and all four rationales.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
133 additional questions for this concept live in the practice pillar. Take a mock exam ↗
Frequently asked
What is a 'golden' test suite?
Should I evaluate the agent's text or its tool calls?
How many test cases do I need?
Can I use Claude to grade my agent?
What if evals show the agent is worse after my changes?
How often should I run evals?
Can evals replace user testing?
Difference between eval cases and unit tests?
Can multiple answers be valid in a test case?
[A, B, C] OR [B, B, C] may both pass.Should evals be in CI/CD?
Work this with your AI
Work this concept hands-on with Claude Code, Codex, or claude.ai. Copy a prompt, paste it into your assistant, and practise in tandem. Each one keeps you active (explain it back, get drilled, or build) rather than just reading.
- Drill it like the exam (scenario MCQs)Practice in the exam's scenario-MCQ format with trap awareness.
- Explain it back (Feynman)Build durable, transferable understanding of a concept you can half-state.
- Test me, adapting the difficultyActive recall practice on a concept you think you know.
- Check my prerequisites firstBefore studying a concept that keeps not sticking.
- Find the high-leverage 20%When a domain feels too big and you are short on time.
