D2.5 · Domain 2 · Tool Design + Integration · 18% of CCA-F

Tool Evaluation & Testing.

9 min read·10 sections·Tier A

Evaluation is how you measure tool-call correctness against ground-truth datasets. A full deep-dive guide is coming soon. agentic eval research

Stub, research neededDomain 2
Tool Evaluation & Testing, hero illustration featuring Loop mascot in a warm gallery scene.
Domain D2Tool Design + Integration · 18%
On this page
01 · Summary

TLDR

Evaluation is how you measure tool-call correctness against ground-truth datasets. A full deep-dive guide is coming soon. agentic eval research

C
Coverage tier
D2
Exam domain
stub
Status
thin
Vault depth
research
Action
02 · Definition

What it is

An evaluation (or eval) is a structured test harness that runs prompts and agents against a suite of test cases, scores each result, and measures improvement over iterations. Unlike manual spot-checking, evals are automated and repeatable: the same 50 cases run against Prompt-v1, Prompt-v2, Prompt-v3, with scores improving from 68% to 87%. Evals are the difference between hope (a prompt feels good) and proof (test cases show +13 points).

The eval workflow has four phases. Define test cases (30-100 input-expected pairs). Run agents against each case, capture response and tool calls. Score results with code-based or model-based grading. Analyze and iterate: which cases failed and why? Refine the prompt, re-run, measure delta. Success is when the score plateaus or new cases reveal new failure modes (regression detection).

Golden test suites are the foundation. Golden means the expected outputs are verified (human-reviewed, audited, canonical). 50 cases where only 10 are golden produces misleading scores. Build golden incrementally: week 1 = 10 golden cases; month 2 = 50; month 6 = 200. The exam tests whether you understand that evals without golden data are theater, not measurement.

Tool-calling evals are distinct from output evals. An output eval scores final text. A tool-calling eval scores behavior: did the agent call the right tools in the right order with the right arguments? For agents, tool-call evals matter more, the agent's job is orchestration, not prose. The exam drills this distinction repeatedly.

03 · Mechanics

How it works

An eval harness is three components: test data, scoring function, and a loop. Test data is JSONL (one case per line): {input, expected_tools, expected_outcome}. The loop runs each case through the agent, parses tool calls, and feeds both to the scoring function. The function returns a numeric score (0-1) or pass/fail. Aggregate across all cases, run on Prompt-v1, refine, run on v2, measure delta.

Code-based scoring is deterministic: parse the agent's response, extract tool calls, compare against expected. expected == actual → pass. Works for tool-call evals, JSON schema validation, rule-based checks. 100% reproducible, no hallucination. Use for anything measurable.

Model-based scoring lets Claude grade subjective cases (tone, completeness, empathy). Pass the input, agent output, and rubric to Claude; Claude returns a structured score. Useful but not reproducible across model versions. Use sparingly, always verify a few examples manually.

Regression detection is the hidden value. Run evals monthly: month 1 = 85%, month 2 = 84% after a system prompt edit. Regression detected. Roll back or understand the trade-off. Without evals, the regression goes unnoticed for weeks. With evals, caught immediately.

Tool Evaluation & Testing mechanics, painterly diagram featuring Loop mascot.
04 · In production

Where you'll see it

Golden test suite for refund agent

50 cases: 10 happy path, 10 edge cases, 20 escalations, 10 errors. Run, score, refine. After 3 iterations, score climbs from 80% to 95%. Failures route to the 5% that need human review. Eval-driven refinement beats manual spot-checking by 4x in time-to-production.

Regression detection in research agent

Daily eval on 100 cases. Tuesday baseline 82%. Wednesday's prompt refinement scores 79%. Regression caught in hours, not weeks. Rollback or adjust. Without evals, 50+ bad outputs would ship before users complained.

Tool-call eval for CI/CD bot

30 PRs, expected tool sequence [Read, Grep, Bash, Report]. 28/30 pass. The 2 failures called Report twice or skipped Bash. Refine tool descriptions, re-run: 29/30. The eval prevents shipping a bot with incomplete analysis.

Continuous evaluation in shadow mode

Deploy improved agent in parallel (no user impact). Score 1000 real interactions. 94% pass. Fix the 6% (wrong tool, invalid customer_id), redeploy: 98%. Once at 95%+, ship to production. Evals are continuous, not one-time.

05 · Implementation

Code examples

Golden test suite + scoring loop
import json

def score_agent(test_case: dict, agent_response: dict) -> dict:
    """Compare actual tool calls to expected."""
    expected_tools = test_case["expected"]["tools"]
    actual_tools = [b.name for b in agent_response.get("tool_calls", [])]

    score = 0
    if actual_tools == expected_tools:
        score += 50  # Tool sequence correct
    if test_case["expected"]["outcome"] == agent_response.get("outcome"):
        score += 30  # Outcome matches
    if test_case["expected"].get("reason") in agent_response.get("reason", ""):
        score += 20  # Specific reason correct

    return {"score": score / 100, "case_id": test_case["id"]}

def run_eval_suite(cases_path: str, agent_fn):
    cases = [json.loads(line) for line in open(cases_path)]
    scores = []
    for case in cases:
        response = agent_fn(case["input"])
        result = score_agent(case, response)
        scores.append(result["score"])
        if result["score"] < 1.0:
            print(f"FAIL case {case['id']}: {result}")

    avg = sum(scores) / len(scores)
    print(f"Avg: {avg:.2%} ({sum(1 for s in scores if s == 1.0)}/{len(scores)})")
    return avg

run_eval_suite("refund_cases.jsonl", refund_agent_v2)
Code-based scoring: deterministic comparison of tool sequences and outcomes. Reproducible, no hallucination.
06 · Distractor patterns

Looks right, isn't

Each row pairs a plausible-looking pattern with the failure it actually creates. These are the shapes exam distractors are built from.

Looks right

Run a few manual examples to verify the prompt works, then deploy.

Actually wrong

Manual spot-checks miss edge cases and provide no proof. A prompt working on 5 chosen examples might fail on 10% of real traffic. Evals run 50-100 diverse cases and catch corner cases.

Looks right

Evals are only for NLP text quality; tool-using agents don't need them.

Actually wrong

Tool-using agents benefit more from evals. Tool calls are deterministic and crisp to measure: did it call the right tools in the right order with the right arguments?

Looks right

If evals score 95%, the agent is production-ready.

Actually wrong

Evals score only on test cases you designed. Missing scenarios (international, mobile, peak load) don't show up. 95% on the suite is necessary but not sufficient. Pair with production monitoring.

Looks right

Use model-based grading for all eval scoring; it's fast and objective.

Actually wrong

Model-based grading fails on deterministic measurements. Use it for subjective cases only (tone, completeness). For tool sequences and JSON schemas, code-based scoring is unambiguous and reproducible.

Looks right

Generate test cases by running the agent and treating its output as expected.

Actually wrong

That's circular validation. You're measuring whether the agent repeats itself, not whether it's correct. Golden cases must be externally validated (human review, expert sign-off).

07 · Compare

Side-by-side

TypeTest DataScoringReproducibilityBest For
Code-based (tool calls)Expected tool sequenceExact match100%Agent behavior, compliance
Code-based (output)Expected JSON schemaSchema validation100%Structured output
Model-basedRubric + expectedClaude gradesNot reproducibleSubjective (tone, completeness)
HybridCode checks + rubricCode first, model fallbackMostly reproducibleComplex cases
Production monitoringReal interactionsLive scoringAlways currentCatching real-world regressions
Manual spot-checkHand-picked examplesSubjective judgmentNot reproducibleQuick sanity check (not proof)
08 · When to use

Decision tree

01

Can the correct outcome be checked with code (exact match, schema validation)?

YesUse code-based eval. Deterministic, reproducible.
NoUse model-based with rubric. Slower, not reproducible, but handles subjective cases.
02

Do you have 50+ test cases that have been externally validated?

YesYou have a golden suite. Run evals every change.
NoBuild golden suite incrementally: 10 this week, 30 next, 50+ by month 2.
03

Is the metric tied to business impact (compliance, revenue, legal)?

YesRun evals daily. Automate regression detection and alert.
NoWeekly evals are sufficient.
04

Will you run evals only once or continuously?

YesSet up the harness with history tracking and regression detection.
NoSimple pass/fail report is enough.
05

Are there edge cases you haven't tested yet?

YesProduction monitoring + shadow mode evals reveal blind spots faster than test cases.
NoGolden suite is comprehensive; evals are the gate to production.
09 · On the exam

Question patterns

Tool Evaluation & Testing exam trap, painterly cautionary scene featuring Loop mascot.

139 V2 questions wired to this concept. Tap an answer to check it instantly — you'll see whether it's right and why — then expand the full breakdown for the mental model and all four rationales.

Your code-review CI bot returns valid JSON with empty findings: [] even on PRs that clearly have issues. Why?

Tap your answer to check it.

Two subagents are returning conflicting reports about the same bug. How do you resolve it?

Tap your answer to check it.

You spawned 4 subagents in parallel; 3 finished and the 4th hangs forever. How do you debug?

Tap your answer to check it.

Two vendors return contradictory market sizes. The agent picks the median and continues. Why is this wrong?

Tap your answer to check it.

A 403 permission failure comes back from infrastructure during a tool call. Should the agent retry or escalate?

Tap your answer to check it.

What is the difference between an error and an escalation?

Tap your answer to check it.

133 additional questions for this concept live in the practice pillar. Take a mock exam ↗

10 · FAQ

Frequently asked

What is a 'golden' test suite?
Golden means the expected outputs are verified (human-reviewed, audited, correct). Without golden data, evals measure consistency, not correctness.
Should I evaluate the agent's text or its tool calls?
For agents, tool calls matter more. An agent that makes the right calls but writes bad text succeeds at its job. Bad calls with great text fails.
How many test cases do I need?
30-50 for general use. 100+ for high-stakes (financial, legal). Enough to catch failure modes, not so many you can't review them.
Can I use Claude to grade my agent?
Yes, for subjective cases. Don't use it for deterministic properties (tool sequence, schema). Code-based grading is unambiguous.
What if evals show the agent is worse after my changes?
Rollback immediately. You've detected a regression. Analyze which cases regressed, decide if the trade-off is worth it, or iterate further.
How often should I run evals?
After every significant change in development. At least weekly in production. Continuous (shadow mode) for high-stakes systems.
Can evals replace user testing?
No. Evals measure performance on test cases you designed. Users try scenarios you didn't anticipate. Use both.
Difference between eval cases and unit tests?
Unit tests check functions. Evals check that the agent solves business problems. Evals are higher-level, testing the full agent loop.
Can multiple answers be valid in a test case?
Yes. Use a flexible scorer that accepts any tool sequence achieving the goal. [A, B, C] OR [B, B, C] may both pass.
Should evals be in CI/CD?
Yes. Every PR runs the eval suite. Score drop fails the gate. Evals are part of tool design, not just QA.
11 · Practice with AI

Work this with your AI

Work this concept hands-on with Claude Code, Codex, or claude.ai. Copy a prompt, paste it into your assistant, and practise in tandem. Each one keeps you active (explain it back, get drilled, or build) rather than just reading.

  • Drill it like the exam (scenario MCQs)
    Practice in the exam's scenario-MCQ format with trap awareness.
  • Explain it back (Feynman)
    Build durable, transferable understanding of a concept you can half-state.
  • Test me, adapting the difficulty
    Active recall practice on a concept you think you know.
  • Check my prerequisites first
    Before studying a concept that keeps not sticking.
  • Find the high-leverage 20%
    When a domain feels too big and you are short on time.
Self-check

Test yourself

Three diagnostic questions on this primitive. Reveal each answer when you have a guess. Want a full 60-question mock? Open the mock hub →

Q1You spot-check 5 examples manually. Production scores reveal 12% failure rate. What's the architectural gap?
Manual spot-checks miss edge cases. A prompt working on 5 chosen examples might fail on 10% of real traffic. Evals run 50-100 diverse cases automatically and catch corner cases. Spot-checks are sanity, not proof.
Q2Tool-using agent's score is 95% on text-output evals. Production fails are still high. Why?
Tool-call evals matter more for agents. An agent making the right calls but writing bad text succeeds at its job. Bad calls with great text fails. Always run tool-sequence evals; text quality is secondary.
Q3Eval score climbs from 80% to 87% after a prompt edit. Is the change ready to ship?
Not yet. Run on fresh test cases you haven't trained against. Score plateau on a fixed suite can mask overfitting to the prompt. Production monitoring + shadow-mode evals reveal blind spots faster than test cases.
Last reviewed: 2026-05-04·Refresh cadence: monthly
D2.5 · D2 · Tool Design + Integration

Tool Evaluation & Testing, complete.

You've covered the full ten-section breakdown for this primitive, definition, mechanics, code, false positives, comparison, decision tree, exam patterns, and FAQ. One technical primitive down on the path to CCA-F.

More platforms →