Tool Evaluation & Testing (D2, 18% of CCA-F) - Claude Architect Concept

01 · Summary

TLDR

Evaluation is how you measure tool-call correctness against ground-truth datasets. A full deep-dive guide is coming soon. agentic eval research

C

Coverage tier

D2

Exam domain

stub

Status

thin

Vault depth

research

Action

02 · Definition

What it is

An evaluation (or eval) is a structured test harness that runs prompts and agents against a suite of test cases, scores each result, and measures improvement over iterations. Unlike manual spot-checking, evals are automated and repeatable: the same 50 cases run against Prompt-v1, Prompt-v2, Prompt-v3, with scores improving from 68% to 87%. Evals are the difference between hope (a prompt feels good) and proof (test cases show +13 points).

The eval workflow has four phases. Define test cases (30-100 input-expected pairs). Run agents against each case, capture response and tool calls. Score results with code-based or model-based grading. Analyze and iterate: which cases failed and why? Refine the prompt, re-run, measure delta. Success is when the score plateaus or new cases reveal new failure modes (regression detection).

Golden test suites are the foundation. Golden means the expected outputs are verified (human-reviewed, audited, canonical). 50 cases where only 10 are golden produces misleading scores. Build golden incrementally: week 1 = 10 golden cases; month 2 = 50; month 6 = 200. The exam tests whether you understand that evals without golden data are theater, not measurement.

Tool-calling evals are distinct from output evals. An output eval scores final text. A tool-calling eval scores behavior: did the agent call the right tools in the right order with the right arguments? For agents, tool-call evals matter more, the agent's job is orchestration, not prose. The exam drills this distinction repeatedly.

03 · Mechanics

How it works

An eval harness is three components: test data, scoring function, and a loop. Test data is JSONL (one case per line): {input, expected_tools, expected_outcome}. The loop runs each case through the agent, parses tool calls, and feeds both to the scoring function. The function returns a numeric score (0-1) or pass/fail. Aggregate across all cases, run on Prompt-v1, refine, run on v2, measure delta.

Code-based scoring is deterministic: parse the agent's response, extract tool calls, compare against expected. expected == actual → pass. Works for tool-call evals, JSON schema validation, rule-based checks. 100% reproducible, no hallucination. Use for anything measurable.

Model-based scoring lets Claude grade subjective cases (tone, completeness, empathy). Pass the input, agent output, and rubric to Claude; Claude returns a structured score. Useful but not reproducible across model versions. Use sparingly, always verify a few examples manually.

Regression detection is the hidden value. Run evals monthly: month 1 = 85%, month 2 = 84% after a system prompt edit. Regression detected. Roll back or understand the trade-off. Without evals, the regression goes unnoticed for weeks. With evals, caught immediately.

Tool Evaluation & Testing mechanics, painterly diagram featuring Loop mascot.

04 · In production

Where you'll see it

Golden test suite for refund agent

50 cases: 10 happy path, 10 edge cases, 20 escalations, 10 errors. Run, score, refine. After 3 iterations, score climbs from 80% to 95%. Failures route to the 5% that need human review. Eval-driven refinement beats manual spot-checking by 4x in time-to-production.

Regression detection in research agent

Daily eval on 100 cases. Tuesday baseline 82%. Wednesday's prompt refinement scores 79%. Regression caught in hours, not weeks. Rollback or adjust. Without evals, 50+ bad outputs would ship before users complained.

Tool-call eval for CI/CD bot

30 PRs, expected tool sequence [Read, Grep, Bash, Report]. 28/30 pass. The 2 failures called Report twice or skipped Bash. Refine tool descriptions, re-run: 29/30. The eval prevents shipping a bot with incomplete analysis.

Continuous evaluation in shadow mode

Deploy improved agent in parallel (no user impact). Score 1000 real interactions. 94% pass. Fix the 6% (wrong tool, invalid customer_id), redeploy: 98%. Once at 95%+, ship to production. Evals are continuous, not one-time.

05 · Implementation

Code examples

Golden test suite + scoring loop

import json

def score_agent(test_case: dict, agent_response: dict) -> dict:
    """Compare actual tool calls to expected."""
    expected_tools = test_case["expected"]["tools"]
    actual_tools = [b.name for b in agent_response.get("tool_calls", [])]

    score = 0
    if actual_tools == expected_tools:
        score += 50  # Tool sequence correct
    if test_case["expected"]["outcome"] == agent_response.get("outcome"):
        score += 30  # Outcome matches
    if test_case["expected"].get("reason") in agent_response.get("reason", ""):
        score += 20  # Specific reason correct

    return {"score": score / 100, "case_id": test_case["id"]}

def run_eval_suite(cases_path: str, agent_fn):
    cases = [json.loads(line) for line in open(cases_path)]
    scores = []
    for case in cases:
        response = agent_fn(case["input"])
        result = score_agent(case, response)
        scores.append(result["score"])
        if result["score"] < 1.0:
            print(f"FAIL case {case['id']}: {result}")

    avg = sum(scores) / len(scores)
    print(f"Avg: {avg:.2%} ({sum(1 for s in scores if s == 1.0)}/{len(scores)})")
    return avg

run_eval_suite("refund_cases.jsonl", refund_agent_v2)

Code-based scoring: deterministic comparison of tool sequences and outcomes. Reproducible, no hallucination.

06 · Distractor patterns

Looks right, isn't

Each row pairs a plausible-looking pattern with the failure it actually creates. These are the shapes exam distractors are built from.

Looks right

Run a few manual examples to verify the prompt works, then deploy.

Actually wrong

Manual spot-checks miss edge cases and provide no proof. A prompt working on 5 chosen examples might fail on 10% of real traffic. Evals run 50-100 diverse cases and catch corner cases.

Looks right

Evals are only for NLP text quality; tool-using agents don't need them.

Actually wrong

Tool-using agents benefit more from evals. Tool calls are deterministic and crisp to measure: did it call the right tools in the right order with the right arguments?

Looks right

If evals score 95%, the agent is production-ready.

Actually wrong

Evals score only on test cases you designed. Missing scenarios (international, mobile, peak load) don't show up. 95% on the suite is necessary but not sufficient. Pair with production monitoring.

Looks right

Use model-based grading for all eval scoring; it's fast and objective.

Actually wrong

Model-based grading fails on deterministic measurements. Use it for subjective cases only (tone, completeness). For tool sequences and JSON schemas, code-based scoring is unambiguous and reproducible.

Looks right

Generate test cases by running the agent and treating its output as expected.

Actually wrong

That's circular validation. You're measuring whether the agent repeats itself, not whether it's correct. Golden cases must be externally validated (human review, expert sign-off).

07 · Compare

Side-by-side

Type	Test Data	Scoring	Reproducibility	Best For
Code-based (tool calls)	Expected tool sequence	Exact match	100%	Agent behavior, compliance
Code-based (output)	Expected JSON schema	Schema validation	100%	Structured output
Model-based	Rubric + expected	Claude grades	Not reproducible	Subjective (tone, completeness)
Hybrid	Code checks + rubric	Code first, model fallback	Mostly reproducible	Complex cases
Production monitoring	Real interactions	Live scoring	Always current	Catching real-world regressions
Manual spot-check	Hand-picked examples	Subjective judgment	Not reproducible	Quick sanity check (not proof)

08 · When to use

Decision tree

01

Can the correct outcome be checked with code (exact match, schema validation)?

YesUse code-based eval. Deterministic, reproducible.

NoUse model-based with rubric. Slower, not reproducible, but handles subjective cases.

02

Do you have 50+ test cases that have been externally validated?

YesYou have a golden suite. Run evals every change.

NoBuild golden suite incrementally: 10 this week, 30 next, 50+ by month 2.

03

Is the metric tied to business impact (compliance, revenue, legal)?

YesRun evals daily. Automate regression detection and alert.

NoWeekly evals are sufficient.

04

Will you run evals only once or continuously?

YesSet up the harness with history tracking and regression detection.

NoSimple pass/fail report is enough.

05

Are there edge cases you haven't tested yet?

YesProduction monitoring + shadow mode evals reveal blind spots faster than test cases.

NoGolden suite is comprehensive; evals are the gate to production.

09 · On the exam

Question patterns

Tool Evaluation & Testing exam trap, painterly cautionary scene featuring Loop mascot.

139 V2 questions wired to this concept. Tap an answer to check it instantly — you'll see whether it's right and why — then expand the full breakdown for the mental model and all four rationales.

Your code-review CI bot returns valid JSON with empty findings: [] even on PRs that clearly have issues. Why?

Tap your answer to check it.

Two subagents are returning conflicting reports about the same bug. How do you resolve it?

Tap your answer to check it.

You spawned 4 subagents in parallel; 3 finished and the 4th hangs forever. How do you debug?

Tap your answer to check it.

Two vendors return contradictory market sizes. The agent picks the median and continues. Why is this wrong?

Tap your answer to check it.

A 403 permission failure comes back from infrastructure during a tool call. Should the agent retry or escalate?

Tap your answer to check it.

What is the difference between an error and an escalation?

Tap your answer to check it.

133 additional questions for this concept live in the practice pillar. Take a mock exam ↗

10 · FAQ

Frequently asked

What is a 'golden' test suite?

Golden means the expected outputs are verified (human-reviewed, audited, correct). Without golden data, evals measure consistency, not correctness.

Should I evaluate the agent's text or its tool calls?

For agents, tool calls matter more. An agent that makes the right calls but writes bad text succeeds at its job. Bad calls with great text fails.

How many test cases do I need?

30-50 for general use. 100+ for high-stakes (financial, legal). Enough to catch failure modes, not so many you can't review them.

Can I use Claude to grade my agent?

Yes, for subjective cases. Don't use it for deterministic properties (tool sequence, schema). Code-based grading is unambiguous.

What if evals show the agent is worse after my changes?

Rollback immediately. You've detected a regression. Analyze which cases regressed, decide if the trade-off is worth it, or iterate further.

How often should I run evals?

After every significant change in development. At least weekly in production. Continuous (shadow mode) for high-stakes systems.

Can evals replace user testing?

No. Evals measure performance on test cases you designed. Users try scenarios you didn't anticipate. Use both.

Difference between eval cases and unit tests?

Unit tests check functions. Evals check that the agent solves business problems. Evals are higher-level, testing the full agent loop.

Can multiple answers be valid in a test case?

Yes. Use a flexible scorer that accepts any tool sequence achieving the goal. [A, B, C] OR [B, B, C] may both pass.

Should evals be in CI/CD?

Yes. Every PR runs the eval suite. Score drop fails the gate. Evals are part of tool design, not just QA.

11 · Practice with AI

Work this with your AI

Work this concept hands-on with Claude Code, Codex, or claude.ai. Copy a prompt, paste it into your assistant, and practise in tandem. Each one keeps you active (explain it back, get drilled, or build) rather than just reading.

Drill it like the exam (scenario MCQs)
Practice in the exam's scenario-MCQ format with trap awareness.
Explain it back (Feynman)
Build durable, transferable understanding of a concept you can half-state.
Test me, adapting the difficulty
Active recall practice on a concept you think you know.
Check my prerequisites first
Before studying a concept that keeps not sticking.
Find the high-leverage 20%
When a domain feels too big and you are short on time.

Tool Evaluation & Testing.

TLDR

What it is

How it works

Where you'll see it

Golden test suite for refund agent

Regression detection in research agent

Tool-call eval for CI/CD bot

Continuous evaluation in shadow mode

Code examples

Looks right, isn't

Side-by-side

Decision tree

Can the correct outcome be checked with code (exact match, schema validation)?

Do you have 50+ test cases that have been externally validated?

Is the metric tied to business impact (compliance, revenue, legal)?

Will you run evals only once or continuously?

Are there edge cases you haven't tested yet?

Question patterns

Frequently asked

Work this with your AI

Test yourself

Tool Evaluation & Testing, complete.

Tool Evaluation & Testing.

TLDR

What it is

How it works

Where you'll see it

Golden test suite for refund agent

Regression detection in research agent

Tool-call eval for CI/CD bot

Continuous evaluation in shadow mode

Code examples

Looks right, isn't

Side-by-side

Decision tree

Can the correct outcome be checked with code (exact match, schema validation)?

Do you have 50+ test cases that have been externally validated?

Is the metric tied to business impact (compliance, revenue, legal)?

Will you run evals only once or continuously?

Are there edge cases you haven't tested yet?

Question patterns

Frequently asked

Work this with your AI

Test yourself

Tool Evaluation & Testing, complete.

Share this primitive