# Tool Evaluation & Testing

> Evaluation is how you measure tool-call correctness against ground-truth datasets. Coverage in vault is thin; needs Phase 6 research for full authoring.

**Domain:** D2 · Tool Design + Integration (18% of CCA-F exam)
**Canonical:** https://claudearchitectcertification.com/concepts/evaluation
**Last reviewed:** 2026-05-04

## Quick stats

- **Coverage tier:** C
- **Exam domain:** D2
- **Status:** stub
- **Vault depth:** thin
- **Action:** research

## What it is

An evaluation (or eval) is a structured test harness that runs prompts and agents against a suite of test cases, scores each result, and measures improvement over iterations. Unlike manual spot-checking, evals are automated and repeatable: the same 50 cases run against Prompt-v1, Prompt-v2, Prompt-v3, with scores improving from 68% to 87%. Evals are the difference between hope (a prompt feels good) and proof (test cases show +13 points).

The eval workflow has four phases. Define test cases (30-100 input-expected pairs). Run agents against each case, capture response and tool calls. Score results with code-based or model-based grading. Analyze and iterate: which cases failed and why? Refine the prompt, re-run, measure delta. Success is when the score plateaus or new cases reveal new failure modes (regression detection).

Golden test suites are the foundation. Golden means the expected outputs are verified (human-reviewed, audited, canonical). 50 cases where only 10 are golden produces misleading scores. Build golden incrementally: week 1 = 10 golden cases; month 2 = 50; month 6 = 200. The exam tests whether you understand that evals without golden data are theater, not measurement.

Tool-calling evals are distinct from output evals. An output eval scores final text. A tool-calling eval scores behavior: did the agent call the right tools in the right order with the right arguments? For agents, tool-call evals matter more, the agent's job is orchestration, not prose. The exam drills this distinction repeatedly.

## How it works

An eval harness is three components: test data, scoring function, and a loop. Test data is JSONL (one case per line): {input, expected_tools, expected_outcome}. The loop runs each case through the agent, parses tool calls, and feeds both to the scoring function. The function returns a numeric score (0-1) or pass/fail. Aggregate across all cases, run on Prompt-v1, refine, run on v2, measure delta.

Code-based scoring is deterministic: parse the agent's response, extract tool calls, compare against expected. expected actual → pass. Works for tool-call evals, JSON schema validation, rule-based checks. 100% reproducible, no hallucination. Use for anything measurable.

Model-based scoring lets Claude grade subjective cases (tone, completeness, empathy). Pass the input, agent output, and rubric to Claude; Claude returns a structured score. Useful but not reproducible across model versions. Use sparingly, always verify a few examples manually.

Regression detection is the hidden value. Run evals monthly: month 1 = 85%, month 2 = 84% after a system prompt edit. Regression detected. Roll back or understand the trade-off. Without evals, the regression goes unnoticed for weeks. With evals, caught immediately.

## Where you'll see it in production

### Golden test suite for refund agent

50 cases: 10 happy path, 10 edge cases, 20 escalations, 10 errors. Run, score, refine. After 3 iterations, score climbs from 80% to 95%. Failures route to the 5% that need human review. Eval-driven refinement beats manual spot-checking by 4x in time-to-production.

### Regression detection in research agent

Daily eval on 100 cases. Tuesday baseline 82%. Wednesday's prompt refinement scores 79%. Regression caught in hours, not weeks. Rollback or adjust. Without evals, 50+ bad outputs would ship before users complained.

### Tool-call eval for CI/CD bot

30 PRs, expected tool sequence [Read, Grep, Bash, Report]. 28/30 pass. The 2 failures called Report twice or skipped Bash. Refine tool descriptions, re-run: 29/30. The eval prevents shipping a bot with incomplete analysis.

### Continuous evaluation in shadow mode

Deploy improved agent in parallel (no user impact). Score 1000 real interactions. 94% pass. Fix the 6% (wrong tool, invalid customer_id), redeploy: 98%. Once at 95%+, ship to production. Evals are continuous, not one-time.

## Code examples

### Golden test suite + scoring loop

**Python:**

```python
import json

def score_agent(test_case: dict, agent_response: dict) -> dict:
    """Compare actual tool calls to expected."""
    expected_tools = test_case["expected"]["tools"]
    actual_tools = [b.name for b in agent_response.get("tool_calls", [])]

    score = 0
    if actual_tools == expected_tools:
        score += 50  # Tool sequence correct
    if test_case["expected"]["outcome"] == agent_response.get("outcome"):
        score += 30  # Outcome matches
    if test_case["expected"].get("reason") in agent_response.get("reason", ""):
        score += 20  # Specific reason correct

    return {"score": score / 100, "case_id": test_case["id"]}

def run_eval_suite(cases_path: str, agent_fn):
    cases = [json.loads(line) for line in open(cases_path)]
    scores = []
    for case in cases:
        response = agent_fn(case["input"])
        result = score_agent(case, response)
        scores.append(result["score"])
        if result["score"] < 1.0:
            print(f"FAIL case {case['id']}: {result}")

    avg = sum(scores) / len(scores)
    print(f"Avg: {avg:.2%} ({sum(1 for s in scores if s == 1.0)}/{len(scores)})")
    return avg

run_eval_suite("refund_cases.jsonl", refund_agent_v2)
```

> Code-based scoring: deterministic comparison of tool sequences and outcomes. Reproducible, no hallucination.

**Test cases (JSONL):**

```json
{"id": 1, "scenario": "eligible refund", "input": {"customer_id": "c123", "amount": 200}, "expected": {"tools": ["get_customer", "get_order", "process_refund"], "outcome": "approved"}}
{"id": 2, "scenario": "amount exceeds limit", "input": {"customer_id": "c123", "amount": 600}, "expected": {"tools": ["get_customer", "get_order"], "outcome": "escalated", "reason": "exceeds_policy"}}
{"id": 3, "scenario": "customer suspended", "input": {"customer_id": "c789", "amount": 100}, "expected": {"tools": ["get_customer"], "outcome": "denied", "reason": "status_suspended"}}
```

> JSONL: one case per line. Each carries input, expected tool sequence, expected outcome, and optional reason for fine-grained scoring.

## Looks-right vs actually-wrong

| Looks right | Actually wrong |
|---|---|
| Run a few manual examples to verify the prompt works, then deploy. | Manual spot-checks miss edge cases and provide no proof. A prompt working on 5 chosen examples might fail on 10% of real traffic. Evals run 50-100 diverse cases and catch corner cases. |
| Evals are only for NLP text quality; tool-using agents don't need them. | Tool-using agents benefit more from evals. Tool calls are deterministic and crisp to measure: did it call the right tools in the right order with the right arguments? |
| If evals score 95%, the agent is production-ready. | Evals score only on test cases you designed. Missing scenarios (international, mobile, peak load) don't show up. 95% on the suite is necessary but not sufficient. Pair with production monitoring. |
| Use model-based grading for all eval scoring; it's fast and objective. | Model-based grading fails on deterministic measurements. Use it for subjective cases only (tone, completeness). For tool sequences and JSON schemas, code-based scoring is unambiguous and reproducible. |
| Generate test cases by running the agent and treating its output as expected. | That's circular validation. You're measuring whether the agent repeats itself, not whether it's correct. Golden cases must be externally validated (human review, expert sign-off). |

## Comparison

| Type | Test Data | Scoring | Reproducibility | Best For |
| --- | --- | --- | --- | --- |
| Code-based (tool calls) | Expected tool sequence | Exact match | 100% | Agent behavior, compliance |
| Code-based (output) | Expected JSON schema | Schema validation | 100% | Structured output |
| Model-based | Rubric + expected | Claude grades | Not reproducible | Subjective (tone, completeness) |
| Hybrid | Code checks + rubric | Code first, model fallback | Mostly reproducible | Complex cases |
| Production monitoring | Real interactions | Live scoring | Always current | Catching real-world regressions |
| Manual spot-check | Hand-picked examples | Subjective judgment | Not reproducible | Quick sanity check (not proof) |

## Decision tree

1. **Can the correct outcome be checked with code (exact match, schema validation)?**
   - **Yes:** Use code-based eval. Deterministic, reproducible.
   - **No:** Use model-based with rubric. Slower, not reproducible, but handles subjective cases.

2. **Do you have 50+ test cases that have been externally validated?**
   - **Yes:** You have a golden suite. Run evals every change.
   - **No:** Build golden suite incrementally: 10 this week, 30 next, 50+ by month 2.

3. **Is the metric tied to business impact (compliance, revenue, legal)?**
   - **Yes:** Run evals daily. Automate regression detection and alert.
   - **No:** Weekly evals are sufficient.

4. **Will you run evals only once or continuously?**
   - **Yes:** Set up the harness with history tracking and regression detection.
   - **No:** Simple pass/fail report is enough.

5. **Are there edge cases you haven't tested yet?**
   - **Yes:** Production monitoring + shadow mode evals reveal blind spots faster than test cases.
   - **No:** Golden suite is comprehensive; evals are the gate to production.

## Exam-pattern questions

### Q1. You spot-check 5 examples manually. Production scores reveal 12% failure rate. What's the architectural gap?

Manual spot-checks miss edge cases. A prompt working on 5 chosen examples might fail on 10% of real traffic. Evals run 50-100 diverse cases automatically and catch corner cases. Spot-checks are sanity, not proof.

### Q2. Tool-using agent's score is 95% on text-output evals. Production fails are still high. Why?

Tool-call evals matter more for agents. An agent making the right calls but writing bad text succeeds at its job. Bad calls with great text fails. Always run tool-sequence evals; text quality is secondary.

### Q3. Eval score climbs from 80% to 87% after a prompt edit. Is the change ready to ship?

Not yet. Run on fresh test cases you haven't trained against. Score plateau on a fixed suite can mask overfitting to the prompt. Production monitoring + shadow-mode evals reveal blind spots faster than test cases.

### Q4. A teammate says "Use Claude to grade the agent's outputs." When is this advice wrong?

For deterministic properties (tool sequence, schema validation, rule compliance), use code-based grading. Deterministic, reproducible, no hallucination. Use Claude grading only for subjective cases (tone, completeness, empathy).

### Q5. You generate test cases by running the current agent and treating its output as expected. Why is this circular?

You're measuring whether the agent repeats itself, not whether it's correct. Golden cases must be externally validated: human review, expert sign-off, or reference implementations. Grading against unverified data is theater.

### Q6. A prompt edit drops the score from 85% to 82%. What should you do?

Roll back immediately. You've detected a regression. Analyze which cases regressed, decide if the trade-off is worth it, or iterate further. Evals let you make this call in minutes, not weeks.

### Q7. Your golden suite has 50 cases, but a production failure happens on a scenario not in the suite. What does this prove?

Evals score only on test cases you designed. 95% on the suite is necessary but not sufficient. Pair evals with production monitoring; real users reveal blind spots faster than fixed test cases.

### Q8. Why are evals part of tool design, not just QA?

Tool descriptions and schemas evolve; evals catch when an edit causes regressions. Run on every PR; failures gate the merge. Evals make tool design measurable, not just artisanal.

## FAQ

### Q1. What is a 'golden' test suite?

Golden means the expected outputs are verified (human-reviewed, audited, correct). Without golden data, evals measure consistency, not correctness.

### Q2. Should I evaluate the agent's text or its tool calls?

For agents, tool calls matter more. An agent that makes the right calls but writes bad text succeeds at its job. Bad calls with great text fails.

### Q3. How many test cases do I need?

30-50 for general use. 100+ for high-stakes (financial, legal). Enough to catch failure modes, not so many you can't review them.

### Q4. Can I use Claude to grade my agent?

Yes, for subjective cases. Don't use it for deterministic properties (tool sequence, schema). Code-based grading is unambiguous.

### Q5. What if evals show the agent is worse after my changes?

Rollback immediately. You've detected a regression. Analyze which cases regressed, decide if the trade-off is worth it, or iterate further.

### Q6. How often should I run evals?

After every significant change in development. At least weekly in production. Continuous (shadow mode) for high-stakes systems.

### Q7. Can evals replace user testing?

No. Evals measure performance on test cases you designed. Users try scenarios you didn't anticipate. Use both.

### Q8. Difference between eval cases and unit tests?

Unit tests check functions. Evals check that the agent solves business problems. Evals are higher-level, testing the full agent loop.

### Q9. Can multiple answers be valid in a test case?

Yes. Use a flexible scorer that accepts any tool sequence achieving the goal. [A, B, C] OR [B, B, C] may both pass.

### Q10. Should evals be in CI/CD?

Yes. Every PR runs the eval suite. Score drop fails the gate. Evals are part of tool design, not just QA.

---

**Source:** https://claudearchitectcertification.com/concepts/evaluation
**Vault sources:** ACP-T03 §4.3 evaluation overview; ASC-A01 Course 6 evals lessons
**Last reviewed:** 2026-05-04

**Evidence tiers** — 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed.