D2.1 · Domain 2 · Tool Design + Integration · 18% of CCA-F

Tool Calling.

8 min read·10 sections·Tier A

Tool calling is how Claude decides to invoke external functions and pass structured arguments. Tools are routing infrastructure, good descriptions reduce the need for classifiers or few-shot examples. The quality of tool design is the primary lever for correct task routing, not model size. messages API tools schema

Foundation patternDomain 218+ anti-patterns
Tool Calling, hero illustration featuring Loop mascot in a warm gallery scene.
Domain D2Tool Design + Integration · 18%
On this page
01 · Summary

TLDR

Tool calling is how Claude decides to invoke external functions and pass structured arguments. Tools are routing infrastructure, good descriptions reduce the need for classifiers or few-shot examples. The quality of tool design is the primary lever for correct task routing, not model size. messages API tools schema

5
Description components
4–5
Optimal tools/agent
18+
Degradation threshold
D2
Exam domain
3
tool_choice modes
02 · Definition

What it is

Tool calling is the mechanism that lets Claude request function execution. You define a tool schema (name, description, input_schema JSON), pass it to messages.create() with the tools parameter, and Claude decides whether to call a tool. A tool_use block lands in the response with the tool's name and input. Your harness executes the function, captures the result, and appends a tool_result block. From Claude's perspective, tool calling is probabilistic negotiation: the model reads what's available and decides if any tool fits.

The description field is load-bearing. It's not documentation, it's a policy document Claude reads every turn to decide when and how to call the tool. Vague descriptions cause misrouting, Claude guesses which tool is right. Strong descriptions have four parts: (1) what it does (one sentence), (2) when to use (concrete trigger), (3) edge cases (what's excluded), (4) boundaries (preconditions, ordering). A 4-line description dramatically reduces misrouting; bigger models do not fix vague descriptions.

The tool registry is your side of the contract. When Claude calls verify_customer, you look up the function in a registry, execute it, and catch all exceptions. Never let exceptions escape, Claude sees only what you put in tool_result.content. If you swallow errors silently, Claude retries the same broken call. If you return a structured error like {"error": "customer_not_found", "hint": "verify cus_ prefix"}, Claude reads it, adjusts, and retries successfully.

Production tool-calling fails in two predictable ways: description-shape misrouting (vague or overlapping descriptions) and tool proliferation (beyond ~18 tools, selection accuracy degrades sharply). The second is architectural: if you need 30 capabilities, split into specialized subagents. The exam drills both patterns. A common distractor: "upgrade the model to fix misrouting." The real fix is to rewrite the descriptions.

03 · Mechanics

How it works

Tool definitions are structured metadata. Each tool has a name, description, and input_schema. The input_schema specifies field types, required fields, and field descriptions. Claude does not see the function body, only the schema. Precise schemas matter: they constrain Claude's input guessing. A permissive schema like {data: any} forces hallucination; a strict schema like {customer_id: string, method: enum} gives the model a deterministic target.

When Claude decides to call a tool, the response includes a tool_use block: {type: "tool_use", id: "tooluse_xyz", name: "verify_customer", input: {customer_id: "cus_123", method: "email_otp"}}. Your harness looks up the tool by name, validates the input, executes the function, and appends a tool_result block: {type: "tool_result", tool_use_id: "tooluse_xyz", content: "{...result...}"}. The id field links the result back to the tool_use so Claude knows which produced which.

The execution harness is critical. Wrap every tool call in try/except. If the tool throws, return a structured error string. Claude reads errors and adjusts, your harness's job is to surface them, not hide them. Structured errors look like {error: "out_of_stock", detail: "qty exceeds available", retry_hint: "try smaller quantity"}. Claude responds to hints; opaque errors break the loop.

Multiple tool_use blocks in a single response are independent. Claude might call verify_customer and lookup_order in the same turn. Your harness executes both, builds a tool_results array, and appends them as a single user-role message with all results. Don't append one at a time, batch them in a single {role: "user", content: [tool_result, tool_result, ...]} block. This is a frequent exam distractor.

Tool Calling mechanics, painterly diagram featuring Loop mascot.
04 · In production

Where you'll see it

Customer support routing

Agent has 5 tools: verify_customer, lookup_order, refund_policy, process_refund, escalate. Each tool's description spells out when-to-use and edge cases. Misrouting drops from 30% → 4% when descriptions become precise rather than vague.

Multi-source synthesis

Research agent has search_papers, search_blogs, query_database, web_fetch. Tool descriptions disambiguate (e.g., 'search_papers: peer-reviewed only; use when citation discipline matters'). Without the disambiguation, agent calls web_fetch instead of search_papers and pollutes results.

Sequential code review passes

Three sequential tool sets: pass 1 (security_scan), pass 2 (style_check), pass 3 (perf_check). Each pass dedicates the model's full attention to one concern. A single mega-tool with 20+ params dilutes attention and misses edge cases.

05 · Implementation

Code examples

Tool definition with strong description anatomy
# Anatomy of a good tool description:
# 1. What it does (one verb-led sentence)
# 2. When to use (concrete trigger conditions)
# 3. Edge cases (what's excluded; what's risky)
# 4. Boundaries (preconditions; ordering with other tools)

verify_customer = {
    "name": "verify_customer",
    "description": (
        # 1. What it does
        "Verify customer identity against the auth service. "
        # 2. When to use
        "Call FIRST in any refund or account-modification workflow. "
        # 3. Edge cases
        "Returns is_verified=false for closed accounts; do not proceed. "
        # 4. Boundaries
        "Must be called before lookup_order or process_refund."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "customer_id": {
                "type": "string",
                "description": "Stripe customer ID (cus_xxx). Not email.",
            },
            "method": {
                "type": "string",
                "enum": ["email_otp", "sms_otp", "security_question"],
            },
        },
        "required": ["customer_id", "method"],
    },
}

# Anti-pattern (vague):
verify_customer_bad = {
    "name": "verify_customer",
    "description": "Verifies a customer.",  # Too vague, model guesses
    "input_schema": {
        "type": "object",
        "properties": {
            "customer": {"type": "string"},  # email? id? phone?
        },
    },
}
Good descriptions are 4-line policy documents. Vague descriptions are the #1 cause of misrouting, bigger models don't fix it.
06 · Distractor patterns

Looks right, isn't

Each row pairs a plausible-looking pattern with the failure it actually creates. These are the shapes exam distractors are built from.

Looks right

Misrouting is fixed by upgrading the model.

Actually wrong

Bigger models route slightly better but cannot recover from vague descriptions. Fix the description anatomy first (what / when / edge cases / boundaries), almost always the root cause.

Looks right

Add 25 tools so the agent can handle every edge case.

Actually wrong

Beyond ~18 tools, selection accuracy degrades sharply. Reduce to 4-5 core tools per agent role. If you need more capabilities, split into specialized subagents.

Looks right

Use a permissive schema like data: any to accept any extraction shape.

Actually wrong

Permissive schemas force the model to guess structure, it returns null or fabricated values. Specify exact field types so the model has a deterministic target.

Looks right

Append each tool_result as its own user message so Claude processes them sequentially.

Actually wrong

When a single response contains multiple tool_use blocks, all `tool_result` blocks must be packed into ONE `user` message. Splitting them across messages breaks the tool_use_idtool_result pairing rule and the next API call returns a 400 error: tool_use ids without corresponding tool_result.

Looks right

If a tool fails, return an empty string so Claude knows nothing happened.

Actually wrong

Empty tool_result.content is interpreted as a silent success by the model, which then proceeds with the wrong assumption. Always return a structured error like {"error": "...", "hint": "..."} AND set the optional `is_error: true` flag on the tool_result block. Both signals matter, the flag tells Claude to retry, the content tells it how.

07 · Compare

Side-by-side

AspectTool callingStructured outputsMCP server
What it isMechanism (Claude calls funcs)Pattern (force a tool with JSON schema)Infrastructure (pre-built tool sets)
You writeTool definitions per agentSchema in tool definition + tool_choice forced.mcp.json config; server is pre-built
Best forCustom logic per agentReliable extractionStandard integrations (GitHub, Slack)
Failure modeVague descriptions cause misroutingPermissive schema causes fabricationStale server config; auth misalignment
Caching surfaceTool definitions are cacheable (cache_control on the tools array)Same as tool calling, schema lives inside the cached toolServer can also cache responses independently
Latency cost+1 round-trip per tool turn+1 round-trip per extraction+1 round-trip + server's own latency budget
08 · When to use

Decision tree

01

Is the integration with a known service (GitHub, Slack, Postgres)?

YesUse an MCP server. No custom plumbing. Configure in .mcp.json.
NoDefine your own tools. Apply the 4-line description anatomy.
02

Does the agent have more than ~18 tools?

YesSplit into specialized subagents. Selection degrades past this threshold.
NoSingle agent is fine. Keep tool count to 4-5 per role if possible.
03

Is misrouting happening?

YesAudit tool descriptions first. Add when/edge-cases/boundaries. Don't reach for fine-tuning.
NoTool design is healthy.
04

Does Claude call tools in parallel within one turn?

YesPack all tool_result blocks into a SINGLE user message. Splitting across messages causes 400 errors. Order inside the array does not matter, the tool_use_id does the routing.
NoSequential single-tool turns are simpler. One result, one user message.
05

Are tool definitions stable across requests in this session?

YesMark the tools array with cache_control: {type: "ephemeral"}. Cached tool definitions cut input cost ~90% for the cached prefix on every subsequent turn.
NoIf tools are dynamic per-request, caching the system prompt is the only remaining lever.
09 · On the exam

Question patterns

Tool Calling exam trap, painterly cautionary scene featuring Loop mascot.

127 V2 questions wired to this concept. Tap an answer to check it instantly — you'll see whether it's right and why — then expand the full breakdown for the mental model and all four rationales.

Your agentic loop keeps running after Claude has clearly finished its task. Which control was most likely missed?

Tap your answer to check it.

An agent loop hits a budget ceiling and you keep raising max_iterations to make it pass. What is the real problem?

Tap your answer to check it.

Your refund agent emits the words I am processing your refund now but then makes another tool call. The harness exits early. Why?

Tap your answer to check it.

A tool throws an exception inside your harness; on the next iteration Claude requests the same tool again with the same arguments. Why?

Tap your answer to check it.

Two of your tools have similar names (fetch_data and get_data). The model picks the wrong one 30% of the time. What is the best first fix?

Tap your answer to check it.

Your code-review subagent missed three critical issues that you can clearly see in the file. Why?

Tap your answer to check it.

121 additional questions for this concept live in the practice pillar. Take a mock exam ↗

10 · FAQ

Frequently asked

Why does Claude sometimes call the wrong tool even with good descriptions?
Two tools with overlapping triggers look the same to the model. Audit descriptions for the same verb ("search", "fetch", "lookup") used across multiple tools. Add Use this when X, NOT when Y boundary clauses; the explicit exclusion does more than any positive description.
Does the order of tools in the `tools` array matter?
Mildly. The model has a slight recency bias toward later tools in the array, especially with vague descriptions. Place rarely-used tools last and frequently-used tools first. Don't rely on order as a routing mechanism, it's a tiebreaker, not a rule.
Can a single `tool_use` block call the same tool with different args twice?
No. Each tool_use block is one call with one input object. Parallel tool use emits multiple separate tool_use blocks in the same response, each with a unique id. If you need the same tool with two different inputs, you'll get two blocks.
How do I stop Claude from calling tools in an infinite loop?
Three structural checks: (1) you're appending tool_result after every tool_use (most common bug), (2) tool descriptions don't conflict, (3) stop_reason branching is correct. `max_iterations` caps mask the bug, not fix it. A healthy loop converges in 3-8 turns for most tasks.
Should the tool name be a verb or a noun?
Verb-led: verify_customer, fetch_invoice, search_papers. The model parses tool names as actions; nouns like customer_info ambiguate (is it a getter? a setter?). Verb_object pattern is the canonical Anthropic style.
Do tool definitions count toward `max_tokens` for the response?
No. Tools live in the input budget. They're billed as input tokens on every request unless cached. Large tool sets (10+ tools with verbose descriptions) can cost 1500-3000 tokens per request, prompt caching is the standard mitigation.
What's the JSON Schema dialect Claude expects?
A subset of JSON Schema 2020-12. Supported: type, properties, required, enum, description, items, format (some), nested objects. Not supported: $ref resolution across files, custom keywords, oneOf/anyOf with complex discriminators (works in simple cases). Keep schemas flat and explicit.
How do I handle a tool that has both required and conditional fields?
Use a discriminator pattern: a required type enum that gates which other fields are required. Example: {type: "refund" | "credit", amount: number, refund_reason?: string, credit_account?: string}. Document the conditional in the description; JSON Schema's structural enforcement is limited.
Can I update a tool's description mid-conversation?
Yes, just send the new tools array with the next request. The new description applies from that turn forward; previous turns are unaffected. This is occasionally useful for dynamic tools whose semantics change based on user permissions, but it invalidates prompt caching for the tools section.
What's the difference between `tool_result` and a regular user message?
tool_result is a content block type, not a message role. It lives inside a user message: {role: "user", content: [{type: "tool_result", tool_use_id: "...", content: "..."}]}. The block is what links back to the tool_use by id. Sending raw text in a user message instead of a structured tool_result block breaks the pairing.
11 · Practice with AI

Work this with your AI

Work this concept hands-on with Claude Code, Codex, or claude.ai. Copy a prompt, paste it into your assistant, and practise in tandem. Each one keeps you active (explain it back, get drilled, or build) rather than just reading.

  • Drill it like the exam (scenario MCQs)
    Practice in the exam's scenario-MCQ format with trap awareness.
  • Explain it back (Feynman)
    Build durable, transferable understanding of a concept you can half-state.
  • Test me, adapting the difficulty
    Active recall practice on a concept you think you know.
  • Check my prerequisites first
    Before studying a concept that keeps not sticking.
  • Find the high-leverage 20%
    When a domain feels too big and you are short on time.
Self-check

Test yourself

Three diagnostic questions on this primitive. Reveal each answer when you have a guess. Want a full 60-question mock? Open the mock hub →

Q1Your agent calls the wrong tool 30% of the time across 8 similar tools. What's the first fix?
Audit tool descriptions. Add the four-part anatomy: what it does, when to use, edge cases, ordering boundaries. Vague descriptions cause misrouting; bigger models don't fix it.
Q2You add 25 tools and selection accuracy drops sharply. Why?
Beyond ~18 tools, attention dilutes and Claude can't reliably select. Split into specialized subagents, each with 4-5 tools max. Or merge overlapping tools.
Q3A tool's `input_schema` is `{data: any}` and Claude returns null. What happened?
The schema gave no constraint, so Claude has no deterministic target. Model guessed null. Specify exact fields with types and examples; constraints reduce hallucination.
Last reviewed: 2026-05-04·Refresh cadence: monthly
D2.1 · D2 · Tool Design + Integration

Tool Calling, complete.

You've covered the full ten-section breakdown for this primitive, definition, mechanics, code, false positives, comparison, decision tree, exam patterns, and FAQ. One technical primitive down on the path to CCA-F.

More platforms →