# Agentic Tool Design

> A meta-skill scenario about designing tool registries. The optimum is 4-5 tools per agent (past 5, routing accuracy drops 8% per extra tool); each tool follows the Anthropic 4-line description pattern (what / when / edge cases / ordering); risky calls go through a PreToolUse hook; results pass through a PostToolUse hook for normalization and side-effect guards; errors are labeled with one of four structured buckets (Transient · Permission · Data · Business) so retry logic can branch correctly. The most-tested distractor: a 15-tool agent that 'just needs a smarter model'. No, it needs a smaller registry.

**Sub-marker:** P3.7
**Domains:** D2 · Tool Design + Integration, D3 · Agent Operations
**Exam weight:** 38% of CCA-F (D2 + D3)
**Build time:** 24 minutes
**Source:** 🟡 Beyond-guide scenario · OP-claimed (Reddit 1s34iyl) · architecture matches Anthropic public guidance
**Canonical:** https://claudearchitectcertification.com/scenarios/agentic-tool-design
**Last reviewed:** 2026-05-04

## In plain English

Think of this as designing the toolbox before the agent picks it up. What tools to give, how to describe them, what guard-rails to put around them. The trap is to give the agent fifteen tools and hope; the discipline is to give it four or five really well-described tools, write each description in the same four-line pattern (what / when / edge cases / ordering), wrap risky calls in PreToolUse hooks, normalize the results in PostToolUse hooks, and label every error with one of four buckets so retry logic can reason about it. The whole point is that tool design IS agent design. Get the toolbox right and the rest of the agent works.

## Exam impact

Domain 2 (Tool Design, 18%) tests the 4-5-tool optimum, the 4-line description pattern, and tool_choice mechanics. Domain 3 (Claude Code Configuration, 20%) tests hook-based policy enforcement (PreToolUse + PostToolUse) and structured-error contracts. Beyond-guide but architecturally consistent with Anthropic's published tool-use guide. The 'why does my 15-tool agent route badly?' question is the canonical exam distractor.

## The problem

### What the customer needs
- A tool registry the agent routes accurately. The right tool fires on the first try ≥ 95% of the time.
- Risky operations gated structurally. Refund cap, destructive Bash, write access policed by hooks not prompts.
- Errors that the agent can reason about. A permission-denied looks different from a transient timeout, and the agent retries accordingly.

### Why naive approaches fail
- 15-tool registry → routing accuracy drops 8% per tool past 5; the agent alternates and misses obvious matches.
- Vague one-line tool descriptions → agent picks the wrong tool ~12% of the time.
- Prompt-only policy enforcement ('never refund > $500') → leaks 3-5% in production despite emphatic phrasing.

### Definition of done
- Tool count per agent ≤ 5; rare tools moved to specialist sub-agents
- Every tool description follows the 4-line pattern (what / when / edge cases / ordering)
- PreToolUse hook gates every policy-bearing tool; exit 2 on violation
- PostToolUse hook normalizes outputs and logs every call to the audit trail
- Tool errors emit one of the 4 structured buckets (Transient · Permission · Data · Business)
- MCP servers used for cross-agent tool sharing. No inline duplication

## Concepts in play

- 🟢 **Tool calling** (`tool-calling`), Tool registry contract + 4-line description pattern
- 🟢 **tool_choice** (`tool-choice`), auto for specialist; forced only for mandatory extraction
- 🟢 **Hooks** (`hooks`), PreToolUse + PostToolUse as structural gates
- 🟢 **Model Context Protocol** (`mcp`), Tool sharing across agents via MCP servers
- 🟢 **Evaluation** (`evaluation`), Tool selection accuracy as the routing test
- 🟢 **Structured outputs** (`structured-outputs`), is_error + 4-bucket error contract
- 🟢 **Subagents** (`subagents`), Move rare tools to specialist sub-agents
- 🟢 **Agentic loops** (`agentic-loops`), stop_reason: tool_use → execute → continue

## Components

### Tool Registry (4-5 Tools), the optimum, not the maximum

The agent's toolbox. Empirically, 4-5 tools is the routing-accuracy sweet spot; past that, accuracy drops ~8% per added tool because descriptions overlap and the model alternates. If the use case needs more tools, split into specialist sub-agents. Each with its own 4-5-tool registry. And route between them with a triage classifier.

**Configuration:** Cap: 4-5 tools. Beyond: split agents. Don't cram a customer-support tool, a refund tool, a sentiment tool, and 11 admin tools into one agent. That's three agents pretending to be one.
**Concept:** `tool-calling`

### 4-Line Description Pattern, what · when · edge cases · ordering

The canonical tool description shape. Line 1: what the tool does. Line 2: when to call it. Line 3: edge cases (returns, failure modes). Line 4: ordering (which tools must come before / after). This pattern makes routing structural. The model reads the pattern in every description and learns the shape. Vague one-liners produce ~12% wrong-tool selection; the 4-line pattern drops it below 3%.

**Configuration:** description: "Look up a customer by customer_id and confirm they are active.\nUse this BEFORE any other tool that mentions the customer.\nEdge cases: returns 'not_found' if customer_id is missing.\nAlways run before lookup_order or process_refund."
**Concept:** `tool-calling`

### PreToolUse Hook (Policy Gate), deterministic, before exec

Sits between the model's tool_use request and the actual tool execution. Reads tool_input (e.g., refund_amount), compares to policy (amount <= cap), exits 0 (allow) or 2 (deny with stderr message). Deny routes the model back with the policy reason, and the agent re-plans. The single-most-effective lever for converting probabilistic prompt-only policies into 100%-deterministic gates.

**Configuration:** matcher: "process_refund". Hook reads stdin JSON: {tool_name, tool_input}. Exits 0 to allow, exits 2 with stderr to deny. SDK forwards stderr back to the model as a tool_result with is_error=true.
**Concept:** `hooks`

### PostToolUse Hook (Normalization + Audit), after exec, before next turn

Fires AFTER the tool runs but BEFORE the result is fed to the model. Normalizes raw outputs (timestamps to ISO-8601, status codes to enum names, field renames), captures side-effect signals, and writes the canonical audit log entry. Without it, the model sees inconsistent output shapes across calls; with it, every call has a predictable contract and an audit trail.

**Configuration:** matcher: '*'. Hook reads stdin: {tool_name, tool_input, tool_result}. Transforms tool_result into normalized shape. Writes audit row to durable log. Always exits 0 (does not deny. That's PreToolUse's job).
**Concept:** `hooks`

### 4-Bucket Structured Error Contract, Transient · Permission · Data · Business

Every tool that can fail emits an error in one of four explicit buckets, not a free-form string. The harness reads is_error: true + error.bucket, then routes: Transient → retry; Permission → escalate (don't retry, won't fix itself); Data → surface to user; Business → block + log + escalate. Without this contract, the agent retries permission errors forever and surfaces transient ones as catastrophes.

**Configuration:** tool_result on failure: { is_error: true, content: { bucket: "Transient"|"Permission"|"Data"|"Business", code, detail, retryable: bool } }. Agent reads bucket and retryable; never branches on detail text.
**Concept:** `structured-outputs`

## Build steps

### 1. Cap the registry at 4-5 tools (split otherwise)

Audit your current tool list. Past 5, you're guaranteed losing routing accuracy. The fix is structural: identify the use cases that actually share state vs those that don't, then split into specialist agents with their own 4-5-tool registries. Use a triage classifier (or a top-level coordinator agent) to route the user request to the right specialist.

**Python:**

```python
# AUDIT: count + classify tools
SUPPORT_TOOLS = ["verify_customer", "lookup_order", "process_refund",
                 "escalate_to_human", "audit_log"]  # 5. At the optimum
ADMIN_TOOLS = ["create_user", "delete_user", "reset_password",
               "lock_account", "unlock_account", "audit_admin"]  # 6. Split

# WRONG: cram all 11 into one agent
# tools = SUPPORT_TOOLS + ADMIN_TOOLS  # 11. Routing accuracy drops ~32%

# RIGHT: two specialist agents, triage routes between them
def triage(user_request: str) -> str:
    """Tiny classifier. Pick the specialist agent."""
    if any(w in user_request.lower() for w in ["refund", "order", "ticket"]):
        return "support"
    if any(w in user_request.lower() for w in ["password", "account", "user"]):
        return "admin"
    return "support"  # default

def route(user_request: str) -> dict:
    specialist = triage(user_request)
    tools = SUPPORT_TOOLS if specialist == "support" else ADMIN_TOOLS
    return run_agent(tools=tools, message=user_request)
```

**TypeScript:**

```typescript
// AUDIT: count + classify tools
const SUPPORT_TOOLS = [
  "verify_customer", "lookup_order", "process_refund",
  "escalate_to_human", "audit_log",
] as const; // 5. At the optimum
const ADMIN_TOOLS = [
  "create_user", "delete_user", "reset_password",
  "lock_account", "unlock_account", "audit_admin",
] as const; // 6. Split

// WRONG: cram all 11 into one agent
// const tools = [...SUPPORT_TOOLS, ...ADMIN_TOOLS]; // 11. Routing accuracy drops ~32%

// RIGHT: two specialist agents, triage routes between them
function triage(userRequest: string): "support" | "admin" {
  const r = userRequest.toLowerCase();
  if (["refund", "order", "ticket"].some((w) => r.includes(w))) return "support";
  if (["password", "account", "user"].some((w) => r.includes(w))) return "admin";
  return "support";
}

async function route(userRequest: string) {
  const specialist = triage(userRequest);
  const tools = specialist === "support" ? SUPPORT_TOOLS : ADMIN_TOOLS;
  return runAgent({ tools, message: userRequest });
}
```

Concept: `tool-calling`

### 2. Write every tool description in the 4-line pattern

Line 1: what (one sentence). Line 2: when (which user intent triggers this). Line 3: edge cases (what happens on failure, missing args). Line 4: ordering (which tools must come before / after). This pattern is the model's structural cue. It reads the pattern across all 5 tool descriptions and routes accordingly. Vague one-liners produce ~12% wrong-tool selection; this pattern drops it below 3%.

**Python:**

```python
# Anthropic 4-line pattern. What / when / edge cases / ordering
TOOLS = [
    {
        "name": "verify_customer",
        "description": (
            "Look up a customer by customer_id and confirm they are active.\n"
            "Use this BEFORE any other tool that mentions the customer.\n"
            "Edge cases: returns 'not_found' if customer_id is missing or stale.\n"
            "Always run before lookup_order or process_refund."
        ),
        "input_schema": {
            "type": "object",
            "properties": {"customer_id": {"type": "string", "pattern": "^cust_[0-9]+$"}},
            "required": ["customer_id"],
        },
    },
    {
        "name": "process_refund",
        "description": (
            "Issue a refund to a verified customer up to the policy cap.\n"
            "Use ONLY after verify_customer has confirmed the customer is active.\n"
            "Edge cases: returns Permission error if amount > policy_cap (handled by hook).\n"
            "Never call before verify_customer; never call twice in one conversation."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "customer_id": {"type": "string"},
                "amount": {"type": "number", "minimum": 0},
                "reason": {"type": "string", "enum": ["damage", "wrong_item", "late", "other"]},
            },
            "required": ["customer_id", "amount", "reason"],
        },
    },
]
```

**TypeScript:**

```typescript
// Anthropic 4-line pattern. What / when / edge cases / ordering
const TOOLS: Anthropic.Tool[] = [
  {
    name: "verify_customer",
    description:
      "Look up a customer by customer_id and confirm they are active.\n" +
      "Use this BEFORE any other tool that mentions the customer.\n" +
      "Edge cases: returns 'not_found' if customer_id is missing or stale.\n" +
      "Always run before lookup_order or process_refund.",
    input_schema: {
      type: "object",
      properties: { customer_id: { type: "string", pattern: "^cust_[0-9]+$" } },
      required: ["customer_id"],
    },
  },
  {
    name: "process_refund",
    description:
      "Issue a refund to a verified customer up to the policy cap.\n" +
      "Use ONLY after verify_customer has confirmed the customer is active.\n" +
      "Edge cases: returns Permission error if amount > policy_cap (handled by hook).\n" +
      "Never call before verify_customer; never call twice in one conversation.",
    input_schema: {
      type: "object",
      properties: {
        customer_id: { type: "string" },
        amount: { type: "number", minimum: 0 },
        reason: { type: "string", enum: ["damage", "wrong_item", "late", "other"] },
      },
      required: ["customer_id", "amount", "reason"],
    },
  },
];
```

Concept: `tool-calling`

### 3. Wire the PreToolUse hook on policy-bearing tools

For every tool that touches money, identity, or destructive state, the PreToolUse hook is the architectural gate. It reads tool_input from stdin JSON, applies the policy check in code (not in a prompt), and exits 0 or 2. Exit 2's stderr message is fed back to the model as a tool_result with is_error: true. The model re-plans with the policy in view.

**Python:**

```python
# .claude/hooks/refund_policy.py
import sys, json, os

POLICY_CAP = float(os.environ.get("REFUND_CAP", "500"))

def main():
    payload = json.loads(sys.stdin.read())
    tool_name = payload["tool_name"]
    tool_input = payload["tool_input"]

    if tool_name != "process_refund":
        sys.exit(0)  # not our concern, allow

    amount = tool_input.get("amount", 0)
    if amount > POLICY_CAP:
        # Stderr is fed back to the model as a tool_result with is_error=true
        print(
            f"refund ${amount} exceeds policy cap ${POLICY_CAP}; "
            f"escalate via escalate_to_human or reduce the amount",
            file=sys.stderr,
        )
        sys.exit(2)  # DENY

    # Additional structural checks. Verify customer is active, etc.
    if not tool_input.get("customer_id", "").startswith("cust_"):
        print("customer_id missing or malformed; call verify_customer first",
              file=sys.stderr)
        sys.exit(2)

    sys.exit(0)  # allow

if __name__ == "__main__":
    main()
```

**TypeScript:**

```typescript
// .claude/hooks/refund-policy.ts
import { readFileSync } from "node:fs";

const POLICY_CAP = Number(process.env.REFUND_CAP ?? "500");

const payload = JSON.parse(readFileSync(0, "utf8"));
const toolName: string = payload.tool_name;
const toolInput = payload.tool_input ?? {};

if (toolName !== "process_refund") {
  process.exit(0); // not our concern, allow
}

const amount = (toolInput.amount as number) ?? 0;
if (amount > POLICY_CAP) {
  // Stderr is fed back to the model as a tool_result with is_error=true
  process.stderr.write(
    `refund \${amount} exceeds policy cap \${POLICY_CAP}; ` +
      `escalate via escalate_to_human or reduce the amount\n`,
  );
  process.exit(2); // DENY
}

if (!String(toolInput.customer_id ?? "").startsWith("cust_")) {
  process.stderr.write("customer_id missing or malformed; call verify_customer first\n");
  process.exit(2);
}

process.exit(0); // allow
```

Concept: `hooks`

### 4. Wire the PostToolUse hook for normalization + audit

PostToolUse fires AFTER the tool runs, BEFORE the model sees the result. Two jobs: normalize the output shape (timestamps to ISO-8601, status codes to enum names, ms to seconds, etc.) so the model sees a consistent contract across calls; and write a canonical audit row capturing tool_name, tool_input, normalized_output, latency, and stop_reason context. The audit log is the replay tool when production breaks at turn 18.

**Python:**

```python
# .claude/hooks/postuse_normalize.py
import sys, json, datetime

def normalize(tool_name: str, raw: dict) -> dict:
    """Project tool-specific raw output into a stable shape."""
    if tool_name == "lookup_order":
        return {
            "order_id": raw.get("id") or raw.get("order_id"),
            "status": (raw.get("status") or "unknown").upper(),
            "created_at": (
                datetime.datetime
                .fromtimestamp(raw["created_unix"])
                .isoformat() + "Z"
            ) if "created_unix" in raw else raw.get("created_at"),
            "total_cents": int(raw.get("total_cents", round(raw.get("total_dollars", 0) * 100))),
        }
    if tool_name == "verify_customer":
        return {
            "customer_id": raw.get("customer_id") or raw.get("id"),
            "active": bool(raw.get("active") or raw.get("is_active")),
            "tier": raw.get("tier") or raw.get("plan") or "standard",
        }
    return raw  # tools without a known shape pass through

def main():
    payload = json.loads(sys.stdin.read())
    tool_name = payload["tool_name"]
    raw_result = payload["tool_result"]

    normalized = normalize(tool_name, raw_result)
    payload["tool_result"] = normalized

    # Append canonical audit row
    with open("audit.jsonl", "a") as f:
        f.write(json.dumps({
            "ts": datetime.datetime.utcnow().isoformat() + "Z",
            "tool": tool_name,
            "input": payload["tool_input"],
            "output": normalized,
            "latency_ms": payload.get("latency_ms"),
        }) + "\n")

    # Pipe normalized payload back to stdout. SDK uses this as the new tool_result
    print(json.dumps(payload))
    sys.exit(0)

if __name__ == "__main__":
    main()
```

**TypeScript:**

```typescript
// .claude/hooks/postuse-normalize.ts
import { readFileSync, appendFileSync } from "node:fs";

function normalize(toolName: string, raw: Record<string, unknown>) {
  if (toolName === "lookup_order") {
    return {
      order_id: raw.id ?? raw.order_id,
      status: String(raw.status ?? "unknown").toUpperCase(),
      created_at: raw.created_unix
        ? new Date((raw.created_unix as number) * 1000).toISOString()
        : raw.created_at,
      total_cents:
        (raw.total_cents as number) ??
        Math.round(((raw.total_dollars as number) ?? 0) * 100),
    };
  }
  if (toolName === "verify_customer") {
    return {
      customer_id: raw.customer_id ?? raw.id,
      active: Boolean(raw.active ?? raw.is_active),
      tier: raw.tier ?? raw.plan ?? "standard",
    };
  }
  return raw; // tools without a known shape pass through
}

const payload = JSON.parse(readFileSync(0, "utf8"));
const normalized = normalize(payload.tool_name, payload.tool_result ?? {});
payload.tool_result = normalized;

// Append canonical audit row
appendFileSync(
  "audit.jsonl",
  JSON.stringify({
    ts: new Date().toISOString(),
    tool: payload.tool_name,
    input: payload.tool_input,
    output: normalized,
    latency_ms: payload.latency_ms,
  }) + "\n",
);

// Pipe normalized payload back to stdout. SDK uses this as the new tool_result
process.stdout.write(JSON.stringify(payload));
process.exit(0);
```

Concept: `hooks`

### 5. Emit errors in 4 structured buckets

Every tool that can fail returns an error tagged with one of four buckets: Transient (network blip, retry), Permission (403/401, escalate. Don't retry, won't fix itself), Data (input malformed, surface to user), Business (policy violation, log + escalate). The agent reads bucket and retryable, and routes accordingly. Without this contract, the agent retries permission errors forever and surfaces transient blips as catastrophes.

**Python:**

```python
from enum import Enum
from typing import TypedDict

class ErrorBucket(str, Enum):
    TRANSIENT  = "Transient"
    PERMISSION = "Permission"
    DATA       = "Data"
    BUSINESS   = "Business"

class ToolError(TypedDict):
    bucket: ErrorBucket
    code: str          # e.g. "RATE_LIMITED", "FORBIDDEN", "INVALID_INPUT", "POLICY_BREACH"
    detail: str        # human-readable; for the model
    retryable: bool

def classify(http_status: int, body: dict) -> ToolError:
    """Project arbitrary backend error → 4-bucket contract."""
    if http_status >= 500 or http_status == 429:
        return {"bucket": ErrorBucket.TRANSIENT, "code": "RETRY",
                "detail": f"upstream {http_status}; retry with backoff",
                "retryable": True}
    if http_status in (401, 403):
        return {"bucket": ErrorBucket.PERMISSION, "code": "FORBIDDEN",
                "detail": "agent lacks permission; escalate, do not retry",
                "retryable": False}
    if http_status == 400:
        return {"bucket": ErrorBucket.DATA, "code": "INVALID_INPUT",
                "detail": body.get("message", "input failed validation"),
                "retryable": False}  # retry won't fix; user must reformulate
    if http_status == 422:
        return {"bucket": ErrorBucket.BUSINESS, "code": "POLICY_BREACH",
                "detail": body.get("message", "request violates business policy"),
                "retryable": False}
    return {"bucket": ErrorBucket.TRANSIENT, "code": "UNKNOWN",
            "detail": str(body), "retryable": True}

# Tool wrapper emits the contract via tool_result
def call_lookup_order(order_id: str) -> dict:
    resp = http_get(f"/orders/{order_id}")
    if resp.status_code != 200:
        return {"is_error": True, "error": classify(resp.status_code, resp.json())}
    return {"is_error": False, "data": resp.json()}
```

**TypeScript:**

```typescript
enum ErrorBucket {
  Transient  = "Transient",
  Permission = "Permission",
  Data       = "Data",
  Business   = "Business",
}

interface ToolError {
  bucket: ErrorBucket;
  code: string;       // e.g. "RATE_LIMITED", "FORBIDDEN", "INVALID_INPUT", "POLICY_BREACH"
  detail: string;     // human-readable; for the model
  retryable: boolean;
}

function classify(httpStatus: number, body: Record<string, unknown>): ToolError {
  if (httpStatus >= 500 || httpStatus === 429) {
    return { bucket: ErrorBucket.Transient, code: "RETRY",
             detail: `upstream ${httpStatus}; retry with backoff`, retryable: true };
  }
  if (httpStatus === 401 || httpStatus === 403) {
    return { bucket: ErrorBucket.Permission, code: "FORBIDDEN",
             detail: "agent lacks permission; escalate, do not retry",
             retryable: false };
  }
  if (httpStatus === 400) {
    return { bucket: ErrorBucket.Data, code: "INVALID_INPUT",
             detail: (body.message as string) ?? "input failed validation",
             retryable: false };
  }
  if (httpStatus === 422) {
    return { bucket: ErrorBucket.Business, code: "POLICY_BREACH",
             detail: (body.message as string) ?? "request violates business policy",
             retryable: false };
  }
  return { bucket: ErrorBucket.Transient, code: "UNKNOWN",
           detail: JSON.stringify(body), retryable: true };
}

// Tool wrapper emits the contract via tool_result
async function callLookupOrder(orderId: string) {
  const resp = await fetch(`/orders/${orderId}`);
  if (!resp.ok) {
    return { is_error: true as const, error: classify(resp.status, await resp.json()) };
  }
  return { is_error: false as const, data: await resp.json() };
}
```

Concept: `structured-outputs`

### 6. Use tool_choice 'auto' for specialists; 'forced' only for mandatory extraction

tool_choice: 'auto' is the right default. The model decides whether to call any tool, and which one, based on the request. tool_choice: 'any' forces the model to call SOME tool (rarely useful). tool_choice: { type: 'tool', name: ... } forces a specific tool. Only correct for extraction pipelines where the tool is mandatory. Forced tool_choice on a conversational specialist agent removes the agent's reasoning capacity.

**Python:**

```python
# auto. The right default for specialist agents
def support_agent(message: str):
    return client.messages.create(
        model="claude-sonnet-4.5",
        max_tokens=1024,
        tools=SUPPORT_TOOLS,
        tool_choice={"type": "auto"},  # agent decides
        messages=[{"role": "user", "content": message}],
    )

# forced. Only for mandatory extraction
def extract_one(email: str):
    return client.messages.create(
        model="claude-sonnet-4.5",
        max_tokens=1024,
        tools=[EXTRACT_TOOL],
        tool_choice={"type": "tool", "name": "extract_record"},  # MUST fire
        messages=[{"role": "user", "content": email}],
    )

# any. Rarely useful; "must call SOME tool but I won't pick which"
# def some_specialist():
#     return client.messages.create(
#         tool_choice={"type": "any"},
#         ...  # only for unusual flows where any of N tools is acceptable
```

**TypeScript:**

```typescript
// auto. The right default for specialist agents
async function supportAgent(message: string) {
  return client.messages.create({
    model: "claude-sonnet-4.5",
    max_tokens: 1024,
    tools: SUPPORT_TOOLS,
    tool_choice: { type: "auto" }, // agent decides
    messages: [{ role: "user", content: message }],
  });
}

// forced. Only for mandatory extraction
async function extractOne(email: string) {
  return client.messages.create({
    model: "claude-sonnet-4.5",
    max_tokens: 1024,
    tools: [EXTRACT_TOOL],
    tool_choice: { type: "tool", name: "extract_record" }, // MUST fire
    messages: [{ role: "user", content: email }],
  });
}

// any. Rarely useful; "must call SOME tool but I won't pick which"
// async function someSpecialist() {
//   return client.messages.create({
//     tool_choice: { type: "any" },
//     ...
//   });
// }
```

Concept: `tool-choice`

### 7. Share tools across agents via MCP servers

When two agents both need lookup_order or verify_customer, don't duplicate the tool inline. Expose it through an MCP server. Each agent connects to the MCP server, advertises it as a tool, and gets a single source of truth. Updating the tool's behavior is a single deploy; the agents pick it up automatically. MCP also abstracts auth, observability, and rate-limiting away from each agent.

**Python:**

```python
# .claude/mcp.json. Declare the MCP servers your project uses
{
  "mcpServers": {
    "crm": {
      "command": "npx",
      "args": ["-y", "@yourorg/crm-mcp-server"],
      "env": {
        "CRM_API_KEY": "${CRM_API_KEY}",
        "CRM_BASE_URL": "https://crm.example.com"
      }
    }
  }
}

# In your agent. MCP tools auto-show up in the tools list
async def support_agent(message: str):
    # The CRM MCP server contributes verify_customer + lookup_order
    return await client.messages.create(
        model="claude-sonnet-4.5",
        max_tokens=1024,
        # tools are auto-loaded from MCP. Your local registry only adds:
        tools=[PROCESS_REFUND_TOOL, ESCALATE_TO_HUMAN_TOOL],
        tool_choice={"type": "auto"},
        messages=[{"role": "user", "content": message}],
        mcp_servers=["crm"],  # fictional pseudo-API; real call sites vary by SDK
    )
```

**TypeScript:**

```typescript
// .claude/mcp.json. Declare the MCP servers your project uses
// {
//   "mcpServers": {
//     "crm": {
//       "command": "npx",
//       "args": ["-y", "@yourorg/crm-mcp-server"],
//       "env": {
//         "CRM_API_KEY": "${CRM_API_KEY}",
//         "CRM_BASE_URL": "https://crm.example.com"
//       }
//     }
//   }
// }

// In your agent. MCP tools auto-show up in the tools list
async function supportAgent(message: string) {
  // The CRM MCP server contributes verify_customer + lookup_order
  return client.messages.create({
    model: "claude-sonnet-4.5",
    max_tokens: 1024,
    // tools are auto-loaded from MCP. Your local registry only adds:
    tools: [PROCESS_REFUND_TOOL, ESCALATE_TO_HUMAN_TOOL],
    tool_choice: { type: "auto" },
    messages: [{ role: "user", content: message }],
    // mcpServers: ["crm"], // pseudo-API; real call sites vary by SDK
  });
}
```

Concept: `mcp`

### 8. Test routing accuracy with a 50-intent eval set

Tool design is empirical. Build a 50-intent eval set where each intent has a known correct tool. Run the agent over it, count first-call accuracy. Below 95% routing accuracy means the descriptions need work; below 90% likely means too many tools. Re-run the eval after every tool addition or description tweak.

**Python:**

```python
# eval/tool_routing.py. Measure first-call routing accuracy
INTENTS = [
    {"text": "I want a refund for order 12345",
     "expected_first_tool": "verify_customer"},
    {"text": "My order hasn't arrived",
     "expected_first_tool": "verify_customer"},
    {"text": "Cancel my account please",
     "expected_first_tool": "escalate_to_human"},
    # ... 47 more
]

def routing_accuracy() -> dict:
    correct = total = 0
    misses = []
    for intent in INTENTS:
        resp = client.messages.create(
            model="claude-sonnet-4.5",
            max_tokens=512,
            tools=SUPPORT_TOOLS,
            tool_choice={"type": "auto"},
            messages=[{"role": "user", "content": intent["text"]}],
        )
        first_tool = next(
            (b.name for b in resp.content if b.type == "tool_use"),
            None,
        )
        total += 1
        if first_tool == intent["expected_first_tool"]:
            correct += 1
        else:
            misses.append({
                "text": intent["text"],
                "expected": intent["expected_first_tool"],
                "got": first_tool,
            })
    return {
        "accuracy": correct / total,
        "n": total,
        "misses": misses[:10],  # first 10 for review
    }

# Re-run after every change to the tool registry; gate deploys on >= 95%
```

**TypeScript:**

```typescript
// eval/tool-routing.ts. Measure first-call routing accuracy
const INTENTS = [
  { text: "I want a refund for order 12345", expected_first_tool: "verify_customer" },
  { text: "My order hasn't arrived", expected_first_tool: "verify_customer" },
  { text: "Cancel my account please", expected_first_tool: "escalate_to_human" },
  // ... 47 more
];

async function routingAccuracy() {
  let correct = 0;
  let total = 0;
  const misses: Array<{ text: string; expected: string; got: string | null }> = [];
  for (const intent of INTENTS) {
    const resp = await client.messages.create({
      model: "claude-sonnet-4.5",
      max_tokens: 512,
      tools: SUPPORT_TOOLS,
      tool_choice: { type: "auto" },
      messages: [{ role: "user", content: intent.text }],
    });
    const firstTool =
      resp.content.find((b) => b.type === "tool_use")?.name ?? null;
    total++;
    if (firstTool === intent.expected_first_tool) {
      correct++;
    } else {
      misses.push({
        text: intent.text,
        expected: intent.expected_first_tool,
        got: firstTool,
      });
    }
  }
  return {
    accuracy: correct / total,
    n: total,
    misses: misses.slice(0, 10),
  };
}

// Re-run after every change to the tool registry; gate deploys on >= 95%
```

Concept: `evaluation`

## Decision matrix

| Decision | Right answer | Wrong answer | Why |
|---|---|---|---|
| Tool count for a specialist agent | 4-5 tools (the optimum); split into specialist sub-agents past 5 | 15 tools 'because the model is smart enough' | Routing accuracy drops ~8% per tool past 5. By 15 tools, accuracy is ~30% lower than at 5. The fix is structural (split the agent), not model-side (bigger model doesn't compensate for ambiguous tool descriptions). |
| Tool description format | 4-line pattern: what / when / edge cases / ordering | One vague sentence ('verifies a customer') | Vague descriptions cause ~12% wrong-tool selection. The 4-line pattern is the model's structural cue. It reads the pattern across all 5 descriptions and routes accordingly. Wrong-tool rate drops below 3%. |
| Refund-cap policy enforcement | PreToolUse hook reads tool_input.amount, exits 2 on violation | System prompt: 'never refund more than $500' | Prompt-only enforcement leaks 3-5% in production despite emphatic phrasing. PreToolUse hooks are deterministic. Exit 2 means deny, full stop. For policy-bearing tools, the hook is the only credible architecture. |
| Tool error contract | 4 buckets (Transient · Permission · Data · Business) + retryable boolean | Free-form error messages parsed by the agent | Free-form errors force the agent to interpret strings; structured buckets let it BRANCH (Transient → retry; Permission → escalate; Data → surface; Business → block + log). Without buckets, the agent retries permission errors forever and panics on transient blips. |

## Failure modes

| Anti-pattern | Failure | Fix |
|---|---|---|
| AP-ATD-01 · Tool count > 5 | 15-tool agent with overlapping descriptions. Routing accuracy at first-call drops to ~65%; the agent alternates between similar tools, sometimes calls 3 tools before settling on the right one. Latency up; cost up; quality down. | Cap at 4-5 tools per agent. Move rare tools (used <10% of conversations) to specialist sub-agents. Use a triage classifier to route requests to the right sub-agent. Each sub-agent stays at 4-5 tools. |
| AP-ATD-02 · Vague tool descriptions | Tool descriptions are one-liners ('verifies the customer', 'looks up an order'). Agent misroutes ~12% of calls because it can't tell when each tool applies. | Anthropic 4-line pattern: what / when / edge cases / ordering. Each line targets a specific routing decision the model has to make. Wrong-tool rate drops below 3%. |
| AP-ATD-03 · No PreToolUse hook on risky operations | Refund cap enforced via system-prompt language. Production logs show 3-5% of refunds violate the cap. Audit fails; finance rolls back; trust in the agent drops. | PreToolUse hook reads tool_input.amount, compares to policy, exits 2 with stderr message on violation. Deterministic, not probabilistic. Policy violations drop to 0. |
| AP-ATD-04 · Unhandled tool errors | Tool returns 403; agent retries indefinitely. Tool returns 500; agent crashes. Tool returns 400 with malformed input; agent gives up without surfacing the input issue to the user. | 4-bucket structured error contract (Transient · Permission · Data · Business) with retryable boolean. Agent reads bucket, branches: Transient → retry with backoff; Permission → escalate; Data → surface to user; Business → block + log + escalate. |
| AP-ATD-05 · No tool ordering guidance | Agent calls lookup_order before verify_customer. Wrong record returned 12% of the time because the customer_id wasn't validated first. Bad data pollutes downstream decisions. | Tool descriptions explicitly state ordering ('Always run BEFORE process_refund'; 'Use ONLY after verify_customer has confirmed the customer is active'). The 4-line pattern's last line is for ordering precisely because ordering is so often the routing failure. |

## Implementation checklist

- [ ] Tool count per agent ≤ 5; rare tools moved to specialist sub-agents (`tool-calling`)
- [ ] Every tool description follows the 4-line pattern (what / when / edge cases / ordering) (`tool-calling`)
- [ ] PreToolUse hook on every policy-bearing tool; exit 2 on violation (`hooks`)
- [ ] PostToolUse hook normalizes outputs and writes the audit log (`hooks`)
- [ ] All tool errors emit one of 4 buckets (Transient · Permission · Data · Business) + retryable boolean (`structured-outputs`)
- [ ] tool_choice: 'auto' on specialist agents; 'forced' only for mandatory extraction (`tool-choice`)
- [ ] Shared tools exposed via MCP servers; no inline duplication across agents (`mcp`)
- [ ] 50-intent routing-accuracy eval set; gate deploys on ≥ 95% (`evaluation`)
- [ ] Agent branches on bucket + retryable; no parsing of error.detail text
- [ ] Hook stderr messages are model-readable (specific, actionable, reference policy)
- [ ] Audit log: tool_name, tool_input, normalized_output, latency_ms, stop_reason context

## Cost &amp; latency

- **Per-tool-call overhead (Pre + Post hooks):** ~5-15ms latency; ~0% token cost, Hooks run as subprocesses reading stdin JSON. No LLM call. Pure local Python/TS. The latency is below the noise floor of a typical tool API call.
- **Routing accuracy eval (50 intents, weekly):** ~$0.05/run, 50 messages × ~500 tokens input + ~50 tokens output at Sonnet 4.5 prices. Cheap insurance against routing regressions; run on every tool registry change and weekly in CI.
- **Tool description token cost (5 tools × 4 lines):** ~600-1000 tokens per call (cached after first), Tools array is stable; mark with cache_control: ephemeral. Schema-cache hit rate ≥ 70% drops effective per-call cost ~90% on the tools array.
- **MCP server overhead:** ~+50-100ms per tool call (network), MCP runs as a separate process or service; network round-trip adds latency. Worth the cost when 2+ agents share the tool. Single source of truth beats inline duplication.
- **Audit log write:** ~5ms; ~1KB per row, Append-only JSONL write in the PostToolUse hook. At 1000 calls/day, 1MB/day, 30MB/month. Negligible storage. Indispensable for production debugging.

## Domain weights

- **D2 · Tool Design + Integration (18%):** Tool registry · 4-line description pattern · 4-bucket error contract · MCP integration
- **D3 · Agent Operations (20%):** PreToolUse hook · PostToolUse hook · routing-accuracy evals · Claude Code hook config

## Practice questions

### Q1. Your agent has 6 tools. Routing accuracy drops from 95% (with 5 tools) to 87% (with 6). What's the cause and the architectural fix?

Tool count past 4-5 degrades selection. Each new tool adds description overlap; the model alternates between similar tools and sometimes picks the wrong one. The fix is structural, not model-side: either consolidate (merge lookup_order_status + lookup_order_details → lookup_order) or move the rare tool to a specialist sub-agent and route requests to the right specialist with a triage classifier. Don't 'just use a smarter model'. Bigger models don't compensate for ambiguous tool descriptions. Tagged to AP-ATD-01.

### Q2. Your tool description is one sentence: 'verifies a customer'. The agent uses it correctly ~88% of the time. What format produces ≥97% accuracy?

The Anthropic 4-line description pattern: line 1 _what_ (one sentence), line 2 _when_ (which user intent triggers it), line 3 _edge cases_ (failure modes, missing args), line 4 _ordering_ (which tools must come before / after). Each line targets a specific routing decision the model has to make on every turn. Vague one-liners produce ~12% wrong-tool selection; the 4-line pattern drops it below 3%. Tagged to AP-ATD-02.

### Q3. PreToolUse hook fires before tool_use; PostToolUse fires after. Which one blocks risky operations like a refund-cap violation, and why?

PreToolUse. It fires BEFORE the tool runs, so it can deny execution by exiting 2. PostToolUse fires AFTER the tool runs and is meant for normalization + audit, not denial. For deterministic policy enforcement (refund cap, destructive-command blocklist, sensitive-data redaction), PreToolUse is the only credible gate. Prompt-only policies leak 3-5% in production; PreToolUse leaks 0%.

### Q4. A tool returns HTTP 403. Should the agent retry?

No. 403 is a non-retryable Permission error. The 4-bucket error contract is decisive here: bucket: 'Permission', retryable: false. The agent reads bucket and retryable and routes: Permission means 'agent lacks the privilege; escalate, don't retry'. Retry won't fix it (the missing permission won't appear in the next 30 seconds). Without this contract, the agent retries permission errors forever and panics on transient ones. Exactly the failure mode the buckets exist to prevent. Tagged to AP-ATD-04.

### Q5. When should tool_choice be 'auto', 'any', or { type: 'tool', name: ... }?

auto. The right default for specialist agents that converse and pick tools as they go (~95% of production flows). { type: 'tool', name: ... }. Only for mandatory extraction pipelines where the tool MUST fire (no agency). any. Rarely useful; says 'must call SOME tool but I won't pick which'. Fine for a narrow flow where any of N tools is acceptable. Forcing tool_choice on a specialist agent removes its reasoning capacity; 99% of the time, auto is correct.

## FAQ

### Q1. Why does the 4-line description pattern matter so much?

It gives the model structural cues across the registry. Each tool has the same 4 lines in the same order. After reading 5 such descriptions, the model has a stable mental model: 'when I want to know WHAT, I read line 1; WHEN, line 2; ORDERING, line 4'. Vague one-liners force the model to infer shape every time. The pattern cuts wrong-tool selection from ~12% to <3%.

### Q2. Can I use the 4-line pattern in MCP tool descriptions?

Yes. Same pattern, same effect. MCP tools surface to the agent through the same tools[] array as inline tools; the description format is identical. If you ship an MCP server, write the 4-line pattern into the server's tool definitions. Downstream agents inherit the routing accuracy without doing anything.

### Q3. What if the policy is too complex for a hook?

Then the hook calls a policy service. PreToolUse hooks are subprocesses, not pure functions. They can hit a Convex action, a feature flag service, a rules engine. The point is the gate is OUTSIDE the prompt: deterministic code makes the deny decision, not the model. Complex policies live in the service the hook calls; simple bounds live in the hook itself.

### Q4. How do I add a new tool to the registry without breaking routing?

Three steps: (1) add the tool with a full 4-line description; (2) re-run the 50-intent routing-accuracy eval. If accuracy drops below 95%, the new tool's description overlaps with an existing one, fix the descriptions; (3) gate the deploy on the eval threshold. Adding tools blindly is the #1 way registries degrade.

### Q5. Should every tool emit the 4-bucket error contract?

Every tool that can fail. Read-only lookups can mostly emit Transient or Data; write tools add Business; auth-protected tools add Permission. The contract is uniform: { is_error: true, error: { bucket, code, detail, retryable } }. The agent's retry logic relies on it; without uniformity, you'd write per-tool retry code and miss bugs.

### Q6. When do I split into specialist sub-agents vs add tools to the existing one?

At 5 tools is the rule of thumb. Tools that share state (e.g. verify_customer + lookup_order + process_refund) can stay in one agent. Tools that don't share state (admin functions vs support functions vs analytics) belong in different agents. Split, route with a triage classifier, each agent stays at 4-5 tools.

### Q7. Is the 4-bucket model Anthropic's or community-derived?

Community-derived but architecturally consistent with Anthropic's tool-use guidance. The buckets formalize the patterns Anthropic's docs hint at (transient retry; permission escalate; etc.). The catalog (ACP-T05) marks this scenario as 🟡 OP-claimed (Reddit thread 1s34iyl) but architecturally well-grounded. Drilling it benefits real exam prep.

## Production readiness

- [ ] Tool registry audit: every agent has ≤ 5 tools; documented in repo
- [ ] Description format lint: every tool has 4 lines (what / when / edges / ordering)
- [ ] PreToolUse hook on every policy-bearing tool; unit-tested for allow/deny
- [ ] PostToolUse hook normalization tested on at least 5 representative output shapes
- [ ] 4-bucket error contract enforced in CI: every tool wrapper returns the contract
- [ ] 50-intent routing eval runs in CI; gates deploy on ≥ 95% accuracy
- [ ] MCP server health checks before agent invocation; degraded mode on outage
- [ ] Audit log retained ≥ 90 days; indexed by tool_name + customer_id

---

**Source:** https://claudearchitectcertification.com/scenarios/agentic-tool-design
**Vault sources:** ACP-T05 §Scenario 7 (🟡 beyond-guide; OP-claimed Reddit 1s34iyl); ACP-T08 §3.7 metadata; Course 12 Claude with Vertex. Lesson 90 workflows-vs-agents; ACP-T06 (5 practice Qs tagged to components); GAI-K04 Claude Certified Architect Exam Reference; COD-K01 AI design + orchestration patterns
**Last reviewed:** 2026-05-04

**Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed.
