# Claude Architect Certification Prep — Full Corpus --- --- > Concatenated markdown for every concept (25), scenario (14), and knowledge entry (13). One file, 52 pages, no token wasted on HTML. --- --- Canonical pages: https://claudearchitectcertification.com --- Companion: https://claudearchitectcertification.com/llms-small.txt (exam-critical subset) --- Per-page twins: https://claudearchitectcertification.com/{concepts,scenarios,knowledge}/{slug}.md --- # Agentic Loops > An agentic loop turns one API response into a controlled, multi-turn system. The harness reads stop_reason, dispatches tools, appends observations, and decides whether to continue. The common failure is treating natural-language text as the termination signal. **Domain:** D1 · Agentic Architectures (27% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/agentic-loops **Last reviewed:** 2026-05-04 ## Quick stats - **Loop stages:** 4 - **Exam domain:** D1 - **Core exit lines:** 3 - **Text parsing tolerance:** 0 - **Scenario links:** 6 ## What it is A single call is one prompt and one response. An agentic loop is a harness pattern that repeats the call after every tool result. The user experience feels like one agent, but the engineering surface is a loop, a message list, and a termination check. ## Exam-pattern questions ### Q1. Your agentic loop keeps running after Claude has clearly finished its task. Which control was likely missed? The harness was checking content shape (text presence) instead of stop_reason. The most-tested distractor is "add a max_iterations cap". The right answer is "branch the loop on stop_reason 'end_turn'", structural field, never natural language. ### Q2. A subagent returns wrong facts about a customer the coordinator clearly mentioned earlier. Why? The coordinator forgot that subagents do not inherit conversation history. The distractor answer says "the subagent's max_tokens was too low". The actual fix is to pass every needed fact in the subagent's task string explicitly. ### Q3. An agent loop hits a budget ceiling and you keep raising max_iterations to make it pass. What is the real problem? max_iterations is a safety cap, not the primary exit condition. Repeatedly raising it masks the underlying bug, usually a missing tool_result append, ambiguous tool descriptions, or two tools the model alternates between. ### Q4. Your refund agent emits the words "I'm processing your refund now" but then makes another tool call. The harness exits early. Why? The harness is parsing the response text for completion phrases. Claude can return text and a tool_use block in the same response; only stop_reason is authoritative. The fix is to ignore the text and read stop_reason: "tool_use" to continue the loop. ### Q5. Long-document extraction stalls at chunk 18 every single time, regardless of model. What is happening? By turn 17 the message list contains 17 turns of tool_use plus tool_result blocks; turn 18 hits stop_reason: "max_tokens". The fix is context windowing: summarize old turns into a facts block, drop the verbose history, continue. Raising max_tokens does not fix it past a certain point. ### Q6. Your code-review CI bot returns valid JSON with empty findings: [] even on PRs that clearly have issues. Why? The bot is treating the presence of any text response as success. It hit stop_reason: "max_tokens" mid-review, but the code did not branch on that signal, so it serialized whatever was in the buffer, which happened to be empty. The fix is to handle max_tokens as a partial-result signal and emit truncated: true. ### Q7. A tool throws an exception inside your harness; on the next iteration Claude requests the same tool again with the same arguments. Why? Either the exception escaped to the loop and was caught silently, or the harness wrote a generic "tool failed" to tool_result. Claude saw no actionable error and re-requested. The fix is to catch the exception and return a structured error with a hint, e.g. {"error": "customer_not_found", "hint": "verify cus_ prefix"}. ### Q8. Two of your tools have similar names (fetch_data and get_data). The model picks the wrong one 30% of the time. Best first fix? Rewrite the descriptions, not the model. Each description should be 4 lines: what it does, when to use it, edge cases, and ordering boundaries with other tools. Bigger models route slightly better, but vague descriptions are the root cause; fine-tuning is a last resort. ## FAQ ### Q1. How is this different from a normal API call? The harness sends the cumulative message list again after each tool result. State accumulates over turns. ### Q2. Where do hooks fit? Hooks validate or enforce around tool calls. They do not replace stop_reason-based termination. --- **Source:** https://claudearchitectcertification.com/concepts/agentic-loops **Vault sources:** ACP-T03 §4.1; GAI-K04 §1 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Subagents > Subagents are specialized, isolated agents spawned by a coordinator to handle domain-specific tasks while preserving conversation isolation. They do NOT inherit memory, the coordinator passes context explicitly in the prompt. Hub-and-spoke prevents context creep from parallel work. **Domain:** D1 · Agentic Architectures (27% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/subagents **Last reviewed:** 2026-05-04 ## Quick stats - **Coordinator duties:** 4 - **Tool categories:** 3 - **Inheritance modes:** 1 - **Exam domain:** D1 - **Scopes:** 2 ## What it is Subagents are markdown-defined agents that operate in isolated contexts, spawned on demand by a coordinator. Each has its own system prompt, tool permissions, and model selection. Unlike multi-turn continuation, subagents receive complete context passed explicitly by the coordinator and do not carry forward prior conversation state. They excel at parallelizing independent tasks, scoping tool access (e.g., read-only review agents), and keeping verbose work out of the main thread. ## Exam-pattern questions ### Q1. Your code-review subagent missed three critical issues that you can clearly see in the file. Why? The subagent received only the file path, not the file contents, and lacked a Read tool. Subagents do not inherit coordinator history or environment. Embed the file content (or grant Read) explicitly in the task string. ### Q2. Two subagents are returning conflicting reports about the same bug. How do you resolve it? You don't, the coordinator does. Subagents never communicate directly. Aggregate both reports in the coordinator, ask Claude to reconcile, or escalate to a human reviewer with both reports as evidence. ### Q3. A research subagent ran for 40 turns and returned a perfect summary, but the bill is huge. What's the architectural fix? Define a structured output format in the subagent's system prompt. Without one, subagents wander. A clear output shape ({findings: [], confidence: number}) doubles as a stopping cue and caps token cost. ### Q4. Your retriever subagent has [Read, Grep, Glob, WebSearch, Bash, Edit] and accidentally modified a config file. What was the design error? Tool overscoping. A retriever should never have Edit. Restrict allowed-tools to [Read, Grep, Glob, WebSearch]. The minimum needed for the role. ### Q5. You spawned 4 subagents in parallel; 3 finished, the 4th hangs forever. How do you debug? Check the 4th subagent's task string. The most common cause is ambiguity, the subagent is asking for clarification or wandering. Add max_iterations and a deterministic output format to bound it. Don't retry blindly. ### Q6. Your coordinator passes the entire chat history to each subagent for "context." Subagents respond with confused outputs. Why? Subagents don't inherit history; passing it confuses them (they were not part of that conversation). Embed only the facts the subagent needs in a self-contained task string. Treat each subagent as starting from zero. ### Q7. A subagent returns stop_reason: "max_tokens" with a partial summary. Production code aborts. What should it do instead? Treat it as a partial result, not failure. Either accept the partial (with a confidence flag) or spawn a refined subagent ("focus on files X to Y only"). Aborting wastes the subagent's actual work. ### Q8. Your coordinator spawns Tool A, then waits for results, then spawns Tool B based on A's output. Why is this slower than ideal? It's sequential when it could be parallel for independent work. If Tool B doesn't strictly depend on Tool A's specific output, spawn both at once with Promise.all. The coordinator merges both outputs. ## FAQ ### Q1. Can subagents call other subagents? Yes, but not directly, the original coordinator must spawn the nested ones and manage all context passing. ### Q2. Subagent vs. separate Claude Code project? Subagents are lightweight isolated sessions inside one project. Separate projects are fully independent. Use subagents for in-project parallelism. ### Q3. When should work stay in the main thread vs. fork? If verbose or exploratory and not needed visible, fork. If reasoning needs to flow naturally in context, keep inline. --- **Source:** https://claudearchitectcertification.com/concepts/subagents **Vault sources:** ACP-T03 §4 (Domain 1); GAI-K04 §8 (hub-and-spoke); ASC-A01 Course 16 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Stop Reason > stop_reason is the authoritative struct field that says why Claude stopped, either end_turn (finished) or tool_use (wants to call a tool). Parsing natural-language phrases for termination is the most-tested distractor; it's unreliable and fails in production. Checking stop_reason is the only deterministic loop control. **Domain:** D1 · Agentic Architectures (27% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/stop-reason **Last reviewed:** 2026-05-04 ## Quick stats - **Core values:** 2 - **Canonical pattern:** 1 - **Anti-patterns:** 3 - **Exam domain:** D1 - **Reliability vs NL parsing:** 100% ## What it is stop_reason is a field in Claude's API response that signals why generation stopped. The two values you act on are "end_turn" (model is done) and "tool_use" (model wants tools executed). It is the authoritative loop-control signal. Termination based on regex or substring matches against text is probabilistic and breaks in production. Structured field checking is the correct pattern. ## Exam-pattern questions ### Q1. Your loop checks response.text.includes('done') to decide termination. What can go wrong? Claude may say "I'm done now" while emitting a tool_use block in the same response. The text is preamble; the tool_use is the real next step. Branch on stop_reason, not text. ### Q2. A response has both a text block and a tool_use block. Which should you handle first? Branch on stop_reason. If it's tool_use, execute the tool call regardless of text presence. If it's end_turn, the text is the final response. Block-level inspection is unreliable; the field is authoritative. ### Q3. You set max_iterations = 5 to prevent infinite loops. The agent fails on legitimate 7-iteration tasks. What's the real fix? Find why the loop is unbounded: missing tool_result append, ambiguous tool descriptions, or two tools that look interchangeable. Caps mask bugs; stop_reason is the primary signal. Raise the cap to a safety buffer, not the primary control. ### Q4. An agent loop hits stop_reason: "max_tokens". Your code throws an error and exits. What should production code do? Treat max_tokens as a partial-result signal, not a failure. Save what was generated, then either raise max_tokens for a retry or chunk the input. The agent did real work; don't discard it. ### Q5. You set a custom stop_sequences parameter. The agent stops mid-task. Why? Your stop sequence matched an unintended substring. Inspect response.stop_sequence to see which one triggered. Either tighten the sequence (more specific) or remove it. Custom stop sequences are a sharp tool. ### Q6. Your TypeScript handler has if/else branches for end_turn, tool_use, and max_tokens. What's missing? A stop_sequence branch for completeness, and a default branch for unknown values (defensive). Use a discriminated union type so the compiler enforces exhaustive checks. ### Q7. An agent calls the same tool 5 times in a row with identical input. Why? You forgot to append the tool_result block to the message list, so Claude doesn't know the tool ran. Without the result, the model re-requests indefinitely until max_iterations saves you. ### Q8. Why is stop_reason better than counting tool_use blocks for loop control? Block counting requires you to enumerate response.content and check types. stop_reason is a single authoritative field set during generation, designed for control flow. It's faster, cleaner, and immune to multi-block responses. ## FAQ ### Q1. Can Claude return text alongside a tool_use block? Yes, both can appear in the same response. Always check stop_reason, not text presence. ### Q2. What if stop_reason is something else? Other values like 'stop_sequence' or 'max_tokens' indicate error or limit conditions, not normal agentic flow. ### Q3. Does stop_reason guarantee tool args are valid? No. It tells you Claude wanted to call. You must still validate arguments before execution. --- **Source:** https://claudearchitectcertification.com/concepts/stop-reason **Vault sources:** ACP-T03 §4.1; GAI-K04 §1 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Session State & Persistence > Session state is the running message list (and any external persistence) that gives an agentic loop continuity. The exam tests progressive summarization tradeoffs (token savings vs. precision loss) and when checkpointing or external store reads are required. Full content lands in SCRUM-21 follow-up. **Domain:** D1 · Agentic Architectures (27% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/session-state **Last reviewed:** 2026-05-04 ## Quick stats - **Persistence layers:** 3 - **Exam domain:** D1 - **Trap pattern:** summary loss - **Coverage tier:** B - **Linked scenarios:** 2 ## Exam-pattern questions ### Q1. Your support agent forgets the customer ID by turn 30. What's the architectural fix? Pin a CASE_FACTS block in the system prompt with customer_id, order_id, refund_amount. Re-read every turn. The most-tested distractor is "increase max_tokens"; the right answer is structural state management. ### Q2. A subagent returns wrong findings; coordinator rediscovers the same problems on the next subagent spawn. Why? Coordinator didn't capture obstacles_encountered from the first subagent. Add a structured field so the coordinator stores and acts on the discovery, then includes the resolution in the next spawn's task string. ### Q3. Long-document extraction dies at chapter 18 every run. What's happening? Message list contains 17 turns of tool_use + tool_result; chapter 18 hits stop_reason: "max_tokens". Fix: checkpoint state, reset the session, reload the checkpoint, continue. Don't increase the model size. ### Q4. Which is wrong: summarize verbose reasoning, or summarize critical facts? Summarizing facts is wrong. Pin them in a CASE_FACTS block. Reasoning chains are summarizable; transactional values (amounts, IDs) are not. ### Q5. Subagent spawns receive the parent's full conversation history. Why is this an anti-pattern? Bloats the subagent context and dilutes focus. Pass a focused prompt with extracted facts and the specific subtask. Subagents work better with less, not more. ### Q6. Cross-task context (vendor matrix path) lives in: project Instructions or session messages? Project Instructions. Sessions are task-scoped (discarded at end). Project Instructions persist across tasks; folder files persist evolving state. ### Q7. When you hit max_tokens, does increasing the model's window solve the problem? Temporarily, yes. Architecturally, no. Larger windows defer the problem; the fix is active state management (case-facts, checkpointing, summarization) regardless of window size. ### Q8. An escalation hands off the entire conversation transcript to a human. What's the better pattern? Structured escalation block: customer_id, order_id, amount, reason, partial_status, recommended_action. Compact (200-500 chars). Human triages in 10 seconds vs 5 minutes for transcript. --- **Source:** https://claudearchitectcertification.com/concepts/session-state **Vault sources:** ACP-T03 §5 progressive summarization; GAI-K04 §8 multi-agent memory; ASC-A01 Course 3 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Human-in-the-Loop Escalation > Escalation is the deterministic handoff path when policy thresholds, low confidence, or repeated failure conditions are hit. The exam pattern: prompt-only escalation policy is wrong; the correct architecture uses hooks or hard gates. Full content in SCRUM-21 follow-up. **Domain:** D1 · Agentic Architectures (27% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/escalation **Last reviewed:** 2026-05-04 ## Quick stats - **Trigger types:** 3 - **Exam domain:** D1 - **Right answer:** deterministic gate - **Coverage tier:** B - **Trap:** prompt-only policy ## Exam-pattern questions ### Q1. Refund agent uses prompt-only enforcement: "escalate refunds over $500". Production failures show 5% of refunds violate the policy. Fix? Replace with a PreToolUse hook that checks if amount > 500: escalate. Deterministic gate. Prompt-only is probabilistic; hook is 100% enforcement. ### Q2. Two vendors return contradictory market sizes. The agent picks the median and continues. Why is this wrong? Ambiguity is an escalation trigger. Agent must surface the conflict, not silently average. Block: {conflict, sources, recommendation: "human sourcing"}. Otherwise the report is silently inaccurate. ### Q3. After PreToolUse blocks a refund, the agent retries the same call on the next loop. Why? The escalation didn't update the database with refund_approval flag. Round-trip pattern: human approves → DB flag set → agent retries → hook sees flag → allows execution. Without the flag, indefinite re-escalation. ### Q4. User says "speak to a manager." Should the agent negotiate first or escalate immediately? Escalate immediately. No reasoning, no multi-turn negotiation. Structural keyword check: if "manager" in message: escalate_immediately. This is policy, not judgment. ### Q5. Permission failure (403 from infrastructure), should the agent retry or escalate? Escalate. Permission failures are non-retryable error category 4. Block routes to the relevant on-call engineer with task description and partial status. Retrying wastes time and obscures the real fix (privilege grant). ### Q6. Difference between an error and an escalation? Error = unhandled (agent crashes, tool fails unexpectedly). Escalation = handled (agent recognizes a designed condition, stops gracefully, submits structured block). Escalation is a designed path, not a failure. ### Q7. Should a customer's tone (anger, frustration) trigger escalation? No. Sentiment is not a trigger. Only explicit conditions: policy exception, permission failure, ambiguous input, explicit request for human. Sentiment is orthogonal to escalation rules. ### Q8. Customer-blocking workflow vs batch workflow: how does the escalation pattern differ? Customer-blocking: agent stops, async queue with 5-10 min SLA, user sees "escalated, response in ~5min". Batch: save block, continue with fallback (auto-approve up to $100, escalate the rest). Design determines the pattern, not the trigger. --- **Source:** https://claudearchitectcertification.com/concepts/escalation **Vault sources:** ACP-T03 §4.1 escalation protocol; ACP-T05 Scenario 1 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Tool Calling > Tool calling is how Claude decides to invoke external functions and pass structured arguments. Tools are routing infrastructure, good descriptions reduce the need for classifiers or few-shot examples. The quality of tool design is the primary lever for correct task routing, not model size. **Domain:** D2 · Tool Design + Integration (18% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/tool-calling **Last reviewed:** 2026-05-04 ## Quick stats - **Description components:** 5 - **Optimal tools/agent:** 4–5 - **Degradation threshold:** 18+ - **Exam domain:** D2 - **tool_choice modes:** 3 ## What it is Tool calling is Claude's ability to request execution of external functions by returning structured tool_use blocks with function name and arguments. Tools serve as routing infrastructure, they guide Claude's decisions. Each tool needs a JSON schema describing inputs, outputs, and behavior. Good descriptions (what it does, when to use, edge cases, boundaries) are the primary determinant of correct routing. ## Exam-pattern questions ### Q1. Your agent calls the wrong tool 30% of the time across 8 similar tools. What's the first fix? Audit tool descriptions. Add the four-part anatomy: what it does, when to use, edge cases, ordering boundaries. Vague descriptions cause misrouting; bigger models don't fix it. ### Q2. You add 25 tools and selection accuracy drops sharply. Why? Beyond ~18 tools, attention dilutes and Claude can't reliably select. Split into specialized subagents, each with 4-5 tools max. Or merge overlapping tools. ### Q3. A tool's input_schema is {data: any} and Claude returns null. What happened? The schema gave no constraint, so Claude has no deterministic target. Model guessed null. Specify exact fields with types and examples; constraints reduce hallucination. ### Q4. Your tool throws an exception. The agent retries the same call with the same input. Why? Your harness swallowed the exception and returned an empty tool_result. Claude saw no error and re-requested. Return {error: "reason", hint: "how to fix"} so Claude can recover. ### Q5. Two tools have similar names: fetch_user and get_user. The agent alternates between them. Fix? Disambiguate in descriptions: "fetch_user: use only when you have an email. get_user: use when you have a numeric ID." Or merge them into one tool with an enum input. ### Q6. You force tool_choice: any to guarantee structured output. The agent still returns empty results. Why? any forces a tool call but the model still picks which one, and might pick wrong if descriptions overlap. Either force a specific tool or fix description disambiguation. ### Q7. Your tool returns 50KB of JSON per call. The loop hits max_tokens after 3 iterations. Architectural fix? Tools should return what's needed, not everything. Add a fields parameter to the tool's input schema so Claude can request specific fields. Or use a resource (read-only catalog) for catalog data. ### Q8. A tool needs the customer ID but Claude calls it without one. The harness errors out. What should the schema enforce? Mark customer_id as required in input_schema. Claude won't call without it. Combine with a clear description: "call only after verify_customer returns a customer_id." ## FAQ ### Q1. Is tool calling the same as MCP? No. Tool calling is the mechanism. MCP is infrastructure that exposes pre-built tools through that mechanism. ### Q2. tool_choice 'any' vs specific tool? 'any' forces a tool call but lets Claude pick which. Specific forces a particular tool, used for mandatory first steps. ### Q3. Should every action be a tool? No. Use tools for external operations and actions you must control. Keep internal reasoning off-tool. --- **Source:** https://claudearchitectcertification.com/concepts/tool-calling **Vault sources:** ACP-T03 §4.2; GAI-K04 §11; ASC-A01 Course 6 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Tool Choice > tool_choice is the parameter that controls whether Claude can decide to use a tool ("auto"), must call any tool ("any"), or must call a specific tool by name. Use specific-tool forcing for mandatory-first-step operations like identity verification before refunds. **Domain:** D2 · Tool Design + Integration (18% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/tool-choice **Last reviewed:** 2026-05-04 ## Quick stats - **Modes:** 3 - **Default:** auto - **Forced use cases:** 2 - **Exam domain:** D2 - **Guarantee:** 100% ## What it is tool_choice is an API parameter that constrains Claude's tool decision. "auto" lets the model decide whether to call any tool (default). "any" forces a tool call from the available set. {"type":"tool","name":"X"} forces tool X. Forced modes guarantee execution for compliance, structured-output extraction, or mandatory verification gates. ## Exam-pattern questions ### Q1. You enable extended_thinking and set tool_choice: any. The API returns a 400 error. Why? Extended thinking is incompatible with any and forced tool_choice. Use auto only when thinking is enabled. The model reasons, then decides whether to call a tool. ### Q2. Your refund flow uses tool_choice: any to guarantee a tool call. The agent calls lookup_order instead of verify_customer first. Fix? Switch to forced ({type: "tool", name: "verify_customer"}). any lets the model pick; only forced guarantees a specific tool. Use forced for mandatory architecture gates. ### Q3. The user asks a simple question and your forced tool_choice calls process_refund with empty input. What went wrong? Forced tool_choice removes the model's option to return text. For ambiguous requests where text is a valid response, use auto. Forced is for mandatory operations only. ### Q4. Your support agent uses tool_choice: auto. 30% of the time it asks clarifying questions instead of calling tools. Is this a bug? Not a bug, that's auto behavior. The model decides whether a tool is needed. If you need guaranteed tool calls, switch to any (model picks) or forced (you pick). Each has a tradeoff. ### Q5. You set tool_choice: any but the agent calls a tool and immediately returns valid input. Then a clarifying tool... Why is it slower than expected? Each tool call is a full turn round-trip. If the same task can be done in fewer calls (e.g. one tool with more params instead of three sequential calls), redesign the tool surface for fewer turns. ### Q6. Your forced tool_choice produces tool calls with hallucinated arguments. How do you guard against this? Force the tool, then validate inputs in your harness. Return structured errors via tool_result if inputs are invalid. The model retries with corrected arguments. Forced tool_choice does not validate inputs. ### Q7. Why is tool_choice: any rejected when extended_thinking is on, but allowed without it? Extended thinking generates reasoning tokens before the final response. Forcing a tool call would constrain the post-thinking output, breaking the reasoning contract. Anthropic disabled the combination at the API layer. ### Q8. You want both deep reasoning and a guaranteed tool call. Is that possible? Not in a single turn. Use extended thinking with auto for the reasoning step, then a follow-up turn with forced tool_choice if the model decides to act. Two turns, not one. ## FAQ ### Q1. Can I force multiple tools in sequence? Not in one call. tool_choice forces one call per request. Sequence by setting tool_choice on each successive call. ### Q2. What if I force a tool but the context doesn't fit? Claude still calls it, often with bad arguments. Only force when context guarantees a sensible call. ### Q3. Is forcing a tool the same as a hook? No. tool_choice is an API constraint on the model. Hooks run code before/after execution. Use both for full control. --- **Source:** https://claudearchitectcertification.com/concepts/tool-choice **Vault sources:** ACP-T03 §4.2 modes table; GAI-K04 §7 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Model Context Protocol > MCP is a communication standard that lets Claude access pre-built tools, resources, and prompts from specialized servers without you writing integration code. Connect to a GitHub MCP server instead of writing GitHub API tools yourself. Servers are configured in .mcp.json (project) or ~/.claude.json (user). **Domain:** D2 · Tool Design + Integration (18% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/mcp **Last reviewed:** 2026-05-04 ## Quick stats - **Primitives:** 3 - **Config levels:** 2 - **Transports:** 3 - **Exam domain:** D2 - **Pre-built sets:** 18+ ## What it is MCP decouples tool definition from application code. An MCP server wraps a service's API and exposes pre-built tools, resources (read-only data catalogs), and prompts (pre-written instructions). Your Claude application is an MCP client, connecting via stdio, SSE, or streamable HTTP. This eliminates repetitive schema authoring and error handling. Servers configure with environment variables for credentials. ## Exam-pattern questions ### Q1. You add a new MCP server to .mcp.json but Claude Code doesn't see its tools. What did you forget? Restart the Claude Code session. The framework caches the tool list at startup. Changes to .mcp.json require a restart, or a manual mcp restart command in newer versions. ### Q2. Two MCP servers expose tools with the same name (Read). What happens? The config is invalid. Tool names must be globally unique across all MCP servers. Prefix tools by server name (GithubRead, FileRead) or disable one of the conflicting servers. ### Q3. A cloud MCP server returns {error: "rate_limited"}. Your agent retries 5 times and exhausts budget. Better approach? Server should return structured rate-limit info: {error: "rate_limited", retry_after: 30}. Your harness reads this and waits before retrying. Don't hammer; respect the hint. ### Q4. You hardcoded a GitHub token in .mcp.json. A teammate clones the repo and your token leaks. Fix? Use ${GITHUB_TOKEN} env-var expansion. Each developer sets their own token in their environment. The .mcp.json is committed; the token is not. ### Q5. An MCP server resource lists 1000 items. Claude calls a tool 1000 times, one per item. Why? The resource is a catalog (read-only data Claude can scan), not a list of pre-staged actions. Reduce tool calls by structuring the resource so Claude can scan and pick targeted items, not iterate. ### Q6. Your custom MCP server crashes when Claude calls a tool with malformed JSON. What's the right error handling? Wrap tool execution in try/except. Return {error: "invalid_input", detail: "..."} as the tool_result. Claude reads it and retries with corrected input. Crashes leave the framework hanging. ### Q7. You want to add a custom internal API. Should you build a custom MCP server or expose tools directly via the SDK? MCP for shared/team use (multiple developers, multiple Claude sessions). Direct SDK tools for app-specific custom logic. MCP adds a small protocol overhead but enables reuse. ### Q8. An MCP server's tool description is auto-generated from the underlying API and reads like docstring noise. The model misroutes. Fix? Override the description in your MCP server config. The vendor's auto-generated description may be too generic. Write a 4-line description with what/when/edge-cases/boundaries. ## FAQ ### Q1. Do I have to use MCP? No. Write tools yourself if you prefer. MCP saves work when pre-built servers exist for the service you need. ### Q2. Can I create my own MCP server? Yes. Anthropic ships Python and Node.js SDKs; community implementations cover other languages. ### Q3. What languages can MCP servers be written in? Any. Python and Node.js are official; community SDKs cover the rest. --- **Source:** https://claudearchitectcertification.com/concepts/mcp **Vault sources:** ACP-T03 §4.2; GAI-K04 §13; ASC-A01 Courses 7 + 10 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # SDK Hooks (Pre/PostToolUse) > Hooks are deterministic code that runs before or after tool calls. They enforce policy that prompt-only patterns can't guarantee, refund limits, identity verification, audit logging. The exam pattern: prompt-only enforcement is wrong; hook is the deterministic gate. Full content in SCRUM-21 follow-up. **Domain:** D2 · Tool Design + Integration (18% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/hooks **Last reviewed:** 2026-05-04 ## Quick stats - **Hook types:** 2 - **Exam domain:** D2 - **Trap:** prompt-only policy - **Coverage tier:** B - **Right answer:** deterministic ## Exam-pattern questions ### Q1. System prompt says "never refund more than 500." Production shows 3% violation rate. What's the architectural fix? Replace prompt with a PreToolUse hook checking amount <= 500. Exits 2 (deny) on violation. Prompt is probabilistic; hook is deterministic. The exam tests this Prompt-vs-Hook heuristic repeatedly. ### Q2. PostToolUse hook detects a policy violation and exits 2. Did the violation happen? Yes. PostToolUse runs after the tool completes. The transaction already happened. Use PreToolUse to prevent operations; PostToolUse is for logging, normalizing, cascading after the fact. ### Q3. Hook exits with code 1 and the agent halts unexpectedly. Why? Exit 1 = internal hook error (halts the agent). Exit 2 = deny (Claude reads stderr error, retries with adjusted args). Use exit 1 only for hook bugs (JSON parse failure, uncaught exception). For policy denial, always exit 2. ### Q4. Matcher "process.*" is supposed to fire on process_refund and process_charge, but it fires on neither. Why? Matchers are literal pipe-separated strings, not regex. Use "process_refund|process_charge" to match both. Regex patterns are silently ignored; matching is exact-string by tool name. ### Q5. Why must hook commands use absolute paths? Relative paths are vulnerable to path-interception attacks (MITRE T1574.007). An attacker plants a malicious script earlier in PATH; your hook runs theirs. Use /home/user/hooks/script.sh, not ./hooks/script.sh. Use ${HOME} or setup-script substitution for portability. ### Q6. Setting "async": true on a PreToolUse hook to avoid blocking. Why does this fail? PreToolUse cannot be async. Claude is waiting for the allow/deny decision before the tool runs. Async PreToolUse causes timeout. Async only works on PostToolUse (the tool already ran; the hook is reactive feedback). ### Q7. A hook needs the customer ID from prior conversation history. How does it get it? It doesn't. Hooks are stateless and isolated. They receive only tool_name, tool_input, session_id, hook_event_name. Pass anything you need explicitly via tool_input. The agent's job is to include needed context in tool calls. ### Q8. Two PreToolUse hooks match the same tool. What happens? The SDK runs them in sequence. Each hook gets stdin from the previous (or original tool_input). If the first exits 2 (deny), subsequent hooks don't run. Use multiple hooks to separate concerns: security check + quota check + audit. --- **Source:** https://claudearchitectcertification.com/concepts/hooks **Vault sources:** ACP-T03 §4.2 enforcement hierarchy; ACP-T03 §4.4 prompt vs hook trap; ASC-A01 Course 6 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Tool Evaluation & Testing > Evaluation is how you measure tool-call correctness against ground-truth datasets. Coverage in vault is thin; needs Phase 6 research for full authoring. **Domain:** D2 · Tool Design + Integration (18% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/evaluation **Last reviewed:** 2026-05-04 ## Quick stats - **Coverage tier:** C - **Exam domain:** D2 - **Status:** stub - **Vault depth:** thin - **Action:** research ## Exam-pattern questions ### Q1. You spot-check 5 examples manually. Production scores reveal 12% failure rate. What's the architectural gap? Manual spot-checks miss edge cases. A prompt working on 5 chosen examples might fail on 10% of real traffic. Evals run 50-100 diverse cases automatically and catch corner cases. Spot-checks are sanity, not proof. ### Q2. Tool-using agent's score is 95% on text-output evals. Production fails are still high. Why? Tool-call evals matter more for agents. An agent making the right calls but writing bad text succeeds at its job. Bad calls with great text fails. Always run tool-sequence evals; text quality is secondary. ### Q3. Eval score climbs from 80% to 87% after a prompt edit. Is the change ready to ship? Not yet. Run on fresh test cases you haven't trained against. Score plateau on a fixed suite can mask overfitting to the prompt. Production monitoring + shadow-mode evals reveal blind spots faster than test cases. ### Q4. A teammate says "Use Claude to grade the agent's outputs." When is this advice wrong? For deterministic properties (tool sequence, schema validation, rule compliance), use code-based grading. Deterministic, reproducible, no hallucination. Use Claude grading only for subjective cases (tone, completeness, empathy). ### Q5. You generate test cases by running the current agent and treating its output as expected. Why is this circular? You're measuring whether the agent repeats itself, not whether it's correct. Golden cases must be externally validated: human review, expert sign-off, or reference implementations. Grading against unverified data is theater. ### Q6. A prompt edit drops the score from 85% to 82%. What should you do? Roll back immediately. You've detected a regression. Analyze which cases regressed, decide if the trade-off is worth it, or iterate further. Evals let you make this call in minutes, not weeks. ### Q7. Your golden suite has 50 cases, but a production failure happens on a scenario not in the suite. What does this prove? Evals score only on test cases you designed. 95% on the suite is necessary but not sufficient. Pair evals with production monitoring; real users reveal blind spots faster than fixed test cases. ### Q8. Why are evals part of tool design, not just QA? Tool descriptions and schemas evolve; evals catch when an edit causes regressions. Run on every PR; failures gate the merge. Evals make tool design measurable, not just artisanal. --- **Source:** https://claudearchitectcertification.com/concepts/evaluation **Vault sources:** ACP-T03 §4.3 evaluation overview; ASC-A01 Course 6 evals lessons **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # CLAUDE.md Hierarchy > CLAUDE.md is persistent project memory loaded automatically by Claude Code. A three-level hierarchy (user → project → directory) lets you set global defaults, team standards, and per-directory rules. Project-level files are version-controlled and shared. Use @import to keep CLAUDE.md modular. **Domain:** D3 · Agent Operations (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/claude-md-hierarchy **Last reviewed:** 2026-05-04 ## Quick stats - **Hierarchy levels:** 3 - **File locations:** 2 - **Import syntax:** @import - **Exam domain:** D3 - **Hard rule limit:** , ## What it is CLAUDE.md is a markdown file that gives Claude Code persistent project memory. It loads automatically at session start and is appended to the system context. Three scopes: user (~/.claude/CLAUDE.md), project (.claude/CLAUDE.md or root CLAUDE.md), and directory (per-folder file that loads when editing inside it). Project files are version-controlled and shared with your team. @import lets you split rules into modular files. ## Exam-pattern questions ### Q1. A teammate's PRs use a different code style than yours. They have rules in ~/.claude/CLAUDE.md but not in the repo. Fix? Move team rules to .claude/CLAUDE.md (project root, version-controlled). User-level rules are personal; only project rules are shared via git. Team standards belong in the repo. ### Q2. You add a rule to .claude/CLAUDE.md mid-session. Claude doesn't follow it. Why? CLAUDE.md is loaded once at session start. Mid-session edits don't take effect until restart. End the session, restart, and the new rules apply. ### Q3. Your testing rule file sits in .claude/rules/testing.md but doesn't auto-load when editing test files. What's missing? Check the YAML frontmatter paths glob. Make sure it matches your file paths exactly. Try paths: ["/*.test.tsx", "/*.spec.ts"] for broad coverage. ### Q4. You have @import ./shared.md in two CLAUDE.md files, and shared.md has @import ../main.md. Claude Code hangs on load. Why? Circular import. Claude Code does not detect cycles. Refactor: don't have child rule files import their parents. Keep the import graph acyclic. ### Q5. A new contributor follows the README but writes code that violates project conventions. They don't read CLAUDE.md. Architectural fix? CLAUDE.md auto-loads when they use Claude Code, no reading required. If they bypass Claude Code (write by hand), enforce conventions via linters and CI checks. CLAUDE.md complements but doesn't replace tooling. ### Q6. Your project root CLAUDE.md is 1,500 lines and Claude's responses are getting slower. Why? CLAUDE.md is appended to every prompt. 1,500 lines is too much. Modularize: move detailed rules to .claude/rules/*.md with paths globs. Keep root CLAUDE.md to ~300 lines of universal team rules. ### Q7. A path-scoped rule file's paths: ["src//*.ts"] matches but the rules don't apply to a file in src/lib/utils.ts. What's wrong? Glob syntax. matches multiple directory levels, so src//*.ts should match src/lib/utils.ts. If not, check for typos in the glob or restart Claude Code. ### Q8. You want different rules for api/ (backend) and app/ (frontend). Best structure? Two path-scoped rule files: .claude/rules/backend.md with paths: ["api/"], .claude/rules/frontend.md with paths: ["app/"]. Keep root CLAUDE.md for shared rules. Each rule auto-loads only when relevant. ## FAQ ### Q1. If I have project + directory CLAUDE.md, which wins? Both load. Directory rules apply only inside that directory; project rules apply everywhere. They are scoped, not conflicting. ### Q2. What goes in user-level CLAUDE.md? Personal preferences that don't vary by project, your indent style, commit message format. Not team standards. ### Q3. Does CLAUDE.md affect performance? Slightly, it appends to every prompt. Keep it concise; use @import if it grows past ~1000 lines. --- **Source:** https://claudearchitectcertification.com/concepts/claude-md-hierarchy **Vault sources:** ACP-T03 §4.3; GAI-K04 §3; ASC-A01 Courses 2 + 4 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Plan Mode > Plan Mode forces Claude Code to draft and approve a multi-step plan before executing. Use it for non-trivial changes (3+ steps); skip for trivial edits. The exam pattern: when do you require Plan Mode vs. direct execution. Full content in SCRUM-21 follow-up. **Domain:** D3 · Agent Operations (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/plan-mode **Last reviewed:** 2026-05-04 ## Quick stats - **Trigger threshold:** 3+ steps - **Exam domain:** D3 - **Decision factor:** blast radius - **Coverage tier:** B - **Skip case:** trivial edit ## Exam-pattern questions ### Q1. Single-file null-check: should you use plan mode? No. Plan mode has overhead. Reserve for 3+ step architectural decisions where there are multiple valid approaches and rework would be expensive. A clear bug-fix uses direct execution. ### Q2. 30-file refactor: direct execution or plan mode? Plan mode is mandatory. Blast radius too large to wing it. Without upfront analysis, you commit to a suboptimal architecture mid-refactor and rework hours of work. Plan mode's iteration cost (milliseconds) replaces rework cost (hours). ### Q3. The plan looks bad. Should you execute and fix the result later? Reject the plan and iterate (zero cost, same session). Tell Claude what's wrong; Claude revises. Executing a bad plan and reworking is expensive. Plan mode's value is in iteration cost being free, not in execution being safe. ### Q4. You're working in an unfamiliar codebase (different framework, different conventions). Plan mode optional? Mandatory. Unfamiliar code means hidden interdependencies. Plan mode makes them visible: Claude explores, identifies risks, proposes a strategy respecting framework conventions. Direct execution in unfamiliar code is a minefield. ### Q5. Can you switch from direct execution to plan mode mid-task? Plan mode is a mode, not a patch. Switching mid-work commits to prior decisions made in direct mode. For complex tasks, start in plan mode. Switching late captures only future steps; the architectural commitments are already made. ### Q6. Plan mode in CI/CD pipelines: viable? No. Plan mode requires human approval (a decision gate). Only for interactive development. For automated CI/CD refactors, use direct execution with comprehensive testing. ### Q7. Plan mode and the -p flag: same purpose? Different. -p makes Claude Code non-interactive (CI/CD). Plan mode is interactive approval-gated. Different tools for different goals: plan mode for refactoring, -p for headless automation. ### Q8. Codebase is too large for plan mode to read entirely. How do you scope? Scope the request. Instead of "refactor the whole monolith," say "refactor the customer domain into a microservice." Plan mode explores only the relevant parts. For massive codebases, pre-summarize architecture in ARCHITECTURE.md. --- **Source:** https://claudearchitectcertification.com/concepts/plan-mode **Vault sources:** ACP-T03 §4.3 plan mode; GAI-K04 §15 decision table; ASC-A01 Course 4 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Skills > Skills are reusable, on-demand task workflows defined in markdown with YAML frontmatter. Unlike CLAUDE.md (always loaded), skills load only when Claude matches the description to the current task. Skills can restrict tool access and accept arguments. **Domain:** D3 · Agent Operations (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/skills **Last reviewed:** 2026-05-04 ## Quick stats - **Frontmatter fields:** 5 - **Scopes:** 2 - **Load pattern:** on-demand - **Exam domain:** D3 - **Context isolation:** fork ## What it is Skills are markdown files that teach Claude Code how to handle specific, repetitive tasks. Each has a YAML frontmatter (name, description, allowed-tools, argument-hint, context). The description drives Claude's matching, if you ask 'review this PR,' Claude matches the PR-review skill and loads it. Project skills (.claude/skills/) ship with the team; user skills (~/.claude/skills/) follow you. ## Exam-pattern questions ### Q1. Your skill description is "helps with code" and Claude activates it for every request. Fix? Description is too generic. Rewrite as task-specific: "validates Python security patterns; use when reviewing code for SQL injection or hardcoded secrets." Specific descriptions enable accurate matching. ### Q2. A skill has allowed-tools: Read, Grep, Glob, Bash and you ask Claude to edit a file. It refuses. Why? Skills enforce allowed-tools at activation time. While the skill is active, Claude can only use those tools. Either add Edit to allowed-tools, or invoke a different skill that has Edit. ### Q3. You wrote a 2,000-line SKILL.md. Claude responses are slow when the skill is active. Better structure? Use progressive disclosure. Keep SKILL.md under 500 lines. Move detailed references to references/, examples to examples/, scripts to scripts/. Link from SKILL.md: "if user asks about X, read references/guide.md." ### Q4. Your skill works on your machine but a teammate can't trigger it. Where's the file? Personal skills (~/.claude/skills/) don't sync. Move it to .claude/skills/ (project, version-controlled). Commit and push. Teammates pull and the skill is available. ### Q5. Two skills match the same request. Which one activates? Both load into context simultaneously. If their instructions conflict, results are unpredictable. Make descriptions non-overlapping: each skill should describe a unique task with no semantic overlap. ### Q6. A skill has scripts in scripts/ but they don't execute when invoked. Why? Scripts must be marked executable (chmod +x) and the SKILL.md must reference them with the correct relative path. Test the script standalone first; then verify SKILL.md instructions are clear about when/how to run it. ### Q7. When does a skill trump CLAUDE.md, and when does CLAUDE.md trump a skill? CLAUDE.md is always-loaded, applying to every conversation. Skills load on-demand when their description matches. They're complementary: CLAUDE.md for universal team standards; skills for repeated task workflows. Neither trumps the other; they layer. ### Q8. Your CI pipeline invokes Claude with --skill ci-review but the skill isn't activated. What's wrong? The flag must match the skill's name field exactly (lowercase, hyphenated). Check the SKILL.md frontmatter. Also verify the skill file is in .claude/skills/ (not ~/.claude/skills/ for project-scoped CI). ## FAQ ### Q1. What does context: fork do? Runs the skill in an isolated subagent and returns a summary. Keeps the main conversation clean. ### Q2. Can I invoke a skill explicitly? Yes, type @skill skill-name or just the name. Most flows rely on auto-detection via description. ### Q3. Do subagents inherit my skills? No, list them in the subagent's frontmatter skills field if you want them available. --- **Source:** https://claudearchitectcertification.com/concepts/skills **Vault sources:** ACP-T03 §4.3 skills; GAI-K04 §6 frontmatter; ASC-A01 Course 15 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Checkpoints & Session Management > Checkpoints save and restore conversation state. Vault coverage is thin; full authoring requires Phase 6 research. **Domain:** D3 · Agent Operations (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/checkpoints **Last reviewed:** 2026-05-04 ## Quick stats - **Coverage tier:** C - **Exam domain:** D3 - **Status:** stub - **Vault depth:** thin - **Action:** research ## Exam-pattern questions ### Q1. Session crashes at turn 12. You restart fresh. What did you lose? All agent context: prior reasoning, accumulated facts, learned patterns. Files survive but agent context does not. Checkpoints preserve the full message thread, so resuming is precision-perfect. ### Q2. Long task: should you summarize at each checkpoint to save tokens? No. Summarization erases precision. Checkpoints preserve the full message list (verbatim). The exam tests this distinction repeatedly: summarization is the opposite of checkpointing. ### Q3. Checkpoints and fork_session: same pattern? Different. Checkpoints resume linearly (A → A+1 → A+2). fork_session branches from a common baseline (A → A1 AND A2). Linear resumption vs branching exploration. ### Q4. Two checkpoints from different sessions, can you merge them? No. Checkpoints are single-session snapshots. Merging is unsupported and corrupts context. To combine work, use manual synthesis or explicit hand-offs (output of A as input to B). ### Q5. When should you save a checkpoint? At meaningful milestones: daily boundaries, after a major phase completes, before a risky decision, after human-in-the-loop approval. For a 30-turn project, 3-5 checkpoints (one per phase or week). ### Q6. Checkpoint loaded; can you edit it before continuing? Technically yes, but don't. Editing corrupts message sequences and breaks Claude's understanding. If you need to correct something, load the checkpoint, acknowledge the issue in a new message, and continue. Claude adjusts. ### Q7. Restoring a checkpoint resets the conversation cost? No. Restoring is one logical session (no reset). Tokens are charged for the restored messages as if part of one continuous session. Plan accordingly: a 30-turn checkpoint costs 30 turns to resume. ### Q8. Can checkpoints be used with the Batch API? No. The Batch API processes asynchronously and returns results; it's not a continuous agent session. Checkpoints are for interactive multi-turn agent work that spans days or sessions. --- **Source:** https://claudearchitectcertification.com/concepts/checkpoints **Vault sources:** ASC-A01 Course 3 checkpoints **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # System Prompts & Instructions > System prompts establish role, constraints, format, and tool guidance. Anatomy: role, capability boundaries, style/format rules, tool guidance, examples. Full content in SCRUM-21 follow-up. **Domain:** D4 · Prompt Engineering (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/system-prompts **Last reviewed:** 2026-05-04 ## Quick stats - **Anatomy parts:** 5 - **Exam domain:** D4 - **Coverage tier:** B - **Common trap:** vague role - **Best length:** concise ## Exam-pattern questions ### Q1. System prompt says "be conservative with refunds." Production shows 5% policy violations. Why? Natural-language guidance is advice the model ignores under pressure. Encode the policy in tool code (if amount > 500: deny). System prompt = macro guidance; tools = enforcement. ### Q2. Vague system prompt vs precise five-section charter: how much accuracy difference? 30-60% reduction in misrouting and off-policy behavior. A precise charter (role, boundaries, format, tool guidance, examples) outperforms "be helpful" by orders of magnitude. Every section is load-bearing. ### Q3. Update the system prompt mid-conversation: what happens? Each call re-reads the system prompt; the new prompt takes effect on the next call. Creates inconsistency mid-conversation: turn 5 with old prompt vs turn 6 with new prompt produces shifting behavior. Define the full charter upfront. ### Q4. System prompt is 5,000 tokens long. What's the issue? Attention dilution. Tight 500-1500 token charters outperform verbose. The model allocates less attention per section when the prompt bloats. Every section must be load-bearing; prune the rest. ### Q5. Why are example patterns the highest-ROI section of a system prompt? A 2-3 example before/after pair reduces misclassification by 30-60%. Examples show Claude exactly what success looks like in concrete cases. Vague descriptions can't substitute for one worked example. ### Q6. System prompt + tool descriptions: which takes precedence on routing? Tool descriptions take precedence (structural enforcement via SDK). System prompt is linguistic guidance. Align them to avoid confusion: prompt says "verify first"; tool description says "verify_customer: Use first." Redundant but safe. ### Q7. Cache the system prompt: when is it not worth it? Below ~1024 tokens (cache overhead exceeds savings) or for one-off requests (no repeat reads). For loops repeating the same system + tools, caching saves ~88% on those tokens per turn. ### Q8. Use a system prompt instead of an agentic loop for complex tasks: viable? No. System prompt defines role and rules. Agentic loop is the control structure (the while block that re-sends messages on tool_use). They're orthogonal; you need both. --- **Source:** https://claudearchitectcertification.com/concepts/system-prompts **Vault sources:** ACP-T03 §4.4 prompt engineering; ACP-T04 §6.B system prompt anatomy; ASC-A01 Course 6 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Prompt Caching > Prompt caching reduces cost (~90%) on repeated context like long system prompts and tool definitions. Cache breakpoints, TTL, and cache_control field placement are exam patterns. Full content in SCRUM-21 follow-up. **Domain:** D4 · Prompt Engineering (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/prompt-caching **Last reviewed:** 2026-05-04 ## Quick stats - **Cost reduction:** ~90% - **Exam domain:** D4 - **Coverage tier:** B - **Trigger:** repeated context - **Field:** cache_control ## Exam-pattern questions ### Q1. Cache the entire message history to speed up long conversations: works? No. The message list grows every turn (new user + assistant messages). Caching requires immutable content. Cache the system prompt and tool definitions, not the messages. ### Q2. Daily news content marked cache_control: ephemeral. Why is this wrong? Ephemeral caching is for content that doesn't change within 5 minutes. Daily content causes cache misses on every change. Cache only immutable content; daily-update content gets no benefit. ### Q3. Cached content from one API key leaks to another. Security issue? No, this can't happen. Caching is per API key, per conversation. Each conversation gets its own cache. Cache isolation is a security feature. ### Q4. After 5 minutes the cache automatically extends? No. After 5 minutes of no access, the cache expires completely. The next call re-reads at full token cost. Plan for cache expiry in long-running loops. ### Q5. How much does caching save on a 1000-token system prompt called 10 times? ~81-88%. First call: 1000 tokens (full). Calls 2-10: 100 each (10% of cache price). Total: 1000 + 900 = ~1900 tokens vs 10,000 fresh. ### Q6. Caching vs the Batch API: same purpose? Different. Caching: 90% on reused content within a 5-min window (instant). Batch API: 50% on all tokens (24-hour wait). Caching for interactive loops; Batch for async bulk. ### Q7. Tool definitions: cacheable like the system prompt? Yes. Tool definitions are stable across turns. Mark with cache_control: ephemeral (or rely on SDK auto-caching). Combined with cached system prompt, saves ~90% on the fixed prefix per turn. ### Q8. Caching with subagents: each subagent gets its own cache? Yes. Each subagent caches its own system prompt and tools. Caches are separate (per subagent). Subagent A's cache doesn't affect subagent B. --- **Source:** https://claudearchitectcertification.com/concepts/prompt-caching **Vault sources:** ACP-T03 §5 prompt caching; ACP-T04 prompt-caching-optimization; ASC-A01 Course 6 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Batch API > Message Batches API: 50% discount for async, non-time-sensitive workloads. Vault coverage thin; needs Phase 6 research. **Domain:** D4 · Prompt Engineering (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/batch-api **Last reviewed:** 2026-05-04 ## Quick stats - **Discount:** 50% - **Exam domain:** D4 - **Coverage tier:** C - **Status:** stub - **Action:** research ## Exam-pattern questions ### Q1. Use Batch API for a CI/CD pre-merge check that must block the PR. Viable? No. Batch processes within 24 hours, not immediately. Pre-merge needs synchronous responses (minutes). Use the standard Messages API for latency-sensitive workflows. ### Q2. Customer-facing chatbot uses Batch API. What goes wrong? 24-hour latency is unacceptable for interactive UX. The customer waits 24 hours for a response; they leave. Use synchronous API for real-time. Batch is for no one waiting. ### Q3. Batch supports multi-turn tool calling, right? No. Batch is single-turn per request. No tool continuation, no agentic loops, no streaming. Each JSONL line is a separate messages.create() request. For tool loops, use synchronous API. ### Q4. 50% savings means always use Batch. True? No. 50% savings only justify 24-hour latency if no one is waiting. For interactive tasks, the cost of waiting (user frustration, business loss) exceeds the 50% savings. Batch is for asynchronous workloads only. ### Q5. Batch API processes 50,000 requests faster than synchronous? No, cheaper not faster. Batch processes within 24 hours; synchronous in parallel finishes in minutes. If you need speed, use synchronous + concurrency. If you need cost, use Batch. ### Q6. What's the maximum batch size? Up to 10,000 requests per batch. For larger workloads, split into multiple batches. Each batch is independent; submit them in parallel for higher throughput. ### Q7. How do you correlate requests with results? Use the custom_id field. Each request defines a custom_id; the result includes the same custom_id. Match them to pair request with response. Without custom_id, results are unordered and untraceable. ### Q8. Can you cancel a batch after submitting? No. Once submitted, the batch runs to completion. Plan carefully before submitting; cancellation is not supported. Test on small batches first, then scale to 10,000. --- **Source:** https://claudearchitectcertification.com/concepts/batch-api **Vault sources:** ACP-T03 §5 batches **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Structured Outputs > Structured outputs guarantee Claude returns JSON matching a schema instead of natural language. Force a tool with a JSON schema; the API enforces structure, not your parser. Schema design (nullable fields, 'unclear' enum values) prevents fabrication. **Domain:** D4 · Prompt Engineering (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/structured-outputs **Last reviewed:** 2026-05-04 ## Quick stats - **Anti-fabrication patterns:** 3 - **Enforcement:** tool_use - **Schema elements:** 2 - **Exam domain:** D4 - **Guarantee:** 100% ## What it is Structured outputs are JSON responses guaranteed to match a schema by forcing Claude to call a tool instead of returning unstructured text. Define a tool with a JSON schema; set tool_choice to force it. Claude must return valid JSON. This eliminates parsing and makes downstream processing deterministic. Schema design prevents fabrication: nullable fields, 'unclear' enum values, structured claim-source mappings. ## Exam-pattern questions ### Q1. You prompt Claude for JSON and 15% of responses include explanatory text. How do you fix this for production? Stop prompting for JSON; use tool_use with a JSON schema and tool_choice: forced. Token generation is constrained to match the schema; structure is guaranteed. Prompt-only approaches are probabilistic. ### Q2. Your schema has {refund_reason: {type: "string"}}. Claude returns "reason: unable to determine" when the contract is silent. What's wrong? The schema forces a string, so Claude fabricates one when source is silent. Fix: {refund_reason: {type: ["string", "null"]}}. Or add an enum: ["refund", "replacement", "unclear"]. Give Claude a way to say "I don't know." ### Q3. A validation-retry loop runs 5 times and still fails. What's the next step? Don't retry forever. After 2-3 retries, escalate to a human reviewer with the document and last-attempt extraction. Failure to converge usually means the source is genuinely ambiguous, not that Claude is broken. ### Q4. You force tool_choice: forced for extraction. Sometimes Claude returns tool_use with empty input. Why? Forced tool_choice guarantees the tool fires, not that input is valid. Empty input means the source has no extractable data. Validate input in your harness; return a structured error so Claude can ask for more context or escalate. ### Q5. Your extraction tool returns confidence: 0.4 for a critical field. What should production do? Route low-confidence extractions to human review. Low confidence is a feature: it means Claude is honest about uncertainty. Auto-accept high-confidence; queue medium for batch review; escalate low immediately. ### Q6. You use tool_choice: forced with extended_thinking. The API returns 400. Why? Extended thinking is incompatible with forced or any tool_choice. Use auto when thinking is on. If extraction must be guaranteed, drop thinking; if reasoning matters more, accept text fallback. ### Q7. Schema validation passes but the extracted customer_id is "customer_001" when the document says "cus_abc123". What's the gap? Schema enforces structure (string), not content correctness. Add a regex pattern: "pattern": "^cus_[a-z0-9]+$". Or validate semantically in your harness and trigger validation-retry on mismatch. ### Q8. You're using Batch API for 1,000 extractions. 50 fail validation. What's the cost-aware retry strategy? Batch doesn't auto-retry. Submit failures as a new batch with the original document + the validation error. Most converge on the second pass. Truly stubborn cases go to human review. Don't re-run all 1,000. ## FAQ ### Q1. Does forcing a tool guarantee correctness? No. It guarantees the format. Content can still be wrong; validate semantics after. ### Q2. Use tool_choice for every extraction? If downstream code expects JSON, yes. For exploratory work, auto is fine. ### Q3. Can I use tool_choice for nested structures? Yes. Define nested objects in the schema with "type": "object" and "properties". --- **Source:** https://claudearchitectcertification.com/concepts/structured-outputs **Vault sources:** ACP-T03 §4.4; GAI-K04 §18 anti-fabrication schema **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Vision & Multimodal > Vision lets Claude process images alongside text. Vault coverage thin; needs Phase 6 research. **Domain:** D4 · Prompt Engineering (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/vision-multimodal **Last reviewed:** 2026-05-04 ## Quick stats - **Input type:** image - **Exam domain:** D4 - **Coverage tier:** C - **Status:** stub - **Action:** research ## Exam-pattern questions ### Q1. Vision request: should you use base64 inline or the Files API? Base64 for <10 images, Files API for batch (50+). Base64 inflates JSON ~33%; Files API stores once and references via file_id, cutting bandwidth and ~40% on token cost for repeated reuse. ### Q2. High-res 4000×3000 images for OCR accuracy. Worth the cost? No. Higher resolution increases token cost linearly with no accuracy gain above 1200×1500. Diminishing returns at ~1000 px width. Downresample for cost; you don't lose precision. ### Q3. Claude returns invalid JSON from a vision extraction. Increase max_tokens? No. Schema-validation failure usually signals ambiguity in the image (poor scan, missing data), not token exhaustion. Retry with clarification or escalate to human review. More tokens won't help. ### Q4. Vision always produces accurate output. True? No. Claude hallucinates fields when tables are ambiguous or images degraded. Always validate JSON against your schema, track confidence per field, escalate low-confidence results. ### Q5. 100-page PDF: send as one mega-image or split by page? Split by page, process each separately. 100 requests but each cheaper than one mega-request (token cost is sublinear per page). Batch process pages in parallel for speed. ### Q6. Image size 1200×1500 vs 1024×768 for invoice extraction. Which? 1200×1500 for dense text and tables; 1024×768 for screenshots or clean documents. Match resolution to information density. Anything above 1500 wastes tokens. ### Q7. Vision and PII (faces, credit cards): can Claude redact in-place? No explicit redaction. Ask Claude to flag PII regions in JSON: {region: "top-left", pii_type: "credit_card"}. Redact client-side with image processing. Don't ship raw images with PII to logs. ### Q8. Token cost of a 1024×768 screenshot vs a 1200×1500 document page? ~400 tokens for the screenshot, ~1000 for the document page. Roughly equal per pixel. Plan token budget by resolution + density. --- **Source:** https://claudearchitectcertification.com/concepts/vision-multimodal **Vault sources:** ACP-T03 capabilities; ASC-A01 Course 6 vision **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Streaming > Streaming returns tokens as they're generated for low-latency UX. Vault coverage thin; needs Phase 6 research. **Domain:** D4 · Prompt Engineering (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/streaming **Last reviewed:** 2026-05-04 ## Quick stats - **Use case:** low latency - **Exam domain:** D4 - **Coverage tier:** C - **Status:** stub - **Action:** research ## Exam-pattern questions ### Q1. Streaming reduces token cost: true? No. Streaming costs the same per token. It's a UX feature (responsive text), not a cost optimization. The total tokens generated and billed are identical to non-streaming. ### Q2. Stream connection drops mid-response. Retry the entire request? No. The drop doesn't lose what was already received. The client retains the buffered text. Retrying sends a new prompt and wastes tokens on duplicate work. Log the error, decide based on context. ### Q3. Tool_use blocks stream character-by-character: parse JSON as it arrives? No. Tool_use blocks arrive complete (in ContentBlockStart) or as chunked input_json_delta events that must be fully accumulated before parsing. Parsing partial JSON fails. ### Q4. Display every event to the user immediately: good UX? No. Some events (MessageStart) are metadata, not displayable. Filter: display only ContentBlockDelta.text. Meta events go to logging or internal state tracking. ### Q5. Cancel a streaming request by closing the connection: stops the bill? No. Closing stops receiving, but the request is still processed server-side and billed up to that point. Cancellation is not a cost-saving mechanism; budget for completion before opening the stream. ### Q6. Latency of the first streamed token? ~100-200ms from request to first ContentBlockDelta.text event. Similar to non-streaming first-token time. Streaming optimizes time-to-screen, not time-to-first-token. ### Q7. Streaming response shorter than 1 second: noticeable UX improvement? No. Responses >3-5 seconds become noticeably more responsive with streaming. Shorter (<1 sec) shows negligible improvement. Use streaming for long-form responses, not quick queries. ### Q8. Emit streamed chunks to a browser client: which protocol? Server-Sent Events (SSE). Server: open /stream endpoint, iterate Claude stream, response.write(event) each chunk. Client: const es = new EventSource('/stream'); es.onmessage = .... Simple, reliable, browser-native. --- **Source:** https://claudearchitectcertification.com/concepts/streaming **Vault sources:** ACP-T03 capabilities **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Attention Engineering > Attention engineering is the discipline of placing critical context where the model attends most strongly, high in the prompt, in the system message, or in repeated facts blocks. Full content in SCRUM-21 follow-up. **Domain:** D4 · Prompt Engineering (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/attention-engineering **Last reviewed:** 2026-05-04 ## Quick stats - **Strong-attention zones:** 3 - **Exam domain:** D4 - **Coverage tier:** B - **Trap:** buried context - **Pattern:** place high ## Exam-pattern questions ### Q1. Critical fact buried in paragraph 5 of context. Agent misses it 40% of the time. Why? Lost-in-the-Middle effect. Mid-context positions get ~40-50% attention vs 90%+ at start/end. Move the fact to the top (system prompt or CASE_FACTS block) or end (current question). Position trumps content. ### Q2. Bold or capitalize important facts to boost attention? No. The transformer doesn't parse Markdown or capitalization. customer_id: 12345 and customer_id: 12345 have same attention weight. Position is what matters. ### Q3. Add important context at the very end for maximum emphasis? Recency bias helps, but only for the final question/request. Context placed at the end is treated as supporting detail, not fact. System prompt is always position 0 for constraints; case-facts at top of user message for transactional data. ### Q4. Put all context in the system prompt to ensure it's always attended? No. System prompts are limited (~2000 tokens effective); they're for role and constraints, not transactional facts. Use system prompt for rules; CASE_FACTS block in user message for facts. ### Q5. Lost-in-the-Middle is a myth: it doesn't affect Claude? False. Empirically documented (Liu et al. 2023) and replicated across all transformers including Claude. Mid-context accuracy drops 40-50%. Mitigate with structural changes, not by ignoring the effect. ### Q6. Agent forgets a key fact despite it being in context. Increase max_tokens? No. Forgetting is an attention-weight issue, not a token budget issue. More tokens won't help. Restructure: move the fact to the top or end. ### Q7. When should you start windowing in an agentic loop? After 4-6 turns (8-12 messages). Beyond that, early turns degrade. Optimal: window at turn 5 or when message list >10KB. Pre-emptive is better than reactive. ### Q8. What goes in the system prompt vs the CASE_FACTS block? System prompt: role, constraints, output schema (rules that don't change per instance). CASE_FACTS: customer ID, order ID, amount, dispute summary (transactional facts that change per instance). Both are needed; they layer. --- **Source:** https://claudearchitectcertification.com/concepts/attention-engineering **Vault sources:** ACP-T03 §15 attention engineering **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # 4D Framework > Delegation, Description, Discernment, Diligence, Anthropic's 4D framework for agent prompt design. Vault coverage thin; needs Phase 6 research. **Domain:** D4 · Prompt Engineering (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/4d-framework **Last reviewed:** 2026-05-04 ## Quick stats - **Pillars:** 4 - **Exam domain:** D4 - **Coverage tier:** C - **Status:** stub - **Action:** research ## Exam-pattern questions ### Q1. Should we use Claude to verify legal text before sending to a client? No, that's wrong Delegation. Legal text requires human authority and accountability. Use Claude to draft for human review. Delegation assigns work to leverage each agent's strengths; lawyers review, Claude drafts. ### Q2. "Be helpful" as the prompt: what's missing? Description (Product, Process, Performance). "Be helpful" is vague. Specify the format (JSON, bullets), the reasoning steps (chain-of-thought), and the tone (friendly advisor, expert reviewer). Vague Description produces vague output. ### Q3. Discernment fails: agent's output is wrong but you ship anyway. What's the meta-failure? Skipping Diligence. Discernment detected the flaw; Diligence demands you don't deploy without addressing it. Either fix (loop back to Description) or escalate to human. Shipping known-flawed output is the opposite of Diligence. ### Q4. Auto-approve refunds <$100 without human review: smart automation or bad practice? Bad practice without Diligence. Even small amounts demand audit logging and periodic human review. Diligence is "I validated this before shipping"; auto-approval without validation skips that step. ### Q5. Description includes Product, Process, Performance. What's the difference? Product: desired output format (JSON schema, markdown, prose). Process: step-by-step instructions for reasoning (chain-of-thought, few-shot examples). Performance: tone, style, role (friendly, expert, advisor). All three together = strong Description. ### Q6. Skip Diligence on low-stakes tasks: ever acceptable? Sometimes. For brainstorming or summarizing a blog post, minimal Diligence is fine. For compliance, financial, legal, healthcare, all four Ds are mandatory. Match Diligence rigor to stakes. ### Q7. The 4D Framework is a technical architecture (like MVC)? No. It's a mental model for human-AI collaboration. Guides how you think about delegating, communicating, validating, deploying. Not a code pattern; a discipline. ### Q8. Discernment detects a flaw: where do you loop back? To Description. Refine the prompt: tighten Product (output format), clarify Process (reasoning steps), adjust Performance (tone). If refinement doesn't help, reconsider Delegation (is this task suited for AI at all?). --- **Source:** https://claudearchitectcertification.com/concepts/4d-framework **Vault sources:** ClaudeCertifications course materials **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Prompt Engineering Techniques > Prompt engineering is a craft of seven techniques that turn a brittle one-shot prompt into a production-grade contract: few-shot examples, iterative refinement against an eval suite, anti-fabrication schemas, structured templates, explicit do-don't constraints, output anchoring via tools, and test-driven prompt edits. The exam trap is treating any one technique as 'the answer'; production prompts compose all seven. **Domain:** D4 · Prompt Engineering (20% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/prompt-engineering-techniques **Last reviewed:** 2026-05-04 ## Quick stats - **Core techniques:** 7 - **Iterations to converge:** 5-15 - **Eval suite size:** 20-50 cases - **Few-shot example sweet spot:** 2-5 pairs - **Exam domain:** D4 ## What it is Prompt engineering is the discipline of shaping Claude's output by writing a prompt, measuring it against test cases, observing where it fails, and tightening the prompt until the eval score plateaus. It is not writing one clever sentence. A production prompt converges over 5-15 iterations against a frozen suite of 20-50 cases, of which 3-5 are known-failure cases that anchor the regression boundary. Per ASC-A01 Course 11 §16, prompt v1 scoring 7.66 climbs to 8.7 in v2 only because the eval gave a quantitative delta to chase. The seven load-bearing techniques are few-shot prompting (showing 2-5 input-output pairs so Claude can pattern-match the harder cases), iterative refinement (the Draft → Test → Observe → Refine → Anchor loop), anti-fabrication patterns (nullable fields, unclear enum values, citation-required answers), structured templates (role + objective + tone + tools + constraints + format + escalation), constraint elicitation (concrete do-don't lists with worked examples), output anchoring (force a tool_use schema, do not ask 'output JSON' in prose), and test-driven prompting (every change validated against a frozen eval suite, regression-free or roll back). The reason every Domain 4 question on the exam is a composition question, never a single-technique question, is that real production prompts are layered. A refund agent uses a structured system-prompt template, three few-shot examples covering the sarcasm-style edge case, an unclear enum option for the refund-reason field, a forced tool_choice for output anchoring, and a 50-case eval that gates every prompt PR. Strip any one layer and the leak rate climbs above 5%. Per ACP-T03 §4.4, natural-language 'output JSON' prompts leak structure ~15% of the time; tool-anchored prompts leak 0%. ## How it works Iterative refinement is the spine. You start with a baseline prompt that almost works, build a small eval set (20-50 cases is the sweet spot, of which 3-5 are deliberate failure cases you have already seen Claude get wrong), and run the prompt against the suite to get a score. Score is the only signal that matters; subjective 'this feels better' is theater. Per ASC-A01 Course 12 §22, the score itself isn't inherently good or bad. What matters is whether you can improve it by refining your prompts. Each refinement targets the lowest-scoring case, you write a tighter instruction or add an example, then re-run the entire suite to make sure no other case regressed. Few-shot prompting is the highest-ROI lever inside the loop. Per ASC-A01 Course 12 §29, you wrap each example in and XML tags, you choose 2-5 examples that cover the failure modes (sarcasm, ambiguity, edge formats), and you optionally add a one-line explanation of *why* the example is ideal. A 2-3 example before-after pair reduces misclassification by 30-60%, the single biggest single-edit accuracy gain you can make. The mistake is using examples Claude already gets right; always pick examples that match the cases you have observed Claude fail on. Output anchoring and anti-fabrication are where natural-language prompting reaches its ceiling. Asking 'please output JSON' in prose leaks structure roughly 15% of the time under load. The fix is tool_use with a JSON schema and tool_choice: forced, the API constrains token generation to match the schema, so structure is guaranteed at 100%. Fabrication is a separate problem, schemas guarantee shape, not truth. Per ACP-T03 §4.4 and the structured-data-extraction scenario doc, when a source is genuinely silent, give the model two honest exits: nullable fields (type: ['string', 'null']) and an unclear / not_provided enum value. Without these escape hatches, required-string fields force fabrication and the leak rate climbs above 5%. ## Where you'll see it in production ### Refund agent prompt PR gate Every prompt change PR runs against a 60-case golden suite (20 happy path, 20 edge, 10 escalations, 10 known-failure). The CI fails the PR if the average score drops or any previously-passing case regresses. Per ASC-A01 Course 11 §15, the alternative, 'test once and decide it's good enough', carries a significant risk of breaking in production. Three iterations typically take a prompt from 80% to 95%. ### Sarcasm-aware sentiment classifier A social-listening team's first prompt scored 71% on a sarcasm-heavy suite. Adding three / pairs that explicitly covered Plan-9-style ironic praise lifted the score to 88%. The team picked the examples directly from the eval's lowest-scoring cases, which per ASC-A01 Course 12 §29 is the canonical way to source few-shot examples. No model upgrade required. ### Contract-extraction anti-fabrication A legal-ops extractor was fabricating termination_clause text when the contract was silent. The fix was schema-level, not prompt-level: nullable strings plus an unclear enum option in the tool input schema. Per the structured-data-extraction scenario doc, fabrication rate dropped from 8% to under 0.5% once the model had an honest exit. The system prompt remained almost unchanged. ## Code examples ### Few-shot prompt with iterative-eval harness **Python:** ```python from anthropic import Anthropic import json client = Anthropic() SYSTEM = """You classify tweet sentiment as positive, negative, or unclear. DO emit one of: positive | negative | unclear. DON'T add commentary or explanation. DON'T pick positive when the tone is sarcastic. Here are example input-output pairs: I love how my flight was delayed three hours. negative Best coffee I've had all week, genuinely. positive idk what to make of this product unclear""" def classify(tweet: str) -> str: resp = client.messages.create( model="claude-opus-4-5", max_tokens=10, system=SYSTEM, messages=[{"role": "user", "content": tweet}], ) return resp.content[0].text.strip().lower() def run_eval(cases_path: str) -> float: cases = [json.loads(line) for line in open(cases_path)] correct = sum(1 for c in cases if classify(c["tweet"]) == c["expected"]) score = correct / len(cases) print(f"Score: {score:.2%} ({correct}/{len(cases)})") return score # Iterate: edit SYSTEM, re-run, compare. Ship only on regression-free improvement. v1_score = run_eval("sentiment_cases.jsonl") ``` > Few-shot pairs cover the sarcasm failure case explicitly. The eval harness reports a single number per prompt version; iterate until the score plateaus. **TypeScript:** ```typescript import Anthropic from "@anthropic-ai/sdk"; import { readFileSync } from "node:fs"; const client = new Anthropic(); const SYSTEM = `You classify tweet sentiment as positive, negative, or unclear. DO emit one of: positive | negative | unclear. DON'T add commentary or explanation. DON'T pick positive when the tone is sarcastic. I love how my flight was delayed three hours. negative Best coffee I've had all week, genuinely. positive idk what to make of this product unclear`; async function classify(tweet: string): Promise { const resp = await client.messages.create({ model: "claude-opus-4-5", max_tokens: 10, system: SYSTEM, messages: [{ role: "user", content: tweet }], }); const block = resp.content[0]; return block.type === "text" ? block.text.trim().toLowerCase() : ""; } async function runEval(casesPath: string): Promise { const cases = readFileSync(casesPath, "utf8") .trim() .split("\n") .map((l) => JSON.parse(l) as { tweet: string; expected: string }); let correct = 0; for (const c of cases) { const out = await classify(c.tweet); if (out === c.expected) correct += 1; } const score = correct / cases.length; console.log(`Score: ${(score * 100).toFixed(1)}% (${correct}/${cases.length})`); return score; } await runEval("sentiment_cases.jsonl"); ``` > Same shape in TypeScript. Each prompt revision becomes a new SYSTEM constant; you compare scores side-by-side. ### Output anchoring with anti-fabrication schema **Python:** ```python from anthropic import Anthropic client = Anthropic() # Anti-fabrication: nullable + 'unclear' enum exit. EXTRACT_TOOL = { "name": "extract_refund_decision", "description": "Extract refund decision from a customer ticket.", "input_schema": { "type": "object", "properties": { "decision": { "type": "string", "enum": ["approve", "deny", "escalate", "unclear"], }, "refund_reason": {"type": ["string", "null"]}, "amount_usd": {"type": ["number", "null"]}, "confidence": {"type": "number", "minimum": 0, "maximum": 1}, }, "required": ["decision", "confidence"], }, } def extract(ticket: str) -> dict: resp = client.messages.create( model="claude-opus-4-5", max_tokens=512, tools=[EXTRACT_TOOL], tool_choice={"type": "tool", "name": "extract_refund_decision"}, messages=[{"role": "user", "content": ticket}], ) for block in resp.content: if block.type == "tool_use": return block.input raise RuntimeError("forced tool_choice did not fire") # 'unclear' + nullable fields let the model say 'I don't know' honestly. result = extract("Customer wants a refund. No order ID provided.") # Expected: {"decision": "escalate", "refund_reason": null, "amount_usd": null, "confidence": 0.3} ``` > Tool-anchored output is 100% structured. The 'unclear' enum and nullable fields cut fabrication on silent sources from 8% to under 0.5%. **TypeScript:** ```typescript import Anthropic from "@anthropic-ai/sdk"; const client = new Anthropic(); const EXTRACT_TOOL: Anthropic.Tool = { name: "extract_refund_decision", description: "Extract refund decision from a customer ticket.", input_schema: { type: "object", properties: { decision: { type: "string", enum: ["approve", "deny", "escalate", "unclear"], }, refund_reason: { type: ["string", "null"] }, amount_usd: { type: ["number", "null"] }, confidence: { type: "number", minimum: 0, maximum: 1 }, }, required: ["decision", "confidence"], }, }; async function extract(ticket: string): Promise { const resp = await client.messages.create({ model: "claude-opus-4-5", max_tokens: 512, tools: [EXTRACT_TOOL], tool_choice: { type: "tool", name: "extract_refund_decision" }, messages: [{ role: "user", content: ticket }], }); const tool = resp.content.find( (b): b is Anthropic.ToolUseBlock => b.type === "tool_use", ); if (!tool) throw new Error("forced tool_choice did not fire"); return tool.input; } ``` > Same anti-fabrication pattern in TypeScript. The schema is the contract; prompt-level pleas to 'be honest' are not a substitute. ## Looks-right vs actually-wrong | Looks right | Actually wrong | |---|---| | Adding 'output valid JSON, do not include any other text' to the system prompt to enforce structure. | Natural-language prompting for structure leaks ~15% of the time under load. Per ACP-T03 §4.4, the only structural guarantee is tool_use with a JSON schema and tool_choice: forced. Prompt instructions are advice, not enforcement. | | Spending three days hand-tuning the wording of the system prompt to fix a 12% accuracy gap. | Per ASC-A01 Course 12 §29, adding 2-3 well-chosen few-shot examples typically yields a 30-60% misclassification reduction in a single edit. Prose tweaks deliver diminishing returns; concrete examples deliver step-changes. | | Using Claude itself to grade 100% of your eval cases because it scales better than humans. | Per the evaluation concept (docid #f86eb3), model-based grading is not reproducible across model versions and fails on deterministic checks (tool sequences, JSON schemas). Use code-based grading for anything measurable; reserve model grading for subjective qualities like tone. | | Required-string field for refund_reason so the schema always returns a value the downstream code can parse. | Required strings force fabrication when the source is silent. Per the structured-data-extraction scenario, fabrication climbs above 5% without an unclear enum option or a nullable type. Always give the model an honest exit. | | Iterating the prompt 20+ times against the same 10 hand-picked test cases until the score reaches 100%. | Per the evaluation concept §Q3, that's overfitting. Score plateau on a tiny fixed suite masks production blind spots. Grow the suite to 50+ externally-validated cases and pair with shadow-mode production evals. | ## Decision tree 1. **Are you anchoring the output format?** - **Yes:** Use tool_use with a JSON schema and tool_choice: forced. Structure is guaranteed at 100%, no parsing risk. - **No:** Prompt-only 'output JSON' leaks ~15% under load. Switch to tool-anchored output before iterating further. 2. **Do you have a frozen 20-50 case eval suite, of which 3-5 are known-failure cases?** - **Yes:** Iterate: edit prompt, re-run suite, ship only on regression-free improvement. Per ASC-A01 Course 11 §16, target a measurable score delta per version. - **No:** Build the suite before tuning the prompt. Even a 10-case suite catches 80% of regressions; without it, every change is a coin flip. 3. **Is your accuracy gap concentrated in specific edge cases (sarcasm, ambiguity, format variance)?** - **Yes:** Add 2-3 few-shot / examples that cover those exact cases. Per ASC-A01 Course 12 §29, this is the highest-ROI single edit. - **No:** If the gap is uniform, your problem is the structured template (role + objective + format + constraints), not the examples. Rewrite the system prompt skeleton first. 4. **Is the model fabricating values when the source is silent?** - **Yes:** Schema-level fix: nullable types and an unclear / not_provided enum option. Per the structured-data-extraction scenario, this drops fabrication from 8% to under 0.5%. - **No:** If fabrication only happens under load, check whether your schema requires fields the source rarely contains; relax the required list. ## Exam-pattern questions ### Q1. Your extraction prompt says 'output valid JSON'. 12% of responses include explanatory prose. Best fix? Stop prompting for JSON; use tool_use with a JSON schema and tool_choice: forced. The most-tested distractor is "add 'do not include any text outside the JSON' to the system prompt". Per ACP-T03 §4.4, prompt-only is probabilistic (~15% leak); tool-anchored output is structurally guaranteed (0% leak). ### Q2. Your sentiment classifier misses sarcastic tweets ~30% of the time. The cheapest fix? Add 2-3 few-shot examples that specifically demonstrate sarcasm, wrapped in / XML tags. Per ASC-A01 Course 12 §29, this technique reduces misclassification by 30-60%. The distractor is "upgrade to a larger model"; the right answer is one prompt edit with concrete failure-case examples. ### Q3. A teammate edited the system prompt and shipped to production. A week later, accuracy dropped 6 points. What was missing from their workflow? An eval suite with regression detection. Per ASC-A01 Course 11 §16, every prompt edit must run against a 20-50 case suite to compare scores (e.g. v1 = 7.66, v2 = 8.7) before merging. Distractor: "more A/B testing in production". Right answer: gate prompt PRs on offline eval delta, not on subjective review. ### Q4. Your refund agent's refund_reason field returns fabricated reasons when the contract is silent. Schema-level fix? Make the field nullable AND add an unclear enum option: {type: ['string', 'null'], enum: ['refund', 'replacement', 'unclear', null]}. Per the structured-data-extraction scenario doc, required-string fields force fabrication when the source is silent. The distractor is "add a temperature: 0 setting"; the actual fix is giving the model an honest exit in the schema. ### Q5. Your prompt iterates 47 times and the eval score keeps oscillating between 78% and 82%. What's the underlying problem? The eval set is too small or not golden. Per the evaluation concept page, scores oscillating without a trend means random variance is dominating signal. Distractor: "use a more capable model". Right answer: grow the suite to 50+ externally-validated cases, then iterate. Without golden data, evals measure consistency, not correctness. ### Q6. Your system prompt uses vague guidance like 'be careful with refunds' and policy violations stay at 5%. Why doesn't the model improve when you add 'be especially careful'? Vague descriptions are floor-level guidance the model ignores under load. Per the system-prompts concept (docid #5ca2ae), encode the policy as a do-don't list with concrete examples (e.g. "DO escalate refunds > $500. DON'T auto-approve any refund without an order_id."). Distractor: "add stronger language like NEVER". Right answer: replace adjectives with concrete rules and worked examples. ### Q7. You add a great new few-shot example and the targeted case finally passes, but two unrelated cases regress. What do you do? Roll back, then iterate. Per the evaluation concept (docid #f86eb3), regression-free is a non-negotiable gate. The distractor is "ship anyway because the targeted case was the priority". Right answer: the new example is teaching Claude something that conflicts with the others; either add a counter-example for the regressing cases or rewrite the new example to be more specific. ### Q8. A new prompt scores 92% on your 50-case suite. You ship. Production failures climb. Why was the eval misleading? The suite was overfit to known cases. Per the evaluation concept page §Q3, score plateau on a fixed suite can mask overfitting; production users hit scenarios you didn't design for. Distractor: "you should have raised the bar to 95%". Right answer: pair offline evals with shadow-mode evaluation on real traffic, and grow the golden suite from production failures. ## FAQ ### Q1. Why XML tags for few-shot examples instead of plain bullets? Per ASC-A01 Course 12 §29, and tags create unambiguous boundaries Claude can attend to. Plain bullets blur into the surrounding instructions. ### Q2. How many few-shot examples is too many? Past 5-7 examples you hit attention dilution and token cost without meaningful gains. Curate the set; replace weak examples instead of adding more. ### Q3. Should the eval grader be Claude or code? Use code-based grading for deterministic checks (tool sequence, JSON schema). Use model-based grading only for subjective qualities (tone, completeness). Per the evaluation concept page, code grading is reproducible; model grading is not. ### Q4. Can I skip the eval suite if my use case is internal-only? No. Without an eval, every prompt edit is a coin flip. Even a 10-case suite catches 80% of regressions. Build the suite before you tune the prompt. ### Q5. Where do constraints belong, system prompt or tool descriptions? Hard constraints (refund > $500 escalates) belong in tool code. System prompt restates them as guidance. Per the system-prompts concept, tool descriptions take precedence on routing; system prompt is linguistic guidance. --- **Source:** https://claudearchitectcertification.com/concepts/prompt-engineering-techniques **Vault sources:** ASC-A01 Course 12 §29 providing examples (#70748e); ASC-A01 Course 11 §16 typical eval workflow (#9d53f3); ASC-A01 Course 12 §22 code-based grading (#af5dc2); ACP-T03 §4.4 prompt engineering and anti-fabrication; Scenario: structured-data-extraction (#4fe21b); Concept: evaluation (#f86eb3); Concept: system-prompts (#5ca2ae) **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Context Window Management > Context window management is how you keep long conversations within limits without dropping critical facts. Patterns: case-facts blocks, progressive summarization, retrieval. Full content in SCRUM-21 follow-up. **Domain:** D5 · Context + Reliability (15% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/context-window **Last reviewed:** 2026-05-04 ## Quick stats - **Patterns:** 3 - **Exam domain:** D5 - **Coverage tier:** B - **Trap:** summary loss - **Right answer:** facts block ## Exam-pattern questions ### Q1. Context window resets after each turn? No. Cumulative across all turns. Messages accumulate; budget shrinks. Once you hit 200K total, new requests fail with context_length_exceeded. ### Q2. Streaming reduces context window cost? No. Streaming is a UX feature; it doesn't change token cost. Same total tokens billed, streamed or not. Use streaming for responsiveness, not cost. ### Q3. Windowing at turn 11 of a 12-turn loop. Worth it? Marginal value. You've already paid for turns 1-10 of accumulated context. Optimal: window at turn 5 if the conversation is verbose, or turn 3 if each turn is expensive. Windowing late is reactive; preemptive is better. ### Q4. Subagent receives the parent's full 100KB conversation history. What goes wrong? Context starvation. Subagent wastes 100K of its 200K budget on irrelevant history. Coordinator should extract a 2KB TASK_CONTEXT and pass only that. Subagents do better with less, not more. ### Q5. Increasing max_tokens fixes a context_length_exceeded error? No. 200K is the input budget; max_tokens is the output budget. Increasing output budget doesn't help an input limit. Window the input (summarize old turns into a CASE_FACTS block, drop verbose history). ### Q6. Cache the message history to keep all turns within budget? No. The message list grows every turn; caching requires immutable content. Cache the system prompt and tool definitions (constants); the growing message list is always fresh. ### Q7. 1M-token window models: caching not needed, just use the bigger window? Bigger windows are not free. Cost scales with input tokens. Filling a 1M window costs 5x a 200K window even if it fits. Use a 200K window thoughtfully (case-facts + summary + recent turns) for most production workloads. ### Q8. Optimal windowing target: at what % of capacity? 50-60% capacity. If your loop generates 20K tokens/turn, window at turn 5 (100K used) to have 100K left for the next 5 turns. Below 50% wastes capacity; above 70% risks running out before the next window cycle. --- **Source:** https://claudearchitectcertification.com/concepts/context-window **Vault sources:** ACP-T03 §5 context; GAI-K04 §9 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Case Facts Block > A case-facts block is an immutable set of transactional data (customer ID, order details, refund amount, policy limits) included at the top of every prompt during a multi-turn conversation. Unlike summaries (which lose precision), case facts are complete and never paraphrased. **Domain:** D5 · Context + Reliability (15% of CCA-F exam) **Canonical:** https://claudearchitectcertification.com/concepts/case-facts-block **Last reviewed:** 2026-05-04 ## Quick stats - **Typical facts:** 5–10 - **Turn threshold:** 20+ - **Position:** top - **Exam domain:** D5 - **Turns supported:** ∞ ## What it is A case-facts block is a structured, immutable set of core transactional data included in every prompt of a conversation. It contains facts that don't change (customer ID, order amount, policy limits), extracted once and reused. Unlike summarization (which compresses and loses detail), facts are preserved in full. Place at the top of every message so Claude always has accurate context without re-reading prior history. Critical in long conversations where context fill forces omission of prior reasoning. ## Exam-pattern questions ### Q1. A support conversation hits turn 30 and the model loses the customer ID. You're using progressive summarization. What's missing? A case-facts block at the top of every message. The summary compressed the customer ID into pronouns. The block survives summarization because it lives outside the summarizable history. ### Q2. Your case-facts block has customer_id, order_id, amount. By turn 20 the agent calls process_refund(amount="about $250"). Why? Either the block was not re-injected per turn, or it was paraphrased into the summary. The block must be immutable and at the top of every message. Verify with a sanity check helper before each messages.create() call. ### Q3. The user provides a corrected customer ID mid-conversation. How do you update the case-facts block? Only the user can correct facts. Listen for explicit corrections ("the customer ID is actually X") and replace the field. Add a new field if facts diverge (customer_id_revised). Never let the agent self-modify the block. ### Q4. You pass the case-facts block to a subagent in the task string. The subagent ignores it and asks for the customer ID. Why? Either the task string buried the block at the bottom (low attention) or the subagent's system prompt didn't reference it. Place the block at the top of the task and explicitly instruct: "use only the values in CASE_FACTS." ### Q5. Your case-facts block grows to 50 fields. Tokens add up. How do you trim? Include only essential transactional data: IDs, amounts, dates, status flags. Remove narrative or explanatory prose, those belong in the conversation. Aim for 50-500 tokens. ### Q6. After multi-session storage and retrieval, the case-facts block sometimes loses fields. What's the architecture fix? Store the authoritative facts in your application database, not just in the message list. Rebuild the block from DB on session resume. Backup ensures recovery if the block is corrupted in transit. ### Q7. You're using Anthropic's prompt caching. Should the case-facts block be cached? Yes, ideal candidate. The block is stable across turns (identity-level data) and contains no per-request secrets. Mark with cache_control: {type: "ephemeral"}. Saves ~90% on input cost for that section per turn. ### Q8. The conversation is 60 turns and you're hitting context limits. What's the right preservation strategy? Progressive summarization with the case-facts preservation helper. Extract the block, summarize middle turns into 3 sentences (excluding the block from summary input), rebuild: block + summary + last 3 turns. Block survives intact. ## FAQ ### Q1. What if the facts block is 100+ fields? Prioritize the critical ones. If genuinely that many, the case scope is too broad. Usually 5–10 fields cover 80% of decisions. ### Q2. Should facts include intermediate findings? No. Facts are immutable source data. Findings belong in the conversation. ### Q3. Can the facts block change between turns? No. That defeats the purpose. Identical across all turns. New constraints become NEW fields, never overwrites. --- **Source:** https://claudearchitectcertification.com/concepts/case-facts-block **Vault sources:** ACP-T03 §5; GAI-K04 §9 **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Customer Support Resolution Agent > A production refund/escalation agent built on Claude. The harness reads stop_reason, dispatches tools through a registry, gates risky operations with a PreToolUse hook (deterministic policy enforcement), pins customer state in a case-facts block, and routes blocked calls to a structured escalation queue. Three-domain coverage makes this the single highest-weight scenario on the exam. **Sub-marker:** P3.1 **Domains:** D1 · Agentic Architectures, D2 · Tool Design + Integration, D5 · Context + Reliability **Exam weight:** 60% of CCA-F (D1 + D2 + D5) **Build time:** 22 minutes **Source:** 🟢 Official Anthropic guide scenario **Canonical:** https://claudearchitectcertification.com/scenarios/customer-support-resolution-agent **Last reviewed:** 2026-05-04 ## In plain English Think of this as the AI teammate that sits behind your support inbox. When a customer writes in about a refund, a tech glitch, or an account question, this agent reads the message, looks up who the customer is, decides what they actually need, and either solves it on the spot or hands the case to a human with all the context already prepared. It exists because most support questions follow the same handful of patterns, and answering them in seconds (instead of hours) is the difference between a customer who stays and a customer who churns. Everything below is how that simple idea is wired up safely in production. ## Exam impact 60% of total CCA-F weight rides through this scenario. Domain 1 (Agentic Architectures, 27%) tests the loop + escalation. Domain 2 (Tool Calling, 18%) tests the registry + hook. Domain 5 (Context, 15%) tests case-facts + session state. Master this one and you've covered the majority of the exam. ## The problem ### What the customer needs - Resolve the request in one turn without multiple agent transfers. - Get identity verified before any account-modifying action. - See a clear path to a human when the agent can't help. ### Why naive approaches fail - Single-agent chatbots forget the customer ID by turn 8 (no case-facts pinning). - Prompt-only refund-cap policy leaks 3% of refunds above the limit (no deterministic hook). - Sentiment-triggered escalation creates false positives: angry users with valid policy denials get escalated unnecessarily. ### Definition of done - p95 resolution latency < 12 seconds end-to-end - Refund-cap violations = 0 (hook-enforced, not prompt-enforced) - Audit log entry per ticket with case-facts snapshot - CSAT ≥ 4.2/5 across resolved tickets ## Concepts in play - 🟢 **Agentic loops** (`agentic-loops`), Specialist agent main loop - 🟢 **stop_reason** (`stop-reason`), Loop termination control - 🟢 **Tool calling** (`tool-calling`), Tool registry contract - 🟢 **tool_choice** (`tool-choice`), auto for open flows - 🟢 **Hooks** (`hooks`), PreToolUse policy gate - 🟠 **Case-facts block** (`case-facts-block`), Pinned customer state - 🟡 **Escalation** (`escalation`), Structured handoff queue - 🟢 **System prompts** (`system-prompts`), Role + constraints ## Components ### Tool Registry, verify · lookup · process · escalate Holds the 4-5 tools the specialist agent can call. Each tool has a clear description and JSON schema. Tool count stays low to keep routing accurate. **Configuration:** tools: [verify_customer, lookup_order, process_refund, escalate_to_human]. tool_choice: auto. Each description is 4 lines: what it does, when to use, edge cases, ordering with peers. **Concept:** `tool-calling` ### PreToolUse Hook, policy gate · deterministic Sits between Claude's tool_use request and actual tool execution. Enforces refund caps, escalation triggers, and time-of-day limits. Exits 2 (deny) on violation. **Configuration:** Hook fires before process_refund. Reads tool_input.amount, compares to policy.refund_cap. Exit 2 with stderr message routes Claude to retry with adjusted args or escalate. **Concept:** `hooks` ### Case-Facts Block, pinned customer state Pinned at the top of every system-prompt iteration. Holds customer_id, order_id, refund_amount, policy_limit. Survives summarization. Re-read every turn. **Configuration:** system: f"CASE_FACTS: {customer_id} · {order_id} · ${amount} · cap=${cap}". Updated by hooks after state-changing tool calls. **Concept:** `case-facts-block` ### Specialist Agent, the agentic loop Runs the messages.create() loop. Reads stop_reason after every response: end_turn → exit, tool_use → execute + append result + continue, max_tokens → save partial. **Configuration:** while True: resp = client.messages.create(...). if resp.stop_reason "end_turn": break. if resp.stop_reason "tool_use": execute_tools(...). **Concept:** `agentic-loops` ### Escalation Queue, structured handoff Receives blocked calls from PreToolUse hook + low-confidence + sentiment-triggered escalations. Each entry has a structured context block (cus_id, reason, partial_status, recommended_action). **Configuration:** queue.push({customer_id, intent, partial_state, blocked_tool, reason, recommended_action}). Human triages in ~10s vs 5min for transcript review. **Concept:** `escalation` ## Build steps ### 1. Define the system prompt with case-facts Anchor the agent's role + constraints + the case-facts block at the very top of the system prompt. The case-facts block is the immutable truth about this customer + order + policy. **Python:** ```python from anthropic import Anthropic client = Anthropic() def build_system_prompt(case_facts: dict) -> str: return f"""You are a customer support agent for ACME. CASE_FACTS (immutable; re-read every turn): - customer_id: {case_facts['customer_id']} - order_id: {case_facts['order_id']} - refund_amount: ${case_facts['amount']} - policy_cap: ${case_facts['cap']} Constraints: - Verify customer before ANY account-modifying call. - Refunds above policy_cap MUST escalate (a hook enforces this). - Branch on stop_reason. Never on response text.""" ``` **TypeScript:** ```typescript import Anthropic from "@anthropic-ai/sdk"; const client = new Anthropic(); function buildSystemPrompt(caseFacts: { customer_id: string; order_id: string; amount: number; cap: number; }): string { return `You are a customer support agent for ACME. CASE_FACTS (immutable; re-read every turn): - customer_id: ${caseFacts.customer_id} - order_id: ${caseFacts.order_id} - refund_amount: \${caseFacts.amount} - policy_cap: \${caseFacts.cap} Constraints: - Verify customer before ANY account-modifying call. - Refunds above policy_cap MUST escalate (a hook enforces this). - Branch on stop_reason. Never on response text.`; } ``` Concept: `case-facts-block` ### 2. Define the 4-tool registry Keep the tool count between 4-5. Each tool description is structured in 4 lines: what / when / edge cases / ordering. This is the primary lever for correct routing, fix descriptions, not the model. **Python:** ```python tools = [ { "name": "verify_customer", "description": ( "Look up a customer by customer_id and confirm they are active.\n" "Use this BEFORE any other tool that mentions the customer.\n" "Edge cases: returns 'not_found' if customer_id is missing.\n" "Always run before lookup_order or process_refund." ), "input_schema": {"type": "object", "properties": { "customer_id": {"type": "string"} }, "required": ["customer_id"]}, }, # ... lookup_order, process_refund, escalate_to_human ] ``` **TypeScript:** ```typescript const tools: Anthropic.Tool[] = [ { name: "verify_customer", description: `Look up a customer by customer_id and confirm they are active. Use this BEFORE any other tool that mentions the customer. Edge cases: returns 'not_found' if customer_id is missing. Always run before lookup_order or process_refund.`, input_schema: { type: "object", properties: { customer_id: { type: "string" } }, required: ["customer_id"], }, }, // ... lookup_order, process_refund, escalate_to_human ]; ``` Concept: `tool-calling` ### 3. Wire the PreToolUse policy hook The hook is deterministic, prompt-only enforcement leaks 3% of cases past the cap. Exit 2 to deny; exit 0 to allow; the SDK reads stderr to route Claude back with feedback. **Python:** ```python # .claude/hooks/refund_policy.py import sys, json, os POLICY_CAP = float(os.environ.get("REFUND_CAP", "500")) def main(): payload = json.loads(sys.stdin.read()) if payload["tool_name"] != "process_refund": sys.exit(0) # not our concern, allow amount = payload["tool_input"].get("amount", 0) if amount > POLICY_CAP: print(f"refund ${amount} exceeds cap ${POLICY_CAP}, escalate", file=sys.stderr) sys.exit(2) # DENY sys.exit(0) # allow if __name__ == "__main__": main() ``` **TypeScript:** ```typescript // .claude/hooks/refund-policy.ts import { readFileSync } from "node:fs"; const POLICY_CAP = parseFloat(process.env.REFUND_CAP ?? "500"); const payload = JSON.parse(readFileSync(0, "utf8")); if (payload.tool_name !== "process_refund") process.exit(0); const amount = payload.tool_input?.amount ?? 0; if (amount > POLICY_CAP) { process.stderr.write(`refund \${amount} exceeds cap \${POLICY_CAP}, escalate\n`); process.exit(2); // DENY } process.exit(0); // allow ``` Concept: `hooks` ### 4. Run the agent loop on stop_reason Branch on the structured field, never the response text. end_turn → exit. tool_use → execute, append, continue. max_tokens → save partial. stop_sequence → custom termination. **Python:** ```python def run_agent_loop(user_msg: str, case_facts: dict, max_iter: int = 15): messages = [{"role": "user", "content": user_msg}] for _ in range(max_iter): resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=4096, system=build_system_prompt(case_facts), tools=tools, messages=messages, ) if resp.stop_reason == "end_turn": return extract_text(resp) if resp.stop_reason == "tool_use": tool_uses = [b for b in resp.content if b.type == "tool_use"] results = [execute_tool(t) for t in tool_uses] messages.append({"role": "assistant", "content": resp.content}) messages.append({"role": "user", "content": results}) continue if resp.stop_reason == "max_tokens": return {"status": "partial", "text": extract_text(resp)} return {"status": "iteration_cap"} ``` **TypeScript:** ```typescript async function runAgentLoop( userMsg: string, caseFacts: { customer_id: string; order_id: string; amount: number; cap: number }, maxIter = 15, ) { const messages: Anthropic.MessageParam[] = [{ role: "user", content: userMsg }]; for (let i = 0; i < maxIter; i++) { const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 4096, system: buildSystemPrompt(caseFacts), tools, messages, }); if (resp.stop_reason === "end_turn") return extractText(resp); if (resp.stop_reason === "tool_use") { const toolUses = resp.content.filter((b) => b.type === "tool_use"); const results = await Promise.all(toolUses.map(executeTool)); messages.push({ role: "assistant", content: resp.content }); messages.push({ role: "user", content: results }); continue; } if (resp.stop_reason === "max_tokens") return { status: "partial", text: extractText(resp) }; } return { status: "iteration_cap" }; } ``` Concept: `stop-reason` ### 5. Add the structured escalation block When the hook denies or the agent reaches stop_reason with low confidence, push a structured block, not the transcript, to the human queue. Triage time drops from 5 minutes to 10 seconds. **Python:** ```python def escalate(case_facts: dict, reason: str, partial: dict) -> dict: return { "customer_id": case_facts["customer_id"], "order_id": case_facts["order_id"], "intent": partial.get("intent", "unknown"), "partial_status": partial.get("last_action"), "blocker": reason, "recommended_action": derive_recommendation(reason), "evidence": [partial.get("last_tool_result")], } ``` **TypeScript:** ```typescript function escalate( caseFacts: { customer_id: string; order_id: string }, reason: string, partial: Record, ) { return { customer_id: caseFacts.customer_id, order_id: caseFacts.order_id, intent: partial.intent ?? "unknown", partial_status: partial.last_action, blocker: reason, recommended_action: deriveRecommendation(reason), evidence: [partial.last_tool_result], }; } ``` Concept: `escalation` ### 6. Wire the sentiment + confidence gates Two final guards on the response: sentiment monitor (orthogonal to policy, distress alone never triggers a refund) + confidence threshold. Either gate can route to escalation. **Python:** ```python def post_response_gates(response: str, agent_confidence: float): sentiment = sentiment_score(response) if sentiment == "distressed" and agent_confidence < 0.7: return {"action": "escalate", "reason": "low_confidence_distressed"} if agent_confidence < 0.5: return {"action": "escalate", "reason": "low_confidence"} return {"action": "send"} ``` **TypeScript:** ```typescript function postResponseGates(response: string, agentConfidence: number) { const sentiment = sentimentScore(response); if (sentiment === "distressed" && agentConfidence < 0.7) { return { action: "escalate", reason: "low_confidence_distressed" }; } if (agentConfidence < 0.5) { return { action: "escalate", reason: "low_confidence" }; } return { action: "send" }; } ``` Concept: `escalation` ### 7. Cache the system prompt for cost The system prompt + tool definitions don't change between turns. Mark them with cache_control: ephemeral and pay roughly 90% less for those bytes on every turn. **Python:** ```python resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=4096, system=[ { "type": "text", "text": build_system_prompt(case_facts), "cache_control": {"type": "ephemeral"}, }, ], tools=tools, # tools also auto-cached when stable messages=messages, ) ``` **TypeScript:** ```typescript const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 4096, system: [ { type: "text", text: buildSystemPrompt(caseFacts), cache_control: { type: "ephemeral" }, }, ], tools, // tools also auto-cached when stable messages, }); ``` Concept: `prompt-caching` ### 8. Audit log every resolution Every closed ticket writes a row: customer_id, agent_path, tool_calls, escalation_reason (if any), elapsed_ms, CSAT. This is your replay tool when production breaks at turn 18. **Python:** ```python def audit_log(case_facts: dict, agent_path: list, elapsed_ms: int, csat: int | None): db.audit.insert({ "ts": datetime.utcnow(), "customer_id": case_facts["customer_id"], "order_id": case_facts["order_id"], "tool_calls": [c["name"] for c in agent_path if c["type"] == "tool_use"], "stop_reasons": [c["stop_reason"] for c in agent_path], "elapsed_ms": elapsed_ms, "csat": csat, }) ``` **TypeScript:** ```typescript async function auditLog( caseFacts: { customer_id: string; order_id: string }, agentPath: Array<{ type: string; name?: string; stop_reason?: string }>, elapsedMs: number, csat: number | null, ) { await db.audit.insert({ ts: new Date(), customer_id: caseFacts.customer_id, order_id: caseFacts.order_id, tool_calls: agentPath.filter((c) => c.type === "tool_use").map((c) => c.name), stop_reasons: agentPath.map((c) => c.stop_reason), elapsed_ms: elapsedMs, csat, }); } ``` Concept: `evaluation` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | tool_choice | "auto" (default) | "any" or {type:"tool",name:"X"} | Customer requests are open-ended, let Claude pick. Forced tools only for mandatory first steps. | | stop_reason handling | branch on field; max_tokens = partial | parse response text for 'done' | Text-shape parsing is the most-tested distractor. The structured field is authoritative. | | Session state | case-facts block + threaded messages | progressive summarization of customer_id | Transactional values (IDs, amounts) must be pinned, never paraphrased. | | Cache TTL | ephemeral on system + tools | no caching | System prompt + tool defs are stable across turns. ~90% cost reduction on those bytes. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-12 · Loop termination | Code checks response.text.includes('done') to decide termination. | Branch on stop_reason 'end_turn'. Text + tool_use can co-exist in one response. | | AP-18 · Refund cap enforcement | System prompt says 'never refund more than $500'. Production sees 3% violations. | PreToolUse hook checks tool_input.amount <= 500. Deterministic gate. | | AP-22 · Escalation triggers | Customer raises voice → agent escalates regardless of policy. | Sentiment is orthogonal. Trigger only on policy exception, ambiguity, or explicit request. | | AP-35 · Customer state retention | By turn 8, agent has summarized cus_42 → 'a customer wanting a refund'. | Pin CASE_FACTS block in system prompt. Re-read every turn. Never paraphrased. | | AP-08 · Identity verification skip | Agent calls lookup_order first; pulls wrong record 12% of the time. | Programmatic prerequisite: verify_customer is called via tool description ordering. Hook can also enforce. | ## Implementation checklist - [ ] System prompt anatomy: role · constraints · case-facts · escalation trigger - [ ] Case-facts block pinned + re-read every turn (`case-facts-block`) - [ ] 4-tool registry with structured 4-line descriptions (`tool-calling`) - [ ] PreToolUse hook for refund cap (deterministic) (`hooks`) - [ ] Loop branches on stop_reason, never on text (`stop-reason`) - [ ] Identity verification prerequisite enforced via tool ordering - [ ] Structured escalation block (not transcript) on handoff (`escalation`) - [ ] Sentiment + confidence post-response gates - [ ] Prompt caching on system + tools (cache_control: ephemeral) (`prompt-caching`) - [ ] Audit log written per closed ticket - [ ] Conversation history bounded by case-facts windowing - [ ] Iteration cap (max_iter=15) as a safety net, not the primary control ## Cost & latency - **Per-conversation tokens:** ~3,200 input · 1,400 output, 8 avg turns × (system + tools + accumulating history). Cache hits ~70% on system+tools. - **Per-conversation cost:** ~$0.018 (Sonnet 4.5), Pre-cache: ~$0.04. With ephemeral cache on system+tools: ~$0.018. ~55% reduction. - **p95 latency:** 8.2 seconds, Streaming first token in ~150ms. Tool round-trips 1.5-2s each. 4 tool calls × 2s + 800ms compose. - **Cache hit rate:** ≥ 70% on system+tools, 5-min TTL on ephemeral. Continuous traffic keeps cache warm. ## Domain weights - **D1 · Agentic Architectures (27%):** Specialist Agent + Loop + Escalation - **D2 · Tool Design + Integration (18%):** Tool Registry + PreToolUse Hook + tool_choice - **D5 · Context + Reliability (15%):** Case-Facts Block + Session State + Prompt Caching ## Practice questions ### Q1. Your refund agent uses prompt-only enforcement: 'never refund over $500'. Production logs show 3% of refunds violate the policy. What's the architectural fix? Replace prompt-only enforcement with a PreToolUse hook that validates tool_input.amount <= 500. The hook exits 2 (deny) on violation, providing deterministic policy enforcement. Prompt-only is probabilistic and leaks ~3-5% in production. Tagged to AP-18 in the anti-pattern catalog. ### Q2. Your agent loop terminates after 7 turns by checking response.text.includes('done'). The customer says they're stuck. What's wrong? Text-parse termination is unreliable. Claude can return [text, tool_use] in the same response, where text is preamble and tool_use is the real next step. Branch on stop_reason "end_turn". The text "I'm done" can appear while stop_reason is still tool_use. Tagged to AP-12. ### Q3. By turn 8, the agent has lost the customer's order ID. What's the architectural fix? Pin a CASE_FACTS block at the top of the system prompt with customer_id, order_id, amount, policy_cap. Re-read every turn. Transactional values (IDs, amounts) must never be summarized, only reasoning chains can be paraphrased. Tagged to AP-35. ### Q4. An angry customer asks for a refund that exceeds the policy. Your agent escalates. Why is this wrong? Sentiment is orthogonal to policy. Distress alone is not an escalation trigger. The hook should evaluate the policy violation independently. If amount > cap, hook denies (escalation by policy, not sentiment). If amount ≤ cap, agent processes regardless of customer mood. Tagged to AP-22. ### Q5. You add a 6th tool to the registry and the agent's tool-selection accuracy drops 8%. What's happening? Tool count past 4-5 degrades routing. Each new tool adds ambiguity; descriptions overlap; the model alternates. Either (a) consolidate tools (merge lookup_order_details + lookup_order_status → lookup_order), or (b) move rare-use tools to a sub-agent. The Anthropic guide caps the optimum at ~5 tools per agent. ## FAQ ### Q1. Why a separate hook for refund cap instead of putting it in the system prompt? Prompt-only enforcement is probabilistic. Claude follows the rule ~95-97% of the time, leaving 3-5% leakage. Hooks are deterministic, they read structured tool_input fields and exit 2 to deny. For policy-bearing limits (refunds, escalation thresholds), determinism is required. Use prompts for tone and behavior; use hooks for hard policy. ### Q2. How many tools should the registry have? 4-5 is the optimum per Anthropic's customer-support guide. Fewer means the agent has to compose multiple low-level calls into one task. More degrades selection accuracy, overlapping descriptions cause the model to alternate. If you need >5, split into specialist agents (e.g., refund agent + tech agent + account agent) and route between them with a triage classifier. ### Q3. Should the system prompt include few-shot examples of past conversations? Sparingly. 1-2 high-quality examples can lock tone and tool-use pattern. More than 3 starts crowding the cache and dilutes attention. Better leverage: pin a clear tool registry with detailed descriptions + a sharp CASE_FACTS block. Examples are for edge-case behavior; descriptions are for routing. ### Q4. What's the difference between sentiment escalation and policy escalation? Policy escalation: the agent hits a structural condition that requires a human (refund > cap, identity unverifiable, ambiguous request). Triggered by hooks or explicit conditions. Sentiment escalation: the customer shows distress. Sentiment is *orthogonal*, distress alone never warrants escalation. Combine them only as a tie-breaker (low confidence + distress = escalate). ### Q5. How do I handle a customer who switches topics mid-conversation? Re-route through triage. If the new intent maps to a different specialist (e.g., refund → tech), don't try to handle it inline. Push the original case-facts to the new specialist's task string + spawn (or context-switch) a new agent. Trying to handle multi-intent in one specialist agent erodes accuracy and pollutes the case-facts block. ### Q6. What's a good escalation queue SLA? 5-10 minutes for customer-blocking flows; 2-4 hours for batch flows (overnight refund reconciliation). Mark each escalation with intent + urgency from the triage stage; route customer-blocking ones to the live queue, batch ones to a daily review. The structured block format is the same; only the SLA differs. ### Q7. Should I cache the message history across turns? No. The message list grows monotonically, caching it has marginal value (each turn changes the cache key). Cache the system prompt + tool definitions instead, those are stable across turns and account for 60-80% of token cost on long conversations. ~5-min TTL on ephemeral cache is sufficient for live chat traffic. ### Q8. When should I use a sub-agent instead of expanding this one? When (a) the new flow is parallelizable (e.g., research a customer's order history while another agent handles billing), (b) the new flow needs different tool scope (read-only research vs write-capable refund), or (c) the new flow generates verbose intermediate work that pollutes the main case-facts block. Use sub-agents for isolation; use this agent for inline reasoning. ### Q9. How do I prevent infinite loops? stop_reason is the primary control, branch on it, never on text. Iteration cap (max_iter=15) is a safety net, not the primary control. If you hit the cap regularly, the bug is upstream: missing tool_result append, ambiguous tool descriptions, or two tools alternating. Raising the cap masks the real issue. ### Q10. What should the audit log capture? Per closed ticket: customer_id, order_id, full tool_call sequence (just names + timestamps), stop_reason per turn, elapsed_ms, escalation_reason (if any), csat (if surveyed). Skip the full transcript, the structured trace is enough to replay any failure. Store for 90 days minimum. ## Production readiness - [ ] Unit tests on every tool's input validation - [ ] Integration test: end-to-end refund flow against test CRM - [ ] Hook test: fire mock tool_input with amount > cap, verify exit 2 - [ ] Sentiment classifier evaluated against ≥ 200 labeled tickets - [ ] Latency monitor: alert if p95 > 12s for ≥ 5 min - [ ] Cost monitor: alert if per-conversation cost > $0.03 - [ ] Escalation queue dashboard with SLA breach alerts - [ ] Runbook: top-5 escalation reasons + recommended human actions --- **Source:** https://claudearchitectcertification.com/scenarios/customer-support-resolution-agent **Vault sources:** ACP-T05 §Scenario 1 (5 ✅/❌ pairs); ACP-T08 §3.2 metadata; Anthropic customer support agent guide; ACP-T06 (5 practice Qs tagged to components) **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Conversational AI Patterns > A multi-turn conversational pattern that survives context compression. The harness pins a CASE_FACTS block at the top of every system-prompt iteration (immutable, re-read every turn), summarizes turns 2-14 into 3 lines at turn 15, gates clarification through a PreToolUse hook (not the prompt), respects explicit human-handoff requests immediately (not sentiment), and keeps the agent reading stop_reason rather than message text. Confirmed on the real exam by two independent pass-takers as one of the highest-leverage scenarios outside the published guide. **Sub-marker:** P3.10 **Domains:** D5 · Context + Reliability, D4 · Prompt Engineering **Exam weight:** 35% of CCA-F (D5 + D4) **Build time:** 24 minutes **Source:** 🟢 Beyond-guide scenario · empirically witnessed on the real CCA-F exam **Canonical:** https://claudearchitectcertification.com/scenarios/conversational-ai-patterns **Last reviewed:** 2026-05-04 ## In plain English Think of this as the long-running version of the support agent. The one that keeps making sense on turn 15 just like it did on turn 1. When a customer comes back five messages later, this agent still knows who they are, what they decided, and what they explicitly asked you not to do. It exists because most production conversations are not three turns; they are fifteen, and a system that quietly forgets the customer ID by turn eight is worse than no system at all. Everything below is how to make a conversation continue feeling like a conversation, even after the underlying context has been compressed. ## Exam impact Domain 5 (Context Management, 15%) tests the case-facts block + history compression. Domain 4 (Prompt Engineering, 20%) tests the prompt-vs-hook distractor and the sentiment-vs-policy escalation question. This is the single highest-weight scenario outside the published exam guide and is empirically confirmed on the real exam by multiple pass-takers. Drilling it materially raises pass probability. ## The problem ### What the customer needs - Pick up the conversation at turn 15 and still see the customer ID, decision, and contact preference from turn 2. - Be honored immediately when they say 'I want to speak to a human'. No negotiation, no 'let me try first'. - Not be re-asked the same clarifying question three turns after they already answered it. ### Why naive approaches fail - Single-block message history hits lost-in-the-middle by turn 9; the agent loses the order ID and re-asks. - Prompt-only clarification language ('don't repeat questions') leaks 8% of cases; the agent re-asks anyway. - Sentiment-triggered escalation creates 50% false positives: angry-but-valid customers get escalated unnecessarily. ### Definition of done - Turn-15 retrieval of customer_id + prior decision = 100% (case-facts pinned, never summarized) - Repeat-clarification rate < 1% (programmatic prerequisite block, not prompt language) - False-escalation rate < 5% (policy-gap + explicit-request triggers only, sentiment ignored) - p95 turn latency < 5s including hook overhead and history compression ## Concepts in play - 🟢 **Agentic loops** (`agentic-loops`), Conversational specialist loop - 🟢 **Session state** (`session-state`), Decision flags + tool-call history - 🟠 **Case-facts block** (`case-facts-block`), Immutable customer state. The load-bearing pattern - 🟢 **Context window** (`context-window`), History compression at turn 15 - 🟢 **Hooks** (`hooks`), PreToolUse clarification gate - 🟡 **Escalation** (`escalation`), Honor explicit human-handoff requests - 🟢 **System prompts** (`system-prompts`), Anchor for case-facts at top - 🟢 **Prompt caching** (`prompt-caching`), Stable system + tools across turns ## Components ### Case-Facts Block, immutable customer state, top of prompt Pinned at the very top of every system-prompt iteration. Holds customer_id, decision_made, contact_preference, escalation_requested, policy_cap. Survives history compression and is re-read every turn. That is the entire point. **Configuration:** system: f"CASE_FACTS:\n customer={cust_id} · decision={decision} · contact={pref} · escalated={escalated}". Updated by hooks after any state-changing tool call. Never paraphrased; always exact. **Concept:** `case-facts-block` ### Session State Manager, decisions + flags between turns Tracks the structured state that case-facts cannot: which clarification questions have been answered, which tool results are still in play, whether the customer has explicitly asked for a human. Updated post-each-tool-call. Read by the hook before any subsequent tool dispatch. **Configuration:** state: {clarifications_answered: [order_id, refund_or_credit], last_tool_result, escalation_requested: false, contact_preference: 'email'}. Persisted in session store, loaded into prompt as a serialized block. **Concept:** `session-state` ### History Summarizer, turns 2-14 → 3 lines at turn 15 Watches conversation length. When the message list exceeds 15 entries, replaces turns 2-N-1 with a single 3-line summary preserving decisions, not transcripts. Case-facts stays untouched at the prompt top. The summary lives in the message list, not in case-facts. **Configuration:** if len(messages) > 15: summary = compress_to_3_lines(messages[1:-1]); messages = [messages[0], {role: 'user', content: summary}, messages[-1]]. Keeps token count flat while preserving decision continuity. **Concept:** `context-window` ### Clarification Gate Hook, PreToolUse · prerequisite block Sits between Claude's tool_use request and tool execution. If a downstream tool needs verified_id and case-facts.verified_id is null, exits 2 with a deterministic message routing Claude to call get_customer first. This is the difference between probabilistic prompt language and 100% prerequisite enforcement. **Configuration:** Hook fires before process_refund / update_account / escalate_to_human. Reads case_facts.verified_id and conversation flags. Exit 2 with stderr message routes Claude back; no leakage, no exceptions. **Concept:** `hooks` ### Stop-Reason Loop Control, branch on the field, not the text Reads stop_reason after every API response. end_turn → exit cleanly. tool_use → execute, append result, continue. max_tokens → save partial state and escalate (never silently truncate). Never branches on response text containing 'done' or 'goodbye'. **Configuration:** while True: resp = client.messages.create(...). if resp.stop_reason "end_turn": return. if resp.stop_reason "tool_use": dispatch + append. if resp.stop_reason "max_tokens": persist + escalate. **Concept:** `agentic-loops` ## Build steps ### 1. Define the case-facts anchor block Pin the immutable customer facts at the very top of the system prompt. These survive compression, are re-read every turn, and are never paraphrased. The block is the single load-bearing pattern of the whole scenario. Get this wrong and turn 15 forgets turn 2. **Python:** ```python from anthropic import Anthropic client = Anthropic() def build_system_prompt(case_facts: dict) -> str: return f"""You are a conversational support agent. CASE_FACTS (immutable; re-read every turn; never paraphrased): - customer_id: {case_facts['customer_id']} - decision_made: {case_facts.get('decision_made', 'none')} - contact_preference: {case_facts.get('contact_preference', 'unset')} - escalation_requested: {case_facts.get('escalation_requested', False)} - policy_cap: ${case_facts.get('cap', 500)} Constraints: - Branch on stop_reason. Never on response text. - If escalation_requested is True: route to human queue, no negotiation. - If a clarifying question was already answered, do not re-ask (state below).""" ``` **TypeScript:** ```typescript import Anthropic from "@anthropic-ai/sdk"; const client = new Anthropic(); function buildSystemPrompt(caseFacts: { customer_id: string; decision_made?: string; contact_preference?: string; escalation_requested?: boolean; cap?: number; }): string { return `You are a conversational support agent. CASE_FACTS (immutable; re-read every turn; never paraphrased): - customer_id: ${caseFacts.customer_id} - decision_made: ${caseFacts.decision_made ?? "none"} - contact_preference: ${caseFacts.contact_preference ?? "unset"} - escalation_requested: ${caseFacts.escalation_requested ?? false} - policy_cap: \${caseFacts.cap ?? 500} Constraints: - Branch on stop_reason. Never on response text. - If escalation_requested is true: route to human queue, no negotiation. - If a clarifying question was already answered, do not re-ask.`; } ``` Concept: `case-facts-block` ### 2. Build the session-state structure Case-facts holds immutable customer state; session-state holds the conversational state. Answered clarifications, last tool result, escalation flag. Together they replace the lost-in-the-middle problem with structural retrieval. Loaded into the prompt as a serialized block right after CASE_FACTS. **Python:** ```python def build_session_block(state: dict) -> str: return f"""SESSION_STATE (updated post-each-turn): - clarifications_answered: {state.get('clarifications_answered', [])} - last_tool: {state.get('last_tool')} → {state.get('last_tool_result_summary')} - escalation_requested: {state.get('escalation_requested', False)} - contact_preference: {state.get('contact_preference', 'email')} """ # After every tool call, update + persist state['clarifications_answered'].append('order_id') state['last_tool'] = 'lookup_order' state['last_tool_result_summary'] = 'order_status=delivered' session_store.save(case_facts['customer_id'], state) ``` **TypeScript:** ```typescript function buildSessionBlock(state: { clarifications_answered: string[]; last_tool?: string; last_tool_result_summary?: string; escalation_requested?: boolean; contact_preference?: string; }): string { return `SESSION_STATE (updated post-each-turn): - clarifications_answered: ${JSON.stringify(state.clarifications_answered)} - last_tool: ${state.last_tool} → ${state.last_tool_result_summary} - escalation_requested: ${state.escalation_requested ?? false} - contact_preference: ${state.contact_preference ?? "email"}`; } // After every tool call, update + persist state.clarifications_answered.push("order_id"); state.last_tool = "lookup_order"; state.last_tool_result_summary = "order_status=delivered"; await sessionStore.save(caseFacts.customer_id, state); ``` Concept: `session-state` ### 3. Wire the PreToolUse clarification hook Programmatic prerequisite enforcement. Before any account-modifying tool, the hook checks case-facts + session-state. Missing prerequisites → exit 2 with a structured stderr message; Claude reads it and routes to the prerequisite tool first. Prompt language alone leaks 8%; this hook is 100%. **Python:** ```python # .claude/hooks/clarification_gate.py import sys, json def main(): payload = json.loads(sys.stdin.read()) tool_name = payload["tool_name"] case_facts = payload.get("case_facts", {}) session = payload.get("session_state", {}) # Account-modifying tools require verified identity if tool_name in ("process_refund", "update_account", "escalate_to_human"): if not case_facts.get("verified_id"): print("verified_id missing. Call get_customer first", file=sys.stderr) sys.exit(2) # Honor explicit escalation request. No further tool calls if session.get("escalation_requested") and tool_name != "escalate_to_human": print("user requested human; route to escalation queue, no other tools", file=sys.stderr) sys.exit(2) sys.exit(0) if __name__ == "__main__": main() ``` **TypeScript:** ```typescript // .claude/hooks/clarification-gate.ts import { readFileSync } from "node:fs"; const payload = JSON.parse(readFileSync(0, "utf8")); const toolName: string = payload.tool_name; const caseFacts = payload.case_facts ?? {}; const session = payload.session_state ?? {}; // Account-modifying tools require verified identity if (["process_refund", "update_account", "escalate_to_human"].includes(toolName)) { if (!caseFacts.verified_id) { process.stderr.write("verified_id missing. Call get_customer first\n"); process.exit(2); } } // Honor explicit escalation request. No further tool calls if (session.escalation_requested && toolName !== "escalate_to_human") { process.stderr.write("user requested human; route to escalation queue, no other tools\n"); process.exit(2); } process.exit(0); ``` Concept: `hooks` ### 4. Run the loop on stop_reason, not text The single most-tested distractor in this scenario is parsing response text for 'done'. Claude can return text + tool_use in the same message; the structured stop_reason field is the only authoritative termination signal. Branch on it. Always. **Python:** ```python def run_conversation_turn(user_msg: str, case_facts: dict, state: dict, max_iter: int = 12): messages = load_history(case_facts['customer_id']) + [{"role": "user", "content": user_msg}] system = build_system_prompt(case_facts) + "\n\n" + build_session_block(state) for _ in range(max_iter): resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=2048, system=system, tools=tools, messages=messages, ) if resp.stop_reason == "end_turn": persist_history(case_facts['customer_id'], messages, resp) return extract_text(resp) if resp.stop_reason == "tool_use": tool_uses = [b for b in resp.content if b.type == "tool_use"] results = [execute_tool(t, case_facts, state) for t in tool_uses] messages.append({"role": "assistant", "content": resp.content}) messages.append({"role": "user", "content": results}) update_state(state, tool_uses, results) continue if resp.stop_reason == "max_tokens": persist_partial(case_facts, state, resp) return {"status": "partial_escalate", "text": extract_text(resp)} return {"status": "iteration_cap"} ``` **TypeScript:** ```typescript async function runConversationTurn( userMsg: string, caseFacts: Record, state: Record, maxIter = 12, ) { const messages: Anthropic.MessageParam[] = [ ...(await loadHistory(caseFacts.customer_id as string)), { role: "user", content: userMsg }, ]; const system = buildSystemPrompt(caseFacts as never) + "\n\n" + buildSessionBlock(state as never); for (let i = 0; i < maxIter; i++) { const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 2048, system, tools, messages, }); if (resp.stop_reason === "end_turn") { await persistHistory(caseFacts.customer_id as string, messages, resp); return extractText(resp); } if (resp.stop_reason === "tool_use") { const toolUses = resp.content.filter((b) => b.type === "tool_use"); const results = await Promise.all(toolUses.map((t) => executeTool(t, caseFacts, state))); messages.push({ role: "assistant", content: resp.content }); messages.push({ role: "user", content: results }); updateState(state, toolUses, results); continue; } if (resp.stop_reason === "max_tokens") { await persistPartial(caseFacts, state, resp); return { status: "partial_escalate", text: extractText(resp) }; } } return { status: "iteration_cap" }; } ``` Concept: `agentic-loops` ### 5. Compress conversation history at turn 15 When the message list exceeds 15 entries, replace turns 2 through N-1 with a single summary that preserves decisions, not transcripts. Case-facts stays at the prompt top, untouched. The summary lives in the message list. This frees ~40% of tokens with zero decision loss. **Python:** ```python def compress_history(messages: list, threshold: int = 15) -> list: if len(messages) <= threshold: return messages # Preserve the original user message + the most recent exchange. first = messages[0] last = messages[-1] middle = messages[1:-1] # Summarize: extract decision points only (not full transcript) decisions = extract_decisions(middle) # e.g. ["user_asked_for_refund", "agent_verified_customer", "policy_allowed_$50"] summary = "CONVERSATION SUMMARY (turns 2-" + str(len(messages) - 1) + "):\n" + "\n".join(f"- {d}" for d in decisions) return [first, {"role": "user", "content": summary}, last] # Usage in the turn loop messages = compress_history(messages, threshold=15) ``` **TypeScript:** ```typescript function compressHistory( messages: Anthropic.MessageParam[], threshold = 15, ): Anthropic.MessageParam[] { if (messages.length <= threshold) return messages; const first = messages[0]; const last = messages[messages.length - 1]; const middle = messages.slice(1, -1); // Summarize: extract decision points only (not full transcript) const decisions = extractDecisions(middle); // e.g. ["user_asked_for_refund", "agent_verified", "policy_allowed_$50"] const summary = `CONVERSATION SUMMARY (turns 2-${messages.length - 1}):\n` + decisions.map((d) => `- ${d}`).join("\n"); return [first, { role: "user", content: summary }, last]; } // Usage in the turn loop messages = compressHistory(messages, 15); ``` Concept: `context-window` ### 6. Honor explicit human-handoff requests immediately When the customer says 'speak to a human', the agent does not negotiate. The hook flips session_state.escalation_requested → true. The next tool dispatch is escalate_to_human; everything else is blocked. Sentiment is orthogonal. Angry customers with valid requests still get the answer first. **Python:** ```python # Detection lives in the agent's prompt; latching lives in state EXPLICIT_HUMAN_PHRASES = [ "speak to a human", "talk to a person", "i want a human", "transfer me", "give me a human", ] def detect_explicit_handoff(user_msg: str) -> bool: msg = user_msg.lower() return any(phrase in msg for phrase in EXPLICIT_HUMAN_PHRASES) def handle_user_turn(user_msg: str, case_facts: dict, state: dict): if detect_explicit_handoff(user_msg): state['escalation_requested'] = True # The next agent loop will see this in session_state and the hook # will block any tool except escalate_to_human. return run_conversation_turn(user_msg, case_facts, state) # Sentiment is intentionally NOT consulted here. ``` **TypeScript:** ```typescript // Detection lives in the agent's prompt; latching lives in state const EXPLICIT_HUMAN_PHRASES = [ "speak to a human", "talk to a person", "i want a human", "transfer me", "give me a human", ]; function detectExplicitHandoff(userMsg: string): boolean { const msg = userMsg.toLowerCase(); return EXPLICIT_HUMAN_PHRASES.some((p) => msg.includes(p)); } async function handleUserTurn( userMsg: string, caseFacts: Record, state: Record, ) { if (detectExplicitHandoff(userMsg)) { state.escalation_requested = true; // The next agent loop sees this in session_state and the hook // blocks any tool except escalate_to_human. } return runConversationTurn(userMsg, caseFacts, state); } // Sentiment is intentionally NOT consulted here. ``` Concept: `escalation` ### 7. Cache the system prompt + tools System prompt + tool definitions are stable across turns; only case-facts and session-state change. Mark the stable parts with cache_control: ephemeral and pay ~90% less for those bytes on every turn after the first. With 5-min TTL on continuous traffic, hit rate stays above 70%. **Python:** ```python # Split system into stable (cached) + dynamic (fresh) blocks resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=2048, system=[ { "type": "text", "text": STABLE_SYSTEM_PREAMBLE, # role + constraints, never changes "cache_control": {"type": "ephemeral"}, }, { "type": "text", "text": build_case_facts_block(case_facts) + build_session_block(state), # changes per turn }, ], tools=tools, # tools array also auto-cached when stable messages=messages, ) # Inspect resp.usage.cache_creation_input_tokens / cache_read_input_tokens to verify hit rate. ``` **TypeScript:** ```typescript // Split system into stable (cached) + dynamic (fresh) blocks const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 2048, system: [ { type: "text", text: STABLE_SYSTEM_PREAMBLE, // role + constraints, never changes cache_control: { type: "ephemeral" }, }, { type: "text", text: buildCaseFactsBlock(caseFacts) + buildSessionBlock(state), // changes per turn }, ], tools, // tools array also auto-cached when stable messages, }); // Inspect resp.usage.cache_creation_input_tokens / cache_read_input_tokens for hit rate. ``` Concept: `prompt-caching` ### 8. Audit-log the conversation arc Every closed conversation writes a structured row: customer_id, turn_count, tool_calls_in_order, escalation_reason (if any), elapsed_ms_total, csat. Skip the full transcript. The structured trace is enough to replay any failure and is 50× smaller. Store for 90 days minimum. **Python:** ```python def audit_conversation(case_facts: dict, state: dict, agent_path: list, elapsed_ms: int, csat: int | None): db.audit.insert({ "ts": datetime.utcnow(), "customer_id": case_facts["customer_id"], "turn_count": len(agent_path), "tool_calls": [c["name"] for c in agent_path if c.get("type") == "tool_use"], "stop_reasons": [c.get("stop_reason") for c in agent_path], "escalation_reason": state.get("escalation_reason"), "compression_fired_at_turn": state.get("compression_at_turn"), "elapsed_ms": elapsed_ms, "csat": csat, }) ``` **TypeScript:** ```typescript async function auditConversation( caseFacts: { customer_id: string }, state: Record, agentPath: Array<{ type?: string; name?: string; stop_reason?: string }>, elapsedMs: number, csat: number | null, ) { await db.audit.insert({ ts: new Date(), customer_id: caseFacts.customer_id, turn_count: agentPath.length, tool_calls: agentPath.filter((c) => c.type === "tool_use").map((c) => c.name), stop_reasons: agentPath.map((c) => c.stop_reason), escalation_reason: state.escalation_reason, compression_fired_at_turn: state.compression_at_turn, elapsed_ms: elapsedMs, csat, }); } ``` Concept: `evaluation` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | Multi-turn customer state | case-facts block at top of prompt + session-state block | progressive summarization of customer_id + amount | Transactional values must be pinned, never paraphrased. Summarization erodes precision; case-facts is structural. | | Customer says 'speak to a human' | set escalation_requested → block all tools except escalate_to_human | negotiate ('let me try once more') or suggest alternatives | Explicit user requests are non-negotiable. Cost of overriding stated preference (churn, complaint escalation) exceeds any benefit of one-more-attempt. | | Long conversation context fills | compress turns 2 through N-1 into 3 lines at turn 15; case-facts stays untouched | keep all messages OR summarize the case-facts block | Conversation history has diminishing returns; case-facts are structural. Compress history, never facts. | | Angry customer with valid request | process the request normally; sentiment does not trigger escalation | escalate on negative sentiment | Sentiment is orthogonal to escalation need. Angry-but-valid customers should get the answer; only policy gaps + tool limits + explicit requests warrant escalation. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-35 · Context loss after compression | By turn 9, agent has summarized cust_4711's order ID to 'a recent order'. Treats turn 10 as a new conversation. | Pin CASE_FACTS at top of system prompt. Re-read every turn. Never paraphrased. Compression only touches the message list, never the case-facts block. | | AP-02 · Prompt-only clarification gate | System prompt says 'do not re-ask answered questions'. Agent re-asks 'which order?' on turn 4 and turn 8. 8% leakage. | Track answered clarifications in session_state. PreToolUse hook checks state and blocks downstream tools if a prerequisite clarification is unanswered. Deterministic, not probabilistic. | | AP-03 · Conversation history inflation | 50 turns fill the context window. Lost-in-the-middle effect drops the order ID. Agent makes contradictory recommendations. | At turn 15, summarize turns 2 through N-1 into 3 lines preserving decisions only. Case-facts stays at prompt top. Frees ~40% tokens with zero decision loss. | | AP-04 · No session-state tracking | User said 'I do not want to be contacted by phone' on turn 3. Agent suggests phone callback on turn 9. | Persist session_state with contact_preference + decision flags. Read into prompt every turn alongside case-facts. | | AP-22 · Sentiment-based escalation | Angry customer with a valid refund request is escalated because tone is negative. 50% false-positive rate, customers learn that anger = faster service. | Escalation triggers only on (a) policy gap, (b) tool limit, (c) explicit user request. Sentiment is logged for reporting but never gates escalation. | ## Implementation checklist - [ ] Case-facts block pinned at top of every system-prompt iteration (`case-facts-block`) - [ ] Session-state block serialized into prompt after case-facts (`session-state`) - [ ] PreToolUse clarification gate hook (deterministic prerequisite block) (`hooks`) - [ ] Loop branches on stop_reason, never on response text (`agentic-loops`) - [ ] History summarizer fires at turn 15; case-facts left untouched (`context-window`) - [ ] Explicit human-handoff phrase detection latches escalation_requested = true (`escalation`) - [ ] Sentiment is logged but never gates escalation - [ ] Stable system preamble cached with cache_control: ephemeral (`prompt-caching`) - [ ] Tool definitions cached (unchanged across turns) (`tool-calling`) - [ ] Audit log per closed conversation (structured, not transcript) - [ ] Iteration cap (max_iter=12) as a safety net, not the primary control ## Cost & latency - **Per-conversation tokens:** ~4,800 input · 1,800 output (avg 12 turns), 12 avg turns × (cached system + tools + dynamic case-facts/session-state + accumulating history). Cache hits ~70% on stable preamble + tools. - **Per-conversation cost:** ~$0.022 (Sonnet 4.5), Pre-cache: ~$0.05. With ephemeral cache on stable system + tools: ~$0.022. ~55% reduction on long conversations. - **p95 turn latency:** ~4.6 seconds, Streaming first token in ~150ms. Tool round-trips 1.5-2s each. Average 1.5 tool calls per turn + hook (<100ms) + compose. - **History compression saving:** ~40% token reduction post-turn-15, 12 verbose message blocks (≈8K tokens) → 3-line summary (≈250 tokens). Frees the long-tail conversations from lost-in-the-middle and OOM-on-history. - **Cache hit rate:** ≥ 70% on stable system + tools, 5-min ephemeral TTL on stable preamble + tool definitions. Continuous chat traffic keeps cache warm. Per-turn case-facts/session-state stays fresh, as it should. ## Domain weights - **D5 · Context + Reliability (15%):** Case-Facts Block + Session State + History Summarizer - **D4 · Prompt Engineering (20%):** System Prompt anatomy + Clarification Gate Hook + Stop-Reason Loop ## Practice questions ### Q1. By turn 8 of a long conversation, your agent has lost the customer's order ID and refund amount. The agent treats turn 9 as if it were turn 1. What is the architectural fix? Pin a CASE_FACTS block at the very top of every system-prompt iteration. The block holds customer_id, order_id, refund_amount, policy_cap, decision_made, contact_preference, escalation_requested. It is immutable and re-read every turn. Transactional values (IDs, amounts) must never be summarized; only conversation reasoning chains can be paraphrased. When the message list grows past 15, compress turns 2 through N-1 into a 3-line summary. The case-facts block stays untouched at the prompt top. Tagged to AP-35. ### Q2. Your agent asks 'which order?' on turn 4 and again on turn 8, even though the customer specified the order ID on turn 2. The system prompt says 'do not re-ask answered questions'. What's leaking? Prompt-only clarification gates leak ~8% in production because the model is probabilistic about following instructions. The fix is structural: track answered clarifications in session-state (clarifications_answered: ['order_id', 'refund_or_credit']). A PreToolUse hook reads session-state before any downstream tool dispatch and exits 2 if a prerequisite clarification is unanswered. The hook is deterministic; the prompt is probabilistic. For business-critical guarantees, structural beats linguistic. Tagged to AP-02. ### Q3. A 15-turn conversation has filled 60% of the context window. The agent begins making contradictory recommendations (suggests escalation on turn 12, then re-engages with solving on turn 14). What approach minimizes token waste while preserving conversation quality? Separate case-facts (immutable, pinned to prompt top) from conversation history (the message list). At turn 15, replace turns 2 through N-1 in the message list with a single 3-4 line summary that preserves decisions only: 'user requested refund · agent verified customer · policy allows full refund · customer chose refund over credit'. Discard the verbose back-and-forth. Case-facts is untouched. It was never going to be summarized. This frees ~40% of tokens with zero decision loss. ### Q4. On turn 5, the customer says 'I want to speak to a human'. Your agent says 'Let me see if I can solve this for you first' and continues the conversation. Why is this wrong, and how do you fix it architecturally? Explicit customer requests are non-negotiable. The cost of overriding a stated preference (churn risk, support complaint, regulatory friction) exceeds any benefit of 'one more attempt'. Architecturally: detect explicit human-handoff phrases on the user turn, latch state.escalation_requested = true, and configure the PreToolUse hook to block all tools except escalate_to_human when that flag is set. The agent's only legal next move is the structured handoff to the human queue. No negotiation, no alternatives. ### Q5. An angry customer is requesting a refund that the policy clearly allows. Your agent escalates because tone analysis flagged the message as 'distressed'. What's the failure mode, and what should the escalation criteria actually be? Sentiment is orthogonal to policy. Distress alone is not an escalation trigger; otherwise angry-but-valid customers learn that anger = faster service (perverse incentive), and you generate ~50% false-positive escalations. Correct escalation criteria are structural: (1) the agent lacks a tool to solve the request, (2) policy explicitly blocks the request, (3) the customer explicitly asks for a human, or (4) confidence falls below a threshold on a high-stakes decision. Tone is logged for analytics but never gates routing. Tagged to AP-22. ## FAQ ### Q1. Why pin case-facts in the system prompt instead of passing them as a tool result? System prompt content is re-read and weighted highest by the model. Tool results live in the message list, which can be compressed, summarized, or fall victim to lost-in-the-middle as the conversation grows. Case-facts must survive every single turn unchanged, so it lives at the structural top of the prompt. The only place that's immune to summarization and attention drift. ### Q2. What's the right threshold for triggering history compression? ~15 messages is a defensible default. Below that, full history fits comfortably and the cost of compression isn't justified. Above it, lost-in-the-middle starts degrading recall on details from turns 4-8. Tune per workload: customer-support averages 8-12 turns and rarely needs compression; technical-support can run 30+ and benefits from compressing earlier (turn 10). ### Q3. If case-facts is immutable, how do I update it when the customer changes their decision? Case-facts is immutable per-iteration, not immutable forever. After every state-changing tool call (e.g., customer switches from refund to store credit), the harness re-builds case-facts with the updated values and pins the new version on the next turn. Within one turn, it's fixed; between turns, it's the deliberate, hook-controlled write path. ### Q4. Why a PreToolUse hook for clarification, instead of putting the rule in the prompt? Prompt-only enforcement is probabilistic (~92% in this scenario). Hooks are deterministic. They read structured state, exit 2, and route Claude back. For business-bearing guarantees (don't re-ask answered questions, don't process unverified accounts), the 8% leak from prompt-only is unacceptable. Use prompts for tone and persona; use hooks for hard guarantees. ### Q5. Should I cache the case-facts block? No. Case-facts changes whenever the customer makes a decision or a tool updates state. Caching it kills hit rate. Split the system prompt into a stable preamble (role + constraints, cached with cache_control: ephemeral) and a dynamic block (case-facts + session-state, fresh every turn). You get ~70% hit rate on the cached portion and zero staleness on the dynamic portion. ### Q6. What if the customer switches topics mid-conversation (refund → tech support)? Re-route through triage. Push current case-facts (customer_id + decisions made so far) to the new specialist's task string, and either context-switch this agent or spawn a sub-agent for the new intent. Trying to handle multi-intent in a single specialist erodes accuracy and pollutes the case-facts block. The original intent's decisions get tangled with the new intent's evidence. ### Q7. How do I test that conversation continuity is actually working? Two-step regression: (1) On turn 1, customer says 'My order is #123, I want a refund.' Verify case-facts is updated. (2) On turn 8, after compression has fired and verbose history is summarized, ask 'Which order was that again?'. The agent must NOT re-ask; it must read case-facts and answer immediately. If it re-asks, your case-facts block isn't being re-read every turn. That's the bug to chase. ### Q8. What goes in the audit log for a long conversation? Per closed conversation: customer_id, turn_count, ordered list of tool_calls (just names + timestamps), per-turn stop_reasons, compression_fired_at_turn (if any), escalation_reason (if any), elapsed_ms_total, csat (if surveyed). Skip the full transcript. The structured trace is 50× smaller and replays any failure path. Retain 90 days minimum for production debugging and exam-style retrospectives. ## Production readiness - [ ] Unit tests for case-facts block construction (immutable, ordered, exact-string) - [ ] Integration test: 20-turn conversation with case-facts assertion at every turn - [ ] Hook test: missing verified_id → exit 2; escalation_requested + non-escalate tool → exit 2 - [ ] Compression test: at turn 15, message list shrinks; case-facts unchanged; decisions preserved - [ ] Explicit-handoff test: 'I want a human' → escalation_requested latches true → hook blocks other tools - [ ] Latency monitor: alert if p95 turn-latency > 6s for ≥ 5 min - [ ] Cost monitor: alert if per-conversation cost > $0.04 (signals cache hit rate dropped) - [ ] False-escalation monitor: alert if sentiment-only escalations exceed 1% of total --- **Source:** https://claudearchitectcertification.com/scenarios/conversational-ai-patterns **Vault sources:** ACP-T05 §Scenario 10 (🟢 confirmed beyond-guide; u/Gracious_Tesla 746 + u/ZealousidealFill6044); ACP-T06 (5 practice Qs, prompt-vs-hook + sentiment-vs-policy distractors); ACP-T08 §3.10 (route + metadata); Anthropic context management + prompt caching guides; GAI-K05 CCA exam questions and scenarios **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Code Generation with Claude Code > Claude Code as a team-aware code-generation agent. The repo carries a CLAUDE.md hierarchy (project + personal + system) committed to version control, area-specific rules in .claude/rules/ keyed by file glob, Plan Mode as a mandatory exploration gate before any complex refactor, Skills to isolate domains (React vs API vs DB) and prevent context pollution, and a code-reviewer Subagent with scoped tools for two-pass PR review (per-file, then integration). The single most-tested distractor: jumping to code before exploring; the second: monolithic context on multi-file PRs. **Sub-marker:** P3.2 **Domains:** D3 · Agent Operations, D2 · Tool Design + Integration **Exam weight:** 38% of CCA-F (D3 + D2) **Build time:** 24 minutes **Source:** 🟢 Official Anthropic guide scenario · in published exam guide and practice exam **Canonical:** https://claudearchitectcertification.com/scenarios/code-generation-with-claude-code **Last reviewed:** 2026-05-04 ## In plain English Think of this as the way a whole team uses Claude Code on the same codebase without each developer getting a different answer. The team writes its conventions down once. Stack, code style, the commands you actually use. And Claude reads them on every session, like a runbook. Before any complex change, the developer asks Claude to plan first (read the code, sketch the approach, get approval), and only then to edit files. The result: code that looks like one team wrote it, refactors that don't break imports, and pull-request reviews that don't lose track of what changed by the fourteenth file. ## Exam impact Domain 3 (Claude Code Configuration, 20%) tests CLAUDE.md hierarchy + Plan Mode + Skills. Domain 2 (Tool Design, 18%) tests subagent scoping + tool registry hygiene. This scenario is in the published exam guide AND in the practice exam. The questions you study here match the live exam closely. The CLAUDE.md hierarchy + Plan Mode questions are nearly free points if internalized. ## The problem ### What the customer needs - Generated code that matches the team's conventions automatically. Same file naming, same style, same architecture decisions across every developer's session. - Plan a complex refactor before touching files so reviewers see the design first and rework drops below 10%. - Review a 14-file PR without losing track of file 3's decisions by the time the reviewer reaches file 14. ### Why naive approaches fail - No CLAUDE.md → conventions drift across sessions; snake_case in one file, camelCase in another, even within the same PR. - Skipping Plan Mode → Claude jumps to writing files; 40% of changes get reworked when the design turns out to be wrong. - Single-context PR review → by file 14, lost-in-the-middle drops the conventions established on file 3; inline comments contradict each other. ### Definition of done - PR-level convention drift = 0 (CLAUDE.md + .claude/rules/ enforced) - Plan-Mode usage = 100% of refactors > 5 files (developer norm, not optional) - Two-pass PR review on every PR > 6 files (per-file local, then integration) - Subagent code review on the auto-generated diff before commit ## Concepts in play - 🟢 **CLAUDE.md hierarchy** (`claude-md-hierarchy`), Project + personal + system memory - 🟢 **Plan Mode** (`plan-mode`), Exploration before execution - 🟢 **Skills** (`skills`), Domain-isolated code generation - 🟢 **Subagents** (`subagents`), Code-review isolation - 🟢 **Context window** (`context-window`), Lost-in-the-middle on long PRs - 🟢 **Tool calling** (`tool-calling`), Scoped tool whitelist per Skill - 🟢 **Attention engineering** (`attention-engineering`), Skills frontmatter + .claude/rules globs - 🟢 **Evaluation** (`evaluation`), Subagent reviewer as a quality gate ## Components ### CLAUDE.md Hierarchy, three-level persistent memory Anchors team conventions, personal preferences, and system-wide rules. Project-level CLAUDE.md is committed to the repo; personal CLAUDE.local.md is gitignored; system-level lives in ~/.claude. Claude Code reads them in order on every session. The project file always wins on conflicts. **Configuration:** .claude/CLAUDE.md (project, committed): stack + commands + code style. .claude/CLAUDE.local.md (personal, gitignored): individual prefs. ~/.claude/CLAUDE.md (system): cross-project defaults. Conflicts: project > personal > system. **Concept:** `claude-md-hierarchy` ### Plan Mode, exploration before execution Shift+Tab puts Claude into a read-only state where it can explore files, sketch dependencies, and propose a design. But cannot edit anything until the developer approves. This single gate eliminates the most expensive refactor anti-pattern (jump-to-code-without-understanding-deps). **Configuration:** Shift+Tab toggles Plan Mode. Claude responses include a structured plan section. Developer reviews, refines (Adjust the plan to use Drizzle instead of Prisma), then approves. Approval flips to Code Mode and execution begins. **Concept:** `plan-mode` ### Skills Registry, domain-specific code generation One Skill per domain (React components, API routes, DB queries). Each Skill has its own system-prompt additions, allowed-tools whitelist, and conventions. The developer invokes the right Skill per task and gets focused output without cross-domain context pollution. **Configuration:** .claude/skills/react-components/SKILL.md (Tailwind + TS + named exports). .claude/skills/api-routes/SKILL.md (zod + middleware + structured returns). Each has frontmatter + allowed-tools list. Loaded conditionally. **Concept:** `skills` ### .claude/rules/ Globs, rules keyed by file path Splits the monolithic CLAUDE.md into area-scoped rule files that load only when Claude is editing matching files. Prevents the prompt from carrying API conventions when editing React, or vice versa. The attention budget stays focused on what's relevant to the current edit. **Configuration:** .claude/rules/react.md (globs: /*.tsx, /*.jsx). .claude/rules/api.md (globs: app/api/). .claude/rules/db.md (globs: lib/db/). Claude reads only the rule files matching the current edit's path. **Concept:** `attention-engineering` ### Code-Review Subagent, scoped tools, fresh context, two-pass review Spawned per PR with [Read, Grep, Bash] only. No Edit, no Write. Runs two passes: per-file local review against .claude/rules/, then a separate integration pass for cross-file consistency. Fresh context prevents lost-in-the-middle on long PRs. **Configuration:** Subagent task: 'Review changed files [list]. Pass 1: per-file style + tests against .claude/rules/. Pass 2: integration. API boundaries, shared state, type alignment.' allowed-tools: ['Read','Grep','Bash']. Returns structured verdict. **Concept:** `subagents` ## Build steps ### 1. Create the project-level CLAUDE.md At the repo root, write .claude/CLAUDE.md documenting stack, commands, and code style. Keep it tight. This file is read on every session and competes for attention with the actual task. 200-400 words is the sweet spot. Commit to version control so the whole team's Claude Code sessions inherit the same rules. **Python:** ```python # .claude/CLAUDE.md (committed to repo, ~300 words) # Project Next.js 15 App Router + TypeScript strict + Tailwind v4 + Drizzle ORM. # Commands - Dev: `pnpm dev` - Tests: `pnpm test` - Lint + typecheck: `pnpm lint && pnpm typecheck` - Build: `pnpm build` # Code Style - Named exports only (no default exports) - 2-space indent - Server Actions in app/actions/. Prefer over /api/ - Drizzle queries in lib/db/queries/ - Components: server by default; "use client" only when state or effects needed # Architecture - Auth: middleware.ts at root; protect /account and /admin - Schema: lib/db/schema.ts is the single source of truth - Types: derive from schema where possible (z.infer / drizzle types) ``` **TypeScript:** ```typescript // .claude/CLAUDE.md (committed to repo, ~300 words) // // # Project // Next.js 15 App Router + TypeScript strict + Tailwind v4 + Drizzle ORM. // // # Commands // - Dev: `pnpm dev` // - Tests: `pnpm test` // - Lint + typecheck: `pnpm lint && pnpm typecheck` // - Build: `pnpm build` // // # Code Style // - Named exports only (no default exports) // - 2-space indent // - Server Actions in app/actions/. Prefer over /api/ // - Drizzle queries in lib/db/queries/ // - Components: server by default; "use client" only when state or effects needed // // # Architecture // - Auth: middleware.ts at root; protect /account and /admin // - Schema: lib/db/schema.ts is the single source of truth // - Types: derive from schema where possible (z.infer / drizzle types) ``` Concept: `claude-md-hierarchy` ### 2. Split conventions into .claude/rules/ globs Once CLAUDE.md grows past ~500 words, split it. Move React conventions to .claude/rules/react.md with a glob /*.tsx, API routes to .claude/rules/api.md (app/api/), DB to .claude/rules/db.md (lib/db/). Claude loads only the rule files matching the file being edited. Attention stays focused. **Python:** ```python # .claude/rules/react.md # globs: ["**/*.tsx", "**/*.jsx"] # # ## React Component Rules # - Tailwind for all styles; no CSS modules, no styled-components # - Props: typed interface (no inline { foo: string }) # - Server Components by default # - "use client" only when useState/useEffect/event handlers required # .claude/rules/api.md # globs: ["app/api/**"] # # ## API Route Rules # - Always validate input with zod schema imported from lib/schemas/ # - Wrap handler in middleware that injects { user, request_id } # - Return shape: { status: "ok", data } | { status: "error", error: { code, message } } # - Log errors with request_id for trace correlation # .claude/rules/db.md # globs: ["lib/db/**"] # # ## Database Rules # - Drizzle ORM only. No raw SQL except in migrations # - Queries in lib/db/queries/ subfolder # - Types: drizzle-zod for runtime + drizzle-orm InferSelectModel for static ``` **TypeScript:** ```typescript // .claude/rules/react.md // globs: ["**/*.tsx", "**/*.jsx"] // // ## React Component Rules // - Tailwind for all styles; no CSS modules, no styled-components // - Props: typed interface (no inline { foo: string }) // - Server Components by default // - "use client" only when useState/useEffect/event handlers required // .claude/rules/api.md // globs: ["app/api/**"] // // ## API Route Rules // - Always validate input with zod schema imported from lib/schemas/ // - Wrap handler in middleware that injects { user, request_id } // - Return shape: { status: "ok", data } | { status: "error", error: { code, message } } // - Log errors with request_id for trace correlation // .claude/rules/db.md // globs: ["lib/db/**"] // // ## Database Rules // - Drizzle ORM only. No raw SQL except in migrations // - Queries in lib/db/queries/ subfolder // - Types: drizzle-zod for runtime + drizzle-orm InferSelectModel for static ``` Concept: `attention-engineering` ### 3. Use Plan Mode for any complex refactor Shift+Tab → Plan Mode. Ask Claude to analyze the codebase first: read the relevant files, sketch dependencies, identify shared state, propose the refactor approach. Review the plan; refine if needed; only then approve and switch to Code Mode. The 5-minute plan review cost is dwarfed by the 30+ minute rework cost it prevents. **Python:** ```python # Workflow: Shift+Tab → Plan Mode # # Developer prompt: # "Plan: split this monolith into 3 services. Auth, billing, notifications. # Identify shared utilities, shared DB schemas, and call boundaries." # # Claude (in Plan Mode, read-only): # 1. Reads app/, lib/, prisma/schema.prisma # 2. Sketches dependency graph # 3. Returns structured plan: # - auth/ owns: User, Session, AuthMethod # - billing/ owns: Customer, Invoice, Payment # - notifications/ owns: Email, Push, AuditLog # - shared/: lib/db/connection.ts, lib/utils/id.ts # - call boundaries: HTTP for sync, queue for async # - migration order: shared → auth → billing → notifications # # Developer reviews, asks: "Is the queue Redis or SQS?" # Claude refines plan with the answer. # Developer: "Approved. Execute step 1 only." # Claude switches to Code Mode and executes that step. ``` **TypeScript:** ```typescript // Workflow: Shift+Tab → Plan Mode // // Developer prompt: // "Plan: split this monolith into 3 services. Auth, billing, notifications. // Identify shared utilities, shared DB schemas, and call boundaries." // // Claude (in Plan Mode, read-only): // 1. Reads app/, lib/, prisma/schema.prisma // 2. Sketches dependency graph // 3. Returns structured plan: // - auth/ owns: User, Session, AuthMethod // - billing/ owns: Customer, Invoice, Payment // - notifications/ owns: Email, Push, AuditLog // - shared/: lib/db/connection.ts, lib/utils/id.ts // - call boundaries: HTTP for sync, queue for async // - migration order: shared → auth → billing → notifications // // Developer reviews, asks: "Is the queue Redis or SQS?" // Claude refines plan with the answer. // Developer: "Approved. Execute step 1 only." // Claude switches to Code Mode and executes that step. ``` Concept: `plan-mode` ### 4. Build Skills for domain-specific generation Create one Skill per domain. Each Skill has frontmatter (name + description + when_to_use), system-prompt additions for that domain, and an allowed-tools whitelist. The developer invokes the right Skill per task. Skills replace the monolithic-prompt pattern that bloats context with irrelevant rules. **Python:** ```python # .claude/skills/react-components/SKILL.md --- name: react-components description: Generate production-grade React components (Tailwind + TS strict). when_to_use: When asked to create or modify a .tsx component file. allowed-tools: [Read, Grep, Glob, Edit] --- You are generating a React component for this Next.js 15 codebase. ## Conventions - Server Components by default. Use "use client" only when state/effects required. - Props: typed interface, never inline. - Tailwind for all styling; no CSS modules. - Named export, never default. - Include a JSDoc one-liner for any non-trivial prop. ## Pattern ```tsx interface FooProps { // ... } export function Foo({ ... }: FooProps) { return ( ... ); } ``` ``` **TypeScript:** ```typescript // .claude/skills/react-components/SKILL.md // --- // name: react-components // description: Generate production-grade React components (Tailwind + TS strict). // when_to_use: When asked to create or modify a .tsx component file. // allowed-tools: [Read, Grep, Glob, Edit] // --- // // You are generating a React component for this Next.js 15 codebase. // // ## Conventions // - Server Components by default. Use "use client" only when state/effects required. // - Props: typed interface, never inline. // - Tailwind for all styling; no CSS modules. // - Named export, never default. // - Include a JSDoc one-liner for any non-trivial prop. // // ## Pattern // ```tsx // interface FooProps { // // ... // } // // export function Foo({ ... }: FooProps) { // return ( ... ); // } // ``` ``` Concept: `skills` ### 5. Spawn a code-review Subagent on every PR After Claude Code generates a diff, spawn a code-reviewer Subagent with [Read, Grep, Bash] only. No edits. The subagent runs in a fresh context and does two passes: per-file local review (style, tests) and integration (cross-file consistency, API boundaries). Fresh context = no lost-in-the-middle on long PRs. **Python:** ```python # Spawn a code-review subagent from anthropic import Anthropic client = Anthropic() REVIEW_PROMPT = """You are a code reviewer. Two passes. PASS 1. Per-file: For each changed file, check: - Conformance with .claude/rules/ matching this file's path - Test coverage for changed lines - No regressions in imports or exports PASS 2. Integration: Across all changed files: - API boundaries consistent (caller signature == callee signature) - Shared types aligned - DB schema and ORM usage match Return STRUCTURED verdict: { "per_file": [{file, issues: [...]}], "integration": [{issue, files_affected: [...]}], "blocker_count": int, "warning_count": int }""" # Spawn. Note allowed-tools is scoped Read/Grep/Bash only resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=8192, system=REVIEW_PROMPT, tools=[READ_TOOL, GREP_TOOL, BASH_TOOL], messages=[{"role": "user", "content": f"Review PR diff: {pr_diff}"}], ) ``` **TypeScript:** ```typescript // Spawn a code-review subagent import Anthropic from "@anthropic-ai/sdk"; const client = new Anthropic(); const REVIEW_PROMPT = `You are a code reviewer. Two passes. PASS 1. Per-file: For each changed file, check: - Conformance with .claude/rules/ matching this file's path - Test coverage for changed lines - No regressions in imports or exports PASS 2. Integration: Across all changed files: - API boundaries consistent (caller signature == callee signature) - Shared types aligned - DB schema and ORM usage match Return STRUCTURED verdict: { "per_file": [{file, issues: [...]}], "integration": [{issue, files_affected: [...]}], "blocker_count": int, "warning_count": int }`; // Spawn. Note allowed-tools is scoped Read/Grep/Bash only const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 8192, system: REVIEW_PROMPT, tools: [READ_TOOL, GREP_TOOL, BASH_TOOL], messages: [{ role: "user", content: `Review PR diff: ${prDiff}` }], }); ``` Concept: `subagents` ### 6. Use @mentions to target context, not load the whole repo When asking about a specific area, use @path/to/file in the prompt. Claude pulls in only that file plus its imports. Not the entire codebase. Pair with CLAUDE.md's Key Files section so the team has a documented set of high-value entry points to mention. **Python:** ```python # In CLAUDE.md, document the key files: # # # Key Files # - Database schema: @lib/db/schema.ts # - Auth middleware: @middleware.ts # - API request validator: @lib/validators/api.ts # - Server action helpers: @lib/actions/_helpers.ts # # Then in a prompt: # "How does the rate limiter integrate with auth? @middleware.ts @lib/auth.ts" # # Claude pulls in only those two files + transitively their imports. # Context stays focused; the whole repo doesn't bloat the message. ``` **TypeScript:** ```typescript // In CLAUDE.md, document the key files: // // # Key Files // - Database schema: @lib/db/schema.ts // - Auth middleware: @middleware.ts // - API request validator: @lib/validators/api.ts // - Server action helpers: @lib/actions/_helpers.ts // // Then in a prompt: // "How does the rate limiter integrate with auth? @middleware.ts @lib/auth.ts" // // Claude pulls in only those two files + transitively their imports. // Context stays focused; the whole repo doesn't bloat the message. ``` Concept: `context-window` ### 7. Sequence non-orthogonal tasks; never mix in one session Refactor + optimize is two tasks, not one. If the developer asks for both in a single session, expect ~40% rework. The optimization feedback invalidates half the refactor. Run them as sequential sessions: refactor cleanly first, commit, then optimize the new structure with fresh context. **Python:** ```python # Wrong: one prompt, two orthogonal goals # "Refactor this Redux store to Zustand and also optimize bundle size." # Result: refactor 80% done, then optimization feedback causes rework. # ~40% wasted work. # Right: sequential sessions # # Session 1 (Plan Mode): refactor only # "Plan: refactor app/store/ from Redux to Zustand. Pure structural; # do not change file boundaries or optimize anything yet." # Approve plan, execute, commit. # # Session 2 (new context, Plan Mode): optimize # "Plan: now optimize bundle size on the new Zustand store. What dynamic # imports? What tree-shake wins? Any code-split candidates?" # Approve plan, execute, commit. # # Each session has clean context and one goal. Code review is per-PR easier. ``` **TypeScript:** ```typescript // Wrong: one prompt, two orthogonal goals // "Refactor this Redux store to Zustand and also optimize bundle size." // Result: refactor 80% done, then optimization feedback causes rework. // ~40% wasted work. // Right: sequential sessions // // Session 1 (Plan Mode): refactor only // "Plan: refactor app/store/ from Redux to Zustand. Pure structural; // do not change file boundaries or optimize anything yet." // Approve plan, execute, commit. // // Session 2 (new context, Plan Mode): optimize // "Plan: now optimize bundle size on the new Zustand store. What dynamic // imports? What tree-shake wins? Any code-split candidates?" // Approve plan, execute, commit. // // Each session has clean context and one goal. Code review is per-PR easier. ``` Concept: `evaluation` ### 8. Save the rule when Claude makes a mistake When Claude generates something the team doesn't want (default exports, wrong import order, /api/ instead of server actions), don't just correct in chat. Ask Claude to save the rule: 'Save this to CLAUDE.md or the right .claude/rules/ file.' Next session. And every other team member's session. Inherits the fix. **Python:** ```python # Developer flow: # # 1. Claude generates: `export default function Foo() {}` # 2. Developer corrects: "Use named export. Save this to .claude/rules/react.md." # 3. Claude: # - Edits .claude/rules/react.md, adds: # "## Exports # - Named exports only. Never `export default`." # - Updates the original file to use named export. # - Commits both in the same change. # # 4. Next session. For ANY developer on the team. Claude reads the rule # and never generates the default export again. Convention drift = 0. def save_rule_to_repo(rule_text: str, target_file: str): """Pseudocode: Claude actually does this through Edit + Bash tools.""" with open(f".claude/rules/{target_file}", "a") as f: f.write(f"\n{rule_text}\n") subprocess.run(["git", "add", f".claude/rules/{target_file}"]) ``` **TypeScript:** ```typescript // Developer flow: // // 1. Claude generates: `export default function Foo() {}` // 2. Developer corrects: "Use named export. Save this to .claude/rules/react.md." // 3. Claude: // - Edits .claude/rules/react.md, adds: // "## Exports // - Named exports only. Never `export default`." // - Updates the original file to use named export. // - Commits both in the same change. // // 4. Next session. For ANY developer on the team. Claude reads the rule // and never generates the default export again. Convention drift = 0. async function saveRuleToRepo(ruleText: string, targetFile: string) { // Pseudocode: Claude actually does this through Edit + Bash tools. await fs.appendFile(`.claude/rules/${targetFile}`, `\n${ruleText}\n`); await execAsync(`git add .claude/rules/${targetFile}`); } ``` Concept: `claude-md-hierarchy` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | Where do team coding conventions live? | .claude/CLAUDE.md (project-level, committed); split to .claude/rules/*.md when it exceeds ~500 words | personal ~/.claude/CLAUDE.md (not shared) or inline comments in source files | Conventions have to be shared and version-controlled to survive team turnover. Personal config doesn't reach the next developer; inline comments rot. | | Complex refactor (20+ files, multi-service) | Plan Mode first; review the plan; approve; only then execute. Sequential subtasks per session. | Jump to code; correct as you go; merge refactor + optimization in one prompt | Plan Mode's 5-minute review cost is dwarfed by the 30+ minute rework when the model guesses at structure. Mixed-goal sessions cause ~40% rework. | | Multi-domain repo (React + API + DB) | One Skill per domain in .claude/skills/; .claude/rules/*.md keyed by file glob | One monolithic CLAUDE.md with every convention; one mega-tool list for everything | Attention budget per turn is finite. Loading API rules when editing React wastes attention and dilutes routing accuracy. | | PR review on a 14-file PR | Two-pass via a code-review Subagent with [Read, Grep, Bash] only. Pass 1 per-file, pass 2 integration | Single-pass review in the same session that wrote the code | Same-session review inherits the writer's context bias. By file 14, lost-in-the-middle drops the conventions established on file 3. Fresh context fixes both. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-06 · Team conventions drift | No CLAUDE.md. Each developer's Claude session generates code with different style. Snake_case in one file, camelCase in another, default exports in a repo of named exports. | Create .claude/CLAUDE.md at the project root, commit it, and document stack + commands + code style. Claude reads it on every session. When you correct Claude, ask it to save the new rule to CLAUDE.md so the next session inherits it. | | AP-07 · Refactor without exploration | Developer asks Claude to 'split this monolith into microservices'. Claude jumps to creating new files without understanding shared state, dependencies, or boundaries. ~40% of the changes get reworked. | Use Plan Mode (Shift+Tab) first. Have Claude analyze the monolith, identify boundaries, sketch the dependency graph, and present a plan. Review and approve, then switch to Code Mode for execution. | | AP-08 · Monolithic context for multi-domain work | Single Claude Code session refactoring 50 files across React, API, and DB. Context bloats with file reads and assumptions; later edits are inconsistent with earlier ones. | Split into Skills (react-components, api-routes, database-queries) with their own system-prompt additions and tool whitelists. Invoke the right Skill per task; each runs in a focused context. | | AP-09 · Single-pass PR review | GitHub Action reviews a 14-file PR in one Claude session. By file 14, lost-in-the-middle has dropped the conventions seen on file 3. Inline comments contradict each other. | Spawn a code-review Subagent with scoped tools (Read + Grep + Bash). Two passes: per-file local against .claude/rules/, then integration for cross-file consistency. Fresh context, no carry-over noise. | | AP-11 · Refactor + optimize in one session | Developer prompts: 'Refactor Redux to Zustand AND optimize bundle size.' Claude attempts both; refactor is 80% done when optimization feedback invalidates half. ~40% rework. | Sequence orthogonal tasks. Session 1: refactor only, commit. Session 2: optimize the new structure with fresh context. Each session has one clear goal; code review is per-PR easier; rework drops to <10%. | ## Implementation checklist - [ ] Project-level .claude/CLAUDE.md committed; covers stack + commands + code style (`claude-md-hierarchy`) - [ ] .claude/rules/*.md split by file glob once CLAUDE.md exceeds ~500 words (`attention-engineering`) - [ ] Plan Mode (Shift+Tab) is the team norm for any refactor > 5 files (`plan-mode`) - [ ] Skills directory populated for each domain (react, api, db, review) (`skills`) - [ ] Each Skill has a tight allowed-tools whitelist (no Bash + Edit unless needed) (`tool-calling`) - [ ] Code-review Subagent spawned per PR with [Read, Grep, Bash] only (`subagents`) - [ ] Two-pass review: per-file local + integration (`context-window`) - [ ] @mentions document Key Files in CLAUDE.md to target context efficiently - [ ] Orthogonal tasks (refactor + optimize) sequenced into separate sessions - [ ] Corrections saved to CLAUDE.md or .claude/rules/ on the spot, not just in chat - [ ] CI/CD job runs the code-review Subagent on every PR pre-merge ## Cost & latency - **CLAUDE.md + rules storage:** ~2-5 KB committed to repo, Project CLAUDE.md ~600 bytes; three .claude/rules/*.md ~400 bytes each; 1-2 Skills ~500 bytes each. Negligible repo bloat; inlined to system prompt at session start. - **Plan Mode exploration (avg complex refactor):** ~$0.02-0.05, Claude reads 6-10 key files (~15K input tokens) and returns a 1-2K-token plan. Cost ~$0.02-0.05. Saves ~$0.30+ in rework. ROI ~15×. - **Per-100-line generation cycle:** ~$0.08-0.15, Initial code: ~20K input tokens (codebase + rules) + 5K output. Tests + revisions: +$0.02-0.05. Total ~$0.08-0.15. - **Code-review Subagent per PR:** ~$0.01-0.03, Scoped to changed files only (~5K tokens) + 1-2K verdict. Two-pass adds ~30%. Catches 70%+ of style + integration bugs before human review. - **Skills system overhead:** ~0% (system-prompt inline), Skills are .md files that load into the system prompt only when invoked. Per-Skill ~500 bytes; loading <100ms. No extra API calls; just attention engineering. ## Domain weights - **D3 · Agent Operations (20%):** CLAUDE.md Hierarchy + Plan Mode + Skills + .claude/rules globs - **D2 · Tool Design + Integration (18%):** Code-Review Subagent (scoped tool whitelist) + tool registry hygiene ## Practice questions ### Q1. Your team is using Claude Code on the same Next.js codebase. Each developer's session generates code with different style. Some default exports, some named; some snake_case files, some kebab-case. What is the most maintainable way to enforce one set of conventions across the team? Create .claude/CLAUDE.md at the project root. Document stack, commands, and code style in 200-400 words. Commit it to version control so every team member's Claude Code session reads the same rules. When Claude generates something inconsistent, correct it AND ask Claude to save the corrected rule to CLAUDE.md (or the relevant .claude/rules/ file) so the next session inherits the fix. One source of truth, version-controlled, shared via git pull. ### Q2. You're restructuring a monolith (20+ services, 100+ files) into microservices. Claude Code begins making changes immediately on the first prompt. What should you have done first, and why? Use Plan Mode (Shift+Tab) before any prompt that asks for multi-file changes. Plan Mode is a read-only state where Claude can explore files, sketch dependencies, and propose a refactor design. But cannot edit. The developer reviews the plan, refines (Use Drizzle instead of Prisma), and approves; only then does Claude switch to Code Mode and execute. The 5-minute review cost prevents the 30+ minute rework when Claude guesses at structure. ### Q3. Your repo has React components (Tailwind), API routes (zod + middleware), and database queries (Drizzle). Should Claude Code use one big tool list and one CLAUDE.md, or specialised setup? Specialise. Create one Skill per domain: react-components (Tailwind + TS), api-routes (zod + middleware), database-queries (Drizzle patterns). Each Skill has its own system-prompt additions and an allowed-tools whitelist. Pair with .claude/rules/*.md keyed by file glob (react.md for /*.tsx, api.md for app/api/). The developer invokes the right Skill per task; attention stays focused on the relevant domain. ### Q4. GitHub Actions runs Claude Code to review a 14-file PR. The review feedback on file 14 contradicts the decision on file 3. Why, and how do you fix it architecturally? Lost-in-the-middle effect. By file 14, the model's context is dense with 13 prior file reads; attention dilutes and earlier conventions drop out. Fix: spawn a code-review Subagent with scoped tools ([Read, Grep, Bash]) and run two passes. Pass 1 reviews each file in isolation against .claude/rules/; pass 2 reviews integration (cross-file consistency, API boundaries). Fresh, scoped context. No carry-over noise. ### Q5. Should .claude/CLAUDE.md be committed to the repository, and who owns updating it? Yes, always commit it. It's team memory. The shared conventions Claude Code reads on every session. Anyone can update it: when Claude makes a mistake and you correct it, ask Claude to 'save this rule to CLAUDE.md', then commit the change. Code review on the CLAUDE.md commit catches drift early. .claude/CLAUDE.local.md (gitignored) is for personal preferences that shouldn't be team-wide. ## FAQ ### Q1. Why split CLAUDE.md into .claude/rules/ instead of keeping it monolithic? Attention budget. Every line in the system prompt competes for the model's attention on every turn. A 1,500-word CLAUDE.md that mixes React, API, and DB conventions wastes attention when Claude is editing a React file (the API and DB rules aren't relevant). .claude/rules/*.md keyed by file glob loads only the rules matching the current edit. The prompt stays focused on what's actually relevant. ### Q2. What's the difference between Plan Mode and just asking Claude to read the code first? Plan Mode is enforced via the SDK. Claude literally cannot edit files until the developer toggles back to Code Mode. Asking 'read the code first' is a soft suggestion the model can ignore under pressure. For complex refactors, the hard gate matters: it forces the plan-then-execute discipline that prevents the most expensive class of refactor mistakes. ### Q3. Can I use Skills and Subagents together? Yes. They compose naturally. A Subagent is an isolation unit (fresh context, scoped tools); a Skill is a system-prompt customisation for a domain. A code-review Subagent might invoke a code-reviewer Skill that defines the review rubric. Subagent owns the context boundary; Skill owns the domain expertise. ### Q4. How small should a Skill be? Tight scope, single domain. A good Skill fits in 50-200 lines of frontmatter + system-prompt. If a Skill grows past 500 lines or starts covering multiple domains, split it (api-routes-rest vs api-routes-graphql, or react-components-server vs react-components-client). Smaller Skills route more accurately and load faster. ### Q5. When should I use a Subagent vs. just continuing in the current session? Use a Subagent when (a) you need context isolation (PR review, audit, parallel research), (b) the task has a scope boundary that won't pollute the parent context, or (c) you're running read-only analysis that doesn't need write access. Stay in the current session for inline reasoning where context continuity matters. ### Q6. What's the fastest way to discover what Skills the team has? Run claude skills list from the repo root. It scans .claude/skills/ (project), ~/.claude/skills/ (personal), and reports each Skill's name, description, and when_to_use. Pair with a CLAUDE.md ## Available Skills section that points to the canonical Skill set so new developers see them on day one. ### Q7. Should I commit .claude/skills/ to the repo? Yes for team Skills, no for personal experiments. Project-level Skills go in .claude/skills/ and are committed. They're team conventions, just like .claude/rules/. Personal experiments live in ~/.claude/skills/ (system-wide) and stay out of the repo until they're proven and ready to be promoted. ## Production readiness - [ ] CLAUDE.md exists, committed, < 500 words; pre-commit hook flags drift past that - [ ] .claude/rules/*.md keyed by glob, each ≤ 200 words - [ ] Plan Mode used in the last 5 multi-file refactors (audit via git log + chat history) - [ ] Skills directory has at least one Skill per major domain - [ ] Code-review Subagent runs in CI on every PR > 3 files - [ ] Two-pass review verdict posts to PR as a structured comment - [ ] No Skill has Edit + Bash + Write all granted (privilege bloat alert) - [ ] Developer onboarding doc references .claude/CLAUDE.md and claude skills list on day one --- **Source:** https://claudearchitectcertification.com/scenarios/code-generation-with-claude-code **Vault sources:** ACP-T05 §Scenario 2 (5 ✅/❌ pairs · official guide scenario); ACP-T08 §3.2 metadata; Course 02 Claude Code 101. Lessons 01, 05, 08; ACP-T06 (5 practice Qs tagged to components); GAI-K05 CCA exam questions and scenarios **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Multi-Agent Research System > A hub-and-spoke research system. The coordinator owns task decomposition (semantic, not lexical), spawns 3-5 research subagents in parallel with isolated contexts and scoped tools, routes findings through a verification subagent that preserves contradictions with attribution (45% Pew vs 12% McKinsey, both kept), and hands verified claims to a read-only synthesis subagent that emits cited Markdown. Subagents NEVER talk to each other directly. All communication routes through the coordinator. Timeouts return structured error context, not silence. The single most-tested distractor: blaming the subagent for narrow coverage when the coordinator's decomposition was the bug. **Sub-marker:** P3.3 **Domains:** D1 · Agentic Architectures, D2 · Tool Design + Integration **Exam weight:** 45% of CCA-F (D1 + D2) **Build time:** 26 minutes **Source:** 🟢 Official Anthropic guide scenario · in published exam guide and practice exam **Canonical:** https://claudearchitectcertification.com/scenarios/multi-agent-research-system **Last reviewed:** 2026-05-04 ## In plain English Think of this as how you ask one question and get back a properly cited briefing from many sources at once. A coordinator splits the question into the obvious sub-questions (visual arts, music, writing, film, performing arts. Not just the first one that comes to mind), spawns a small team of researchers in parallel, each works alone in their lane, then a separate fact-checker reconciles anything they disagree on, and a final writer turns the verified findings into a single readable report with citations. The whole point is that one big agent thinking by itself misses things; a small team with the right division of labour does not. ## Exam impact Domain 1 (Agentic Architecture, 27%) tests coordinator vs subagent responsibilities, hub-and-spoke topology, and where decomposition actually lives. Domain 2 (Tool Design, 18%) tests scoped tool access per subagent, structured error context, and verify_fact scoping. This scenario is in the published guide AND the practice exam. Questions you drill here match the live exam closely. The 'who's at fault when coverage is narrow' question is the canonical exam distractor. ## The problem ### What the customer needs - Complete coverage of the research topic. Every relevant sub-domain enumerated, none silently dropped. - Reconciled contradictions preserved with attribution, not flattened into one 'most likely' number. - Cited final report that traces every claim to a verifiable source and acknowledges data gaps. ### Why naive approaches fail - Coordinator decomposes 'creative industries' into only visual arts. Misses music, writing, film, performing arts. - Web-search subagent times out and returns empty results as success. Coordinator treats as 'no info' instead of 'needs retry'. - Synthesis picks the 'more likely' statistic between 45% (Pew) and 12% (McKinsey). Drops the conflict, ships misinformation. ### Definition of done - Topic-coverage gap rate = 0 (decomposition reviewed before spawn) - Timeout-as-empty-results rate = 0 (structured error context required) - Contradiction-preservation rate = 100% (both stats + sources retained) - Subagent-to-subagent direct call rate = 0 (all routes through coordinator) ## Concepts in play - 🟢 **Subagents** (`subagents`), Isolated parallel research workers - 🟢 **Agentic loops** (`agentic-loops`), Coordinator + subagent loops - 🟢 **Tool calling** (`tool-calling`), Scoped tool whitelist per subagent - 🟢 **tool_choice** (`tool-choice`), auto on research, forced on verification - 🟢 **stop_reason** (`stop-reason`), Coordinator + subagent termination - 🟢 **Structured outputs** (`structured-outputs`), Findings + verifications JSON shape - 🟢 **Evaluation** (`evaluation`), Verification subagent as fact-check gate - 🟢 **Context window** (`context-window`), Subagent isolation prevents bloat ## Components ### Coordinator Agent, the hub of hub-and-spoke Receives the user query, performs SEMANTIC decomposition (not lexical) into all relevant sub-domains, spawns research subagents in parallel with explicit task prompts, awaits all results, routes findings to verification, hands verified claims to synthesis. Owns every cross-subagent communication path. **Configuration:** Decomposition is the load-bearing step. For 'impact of AI on creative industries' the coordinator must enumerate visual + music + writing + film + performing arts, not stop at the first sub-domain. Spawn pattern: asyncio.gather (Python) / Promise.all (TS). **Concept:** `subagents` ### Research Subagent (parallel), scoped tools, isolated context One subagent per sub-domain. Receives an explicit task prompt. No inherited history, no parent context. Runs research with a narrow tool whitelist (Read, WebSearch, Bash). Returns structured findings JSON: {claim, sources: [{url, date, confidence}]}. Never editorialises; reports facts as stated. **Configuration:** system: "You are a research specialist. Find authoritative sources. Return JSON {findings: [{claim, sources: [...]}]}." tools: [Read, WebSearch, Bash]. messages: [{role: "user", content: task_from_coordinator}]. **Concept:** `tool-calling` ### Verification Subagent, fact-check + reconcile contradictions Cross-checks all claims from research subagents. When two sources conflict (45% Pew vs 12% McKinsey), preserves both with their context and attribution rather than picking the 'more likely' one. Returns verified claims with confidence scores; the verification step is what protects the report from misinformation. **Configuration:** Input: pooled claims from all research subagents. Output: {verifications: [{claim, verified, confidence, sources_reconciled, notes}]}. Notes field captures the context that explains apparent contradictions (different timeframes, definitions, populations). **Concept:** `evaluation` ### Synthesis Subagent, read-only narrative generator Receives verified claims + the coordinator's narrative prompt. Writes a cohesive Markdown report with inline citations [1], [2]. CRITICAL: tools restricted to Read only. No WebSearch, no Bash. This prevents re-research and keeps synthesis focused on stitching the verified facts into a story. **Configuration:** system: "You are a synthesis specialist. Read verified findings and write a cited narrative. Do NOT research." tools: [Read]. Input: {verified_claims, narrative_prompt}. Output: Markdown with [n] citations. **Concept:** `context-window` ### Error Propagation Layer, structured timeout context When a subagent times out or hits a dead end, returns structured error context the coordinator can act on: {status: 'timeout', query, partial_results, alternatives}. Coordinator inspects status_code and either retries with a narrower scope, accepts partial data, or transparently marks the gap in the final report. **Configuration:** On timeout: {status: 'timeout', query, partial, alternatives: ['narrower query', 'different keywords', ...]}. On no_results: {status: 'no_results', query, alternatives}. Never return [] as success. Silence loses the failure context. **Concept:** `structured-outputs` ## Build steps ### 1. Build the coordinator's semantic decomposition The decomposition step is where most coverage failures actually originate. Analyse the topic semantically and enumerate ALL relevant sub-domains before spawning anything. For 'creative industries', that means visual arts AND music AND writing AND film AND performing arts. Not the first one that comes to mind. The decomposition is the coordinator's load-bearing responsibility. **Python:** ```python from typing import List def decompose_query(query: str) -> List[str]: """Semantic decomposition. Enumerate all relevant sub-domains. The exam-question distractor is to blame subagents for narrow coverage when the coordinator's decomposition was the bug. """ q = query.lower() if "creative industries" in q: domains = [ "visual arts (digital art, graphic design, photography)", "music production and composition", "writing (novels, journalism, screenwriting)", "film and video production", "performing arts (theater, dance)", ] elif "healthcare" in q: domains = [ "clinical decision support", "medical imaging and diagnostics", "drug discovery and trials", "patient-facing communication", "administrative + revenue cycle", ] else: # Generic fallback. STILL decompose, never single-shot domains = [ f"{query}. Recent academic literature", f"{query}. Industry case studies", f"{query}. Empirical adoption data", ] return [f"Find AI impact on {d}" for d in domains] ``` **TypeScript:** ```typescript function decomposeQuery(query: string): string[] { // Semantic decomposition. Enumerate all relevant sub-domains. // The exam-question distractor is to blame subagents for narrow // coverage when the coordinator's decomposition was the bug. const q = query.toLowerCase(); let domains: string[]; if (q.includes("creative industries")) { domains = [ "visual arts (digital art, graphic design, photography)", "music production and composition", "writing (novels, journalism, screenwriting)", "film and video production", "performing arts (theater, dance)", ]; } else if (q.includes("healthcare")) { domains = [ "clinical decision support", "medical imaging and diagnostics", "drug discovery and trials", "patient-facing communication", "administrative + revenue cycle", ]; } else { // Generic fallback. STILL decompose, never single-shot domains = [ `${query}. Recent academic literature`, `${query}. Industry case studies`, `${query}. Empirical adoption data`, ]; } return domains.map((d) => `Find AI impact on ${d}`); } ``` Concept: `subagents` ### 2. Define subagent system prompts and tool whitelists Every subagent gets its own system prompt + scoped tool list. Research subagents get [Read, WebSearch, Bash]; verification gets [Read, WebSearch, Bash] + a fact-check rubric; synthesis gets [Read] only. That read-only restriction is the architectural detail that prevents synthesis from re-researching mid-narrative. **Python:** ```python RESEARCH_SUBAGENT_SYSTEM = """You are a research specialist. 1. Find authoritative sources on the assigned sub-domain. 2. Extract key claims with evidence. 3. Return structured JSON ONLY: {"findings": [ {"claim": "...", "sources": [{"url": "...", "date": "YYYY-MM-DD", "confidence": 0.0-1.0}]} ]} Do NOT synthesize, editorialize, or pick winners between conflicting sources. Report facts as stated. Preserve contradictions for verification.""" VERIFICATION_SUBAGENT_SYSTEM = """You are a fact-checker. 1. Read pooled claims from research subagents. 2. Cross-check each claim against its sources. 3. When sources conflict, PRESERVE BOTH with attribution + context. 4. Return JSON: {"verifications": [ {"claim": "...", "verified": true|false, "confidence": 0.0-1.0, "sources_reconciled": [...], "notes": "context about conflicts"} ]}""" SYNTHESIS_SUBAGENT_SYSTEM = """You are a synthesis specialist. 1. Read verified findings (JSON). 2. Write a cohesive Markdown narrative with inline [n] citations. 3. Acknowledge data gaps transparently when verifications flag them. 4. Do NOT conduct new research. You have ONLY the Read tool.""" ``` **TypeScript:** ```typescript const RESEARCH_SUBAGENT_SYSTEM = `You are a research specialist. 1. Find authoritative sources on the assigned sub-domain. 2. Extract key claims with evidence. 3. Return structured JSON ONLY: {"findings": [ {"claim": "...", "sources": [{"url": "...", "date": "YYYY-MM-DD", "confidence": 0.0-1.0}]} ]} Do NOT synthesize, editorialize, or pick winners between conflicting sources. Report facts as stated. Preserve contradictions for verification.`; const VERIFICATION_SUBAGENT_SYSTEM = `You are a fact-checker. 1. Read pooled claims from research subagents. 2. Cross-check each claim against its sources. 3. When sources conflict, PRESERVE BOTH with attribution + context. 4. Return JSON: {"verifications": [ {"claim": "...", "verified": true|false, "confidence": 0.0-1.0, "sources_reconciled": [...], "notes": "context about conflicts"} ]}`; const SYNTHESIS_SUBAGENT_SYSTEM = `You are a synthesis specialist. 1. Read verified findings (JSON). 2. Write a cohesive Markdown narrative with inline [n] citations. 3. Acknowledge data gaps transparently when verifications flag them. 4. Do NOT conduct new research. You have ONLY the Read tool.`; ``` Concept: `tool-calling` ### 3. Spawn research subagents in parallel All research subagents fire at once via async fan-out. Latency is max(subagents), not sum. The whole point of the architecture. Each subagent receives an explicit task prompt with the context it needs; nothing is inherited from the coordinator's history. Cost: N separate API calls. Worth it. **Python:** ```python import asyncio from anthropic import AsyncAnthropic client = AsyncAnthropic() async def spawn_research(task: str) -> dict: """One isolated research subagent. No inherited history.""" resp = await client.messages.create( model="claude-sonnet-4.5", max_tokens=2048, system=RESEARCH_SUBAGENT_SYSTEM, tools=RESEARCH_TOOLS, # [Read, WebSearch, Bash] messages=[{"role": "user", "content": task}], ) return { "task": task, "stop_reason": resp.stop_reason, "result": extract_json(resp), } async def research_in_parallel(tasks: list[str]) -> list[dict]: """Fan out N subagents at once. Latency = max, not sum.""" return await asyncio.gather(*(spawn_research(t) for t in tasks)) # Coordinator usage tasks = decompose_query("impact of AI on creative industries") results = asyncio.run(research_in_parallel(tasks)) ``` **TypeScript:** ```typescript import Anthropic from "@anthropic-ai/sdk"; const client = new Anthropic(); async function spawnResearch(task: string) { // One isolated research subagent. No inherited history. const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 2048, system: RESEARCH_SUBAGENT_SYSTEM, tools: RESEARCH_TOOLS, // [Read, WebSearch, Bash] messages: [{ role: "user", content: task }], }); return { task, stop_reason: resp.stop_reason, result: extractJson(resp), }; } async function researchInParallel(tasks: string[]) { // Fan out N subagents at once. Latency = max, not sum. return Promise.all(tasks.map((t) => spawnResearch(t))); } // Coordinator usage const tasks = decomposeQuery("impact of AI on creative industries"); const results = await researchInParallel(tasks); ``` Concept: `agentic-loops` ### 4. Return structured error context, never silence When a subagent times out or hits no results, the WORST thing it can do is return []. The coordinator then can't tell whether 'no info exists' or 'we never got the data'. A critical distinction for the final report. Always return a structured error: status + query + partial_results + alternatives. The coordinator inspects status_code and decides: retry, narrow, or transparently mark the gap. **Python:** ```python def handle_subagent_error(error_type: str, query: str, partial: list | None = None) -> dict: """Structured error context. Coordinator consumes status_code to decide.""" if error_type == "timeout": return { "status": "timeout", "query": query, "partial_results": partial or [], "alternatives": [ "Narrow the query to a single sub-aspect", "Try synonyms or domain-specific terms", "Reduce the time horizon (e.g., last 12 months)", ], } if error_type == "no_results": return { "status": "no_results", "query": query, "alternatives": ["Broaden keywords", "Check spelling", "Try sister terms"], } if error_type == "rate_limited": return {"status": "rate_limited", "query": query, "retry_after_s": 60} return {"status": "unknown_error", "query": query} # Coordinator inspects status_code def coordinator_handle(result: dict, original_task: str): status = result.get("status") if status == "timeout": # Either retry with the first alternative or include a transparent gap return retry_with_narrower_query(result["alternatives"][0]) if status == "no_results": return mark_data_unavailable(original_task) return result # success ``` **TypeScript:** ```typescript function handleSubagentError( errorType: string, query: string, partial?: unknown[], ): Record { // Structured error context. Coordinator consumes status_code to decide. if (errorType === "timeout") { return { status: "timeout", query, partial_results: partial ?? [], alternatives: [ "Narrow the query to a single sub-aspect", "Try synonyms or domain-specific terms", "Reduce the time horizon (e.g., last 12 months)", ], }; } if (errorType === "no_results") { return { status: "no_results", query, alternatives: ["Broaden keywords", "Check spelling", "Try sister terms"], }; } if (errorType === "rate_limited") { return { status: "rate_limited", query, retry_after_s: 60 }; } return { status: "unknown_error", query }; } // Coordinator inspects status_code function coordinatorHandle( result: Record, originalTask: string, ) { const status = result.status; if (status === "timeout") { const alts = result.alternatives as string[]; return retryWithNarrowerQuery(alts[0]); } if (status === "no_results") { return markDataUnavailable(originalTask); } return result; // success } ``` Concept: `structured-outputs` ### 5. Run the verification subagent and preserve contradictions Pool all claims from research subagents and pass them to a single verification subagent. When two sources conflict (45% Pew vs 12% McKinsey), the verification subagent's job is NOT to pick a winner. It is to preserve both with their context (different definitions, different timeframes, different populations) and attribute each to its source. Picking one is misinformation; preserving both is journalism. **Python:** ```python def build_verification_task(pooled_claims: list[dict]) -> str: """Pool all research-subagent claims for fact-checking.""" body = "\n".join( f"- Claim: {c['claim']}\n Sources: {c['sources']}" for c in pooled_claims ) return f"""Verify the following claims pooled from research subagents. For each claim: 1. Check that sources are credible and dated. 2. If two or more sources CONFLICT, do NOT pick a winner. Preserve both with their context (timeframe, definition, population) and emit both in sources_reconciled with notes explaining the apparent contradiction. The reader gets BOTH numbers + WHY they differ. 3. Return JSON: {{"verifications": [ {{"claim": "...", "verified": true|false, "confidence": 0.0-1.0, "sources_reconciled": [...], "notes": "..." }} ]}} CLAIMS TO VERIFY: {body}""" # Coordinator pools claims and dispatches all_claims = [c for r in research_results for c in r.get("findings", [])] verification_task = build_verification_task(all_claims) verified = client.messages.create( model="claude-sonnet-4.5", max_tokens=4096, system=VERIFICATION_SUBAGENT_SYSTEM, tools=VERIFICATION_TOOLS, messages=[{"role": "user", "content": verification_task}], ) ``` **TypeScript:** ```typescript function buildVerificationTask( pooledClaims: Array<{ claim: string; sources: unknown[] }>, ): string { // Pool all research-subagent claims for fact-checking. const body = pooledClaims .map( (c) => `- Claim: ${c.claim}\n Sources: ${JSON.stringify(c.sources)}`, ) .join("\n"); return `Verify the following claims pooled from research subagents. For each claim: 1. Check that sources are credible and dated. 2. If two or more sources CONFLICT, do NOT pick a winner. Preserve both with their context (timeframe, definition, population) and emit both in sources_reconciled with notes explaining the apparent contradiction. The reader gets BOTH numbers + WHY they differ. 3. Return JSON: {"verifications": [ {"claim": "...", "verified": true|false, "confidence": 0.0-1.0, "sources_reconciled": [...], "notes": "..." } ]} CLAIMS TO VERIFY: ${body}`; } // Coordinator pools claims and dispatches const allClaims = researchResults.flatMap( (r) => (r.findings as Array<{ claim: string; sources: unknown[] }>) ?? [], ); const verificationTask = buildVerificationTask(allClaims); const verified = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 4096, system: VERIFICATION_SUBAGENT_SYSTEM, tools: VERIFICATION_TOOLS, messages: [{ role: "user", content: verificationTask }], }); ``` Concept: `evaluation` ### 6. Run the synthesis subagent with READ-ONLY tools Synthesis is the final step. It receives verified claims + the coordinator's narrative prompt, and emits Markdown with inline citations. The crucial detail: the synthesis subagent's tool list is [Read] only. No WebSearch, no Bash. That restriction prevents it from re-researching mid-narrative (a common failure mode where synthesis fact-checks itself again and inflates latency 3-5×). **Python:** ```python def synthesize(verified_claims: list[dict], narrative_prompt: str) -> str: """Read-only narrative generation. No re-research.""" task = f"""Narrative prompt from coordinator: {narrative_prompt} Verified findings (JSON): {json.dumps(verified_claims, indent=2)} Write a Markdown report that: 1. Flows logically from finding to finding 2. Uses inline citations [1], [2], etc. matching sources_reconciled order 3. ACKNOWLEDGES data gaps transparently where verifications.notes flagged them 4. Does NOT re-research or speculate beyond the verified findings You have ONLY the Read tool. No WebSearch, no Bash.""" resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=3000, system=SYNTHESIS_SUBAGENT_SYSTEM, tools=[READ_TOOL_ONLY], # Critical: read-only messages=[{"role": "user", "content": task}], ) return extract_text(resp) ``` **TypeScript:** ```typescript async function synthesize( verifiedClaims: unknown[], narrativePrompt: string, ): Promise { // Read-only narrative generation. No re-research. const task = `Narrative prompt from coordinator: ${narrativePrompt} Verified findings (JSON): ${JSON.stringify(verifiedClaims, null, 2)} Write a Markdown report that: 1. Flows logically from finding to finding 2. Uses inline citations [1], [2], etc. matching sources_reconciled order 3. ACKNOWLEDGES data gaps transparently where verifications.notes flagged them 4. Does NOT re-research or speculate beyond the verified findings You have ONLY the Read tool. No WebSearch, no Bash.`; const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 3000, system: SYNTHESIS_SUBAGENT_SYSTEM, tools: [READ_TOOL_ONLY], // Critical: read-only messages: [{ role: "user", content: task }], }); return extractText(resp); } ``` Concept: `context-window` ### 7. Route ALL communication through the coordinator If subagent B needs a finding from subagent A, the answer is NOT to call A from B. The answer is: A finishes, returns to coordinator, coordinator passes the finding into B's task prompt. This single rule preserves isolation (each subagent has clean context), parallelism (when dependencies allow), and visibility (the coordinator owns the whole orchestration graph). **Python:** ```python # WRONG. Direct subagent-to-subagent communication # class ResearcherA: # def __init__(self, researcher_b): # self.b = researcher_b # ❌ creates hidden coupling # async def find_papers(self, topic): # finding = await self._search(topic) # await self.b.search_web(finding) # ❌ breaks isolation # RIGHT. Coordinator owns all routing async def coordinated_dependent_research(topic: str): # Phase 1: A runs alone a_finding = await spawn_research(f"Find academic papers on {topic}") # Phase 2: coordinator passes A's finding into B's TASK PROMPT b_task = ( f"Topic: {topic}. " f"Key research direction surfaced by paper search: " f"{a_finding['result']['top_direction']}. " f"Now search the web for industry coverage of that direction." ) b_result = await spawn_research(b_task) # Phase 3: coordinator continues with both return {"a": a_finding, "b": b_result} ``` **TypeScript:** ```typescript // WRONG. Direct subagent-to-subagent communication // class ResearcherA { // constructor(private b: ResearcherB) {} // ❌ creates hidden coupling // async findPapers(topic: string) { // const finding = await this.search(topic); // await this.b.searchWeb(finding); // ❌ breaks isolation // } // } // RIGHT. Coordinator owns all routing async function coordinatedDependentResearch(topic: string) { // Phase 1: A runs alone const aFinding = await spawnResearch(`Find academic papers on ${topic}`); // Phase 2: coordinator passes A's finding into B's TASK PROMPT const bTask = `Topic: ${topic}. Key research direction surfaced by paper search: ` + `${(aFinding.result as { top_direction: string }).top_direction}. ` + `Now search the web for industry coverage of that direction.`; const bResult = await spawnResearch(bTask); // Phase 3: coordinator continues with both return { a: aFinding, b: bResult }; } ``` Concept: `subagents` ### 8. Cap parallelism and add retry budgets Parallel fan-out has diminishing returns past 5-7 subagents. API concurrency limits, context-window contention on the coordinator side, and rate-limit backpressure all kick in. Cap concurrency, set a retry budget per subagent (typically 2 retries with narrowed queries), and emit telemetry: spawn count, parallel max, retry rate, partial-data rate. These metrics are how you tune the system in production. **Python:** ```python import asyncio MAX_PARALLEL = 5 RETRY_BUDGET = 2 semaphore = asyncio.Semaphore(MAX_PARALLEL) async def spawn_with_retry(task: str, attempt: int = 0) -> dict: """Bounded-concurrency research with structured-error retry.""" async with semaphore: result = await spawn_research(task) status = result.get("result", {}).get("status") if status == "timeout" and attempt < RETRY_BUDGET: narrower = narrow_query(result["result"].get("alternatives", [task])[0]) telemetry.increment("subagent.retry", labels={"reason": "timeout"}) return await spawn_with_retry(narrower, attempt + 1) return result # Coordinator with bounded fan-out + retries results = await asyncio.gather(*(spawn_with_retry(t) for t in tasks)) ``` **TypeScript:** ```typescript const MAX_PARALLEL = 5; const RETRY_BUDGET = 2; // Lightweight semaphore for bounded concurrency class Semaphore { private q: Array<() => void> = []; constructor(private avail: number) {} async acquire(): Promise { if (this.avail > 0) { this.avail--; return; } return new Promise((res) => this.q.push(res)); } release(): void { const next = this.q.shift(); if (next) next(); else this.avail++; } } const sem = new Semaphore(MAX_PARALLEL); async function spawnWithRetry(task: string, attempt = 0): Promise { await sem.acquire(); try { const result = await spawnResearch(task); const status = (result.result as { status?: string })?.status; if (status === "timeout" && attempt < RETRY_BUDGET) { const alts = (result.result as { alternatives?: string[] })?.alternatives ?? [task]; telemetry.increment("subagent.retry", { reason: "timeout" }); sem.release(); return spawnWithRetry(narrowQuery(alts[0]), attempt + 1); } return result; } finally { sem.release(); } } const results = await Promise.all(tasks.map((t) => spawnWithRetry(t))); ``` Concept: `evaluation` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | Coverage gap appears in the final report | Audit the coordinator's decomposition first. It almost always lives there | Tune subagent prompts or upgrade their model | Decomposition is the coordinator's job. If 4 of 5 sub-domains were never enumerated, no amount of subagent quality recovers them. Fix decomposition; the rest follows. | | Subagent timed out. What does it return? | Structured error: {status: 'timeout', query, partial_results, alternatives} | Empty list [], marked as success | Silence loses the failure context. The coordinator can't distinguish 'no info exists' from 'we never got the data', so the final report can't acknowledge the gap honestly. | | Two sources disagree (45% Pew vs 12% McKinsey) | Verification preserves both with attribution + context (different definitions, different timeframes) | Synthesis picks the 'more likely' one and drops the other | Both numbers are correct under their respective definitions. Dropping one is misinformation. Preserving both with context is the journalistic move and the architectural one. | | Subagent A's output is needed by Subagent B | A returns to coordinator; coordinator passes A's finding into B's task prompt | A calls B directly with the finding | Direct subagent-to-subagent calls break isolation, kill parallelism (B waits for A), and hide the dependency from the coordinator's view. Hub-and-spoke is the architecture for a reason. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-10 · Narrow task decomposition | Coordinator decomposes 'creative industries' into visual arts only; report misses music, writing, film, performing arts. Subagents finished successfully. The bug is upstream. | Fix the coordinator's semantic decomposition. Enumerate ALL relevant sub-domains before spawning. The decomposition step is the coordinator's load-bearing responsibility. | | AP-11 · Silent timeout returns empty as success | Web-search subagent times out, returns []. Coordinator treats as 'no information exists' instead of 'timed out'. Final report has a silent gap. | Return structured error: {status: 'timeout', query, partial_results, alternatives}. Coordinator inspects status_code, retries with a narrower scope, or marks the gap transparently in the report. | | AP-12 · Latency bloat from over-broad verification | Synthesis subagent calls verify_fact for 100 claims sequentially. 80 are simple (Wikipedia), 20 complex. Total 60+ seconds for what should be 10. | Scope verify_fact narrowly: simple-claim batch verification (parallel, ~3s) + dedicated complex-verification subagent (parallel, pre-synthesis). Synthesis assumes facts are pre-verified. | | AP-13 · Dropped contradictions | Two sources conflict (45% any-use Pew vs 12% daily-use McKinsey). Synthesis picks the 'more likely' one; the other is dropped. Report is misinformation. | Preserve both at the verification step with sources_reconciled + notes explaining the apparent conflict. Synthesis presents both with attribution. Reader sees both numbers and why they differ. | | AP-14 · Direct subagent-to-subagent communication | Researcher A (papers) directly hands a finding to Researcher B (web). Isolation breaks; parallelism degrades to sequential; coordinator loses visibility. | Route everything through the coordinator. A returns to coordinator; coordinator constructs B's task prompt with A's finding embedded. Hub-and-spoke is non-negotiable. | ## Implementation checklist - [ ] Coordinator's decomposition function reviewed for completeness BEFORE first spawn (`subagents`) - [ ] Each subagent has its own system prompt + scoped tool whitelist (`tool-calling`) - [ ] Synthesis subagent has Read-only tool list (no WebSearch, no Bash) (`context-window`) - [ ] All subagent fan-out via async gather / Promise.all (`agentic-loops`) - [ ] Structured error context on every timeout / no-results path (`structured-outputs`) - [ ] Verification subagent preserves contradictions with attribution (`evaluation`) - [ ] No direct subagent-to-subagent calls anywhere in the codebase (`subagents`) - [ ] Coordinator inspects status_code on every subagent return - [ ] Bounded concurrency (semaphore) to cap parallel API calls - [ ] Per-subagent retry budget with narrowed-query alternatives - [ ] Telemetry: spawn count, parallel max, retry rate, partial-data rate ## Cost & latency - **Research subagents (3-5 parallel):** ~$0.06-0.15 per query, 3 subagents × ~20K input + ~2K output ≈ $0.06; 5 subagents ≈ $0.10. Parallel: latency = max(subagents) ≈ 3-5s, cost = sum. - **Verification subagent:** ~$0.03-0.05 per query, Reads pooled findings (~15K) + cross-checks (~10K) + emits verifications (~1K) ≈ $0.04. Single pass, serial. - **Synthesis subagent:** ~$0.02-0.03 per query, Reads verified findings (~5K) + generates Markdown narrative (~3K). Read-only tools keep cost low; no re-research. - **Retry overhead (timeouts):** ~$0.01-0.02 per retry, Narrowed retries (~10K input + 1K output). Cap at 2 retries per subagent to bound cost; beyond that, accept partial data and mark the gap. - **p95 end-to-end latency:** ~10-14s, Decompose ~0.5s + parallel research ~5s + verification ~3s + synthesis ~3s + coordinator overhead. Subagents in parallel save ~10s vs sequential. ## Domain weights - **D1 · Agentic Architectures (27%):** Coordinator + research/verification/synthesis subagent loops + hub-and-spoke routing - **D2 · Tool Design + Integration (18%):** Per-subagent tool whitelist + structured error context + verification rubric ## Practice questions ### Q1. A research system decomposes 'impact of AI on creative industries' into three subtopics: visual arts, music, writing. The web-search subagent finds results for all three. The synthesis subagent produces a report covering only visual arts. Why? The root cause is the coordinator's decomposition, not the subagents. Subagents finished successfully on what they were assigned; the coordinator only assigned visual-arts tasks. The fix is upstream: analyse the topic semantically and enumerate all relevant sub-domains (visual + music + writing + film + performing arts) before spawning. Decomposition is the coordinator's load-bearing responsibility, and the most-tested distractor on this scenario is to blame the subagents instead. ### Q2. A web-search subagent times out and returns an empty result list. The coordinator treats this as 'no information available' and moves forward. The final report is incomplete. What's the architectural fix? Subagents must return structured error context on timeout, never silence. The shape: {status: 'timeout', query, partial_results, alternatives: ['narrower query', 'different keywords', ...]}. The coordinator inspects status_code and decides: retry with the first alternative, accept partial data, or transparently mark the gap in the synthesis. Returning [] as success conflates 'no info exists' with 'we never got the data'. And the report can't acknowledge a gap it can't see. Tagged to AP-11. ### Q3. A research report cites two conflicting statistics: '45% of creative workers use AI' (Pew) and '12% use AI daily' (McKinsey). Should synthesis pick the more likely one? No. Both are correct. They measure different things (any-use vs daily-use) under different methodologies. The verification subagent's job is to preserve both with attribution + context, structured as sources_reconciled: [{stat: '45%', source: 'Pew', context: 'any use'}, {stat: '12%', source: 'McKinsey', context: 'daily use'}] plus a notes field explaining the apparent contradiction. Synthesis then presents both with attribution. Picking one is misinformation; preserving both is journalism. Tagged to AP-13. ### Q4. Subagent A (academic papers) finds a key research direction. Subagent B (web search) needs that finding to guide its queries. Should A pass it directly to B? No. All cross-subagent communication routes through the coordinator. A returns its finding; the coordinator constructs B's task prompt with the finding embedded: Topic: ... Key direction from paper search: [A's finding]. Now search the web for that direction. Direct calls break isolation (B inherits A's context noise), kill parallelism (B waits for A even when it could run in parallel), and hide the dependency from the coordinator's orchestration graph. Hub-and-spoke is non-negotiable. Tagged to AP-14. ### Q5. A synthesis subagent needs to verify ~100 facts in a final report. Calling verify_fact sequentially takes 60+ seconds. What's the architectural fix? Two layers. First, scope verify_fact narrowly: simple-claim verification (Wikipedia lookups, ~5ms each) batched in parallel completes ~80 claims in ~2s. Second, dedicate a separate verification subagent that runs before synthesis, processing complex multi-source reconciliation in parallel; synthesis then assumes facts are pre-verified and only reads. Total latency drops from 60s+ to ~3-5s. The general lesson: don't let the synthesis subagent re-research mid-narrative. Tagged to AP-12. ## FAQ ### Q1. Should subagents run in parallel or in sequence? Parallel whenever possible. Independent research tasks (visual arts, music, writing) all run at once via asyncio.gather / Promise.all. Cost: N separate API calls. Latency: max(N) ≈ 5-8s, not sum. Sequence only when there's a true data dependency (B needs A's output). And even then, the coordinator handles the chaining; subagents never call each other. ### Q2. Can subagents inherit the coordinator's conversation history? No. Subagents are isolated by design. That's the architectural win. The coordinator passes context explicitly in the subagent's task prompt: User asked: [query]. Key context so far: [pinned facts]. Your task: [focused research goal]. Subagent starts fresh with only what's in the task. Inheriting history defeats parallelism and bloats per-subagent context cost. ### Q3. What happens if multiple subagents return partial / timeout results? Coordinator collects what came back, invokes synthesis with a narrative-prompt note: Research is incomplete due to timeouts on [X, Y]. The report should acknowledge gaps in those areas explicitly. Transparency beats silence. The reader sees a report that says 'we got these 3 sub-domains; the other 2 timed out' rather than a confidently-misleading report missing 2 whole sub-domains. ### Q4. Should the synthesis subagent have web-search access? No. Read-only is the architectural detail. Synthesis stitches verified findings into a narrative; it does not re-research. If synthesis needs to verify a fact mid-sentence, that's a sign verification should have been broader upstream. Fix the verification phase, not the synthesis tool list. Read-only also caps the latency and cost of synthesis predictably. ### Q5. How do we handle contradictions surfaced by research subagents? Don't resolve them at the subagent level. Pass conflicting findings to the verification subagent with sources intact. Verification reconciles: preserves both with attribution + a notes field explaining the conflict (different timeframes, definitions, populations, methodologies). Synthesis then presents both with context. The reader gets transparency; the system avoids fabricating false certainty. ### Q6. Can a subagent spawn another subagent (nested)? In theory yes; in practice avoid it. Nested subagents increase latency, complicate context flow, and obscure the orchestration graph from the coordinator. Keep the hierarchy shallow: coordinator → leaf subagents. If you need 'meta-research' (one subagent's job is to figure out what to research), have the coordinator do that decomposition step explicitly. ### Q7. What's the maximum number of subagents to run in parallel? No hard limit, diminishing returns past 5-7. API concurrency limits, rate-limit backpressure, and context-window contention on the coordinator side all start kicking in. Use a bounded semaphore (MAX_PARALLEL = 5), measure latency at different fan-outs, and tune to your workload. For 10+ tasks, consider Batch API or sequential task chains. ## Production readiness - [ ] Decomposition function unit-tested with 5+ representative topic types - [ ] Each subagent's tool whitelist verified. No Bash + Edit + Write together unless intended - [ ] Structured error contract enforced via TypeScript / Pydantic schema - [ ] Bounded concurrency (semaphore) tested under burst load (20+ queued tasks) - [ ] Retry budget per subagent capped (default 2); telemetry on retry rate - [ ] Verification subagent has fact-check rubric + contradiction-preservation rule documented - [ ] Synthesis subagent's tool list is [Read] only; lint check prevents regression - [ ] End-to-end test: contradictions survive from research → verification → synthesis with attribution --- **Source:** https://claudearchitectcertification.com/scenarios/multi-agent-research-system **Vault sources:** ACP-T05 §Scenario 3 (5 ✅/❌ pairs · official guide scenario); ACP-T08 §3.3 metadata; Course 16 Subagents. Lesson 03 designing effective subagents; ACP-T06 (5 practice Qs tagged to components); GAI-K05 CCA exam questions and scenarios (Scenario 3 walkthrough); COD-K04 Feynman architecture review (multi-agent patterns) **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Claude Code for CI/CD > Claude Code as a headless PR reviewer in GitHub Actions. The workflow uses claude -p for non-interactive execution, runs per-file independent sessions (no shared context across the 14 files in a PR. That prevents lost-in-the-middle), emits --output-format json for structured verdicts the next workflow step parses into PR comments, and explicitly declares allowed_tools in YAML (no wildcards in CI). Custom instructions in the workflow file inject the project's CLAUDE.md context. The most-tested distractor: thinking one big session reviewing all 14 files is faster. It's not, it's worse, because by file 14 the early conventions have dropped out of attention. **Sub-marker:** P3.5 **Domains:** D3 · Agent Operations, D2 · Tool Design + Integration **Exam weight:** 38% of CCA-F (D3 + D2) **Build time:** 25 minutes **Source:** 🟢 Official Anthropic guide scenario · in published exam guide and practice exam **Canonical:** https://claudearchitectcertification.com/scenarios/claude-code-for-cicd **Last reviewed:** 2026-05-04 ## In plain English Think of this as Claude Code working inside your CI without anyone watching. It runs every time a pull request opens, reviews each changed file in isolation, and posts inline comments back on the PR. No interactive prompts, no human at a keyboard. The agent is invoked as a one-shot command (`claude -p`), it returns a structured JSON verdict, and the workflow turns that JSON into the comments your team actually reads. The whole point is that PR review at scale needs a reviewer that does not get tired by file 14, and a CI-native agent that runs per file is exactly that. ## Exam impact Domain 3 (Claude Code Configuration, 20%) tests the headless `-p` flag, custom_instructions, and per-file session boundaries. Domain 2 (Tool Design, 18%) tests explicit allowed_tools in CI (no wildcards) and structured-JSON output contracts. This is in the published exam guide AND practice exam. High-yield drilling territory. The 'why does file 14 contradict file 3' question is canonical. ## The problem ### What the customer needs - PR review on every pull request without a human running the CLI manually. Claude Code triggered by GitHub Actions on pull_request. - Per-file feedback that doesn't lose track of conventions established earlier in the diff. File 14 must agree with file 3 on style and tests. - Structured output the workflow can parse into PR review comments, not free-form prose with brittle regex. ### Why naive approaches fail - Single session for all 14 files → context pollutes → file 14 contradicts file 3 (lost-in-the-middle). - Forgetting -p → workflow hangs waiting for interactive input → CI timeout after 6 hours, no review posted. - Free-form text output → next step uses regex to extract issues → brittle, breaks on every Claude wording change. ### Definition of done - Per-file PR review fires on every pull_request event - Each file reviewed in its own claude -p session (isolated context) - Output is JSON (--output-format json); workflow parses with jq, posts comments via gh CLI - allowed_tools is an explicit list in YAML (no * wildcard) - Project context flows in via custom_instructions reading .github/claude-context.md ## Concepts in play - 🟢 **CLAUDE.md hierarchy** (`claude-md-hierarchy`), Project context injected via custom_instructions - 🟢 **Context window** (`context-window`), Per-file isolation prevents lost-in-the-middle - 🟢 **Tool calling** (`tool-calling`), Explicit allowed_tools whitelist in CI - 🟢 **Structured outputs** (`structured-outputs`), --output-format json contract - 🟢 **Subagents** (`subagents`), Each per-file session is effectively a subagent - 🟢 **Evaluation** (`evaluation`), Two-pass: per-file local then PR-level integration - 🟢 **Batch API** (`batch-api`), Overnight nightly audits for non-blocking checks - 🟢 **Hooks** (`hooks`), PreToolUse hooks for CI cost guards (deny network egress) ## Components ### GitHub Actions Workflow, .github/workflows/claude.yml Triggers on pull_request events. Authenticates with the Claude Code GitHub App. Loops over the changed files (via gh pr diff --name-only) and dispatches a per-file claude -p invocation. Owns retries, concurrency, and the eventual gh pr review post. **Configuration:** on: pull_request. Steps: actions/checkout → install Claude Code CLI → for each changed file, run claude review pr -p --output-format json --custom-instructions .github/claude-context.md. Concurrency: max 4 parallel files; the rest queue. **Concept:** `claude-md-hierarchy` ### claude -p Headless Invocation, non-interactive, one-shot per file Runs Claude Code in headless mode. The -p flag disables the interactive REPL; the agent processes a single task, emits output to stdout, and exits. No human at a keyboard, no waiting on prompts. This is the CI primitive. Without -p, the workflow hangs. **Configuration:** claude review pr -p --output-format json --custom-instructions ".github/claude-context.md" --files "src/auth/login.ts" --max-turns 6 **Concept:** `context-window` ### Per-File Session Isolation, one claude -p invocation per changed file Each changed file gets its OWN headless session. No shared context across files. This is the single biggest architectural decision: 14 isolated sessions × small context > 1 session × 14 files of accumulated context. The latter triggers lost-in-the-middle by file 8-10; the former does not. **Configuration:** Loop in workflow YAML: for f in $(gh pr diff --name-only); do claude review pr -p --files $f >> review-$f.json; done. Each file's review is independent; the parent workflow aggregates JSON. **Concept:** `subagents` ### Structured JSON Output Contract, --output-format json + jq parsing Claude emits a structured object per file: { file, verdict (approve | request_changes | comment), issues: [{ line, severity, message, suggestion? }], summary }. The workflow's next step parses with jq and posts inline comments via gh pr review --comment-line. No regex parsing of free-form prose. **Configuration:** Schema (per file): { file: string, verdict: 'approve'|'request_changes'|'comment', issues: [{line: int, severity: 'blocker'|'nit'|'praise', message: string, suggestion?: string}], summary: string } **Concept:** `structured-outputs` ### PR Comment Poster + Allowed-Tools Gate, gh pr review + explicit tool whitelist Final workflow step reads the aggregated JSON, runs gh pr review --request-changes --body $SUMMARY and gh pr review --comment-line N $MSG per issue. Critical: the claude -p invocation declares --allowed-tools Read,Grep,Glob,Bash (no wildcard, no Edit, no Write). CI agents never need write access to the repo; they read and report. **Configuration:** --allowed-tools "Read,Grep,Glob,Bash(git diff,git log,gh pr diff)". Wildcards in CI are a red flag. They expand the blast radius of any prompt-injection in PR content. **Concept:** `tool-calling` ## Build steps ### 1. Scaffold the GitHub Actions workflow Create .github/workflows/claude.yml. Trigger on pull_request events. Install the Claude Code GitHub App via /install-github-app (one-time, generates the OAuth token stored as secrets.CLAUDE_CODE_OAUTH_TOKEN). Checkout the PR head ref, install the CLI, then dispatch the per-file review loop. **Python:** ```python # .github/workflows/claude.yml name: Claude PR review on: pull_request: types: [opened, synchronize] jobs: review: runs-on: ubuntu-latest permissions: contents: read pull-requests: write steps: - uses: actions/checkout@v4 with: ref: ${{ github.event.pull_request.head.sha }} fetch-depth: 2 - name: Install Claude Code CLI run: npm i -g @anthropic-ai/claude-code - name: Per-file review env: CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} GH_TOKEN: ${{ github.token }} run: ./.github/scripts/per-file-review.sh ``` **TypeScript:** ```typescript // .github/workflows/claude.yml // name: Claude PR review // on: // pull_request: // types: [opened, synchronize] // // jobs: // review: // runs-on: ubuntu-latest // permissions: // contents: read // pull-requests: write // steps: // - uses: actions/checkout@v4 // with: // ref: ${{ github.event.pull_request.head.sha }} // fetch-depth: 2 // // - name: Install Claude Code CLI // run: npm i -g @anthropic-ai/claude-code // // - name: Per-file review // env: // CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} // GH_TOKEN: ${{ github.token }} // run: ./.github/scripts/per-file-review.sh ``` Concept: `claude-md-hierarchy` ### 2. Run claude -p with --output-format json per file Loop over the changed files (gh pr diff --name-only). For each one, run claude review in headless mode (-p), with --output-format json, --allowed-tools explicit, and --custom-instructions pointing at a markdown file that holds the project's CLAUDE.md context. Capture each file's JSON output to disk; aggregate at the end. **Python:** ```python # .github/scripts/per-file-review.sh #!/usr/bin/env bash set -euo pipefail REVIEW_DIR=$(mktemp -d) echo "files=$REVIEW_DIR" >> "$GITHUB_OUTPUT" # Per-file independent sessions. NOT one big session for f in $(gh pr diff --name-only); do echo "::group::Reviewing $f" out="$REVIEW_DIR/$(echo "$f" | tr '/' '_').json" claude review pr \ -p \ --output-format json \ --custom-instructions ".github/claude-context.md" \ --allowed-tools "Read,Grep,Glob,Bash(git diff,git log,gh pr diff)" \ --files "$f" \ --max-turns 6 \ > "$out" || echo '{"file":"'"$f"'","verdict":"comment","issues":[],"summary":"review failed"}' > "$out" echo "::endgroup::" done # Aggregate to one file the next step parses jq -s '.' "$REVIEW_DIR"/*.json > review-aggregate.json ``` **TypeScript:** ```typescript // .github/scripts/per-file-review.ts (Node.js variant) import { execSync } from "node:child_process"; import { writeFileSync, mkdtempSync } from "node:fs"; import { join } from "node:path"; import { tmpdir } from "node:os"; const reviewDir = mkdtempSync(join(tmpdir(), "claude-review-")); const files = execSync("gh pr diff --name-only", { encoding: "utf8" }) .trim() .split("\n") .filter(Boolean); // Per-file independent sessions. NOT one big session for (const f of files) { const out = join(reviewDir, f.replaceAll("/", "_") + ".json"); try { const json = execSync( `claude review pr -p --output-format json ` + `--custom-instructions ".github/claude-context.md" ` + `--allowed-tools "Read,Grep,Glob,Bash(git diff,git log,gh pr diff)" ` + `--files "${f}" --max-turns 6`, { encoding: "utf8" }, ); writeFileSync(out, json); } catch { writeFileSync( out, JSON.stringify({ file: f, verdict: "comment", issues: [], summary: "review failed", }), ); } } ``` Concept: `structured-outputs` ### 3. Inject project context via custom_instructions Don't pass the entire project CLAUDE.md to claude -p. Too much. Instead, create .github/claude-context.md (committed to the repo) with the CI-relevant slice: stack, code style, what counts as a blocker vs a nit, what to skip (generated files, lockfiles). The --custom-instructions flag injects this into every per-file session. **Python:** ```python # .github/claude-context.md (committed; loaded by --custom-instructions) # Project review rubric # Stack: Next.js 15 + TypeScript strict + Drizzle. We use Server Actions # instead of API routes; named exports only; 2-space indent. # # Severity rubric: # blocker = type unsafe, secret leaked, missing test for changed code, # breaks API contract, server-action without zod validation # nit = naming inconsistency, missing JSDoc, slightly verbose # praise = clever simplification, good test, well-named refactor # # Skip (do NOT review): # - pnpm-lock.yaml # - public/scenarios/*.svg (generated) # - any file under .next/, dist/, build/, node_modules/ # # Output format reminder: # Always emit JSON: { file, verdict, issues[], summary }. # verdict ∈ {approve, request_changes, comment}. # issues[].severity ∈ {blocker, nit, praise}. ``` **TypeScript:** ```typescript // .github/claude-context.md (committed; loaded by --custom-instructions) // # Project review rubric // # Stack: Next.js 15 + TypeScript strict + Drizzle. We use Server Actions // # instead of API routes; named exports only; 2-space indent. // # // # Severity rubric: // # blocker = type unsafe, secret leaked, missing test for changed code, // # breaks API contract, server-action without zod validation // # nit = naming inconsistency, missing JSDoc, slightly verbose // # praise = clever simplification, good test, well-named refactor // # // # Skip (do NOT review): // # - pnpm-lock.yaml // # - public/scenarios/*.svg (generated) // # - any file under .next/, dist/, build/, node_modules/ // # // # Output format reminder: // # Always emit JSON: { file, verdict, issues[], summary }. // # verdict ∈ {approve, request_changes, comment}. // # issues[].severity ∈ {blocker, nit, praise}. ``` Concept: `claude-md-hierarchy` ### 4. Lock allowed_tools. No wildcards in CI In CI, the agent processes content from PR authors. Including authors outside your org. That content is untrusted. Wildcard --allowed-tools '*' lets a prompt-injection in a PR body or commit message escalate to write access on your repo. Always declare an explicit list: Read, Grep, Glob, and a NARROW Bash whitelist. Never Edit, Write, or open-ended Bash. **Python:** ```python # WRONG. Open invitation to prompt injection # claude review pr -p --allowed-tools "*" # # WRONG. Bash with no whitelist # claude review pr -p --allowed-tools "Read,Bash" # # RIGHT. Every tool explicit, Bash narrowly scoped to read-only commands ALLOWED='Read,Grep,Glob,Bash(git diff,git log --oneline -20,gh pr diff,gh pr view)' claude review pr -p \ --allowed-tools "$ALLOWED" \ --files "$f" \ --max-turns 6 \ --output-format json # # Periodically audit: grep workflows for any "*" or unscoped Bash # rg -n 'allowed-tools.*"[^"]*\*' .github/workflows/ ``` **TypeScript:** ```typescript // WRONG. Open invitation to prompt injection // execSync('claude review pr -p --allowed-tools "*"'); // // WRONG. Bash with no whitelist // execSync('claude review pr -p --allowed-tools "Read,Bash"'); // // RIGHT. Every tool explicit, Bash narrowly scoped to read-only commands const ALLOWED = "Read,Grep,Glob,Bash(git diff,git log --oneline -20,gh pr diff,gh pr view)"; execSync( `claude review pr -p --allowed-tools "${ALLOWED}" --files "${f}" --max-turns 6 --output-format json`, ); // Periodically audit: grep workflows for any "*" or unscoped Bash // rg -n 'allowed-tools.*"[^"]*\*' .github/workflows/ ``` Concept: `tool-calling` ### 5. Parse JSON and post inline PR comments Aggregate the per-file JSON outputs, then transform into gh pr review calls. One inline comment per issue (line + body), one summary review at the end (approve / request_changes / comment based on whether any blockers fired). The gh CLI handles the GitHub REST mechanics; you just feed it structured input. **Python:** ```python # .github/scripts/post-comments.sh #!/usr/bin/env bash set -euo pipefail # Aggregate file from previous step agg=review-aggregate.json # Per-issue inline comments jq -c '.[].issues[] as $i | { file: .[].file, line: $i.line, body: $i.message }' "$agg" \ | while read -r row; do file=$(echo "$row" | jq -r .file) line=$(echo "$row" | jq -r .line) body=$(echo "$row" | jq -r .body) gh pr review --comment --body "$body" --comment-line "$line" -- "$file" done # Summary verdict blockers=$(jq '[.[] | .issues[] | select(.severity=="blocker")] | length' "$agg") if [ "$blockers" -gt 0 ]; then gh pr review --request-changes --body "$blockers blocker(s) found. See inline comments." else gh pr review --approve --body "LGTM. Inline nits/praise where applicable." fi ``` **TypeScript:** ```typescript // .github/scripts/post-comments.ts import { execSync } from "node:child_process"; import { readFileSync } from "node:fs"; interface Issue { line: number; severity: "blocker" | "nit" | "praise"; message: string; } interface FileReview { file: string; verdict: "approve" | "request_changes" | "comment"; issues: Issue[]; summary: string; } const agg: FileReview[] = JSON.parse( readFileSync("review-aggregate.json", "utf8"), ); // Per-issue inline comments for (const fr of agg) { for (const issue of fr.issues) { execSync( `gh pr review --comment --body "${issue.message}" --comment-line ${issue.line} -- "${fr.file}"`, ); } } // Summary verdict const blockers = agg.flatMap((fr) => fr.issues.filter((i) => i.severity === "blocker"), ); if (blockers.length > 0) { execSync( `gh pr review --request-changes --body "${blockers.length} blocker(s) found. See inline comments."`, ); } else { execSync( `gh pr review --approve --body "LGTM. Inline nits/praise where applicable."`, ); } ``` Concept: `structured-outputs` ### 6. Cap concurrency and add a per-PR cost budget Per-file fan-out is parallel by default. But uncapped parallelism can exhaust the GitHub Actions concurrent-job limit and stack up token spend. Cap at ~4 parallel files. Add a per-PR token budget (env var checked at the start of each file) that aborts further reviews if the running PR would exceed the cap. Cost predictability beats marginal latency wins. **Python:** ```python # In per-file-review.sh. Bounded concurrency via xargs -P gh pr diff --name-only \ | xargs -n 1 -P 4 -I {} bash -c ' file="$1" out="$REVIEW_DIR/$(echo "$file" | tr / _).json" claude review pr -p --output-format json --files "$file" \ --custom-instructions ".github/claude-context.md" \ --allowed-tools "Read,Grep,Glob,Bash(git diff,git log,gh pr diff)" \ --max-turns 6 > "$out" ' _ {} # Per-PR token budget guard (run before the loop) TOKEN_BUDGET=${CLAUDE_PR_TOKEN_BUDGET:-200000} estimated=$(gh pr diff --name-only | wc -l | awk -v b="$TOKEN_BUDGET" '{ print int(b / NR) }') if [ "$estimated" -lt 5000 ]; then echo "::warning::PR has too many files for budget $TOKEN_BUDGET; consider splitting" exit 0 fi ``` **TypeScript:** ```typescript // In per-file-review.ts. Bounded concurrency via simple semaphore const MAX_PARALLEL = 4; let inFlight = 0; const queue: Array<() => Promise> = files.map((f) => async () => { // ...claude review for f... }); async function run() { while (queue.length || inFlight) { while (inFlight < MAX_PARALLEL && queue.length) { const job = queue.shift()!; inFlight++; job().finally(() => { inFlight--; }); } await new Promise((r) => setTimeout(r, 50)); } } // Per-PR token budget guard const TOKEN_BUDGET = Number(process.env.CLAUDE_PR_TOKEN_BUDGET ?? 200_000); const estimatedPerFile = Math.floor(TOKEN_BUDGET / files.length); if (estimatedPerFile < 5000) { console.warn( `::warning::PR has too many files for budget ${TOKEN_BUDGET}; consider splitting`, ); process.exit(0); } await run(); ``` Concept: `context-window` ### 7. Use Batch API for nightly audits, sync API for blocking review PR review is blocking. The developer is waiting; sync API is the right call. But you also want a nightly audit pass (drift detection, security regression scan) that doesn't need to finish in minutes. That's where the Batch API earns its 50% discount: submit overnight, results in 24h, review the next morning. Two different APIs for two different latency budgets. **Python:** ```python # Sync API. Pre-merge blocking review (latency matters) # (this is what the per-file-review.sh script above uses) # Batch API. Overnight audit job (latency doesn't matter, cost does) import anthropic, json client = anthropic.Anthropic() # Build a batch from yesterday's merged PRs requests = [] for pr in get_merged_prs(since="24h"): for f in pr.changed_files: requests.append({ "custom_id": f"audit-{pr.number}-{f.path}", "params": { "model": "claude-sonnet-4.5", "max_tokens": 1024, "messages": [{"role": "user", "content": f.diff}], }, }) batch = client.messages.batches.create(requests=requests) print(f"Submitted batch {batch.id}; will be ready in ~24h") # Tomorrow morning, fetch results, write to a Slack #audit-drift channel. ``` **TypeScript:** ```typescript // Sync API. Pre-merge blocking review (latency matters) // (this is what the per-file-review.ts script above uses) // Batch API. Overnight audit job (latency doesn't matter, cost does) import Anthropic from "@anthropic-ai/sdk"; const client = new Anthropic(); interface MergedPr { number: number; changed_files: Array<{ path: string; diff: string }>; } declare function getMergedPrs(opts: { since: string }): MergedPr[]; const requests = []; for (const pr of getMergedPrs({ since: "24h" })) { for (const f of pr.changed_files) { requests.push({ custom_id: `audit-${pr.number}-${f.path}`, params: { model: "claude-sonnet-4.5", max_tokens: 1024, messages: [{ role: "user", content: f.diff }], }, }); } } const batch = await client.messages.batches.create({ requests }); console.log(`Submitted batch ${batch.id}; will be ready in ~24h`); // Tomorrow morning, fetch results, write to a Slack #audit-drift channel. ``` Concept: `batch-api` ### 8. Add a CI cost-guard hook + alerting Wrap the claude -p invocation in a PreToolUse hook that aborts the review if the PR is over a token-budget threshold (e.g. >100K tokens of diff). This protects you from a runaway 10-million-line PR exhausting your monthly Claude budget in one CI run. Pair with a workflow alert to a Slack channel when the hook fires. **Python:** ```python # .claude/hooks/ci_cost_guard.py import sys, json, subprocess MAX_TOKENS_PER_PR = 200_000 def main(): payload = json.loads(sys.stdin.read()) if payload["tool_name"] != "claude_review_pr": sys.exit(0) diff = subprocess.check_output(["gh", "pr", "diff"], text=True) estimated = len(diff) // 4 # ~4 chars per token rule of thumb if estimated > MAX_TOKENS_PER_PR: print( f"PR exceeds token budget: ~{estimated} tokens > " f"{MAX_TOKENS_PER_PR}. Split the PR or raise the limit " f"deliberately.", file=sys.stderr, ) sys.exit(2) # DENY sys.exit(0) ``` **TypeScript:** ```typescript // .claude/hooks/ci-cost-guard.ts import { readFileSync } from "node:fs"; import { execSync } from "node:child_process"; const MAX_TOKENS_PER_PR = 200_000; const payload = JSON.parse(readFileSync(0, "utf8")); if (payload.tool_name !== "claude_review_pr") process.exit(0); const diff = execSync("gh pr diff", { encoding: "utf8" }); const estimated = Math.floor(diff.length / 4); if (estimated > MAX_TOKENS_PER_PR) { process.stderr.write( `PR exceeds token budget: ~${estimated} tokens > ${MAX_TOKENS_PER_PR}. ` + `Split the PR or raise the limit deliberately.\n`, ); process.exit(2); // DENY } process.exit(0); ``` Concept: `hooks` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | Reviewing a 14-file PR | Per-file independent claude -p sessions, aggregated at the end | One claude -p session reviewing all 14 files together | Single-session review hits lost-in-the-middle by file 8-10; conventions established on file 3 drop out of attention by file 14. Per-file isolation eliminates that failure mode entirely. | | Output format from claude -p in CI | --output-format json (parsed with jq → gh pr review) | Free-form prose, parsed downstream with regex | Regex over prose is brittle. Every Claude wording change breaks the pipeline. JSON contract is stable and the SDK guarantees the shape. | | Tool access in CI | Explicit --allowed-tools list (Read, Grep, Glob, narrow Bash) | Wildcard --allowed-tools '*' or unscoped Bash | PR content is untrusted (external contributors, prompt injection vectors). Wildcards expand the blast radius; explicit whitelists cap it. CI agents never need write access. They read and report. | | Pre-merge review vs nightly audit | Sync API for blocking pre-merge; Batch API for non-blocking overnight | Use one API for both | Pre-merge review is latency-critical (developer waiting); sync API is right. Nightly audit is cost-sensitive but not latency-bound; Batch API at 50% discount earns its keep. Two budgets, two APIs. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-15 · Same session for all files | Workflow runs ONE claude -p over all 14 changed files. By file 14, lost-in-the-middle has dropped the conventions established on file 3. Inline comments on file 14 contradict the comments on file 3. | Per-file independent sessions. Loop the 14 files, run one claude -p invocation per file, aggregate the JSON outputs at the end. Each file gets a fresh, focused context. | | AP-16 · Forgot the -p flag | Workflow invokes claude review without -p. The CLI starts in interactive mode, waits for input, and the GitHub Actions runner times out after 6 hours with no review posted. | Always pass -p for non-interactive headless execution. CI hangs are silent failures; the -p flag is what makes Claude Code CI-safe in the first place. | | AP-17 · Unstructured text output | Workflow asks Claude to 'review the file and post inline comments'. Output is free-form prose. Next step uses regex to extract issues. Every Claude phrasing change breaks the regex; the workflow silently posts nothing. | Always pass --output-format json. The output is a structured contract: { file, verdict, issues[], summary }. Workflow parses with jq and posts via gh pr review --comment-line. | | AP-18 · Wildcard --allowed-tools in CI | Workflow uses --allowed-tools '*' for convenience. A prompt-injection in a PR description tricks the agent into running Bash(rm -rf .) or writing a malicious file. Repo state corrupted; PR reviewer can't tell what happened. | Always declare an explicit allowed-tools list: Read,Grep,Glob,Bash(git diff,git log,gh pr diff). CI agents never need write tools. Wildcards expand the prompt-injection blast radius; explicit lists cap it. | | AP-19 · No project context in CI | claude -p runs without --custom-instructions. Claude doesn't know the project's stack, code style, or what counts as a blocker. Reviews are generic; flags style-correct code as 'should use named exports' in a default-exports codebase. | Commit .github/claude-context.md with the CI-relevant slice of CLAUDE.md (stack, severity rubric, files to skip). Pass it via --custom-instructions .github/claude-context.md on every claude -p invocation. | ## Implementation checklist - [ ] .github/workflows/claude.yml exists; triggers on pull_request opened+synchronize - [ ] Claude Code GitHub App installed; CLAUDE_CODE_OAUTH_TOKEN in repo secrets - [ ] Workflow runs claude review with -p (headless) on every invocation (`context-window`) - [ ] Per-file loop. Each changed file in its own claude -p session (`subagents`) - [ ] --output-format json on every invocation; jq parses the aggregate (`structured-outputs`) - [ ] --allowed-tools is an explicit list (no wildcards, no Edit/Write) (`tool-calling`) - [ ] .github/claude-context.md committed; passed via --custom-instructions (`claude-md-hierarchy`) - [ ] Concurrency capped (xargs -P 4 or simple JS semaphore) - [ ] PreToolUse cost-guard hook denies PRs over the token budget (`hooks`) - [ ] Nightly audit job uses Batch API (50% discount, 24h SLA) (`batch-api`) - [ ] PR comments posted via gh pr review --comment-line + summary verdict ## Cost & latency - **Per-PR review (avg 8 files, 4 parallel):** ~$0.05-0.12 per PR, 8 files × ~10K input tokens (file diff + context-md) + ~2K output (JSON verdict) ≈ $0.04-0.10 + parallel overhead. Cap at ~$0.15 with the cost-guard hook to bound runaway 50-file PRs. - **Nightly audit (Batch API, ~50 PRs/day):** ~$0.50-1.20 per night, 50 PRs × 8 files × ~5K tokens (audit-only, lighter prompt) at Batch API 50% discount. Result ready next morning; no developer waiting. - **Custom-instructions caching:** ~30% savings on warm cache, .github/claude-context.md is stable across all per-file invocations within a single PR. Mark it cache_control: ephemeral; 5-min TTL keeps it warm across files in one workflow run. - **p95 PR-review latency:** ~45-90 seconds for 8-file PR, Per-file ~8-15s × 4 parallel = ~30s + JSON aggregation + gh pr review posts. Fast enough that the developer doesn't context-switch away while waiting. - **Cost-guard hook savings:** Prevents $X.XX runaways on outlier PRs, A 200-file PR (rare but real. E.g. lockfile bumps) without the hook would burn ~$3-6 in one workflow run. The hook denies before the loop starts; cost reverts to $0. ## Domain weights - **D3 · Agent Operations (20%):** claude -p flag + custom_instructions + .github/workflows/claude.yml + per-file isolation - **D2 · Tool Design + Integration (18%):** Explicit allowed_tools whitelist + structured JSON output contract + Batch API split ## Practice questions ### Q1. A GitHub Actions CI pipeline runs Claude Code to review PRs. It processes all 14 modified files in a SINGLE claude -p session. After 8 files the output becomes repetitive and misses issues. By file 14, an inline comment contradicts a decision made on file 3. Why? Lost-in-the-middle effect across the 14-file context. Single-session review accumulates context; by file 8-10 attention dilutes and conventions established on file 3 drop out. The fix is per-file independent sessions: loop the changed files, run one claude -p invocation per file, aggregate the JSON outputs at the end. Each file gets fresh, focused context. Tagged to AP-15. ### Q2. Your CI/CD workflow hangs when calling claude review pr.md. The GitHub Actions runner eventually times out after 6 hours with no review posted. What flag is missing? The -p flag. Without it, Claude Code starts in interactive mode and waits for human input. Which never arrives in CI, so the runner just hangs. -p (or --print) puts the CLI into headless one-shot mode: it processes a single task, emits output to stdout, and exits. Always pass -p in any non-interactive context. Tagged to AP-16. ### Q3. A GitHub Action posts inline review comments on PR lines. Claude outputs free-form prose; the workflow uses regex to extract issues, line numbers, and severities. After every Claude wording update, the regex breaks and reviews silently fail. What architectural change fixes this for good? Pass --output-format json. Claude emits a structured object per file: { file, verdict, issues: [{ line, severity, message, suggestion? }], summary }. The workflow parses with jq and posts via gh pr review --comment-line. Regex over prose is brittle by construction; structured-output contracts are the only durable answer. Tagged to AP-17. ### Q4. Your CI/CD pipeline uses --allowed-tools '*' because the team didn't want to maintain a list. A PR description includes a prompt-injection that tricks Claude into running Bash(rm -rf .). Repo state corrupted. What's the rule? No wildcards in CI. Ever. PR content (description, commit messages, diff, file contents) is untrusted. An external contributor or a prompt-injection in any of those vectors can escalate. Always declare an explicit list: --allowed-tools "Read,Grep,Glob,Bash(git diff,git log,gh pr diff,gh pr view)". CI review agents never need Edit, Write, or unscoped Bash; they read and report. Tagged to AP-18. ### Q5. Your Claude Code CI action runs without project context. Reviews flag style-correct code as 'should use named exports' in a default-exports codebase. What YAML field provides the project-specific rubric? --custom-instructions pointing at a committed markdown file (e.g. .github/claude-context.md). The file documents the project's stack, severity rubric (what counts as a blocker vs nit), and files to skip (lockfiles, generated assets). Don't pass the entire CLAUDE.md. Too much context. Pass the CI-relevant slice. Tagged to AP-19. ## FAQ ### Q1. Is claude -p the same as the SDK? Functionally similar; ergonomically different. The SDK is for code that needs programmatic control (custom orchestration, response streaming, tool definitions in code). claude -p is for shell-driven workflows (CI, cron, dev scripts) where the input is a prompt and the output is a structured response. CI workflows almost always want -p; bespoke automation almost always wants the SDK. ### Q2. What's the maximum file count per PR before this approach breaks down? No hard limit, but ~30 files is the ergonomic ceiling for blocking pre-merge review. Past that, sync-API latency adds up (~30s × ceil(N/4) parallel). For 30+ file PRs, switch to Batch-API audit (results next morning) and use the sync API only for files that touch security-critical paths (auth/, payments/, infra/). Two-track review. ### Q3. Can I run claude -p on the same PR every time it's updated? Yes. That's the synchronize event. The workflow trigger should be on: pull_request: types: [opened, synchronize]. Synchronize fires on every push to the PR head ref. Idempotency: each run reviews the *current* diff, so old comments stay until they're stale; if you want to dismiss outdated reviews, add a gh pr review --dismiss step that targets reviews on commits no longer at HEAD. ### Q4. How do I keep the cost predictable if the team merges 100+ PRs a day? Three levers, in priority order: (1) PreToolUse cost-guard hook that denies any single PR over a token budget. Protects from outliers; (2) --max-turns capped at 4-6. Bounds the worst case per file; (3) Concurrency cap (xargs -P 4 or JS semaphore). Bounds parallel Claude API calls in flight. With all three, monthly cost variance stays inside ±10%. ### Q5. Should the CI workflow have write access to the repo? No. Read-only on the repo, write-only on PR comments and reviews. GitHub Actions permissions: block: contents: read, pull-requests: write. The CI agent reads code, runs gh pr diff, posts comments. It never needs to push commits. Same logic that bans write tools in --allowed-tools applies at the GitHub permission layer. ### Q6. What's in .github/claude-context.md vs the project's main CLAUDE.md? The CI slice, not the whole thing. Main CLAUDE.md targets developers running Claude Code interactively. It covers full stack, conventions, examples, troubleshooting (~300-500 lines). .github/claude-context.md is the trimmed CI rubric: stack one-liner, severity definitions, files to skip, output-format reminder (~40-80 lines). Smaller context = faster review + cheaper tokens. ### Q7. Can I run claude -p reviews with custom Skills? Yes, and you should. Create .claude/skills/code-reviewer/SKILL.md with the team's review rubric, allowed tools, and output schema. The CI workflow invokes the Skill explicitly: claude review pr -p --skill code-reviewer .... Skills are version-controlled (live in the repo), so the rubric evolves with the codebase and CI behaviour stays in sync. ## Production readiness - [ ] .github/workflows/claude.yml committed and triggers on pull_request - [ ] Smoke test: open a 1-file PR; review posts within 60s with structured comments - [ ] Stress test: open a 30-file PR; concurrency cap prevents runner exhaustion - [ ] Cost-guard test: open a synthetic 200-file PR; hook denies before loop starts - [ ] Schema lint: validate every JSON output against the file-review schema - [ ] Permissions audit: workflow permissions: block is contents:read + pull-requests:write only - [ ] Tools audit: no --allowed-tools '*' anywhere in .github/workflows/ - [ ] Nightly audit Batch API job runs and posts to #audit-drift on completion --- **Source:** https://claudearchitectcertification.com/scenarios/claude-code-for-cicd **Vault sources:** ACP-T05 §Scenario 5 (5 ✅/❌ pairs · official guide scenario); ACP-T08 §3.5 metadata; Course 04 Claude Code in Action. Lesson 12 GitHub integration; ACP-T06 (5 practice Qs tagged to components); ACP-T07 §Lab 5 spec (claude -p, JSON output, independent sessions); GAI-K05 CCA exam questions and scenarios **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Agent Skills for Enterprise KM > An enterprise-scale Skills registry. Each Skill is a markdown file with frontmatter (name + version + description + tags + dependencies + access_level), stored in .claude/skills/{team}/{name}.md so naming collisions become structural impossibilities. The registry indexer rebuilds on every commit, the search service surfaces the right Skill from 200+, semver gates breaking changes, and a permission-aware layer enforces ACLs before invocation (support agents cannot invoke finance Skills, no matter how cleverly prompted). Empirically confirmed on the real CCA-F exam by multiple pass-takers as one of the highest-leverage beyond-guide scenarios. **Sub-marker:** P3.11 **Domains:** D3 · Agent Operations, D2 · Tool Design + Integration **Exam weight:** 38% of CCA-F (D3 + D2) **Build time:** 26 minutes **Source:** 🟢 Beyond-guide scenario · empirically witnessed on the real CCA-F exam **Canonical:** https://claudearchitectcertification.com/scenarios/agent-skills-for-enterprise-km **Last reviewed:** 2026-05-04 ## In plain English Think of this as the way a 5,000-person company stops re-inventing the same agent prompt fifteen times. Every team writes its own Skills. Refund handling, expense reporting, deployment runbooks. And they all live in one shared library, organised by team and version, just like a code repo. When an agent on any team needs to do something, it searches the library, finds the right Skill, checks that the user is allowed to use it (finance Skills are not for the support team), and runs it. The whole point is that knowledge gets re-used safely at enterprise scale, not copy-pasted into a hundred system prompts. ## Exam impact Domain 3 (Claude Code Configuration, 20%) tests Skills frontmatter, namespace conventions, and dependency resolution. Domain 2 (Tool Design, 18%) tests permission-aware invocation and the search-service contract. Confirmed on the real exam by two independent pass-takers (per ACP-T05 §Scenario 11 catalog). One of the two beyond-guide scenarios with the highest known exam frequency. Drilling this scenario lifts pass probability materially. ## The problem ### What the customer needs - One source of truth across 15 teams. No copy-pasted Skill prompts drifting in 15 different repos. - Discoverable at enterprise scale. An agent on the marketing team finds the right finance Skill in seconds, not hours. - Permission-aware. Finance's budget-approval Skill must be unreachable from support's agent, no matter how the support agent is prompted. ### Why naive approaches fail - 200+ Skills in one flat folder → collision week 1 (refund-resolver exists in support/, growth/, AND finance/, all mean different things). - No semver → v2 silently breaks v1 callers when frontmatter shape changes; agents start failing silently across the org. - No ACL → support agent invokes finance/budget-approval because the Skill description sounded relevant; policy violation at scale. ### Definition of done - Naming collision rate = 0 (team namespace prefix enforced) - Breaking-change incidents = 0 (semver in frontmatter, callers pin major) - Cross-team unauthorized invocation rate = 0 (ACL check before execution) - Skill-discovery p95 latency < 200ms (embeddings or full-text index) - Reindex SLA < 60s from commit to searchable ## Concepts in play - 🟢 **Skills** (`skills`), Markdown + frontmatter as the unit of reusable knowledge - 🟢 **Project memory** (`claude-md-hierarchy`), Skills extend project-level CLAUDE.md across teams - 🟢 **Tool calling** (`tool-calling`), Skill invocation as a structured tool call - 🟢 **Attention engineering** (`attention-engineering`), Frontmatter routes the LLM to the right Skill - 🟢 **Evaluation** (`evaluation`), Pre-execution ACL check is a gate, not a heuristic - 🟢 **Context window** (`context-window`), Search returns top-k Skills, not all 200+ - 🟢 **Subagents** (`subagents`), Some Skills spawn isolated subagents internally - 🟢 **Structured outputs** (`structured-outputs`), Skill frontmatter is the contract ## Components ### Skill Definition File, .claude/skills/{team}/{name}.md The unit of enterprise knowledge. Markdown body holds the instructions; YAML frontmatter holds the metadata the registry indexes (name, version, description, tags, depends_on, access_level). Lives in version control next to code, reviewed via PR like any other team artifact. **Configuration:** Path convention: .claude/skills/{team}/{name}.md. Frontmatter required: name, version (semver), description, tags, depends_on, access_level. Body: the actual prompt + examples. Reviewed in PRs. **Concept:** `skills` ### Shared Registry & Indexer, rebuilds on every commit A CI job that walks .claude/skills//*.md, parses frontmatter, validates schema, builds a searchable index, and publishes it to the registry service. Idempotent. Fast (sub-minute on 500 Skills). When a Skill commit lands, the index is fresh within 60s and the new version is discoverable. **Configuration:** Triggered on push to main. Steps: glob skills, parse YAML, validate (semver, ACL, deps exist), upload index to registry. Reindex SLA: <60s. Failed parses fail the CI; bad Skills never reach the registry. **Concept:** `structured-outputs` ### Search Service, embeddings-based at scale Indexes Skill descriptions + tags + frontmatter. Agents query in natural language ('find a Skill for processing customer refunds') and get the top-k matches with their metadata. Full-text works at <50 Skills; embeddings (OpenAI / Voyage) become essential past 100; org-wide deployments use a hybrid (embeddings for recall, full-text for precision). **Configuration:** POST /search { query, k=5, filters: { team?, access_level?, tag? } } → [{slug, version, description, score}]. Latency p95 < 200ms. Cache embeddings keyed by (skill_slug, content_hash); recompute only on content change. **Concept:** `context-window` ### Git-Based Versioning, semver in frontmatter + Git tags Every Skill carries a semver version in its frontmatter; every release tags Git so older versions stay reachable. Callers pin a MAJOR version (refund-resolver:v1.x); the registry serves the latest patch within that major. Breaking changes bump the major; old callers keep working until they migrate. **Configuration:** Frontmatter: version: 1.2.3. Caller: depends_on: ['support/refund-resolver:1.x']. Registry resolves to latest patch within pinned major. Deprecated versions stay queryable for 6 months before archive. **Concept:** `tool-calling` ### Access Control Layer, permission-aware invocation Sits between Skill discovery and Skill execution. Reads the calling agent's role + the Skill's access_level (public | team | role-restricted | sensitive). Denies invocation when the agent's role isn't in the allowlist. Returns a structured permission-denied error. The agent observes it and can request access via the org's standard flow, not bypass it. **Configuration:** Pre-invocation: { agent_role, skill_acl } → { allowed: bool, reason }. ACL stored in frontmatter access_level + team-level org config. Denied: structured error { code: 'ACL_DENIED', skill, reason, request_url }. **Concept:** `evaluation` ## Build steps ### 1. Lay out the team-namespaced directory Create .claude/skills/{team}/{name}.md per team. Even on day one with 5 Skills, namespace from the start. Retrofitting a flat layout into namespaces at 100 Skills is painful. The directory IS the registry's source of truth. **Python:** ```python # Repository layout # .claude/ # └── skills/ # ├── support/ # │ ├── refund-resolver.md # support/refund-resolver # │ └── escalation-router.md # support/escalation-router # ├── platform/ # │ ├── deploy-runbook.md # │ └── incident-triage.md # ├── data/ # │ ├── query-builder.md # │ └── pii-redactor.md # └── finance/ # └── budget-approval.md # access_level: sensitive # Bootstrap script for a fresh repo import os TEAMS = ["support", "platform", "data", "growth", "finance"] for t in TEAMS: os.makedirs(f".claude/skills/{t}", exist_ok=True) with open(f".claude/skills/{t}/.gitkeep", "w") as f: pass print("namespace-by-team layout ready; commit and start authoring.") ``` **TypeScript:** ```typescript // Repository layout // .claude/ // └── skills/ // ├── support/ // │ ├── refund-resolver.md // support/refund-resolver // │ └── escalation-router.md // support/escalation-router // ├── platform/ // │ ├── deploy-runbook.md // │ └── incident-triage.md // ├── data/ // │ ├── query-builder.md // │ └── pii-redactor.md // └── finance/ // └── budget-approval.md // access_level: sensitive // Bootstrap script for a fresh repo import { mkdirSync, writeFileSync } from "node:fs"; const teams = ["support", "platform", "data", "growth", "finance"]; for (const t of teams) { mkdirSync(`.claude/skills/${t}`, { recursive: true }); writeFileSync(`.claude/skills/${t}/.gitkeep`, ""); } console.log("namespace-by-team layout ready; commit and start authoring."); ``` Concept: `skills` ### 2. Define the Skill frontmatter schema Every Skill carries the same YAML frontmatter shape, validated by the indexer. Required: name, version (semver), description, tags, access_level. Optional: depends_on, deprecated, owners. Schema lives in the repo so PRs that break it fail CI before merging. **Python:** ```python # .claude/skills/_schema.yaml. The frontmatter contract # Validated by the indexer; PRs that violate this schema fail CI. required: - name # team/skill-name (e.g. support/refund-resolver) - version # semver: MAJOR.MINOR.PATCH - description # 1-2 sentence description, search-indexed - tags # array, search-indexed - access_level # public | team | role-restricted | sensitive optional: - depends_on # ['support/case-facts:1.x', ...] - deprecated # 'use support/refund-resolver-v2 instead' - owners # ['@support-team', '@jane.doe'] # Example skill. Support/refund-resolver.md --- name: support/refund-resolver version: 1.2.3 description: | Resolves customer refund requests up to $500 using the case-facts block and escalation queue. For amounts above cap, escalates. tags: [refund, customer-support, payment] access_level: team depends_on: - support/case-facts:1.x - shared/escalation-queue:2.x owners: - "@support-team" --- # Body: the actual instructions and examples ... ``` **TypeScript:** ```typescript // .claude/skills/_schema.yaml. The frontmatter contract // Validated by the indexer; PRs that violate this schema fail CI. // // required: // - name // team/skill-name (e.g. support/refund-resolver) // - version // semver: MAJOR.MINOR.PATCH // - description // 1-2 sentence description, search-indexed // - tags // array, search-indexed // - access_level // public | team | role-restricted | sensitive // // optional: // - depends_on // ['support/case-facts:1.x', ...] // - deprecated // 'use support/refund-resolver-v2 instead' // - owners // ['@support-team', '@jane.doe'] // Example skill. Support/refund-resolver.md // --- // name: support/refund-resolver // version: 1.2.3 // description: | // Resolves customer refund requests up to $500 using the case-facts // block and escalation queue. For amounts above cap, escalates. // tags: [refund, customer-support, payment] // access_level: team // depends_on: // - support/case-facts:1.x // - shared/escalation-queue:2.x // owners: // - "@support-team" // --- // // # Body: the actual instructions and examples ... ``` Concept: `structured-outputs` ### 3. Build the registry indexer A CI job walks .claude/skills//*.md, parses each Skill's frontmatter, validates the schema, resolves dependencies, and writes a searchable index. Runs on every push to main; reindex SLA <60s on 500 Skills. Bad Skills (broken schema, missing dep, semver violation) fail the CI. They never reach the registry. **Python:** ```python # scripts/index_skills.py. Runs in CI on push to main import yaml, json, glob, sys, hashlib, semver from pathlib import Path REQUIRED = {"name", "version", "description", "tags", "access_level"} ACCESS_LEVELS = {"public", "team", "role-restricted", "sensitive"} def parse(path: Path) -> dict: text = path.read_text() if not text.startswith("---"): raise ValueError(f"{path}: missing frontmatter") _, fm, body = text.split("---", 2) meta = yaml.safe_load(fm) missing = REQUIRED - set(meta) if missing: raise ValueError(f"{path}: missing keys: {missing}") if meta["access_level"] not in ACCESS_LEVELS: raise ValueError(f"{path}: bad access_level: {meta['access_level']}") semver.VersionInfo.parse(meta["version"]) # raises if invalid meta["body_hash"] = hashlib.sha256(body.encode()).hexdigest()[:12] meta["path"] = str(path) return meta def build_index() -> list[dict]: skills = [parse(Path(p)) for p in glob.glob(".claude/skills/**/*.md", recursive=True)] # Resolve dependencies. Every depends_on must exist names = {s["name"] for s in skills} for s in skills: for dep in s.get("depends_on", []): dep_name = dep.split(":")[0] if dep_name not in names: raise ValueError(f"{s['name']}: missing dep {dep_name}") return skills if __name__ == "__main__": try: index = build_index() Path("dist/skill-registry.json").write_text(json.dumps(index, indent=2)) print(f"indexed {len(index)} skills; pushed to registry") except ValueError as e: print(f"::error::{e}", file=sys.stderr) sys.exit(1) ``` **TypeScript:** ```typescript // scripts/index-skills.ts. Runs in CI on push to main import { readFileSync, writeFileSync } from "node:fs"; import { glob } from "glob"; import { parse as parseYaml } from "yaml"; import { createHash } from "node:crypto"; import semver from "semver"; const REQUIRED = ["name", "version", "description", "tags", "access_level"]; const ACCESS_LEVELS = ["public", "team", "role-restricted", "sensitive"]; interface Skill { name: string; version: string; description: string; tags: string[]; access_level: string; depends_on?: string[]; deprecated?: string; owners?: string[]; body_hash: string; path: string; } function parse(path: string): Skill { const text = readFileSync(path, "utf8"); if (!text.startsWith("---")) throw new Error(`${path}: missing frontmatter`); const [, fm, body] = text.split("---", 3); const meta = parseYaml(fm) as Record; for (const k of REQUIRED) { if (!(k in meta)) throw new Error(`${path}: missing key ${k}`); } if (!ACCESS_LEVELS.includes(meta.access_level as string)) { throw new Error(`${path}: bad access_level: ${meta.access_level}`); } if (!semver.valid(meta.version as string)) { throw new Error(`${path}: invalid semver ${meta.version}`); } const body_hash = createHash("sha256").update(body).digest("hex").slice(0, 12); return { ...meta, body_hash, path } as Skill; } async function buildIndex(): Promise { const paths = await glob(".claude/skills/**/*.md"); const skills = paths.map((p) => parse(p)); const names = new Set(skills.map((s) => s.name)); for (const s of skills) { for (const dep of s.depends_on ?? []) { const depName = dep.split(":")[0]; if (!names.has(depName)) throw new Error(`${s.name}: missing dep ${depName}`); } } return skills; } try { const index = await buildIndex(); writeFileSync("dist/skill-registry.json", JSON.stringify(index, null, 2)); console.log(`indexed ${index.length} skills; pushed to registry`); } catch (e) { console.error(`::error::${(e as Error).message}`); process.exit(1); } ``` Concept: `structured-outputs` ### 4. Add semantic search over the registry At <50 Skills, full-text on description+tags is enough. Past 100, agents need to discover by intent rather than keyword (a Skill that handles customer refunds should match refund-resolver even without the word 'refund' in the query). Embeddings + vector index over Skill description+tags is the play; cache embeddings keyed by body_hash so re-embedding only fires on content change. **Python:** ```python # scripts/search_service.py from anthropic import Anthropic import numpy as np, json from pathlib import Path # Index loaded from registry SKILLS = json.loads(Path("dist/skill-registry.json").read_text()) # Embedding cache keyed by (skill name, body_hash) _emb_cache: dict[tuple[str, str], list[float]] = {} def embed_text(text: str) -> list[float]: """Stand-in for any embeddings provider (Voyage, OpenAI, etc.).""" # In production, batch-embed once at index time and cache: # client.embed(text, model="voyage-2-large") raise NotImplementedError def index_skill(s: dict): key = (s["name"], s["body_hash"]) if key not in _emb_cache: text = f"{s['description']} {' '.join(s['tags'])}" _emb_cache[key] = embed_text(text) def cosine(a, b): a, b = np.array(a), np.array(b) return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b))) def search(query: str, k: int = 5, team: str | None = None, access_level: str | None = None) -> list[dict]: qv = embed_text(query) scored = [] for s in SKILLS: if team and not s["name"].startswith(f"{team}/"): continue if access_level and s["access_level"] != access_level: continue index_skill(s) sim = cosine(qv, _emb_cache[(s["name"], s["body_hash"])]) scored.append((sim, s)) scored.sort(key=lambda x: -x[0]) return [ {**s, "score": round(score, 3)} for score, s in scored[:k] ] # Example # results = search("handle a customer refund up to $500", k=3, team="support") ``` **TypeScript:** ```typescript // scripts/search-service.ts import { readFileSync } from "node:fs"; interface Skill { name: string; version: string; description: string; tags: string[]; access_level: string; body_hash: string; } const SKILLS: Skill[] = JSON.parse( readFileSync("dist/skill-registry.json", "utf8"), ); const embCache = new Map(); async function embedText(text: string): Promise { // Stand-in for Voyage / OpenAI / etc. // In production: batch-embed at index time and cache. throw new Error("not implemented"); } async function indexSkill(s: Skill) { const key = `${s.name}|${s.body_hash}`; if (!embCache.has(key)) { const text = `${s.description} ${s.tags.join(" ")}`; embCache.set(key, await embedText(text)); } } function cosine(a: number[], b: number[]): number { let dot = 0, na = 0, nb = 0; for (let i = 0; i < a.length; i++) { dot += a[i] * b[i]; na += a[i] * a[i]; nb += b[i] * b[i]; } return dot / (Math.sqrt(na) * Math.sqrt(nb)); } export async function search( query: string, opts: { k?: number; team?: string; access_level?: string } = {}, ) { const { k = 5, team, access_level } = opts; const qv = await embedText(query); const scored: Array<{ score: number; skill: Skill }> = []; for (const s of SKILLS) { if (team && !s.name.startsWith(`${team}/`)) continue; if (access_level && s.access_level !== access_level) continue; await indexSkill(s); const sim = cosine(qv, embCache.get(`${s.name}|${s.body_hash}`)!); scored.push({ score: sim, skill: s }); } scored.sort((a, b) => b.score - a.score); return scored.slice(0, k).map(({ score, skill }) => ({ ...skill, score: Math.round(score * 1000) / 1000, })); } ``` Concept: `context-window` ### 5. Pin versions on every dependency edge Every depends_on in a Skill's frontmatter pins a MAJOR version (support/case-facts:1.x), not a fixed PATCH. The registry resolves to the latest PATCH within the pinned major. When case-facts ships a breaking change, it bumps to v2. Old callers continue against v1.x; new callers opt in to v2 explicitly. This is exactly how pip / npm work, applied to Skills. **Python:** ```python # scripts/resolve_deps.py. Given a Skill, resolve its depends_on graph import json, semver from pathlib import Path SKILLS = json.loads(Path("dist/skill-registry.json").read_text()) INDEX = {s["name"]: [] for s in SKILLS} for s in SKILLS: INDEX[s["name"]].append(s) for name in INDEX: INDEX[name].sort(key=lambda s: semver.VersionInfo.parse(s["version"])) def resolve(spec: str) -> dict: """spec: 'team/skill:1.x' or 'team/skill:>=2.0.0 <3.0.0'.""" name, _, constraint = spec.partition(":") versions = INDEX.get(name, []) if not versions: raise LookupError(f"unknown skill: {name}") if constraint.endswith(".x"): major = int(constraint.split(".")[0]) candidates = [ v for v in versions if semver.VersionInfo.parse(v["version"]).major == major ] else: candidates = [ v for v in versions if semver.match(v["version"], constraint) ] if not candidates: raise LookupError(f"{name}: no version satisfies {constraint}") return candidates[-1] # latest matching def topo_resolve(skill_spec: str, seen: set | None = None) -> list[dict]: """Resolve full dep graph in topological order.""" seen = seen or set() skill = resolve(skill_spec) if skill["name"] in seen: return [] seen.add(skill["name"]) out = [] for dep_spec in skill.get("depends_on", []): out.extend(topo_resolve(dep_spec, seen)) out.append(skill) return out ``` **TypeScript:** ```typescript // scripts/resolve-deps.ts import { readFileSync } from "node:fs"; import semver from "semver"; interface Skill { name: string; version: string; depends_on?: string[]; [k: string]: unknown; } const SKILLS: Skill[] = JSON.parse( readFileSync("dist/skill-registry.json", "utf8"), ); const INDEX = new Map(); for (const s of SKILLS) { if (!INDEX.has(s.name)) INDEX.set(s.name, []); INDEX.get(s.name)!.push(s); } for (const versions of INDEX.values()) { versions.sort((a, b) => semver.compare(a.version, b.version)); } export function resolve(spec: string): Skill { // spec: 'team/skill:1.x' or 'team/skill:>=2.0.0 <3.0.0' const [name, constraint] = spec.split(":"); const versions = INDEX.get(name); if (!versions) throw new Error(`unknown skill: ${name}`); let candidates: Skill[]; if (constraint?.endsWith(".x")) { const major = Number(constraint.split(".")[0]); candidates = versions.filter((v) => semver.major(v.version) === major); } else { candidates = versions.filter((v) => semver.satisfies(v.version, constraint)); } if (candidates.length === 0) { throw new Error(`${name}: no version satisfies ${constraint}`); } return candidates[candidates.length - 1]; } export function topoResolve(spec: string, seen = new Set()): Skill[] { const skill = resolve(spec); if (seen.has(skill.name)) return []; seen.add(skill.name); const out: Skill[] = []; for (const dep of skill.depends_on ?? []) { out.push(...topoResolve(dep, seen)); } out.push(skill); return out; } ``` Concept: `tool-calling` ### 6. Enforce ACLs before invocation Permission-aware RAG isn't built into Claude. You implement it. Read the calling agent's role + the Skill's access_level, run a hard check before invoking, and return a structured error on deny. This is a deterministic gate, not a prompt-language constraint; the Skill's body never executes if ACL fails. **Python:** ```python # scripts/acl_gate.py from typing import TypedDict, Literal class Skill(TypedDict): name: str access_level: Literal["public", "team", "role-restricted", "sensitive"] owners: list[str] class AgentContext(TypedDict): role: str # e.g. 'support-agent', 'finance-agent' teams: list[str] # ['support', 'shared'] elevated: bool # has the user explicitly elevated to invoke sensitive skills? ROLE_ACL = { # Each access_level → which roles may invoke "public": lambda ctx, s: True, "team": lambda ctx, s: any(s["name"].startswith(f"{t}/") for t in ctx["teams"]), "role-restricted": lambda ctx, s: ctx["role"] in s.get("allowed_roles", []), "sensitive": lambda ctx, s: ctx["elevated"] and any( s["name"].startswith(f"{t}/") for t in ctx["teams"] ), } def check(ctx: AgentContext, skill: Skill) -> dict: """Returns {allowed, reason, request_url?}.""" rule = ROLE_ACL[skill["access_level"]] if rule(ctx, skill): return {"allowed": True, "reason": "access_granted"} return { "allowed": False, "reason": f"agent role={ctx['role']} cannot invoke {skill['name']} (access_level={skill['access_level']})", "request_url": f"https://internal.example.com/skills/request-access?skill={skill['name']}", } # Usage in the agent loop def invoke_skill(ctx: AgentContext, skill_spec: str, payload: dict): from resolve_deps import resolve skill = resolve(skill_spec) decision = check(ctx, skill) if not decision["allowed"]: return {"error": "ACL_DENIED", **decision} # ...actually invoke the Skill body... ``` **TypeScript:** ```typescript // scripts/acl-gate.ts type AccessLevel = "public" | "team" | "role-restricted" | "sensitive"; interface Skill { name: string; access_level: AccessLevel; allowed_roles?: string[]; owners?: string[]; } interface AgentContext { role: string; // e.g. 'support-agent', 'finance-agent' teams: string[]; // ['support', 'shared'] elevated: boolean; // has the user explicitly elevated to invoke sensitive skills? } const ROLE_ACL: Record boolean> = { public: () => true, team: (ctx, s) => ctx.teams.some((t) => s.name.startsWith(`${t}/`)), "role-restricted": (ctx, s) => (s.allowed_roles ?? []).includes(ctx.role), sensitive: (ctx, s) => ctx.elevated && ctx.teams.some((t) => s.name.startsWith(`${t}/`)), }; export function check(ctx: AgentContext, skill: Skill) { if (ROLE_ACL[skill.access_level](ctx, skill)) { return { allowed: true as const, reason: "access_granted" }; } return { allowed: false as const, reason: `agent role=${ctx.role} cannot invoke ${skill.name} (access_level=${skill.access_level})`, request_url: `https://internal.example.com/skills/request-access?skill=${skill.name}`, }; } // Usage in the agent loop export async function invokeSkill( ctx: AgentContext, skillSpec: string, payload: Record, ) { const { resolve } = await import("./resolve-deps"); const skill = resolve(skillSpec) as Skill; const decision = check(ctx, skill); if (!decision.allowed) { return { error: "ACL_DENIED" as const, ...decision }; } // ...actually invoke the Skill body... } ``` Concept: `evaluation` ### 7. Wire the agent's Skill discovery into its tool loop Expose two tools to every agent: search_skills(query, filters) and invoke_skill(name, version, payload). The agent finds Skills by intent, the ACL gate runs inside invoke_skill, and the Skill body executes only on allow. The agent never sees the registry's raw 200+ entries. Just the top-k matches for its query, gated by access_level. **Python:** ```python # Skills are exposed as two tools to every agent TOOLS = [ { "name": "search_skills", "description": ( "Find a Skill in the enterprise registry by natural-language query. " "Returns up to k matches with name, version, description, score. " "Use BEFORE invoke_skill so you have a name+version to invoke." ), "input_schema": { "type": "object", "properties": { "query": {"type": "string"}, "k": {"type": "integer", "default": 5}, "team": {"type": "string"}, # optional filter "access_level": {"type": "string"}, # optional filter }, "required": ["query"], }, }, { "name": "invoke_skill", "description": ( "Invoke a Skill from the registry. ACL is checked before the " "Skill body executes; if denied, returns ACL_DENIED with a " "request_url for access. Always pin a major version (e.g. 1.x)." ), "input_schema": { "type": "object", "properties": { "name": {"type": "string"}, "version_constraint": {"type": "string", "default": "*"}, "payload": {"type": "object"}, }, "required": ["name", "payload"], }, }, ] ``` **TypeScript:** ```typescript // Skills are exposed as two tools to every agent import type Anthropic from "@anthropic-ai/sdk"; export const tools: Anthropic.Tool[] = [ { name: "search_skills", description: "Find a Skill in the enterprise registry by natural-language query. " + "Returns up to k matches with name, version, description, score. " + "Use BEFORE invoke_skill so you have a name+version to invoke.", input_schema: { type: "object", properties: { query: { type: "string" }, k: { type: "integer", default: 5 }, team: { type: "string" }, access_level: { type: "string" }, }, required: ["query"], }, }, { name: "invoke_skill", description: "Invoke a Skill from the registry. ACL is checked before the Skill " + "body executes; if denied, returns ACL_DENIED with a request_url for " + "access. Always pin a major version (e.g. 1.x).", input_schema: { type: "object", properties: { name: { type: "string" }, version_constraint: { type: "string", default: "*" }, payload: { type: "object" }, }, required: ["name", "payload"], }, }, ]; ``` Concept: `tool-calling` ### 8. Track usage + deprecation lifecycle Once Skills are in production, the registry needs to know which Skills are hot, which are stale, which have known broken versions. Log every invoke_skill call with name, version, agent role, outcome. Surface a deprecation notice in search_skills results when an old version is queried. Auto-archive Skills with zero invocations in 6 months. **Python:** ```python # scripts/usage_tracker.py from datetime import datetime, timedelta from collections import Counter import json from pathlib import Path # Append-only log of every invoke_skill call def log_invocation(name: str, version: str, agent_role: str, outcome: str): record = { "ts": datetime.utcnow().isoformat() + "Z", "name": name, "version": version, "agent_role": agent_role, "outcome": outcome, # 'success' | 'acl_denied' | 'error' } with open("logs/skill-invocations.jsonl", "a") as f: f.write(json.dumps(record) + "\n") # Nightly: surface deprecation candidates + hot Skills def nightly_report(): cutoff = datetime.utcnow() - timedelta(days=180) invocations = [ json.loads(line) for line in Path("logs/skill-invocations.jsonl").read_text().splitlines() ] recent = [r for r in invocations if datetime.fromisoformat(r["ts"][:-1]) > cutoff] hot = Counter((r["name"], r["version"]) for r in recent).most_common(20) invoked_names = {r["name"] for r in recent} all_names = {s["name"] for s in json.loads(Path("dist/skill-registry.json").read_text())} cold = sorted(all_names - invoked_names) print("=== Hot Skills (last 180d) ===") for (name, ver), count in hot: print(f" {name}:{ver} {count}") print(f"\n=== Cold Skills (deprecation candidates) ({len(cold)}) ===") for name in cold: print(f" {name}") ``` **TypeScript:** ```typescript // scripts/usage-tracker.ts import { appendFileSync, readFileSync } from "node:fs"; interface Record { ts: string; name: string; version: string; agent_role: string; outcome: "success" | "acl_denied" | "error"; } export function logInvocation( name: string, version: string, agent_role: string, outcome: Record["outcome"], ) { const r: Record = { ts: new Date().toISOString(), name, version, agent_role, outcome }; appendFileSync("logs/skill-invocations.jsonl", JSON.stringify(r) + "\n"); } export function nightlyReport() { const cutoff = Date.now() - 180 * 24 * 3600 * 1000; const invocations = readFileSync("logs/skill-invocations.jsonl", "utf8") .split("\n") .filter(Boolean) .map((line) => JSON.parse(line) as Record); const recent = invocations.filter((r) => new Date(r.ts).getTime() > cutoff); const hotMap = new Map(); for (const r of recent) { const k = `${r.name}:${r.version}`; hotMap.set(k, (hotMap.get(k) ?? 0) + 1); } const hot = [...hotMap.entries()] .sort((a, b) => b[1] - a[1]) .slice(0, 20); const invokedNames = new Set(recent.map((r) => r.name)); const allNames = new Set( JSON.parse(readFileSync("dist/skill-registry.json", "utf8")).map( (s: { name: string }) => s.name, ), ); const cold = [...allNames].filter((n) => !invokedNames.has(n)).sort(); console.log("=== Hot Skills (last 180d) ==="); for (const [k, n] of hot) console.log(` ${k} ${n}`); console.log(`\n=== Cold Skills (deprecation candidates) (${cold.length}) ===`); for (const name of cold) console.log(` ${name}`); } ``` Concept: `evaluation` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | Org has 200+ Skills across 15 teams | Team-namespaced directory ({team}/{name}) + shared registry + embeddings search + ACL layer | Flat folder, full-text search, no ACL ('we'll add permissions later') | Naming collisions, version drift, and cross-team ACL violations all become structural impossibilities at the directory + frontmatter level. Retrofitting them at 200 Skills costs an order of magnitude more than starting clean. | | Skill case-facts is shipping a breaking change | Bump major (v2.0.0); existing callers stay on v1.x until they migrate; deprecation notice in v1's frontmatter | Edit v1 in place; tell teams to update their callers | Semver + Git tags let old callers keep working while new callers opt into v2 deliberately. Editing in place breaks every agent in the org silently. A class of incident that's painful to debug because the symptoms surface in agent loops, not in the Skill itself. | | Support agent's prompt suggests calling finance/budget-approval | ACL gate denies pre-execution; structured ACL_DENIED error with request_url returned to the agent | Trust the prompt; finance Skills not in support agent's tool list | Prompt-only restriction leaks under prompt injection or clever phrasing. A deterministic ACL gate that runs before the Skill body executes is the only real boundary. Tool-list restriction is the second layer; ACL is the first. | | Agent needs to find a Skill but doesn't know its exact name | search_skills(query). Embeddings/full-text returns top-k matches with metadata | Show all 200+ Skills in the agent's tool list | 200 tools in a single agent's tool list destroys routing accuracy (per Scenario P3.1's tool-count rule). Search-by-intent surfaces only the top-k relevant matches; the agent picks one and invokes it. Two tools (search + invoke) cover the whole space. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-20 · Unbounded skill count in flat layout | 200+ Skills in .claude/skills/ flat folder. Naming collisions appear in week 1 (refund-resolver exists in support, growth, and finance contexts, all meaning different things). Discovery becomes a grep contest. | Team-namespaced layout: .claude/skills/{team}/{name}.md. Collisions become structurally impossible. support/refund-resolver and growth/refund-resolver are distinct paths. Past 50 Skills, add an embeddings-based search service. | | AP-21 · No versioning | Skill case-facts ships a breaking change (frontmatter shape changes). Every agent in the org that depends on it starts failing silently. No way to roll back a single Skill's update. | Semver in frontmatter (version: 1.2.3) + Git tags. Callers pin major (case-facts:1.x); registry resolves to latest patch. Breaking changes bump major, callers migrate deliberately. | | AP-22 · Naming collisions across teams | Two teams independently author a refund-resolver Skill. Both end up in .claude/skills/refund-resolver.md (last commit wins). Agents call the wrong one; nobody notices for weeks. | Team namespace prefix: support/refund-resolver vs growth/refund-resolver. The directory layout enforces uniqueness; the indexer rejects duplicates. PR review surfaces collisions before merge. | | AP-23 · No access control | Support agent's prompt is cleverly engineered (or injected via PR content) to invoke finance/budget-approval. The Skill executes; an unauthorized $50K refund is approved. Audit log shows the agent did it; ACL log shows nothing because there is no ACL. | ACL gate (access_level: public | team | role-restricted | sensitive) on every Skill, checked pre-invocation. Denied calls return a structured ACL_DENIED error; the agent observes it and either escalates or routes differently. Deterministic, not prompt-based. | | AP-24 · Skills as one-off prompts | Each agent's system prompt copy-pastes the relevant Skill content inline. When the Skill changes, 12 agents need updating. Nobody updates them all; behavior drifts over months. | Skills are reusable, composable, versioned units. Agents reference them via invoke_skill('support/refund-resolver:1.x', payload). One source of truth; one Skill update propagates to every caller automatically. | ## Implementation checklist - [ ] Team-namespaced directory layout: .claude/skills/{team}/{name}.md (`skills`) - [ ] Frontmatter schema documented and validated by the indexer (name, version, description, tags, access_level required) (`structured-outputs`) - [ ] Semver enforced. Every Skill has a valid semver in frontmatter (`tool-calling`) - [ ] CI indexer runs on push to main; reindex SLA <60s on 500 Skills - [ ] Search service deployed with p95 <200ms; embeddings cached by body_hash (`context-window`) - [ ] Two tools exposed to every agent: search_skills + invoke_skill (`tool-calling`) - [ ] ACL gate runs PRE-invocation; denied calls return structured ACL_DENIED (`evaluation`) - [ ] Dependency resolution: callers pin major; registry resolves to latest patch - [ ] Deprecation lifecycle: zero-invocation Skills auto-flagged at 180d - [ ] Usage log appended on every invoke_skill call (jsonl) - [ ] PR review on every Skill change. Including the frontmatter shape ## Cost & latency - **Skill execution (avg 800 tokens):** ~$0.0024 per invocation, Skill body ~500 tokens system + ~200 input + ~100 output. Sonnet 4.5 pricing. Most Skills are narrow, focused units. No inflation from generic prompt scaffolding. - **Search service (embeddings):** ~$0.0001 per query, Voyage / OpenAI embedding ~512 dims at fractional cost per query. Embeddings cached by body_hash so re-embedding only fires on content change. At 1M queries/month, ~$100. - **Reindex CI job (per push to main):** ~$0 (compute) + ~$0.01 (embedding refresh), Indexer is pure parsing on GitHub Actions free tier. Only cost is re-embedding Skills with changed content. Typically <5% of the registry per push. - **ACL check overhead:** ~+0.01ms per invocation, ~0% token cost, ACL is a deterministic dictionary lookup against frontmatter + agent role. No LLM call. Latency is unmeasurable in the pipeline; cost is in maintenance, not execution. - **Annual registry hosting (5K Skills, 20K queries/day):** ~$3K-8K/year, Embeddings store + search service + reindex compute. Small relative to the per-invocation Skill execution cost which dominates total spend at scale. ## Domain weights - **D3 · Agent Operations (20%):** Skill definition file + frontmatter schema + Git versioning + deprecation lifecycle - **D2 · Tool Design + Integration (18%):** search_skills + invoke_skill tool design + ACL gate + dependency resolution ## Practice questions ### Q1. An enterprise has 200+ Skills across 15 teams. Skill-name collisions occur weekly (refund-resolver exists in support/, growth/, AND finance/, all meaning different things). How should you structure the registry to prevent this structurally? Adopt a team namespace prefix convention enforced by the directory layout: .claude/skills/{team}/{name}.md, so the canonical name is support/refund-resolver vs growth/refund-resolver. Collisions become impossible at the filesystem level (different paths) and at the registry level (the indexer rejects duplicate name fields in frontmatter). Pair with PR review on every Skill change to catch deliberate naming drift before merge. Tagged to AP-22. ### Q2. A Skill for customer-support refund processing is updated frequently. Last week, an in-place edit broke 12 dependent agents silently. How do you prevent this? Semver in frontmatter plus Git tags. Every Skill carries a version: MAJOR.MINOR.PATCH; every release tags the Git history. Callers pin a major (support/refund-resolver:1.x); the registry resolves to the latest patch within that major. Breaking changes bump the major (v2.0.0); existing callers continue against v1.x until they migrate deliberately. The in-place edit becomes structurally impossible. The indexer rejects two Skills with the same name and version. Tagged to AP-21. ### Q3. Finance team has sensitive Skills (e.g. budget-approval). The support team's agent must NEVER invoke them, no matter how cleverly prompted (or prompt-injected). How do you enforce this architecturally? Every Skill carries an access_level in frontmatter (public | team | role-restricted | sensitive). An ACL gate runs before Skill invocation: read the calling agent's role + the Skill's access_level, deny pre-execution if not allowed, return structured {error: 'ACL_DENIED', skill, reason, request_url}. The agent observes the denial and either escalates or routes differently. It cannot bypass. This is deterministic, not prompt-based; cleverness in the prompt cannot defeat a hard pre-invocation check. Tagged to AP-23. ### Q4. An agent on the marketing team needs to discover the right Skill from 50+ available. Searching by exact name is slow and requires the agent to already know what's there. What infrastructure should you add? An embeddings-based search service keyed over description + tags with optional filters by team and access_level. The agent calls search_skills('process customer refund up to $500', k=5) and gets the top-5 matches with {name, version, description, score}. Pair with a full-text fallback for exact-keyword queries. Cache embeddings keyed by body_hash so re-embedding only fires on content change. p95 query latency stays <200ms even at 5,000 Skills. ### Q5. A Skill captures enterprise knowledge (policies, procedures) for support. Should it be a single 500-line markdown file or modular across multiple files with depends_on? Modular composition via depends_on. The Skill's frontmatter declares its dependencies (depends_on: ['support/case-facts:1.x', 'shared/escalation-queue:2.x']); the registry resolves them topologically at invocation time. Benefits: each unit is independently versioned (case-facts evolves separately from escalation-queue), reusable across multiple parent Skills, and easier to PR-review (smaller files). The dep resolver enforces no cycles and that every referenced version exists. ## FAQ ### Q1. What's the maximum number of Skills per organization? Unbounded with the right infrastructure. Per-project (a single agent's working set), keep <12 for discoverability. Per-team, low hundreds is comfortable with a search service. Org-wide, thousands work with embeddings + namespaces + ACLs. The bottleneck is rarely raw Skill count. It's how the agent finds the right one and how the org governs change. ### Q2. Can a Skill depend on other Skills? Yes, declared in frontmatter. depends_on: ['support/case-facts:1.x', 'shared/escalation-queue:2.x']. The registry validates dependencies exist at index time (CI fails on missing dep) and resolves them topologically at invocation time. Avoid cycles. The dep resolver detects them and rejects. ### Q3. How do you version Skills without breaking existing agents? Semver in frontmatter + callers pin major. A Skill at v1.2.3 keeps backward compatibility for all v1.x callers. When a breaking change is needed, bump to v2.0.0; existing callers continue against v1.x until they migrate deliberately. Deprecation notices in the v1 frontmatter point to v2; the registry surfaces the warning in search_skills results. ### Q4. Is permission-aware RAG built into Claude? No. You implement it. Claude's tool layer doesn't know about your org's roles. Implement an ACL gate that runs pre-invocation: read agent role + Skill access_level, deny if not allowed, return structured ACL_DENIED. The Skill body never executes if the ACL check fails. This is the same pattern as authorization middleware in any HTTP service. Deterministic, not LLM-judged. ### Q5. Should sensitive Skills be versioned differently? No. Same versioning, different access control. Versioning is about backward compatibility; access control is about who can invoke. They're orthogonal. A sensitive Skill ships v1.2.3 just like a public one; the ACL gate gates who can call it, regardless of version. ### Q6. How do you find the right Skill from 200+? Two tools, one query. First, search_skills(query, k=5, filters). Embeddings search returns top-k matches by intent. Second, invoke_skill(name, version, payload). Runs the chosen Skill with ACL check. The agent never sees raw access to the registry; it queries through the search tool. This keeps the agent's tool list small (just 2 tools) while exposing the entire Skills library. ### Q7. What happens to old Skill versions when a new major ships? They stay queryable for 6 months by default. The deprecation lifecycle: ship v2.0.0 → mark v1.x with a deprecation note in frontmatter → registry serves v1.x to existing callers but flags the deprecation in search results → after 6 months of zero invocations, auto-archive. Active Skills stay forever; truly cold ones get cleaned up. ## Production readiness - [ ] All Skills have valid frontmatter (CI fails the merge if not) - [ ] Indexer reindex SLA <60s on the live registry - [ ] Search p95 latency <200ms under steady-state load - [ ] ACL gate unit-tested per access_level (public, team, role-restricted, sensitive) - [ ] Dep resolver tested with cycle, missing-dep, and major-bump scenarios - [ ] Usage log persisted append-only (jsonl) with retention policy documented - [ ] Nightly deprecation report runs; cold Skills surfaced to owners - [ ] Prod deploy of search service has fallback to full-text on embeddings outage --- **Source:** https://claudearchitectcertification.com/scenarios/agent-skills-for-enterprise-km **Vault sources:** ACP-T05 §Scenario 11 (🟢 confirmed beyond-guide; u/ZealousidealFill6044); ACP-T07 §Lab 11 spec (HIGHEST PRIORITY beyond-guide lab); ACP-T08 §3.11 (multi-file skills, registry, search, version, ACL); Course 15 Introduction to Agent Skills. Overview + lesson 5 (sharing skills); ACP-T06 (5 practice Qs tagged to components); COD-K12 Hermes agent architecture review (self-improving skill systems) **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Developer Productivity Agent > A team of specialised subagents that handle codebase exploration, code review, and doc generation. Each with a narrow tool whitelist and isolated context. The lead agent uses Grep + Glob first to locate relevant files (never Read everything), delegates review to an independent reviewer subagent to avoid confirmation bias, and gates every destructive action behind the 4D Framework (Delegation → Description → Discernment → Diligence) where a human clicks approve. Distribution beats monoliths: 4-5 tools per subagent across 2-3 specialists routes more accurately than one 15-tool agent. **Sub-marker:** P3.4 **Domains:** D1 · Agentic Architectures, D2 · Tool Design + Integration **Exam weight:** 45% of CCA-F (D1 + D2) **Build time:** 22 minutes **Source:** 🟢 Official Anthropic guide scenario · in published exam guide **Canonical:** https://claudearchitectcertification.com/scenarios/developer-productivity-agent **Last reviewed:** 2026-05-04 ## In plain English Think of this as the agent that helps a developer rename a function across a 1,000-file codebase without losing their afternoon to it. Instead of one big agent reading every file (which fails because there's too much to remember), the work is split: one tiny helper finds the matching files with grep, another reads only those files, a third reviews the proposed changes from a fresh perspective so it doesn't just rubber-stamp its own work, and a human approves the final commit. The whole point is that productivity tasks are easier when you delegate to specialists rather than asking one agent to do everything alone. ## Exam impact Domain 1 (Agentic Architecture, 27%) tests subagent distribution, the grep-then-read exploration sequence, and the 4D Framework approval gate. Domain 2 (Tool Design, 18%) tests built-in vs MCP tool selection (`Read` vs `Bash(cat ...)`), tool count per agent (4-5, not 15), and structured output for tool results. In the published guide and the practice exam. High-yield drilling. ## The problem ### What the customer needs - Find every reference to a function across a 1,000-file repo without reading every file. - Review the agent's proposed changes with a fresh perspective so confirmation bias doesn't pass through. - Generate up-to-date docs that reflect the actual code, not last quarter's spec. ### Why naive approaches fail - Monolithic single agent with 15 tools trying to do everything → routing accuracy drops 8% per tool past 5; the agent alternates between similar tools and misses obvious matches. - Read every file first to 'understand the codebase' → context floods at file 80; the agent has lost track of the original task by file 200. - Same session generates AND reviews the code → confirmation bias passes through; the reviewer agrees with the writer because it shares the writer's context. ### Definition of done - Grep / Glob locates matched files first; Read opens only those files - Code review runs in an independent subagent with fresh context - Tool count ≤ 5 per subagent; specialists distributed across reader / reviewer / docgen - 4D Framework approval gate before any auto-merge or destructive action - Bash structured-output flags (--format json, --porcelain) replace fragile regex parsing ## Concepts in play - 🟢 **Subagents** (`subagents`), Reader / Reviewer / Doc-generator specialists - 🟢 **Tool calling** (`tool-calling`), 4-5 tools per subagent, distributed not monolithic - 🟢 **Context window** (`context-window`), Grep first, read only what matched - 🟢 **Evaluation** (`evaluation`), Independent reviewer subagent (no confirmation bias) - 🟢 **4D Framework** (`4d-framework`), Delegation, Description, Discernment, Diligence - 🟢 **Project memory** (`claude-md-hierarchy`), Codebase context loader + scratchpad files - 🟢 **Model Context Protocol** (`mcp`), Language servers, linters, formatters - 🟢 **Agentic loops** (`agentic-loops`), Lead agent's coordination loop ## Components ### Built-in Tool Suite, Read · Write · Edit · Bash · Grep · Glob The six built-in tools cover almost every productivity task. Use Grep + Glob to locate before Read; use Edit (not Write) for in-place changes to existing files; reserve Bash for actual commands (compile, test, run). Never as a fallback for file I/O. The grep-then-read sequence is the single biggest token-efficiency lever. **Configuration:** Tool whitelist per subagent: Reader=[Read,Grep,Glob]; Reviewer=[Read,Grep,Bash(test,lint)]; Doc-gen=[Read,Write,Glob]. Never grant 15 tools to one agent. Accuracy drops 8% per tool past 5. **Concept:** `tool-calling` ### Codebase Context Loader, imports · dependencies · architecture Walks the repo at session start: parses package.json / pyproject.toml, reads README.md + .claude/CLAUDE.md, builds a dependency graph (top 50 imports), surfaces the architecture-decisions doc. Loaded once into the lead agent's context so it doesn't re-discover on every turn. **Configuration:** Run on first invocation: project_type, key_dirs, top_imports[], architecture_summary. Persist as .claude/context-cache.json with hash-based invalidation. Refresh on package.json change. **Concept:** `claude-md-hierarchy` ### Code-Review Subagent, independent · fresh context Spawned per change-set with [Read, Grep, Bash(test,lint)] only. No Edit, no Write. Fresh context, no inherited history from the writing agent. Reviews against .claude/rules/ and the codebase context. Returns { verdict, issues: [{ line, severity, message }], summary }. Independence is the architectural point. Same-session review just rubber-stamps. **Configuration:** system: 'You are a code reviewer. Read only. Check against .claude/rules/ and existing patterns.' tools: [Read, Grep, Bash(npm test, npm run lint)]. Receives diff + context summary; returns structured verdict. **Concept:** `evaluation` ### Doc-Generator Subagent, writes from code, not from spec Specialised subagent that reads source files and generates documentation that reflects what the code actually does, not what last quarter's spec said it would do. Writes JSDoc / docstrings inline (Edit), README sections (Write), or external API docs (Write to docs/). Tools scoped to read source + write docs only. **Configuration:** system: 'Generate docs from source. Use the code as ground truth.' tools: [Read, Write(docs/, *.md), Glob]. Run after code-review-subagent passes; auto-attached as part of the PR. **Concept:** `subagents` ### MCP Server Integrations, language servers · linters · formatters Optional but high-leverage in multi-language repos. Hook in pyright for Python type info, tsc --noEmit for TS, eslint/ruff for lint, prettier/black for formatting. Each MCP server adds language-specific context the agent can query without re-implementing it. Selected per file extension; not all loaded at once. **Configuration:** MCP registry per workspace: { '.ts': ['tsc', 'eslint'], '.py': ['pyright', 'ruff'], '.go': ['gopls'] }. Routed by file extension; agent calls mcp.lint(file) and gets structured diagnostics. **Concept:** `mcp` ## Build steps ### 1. Profile the codebase before any task On first invocation, run a one-time codebase-context loader: language(s), framework, top entry points, dependency graph. Cache to .claude/context-cache.json. The lead agent loads this on every subsequent task instead of re-discovering. Saves ~4K tokens per turn. **Python:** ```python # scripts/profile_codebase.py import json, subprocess from pathlib import Path from collections import Counter def profile() -> dict: profile = {"languages": [], "framework": None, "key_dirs": [], "top_imports": []} if Path("package.json").exists(): pkg = json.loads(Path("package.json").read_text()) profile["languages"].append("typescript") deps = list((pkg.get("dependencies") or {}).keys()) if "next" in deps: profile["framework"] = "next.js" elif "react" in deps: profile["framework"] = "react" if Path("pyproject.toml").exists(): profile["languages"].append("python") # Top 50 imported modules across the repo imports = Counter() for ext, regex in [("ts", r"^import .* from ['\"]"), ("py", r"^(import|from) ")]: for f in Path(".").rglob(f"*.{ext}"): if "node_modules" in f.parts or ".venv" in f.parts: continue for line in f.read_text(errors="ignore").splitlines()[:50]: if line.startswith(("import", "from")): imports[line.split()[1].rstrip(',')] += 1 profile["top_imports"] = [m for m, _ in imports.most_common(50)] profile["key_dirs"] = [ p.name for p in Path(".").iterdir() if p.is_dir() and not p.name.startswith(".") and p.name not in ("node_modules", "dist") ] Path(".claude/context-cache.json").write_text(json.dumps(profile, indent=2)) return profile if __name__ == "__main__": print(json.dumps(profile(), indent=2)) ``` **TypeScript:** ```typescript // scripts/profile-codebase.ts import { readFileSync, writeFileSync, readdirSync, statSync } from "node:fs"; import { join } from "node:path"; interface Profile { languages: string[]; framework: string | null; key_dirs: string[]; top_imports: string[]; } export function profile(): Profile { const out: Profile = { languages: [], framework: null, key_dirs: [], top_imports: [] }; try { const pkg = JSON.parse(readFileSync("package.json", "utf8")); out.languages.push("typescript"); const deps = Object.keys(pkg.dependencies ?? {}); if (deps.includes("next")) out.framework = "next.js"; else if (deps.includes("react")) out.framework = "react"; } catch {} try { statSync("pyproject.toml"); out.languages.push("python"); } catch {} // Top imports. Walk repo, count import statements const counts = new Map(); function walk(dir: string) { for (const name of readdirSync(dir)) { if (name.startsWith(".") || name === "node_modules") continue; const p = join(dir, name); const st = statSync(p); if (st.isDirectory()) walk(p); else if (/\.(ts|tsx|js|py)$/.test(name)) { const lines = readFileSync(p, "utf8").split("\n").slice(0, 50); for (const line of lines) { const m = line.match(/^(import|from)\s+([\w./@-]+)/); if (m) counts.set(m[2], (counts.get(m[2]) ?? 0) + 1); } } } } walk("."); out.top_imports = [...counts.entries()] .sort((a, b) => b[1] - a[1]) .slice(0, 50) .map(([k]) => k); out.key_dirs = readdirSync(".").filter( (n) => !n.startsWith(".") && !["node_modules", "dist"].includes(n), ); writeFileSync(".claude/context-cache.json", JSON.stringify(out, null, 2)); return out; } ``` Concept: `claude-md-hierarchy` ### 2. Always Grep + Glob before Read The single biggest exploration anti-pattern is Read on every file in a directory. Instead, Grep finds the symbol or pattern across the repo (returns matched files + line numbers); Glob narrows by path; only Read the files that matched. On a 1000-file repo searching for a function name, this is the difference between 1000 Reads and 12 Reads. **Python:** ```python # Wrong. Read everything, hope for the best # for f in glob('**/*.ts'): # content = read_file(f) # 1000 reads, context floods, agent gets lost # Right. Grep first to locate, then Read only matched def find_function_usages(name: str) -> list[dict]: """Grep + Glob first, Read only what matched.""" # Step 1: Grep finds the pattern across the repo grep_result = run_tool("Grep", { "pattern": rf"\b{name}\b", "type": "ts", "output_mode": "files_with_matches", # NOT 'content'. Too verbose }) matched_files = grep_result["files"] # e.g. 12 files, not 1000 # Step 2: Glob filters further if needed if len(matched_files) > 30: # Too many. Narrow by path matched_files = [f for f in matched_files if "src/" in f] # Step 3: Read only the files that matched file_contents = {} for f in matched_files[:20]: # cap at 20 to bound context file_contents[f] = run_tool("Read", {"file_path": f}) return [{"file": f, "content": c} for f, c in file_contents.items()] ``` **TypeScript:** ```typescript // Wrong. Read everything, hope for the best // for (const f of glob.sync("**/*.ts")) { // const content = readFile(f); // 1000 reads, context floods, agent gets lost // } // Right. Grep first to locate, then Read only matched async function findFunctionUsages(name: string) { // Step 1: Grep finds the pattern across the repo const grepResult = await runTool("Grep", { pattern: `\\b${name}\\b`, type: "ts", output_mode: "files_with_matches", // NOT 'content'. Too verbose }); let matchedFiles: string[] = grepResult.files; // Step 2: Glob filters further if needed if (matchedFiles.length > 30) { matchedFiles = matchedFiles.filter((f) => f.includes("src/")); } // Step 3: Read only the files that matched const out: Array<{ file: string; content: string }> = []; for (const f of matchedFiles.slice(0, 20)) { const content = await runTool("Read", { file_path: f }); out.push({ file: f, content }); } return out; } ``` Concept: `context-window` ### 3. Distribute tools across 2-3 specialised subagents One 15-tool agent routes worse than three 5-tool specialists. Define Reader (Grep + Glob + Read), Reviewer (Read + Grep + Bash test/lint), Doc-Generator (Read + Write + Glob). Each runs in its own context. The lead agent coordinates and merges results. Routing accuracy stays high because each specialist has a clear domain. **Python:** ```python # Specialist subagent definitions SPECIALISTS = { "reader": { "system": "You are a codebase reader. Use Grep+Glob to locate, " "then Read only what matched. Never modify files.", "tools": ["Read", "Grep", "Glob"], }, "reviewer": { "system": "You are an independent code reviewer. Fresh context, " "no carry-over from the writer. Check against .claude/rules/.", "tools": ["Read", "Grep", "Bash(npm test, npm run lint, ruff check)"], }, "docgen": { "system": "You generate docs from source. Use the code as ground truth.", "tools": ["Read", "Write(docs/**, *.md)", "Glob"], }, } def spawn_specialist(role: str, task: str) -> dict: spec = SPECIALISTS[role] return client.messages.create( model="claude-sonnet-4.5", max_tokens=2048, system=spec["system"], tools=load_tools(spec["tools"]), messages=[{"role": "user", "content": task}], ) # Lead agent coordinates def refactor_function(name: str, new_name: str): locations = spawn_specialist("reader", f"Find all usages of {name}") proposed_diff = generate_renames(locations, new_name) review = spawn_specialist("reviewer", f"Review this diff: {proposed_diff}") if review["verdict"] != "approve": return {"status": "review_failed", "issues": review["issues"]} return {"status": "ready_to_apply", "diff": proposed_diff, "review": review} ``` **TypeScript:** ```typescript // Specialist subagent definitions const SPECIALISTS = { reader: { system: "You are a codebase reader. Use Grep+Glob to locate, " + "then Read only what matched. Never modify files.", tools: ["Read", "Grep", "Glob"], }, reviewer: { system: "You are an independent code reviewer. Fresh context, " + "no carry-over from the writer. Check against .claude/rules/.", tools: ["Read", "Grep", "Bash(npm test, npm run lint, ruff check)"], }, docgen: { system: "You generate docs from source. Use the code as ground truth.", tools: ["Read", "Write(docs/**, *.md)", "Glob"], }, } as const; async function spawnSpecialist( role: keyof typeof SPECIALISTS, task: string, ) { const spec = SPECIALISTS[role]; return client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 2048, system: spec.system, tools: loadTools(spec.tools as readonly string[]), messages: [{ role: "user", content: task }], }); } // Lead agent coordinates async function refactorFunction(name: string, newName: string) { const locations = await spawnSpecialist("reader", `Find all usages of ${name}`); const proposedDiff = generateRenames(locations, newName); const review = await spawnSpecialist("reviewer", `Review this diff: ${proposedDiff}`); if (review.verdict !== "approve") { return { status: "review_failed" as const, issues: review.issues }; } return { status: "ready_to_apply" as const, diff: proposedDiff, review }; } ``` Concept: `subagents` ### 4. Spawn the reviewer in a SEPARATE session If the reviewer shares context with the writer, it inherits the writer's assumptions and rubber-stamps. The fix is structural: the reviewer subagent runs with a fresh messages array, fresh system prompt, no inherited tool history. It sees only the diff + the project rules. The independence is the entire point. Confirmation bias is structural, not philosophical. **Python:** ```python def review_diff_independently(diff: str, file_paths: list[str]) -> dict: """Spawn a fresh-context reviewer. NO carry-over from the writer.""" REVIEWER_SYSTEM = """You are an independent code reviewer. You have NO context from the agent that wrote this diff. You see only: 1. The diff itself 2. The current state of the affected files 3. The project rules in .claude/rules/ Check for: - Style conformance (matches .claude/rules/) - Test coverage (does the diff change behavior without test changes?) - Type safety (any unsafe casts or 'any' added?) - API breakage (did a public signature change?) Return JSON: { verdict: 'approve'|'request_changes', issues: [{ file, line, severity, message }], summary }""" # CRITICAL: fresh messages array, fresh system, fresh tool list return client.messages.create( model="claude-sonnet-4.5", max_tokens=2048, system=REVIEWER_SYSTEM, tools=[READ_TOOL, GREP_TOOL, BASH_TEST_LINT_TOOL], # no Edit/Write messages=[{ "role": "user", "content": f"Diff to review:\n{diff}\n\nAffected files: {file_paths}", }], ) ``` **TypeScript:** ```typescript async function reviewDiffIndependently(diff: string, filePaths: string[]) { // Spawn a fresh-context reviewer. NO carry-over from the writer. const REVIEWER_SYSTEM = `You are an independent code reviewer. You have NO context from the agent that wrote this diff. You see only: 1. The diff itself 2. The current state of the affected files 3. The project rules in .claude/rules/ Check for: - Style conformance (matches .claude/rules/) - Test coverage (does the diff change behavior without test changes?) - Type safety (any unsafe casts or 'any' added?) - API breakage (did a public signature change?) Return JSON: { verdict: 'approve'|'request_changes', issues: [{ file, line, severity, message }], summary }`; // CRITICAL: fresh messages array, fresh system, fresh tool list return client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 2048, system: REVIEWER_SYSTEM, tools: [READ_TOOL, GREP_TOOL, BASH_TEST_LINT_TOOL], // no Edit/Write messages: [ { role: "user", content: `Diff to review:\n${diff}\n\nAffected files: ${filePaths.join(", ")}`, }, ], }); } ``` Concept: `evaluation` ### 5. Use a scratchpad file for cross-turn state Long productivity tasks (rename across 50 files, refactor 3 services) span many turns. Don't try to keep all state in the message history. Write it to .claude/scratchpad.md. The lead agent reads it at the start of every turn, updates it after each subagent returns, and trims old entries. The scratchpad is the agent's working memory; the message history is the audit trail. **Python:** ```python # .claude/scratchpad.md. Agent working memory across turns SCRATCHPAD = Path(".claude/scratchpad.md") def update_scratchpad(section: str, content: str): """Append-and-replace by section. Trims old entries past 20.""" if not SCRATCHPAD.exists(): SCRATCHPAD.write_text("# Agent scratchpad\n\n") text = SCRATCHPAD.read_text() # Replace existing section or append marker = f"## {section}" if marker in text: # Replace existing section before, _, rest = text.partition(marker) _, _, after = rest.partition("\n## ") new = before + marker + "\n" + content + "\n## " + after if "\n## " in rest else before + marker + "\n" + content else: new = text + "\n" + marker + "\n" + content SCRATCHPAD.write_text(new) # Usage in the agent loop update_scratchpad( "Rename Task: FooBar -> BazQux", "Files matched: 12. Already updated: 8. Remaining: src/old.ts, src/legacy.ts.", ) ``` **TypeScript:** ```typescript // .claude/scratchpad.md. Agent working memory across turns import { existsSync, readFileSync, writeFileSync } from "node:fs"; const SCRATCHPAD = ".claude/scratchpad.md"; export function updateScratchpad(section: string, content: string) { if (!existsSync(SCRATCHPAD)) { writeFileSync(SCRATCHPAD, "# Agent scratchpad\n\n"); } const text = readFileSync(SCRATCHPAD, "utf8"); const marker = `## ${section}`; let next: string; if (text.includes(marker)) { // Replace existing section const before = text.slice(0, text.indexOf(marker)); const rest = text.slice(text.indexOf(marker) + marker.length); const nextMarker = rest.indexOf("\n## "); const after = nextMarker >= 0 ? rest.slice(nextMarker) : ""; next = `${before}${marker}\n${content}${after}`; } else { next = `${text}\n${marker}\n${content}`; } writeFileSync(SCRATCHPAD, next); } // Usage in the agent loop updateScratchpad( "Rename Task: FooBar -> BazQux", "Files matched: 12. Already updated: 8. Remaining: src/old.ts, src/legacy.ts.", ); ``` Concept: `claude-md-hierarchy` ### 6. Wire MCP servers per language extension Multi-language repos benefit from language-specific MCP servers. pyright for Python types, tsc --noEmit for TS, eslint and prettier for JS, ruff for Python lint. Don't load all MCP servers at session start. Route by file extension when a tool needs them. Keeps context lean, reduces overhead. **Python:** ```python # Per-extension MCP routing MCP_BY_EXTENSION = { ".ts": ["tsc", "eslint", "prettier"], ".tsx": ["tsc", "eslint", "prettier"], ".py": ["pyright", "ruff", "black"], ".go": ["gopls", "gofmt"], ".rs": ["rust-analyzer", "rustfmt"], } def select_mcp_servers(file_paths: list[str]) -> list[str]: """Pick the minimum set of MCP servers needed for these files.""" extensions = {Path(f).suffix for f in file_paths} servers = set() for ext in extensions: for server in MCP_BY_EXTENSION.get(ext, []): servers.add(server) return sorted(servers) # In the lead agent before spawning the reviewer def review_with_mcp(diff: str, files: list[str]): mcp_servers = select_mcp_servers(files) return spawn_specialist( "reviewer", f"Review diff: {diff}\nUse MCP servers: {mcp_servers}", ) ``` **TypeScript:** ```typescript // Per-extension MCP routing const MCP_BY_EXTENSION: Record = { ".ts": ["tsc", "eslint", "prettier"], ".tsx": ["tsc", "eslint", "prettier"], ".py": ["pyright", "ruff", "black"], ".go": ["gopls", "gofmt"], ".rs": ["rust-analyzer", "rustfmt"], }; export function selectMcpServers(filePaths: string[]): string[] { const exts = new Set(filePaths.map((f) => "." + (f.split(".").pop() ?? ""))); const servers = new Set(); for (const ext of exts) { for (const s of MCP_BY_EXTENSION[ext] ?? []) servers.add(s); } return [...servers].sort(); } // In the lead agent before spawning the reviewer async function reviewWithMcp(diff: string, files: string[]) { const mcpServers = selectMcpServers(files); return spawnSpecialist( "reviewer", `Review diff: ${diff}\nUse MCP servers: ${mcpServers.join(", ")}`, ); } ``` Concept: `mcp` ### 7. Implement the 4D approval gate Delegation (agent proposes), Description (clear output), Discernment (human reviews), Diligence (human approves). The 4D Framework. Code-changing actions are gated behind explicit human approval. The agent presents the diff + the reviewer's verdict + a one-paragraph summary; the human clicks approve or rejects. No auto-merge for productivity tasks; the gate exists because the cost of a bad merge is far higher than the friction of one click. **Python:** ```python from typing import Literal, TypedDict class ApprovalRequest(TypedDict): delegation: str # what the agent proposes (1 sentence) description: str # the diff + summary of changes discernment: dict # reviewer subagent's verdict + issues risk_level: Literal["low", "medium", "high"] def request_approval(req: ApprovalRequest) -> Literal["approve", "reject"]: """4D gate. ALWAYS waits for human input. No auto-approve.""" print("=" * 60) print(f"DELEGATION: {req['delegation']}") print(f"\nDESCRIPTION:\n{req['description']}") print(f"\nDISCERNMENT (review verdict): {req['discernment']['verdict']}") if req['discernment'].get('issues'): print(f" Issues:") for issue in req['discernment']['issues']: print(f" - {issue['severity']}: {issue['message']}") print(f"\nRISK LEVEL: {req['risk_level']}") print("=" * 60) print("DILIGENCE: Approve? [y/N]: ", end="") return "approve" if input().strip().lower() == "y" else "reject" # Usage in the lead agent def apply_refactor_safely(diff: str, files: list[str], reviewer_verdict: dict): decision = request_approval({ "delegation": f"Apply rename refactor across {len(files)} files", "description": f"Diff:\n{diff}", "discernment": reviewer_verdict, "risk_level": "medium" if len(files) > 20 else "low", }) if decision == "approve": run_tool("Edit", {"diff": diff}) else: print("Refactor rejected by human; no files changed.") ``` **TypeScript:** ```typescript interface ApprovalRequest { delegation: string; // what the agent proposes (1 sentence) description: string; // the diff + summary of changes discernment: { // reviewer subagent's verdict + issues verdict: "approve" | "request_changes"; issues: Array<{ severity: string; message: string }>; }; risk_level: "low" | "medium" | "high"; } export async function requestApproval(req: ApprovalRequest): Promise<"approve" | "reject"> { // 4D gate. ALWAYS waits for human input. No auto-approve. console.log("=".repeat(60)); console.log(`DELEGATION: ${req.delegation}`); console.log(`\nDESCRIPTION:\n${req.description}`); console.log(`\nDISCERNMENT (review verdict): ${req.discernment.verdict}`); if (req.discernment.issues?.length) { console.log(" Issues:"); for (const issue of req.discernment.issues) { console.log(` - ${issue.severity}: ${issue.message}`); } } console.log(`\nRISK LEVEL: ${req.risk_level}`); console.log("=".repeat(60)); process.stdout.write("DILIGENCE: Approve? [y/N]: "); const answer = await new Promise((res) => { process.stdin.once("data", (d) => res(d.toString())); }); return answer.trim().toLowerCase() === "y" ? "approve" : "reject"; } // Usage in the lead agent export async function applyRefactorSafely( diff: string, files: string[], reviewerVerdict: ApprovalRequest["discernment"], ) { const decision = await requestApproval({ delegation: `Apply rename refactor across ${files.length} files`, description: `Diff:\n${diff}`, discernment: reviewerVerdict, risk_level: files.length > 20 ? "medium" : "low", }); if (decision === "approve") { await runTool("Edit", { diff }); } else { console.log("Refactor rejected by human; no files changed."); } } ``` Concept: `4d-framework` ### 8. Use Bash structured output, not regex Tools that emit JSON (gh pr diff --json, git log --pretty=format:%H%x09%s, npm test --reporter json, eslint --format json) replace fragile prose-parsing. Even simple commands have structured-output flags (git status --porcelain, ls -la --time-style=long-iso). Always prefer them. Regex on free-form output is the reason most agent pipelines break in the third week of production. **Python:** ```python # Wrong. Regex on prose output # output = run("git log --oneline -10") # commits = re.findall(r"^([a-f0-9]{7,}) (.+)$", output, re.M) # # Breaks when commit message contains a colon or special chars # Right. Structured output, parse JSON import json, subprocess def recent_commits(n: int = 10) -> list[dict]: """git log with structured output, no regex needed.""" raw = subprocess.check_output( ["git", "log", f"-{n}", "--pretty=format:%H%x09%h%x09%s%x09%an%x09%aI"], text=True, ) return [ dict(zip(["sha", "short", "subject", "author", "date"], line.split("\t"))) for line in raw.splitlines() if line ] def lint_results(file: str) -> list[dict]: """eslint --format json gives a structured contract.""" raw = subprocess.check_output( ["npx", "eslint", "--format", "json", file], text=True, ) return json.loads(raw) # array of { filePath, messages: [{ ruleId, severity, line, message }] } def pr_diff_files() -> list[str]: """gh pr diff --name-only. One file per line, no parsing needed.""" return subprocess.check_output( ["gh", "pr", "diff", "--name-only"], text=True, ).strip().split("\n") ``` **TypeScript:** ```typescript // Wrong. Regex on prose output // const output = execSync("git log --oneline -10").toString(); // const commits = [...output.matchAll(/^([a-f0-9]{7,}) (.+)$/gm)]; // // Breaks when commit message contains a colon or special chars // Right. Structured output, parse JSON import { execSync } from "node:child_process"; interface Commit { sha: string; short: string; subject: string; author: string; date: string; } export function recentCommits(n = 10): Commit[] { // git log with structured output, no regex needed. const raw = execSync( `git log -${n} --pretty=format:%H%x09%h%x09%s%x09%an%x09%aI`, { encoding: "utf8" }, ); return raw .split("\n") .filter(Boolean) .map((line) => { const [sha, short, subject, author, date] = line.split("\t"); return { sha, short, subject, author, date }; }); } export function lintResults(file: string) { // eslint --format json gives a structured contract. const raw = execSync(`npx eslint --format json ${file}`, { encoding: "utf8" }); return JSON.parse(raw); // array of { filePath, messages: [{ ruleId, severity, line, message }] } } export function prDiffFiles(): string[] { // gh pr diff --name-only. One file per line, no parsing needed. return execSync("gh pr diff --name-only", { encoding: "utf8" }) .trim() .split("\n"); } ``` Concept: `tool-calling` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | Multi-file refactoring task (20+ files) | Distribute work across 2-3 specialist subagents (reader → reviewer → applier); 4-5 tools each | One monolithic 15-tool agent does everything in one session | Routing accuracy drops ~8% per tool past 5; 15 tools degrades quality dramatically. Specialist subagents with narrow tool whitelists keep accuracy high and contexts clean. | | Exploring a 1000-file codebase for usages of a function | Grep + Glob first to locate matched files, then Read only those | Read every file to 'understand the codebase' before deciding | Read-everything-first floods context and the agent loses track of the original task. Grep returns matched files in seconds with structured output; Read narrows to ~10-20 files instead of ~1000. | | Reviewing the agent's own code changes | Spawn an independent reviewer subagent with fresh context, no inherited history | Same session generates AND reviews the code | Same-session review inherits the writer's assumptions and rubber-stamps. Confirmation bias is structural here, not philosophical. Independence at the architectural layer is the only fix. | | Auto-merging the agent's code changes | 4D Framework: Delegation → Description → Discernment → Diligence (human click) | Auto-merge if reviewer subagent says 'approve' | The cost of a bad merge (broken main, rolled-back deploy, lost time across the team) is far higher than the friction of one human click. The 4D gate exists because trust in agent output is best earned incrementally with humans in the loop on destructive actions. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-25 · Monolithic agent with 15+ tools | Single agent has 15 tools (Read, Write, Edit, Bash, Grep, Glob, WebSearch, gh, npm, jest, eslint, prettier, git, jq, sed). Tool selection accuracy drops; the agent calls Bash(cat) when Read would do, alternates between similar tools, and occasionally hangs comparing options. | Distribute tools across 2-3 specialised subagents (Reader, Reviewer, Doc-gen). Each subagent has 4-5 tools maximum. Routing accuracy stays high because each specialist has a clear tool set and a clear job. | | AP-26 · Auto-merge without human approval | Agent generates a refactor diff, the reviewer subagent approves it, and the workflow auto-merges. A subtle regression ships to main. Rollback takes 40 min; the team's trust in the agent drops for the rest of the quarter. | 4D Framework: Delegation (agent proposes) → Description (clear output) → Discernment (reviewer verdict) → Diligence (human clicks approve). The human gate is non-negotiable for code-changing actions. | | AP-27 · Code review in the same session as code generation | Same Claude session writes the code AND reviews it. The reviewer agrees with the writer on every choice. Same context, same assumptions, same blind spots. Confirmation bias passes through; bugs reach production. | Spawn the reviewer as an independent subagent: fresh messages array, fresh system prompt, no inherited history. It sees only the diff + the project rules. Independence is structural, not philosophical. | | AP-28 · Read-first codebase exploration | Agent runs Read on every file in src/ to 'understand the codebase'. By file 80, context is full of irrelevant content; the agent has lost the original question. Returns generic recommendations instead of specific matches. | Grep + Glob first to locate; Read only matched files. The exploration sequence is non-negotiable: locate → narrow → read. On a 1000-file repo, this is the difference between 1000 Reads and 12. | | AP-29 · Fragile regex on tool output | Workflow parses git log --oneline output with regex to extract commit SHAs. A commit message containing parentheses or a colon breaks the regex; the workflow silently misses commits or attributes them to the wrong author. | Use structured output flags (git log --pretty=format:%H%x09%s, npm test --reporter json, eslint --format json, gh pr diff --json). Parse JSON or fixed-delimiter output. Regex over prose is the reason most agent pipelines break in the third week. | ## Implementation checklist - [ ] Codebase context loader runs once per session, caches to .claude/context-cache.json (`claude-md-hierarchy`) - [ ] Grep + Glob exploration sequence enforced; never Read-first across many files (`context-window`) - [ ] Specialist subagents defined: Reader, Reviewer, Doc-gen. Each with 4-5 tools (`subagents`) - [ ] Reviewer subagent runs in fresh context (no inherited writer history) (`evaluation`) - [ ] Tool whitelists scoped per specialist (no Edit/Write on Reader; no Edit on Reviewer) (`tool-calling`) - [ ] Scratchpad file (.claude/scratchpad.md) persists cross-turn agent state - [ ] MCP servers selected per file extension; not all loaded at session start (`mcp`) - [ ] 4D Framework approval gate before any auto-merge or destructive action (`4d-framework`) - [ ] Bash structured-output flags used everywhere (--format json, --porcelain, etc.) - [ ] Tool count audit per subagent: ≤ 5 tools, no exceptions - [ ] Per-task telemetry: tool selection accuracy, reviewer verdict distribution, approval-gate latency ## Cost & latency - **Codebase exploration (Grep + Read 10-15 matched files):** ~$0.003-0.008 per task, Grep is one tool call (~200 tokens output, files-with-matches mode). Read on 10-15 files at ~2K tokens each = ~25K input + ~500 output. ~$0.005 typical. Avoiding read-everything saves orders of magnitude on big repos. - **Code-review subagent per change-set:** ~$0.012 per review, Reviewer reads the diff (~3K tokens) + affected files (~6K tokens) + .claude/rules (~500 tokens) + emits structured verdict (~1K). ~$0.012 at Sonnet 4.5 prices. Cheap insurance against a bad merge. - **Doc-generation subagent (per source file):** ~$0.024 per file, Doc-gen reads source (~4K) + writes JSDoc/docstring (~2K output) + writes README section if asked (~3K output). ~$0.024 typical. Worth running on every commit-back-to-main to keep docs current. - **Full refactor workflow end-to-end:** ~$0.04-0.08 per refactor, Profile + Grep + Read 12 files + propose diff + Review + Approve + Apply. Sums to ~$0.05 typical. The 4D gate adds zero token cost (human input); reviewer subagent is the dominant line item. - **MCP server overhead (when loaded):** ~+10-15% per call, MCP context (language-server output, lint diagnostics) adds ~500-2K tokens per call. Routed per file extension so the cost only applies when language-specific context actually helps. ## Domain weights - **D1 · Agentic Architectures (27%):** Subagent distribution + lead agent loop + 4D Framework gate - **D2 · Tool Design + Integration (18%):** Built-in tool selection + per-subagent tool whitelist + Bash structured output + MCP routing ## Practice questions ### Q1. A developer-productivity agent explores a 1,000-file codebase looking for references to a function. It uses Read on every file and runs out of context after 200 files. How should you fix this? Replace Read-everything with the Grep + Glob first, Read only matched sequence. Step 1: Grep(pattern: '\\bFunctionName\\b', output_mode: 'files_with_matches') returns the 12 files that actually contain the symbol. Step 2: Glob narrows further if needed (e.g. only src/). Step 3: Read only the matched files. On a 1,000-file repo this is the difference between 1,000 Reads and 12. Orders of magnitude on tokens and the difference between a focused agent and a context-flooded one. Tagged to AP-28. ### Q2. An agent auto-generates and auto-reviews code in a single Claude session. The reviewer always agrees with the writer, even when bugs slip through. What architectural change fixes this? Spawn the reviewer as an independent subagent with a fresh messages array, fresh system prompt, and a tool whitelist scoped to [Read, Grep, Bash(test, lint)]. No Edit, no Write. The reviewer sees only the diff + the project rules; it has zero context from the writer. Confirmation bias is structural in same-session review; only architectural independence eliminates it. Tagged to AP-27. ### Q3. You're building a documentation agent that processes 12 source files across React components, API routes, and database queries. Your first attempt is one agent with 15 tools. How many specialised subagents should you create? Three specialists, 4-5 tools each: a Reader (Read + Grep + Glob) for codebase exploration, a Reviewer (Read + Grep + Bash test/lint) for verdicts, and a Doc-Generator (Read + Write + Glob) for output. Distribution beats monolithic 15-tool agents because routing accuracy drops ~8% per tool past 5. Each specialist has a clear domain and a clear tool set, and they coordinate via a lead agent. Tagged to AP-25. ### Q4. Your productivity agent successfully rewrites a critical function and immediately auto-merges the change to main. A subtle regression ships to production. What gate was missing? The 4D Framework approval gate. Delegation (agent proposes the change), Description (the diff + summary), Discernment (the reviewer subagent's verdict), and Diligence (a human clicks approve). The human-click step is non-negotiable for code-changing actions because the cost of a bad merge (broken main, rollback time, lost team trust) is orders of magnitude higher than the friction of one approval click. Auto-merge defeats the entire architecture. Tagged to AP-26. ### Q5. An agent uses git log --oneline and parses the output with regex to extract commit SHAs and messages. After a commit message containing parentheses ships, the workflow silently misattributes commits. What's the rule? Always use structured-output flags. Replace git log --oneline parsed by regex with git log --pretty=format:%H%x09%h%x09%s%x09%an%x09%aI (tab-delimited, no ambiguity). Or --pretty=format:%H then git show for full info. Same applies to npm test --reporter json, eslint --format json, gh pr diff --json. Regex over prose is the reason most agent pipelines break in their third week. Structured output is a contract; regex on prose is a guess. Tagged to AP-29. ## FAQ ### Q1. Should I use Read or Grep to explore a large codebase? Grep first, always. Grep finds the symbol or pattern across the whole repo in one tool call (returns matched files + line numbers). Then Read only the matched files. Read-everything-first is the single biggest token waste in productivity workflows. And it floods context, which makes the agent worse, not just slower. ### Q2. Can an agent review code it just generated in the same session? Not effectively. Same-session review inherits the writer's context, assumptions, and blind spots. The reviewer rubber-stamps because it shares the writer's mental model. Spawn an independent reviewer subagent with fresh context: it sees only the diff + the project rules, with no carry-over. Independence is structural, not philosophical. ### Q3. How many tools should a developer-productivity agent have? 4-5 per subagent, distributed across 2-3 specialists. One 15-tool agent loses ~8% routing accuracy per tool past 5. By 15 tools, the agent is alternating between similar options and missing obvious matches. Three specialists with clear tool sets (Reader / Reviewer / Doc-gen) outperform one generalist on every dimension. ### Q4. Do I need MCP servers for code-productivity workflows? No, but they help in multi-language repos. Built-in tools (Read, Grep, Edit, Bash) cover most tasks. MCP servers add language-specific context: pyright for Python types, tsc --noEmit for TS, ruff for Python lint, eslint for JS. Route per file extension. Don't load all MCP servers at session start. ### Q5. Should productivity agents auto-merge code? No. The 4D Framework gate is non-negotiable for code-changing actions: Delegation → Description → Discernment → Diligence (human click). The cost of a bad merge is far higher than the friction of one click. If the team really wants automation, the right move is to auto-create the PR, run the reviewer subagent, and post the verdict. But the human still clicks merge. ### Q6. How do you handle multi-language codebases? Per-extension routing for both subagents and MCP servers. The Reader subagent uses Grep with --type ts or --type py to narrow searches by language. The Reviewer subagent loads pyright for .py files and tsc for .ts files. The Doc-Generator emits language-appropriate doc syntax (JSDoc for JS/TS, docstrings for Python). One agent topology, multiple language plug-ins. ### Q7. How do I secure MCP servers against tool-arg injection and credential leakage? Five-line MCP security checklist. (1) Secrets via ${ENV_VAR} expansion in .mcp.json, never inline. (2) Tool input schemas treat every parameter as untrusted (no eval, no string concatenation into shell or SQL inside the MCP server). (3) Allowlist binaries the MCP server can invoke; deny everything else. (4) Per-server transport: stdio for local, HTTPS-only for remote, never plain HTTP. (5) Audit-log every tool_use through the MCP server with the input + outcome + caller agent. Cross-link: P3.7 (agentic-tool-design) covers the 4-bucket structured-error contract MCP servers should emit; pair both pages when designing a new MCP server. Tagged related: mcp-security cluster. ## Production readiness - [ ] Codebase context loader runs in <2s on first session; cached for subsequent sessions - [ ] Grep+Glob+Read sequence enforced (lint check on subagent system prompts) - [ ] Specialist subagent definitions reviewed via PR. Tool whitelists are an audit gate - [ ] Reviewer subagent independence verified by integration test (no shared message history) - [ ] 4D Framework approval gate fires on every code-changing action; auto-merge disabled - [ ] Scratchpad TTL + size-bound: archive old entries past 20 sections - [ ] MCP server health checks per session start; degraded mode on outage - [ ] Telemetry: tool-selection accuracy, reviewer verdict distribution, approval-gate p95 latency --- **Source:** https://claudearchitectcertification.com/scenarios/developer-productivity-agent **Vault sources:** ACP-T05 §Scenario 4 (5 ✅/❌ pairs · official guide scenario); ACP-T06 (5 practice Qs tagged to components); ACP-T07 §Lab 4 spec (built-in tools + codebase navigation + scratchpad); ACP-T08 §3.4 metadata; Course 02 Claude Code 101. Lesson 01 what is Claude Code; GAI-K05 CCA exam questions and scenarios (Scenario 4 walkthrough); 4D Framework concept page **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Structured Data Extraction > A schema-driven extraction agent. The harness defines the output shape as a tool with input_schema, sets tool_choice to force that tool, runs a validation-retry loop (parse → validate → on failure feed the error back), uses nullable fields and enum escape hatches so the model says 'unclear' instead of fabricating, gates sensitive bounds with a PreToolUse hook (refund_amount > cap deny), and caches the schema with cache_control: ephemeral for ~90% cost reduction on bulk runs. The most-tested distractor: prompting 'output JSON' instead of forced tool_use. The former leaks 15%, the latter is a structural guarantee. **Sub-marker:** P3.6 **Domains:** D2 · Tool Design + Integration, D5 · Context + Reliability **Exam weight:** 33% of CCA-F (D2 + D5) **Build time:** 22 minutes **Source:** 🟢 Official Anthropic guide scenario · in published exam guide **Canonical:** https://claudearchitectcertification.com/scenarios/structured-data-extraction **Last reviewed:** 2026-05-04 ## In plain English Think of this as the way you turn a messy email or a 200-page contract into a clean spreadsheet row, reliably, every time. Instead of asking the model to 'output JSON' and hoping (which fails about 15% of the time in production), you define exactly the shape you want as a tool schema, force the model to use that tool, and then validate every record before accepting it. When something is missing, you tell the model nullable fields are okay so it doesn't make data up. When the answer comes back wrong, the harness feeds the specific error back and asks again. The whole point is that extraction at scale needs deterministic shape AND honest gaps, not creative writing. ## Exam impact Domain 2 (Tool Design, 18%) tests forced tool_choice, JSON schema authoring, and the validation-retry contract. Domain 5 (Context, 15%) tests nullable-vs-required design and prompt-cache strategy on schemas. In the published guide and tested heavily. The 'why does my prompt-output-JSON pipeline leak 15%?' question is the canonical exam distractor. ## The problem ### What the customer needs - Guaranteed structure on every output. A downstream pipeline must never see a record missing a field. - Honest non-answers when source data is genuinely missing. Better an explicit 'unclear' than a fabricated value. - Bulk extraction at acceptable cost. 1000 documents/night at <$5 total. ### Why naive approaches fail - Prompt 'output JSON' → 15% leak with prose wrapping ('Sure, here's the JSON:'); downstream parser breaks. - Required fields with no nullable option → model fabricates values when source is silent (refund_reason becomes 'customer dissatisfied' even when the email said nothing). - Single-pass extraction with no retry → semantic errors slip through (date as 'next Tuesday', amount as -50, customer_id with embedded whitespace). ### Definition of done - Schema conformance = 100% (forced tool_use guarantees shape) - Fabrication rate < 1% (nullable + enum escapes give the model an honest opt-out) - Validation-retry convergence ≥ 95% within 3 attempts; remainder routed to human review - Bulk runs use Batch API (50% discount, 24h SLA) for non-blocking volume - Schema cached with cache_control: ephemeral for ~90% savings on sustained traffic ## Concepts in play - 🟢 **Structured outputs** (`structured-outputs`), Forced tool_use as the structural contract - 🟢 **Tool calling** (`tool-calling`), Schema lives in tools[0].input_schema - 🟢 **tool_choice** (`tool-choice`), Forced (not auto) for guaranteed extraction - 🟢 **Evaluation** (`evaluation`), Semantic validation beyond schema shape - 🟢 **Prompt caching** (`prompt-caching`), Schema cached ephemeral, ~90% savings - 🟢 **Batch API** (`batch-api`), 50% discount for bulk overnight extraction - 🟢 **Hooks** (`hooks`), PreToolUse on policy bounds (refund cap) - 🟢 **Context window** (`context-window`), Schema reuse across documents in one session ## Components ### JSON Schema Definition, the contract, in input_schema The output shape lives inside a tool definition, not as freeform text instruction. Required vs nullable, enum vs string, integer vs number. Every property is explicit. The model can only emit a tool_use call that matches; the SDK rejects anything else. **Configuration:** tools = [{ name: "extract_record", input_schema: { type: "object", properties: { customer_id: { type: "string", pattern: "^cust_[0-9]+$" }, refund_amount: { type: ["number", "null"] }, refund_reason: { type: "string", enum: ["damage", "wrong_item", "late", "other", "unclear"] } }, required: ["customer_id", "refund_amount", "refund_reason"] } }] **Concept:** `structured-outputs` ### Forced tool_choice, tool_choice: { type: 'tool', name: ... } Setting tool_choice to a specific tool name guarantees the model fires that tool. No prose wrapping, no 'I'd be happy to help' preamble, no probabilistic adherence. This is the single biggest reliability lever; it converts 85% prompt-only adherence into 100% structural adherence. **Configuration:** tool_choice: { type: 'tool', name: 'extract_record' }. Use 'auto' only for open-ended flows where the model decides whether to call any tool. Forced is for mandatory extraction. **Concept:** `tool-choice` ### Validation-Retry Loop, parse → validate → feed-error-back Schema enforcement guarantees STRUCTURE. Semantic validation (date format, amount sign, ID pattern, business rules) runs in code after parse. On failure, the harness feeds a specific error message back to the model ('refund_amount is -50, must be ≥ 0') and retries. Typically converges in ≤ 2 retries. Generic 'try again' doesn't work; specific errors do. **Configuration:** loop: extract → parse → validate_semantically → if invalid, append { role: 'user', content: 'Validation failed: . Re-extract.' } → retry. Max retries: 3. After 3, route to human review. **Concept:** `evaluation` ### Nullable Fields + Enum Escape Hatches, the anti-fabrication architecture When a source genuinely doesn't contain a value, the model has two honest options: emit null (if the schema allows nullable) or emit a designated 'unclear' / 'not_provided' enum value. Without these escape hatches, required-string fields force the model to invent. Fabrication rate climbs above 5%. With them, fabrication drops below 1%. **Configuration:** Field types: ["string", "null"] for optional values. Enums always include "unclear" or "other" as the last option. Few-shot examples explicitly show the model emitting unclear when source is silent. Anchors the behaviour. **Concept:** `structured-outputs` ### Schema Caching + Batch API, cost discipline at volume The schema is the largest stable token cost (~500-2000 tokens depending on complexity). Mark the tools array with cache_control: ephemeral; the 5-min TTL keeps it warm across sustained traffic, dropping schema-token cost ~90%. For overnight bulk runs, the Batch API gives a flat 50% discount. Combined with caching, bulk extraction cost drops 95%+ vs naive sync calls. **Configuration:** Sync API: tools array with cache_control: { type: 'ephemeral' }. Cache hit rate ≥ 70% within 5-min windows. Batch API: submit 1000+ extractions overnight, results within 24h, no real-time retries (resubmit failures the next batch). **Concept:** `prompt-caching` ## Build steps ### 1. Author the JSON schema as a tool definition Define the output shape in tools[0].input_schema. A JSON Schema object. Every required field listed in required[]. Every optional field has ["", "null"] so the model can emit null. Every constrained string is an enum with an explicit escape (unclear, not_provided, other). Add pattern regex on IDs that have a known format. **Python:** ```python from anthropic import Anthropic client = Anthropic() EXTRACT_TOOL = { "name": "extract_record", "description": "Extract a structured record from a customer email.", "input_schema": { "type": "object", "properties": { "customer_id": {"type": "string", "pattern": "^cust_[0-9]+$"}, "refund_amount": {"type": ["number", "null"]}, "refund_reason": { "type": "string", "enum": ["damage", "wrong_item", "late", "other", "unclear"], }, "urgency": { "type": "string", "enum": ["low", "medium", "high", "unclear"], }, }, "required": ["customer_id", "refund_amount", "refund_reason", "urgency"], }, } ``` **TypeScript:** ```typescript import Anthropic from "@anthropic-ai/sdk"; const client = new Anthropic(); const EXTRACT_TOOL: Anthropic.Tool = { name: "extract_record", description: "Extract a structured record from a customer email.", input_schema: { type: "object", properties: { customer_id: { type: "string", pattern: "^cust_[0-9]+$" }, refund_amount: { type: ["number", "null"] }, refund_reason: { type: "string", enum: ["damage", "wrong_item", "late", "other", "unclear"], }, urgency: { type: "string", enum: ["low", "medium", "high", "unclear"], }, }, required: ["customer_id", "refund_amount", "refund_reason", "urgency"], }, }; ``` Concept: `structured-outputs` ### 2. Force tool_choice to the extraction tool tool_choice: { type: 'tool', name: 'extract_record' } is the structural contract. The model has no choice but to fire the tool with arguments matching the schema. Any prose preamble or wrapping disappears. This single setting turns 85% prompt-only adherence into 100% structural adherence. **Python:** ```python def extract_one(email_text: str) -> dict: resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=1024, tools=[EXTRACT_TOOL], tool_choice={"type": "tool", "name": "extract_record"}, messages=[{"role": "user", "content": email_text}], ) # The model MUST emit a tool_use block matching EXTRACT_TOOL.input_schema for block in resp.content: if block.type == "tool_use" and block.name == "extract_record": return block.input # already a dict matching the schema shape raise RuntimeError("forced tool_choice did not yield tool_use. SDK bug") ``` **TypeScript:** ```typescript async function extractOne(emailText: string) { const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 1024, tools: [EXTRACT_TOOL], tool_choice: { type: "tool", name: "extract_record" }, messages: [{ role: "user", content: emailText }], }); // The model MUST emit a tool_use block matching EXTRACT_TOOL.input_schema for (const block of resp.content) { if (block.type === "tool_use" && block.name === "extract_record") { return block.input as Record; } } throw new Error("forced tool_choice did not yield tool_use. SDK bug"); } ``` Concept: `tool-choice` ### 3. Add nullable types and enum escape hatches Every field that might be genuinely missing in the source gets ["", "null"]. Every constrained string includes an explicit unclear / not_provided / other option. This gives the model an honest exit when the source is silent. Without it, required-string fields force fabrication. Pair with a few-shot example showing the model correctly emitting unclear. **Python:** ```python # Few-shot: show the model how to emit 'unclear' on a silent source FEW_SHOT_EXAMPLES = [ { "role": "user", "content": "I want a refund for my order. Thanks.", }, { "role": "assistant", "content": [{ "type": "tool_use", "name": "extract_record", "input": { "customer_id": "cust_unknown", # signals: pattern won't match, route to human "refund_amount": None, "refund_reason": "unclear", "urgency": "unclear", }, }], }, { "role": "user", "content": [{ "type": "tool_result", "tool_use_id": "...", "content": "ok", }], }, ] # Then your real message follows: # messages = FEW_SHOT_EXAMPLES + [{"role": "user", "content": email_text}] ``` **TypeScript:** ```typescript // Few-shot: show the model how to emit 'unclear' on a silent source const FEW_SHOT_EXAMPLES: Anthropic.MessageParam[] = [ { role: "user", content: "I want a refund for my order. Thanks." }, { role: "assistant", content: [ { type: "tool_use", id: "toolu_demo_1", name: "extract_record", input: { customer_id: "cust_unknown", refund_amount: null, refund_reason: "unclear", urgency: "unclear", }, }, ], }, { role: "user", content: [{ type: "tool_result", tool_use_id: "toolu_demo_1", content: "ok" }], }, ]; // Then prepend FEW_SHOT_EXAMPLES to your real messages array. ``` Concept: `structured-outputs` ### 4. Wrap extraction in a validation-retry loop Schema guarantees structure; semantics need code. After parsing the tool_use input, validate semantically: refund_amount > 0, customer_id matches the canonical pattern beyond the regex, urgency-vs-amount sanity (a $50K refund tagged 'low' is suspicious). On failure, feed a specific error back to the model and retry. Specific errors converge; generic 'try again' loops forever. **Python:** ```python import re def validate(record: dict) -> list[str]: """Returns a list of specific error messages (empty = valid).""" errors = [] if record.get("refund_amount") is not None and record["refund_amount"] < 0: errors.append("refund_amount must be non-negative") if not re.fullmatch(r"cust_\d{4,}", record.get("customer_id", "")): errors.append("customer_id must match cust_<4+ digits>") if record.get("refund_amount", 0) > 1000 and record.get("urgency") == "low": errors.append("refund_amount > 1000 with urgency='low' is suspicious") return errors def extract_with_retry(email_text: str, max_retries: int = 3) -> dict: messages = [{"role": "user", "content": email_text}] for attempt in range(max_retries): resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=1024, tools=[EXTRACT_TOOL], tool_choice={"type": "tool", "name": "extract_record"}, messages=messages, ) tool_use = next(b for b in resp.content if b.type == "tool_use") record = tool_use.input errors = validate(record) if not errors: return record # Feed the SPECIFIC error back; generic 'retry' doesn't converge messages.append({"role": "assistant", "content": resp.content}) messages.append({ "role": "user", "content": [{ "type": "tool_result", "tool_use_id": tool_use.id, "content": f"Validation failed: {'; '.join(errors)}. Re-extract.", "is_error": True, }], }) raise ValueError(f"extraction did not converge in {max_retries} attempts") ``` **TypeScript:** ```typescript function validate(record: Record): string[] { // Returns a list of specific error messages (empty = valid). const errors: string[] = []; const amount = record.refund_amount as number | null; if (amount !== null && amount !== undefined && amount < 0) { errors.push("refund_amount must be non-negative"); } if (!/^cust_\d{4,}$/.test(record.customer_id as string)) { errors.push("customer_id must match cust_<4+ digits>"); } if (amount && amount > 1000 && record.urgency === "low") { errors.push("refund_amount > 1000 with urgency='low' is suspicious"); } return errors; } async function extractWithRetry(emailText: string, maxRetries = 3) { const messages: Anthropic.MessageParam[] = [{ role: "user", content: emailText }]; for (let i = 0; i < maxRetries; i++) { const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 1024, tools: [EXTRACT_TOOL], tool_choice: { type: "tool", name: "extract_record" }, messages, }); const toolUse = resp.content.find((b) => b.type === "tool_use"); if (!toolUse || toolUse.type !== "tool_use") throw new Error("no tool_use"); const errors = validate(toolUse.input as Record); if (errors.length === 0) return toolUse.input; // Feed the SPECIFIC error back; generic 'retry' doesn't converge messages.push({ role: "assistant", content: resp.content }); messages.push({ role: "user", content: [ { type: "tool_result", tool_use_id: toolUse.id, content: `Validation failed: ${errors.join("; ")}. Re-extract.`, is_error: true, }, ], }); } throw new Error(`extraction did not converge in ${maxRetries} attempts`); } ``` Concept: `evaluation` ### 5. Cache the schema with cache_control: ephemeral The schema is the largest stable token cost in steady-state extraction (~500-2000 tokens for non-trivial shapes). Mark the tools array with cache_control: { type: 'ephemeral' }. The 5-min TTL keeps it warm across sustained traffic; cached input tokens cost ~10% of fresh tokens. Hit rate stays ≥ 70% with continuous extraction; ~90% schema-token savings. **Python:** ```python def extract_with_cache(email_text: str) -> dict: resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=1024, tools=[ { **EXTRACT_TOOL, "cache_control": {"type": "ephemeral"}, # 5-min TTL }, ], tool_choice={"type": "tool", "name": "extract_record"}, messages=[{"role": "user", "content": email_text}], ) # Inspect cache stats for observability print(f"cache_creation: {resp.usage.cache_creation_input_tokens}") print(f"cache_read: {resp.usage.cache_read_input_tokens}") return next(b.input for b in resp.content if b.type == "tool_use") ``` **TypeScript:** ```typescript async function extractWithCache(emailText: string) { const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 1024, tools: [ { ...EXTRACT_TOOL, cache_control: { type: "ephemeral" }, // 5-min TTL }, ], tool_choice: { type: "tool", name: "extract_record" }, messages: [{ role: "user", content: emailText }], }); // Inspect cache stats for observability console.log(`cache_creation: ${resp.usage.cache_creation_input_tokens}`); console.log(`cache_read: ${resp.usage.cache_read_input_tokens}`); const tu = resp.content.find((b) => b.type === "tool_use"); return tu?.type === "tool_use" ? tu.input : null; } ``` Concept: `prompt-caching` ### 6. Add a PreToolUse hook on policy bounds When extraction touches policy-bearing values (refund cap, transaction limits), don't trust the model. Wrap it in a PreToolUse hook that exits 2 on violation. The hook reads tool_input.refund_amount and compares to the known cap; on breach, it returns an error to the model with the cap reference, and the model re-extracts with the constraint visible. Deterministic policy enforcement, not probabilistic. **Python:** ```python # .claude/hooks/extract_policy.py import sys, json, os REFUND_CAP = float(os.environ.get("REFUND_CAP", "500")) payload = json.loads(sys.stdin.read()) if payload["tool_name"] != "extract_record": sys.exit(0) amount = payload["tool_input"].get("refund_amount") or 0 if amount > REFUND_CAP: print( f"refund_amount ${amount} exceeds policy cap ${REFUND_CAP}; " f"emit refund_amount=null and refund_reason='unclear' instead", file=sys.stderr, ) sys.exit(2) # DENY. Model sees error, re-extracts with the bound sys.exit(0) ``` **TypeScript:** ```typescript // .claude/hooks/extract-policy.ts import { readFileSync } from "node:fs"; const REFUND_CAP = Number(process.env.REFUND_CAP ?? 500); const payload = JSON.parse(readFileSync(0, "utf8")); if (payload.tool_name !== "extract_record") process.exit(0); const amount = (payload.tool_input?.refund_amount as number | null) ?? 0; if (amount > REFUND_CAP) { process.stderr.write( `refund_amount \${amount} exceeds policy cap \${REFUND_CAP}; ` + `emit refund_amount=null and refund_reason='unclear' instead\n`, ); process.exit(2); // DENY. Model sees error, re-extracts with the bound } process.exit(0); ``` Concept: `hooks` ### 7. Use Batch API for bulk overnight runs Sync API is the right call when latency matters. For overnight backfills (1000+ documents), the Batch API gives a flat 50% discount with a 24h SLA. Combined with schema caching, bulk extraction cost drops 95%+ vs naive sync calls. Resubmit failures as a new batch the next morning. Batch API is async, no real-time retry inside the batch. **Python:** ```python import json def submit_batch_extraction(emails: list[dict]) -> str: """Submit a batch of extraction requests for overnight processing.""" requests = [] for email in emails: requests.append({ "custom_id": f"extract-{email['id']}", "params": { "model": "claude-sonnet-4.5", "max_tokens": 1024, "tools": [EXTRACT_TOOL], "tool_choice": {"type": "tool", "name": "extract_record"}, "messages": [{"role": "user", "content": email["body"]}], }, }) batch = client.messages.batches.create(requests=requests) print(f"Batch {batch.id} submitted with {len(requests)} extractions") print(f"Expected ready: {batch.expires_at}") return batch.id # Next morning. Fetch results, validate each, requeue failures def harvest_batch(batch_id: str): batch = client.messages.batches.retrieve(batch_id) if batch.processing_status != "ended": return {"status": "not_ready"} results = client.messages.batches.results(batch_id) accepted, rejected = [], [] for r in results: if r.result.type == "succeeded": tu = next(b for b in r.result.message.content if b.type == "tool_use") if not validate(tu.input): accepted.append(tu.input) continue rejected.append(r.custom_id) return {"accepted": accepted, "rejected_for_retry": rejected} ``` **TypeScript:** ```typescript async function submitBatchExtraction(emails: Array<{ id: string; body: string }>) { const requests = emails.map((email) => ({ custom_id: `extract-${email.id}`, params: { model: "claude-sonnet-4.5", max_tokens: 1024, tools: [EXTRACT_TOOL], tool_choice: { type: "tool", name: "extract_record" } as const, messages: [{ role: "user" as const, content: email.body }], }, })); const batch = await client.messages.batches.create({ requests }); console.log(`Batch ${batch.id} submitted with ${requests.length} extractions`); console.log(`Expected ready: ${batch.expires_at}`); return batch.id; } // Next morning. Fetch results, validate each, requeue failures async function harvestBatch(batchId: string) { const batch = await client.messages.batches.retrieve(batchId); if (batch.processing_status !== "ended") return { status: "not_ready" }; const results = client.messages.batches.results(batchId); const accepted: unknown[] = []; const rejected: string[] = []; for await (const r of results) { if (r.result.type === "succeeded") { const tu = r.result.message.content.find((b) => b.type === "tool_use"); if (tu?.type === "tool_use" && validate(tu.input as Record).length === 0) { accepted.push(tu.input); continue; } } rejected.push(r.custom_id); } return { accepted, rejected_for_retry: rejected }; } ``` Concept: `batch-api` ### 8. Stratified accuracy reporting (not aggregate) A 95% aggregate accuracy can hide a 60% accuracy on a critical document type. Track validation pass rate stratified by source. By document type (email vs PDF vs HTML), by sender domain, by extraction date. Bad strata surface fast. Aggregate metrics lie; stratified ones tell the truth. **Python:** ```python from collections import defaultdict def stratified_accuracy(records: list[dict]) -> dict: """Group validation results by document type, surface weak strata.""" by_type = defaultdict(lambda: {"pass": 0, "fail": 0}) for r in records: errors = validate(r["extracted"]) bucket = "pass" if not errors else "fail" by_type[r["doc_type"]][bucket] += 1 report = {} for doc_type, counts in by_type.items(): total = counts["pass"] + counts["fail"] report[doc_type] = { "total": total, "pass_rate": counts["pass"] / total if total else 0, "fail_count": counts["fail"], } # Sort by pass_rate ascending. Worst strata first return dict(sorted(report.items(), key=lambda kv: kv[1]["pass_rate"])) # Output: # { # "html_email": {"total": 200, "pass_rate": 0.62, "fail_count": 76}, # WEAK # "pdf": {"total": 150, "pass_rate": 0.91, "fail_count": 14}, # "plain_text": {"total": 650, "pass_rate": 0.99, "fail_count": 6}, # } ``` **TypeScript:** ```typescript function stratifiedAccuracy( records: Array<{ doc_type: string; extracted: Record }>, ) { // Group validation results by document type, surface weak strata. const byType = new Map(); for (const r of records) { const errors = validate(r.extracted); const bucket = errors.length === 0 ? "pass" : "fail"; const counts = byType.get(r.doc_type) ?? { pass: 0, fail: 0 }; counts[bucket]++; byType.set(r.doc_type, counts); } const report: Record = {}; for (const [docType, counts] of byType) { const total = counts.pass + counts.fail; report[docType] = { total, pass_rate: total ? counts.pass / total : 0, fail_count: counts.fail, }; } // Sort by pass_rate ascending. Worst strata first return Object.fromEntries( Object.entries(report).sort(([, a], [, b]) => a.pass_rate - b.pass_rate), ); } ``` Concept: `evaluation` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | Output shape guarantee | Forced tool_choice with input_schema as the contract | Prompt instruction 'output JSON' or 'respond with valid JSON' | Prompt instruction is probabilistic (~85% adherence in production); forced tool_use is structural (100%). The cost difference is negligible; the reliability difference is decisive. | | Field that might be missing in the source | ["", "null"] AND/OR enum with explicit 'unclear' option | Required field with no nullable / no escape. Force the model to invent | Without an honest exit, the model fabricates. Fabrication rate climbs above 5% on required-string fields. Nullable + enum escapes drop it below 1%. | | Validation failure | Validation-retry loop with the SPECIFIC error fed back | Generic 'please try again' or single-pass with no retry | Specific errors (e.g. 'refund_amount is -50, must be ≥ 0') converge in ≤ 2 retries. Generic retries don't converge; the model regenerates the same bad output. | | 1000 extractions overnight | Batch API + schema caching (50% × ~90% = ~95% savings) | Sync API with caching, or sync API without caching | Bulk + non-blocking = Batch API by default. Cap is the 24h SLA. For latency-critical extractions, stay sync + cached; for backfill, switch to Batch. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-SDE-01 · Prompt-only JSON | System prompt says 'respond with JSON only'. ~15% of responses include prose wrapping ('Sure, here's the JSON:'); downstream parser breaks on every fifth document. | Forced tool_choice + input_schema. The model has no choice but to fire the tool with arguments matching the schema. 100% structural adherence. | | AP-SDE-02 · Non-nullable required fields | refund_reason is required string. Source email says nothing about a reason. Model fabricates 'customer dissatisfied' to satisfy the schema. Fabrication rate ~7%. | Make the field nullable AND/OR add an explicit 'unclear' enum option. Few-shot one example showing the model emit 'unclear' on a silent source. Fabrication drops below 1%. | | AP-SDE-03 · Single-pass extraction | Validation fails (refund_amount = -50). The pipeline drops the record entirely. Operator sees 5% silent loss; nobody notices for two weeks. | Validation-retry loop: feed the specific error back to the model, retry up to 3 times. 70-80% of validation failures converge within 2 retries. | | AP-SDE-04 · Schema-only validation | Schema accepts refund_amount: -50 (number type passes). Schema accepts customer_id: 'cust_ 42' (string type passes). Bad data ships downstream. | Validate semantically AFTER schema parse: bounds checks (amount ≥ 0), regex on IDs (no embedded whitespace), business rules (urgency-vs-amount sanity). Schema enforces shape; code enforces meaning. | | AP-SDE-05 · No policy hook on sensitive bounds | Refund cap is enforced in the system prompt ('never extract refund_amount > 500'). Production sees 3% violations leak through; auditor flags. | PreToolUse hook on extract_record: reads tool_input.refund_amount, compares to policy cap, exits 2 on breach with a specific error message. Deterministic, not probabilistic. | ## Implementation checklist - [ ] Schema lives in tools[0].input_schema (not in the prompt) (`structured-outputs`) - [ ] tool_choice forced to the extraction tool name (`tool-choice`) - [ ] Every optional field uses ["", "null"] or includes an 'unclear' enum (`structured-outputs`) - [ ] Few-shot example demonstrates the 'unclear' / null exit on silent source - [ ] Validation-retry loop with SPECIFIC errors fed back; max_retries = 3 (`evaluation`) - [ ] Schema cached with cache_control: ephemeral; hit rate monitored ≥ 70% (`prompt-caching`) - [ ] PreToolUse hook on policy-bearing fields (cap, threshold) (`hooks`) - [ ] Batch API used for bulk overnight runs (>100 docs) (`batch-api`) - [ ] Stratified accuracy reporting (by doc_type, sender, date). Never aggregate-only - [ ] Failed-after-retries records routed to human review queue with original + last error - [ ] Telemetry: schema cache hit rate, validation pass rate, retry distribution, fabrication rate ## Cost & latency - **Per-extraction (sync, cached schema):** ~$0.0008-0.002, Schema ~1500 tokens at cache-read price (~$0.0001) + email body ~500 input tokens + ~150 output tokens. Sustained traffic with ≥70% cache hit rate keeps per-record cost predictable. - **Validation-retry overhead:** ~+30% on records that retry, 5-10% of records retry once; 1-2% retry twice. Specific-error feedback converges quickly. Overall pipeline cost up ~5% to gain ~99% schema-conformance + ~99% semantic-conformance. - **Bulk overnight (Batch API):** ~50% off sync, ~95% off naive uncached sync, Batch API flat 50% discount × schema caching ~90% savings = ~95% total. 1000 documents @ ~$0.0008 each (sync cached) drops to ~$0.0004 each (batch + cache). $0.40 vs $0.80 per 1000. - **Hook overhead:** ~0% token cost; ~50ms latency, PreToolUse hook is a Python/TS subprocess reading stdin and exiting 0/2. No LLM call. Cost is purely syscall-level latency. - **Per-1000-docs total (steady state, sync + cached):** ~$0.80-2.00, Real production extraction at typical complexity. Batch + cache halves it. Adding human review of unconverged records adds operator-time cost but recovers the long tail. ## Domain weights - **D2 · Tool Design + Integration (18%):** Schema in tool · forced tool_choice · validation-retry · PreToolUse hook - **D5 · Context + Reliability (15%):** Nullable design · enum escapes · prompt caching strategy · stratified reporting ## Practice questions ### Q1. Your prompt-only JSON extractor leaks ~15% with prose wrapping ('Sure, here is the JSON: ...'). Downstream parser breaks every 7th document. What's the architectural fix? Move the schema into tools[0].input_schema and set tool_choice: { type: 'tool', name: '' }. This forces the model to emit a structured tool_use call matching the schema. No prose wrapping, no preamble, no probabilistic adherence. The 15% leak collapses to 0%; the parser sees only valid structured input. Tagged to AP-SDE-01. ### Q2. Schema has refund_reason: { type: 'string' } (required). Source email says nothing about a reason. Production logs show ~7% of records have invented reasons like 'customer dissatisfied'. How do you stop the model from fabricating? Give the model an honest exit. Either make the field nullable (type: ['string', 'null']) or use an enum that includes an explicit 'unclear' option. Pair with one few-shot example showing the model correctly emit 'unclear' on a silent source. Fabrication rate drops from ~7% to <1% because the model now has a structurally-correct way to say 'I don't know'. Tagged to AP-SDE-02. ### Q3. Validation fails on a record (refund_amount = -50). Your pipeline drops the record silently. What architectural change recovers most of these without human intervention? Validation-retry loop with the SPECIFIC error fed back. After parsing, validate semantically; on failure, append a tool_result with is_error: true and the specific error message ('refund_amount is -50, must be ≥ 0'); retry up to 3 times. 70-80% of validation failures converge within 2 retries because the model now sees what's wrong. Generic 'try again' messages don't converge. Tagged to AP-SDE-03. ### Q4. You're processing 1000 customer emails overnight for a backfill. What API choice minimizes cost without sacrificing reliability? Batch API + prompt-caching on the schema. Batch API gives a flat 50% discount with a 24h SLA. Perfect for non-blocking backfill. Schema cached with cache_control: ephemeral saves another ~90% on the schema-token cost. Combined: ~95% savings vs naive sync calls. Resubmit failed records as a fresh batch the next morning; Batch API is async, no real-time retry inside the batch. ### Q5. Aggregate accuracy is 95%. Your CTO wants to know if it's safe to ship. What additional view do you produce before answering? Stratified accuracy. Pass-rate broken down by document type (email vs PDF vs HTML), sender domain, date range. A 95% aggregate can hide a 62% pass rate on html_email (which dominates volume) while plain-text scores 99%. Surface the worst stratum. Aggregate metrics lie; stratified metrics tell the truth and surface the documents that need targeted few-shots, schema tweaks, or human review. ## FAQ ### Q1. Why not just prompt 'output JSON'. Claude is good at it now? Probabilistic ≠ guaranteed. Even at 95% adherence, a 5% prose-wrapped output rate is one broken record per 20 documents. Forced tool_choice is structural. The SDK rejects anything that doesn't match the schema. The cost is identical; the reliability difference is decisive. Use prompts for tone; use forced tools for shape. ### Q2. Can I cache the schema if it changes between calls? Cache the stable parts. If the schema body is fixed but a few enum values vary by tenant, split the tools array: stable common schema (cached) + small per-tenant additions (fresh). Cache only what's stable across calls; the cache key is sensitive to byte-level changes. ### Q3. How does Batch API interact with the validation-retry loop? Batch is async. No inside-the-batch retry. Submit, wait 24h, harvest results. Validate each; if some fail, submit those failures (with the specific error in the next message) as a NEW batch. Most converge in batch-2. For records that need real-time retry, route them to the sync pipeline. ### Q4. What's the difference between the schema's pattern regex and code-level validation? Schema validates shape; code validates meaning. pattern: '^cust_[0-9]+$' rejects malformed IDs at parse time. Faster, structural. Semantic checks (the customer_id must exist in our DB, the refund_amount must be ≤ the original purchase total) need code; they're business rules, not syntax. Use pattern for cheap structural rejection; use code for everything that requires lookups or business logic. ### Q5. Should I run extended_thinking with structured extraction? Generally no. They're incompatible with forced tool_choice. When the model needs to reason about ambiguous source text, set tool_choice: 'auto' and accept ~95% reliability, OR run a sync pre-pass to disambiguate, then a forced extraction on the cleaned input. Don't try to combine extended_thinking with forced tool_choice in one call. ### Q6. How do I test that nullable + enum escapes are working? Adversarial test set. Compose 50 short inputs where the value is GENUINELY missing or ambiguous. Run extraction. The right behaviour is a mix of null and 'unclear'. Never invented values. If you see invented values (e.g. a refund_reason that's not in the source), the few-shot or the schema doesn't yet give the model an honest exit. Iterate. ### Q7. Does forced tool_choice work with multi-tool registries? Yes. Tool_choice picks one specific tool by name. Even in a 5-tool registry, tool_choice: { type: 'tool', name: 'extract_record' } forces THAT tool. The other 4 are inert for this call. Use this when extraction is mandatory but the broader agent has other tools available; for extraction-only, the registry can have just one tool. ### Q8. When do I use Batch API vs sync vs caching for cost optimization? Three-axis decision. (1) Latency-sensitive (interactive review, real-time agent loops): sync API + prompt-cache on the schema and stable system prompt. ~80% off the cached portion at >=70% hit rate. (2) Latency-tolerant bulk (>=100 docs overnight): Batch API for the flat 50% discount, 24h SLA. Combine with caching only at the per-batch sub-batch level (ephemeral cache is per-request in Batch). (3) Mixed traffic: sync the long tail, batch the predictable backlog. Cross-references: P3.5 uses Batch API for nightly audits; P3.8 uses Batch API for bulk long-doc extraction; P3.7 covers the 4-line tool-description pattern that keeps cached tools array stable. Tagged related: cost-optimization cluster. ### Q9. Why does the model's self-reported confidence not gate routing? Calibration anti-pattern. Models emit confidence scores that correlate weakly with actual correctness, especially on out-of-distribution inputs. A 0.95 confidence on a fabricated value still ships fabricated data. The right architecture is structural: validation-retry on schema + semantic checks (sum equals total, currency in enum, date sanity), and stratified accuracy reporting by document type. Use confidence as a soft signal in routing decisions (escalate to human if confidence < 0.5) but never as the primary gate. Tagged related: evaluation-and-evals cluster. ## Production readiness - [ ] Schema versioned in source control; PR-reviewed before deploy - [ ] Few-shot 'unclear' / null example in every extraction prompt - [ ] Validation-retry loop with specific-error feedback; max_retries = 3 - [ ] PreToolUse hook on policy-bearing fields with unit tests - [ ] Schema cache hit rate monitored; alert if drops below 50% - [ ] Batch API job for nightly backfill with auto-resubmit on transient failures - [ ] Stratified accuracy dashboard updated daily; alert on any stratum < 90% - [ ] Human-review queue for records that fail after 3 retries; SLA documented --- **Source:** https://claudearchitectcertification.com/scenarios/structured-data-extraction **Vault sources:** ACP-T05 §Scenario 6 (5 ✅/❌ pairs · official guide scenario); ACP-T08 §3.6 metadata; Course 11 Claude in Bedrock. Lessons 32, 39, 40 (JSON Schema, structured data, flexible extraction); Course 06 Claude with Anthropic API. Lesson 13 structured data; GAI-K04 Claude Certified Architect Exam Reference **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Agentic Tool Design > A meta-skill scenario about designing tool registries. The optimum is 4-5 tools per agent (past 5, routing accuracy drops 8% per extra tool); each tool follows the Anthropic 4-line description pattern (what / when / edge cases / ordering); risky calls go through a PreToolUse hook; results pass through a PostToolUse hook for normalization and side-effect guards; errors are labeled with one of four structured buckets (Transient · Permission · Data · Business) so retry logic can branch correctly. The most-tested distractor: a 15-tool agent that 'just needs a smarter model'. No, it needs a smaller registry. **Sub-marker:** P3.7 **Domains:** D2 · Tool Design + Integration, D3 · Agent Operations **Exam weight:** 38% of CCA-F (D2 + D3) **Build time:** 24 minutes **Source:** 🟡 Beyond-guide scenario · OP-claimed (Reddit 1s34iyl) · architecture matches Anthropic public guidance **Canonical:** https://claudearchitectcertification.com/scenarios/agentic-tool-design **Last reviewed:** 2026-05-04 ## In plain English Think of this as designing the toolbox before the agent picks it up. What tools to give, how to describe them, what guard-rails to put around them. The trap is to give the agent fifteen tools and hope; the discipline is to give it four or five really well-described tools, write each description in the same four-line pattern (what / when / edge cases / ordering), wrap risky calls in PreToolUse hooks, normalize the results in PostToolUse hooks, and label every error with one of four buckets so retry logic can reason about it. The whole point is that tool design IS agent design. Get the toolbox right and the rest of the agent works. ## Exam impact Domain 2 (Tool Design, 18%) tests the 4-5-tool optimum, the 4-line description pattern, and tool_choice mechanics. Domain 3 (Claude Code Configuration, 20%) tests hook-based policy enforcement (PreToolUse + PostToolUse) and structured-error contracts. Beyond-guide but architecturally consistent with Anthropic's published tool-use guide. The 'why does my 15-tool agent route badly?' question is the canonical exam distractor. ## The problem ### What the customer needs - A tool registry the agent routes accurately. The right tool fires on the first try ≥ 95% of the time. - Risky operations gated structurally. Refund cap, destructive Bash, write access policed by hooks not prompts. - Errors that the agent can reason about. A permission-denied looks different from a transient timeout, and the agent retries accordingly. ### Why naive approaches fail - 15-tool registry → routing accuracy drops 8% per tool past 5; the agent alternates and misses obvious matches. - Vague one-line tool descriptions → agent picks the wrong tool ~12% of the time. - Prompt-only policy enforcement ('never refund > $500') → leaks 3-5% in production despite emphatic phrasing. ### Definition of done - Tool count per agent ≤ 5; rare tools moved to specialist sub-agents - Every tool description follows the 4-line pattern (what / when / edge cases / ordering) - PreToolUse hook gates every policy-bearing tool; exit 2 on violation - PostToolUse hook normalizes outputs and logs every call to the audit trail - Tool errors emit one of the 4 structured buckets (Transient · Permission · Data · Business) - MCP servers used for cross-agent tool sharing. No inline duplication ## Concepts in play - 🟢 **Tool calling** (`tool-calling`), Tool registry contract + 4-line description pattern - 🟢 **tool_choice** (`tool-choice`), auto for specialist; forced only for mandatory extraction - 🟢 **Hooks** (`hooks`), PreToolUse + PostToolUse as structural gates - 🟢 **Model Context Protocol** (`mcp`), Tool sharing across agents via MCP servers - 🟢 **Evaluation** (`evaluation`), Tool selection accuracy as the routing test - 🟢 **Structured outputs** (`structured-outputs`), is_error + 4-bucket error contract - 🟢 **Subagents** (`subagents`), Move rare tools to specialist sub-agents - 🟢 **Agentic loops** (`agentic-loops`), stop_reason: tool_use → execute → continue ## Components ### Tool Registry (4-5 Tools), the optimum, not the maximum The agent's toolbox. Empirically, 4-5 tools is the routing-accuracy sweet spot; past that, accuracy drops ~8% per added tool because descriptions overlap and the model alternates. If the use case needs more tools, split into specialist sub-agents. Each with its own 4-5-tool registry. And route between them with a triage classifier. **Configuration:** Cap: 4-5 tools. Beyond: split agents. Don't cram a customer-support tool, a refund tool, a sentiment tool, and 11 admin tools into one agent. That's three agents pretending to be one. **Concept:** `tool-calling` ### 4-Line Description Pattern, what · when · edge cases · ordering The canonical tool description shape. Line 1: what the tool does. Line 2: when to call it. Line 3: edge cases (returns, failure modes). Line 4: ordering (which tools must come before / after). This pattern makes routing structural. The model reads the pattern in every description and learns the shape. Vague one-liners produce ~12% wrong-tool selection; the 4-line pattern drops it below 3%. **Configuration:** description: "Look up a customer by customer_id and confirm they are active.\nUse this BEFORE any other tool that mentions the customer.\nEdge cases: returns 'not_found' if customer_id is missing.\nAlways run before lookup_order or process_refund." **Concept:** `tool-calling` ### PreToolUse Hook (Policy Gate), deterministic, before exec Sits between the model's tool_use request and the actual tool execution. Reads tool_input (e.g., refund_amount), compares to policy (amount <= cap), exits 0 (allow) or 2 (deny with stderr message). Deny routes the model back with the policy reason, and the agent re-plans. The single-most-effective lever for converting probabilistic prompt-only policies into 100%-deterministic gates. **Configuration:** matcher: "process_refund". Hook reads stdin JSON: {tool_name, tool_input}. Exits 0 to allow, exits 2 with stderr to deny. SDK forwards stderr back to the model as a tool_result with is_error=true. **Concept:** `hooks` ### PostToolUse Hook (Normalization + Audit), after exec, before next turn Fires AFTER the tool runs but BEFORE the result is fed to the model. Normalizes raw outputs (timestamps to ISO-8601, status codes to enum names, field renames), captures side-effect signals, and writes the canonical audit log entry. Without it, the model sees inconsistent output shapes across calls; with it, every call has a predictable contract and an audit trail. **Configuration:** matcher: '*'. Hook reads stdin: {tool_name, tool_input, tool_result}. Transforms tool_result into normalized shape. Writes audit row to durable log. Always exits 0 (does not deny. That's PreToolUse's job). **Concept:** `hooks` ### 4-Bucket Structured Error Contract, Transient · Permission · Data · Business Every tool that can fail emits an error in one of four explicit buckets, not a free-form string. The harness reads is_error: true + error.bucket, then routes: Transient → retry; Permission → escalate (don't retry, won't fix itself); Data → surface to user; Business → block + log + escalate. Without this contract, the agent retries permission errors forever and surfaces transient ones as catastrophes. **Configuration:** tool_result on failure: { is_error: true, content: { bucket: "Transient"|"Permission"|"Data"|"Business", code, detail, retryable: bool } }. Agent reads bucket and retryable; never branches on detail text. **Concept:** `structured-outputs` ## Build steps ### 1. Cap the registry at 4-5 tools (split otherwise) Audit your current tool list. Past 5, you're guaranteed losing routing accuracy. The fix is structural: identify the use cases that actually share state vs those that don't, then split into specialist agents with their own 4-5-tool registries. Use a triage classifier (or a top-level coordinator agent) to route the user request to the right specialist. **Python:** ```python # AUDIT: count + classify tools SUPPORT_TOOLS = ["verify_customer", "lookup_order", "process_refund", "escalate_to_human", "audit_log"] # 5. At the optimum ADMIN_TOOLS = ["create_user", "delete_user", "reset_password", "lock_account", "unlock_account", "audit_admin"] # 6. Split # WRONG: cram all 11 into one agent # tools = SUPPORT_TOOLS + ADMIN_TOOLS # 11. Routing accuracy drops ~32% # RIGHT: two specialist agents, triage routes between them def triage(user_request: str) -> str: """Tiny classifier. Pick the specialist agent.""" if any(w in user_request.lower() for w in ["refund", "order", "ticket"]): return "support" if any(w in user_request.lower() for w in ["password", "account", "user"]): return "admin" return "support" # default def route(user_request: str) -> dict: specialist = triage(user_request) tools = SUPPORT_TOOLS if specialist == "support" else ADMIN_TOOLS return run_agent(tools=tools, message=user_request) ``` **TypeScript:** ```typescript // AUDIT: count + classify tools const SUPPORT_TOOLS = [ "verify_customer", "lookup_order", "process_refund", "escalate_to_human", "audit_log", ] as const; // 5. At the optimum const ADMIN_TOOLS = [ "create_user", "delete_user", "reset_password", "lock_account", "unlock_account", "audit_admin", ] as const; // 6. Split // WRONG: cram all 11 into one agent // const tools = [...SUPPORT_TOOLS, ...ADMIN_TOOLS]; // 11. Routing accuracy drops ~32% // RIGHT: two specialist agents, triage routes between them function triage(userRequest: string): "support" | "admin" { const r = userRequest.toLowerCase(); if (["refund", "order", "ticket"].some((w) => r.includes(w))) return "support"; if (["password", "account", "user"].some((w) => r.includes(w))) return "admin"; return "support"; } async function route(userRequest: string) { const specialist = triage(userRequest); const tools = specialist === "support" ? SUPPORT_TOOLS : ADMIN_TOOLS; return runAgent({ tools, message: userRequest }); } ``` Concept: `tool-calling` ### 2. Write every tool description in the 4-line pattern Line 1: what (one sentence). Line 2: when (which user intent triggers this). Line 3: edge cases (what happens on failure, missing args). Line 4: ordering (which tools must come before / after). This pattern is the model's structural cue. It reads the pattern across all 5 tool descriptions and routes accordingly. Vague one-liners produce ~12% wrong-tool selection; this pattern drops it below 3%. **Python:** ```python # Anthropic 4-line pattern. What / when / edge cases / ordering TOOLS = [ { "name": "verify_customer", "description": ( "Look up a customer by customer_id and confirm they are active.\n" "Use this BEFORE any other tool that mentions the customer.\n" "Edge cases: returns 'not_found' if customer_id is missing or stale.\n" "Always run before lookup_order or process_refund." ), "input_schema": { "type": "object", "properties": {"customer_id": {"type": "string", "pattern": "^cust_[0-9]+$"}}, "required": ["customer_id"], }, }, { "name": "process_refund", "description": ( "Issue a refund to a verified customer up to the policy cap.\n" "Use ONLY after verify_customer has confirmed the customer is active.\n" "Edge cases: returns Permission error if amount > policy_cap (handled by hook).\n" "Never call before verify_customer; never call twice in one conversation." ), "input_schema": { "type": "object", "properties": { "customer_id": {"type": "string"}, "amount": {"type": "number", "minimum": 0}, "reason": {"type": "string", "enum": ["damage", "wrong_item", "late", "other"]}, }, "required": ["customer_id", "amount", "reason"], }, }, ] ``` **TypeScript:** ```typescript // Anthropic 4-line pattern. What / when / edge cases / ordering const TOOLS: Anthropic.Tool[] = [ { name: "verify_customer", description: "Look up a customer by customer_id and confirm they are active.\n" + "Use this BEFORE any other tool that mentions the customer.\n" + "Edge cases: returns 'not_found' if customer_id is missing or stale.\n" + "Always run before lookup_order or process_refund.", input_schema: { type: "object", properties: { customer_id: { type: "string", pattern: "^cust_[0-9]+$" } }, required: ["customer_id"], }, }, { name: "process_refund", description: "Issue a refund to a verified customer up to the policy cap.\n" + "Use ONLY after verify_customer has confirmed the customer is active.\n" + "Edge cases: returns Permission error if amount > policy_cap (handled by hook).\n" + "Never call before verify_customer; never call twice in one conversation.", input_schema: { type: "object", properties: { customer_id: { type: "string" }, amount: { type: "number", minimum: 0 }, reason: { type: "string", enum: ["damage", "wrong_item", "late", "other"] }, }, required: ["customer_id", "amount", "reason"], }, }, ]; ``` Concept: `tool-calling` ### 3. Wire the PreToolUse hook on policy-bearing tools For every tool that touches money, identity, or destructive state, the PreToolUse hook is the architectural gate. It reads tool_input from stdin JSON, applies the policy check in code (not in a prompt), and exits 0 or 2. Exit 2's stderr message is fed back to the model as a tool_result with is_error: true. The model re-plans with the policy in view. **Python:** ```python # .claude/hooks/refund_policy.py import sys, json, os POLICY_CAP = float(os.environ.get("REFUND_CAP", "500")) def main(): payload = json.loads(sys.stdin.read()) tool_name = payload["tool_name"] tool_input = payload["tool_input"] if tool_name != "process_refund": sys.exit(0) # not our concern, allow amount = tool_input.get("amount", 0) if amount > POLICY_CAP: # Stderr is fed back to the model as a tool_result with is_error=true print( f"refund ${amount} exceeds policy cap ${POLICY_CAP}; " f"escalate via escalate_to_human or reduce the amount", file=sys.stderr, ) sys.exit(2) # DENY # Additional structural checks. Verify customer is active, etc. if not tool_input.get("customer_id", "").startswith("cust_"): print("customer_id missing or malformed; call verify_customer first", file=sys.stderr) sys.exit(2) sys.exit(0) # allow if __name__ == "__main__": main() ``` **TypeScript:** ```typescript // .claude/hooks/refund-policy.ts import { readFileSync } from "node:fs"; const POLICY_CAP = Number(process.env.REFUND_CAP ?? "500"); const payload = JSON.parse(readFileSync(0, "utf8")); const toolName: string = payload.tool_name; const toolInput = payload.tool_input ?? {}; if (toolName !== "process_refund") { process.exit(0); // not our concern, allow } const amount = (toolInput.amount as number) ?? 0; if (amount > POLICY_CAP) { // Stderr is fed back to the model as a tool_result with is_error=true process.stderr.write( `refund \${amount} exceeds policy cap \${POLICY_CAP}; ` + `escalate via escalate_to_human or reduce the amount\n`, ); process.exit(2); // DENY } if (!String(toolInput.customer_id ?? "").startsWith("cust_")) { process.stderr.write("customer_id missing or malformed; call verify_customer first\n"); process.exit(2); } process.exit(0); // allow ``` Concept: `hooks` ### 4. Wire the PostToolUse hook for normalization + audit PostToolUse fires AFTER the tool runs, BEFORE the model sees the result. Two jobs: normalize the output shape (timestamps to ISO-8601, status codes to enum names, ms to seconds, etc.) so the model sees a consistent contract across calls; and write a canonical audit row capturing tool_name, tool_input, normalized_output, latency, and stop_reason context. The audit log is the replay tool when production breaks at turn 18. **Python:** ```python # .claude/hooks/postuse_normalize.py import sys, json, datetime def normalize(tool_name: str, raw: dict) -> dict: """Project tool-specific raw output into a stable shape.""" if tool_name == "lookup_order": return { "order_id": raw.get("id") or raw.get("order_id"), "status": (raw.get("status") or "unknown").upper(), "created_at": ( datetime.datetime .fromtimestamp(raw["created_unix"]) .isoformat() + "Z" ) if "created_unix" in raw else raw.get("created_at"), "total_cents": int(raw.get("total_cents", round(raw.get("total_dollars", 0) * 100))), } if tool_name == "verify_customer": return { "customer_id": raw.get("customer_id") or raw.get("id"), "active": bool(raw.get("active") or raw.get("is_active")), "tier": raw.get("tier") or raw.get("plan") or "standard", } return raw # tools without a known shape pass through def main(): payload = json.loads(sys.stdin.read()) tool_name = payload["tool_name"] raw_result = payload["tool_result"] normalized = normalize(tool_name, raw_result) payload["tool_result"] = normalized # Append canonical audit row with open("audit.jsonl", "a") as f: f.write(json.dumps({ "ts": datetime.datetime.utcnow().isoformat() + "Z", "tool": tool_name, "input": payload["tool_input"], "output": normalized, "latency_ms": payload.get("latency_ms"), }) + "\n") # Pipe normalized payload back to stdout. SDK uses this as the new tool_result print(json.dumps(payload)) sys.exit(0) if __name__ == "__main__": main() ``` **TypeScript:** ```typescript // .claude/hooks/postuse-normalize.ts import { readFileSync, appendFileSync } from "node:fs"; function normalize(toolName: string, raw: Record) { if (toolName === "lookup_order") { return { order_id: raw.id ?? raw.order_id, status: String(raw.status ?? "unknown").toUpperCase(), created_at: raw.created_unix ? new Date((raw.created_unix as number) * 1000).toISOString() : raw.created_at, total_cents: (raw.total_cents as number) ?? Math.round(((raw.total_dollars as number) ?? 0) * 100), }; } if (toolName === "verify_customer") { return { customer_id: raw.customer_id ?? raw.id, active: Boolean(raw.active ?? raw.is_active), tier: raw.tier ?? raw.plan ?? "standard", }; } return raw; // tools without a known shape pass through } const payload = JSON.parse(readFileSync(0, "utf8")); const normalized = normalize(payload.tool_name, payload.tool_result ?? {}); payload.tool_result = normalized; // Append canonical audit row appendFileSync( "audit.jsonl", JSON.stringify({ ts: new Date().toISOString(), tool: payload.tool_name, input: payload.tool_input, output: normalized, latency_ms: payload.latency_ms, }) + "\n", ); // Pipe normalized payload back to stdout. SDK uses this as the new tool_result process.stdout.write(JSON.stringify(payload)); process.exit(0); ``` Concept: `hooks` ### 5. Emit errors in 4 structured buckets Every tool that can fail returns an error tagged with one of four buckets: Transient (network blip, retry), Permission (403/401, escalate. Don't retry, won't fix itself), Data (input malformed, surface to user), Business (policy violation, log + escalate). The agent reads bucket and retryable, and routes accordingly. Without this contract, the agent retries permission errors forever and surfaces transient blips as catastrophes. **Python:** ```python from enum import Enum from typing import TypedDict class ErrorBucket(str, Enum): TRANSIENT = "Transient" PERMISSION = "Permission" DATA = "Data" BUSINESS = "Business" class ToolError(TypedDict): bucket: ErrorBucket code: str # e.g. "RATE_LIMITED", "FORBIDDEN", "INVALID_INPUT", "POLICY_BREACH" detail: str # human-readable; for the model retryable: bool def classify(http_status: int, body: dict) -> ToolError: """Project arbitrary backend error → 4-bucket contract.""" if http_status >= 500 or http_status == 429: return {"bucket": ErrorBucket.TRANSIENT, "code": "RETRY", "detail": f"upstream {http_status}; retry with backoff", "retryable": True} if http_status in (401, 403): return {"bucket": ErrorBucket.PERMISSION, "code": "FORBIDDEN", "detail": "agent lacks permission; escalate, do not retry", "retryable": False} if http_status == 400: return {"bucket": ErrorBucket.DATA, "code": "INVALID_INPUT", "detail": body.get("message", "input failed validation"), "retryable": False} # retry won't fix; user must reformulate if http_status == 422: return {"bucket": ErrorBucket.BUSINESS, "code": "POLICY_BREACH", "detail": body.get("message", "request violates business policy"), "retryable": False} return {"bucket": ErrorBucket.TRANSIENT, "code": "UNKNOWN", "detail": str(body), "retryable": True} # Tool wrapper emits the contract via tool_result def call_lookup_order(order_id: str) -> dict: resp = http_get(f"/orders/{order_id}") if resp.status_code != 200: return {"is_error": True, "error": classify(resp.status_code, resp.json())} return {"is_error": False, "data": resp.json()} ``` **TypeScript:** ```typescript enum ErrorBucket { Transient = "Transient", Permission = "Permission", Data = "Data", Business = "Business", } interface ToolError { bucket: ErrorBucket; code: string; // e.g. "RATE_LIMITED", "FORBIDDEN", "INVALID_INPUT", "POLICY_BREACH" detail: string; // human-readable; for the model retryable: boolean; } function classify(httpStatus: number, body: Record): ToolError { if (httpStatus >= 500 || httpStatus === 429) { return { bucket: ErrorBucket.Transient, code: "RETRY", detail: `upstream ${httpStatus}; retry with backoff`, retryable: true }; } if (httpStatus === 401 || httpStatus === 403) { return { bucket: ErrorBucket.Permission, code: "FORBIDDEN", detail: "agent lacks permission; escalate, do not retry", retryable: false }; } if (httpStatus === 400) { return { bucket: ErrorBucket.Data, code: "INVALID_INPUT", detail: (body.message as string) ?? "input failed validation", retryable: false }; } if (httpStatus === 422) { return { bucket: ErrorBucket.Business, code: "POLICY_BREACH", detail: (body.message as string) ?? "request violates business policy", retryable: false }; } return { bucket: ErrorBucket.Transient, code: "UNKNOWN", detail: JSON.stringify(body), retryable: true }; } // Tool wrapper emits the contract via tool_result async function callLookupOrder(orderId: string) { const resp = await fetch(`/orders/${orderId}`); if (!resp.ok) { return { is_error: true as const, error: classify(resp.status, await resp.json()) }; } return { is_error: false as const, data: await resp.json() }; } ``` Concept: `structured-outputs` ### 6. Use tool_choice 'auto' for specialists; 'forced' only for mandatory extraction tool_choice: 'auto' is the right default. The model decides whether to call any tool, and which one, based on the request. tool_choice: 'any' forces the model to call SOME tool (rarely useful). tool_choice: { type: 'tool', name: ... } forces a specific tool. Only correct for extraction pipelines where the tool is mandatory. Forced tool_choice on a conversational specialist agent removes the agent's reasoning capacity. **Python:** ```python # auto. The right default for specialist agents def support_agent(message: str): return client.messages.create( model="claude-sonnet-4.5", max_tokens=1024, tools=SUPPORT_TOOLS, tool_choice={"type": "auto"}, # agent decides messages=[{"role": "user", "content": message}], ) # forced. Only for mandatory extraction def extract_one(email: str): return client.messages.create( model="claude-sonnet-4.5", max_tokens=1024, tools=[EXTRACT_TOOL], tool_choice={"type": "tool", "name": "extract_record"}, # MUST fire messages=[{"role": "user", "content": email}], ) # any. Rarely useful; "must call SOME tool but I won't pick which" # def some_specialist(): # return client.messages.create( # tool_choice={"type": "any"}, # ... # only for unusual flows where any of N tools is acceptable ``` **TypeScript:** ```typescript // auto. The right default for specialist agents async function supportAgent(message: string) { return client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 1024, tools: SUPPORT_TOOLS, tool_choice: { type: "auto" }, // agent decides messages: [{ role: "user", content: message }], }); } // forced. Only for mandatory extraction async function extractOne(email: string) { return client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 1024, tools: [EXTRACT_TOOL], tool_choice: { type: "tool", name: "extract_record" }, // MUST fire messages: [{ role: "user", content: email }], }); } // any. Rarely useful; "must call SOME tool but I won't pick which" // async function someSpecialist() { // return client.messages.create({ // tool_choice: { type: "any" }, // ... // }); // } ``` Concept: `tool-choice` ### 7. Share tools across agents via MCP servers When two agents both need lookup_order or verify_customer, don't duplicate the tool inline. Expose it through an MCP server. Each agent connects to the MCP server, advertises it as a tool, and gets a single source of truth. Updating the tool's behavior is a single deploy; the agents pick it up automatically. MCP also abstracts auth, observability, and rate-limiting away from each agent. **Python:** ```python # .claude/mcp.json. Declare the MCP servers your project uses { "mcpServers": { "crm": { "command": "npx", "args": ["-y", "@yourorg/crm-mcp-server"], "env": { "CRM_API_KEY": "${CRM_API_KEY}", "CRM_BASE_URL": "https://crm.example.com" } } } } # In your agent. MCP tools auto-show up in the tools list async def support_agent(message: str): # The CRM MCP server contributes verify_customer + lookup_order return await client.messages.create( model="claude-sonnet-4.5", max_tokens=1024, # tools are auto-loaded from MCP. Your local registry only adds: tools=[PROCESS_REFUND_TOOL, ESCALATE_TO_HUMAN_TOOL], tool_choice={"type": "auto"}, messages=[{"role": "user", "content": message}], mcp_servers=["crm"], # fictional pseudo-API; real call sites vary by SDK ) ``` **TypeScript:** ```typescript // .claude/mcp.json. Declare the MCP servers your project uses // { // "mcpServers": { // "crm": { // "command": "npx", // "args": ["-y", "@yourorg/crm-mcp-server"], // "env": { // "CRM_API_KEY": "${CRM_API_KEY}", // "CRM_BASE_URL": "https://crm.example.com" // } // } // } // } // In your agent. MCP tools auto-show up in the tools list async function supportAgent(message: string) { // The CRM MCP server contributes verify_customer + lookup_order return client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 1024, // tools are auto-loaded from MCP. Your local registry only adds: tools: [PROCESS_REFUND_TOOL, ESCALATE_TO_HUMAN_TOOL], tool_choice: { type: "auto" }, messages: [{ role: "user", content: message }], // mcpServers: ["crm"], // pseudo-API; real call sites vary by SDK }); } ``` Concept: `mcp` ### 8. Test routing accuracy with a 50-intent eval set Tool design is empirical. Build a 50-intent eval set where each intent has a known correct tool. Run the agent over it, count first-call accuracy. Below 95% routing accuracy means the descriptions need work; below 90% likely means too many tools. Re-run the eval after every tool addition or description tweak. **Python:** ```python # eval/tool_routing.py. Measure first-call routing accuracy INTENTS = [ {"text": "I want a refund for order 12345", "expected_first_tool": "verify_customer"}, {"text": "My order hasn't arrived", "expected_first_tool": "verify_customer"}, {"text": "Cancel my account please", "expected_first_tool": "escalate_to_human"}, # ... 47 more ] def routing_accuracy() -> dict: correct = total = 0 misses = [] for intent in INTENTS: resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=512, tools=SUPPORT_TOOLS, tool_choice={"type": "auto"}, messages=[{"role": "user", "content": intent["text"]}], ) first_tool = next( (b.name for b in resp.content if b.type == "tool_use"), None, ) total += 1 if first_tool == intent["expected_first_tool"]: correct += 1 else: misses.append({ "text": intent["text"], "expected": intent["expected_first_tool"], "got": first_tool, }) return { "accuracy": correct / total, "n": total, "misses": misses[:10], # first 10 for review } # Re-run after every change to the tool registry; gate deploys on >= 95% ``` **TypeScript:** ```typescript // eval/tool-routing.ts. Measure first-call routing accuracy const INTENTS = [ { text: "I want a refund for order 12345", expected_first_tool: "verify_customer" }, { text: "My order hasn't arrived", expected_first_tool: "verify_customer" }, { text: "Cancel my account please", expected_first_tool: "escalate_to_human" }, // ... 47 more ]; async function routingAccuracy() { let correct = 0; let total = 0; const misses: Array<{ text: string; expected: string; got: string | null }> = []; for (const intent of INTENTS) { const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 512, tools: SUPPORT_TOOLS, tool_choice: { type: "auto" }, messages: [{ role: "user", content: intent.text }], }); const firstTool = resp.content.find((b) => b.type === "tool_use")?.name ?? null; total++; if (firstTool === intent.expected_first_tool) { correct++; } else { misses.push({ text: intent.text, expected: intent.expected_first_tool, got: firstTool, }); } } return { accuracy: correct / total, n: total, misses: misses.slice(0, 10), }; } // Re-run after every change to the tool registry; gate deploys on >= 95% ``` Concept: `evaluation` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | Tool count for a specialist agent | 4-5 tools (the optimum); split into specialist sub-agents past 5 | 15 tools 'because the model is smart enough' | Routing accuracy drops ~8% per tool past 5. By 15 tools, accuracy is ~30% lower than at 5. The fix is structural (split the agent), not model-side (bigger model doesn't compensate for ambiguous tool descriptions). | | Tool description format | 4-line pattern: what / when / edge cases / ordering | One vague sentence ('verifies a customer') | Vague descriptions cause ~12% wrong-tool selection. The 4-line pattern is the model's structural cue. It reads the pattern across all 5 descriptions and routes accordingly. Wrong-tool rate drops below 3%. | | Refund-cap policy enforcement | PreToolUse hook reads tool_input.amount, exits 2 on violation | System prompt: 'never refund more than $500' | Prompt-only enforcement leaks 3-5% in production despite emphatic phrasing. PreToolUse hooks are deterministic. Exit 2 means deny, full stop. For policy-bearing tools, the hook is the only credible architecture. | | Tool error contract | 4 buckets (Transient · Permission · Data · Business) + retryable boolean | Free-form error messages parsed by the agent | Free-form errors force the agent to interpret strings; structured buckets let it BRANCH (Transient → retry; Permission → escalate; Data → surface; Business → block + log). Without buckets, the agent retries permission errors forever and panics on transient blips. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-ATD-01 · Tool count > 5 | 15-tool agent with overlapping descriptions. Routing accuracy at first-call drops to ~65%; the agent alternates between similar tools, sometimes calls 3 tools before settling on the right one. Latency up; cost up; quality down. | Cap at 4-5 tools per agent. Move rare tools (used <10% of conversations) to specialist sub-agents. Use a triage classifier to route requests to the right sub-agent. Each sub-agent stays at 4-5 tools. | | AP-ATD-02 · Vague tool descriptions | Tool descriptions are one-liners ('verifies the customer', 'looks up an order'). Agent misroutes ~12% of calls because it can't tell when each tool applies. | Anthropic 4-line pattern: what / when / edge cases / ordering. Each line targets a specific routing decision the model has to make. Wrong-tool rate drops below 3%. | | AP-ATD-03 · No PreToolUse hook on risky operations | Refund cap enforced via system-prompt language. Production logs show 3-5% of refunds violate the cap. Audit fails; finance rolls back; trust in the agent drops. | PreToolUse hook reads tool_input.amount, compares to policy, exits 2 with stderr message on violation. Deterministic, not probabilistic. Policy violations drop to 0. | | AP-ATD-04 · Unhandled tool errors | Tool returns 403; agent retries indefinitely. Tool returns 500; agent crashes. Tool returns 400 with malformed input; agent gives up without surfacing the input issue to the user. | 4-bucket structured error contract (Transient · Permission · Data · Business) with retryable boolean. Agent reads bucket, branches: Transient → retry with backoff; Permission → escalate; Data → surface to user; Business → block + log + escalate. | | AP-ATD-05 · No tool ordering guidance | Agent calls lookup_order before verify_customer. Wrong record returned 12% of the time because the customer_id wasn't validated first. Bad data pollutes downstream decisions. | Tool descriptions explicitly state ordering ('Always run BEFORE process_refund'; 'Use ONLY after verify_customer has confirmed the customer is active'). The 4-line pattern's last line is for ordering precisely because ordering is so often the routing failure. | ## Implementation checklist - [ ] Tool count per agent ≤ 5; rare tools moved to specialist sub-agents (`tool-calling`) - [ ] Every tool description follows the 4-line pattern (what / when / edge cases / ordering) (`tool-calling`) - [ ] PreToolUse hook on every policy-bearing tool; exit 2 on violation (`hooks`) - [ ] PostToolUse hook normalizes outputs and writes the audit log (`hooks`) - [ ] All tool errors emit one of 4 buckets (Transient · Permission · Data · Business) + retryable boolean (`structured-outputs`) - [ ] tool_choice: 'auto' on specialist agents; 'forced' only for mandatory extraction (`tool-choice`) - [ ] Shared tools exposed via MCP servers; no inline duplication across agents (`mcp`) - [ ] 50-intent routing-accuracy eval set; gate deploys on ≥ 95% (`evaluation`) - [ ] Agent branches on bucket + retryable; no parsing of error.detail text - [ ] Hook stderr messages are model-readable (specific, actionable, reference policy) - [ ] Audit log: tool_name, tool_input, normalized_output, latency_ms, stop_reason context ## Cost & latency - **Per-tool-call overhead (Pre + Post hooks):** ~5-15ms latency; ~0% token cost, Hooks run as subprocesses reading stdin JSON. No LLM call. Pure local Python/TS. The latency is below the noise floor of a typical tool API call. - **Routing accuracy eval (50 intents, weekly):** ~$0.05/run, 50 messages × ~500 tokens input + ~50 tokens output at Sonnet 4.5 prices. Cheap insurance against routing regressions; run on every tool registry change and weekly in CI. - **Tool description token cost (5 tools × 4 lines):** ~600-1000 tokens per call (cached after first), Tools array is stable; mark with cache_control: ephemeral. Schema-cache hit rate ≥ 70% drops effective per-call cost ~90% on the tools array. - **MCP server overhead:** ~+50-100ms per tool call (network), MCP runs as a separate process or service; network round-trip adds latency. Worth the cost when 2+ agents share the tool. Single source of truth beats inline duplication. - **Audit log write:** ~5ms; ~1KB per row, Append-only JSONL write in the PostToolUse hook. At 1000 calls/day, 1MB/day, 30MB/month. Negligible storage. Indispensable for production debugging. ## Domain weights - **D2 · Tool Design + Integration (18%):** Tool registry · 4-line description pattern · 4-bucket error contract · MCP integration - **D3 · Agent Operations (20%):** PreToolUse hook · PostToolUse hook · routing-accuracy evals · Claude Code hook config ## Practice questions ### Q1. Your agent has 6 tools. Routing accuracy drops from 95% (with 5 tools) to 87% (with 6). What's the cause and the architectural fix? Tool count past 4-5 degrades selection. Each new tool adds description overlap; the model alternates between similar tools and sometimes picks the wrong one. The fix is structural, not model-side: either consolidate (merge lookup_order_status + lookup_order_details → lookup_order) or move the rare tool to a specialist sub-agent and route requests to the right specialist with a triage classifier. Don't 'just use a smarter model'. Bigger models don't compensate for ambiguous tool descriptions. Tagged to AP-ATD-01. ### Q2. Your tool description is one sentence: 'verifies a customer'. The agent uses it correctly ~88% of the time. What format produces ≥97% accuracy? The Anthropic 4-line description pattern: line 1 _what_ (one sentence), line 2 _when_ (which user intent triggers it), line 3 _edge cases_ (failure modes, missing args), line 4 _ordering_ (which tools must come before / after). Each line targets a specific routing decision the model has to make on every turn. Vague one-liners produce ~12% wrong-tool selection; the 4-line pattern drops it below 3%. Tagged to AP-ATD-02. ### Q3. PreToolUse hook fires before tool_use; PostToolUse fires after. Which one blocks risky operations like a refund-cap violation, and why? PreToolUse. It fires BEFORE the tool runs, so it can deny execution by exiting 2. PostToolUse fires AFTER the tool runs and is meant for normalization + audit, not denial. For deterministic policy enforcement (refund cap, destructive-command blocklist, sensitive-data redaction), PreToolUse is the only credible gate. Prompt-only policies leak 3-5% in production; PreToolUse leaks 0%. ### Q4. A tool returns HTTP 403. Should the agent retry? No. 403 is a non-retryable Permission error. The 4-bucket error contract is decisive here: bucket: 'Permission', retryable: false. The agent reads bucket and retryable and routes: Permission means 'agent lacks the privilege; escalate, don't retry'. Retry won't fix it (the missing permission won't appear in the next 30 seconds). Without this contract, the agent retries permission errors forever and panics on transient ones. Exactly the failure mode the buckets exist to prevent. Tagged to AP-ATD-04. ### Q5. When should tool_choice be 'auto', 'any', or { type: 'tool', name: ... }? auto. The right default for specialist agents that converse and pick tools as they go (~95% of production flows). { type: 'tool', name: ... }. Only for mandatory extraction pipelines where the tool MUST fire (no agency). any. Rarely useful; says 'must call SOME tool but I won't pick which'. Fine for a narrow flow where any of N tools is acceptable. Forcing tool_choice on a specialist agent removes its reasoning capacity; 99% of the time, auto is correct. ## FAQ ### Q1. Why does the 4-line description pattern matter so much? It gives the model structural cues across the registry. Each tool has the same 4 lines in the same order. After reading 5 such descriptions, the model has a stable mental model: 'when I want to know WHAT, I read line 1; WHEN, line 2; ORDERING, line 4'. Vague one-liners force the model to infer shape every time. The pattern cuts wrong-tool selection from ~12% to <3%. ### Q2. Can I use the 4-line pattern in MCP tool descriptions? Yes. Same pattern, same effect. MCP tools surface to the agent through the same tools[] array as inline tools; the description format is identical. If you ship an MCP server, write the 4-line pattern into the server's tool definitions. Downstream agents inherit the routing accuracy without doing anything. ### Q3. What if the policy is too complex for a hook? Then the hook calls a policy service. PreToolUse hooks are subprocesses, not pure functions. They can hit a Convex action, a feature flag service, a rules engine. The point is the gate is OUTSIDE the prompt: deterministic code makes the deny decision, not the model. Complex policies live in the service the hook calls; simple bounds live in the hook itself. ### Q4. How do I add a new tool to the registry without breaking routing? Three steps: (1) add the tool with a full 4-line description; (2) re-run the 50-intent routing-accuracy eval. If accuracy drops below 95%, the new tool's description overlaps with an existing one, fix the descriptions; (3) gate the deploy on the eval threshold. Adding tools blindly is the #1 way registries degrade. ### Q5. Should every tool emit the 4-bucket error contract? Every tool that can fail. Read-only lookups can mostly emit Transient or Data; write tools add Business; auth-protected tools add Permission. The contract is uniform: { is_error: true, error: { bucket, code, detail, retryable } }. The agent's retry logic relies on it; without uniformity, you'd write per-tool retry code and miss bugs. ### Q6. When do I split into specialist sub-agents vs add tools to the existing one? At 5 tools is the rule of thumb. Tools that share state (e.g. verify_customer + lookup_order + process_refund) can stay in one agent. Tools that don't share state (admin functions vs support functions vs analytics) belong in different agents. Split, route with a triage classifier, each agent stays at 4-5 tools. ### Q7. Is the 4-bucket model Anthropic's or community-derived? Community-derived but architecturally consistent with Anthropic's tool-use guidance. The buckets formalize the patterns Anthropic's docs hint at (transient retry; permission escalate; etc.). The catalog (ACP-T05) marks this scenario as 🟡 OP-claimed (Reddit thread 1s34iyl) but architecturally well-grounded. Drilling it benefits real exam prep. ### Q8. How does this scenario compose with MCP server security? Tightly. When tools are exposed via MCP rather than inline, the 4-line description pattern, the PreToolUse policy gate, and the 4-bucket structured-error contract all live on the MCP server side. The MCP-security checklist (secrets via ${ENV_VAR}, every parameter untrusted, binary allowlist, HTTPS-only transport, audit-log every tool_use) is the operational hardening; the patterns on this page are the design contract. Cross-link: P3.4 (developer-productivity-agent) FAQ covers the MCP-security checklist in depth. Pair both pages when designing a new MCP server. Tagged related: mcp-security cluster. ## Production readiness - [ ] Tool registry audit: every agent has ≤ 5 tools; documented in repo - [ ] Description format lint: every tool has 4 lines (what / when / edges / ordering) - [ ] PreToolUse hook on every policy-bearing tool; unit-tested for allow/deny - [ ] PostToolUse hook normalization tested on at least 5 representative output shapes - [ ] 4-bucket error contract enforced in CI: every tool wrapper returns the contract - [ ] 50-intent routing eval runs in CI; gates deploy on ≥ 95% accuracy - [ ] MCP server health checks before agent invocation; degraded mode on outage - [ ] Audit log retained ≥ 90 days; indexed by tool_name + customer_id --- **Source:** https://claudearchitectcertification.com/scenarios/agentic-tool-design **Vault sources:** ACP-T05 §Scenario 7 (🟡 beyond-guide; OP-claimed Reddit 1s34iyl); ACP-T08 §3.7 metadata; Course 12 Claude with Vertex. Lesson 90 workflows-vs-agents; ACP-T06 (5 practice Qs tagged to components); GAI-K04 Claude Certified Architect Exam Reference; COD-K01 AI design + orchestration patterns **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Long Document Processing > A retrieval-augmented long-document agent. Semantic chunking (paragraph- and section-aware, not fixed-size) preserves meaning; an embedding index supports top-K retrieval so only the relevant chunks enter context; an immutable CASE_FACTS block anchors transactional values (doc_id, extracted_count, decisions made) at the prompt top, surviving every chunk; a checkpoint-and-resume pattern saves state on max_tokens and continues in a fresh session; citations (chunk_id + page) flow through every output for audit-grade provenance. The most-tested distractor: progressive summarization of CASE_FACTS. Exact values like '$247.83' get paraphrased to '~$250' and the audit fails. **Sub-marker:** P3.8 **Domains:** D5 · Context + Reliability, D2 · Tool Design + Integration **Exam weight:** 33% of CCA-F (D5 + D2) **Build time:** 26 minutes **Source:** 🟡 Beyond-guide scenario · OP-claimed (Reddit 1s34iyl) · architecture matches Anthropic public guidance **Canonical:** https://claudearchitectcertification.com/scenarios/long-document-processing **Last reviewed:** 2026-05-04 ## In plain English Think of this as how you read a 200-page contract and extract every clause that matters without losing your place. The naive way. Paste the whole document into the prompt. Fails because the model runs out of room around page 120 and forgets what it saw on page 1. The right way is to chunk the document into reasonable pieces, build a search index over the chunks, retrieve only the chunks relevant to the current question, and pin the immutable facts (document ID, extraction state, decisions already made) in a CASE_FACTS block at the top of every prompt. When the run hits the model's token limit mid-document, you save state and resume. Like a bookmark. The whole point is that long documents are not one big prompt; they're many small ones with an immutable thread. ## Exam impact Domain 5 (Context, 15%) tests CASE_FACTS pinning, lost-in-the-middle mitigation, and checkpoint-and-resume. Domain 2 (Tool Design, 18%) tests retrieval contract, citation propagation, and Batch API for bulk extraction. Beyond-guide but architecturally consistent with Anthropic's contextual-retrieval guide. The 'why did the agent forget the order ID by page 120?' question is the canonical exam distractor. ## The problem ### What the customer needs - Process 200-page contracts without max_tokens errors and without losing the order/case ID partway through. - Audit-grade citations. Every extracted clause traces back to a specific chunk and page number. - Bulk overnight runs for backfills. 1000 documents in one batch, results next morning. ### Why naive approaches fail - Stuff the whole document into the prompt → max_tokens at page 120; lost-in-the-middle drops the order ID established on page 1. - Progressive summarization of facts → '$247.83' becomes '~$250' in the case-facts; audit fails because exact values were paraphrased. - RAG without citation tracking → model hallucinates source pages; auditor can't verify any claim against the original document. ### Definition of done - Semantic chunking (paragraph + section boundaries), not fixed-size, with 10-20% overlap - Top-K retrieval (K = 5 typical) returns only relevant chunks; full document never enters context - CASE_FACTS block pinned at every prompt top. Never summarized, only the conversation history is - Checkpoint-and-resume on max_tokens: state saved, fresh session, resume from checkpoint - Citations (chunk_id + page) propagate through every extraction; auditor can verify each claim - Batch API for bulk overnight extraction (≥ 100 docs at 50% off) ## Concepts in play - 🟢 **Context window** (`context-window`), Top-K retrieval keeps prompt small even on 200-page docs - 🟠 **Case-facts block** (`case-facts-block`), Immutable anchor for doc_id + extraction state - 🟢 **Checkpoints** (`checkpoints`), Save state on max_tokens, resume in a fresh session - 🟢 **Structured outputs** (`structured-outputs`), Citations carry chunk_id + page through every extraction - 🟢 **Tool calling** (`tool-calling`), search_chunks + extract_clause as the tool registry - 🟢 **Batch API** (`batch-api`), Bulk overnight extraction at 50% off - 🟢 **Prompt caching** (`prompt-caching`), System prompt + tool registry cached across chunks - 🟢 **Evaluation** (`evaluation`), Stratified accuracy by document type + page section ## Components ### Semantic Chunker, paragraph + section boundaries, not fixed-size Splits the document at natural boundaries (sections, paragraphs, list items) rather than at fixed byte counts. Preserves the meaning of each chunk; a sentence is never cut in half. Adds 10-20% overlap between adjacent chunks so a clause that spans a boundary still appears whole in at least one chunk. Fixed-size chunking destroys context; semantic chunking preserves it. **Configuration:** Chunk size: 500-2000 tokens (≈400-1500 words). Overlap: 10-20%. Boundary precedence: section header → paragraph → sentence (never break mid-sentence). Each chunk gets a deterministic chunk_id and a page number for citation. **Concept:** `context-window` ### Retrieval Index (top-K), embeddings + cosine similarity Embeds every chunk once at ingest time; stores embeddings in a vector index (FAISS, pgvector, Pinecone). At extraction time, the agent's question gets embedded; the index returns the top-K most-similar chunks (K = 5 typical). Only those K chunks enter context. The full document never does. Latency p95 < 100ms even on 10K-chunk documents. **Configuration:** Embedding model: Voyage-3 or OpenAI text-embedding-3-small (~$0.13 / M tokens). Distance: cosine similarity. K: 5 (sweet spot. Bigger dilutes context, smaller misses relevance). Re-rank with Claude Haiku for the top-20 → top-5 if precision matters. **Concept:** `context-window` ### CASE_FACTS Block, immutable anchor at prompt top Pinned at the very top of every system prompt iteration. Holds doc_id, extracted_count, decisions already made, policy_cap. Survives every chunk swap. NEVER summarized. Exact values like '$247.83' stay exact across hundreds of turns. The conversation history below it CAN be summarized; case-facts cannot. The architectural difference between 'reliable extraction' and 'paraphrased nonsense'. **Configuration:** system: "CASE_FACTS (immutable; re-read every turn): doc_id={doc_id}, extracted_count={count}, last_clause_id={cid}". Updated by hooks after state-changing tool calls. **Concept:** `case-facts-block` ### Checkpoint-and-Resume, the architectural fix for max_tokens When the model returns stop_reason: max_tokens, the harness writes the current case-facts + last extracted record + chunk position to a durable store (Convex DB, S3, local JSONL), then starts a FRESH session and reads the checkpoint as its case-facts. The new session continues from where the old left off. No data loss; no re-processing; no manual intervention. **Configuration:** On stop_reasonmax_tokens: persist({doc_id, extracted_count, last_chunk_id, partial_extraction}). New session: system_prompt loads case-facts from checkpoint. Idempotent on the chunk_id key. **Concept:** `checkpoints` ### Citation Tracker, chunk_id + page on every output Every tool result includes the chunk_id(s) and page number(s) that supported the extraction. The model's output schema requires citations: [{chunk_id, page, span?}]; downstream consumers can click any extracted value and see the exact paragraph in the original document. Audit-grade provenance, structurally enforced. **Configuration:** extract_clause output_schema: { clause_text, clause_type, citations: [{ chunk_id, page, span?: 'character offsets within chunk' }] }. The model can't emit a clause without at least one citation. **Concept:** `structured-outputs` ## Build steps ### 1. Semantic chunking with overlap Walk the document; split at section / paragraph / sentence boundaries (in that precedence). Aim for 500-2000-token chunks; add 10-20% overlap between adjacent chunks so a clause spanning a boundary stays whole in at least one chunk. Each chunk gets a deterministic chunk_id (hash of content) and a page number for citation. **Python:** ```python import hashlib from typing import TypedDict class Chunk(TypedDict): chunk_id: str page: int text: str def chunk_document(pages: list[str], target_tokens: int = 1200, overlap: float = 0.15) -> list[Chunk]: """Semantic chunking with overlap. Split at section, paragraph, sentence.""" chunks = [] buffer = "" page_buffer_started = 1 for page_num, page_text in enumerate(pages, start=1): for paragraph in split_into_paragraphs(page_text): # If adding this paragraph would exceed target, emit current buffer if approx_tokens(buffer + paragraph) > target_tokens and buffer: chunks.append({ "chunk_id": hashlib.md5(buffer.encode()).hexdigest()[:12], "page": page_buffer_started, "text": buffer.strip(), }) # Carry forward the last 15% as overlap tail = buffer[-int(len(buffer) * overlap):] buffer = tail + paragraph + "\n\n" page_buffer_started = page_num else: buffer += paragraph + "\n\n" if buffer.strip(): chunks.append({ "chunk_id": hashlib.md5(buffer.encode()).hexdigest()[:12], "page": page_buffer_started, "text": buffer.strip(), }) return chunks def split_into_paragraphs(page_text: str) -> list[str]: return [p for p in page_text.split("\n\n") if p.strip()] def approx_tokens(text: str) -> int: return len(text) // 4 # rule of thumb ``` **TypeScript:** ```typescript import { createHash } from "node:crypto"; interface Chunk { chunk_id: string; page: number; text: string; } export function chunkDocument( pages: string[], targetTokens = 1200, overlap = 0.15, ): Chunk[] { // Semantic chunking with overlap. Split at section, paragraph, sentence. const chunks: Chunk[] = []; let buffer = ""; let pageBufferStarted = 1; pages.forEach((pageText, idx) => { const pageNum = idx + 1; for (const paragraph of pageText.split(/\n\n+/).filter(Boolean)) { if (approxTokens(buffer + paragraph) > targetTokens && buffer) { chunks.push({ chunk_id: createHash("md5").update(buffer).digest("hex").slice(0, 12), page: pageBufferStarted, text: buffer.trim(), }); // Carry forward the last 15% as overlap const tail = buffer.slice(-Math.floor(buffer.length * overlap)); buffer = tail + paragraph + "\n\n"; pageBufferStarted = pageNum; } else { buffer += paragraph + "\n\n"; } } }); if (buffer.trim()) { chunks.push({ chunk_id: createHash("md5").update(buffer).digest("hex").slice(0, 12), page: pageBufferStarted, text: buffer.trim(), }); } return chunks; } function approxTokens(text: string): number { return Math.floor(text.length / 4); // rule of thumb } ``` Concept: `context-window` ### 2. Embed and index every chunk At ingest time, embed each chunk once with a strong embedding model (Voyage-3 or OpenAI text-embedding-3-small) and store in a vector index. Index by chunk_id; the embedding becomes the search key. Re-embedding only fires on content change (hash-keyed cache). For a 200-page document at ~500 chunks, embedding is a ~$0.05 one-time cost. **Python:** ```python # embed_and_index.py import voyageai vo = voyageai.Client() # picks up VOYAGE_API_KEY def embed_and_index(chunks: list[Chunk], collection: str): """Embed every chunk; store in vector DB keyed by chunk_id.""" texts = [c["text"] for c in chunks] # Voyage supports batched embedding. Much cheaper than per-chunk embeddings = vo.embed(texts, model="voyage-3", input_type="document").embeddings # Pseudo-code for vector store; real impl uses pgvector / Pinecone / FAISS for chunk, embedding in zip(chunks, embeddings): vector_store.upsert( collection=collection, id=chunk["chunk_id"], vector=embedding, metadata={"page": chunk["page"], "text_preview": chunk["text"][:200]}, ) return len(embeddings) # Re-embedding cache: skip if chunk content hash hasn't changed def smart_reindex(doc_id: str, new_chunks: list[Chunk]): existing = vector_store.list_chunks(collection=doc_id) existing_ids = {c["id"] for c in existing} new_ids = {c["chunk_id"] for c in new_chunks} to_delete = existing_ids - new_ids to_add = [c for c in new_chunks if c["chunk_id"] not in existing_ids] vector_store.delete_many(doc_id, list(to_delete)) embed_and_index(to_add, doc_id) ``` **TypeScript:** ```typescript // embed-and-index.ts import { VoyageAIClient } from "voyageai"; const vo = new VoyageAIClient({ apiKey: process.env.VOYAGE_API_KEY! }); export async function embedAndIndex(chunks: Chunk[], collection: string) { const texts = chunks.map((c) => c.text); // Voyage supports batched embedding. Much cheaper than per-chunk const { embeddings } = await vo.embed({ input: texts, model: "voyage-3", inputType: "document", }); // Pseudo-code for vector store; real impl uses pgvector / Pinecone / FAISS for (let i = 0; i < chunks.length; i++) { await vectorStore.upsert({ collection, id: chunks[i].chunk_id, vector: embeddings![i], metadata: { page: chunks[i].page, text_preview: chunks[i].text.slice(0, 200), }, }); } return embeddings!.length; } // Re-embedding cache: skip if chunk content hash hasn't changed export async function smartReindex(docId: string, newChunks: Chunk[]) { const existing = await vectorStore.listChunks(docId); const existingIds = new Set(existing.map((c) => c.id)); const newIds = new Set(newChunks.map((c) => c.chunk_id)); const toDelete = [...existingIds].filter((id) => !newIds.has(id)); const toAdd = newChunks.filter((c) => !existingIds.has(c.chunk_id)); await vectorStore.deleteMany(docId, toDelete); await embedAndIndex(toAdd, docId); } ``` Concept: `context-window` ### 3. Retrieve top-K chunks; never the full document When the agent asks a question (e.g., 'what's the indemnification clause?'), embed the question, retrieve the top-K=5 most-similar chunks, and pass ONLY those into context. The full document never enters the prompt. K=5 is the sweet spot. Bigger K dilutes context with marginally-relevant chunks; smaller K misses the right one. Re-rank top-20 with Claude Haiku if precision matters. **Python:** ```python def retrieve_top_k(question: str, doc_id: str, k: int = 5) -> list[dict]: """Top-K retrieval. Full doc never enters context.""" q_embed = vo.embed([question], model="voyage-3", input_type="query").embeddings[0] candidates = vector_store.search(collection=doc_id, vector=q_embed, top_k=20) # Optional re-rank with Haiku for precision if len(candidates) > k: candidates = rerank_with_haiku(question, candidates)[:k] return [ { "chunk_id": c["id"], "page": c["metadata"]["page"], "text": c["metadata"]["text_full"], # rehydrate from chunk store "score": c["score"], } for c in candidates[:k] ] def rerank_with_haiku(question: str, candidates: list[dict]) -> list[dict]: """Use Haiku to re-rank embedding candidates (cheap, focused).""" prompt = ( f"Question: {question}\n\n" + "\n\n".join(f"[{i}] {c['metadata']['text_preview']}" for i, c in enumerate(candidates)) + "\n\nReturn the indices of the top 5 most relevant chunks, JSON only:" ) resp = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=64, messages=[{"role": "user", "content": prompt}], ) indices = json.loads(resp.content[0].text) return [candidates[i] for i in indices] ``` **TypeScript:** ```typescript export async function retrieveTopK( question: string, docId: string, k = 5, ) { // Top-K retrieval. Full doc never enters context. const { embeddings } = await vo.embed({ input: [question], model: "voyage-3", inputType: "query", }); let candidates = await vectorStore.search({ collection: docId, vector: embeddings![0], topK: 20, }); // Optional re-rank with Haiku for precision if (candidates.length > k) { candidates = (await rerankWithHaiku(question, candidates)).slice(0, k); } return candidates.slice(0, k).map((c) => ({ chunk_id: c.id, page: c.metadata.page, text: c.metadata.text_full, score: c.score, })); } async function rerankWithHaiku( question: string, candidates: Array<{ id: string; metadata: Record; score: number }>, ) { const prompt = `Question: ${question}\n\n` + candidates .map((c, i) => `[${i}] ${c.metadata.text_preview}`) .join("\n\n") + "\n\nReturn the indices of the top 5 most relevant chunks, JSON only:"; const resp = await client.messages.create({ model: "claude-haiku-4-5-20251001", max_tokens: 64, messages: [{ role: "user", content: prompt }], }); const indices = JSON.parse( resp.content[0].type === "text" ? resp.content[0].text : "[]", ) as number[]; return indices.map((i) => candidates[i]); } ``` Concept: `context-window` ### 4. Pin CASE_FACTS at the prompt top. Never summarize Every prompt iteration starts with a CASE_FACTS block: doc_id, extracted_count, last_clause_id, decisions already made. The block is rebuilt from durable state every turn. It survives summarization, model swaps, and session resets. Critically, EXACT VALUES ($247.83, not ~$250; cust_4711, not the customer) stay verbatim. The conversation history below it CAN be summarized; the case-facts cannot. **Python:** ```python def build_system_prompt(case_facts: dict, retrieved_chunks: list[dict]) -> str: """Pin CASE_FACTS at the top; retrieved chunks below; conversation last.""" chunks_text = "\n\n".join( f"### CHUNK {c['chunk_id']} (page {c['page']})\n{c['text']}" for c in retrieved_chunks ) return f"""You are a long-document extraction agent. CASE_FACTS (immutable; re-read every turn; values are EXACT, never paraphrased): - doc_id: {case_facts['doc_id']} - doc_type: {case_facts.get('doc_type', 'unknown')} - extracted_count: {case_facts.get('extracted_count', 0)} - last_clause_id: {case_facts.get('last_clause_id', 'none')} - policy_cap: ${case_facts.get('policy_cap', 0):,.2f} Constraints: - Cite every extraction with chunk_id + page from the chunks below. - Never paraphrase exact values from CASE_FACTS or chunk text. - Branch on stop_reason. On max_tokens, save state. The harness will resume. RETRIEVED CHUNKS (top-K; only these are in context): {chunks_text}""" def update_case_facts(case_facts: dict, new_extraction: dict) -> dict: """Hook-style update; preserves all values verbatim.""" return { **case_facts, "extracted_count": case_facts.get("extracted_count", 0) + 1, "last_clause_id": new_extraction["clause_id"], } ``` **TypeScript:** ```typescript interface CaseFacts { doc_id: string; doc_type?: string; extracted_count?: number; last_clause_id?: string; policy_cap?: number; } export function buildSystemPrompt( caseFacts: CaseFacts, retrievedChunks: Array<{ chunk_id: string; page: number; text: string }>, ): string { // Pin CASE_FACTS at the top; retrieved chunks below; conversation last. const chunksText = retrievedChunks .map( (c) => `### CHUNK ${c.chunk_id} (page ${c.page})\n${c.text}`, ) .join("\n\n"); return `You are a long-document extraction agent. CASE_FACTS (immutable; re-read every turn; values are EXACT, never paraphrased): - doc_id: ${caseFacts.doc_id} - doc_type: ${caseFacts.doc_type ?? "unknown"} - extracted_count: ${caseFacts.extracted_count ?? 0} - last_clause_id: ${caseFacts.last_clause_id ?? "none"} - policy_cap: \${(caseFacts.policy_cap ?? 0).toFixed(2)} Constraints: - Cite every extraction with chunk_id + page from the chunks below. - Never paraphrase exact values from CASE_FACTS or chunk text. - Branch on stop_reason. On max_tokens, save state. The harness will resume. RETRIEVED CHUNKS (top-K; only these are in context): ${chunksText}`; } export function updateCaseFacts( caseFacts: CaseFacts, newExtraction: { clause_id: string }, ): CaseFacts { // Hook-style update; preserves all values verbatim. return { ...caseFacts, extracted_count: (caseFacts.extracted_count ?? 0) + 1, last_clause_id: newExtraction.clause_id, }; } ``` Concept: `case-facts-block` ### 5. Checkpoint on max_tokens; resume in a fresh session When stop_reason 'max_tokens', the harness writes the current case-facts + last extracted record + chunk position to a durable store, then starts a fresh session and re-loads the checkpoint as its case-facts. Because case-facts are at the prompt top, the new session continues exactly where the old left off. No data loss; no manual intervention; no need for the agent to even know. **Python:** ```python import json from datetime import datetime CHECKPOINT_DIR = ".checkpoints" def save_checkpoint(case_facts: dict, last_record: dict, position: dict): """Persist state on max_tokens. Idempotent on doc_id.""" path = f"{CHECKPOINT_DIR}/{case_facts['doc_id']}.json" with open(path, "w") as f: json.dump({ "case_facts": case_facts, "last_record": last_record, "position": position, # {chunk_id, paragraph_offset} "saved_at": datetime.utcnow().isoformat() + "Z", }, f) return path def load_checkpoint(doc_id: str) -> dict | None: path = f"{CHECKPOINT_DIR}/{doc_id}.json" if not os.path.exists(path): return None with open(path) as f: return json.load(f) def extract_with_resume(doc_id: str, question: str, max_iter: int = 50): """Top-level loop with automatic checkpoint-and-resume.""" checkpoint = load_checkpoint(doc_id) or {} case_facts = checkpoint.get("case_facts") or {"doc_id": doc_id} for iteration in range(max_iter): chunks = retrieve_top_k(question, doc_id, k=5) resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=4096, system=build_system_prompt(case_facts, chunks), tools=[EXTRACT_CLAUSE_TOOL], messages=[{"role": "user", "content": question}], ) if resp.stop_reason == "end_turn": return {"status": "complete", "case_facts": case_facts} if resp.stop_reason == "max_tokens": # Save and continue in a fresh session save_checkpoint(case_facts, last_record={}, position={}) continue # next iteration starts a fresh session if resp.stop_reason == "tool_use": tool_use = next(b for b in resp.content if b.type == "tool_use") case_facts = update_case_facts(case_facts, tool_use.input) return {"status": "iteration_cap", "case_facts": case_facts} ``` **TypeScript:** ```typescript import { writeFileSync, readFileSync, existsSync, mkdirSync } from "node:fs"; import { join } from "node:path"; const CHECKPOINT_DIR = ".checkpoints"; mkdirSync(CHECKPOINT_DIR, { recursive: true }); export function saveCheckpoint( caseFacts: CaseFacts, lastRecord: Record, position: Record, ) { // Persist state on max_tokens. Idempotent on doc_id. const path = join(CHECKPOINT_DIR, `${caseFacts.doc_id}.json`); writeFileSync( path, JSON.stringify({ case_facts: caseFacts, last_record: lastRecord, position, saved_at: new Date().toISOString(), }), ); return path; } export function loadCheckpoint(docId: string): { case_facts?: CaseFacts } | null { const path = join(CHECKPOINT_DIR, `${docId}.json`); return existsSync(path) ? JSON.parse(readFileSync(path, "utf8")) : null; } export async function extractWithResume( docId: string, question: string, maxIter = 50, ) { // Top-level loop with automatic checkpoint-and-resume. const checkpoint = loadCheckpoint(docId); let caseFacts: CaseFacts = checkpoint?.case_facts ?? { doc_id: docId }; for (let i = 0; i < maxIter; i++) { const chunks = await retrieveTopK(question, docId, 5); const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 4096, system: buildSystemPrompt(caseFacts, chunks), tools: [EXTRACT_CLAUSE_TOOL], messages: [{ role: "user", content: question }], }); if (resp.stop_reason === "end_turn") { return { status: "complete" as const, case_facts: caseFacts }; } if (resp.stop_reason === "max_tokens") { saveCheckpoint(caseFacts, {}, {}); continue; // next iteration starts a fresh session } if (resp.stop_reason === "tool_use") { const tu = resp.content.find((b) => b.type === "tool_use"); if (tu?.type === "tool_use") { caseFacts = updateCaseFacts( caseFacts, tu.input as { clause_id: string }, ); } } } return { status: "iteration_cap" as const, case_facts: caseFacts }; } ``` Concept: `checkpoints` ### 6. Citations. Chunk_id + page on every output Every extraction tool emits a citations: [{ chunk_id, page }] array; the schema makes citations REQUIRED. The model can't extract a clause without pointing at the chunks that supported it. Downstream consumers (auditors, reviewers, regulators) click any extracted value and see the exact paragraph in the original. Audit-grade provenance, structurally enforced. **Python:** ```python EXTRACT_CLAUSE_TOOL = { "name": "extract_clause", "description": ( "Extract a contractual clause from the retrieved chunks.\n" "Use this when the user asks for a specific clause type.\n" "Edge cases: if the clause type is not present in any retrieved chunk, " "emit clause_text='not_found' with empty citations.\n" "ALWAYS cite chunk_id and page for every extraction." ), "input_schema": { "type": "object", "properties": { "clause_id": {"type": "string"}, "clause_type": { "type": "string", "enum": ["indemnification", "termination", "payment", "ip", "other"], }, "clause_text": {"type": "string"}, "citations": { "type": "array", "minItems": 1, # require at least one citation "items": { "type": "object", "properties": { "chunk_id": {"type": "string"}, "page": {"type": "integer", "minimum": 1}, "span": { "type": "string", "description": "character offsets within the chunk, e.g. '128-340'", }, }, "required": ["chunk_id", "page"], }, }, }, "required": ["clause_id", "clause_type", "clause_text", "citations"], }, } ``` **TypeScript:** ```typescript const EXTRACT_CLAUSE_TOOL: Anthropic.Tool = { name: "extract_clause", description: "Extract a contractual clause from the retrieved chunks.\n" + "Use this when the user asks for a specific clause type.\n" + "Edge cases: if the clause type is not present in any retrieved chunk, " + "emit clause_text='not_found' with empty citations.\n" + "ALWAYS cite chunk_id and page for every extraction.", input_schema: { type: "object", properties: { clause_id: { type: "string" }, clause_type: { type: "string", enum: ["indemnification", "termination", "payment", "ip", "other"], }, clause_text: { type: "string" }, citations: { type: "array", minItems: 1, // require at least one citation items: { type: "object", properties: { chunk_id: { type: "string" }, page: { type: "integer", minimum: 1 }, span: { type: "string", description: "character offsets within the chunk, e.g. '128-340'", }, }, required: ["chunk_id", "page"], }, }, }, required: ["clause_id", "clause_type", "clause_text", "citations"], }, }; ``` Concept: `structured-outputs` ### 7. Bulk extraction via Batch API (50% off, 24h) When the use case is 'extract every payment clause from 1000 contracts overnight', the Batch API earns its 50% discount. Submit at 6 PM, results ready at 6 AM. No real-time retry inside the batch; failures get resubmitted in the next batch with their specific error in the next message. Combined with prompt caching on the system prompt + tool registry, bulk extraction cost drops 95%+ vs naive sync calls. **Python:** ```python def submit_bulk_extraction(docs: list[dict], clause_type: str) -> str: """Submit a batch of clause-extraction requests for overnight processing.""" requests = [] for doc in docs: chunks = retrieve_top_k(f"Find {clause_type} clauses", doc["id"], k=5) case_facts = load_checkpoint(doc["id"]) or {"doc_id": doc["id"]} requests.append({ "custom_id": f"clause-{doc['id']}-{clause_type}", "params": { "model": "claude-sonnet-4.5", "max_tokens": 2048, "system": [{ "type": "text", "text": build_system_prompt(case_facts, chunks), "cache_control": {"type": "ephemeral"}, # cache the system prompt }], "tools": [ {**EXTRACT_CLAUSE_TOOL, "cache_control": {"type": "ephemeral"}}, ], "tool_choice": {"type": "tool", "name": "extract_clause"}, "messages": [{"role": "user", "content": f"Extract all {clause_type} clauses."}], }, }) batch = client.messages.batches.create(requests=requests) return batch.id # Next morning. Fetch + harvest def harvest_bulk(batch_id: str): results = client.messages.batches.results(batch_id) accepted, retry_queue = [], [] for r in results: if r.result.type == "succeeded": tu = next(b for b in r.result.message.content if b.type == "tool_use") accepted.append(tu.input) else: retry_queue.append(r.custom_id) return {"accepted": accepted, "retry": retry_queue} ``` **TypeScript:** ```typescript async function submitBulkExtraction( docs: Array<{ id: string }>, clauseType: string, ) { // Submit a batch of clause-extraction requests for overnight processing. const requests = await Promise.all( docs.map(async (doc) => { const chunks = await retrieveTopK(`Find ${clauseType} clauses`, doc.id, 5); const cp = loadCheckpoint(doc.id); const caseFacts: CaseFacts = cp?.case_facts ?? { doc_id: doc.id }; return { custom_id: `clause-${doc.id}-${clauseType}`, params: { model: "claude-sonnet-4.5", max_tokens: 2048, system: [ { type: "text", text: buildSystemPrompt(caseFacts, chunks), cache_control: { type: "ephemeral" }, // cache the system prompt }, ], tools: [ { ...EXTRACT_CLAUSE_TOOL, cache_control: { type: "ephemeral" }, }, ], tool_choice: { type: "tool", name: "extract_clause" } as const, messages: [ { role: "user" as const, content: `Extract all ${clauseType} clauses.`, }, ], }, }; }), ); const batch = await client.messages.batches.create({ requests }); return batch.id; } // Next morning. Fetch + harvest async function harvestBulk(batchId: string) { const results = client.messages.batches.results(batchId); const accepted: unknown[] = []; const retryQueue: string[] = []; for await (const r of results) { if (r.result.type === "succeeded") { const tu = r.result.message.content.find((b) => b.type === "tool_use"); if (tu?.type === "tool_use") accepted.push(tu.input); } else { retryQueue.push(r.custom_id); } } return { accepted, retry: retryQueue }; } ``` Concept: `batch-api` ### 8. Stratified accuracy + adversarial 'silent source' tests Aggregate accuracy hides per-document-type weakness. Stratify by doc_type (MSA vs DPA vs SOW), by clause_type, by page-section (front/middle/back). Surface the worst stratum. Pair with an adversarial test set of 50 documents where the requested clause is GENUINELY absent. The right behaviour is clause_text='not_found' with empty citations, NEVER an invented clause. Hallucinated extractions = audit-fail. **Python:** ```python from collections import defaultdict def stratified_accuracy(extractions: list[dict]) -> dict: """Pass rate by doc_type × clause_type × page-section.""" buckets = defaultdict(lambda: {"pass": 0, "fail": 0}) for e in extractions: section = ("front" if e["citations"][0]["page"] <= 30 else "back" if e["citations"][0]["page"] > 150 else "middle") key = (e["doc_type"], e["clause_type"], section) bucket = "pass" if validate_extraction(e) else "fail" buckets[key][bucket] += 1 report = {} for (doc_type, clause_type, section), counts in buckets.items(): total = counts["pass"] + counts["fail"] report[f"{doc_type}/{clause_type}/{section}"] = { "total": total, "pass_rate": counts["pass"] / total if total else 0, } return dict(sorted(report.items(), key=lambda kv: kv[1]["pass_rate"])) def adversarial_silent_source_test() -> float: """50 docs where the requested clause is GENUINELY absent.""" correct = 0 for doc in load_silent_source_docs(): # known to NOT contain the clause result = extract_with_resume(doc["id"], "Find the indemnification clause") # Right: clause_text='not_found' with empty citations # Wrong: ANY invented clause text if result.get("case_facts", {}).get("last_extraction", {}).get("clause_text") == "not_found": correct += 1 return correct / 50 # target: ≥ 95% ``` **TypeScript:** ```typescript function stratifiedAccuracy( extractions: Array<{ doc_type: string; clause_type: string; citations: Array<{ page: number }>; }>, ) { // Pass rate by doc_type × clause_type × page-section. const buckets = new Map(); for (const e of extractions) { const page = e.citations[0].page; const section = page <= 30 ? "front" : page > 150 ? "back" : "middle"; const key = `${e.doc_type}/${e.clause_type}/${section}`; const bucket = validateExtraction(e) ? "pass" : "fail"; const counts = buckets.get(key) ?? { pass: 0, fail: 0 }; counts[bucket]++; buckets.set(key, counts); } const report: Record = {}; for (const [key, counts] of buckets) { const total = counts.pass + counts.fail; report[key] = { total, pass_rate: total ? counts.pass / total : 0 }; } return Object.fromEntries( Object.entries(report).sort(([, a], [, b]) => a.pass_rate - b.pass_rate), ); } async function adversarialSilentSourceTest(): Promise { // 50 docs where the requested clause is GENUINELY absent. let correct = 0; for (const doc of await loadSilentSourceDocs()) { const result = await extractWithResume(doc.id, "Find the indemnification clause"); // Right: clause_text='not_found' with empty citations // Wrong: ANY invented clause text const lastClause = (result.case_facts as { last_extraction?: { clause_text?: string } }) .last_extraction?.clause_text; if (lastClause === "not_found") correct++; } return correct / 50; // target: ≥ 95% } ``` Concept: `evaluation` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | 200-page document into the prompt | Chunk + index + retrieve top-K (K=5) | Stuff the whole document into one prompt | Stuffing hits max_tokens around page 120 and triggers lost-in-the-middle. Top-K retrieval keeps the prompt small and focused. The full document never enters context, so length is bounded by chunk count not document size. | | Storing transactional values across many turns | CASE_FACTS block. Exact values, never paraphrased | Progressive summarization that paraphrases the conversation including facts | Summarization erodes precision. '$247.83' becomes '~$250'; 'cust_4711' becomes 'the customer'. CASE_FACTS keeps exact values verbatim; conversation history can be summarized, facts cannot. | | Hit max_tokens mid-document | Save state + start fresh session + reload from checkpoint | Increase max_tokens or just retry from scratch | Larger windows defer the problem; checkpoint-and-resume permanently solves it. Restarting from scratch loses everything extracted so far. The architectural pattern scales to documents of any length. | | Bulk overnight processing of 1000 documents | Batch API + cached system prompt + cached tool registry | Sync API in a tight loop | Batch API gives a flat 50% discount with a 24h SLA. Fine for non-blocking backfill. Caching adds another ~90% off the system + tools. Combined: ~95% savings vs naive sync. Sync API is for latency-critical extraction only. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-LDP-01 · Stuff the whole document into the prompt | Try to paste a 150-page contract into a single prompt. Hits max_tokens at page 120; lost-in-the-middle drops the order ID established on page 1; agent makes contradictory recommendations on later pages. | Chunk + index + top-K retrieval. Only K=5 chunks ever enter context. The full document is searchable but never present. Length is bounded by retrieval, not document size. | | AP-LDP-02 · Progressive summarization of facts | Long conversation summarizes every 10 turns. Refund amount '$247.83' becomes '~$250' in the summary; customer ID 'cust_4711' becomes 'the customer'. Audit fails because exact values were paraphrased. | CASE_FACTS block at every prompt top. Never summarized. Holds exact values verbatim. Only the message history below is summarized; the case-facts persist verbatim across every iteration. | | AP-LDP-03 · No checkpoint on max_tokens | Long batch job processes 200 pages, hits max_tokens at turn 15. The whole pipeline aborts; everything extracted so far is lost; operator has to restart from page 1. | Checkpoint-and-resume: on stop_reason: max_tokens, persist state (case_facts + last extraction + chunk position) and start a fresh session that reloads the checkpoint as its case-facts. Idempotent on chunk_id. | | AP-LDP-04 · RAG without citations | Agent retrieves chunks and emits extracted clauses without saying which chunk supported each one. Auditor asks 'where does this come from?' and there's no answer; auditor flags the run as un-verifiable. | Citation tracker: every extraction emits citations: [{ chunk_id, page }] with minItems: 1 in the schema. The model can't extract a clause without pointing at the supporting chunks. Audit-grade provenance is structurally enforced. | | AP-LDP-05 · Chunking destroys context | Fixed-size chunks (every 1000 characters) split mid-sentence and mid-paragraph. A clause that spans a chunk boundary appears truncated in both adjacent chunks; retrieval misses it; extraction is wrong. | Semantic chunking with overlap. Split at section / paragraph / sentence boundaries (in that precedence). Add 10-20% overlap between adjacent chunks so a boundary-spanning clause stays whole in at least one chunk. Each chunk is meaning-complete. | ## Implementation checklist - [ ] Semantic chunking (paragraph + section boundaries) with 10-20% overlap (`context-window`) - [ ] Each chunk has a deterministic chunk_id and a page number - [ ] Embeddings indexed once at ingest; re-embedding only on content change (`context-window`) - [ ] Top-K retrieval (K=5 typical); full document never enters context - [ ] CASE_FACTS block pinned at every prompt top. Exact values, never paraphrased (`case-facts-block`) - [ ] Checkpoint-and-resume on max_tokens; idempotent on chunk_id (`checkpoints`) - [ ] Citations REQUIRED in every extraction tool's schema (minItems: 1) (`structured-outputs`) - [ ] System prompt + tool registry cached with cache_control: ephemeral (`prompt-caching`) - [ ] Batch API for bulk overnight runs (≥ 100 docs) (`batch-api`) - [ ] Stratified accuracy: doc_type × clause_type × page-section (`evaluation`) - [ ] Adversarial 'silent source' test (≥ 50 docs known to lack the clause); target ≥ 95% not_found rate ## Cost & latency - **One-time embedding (200-page doc, ~500 chunks):** ~$0.05, 500 chunks × ~1000 tokens × Voyage-3 at $0.13/M tokens ≈ $0.06. One-time cost at ingest; never re-paid unless content changes. - **Per-extraction retrieval + Claude call (cached system + tools):** ~$0.008-0.015, Embedding query (~$0.0001) + vector lookup (~$0.0001) + 5 chunks × ~1000 tokens (cached) + ~500 output tokens. Cache hit rate ≥ 70% drops effective per-call cost ~80%. - **Bulk overnight (Batch API + caching):** ~$0.004-0.008 per extraction, Batch API 50% discount × prompt caching ~90% off the system + tools = ~95% off naive sync. 1000 extractions @ $0.008 each = $8 total overnight. - **Checkpoint storage:** ~5-15 KB per document, JSON dump of case_facts + last extraction + chunk position. Negligible per-document; at 1000 docs in flight, <15MB total. Idempotent re-loads add no cost. - **p95 latency per extraction (cached, sync):** ~2-4 seconds, Embed query (50ms) + vector lookup (50ms) + Claude call (2-3s with cache hit). Acceptable for interactive review of contracts; bulk uses Batch API. ## Domain weights - **D5 · Context + Reliability (15%):** Semantic chunking · top-K retrieval · CASE_FACTS pinning · checkpoint-and-resume - **D2 · Tool Design + Integration (18%):** search_chunks + extract_clause tool registry · citation contract · Batch API integration ## Practice questions ### Q1. 150-page contract; the agent processes it in one prompt and dies at page 120 with max_tokens. Recovery without losing everything? Checkpoint-and-resume. On stop_reason: max_tokens, persist {case_facts, last_extraction, chunk_position} to durable storage; start a FRESH session whose system prompt loads the checkpoint as its CASE_FACTS; continue from chunk_position. The new session has clean context but inherits exactly the state of the old one. Idempotent on chunk_id. Re-running a chunk just produces the same extraction. Tagged to AP-LDP-03. ### Q2. RAG: should you retrieve all chunks similar to the query, or top-K ranked? Top-K (K=5 typical). All-similar floods context with marginally-relevant chunks; the model wastes attention. Top-5 keeps the prompt focused on the most relevant evidence. Re-rank top-20 with Claude Haiku for the top-5 if precision matters; the cost is negligible and the precision lift on hard queries is real. ### Q3. Chunking strategy: fixed-size or semantic? Semantic (break at section / paragraph / sentence boundaries, in that precedence). Fixed-size (every 1000 chars) splits mid-sentence; clauses that span a boundary appear truncated in both adjacent chunks. Semantic chunking preserves meaning; pair with 10-20% overlap so a clause spanning a paragraph boundary still appears whole in at least one chunk. Tagged to AP-LDP-05. ### Q4. How do you preserve citations through long-document extraction? Make citations REQUIRED in the extraction tool's schema (minItems: 1). Every extracted clause emits citations: [{ chunk_id, page, span? }]. The model literally cannot return a clause without pointing at the supporting chunks. Downstream auditors click any extracted value and see the exact paragraph in the original document. Audit-grade provenance, structurally enforced. The model has no way to forget it. Tagged to AP-LDP-04. ### Q5. Long conversation summarized at every 10 turns. Customer ID 'cust_4711' becomes 'the customer'; refund amount '$247.83' becomes '~$250'. The audit fails. What's the architectural fix? CASE_FACTS block. Pinned at the top of every system prompt iteration, holding exact transactional values verbatim. The conversation history below it CAN be summarized; the case-facts CANNOT. Structurally separate the two: facts go in case-facts (immutable, exact), reasoning chains go in conversation history (summarizable). Tagged to AP-LDP-02. ## FAQ ### Q1. What's the optimal chunk size? 500-2000 tokens (~400-1500 words). Smaller and you make too many retrieval calls; larger and you lose the granularity that makes top-K retrieval useful. 1000-1200 tokens is a good default. Always pair with 10-20% overlap so boundary-spanning content survives. ### Q2. Should I retrieve all matching chunks or top-K? Top-K (default K=5). All-matching floods context with marginally-relevant noise; the model wastes attention. Top-5 keeps the prompt focused. If precision is critical, retrieve top-20 with embeddings and re-rank with Claude Haiku to top-5. ### Q3. How do I prevent lost-in-the-middle? Anchor critical facts at context top (CASE_FACTS), retrieve top-K only, trim verbose tool results. Long contexts dilute attention to middle content. Keep the prompt structure: CASE_FACTS at top, retrieved chunks in the middle, the user's latest message at the end. Don't put case-facts in the middle of the prompt. ### Q4. Does prompt caching help with RAG? Partially. Cache the stable parts: system prompt + tool registry + (optionally) the CASE_FACTS scaffold. The retrieved chunks change every query, so they're always fresh. Realistic savings: ~30-50% total cost reduction depending on system-prompt size. Not as dramatic as caching a 200-page document would have been, but RAG never had that overhead in the first place. ### Q5. Where do I store the checkpoint? Durable, idempotent storage. Convex DB, S3, or a local JSONL file in dev. Key by doc_id. The checkpoint write must be atomic; partial writes confuse the resume path. Retain checkpoints until the document is fully processed; delete on completion or after 30 days, whichever comes first. ### Q6. Can I combine Batch API with checkpoint-and-resume? Yes. For the long-running ones. Submit each document as one batch request. If a request hits max_tokens, the harvest step writes a checkpoint and includes that document in the NEXT batch with the checkpoint as case-facts. Two batches usually finish a long document; rare cases need three. ### Q7. How do I handle a 500-page doc that exceeds even the chunked + paged max_tokens cap? Checkpoint after every ~50 pages or natural boundary (chapter, section). The harness writes checkpoints on max_tokens automatically; at 500 pages you'll see ~5-8 checkpoint events across multiple sessions. Each session is bounded; the document length isn't. ## Production readiness - [ ] Semantic chunker tested on 5 representative document types (contracts, papers, manuals) - [ ] Embedding cache by chunk content hash; rebuild only on change - [ ] CASE_FACTS schema versioned; migration plan documented - [ ] Checkpoint write is atomic; idempotency tested with deliberate restarts - [ ] Citations schema enforced (minItems: 1); CI lint catches schemas missing the constraint - [ ] Batch-API job retries failures in next batch with the specific error in the next message - [ ] Stratified accuracy dashboard updated daily; alert on any stratum < 90% - [ ] Adversarial 'silent source' eval runs weekly; ≥ 95% not_found rate required --- **Source:** https://claudearchitectcertification.com/scenarios/long-document-processing **Vault sources:** ACP-T05 §Scenario 8 (🟡 beyond-guide; OP-claimed Reddit 1s34iyl); ACP-T08 §3.8 metadata; Course 12 Claude with Vertex. Lessons 46, 54 (RAG, contextual retrieval); Course 11 Claude in Bedrock. Lesson 43 RAG introduction; ACP-T06 (5 practice Qs tagged to components); GAI-K05 CCA exam questions and scenarios; COD-K04 Feynman architecture review (long-doc patterns) **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Claude for Operations > An on-call ops agent. Alerts arrive structured (service, severity, root_cause_signal, 4-bucket isError: Transient · Permission · Data · Business); the agent matches a runbook from a small registry (3-5 safe playbooks: restart, rollback, drain, scale); a PreToolUse hook denies destructive Bash (rm -rf, sudo, drop database) before it executes; on ambiguity or denial, the harness emits a structured escalation block (incident_id, service, severity, root_cause, partial_status, recommended_action) to PagerDuty; every tool call writes a PostToolUse audit log entry with the full call context. The most-tested distractor: sentiment-based escalation. Sentiment is orthogonal to ops. Escalate on policy gaps and access failures, never on log-message tone. **Sub-marker:** P3.9 **Domains:** D1 · Agentic Architectures, D5 · Context + Reliability **Exam weight:** 42% of CCA-F (D1 + D5) **Build time:** 26 minutes **Source:** 🟡 Beyond-guide scenario · OP-claimed (Reddit 1s34iyl) · architecture matches Anthropic public guidance **Canonical:** https://claudearchitectcertification.com/scenarios/claude-for-operations **Last reviewed:** 2026-05-04 ## In plain English Think of this as the on-call agent that picks up the 3 AM page so a human doesn't have to. But only for the boring, predictable incidents, and only when the action is reversible. The agent reads a structured alert, classifies it into one of four buckets (transient blip vs permission failure vs bad data vs business policy), matches the right runbook, and executes the safe steps. The dangerous steps (`rm -rf prod`, dropping a production database) are blocked by a deterministic hook before they ever run. The agent literally cannot execute them, no matter how cleverly the alert text is phrased. Anything ambiguous gets a structured escalation block (incident ID, service, severity, root cause, recommended next step) routed to a human in PagerDuty. The whole point is composure under pressure: hooks for safety, structured isError for clarity, audit logs for after-action review. ## Exam impact Domain 1 (Agentic Architecture, 27%) tests the runbook + hook composition, the structured escalation block, and the loop's stop_reason handling on `tool_use → hook denied`. Domain 5 (Context, 15%) tests case-facts pinning of incident metadata across multi-turn ops conversations. Beyond-guide but architecturally well-grounded. The 'why does the agent retry permission errors forever?' question is the canonical exam distractor on this scenario. ## The problem ### What the customer needs - Auto-resolve boring incidents. Pod restarts, cache flushes, log rotation. Without paging a human. - Block destructive commands deterministically. rm -rf /prod must NEVER execute, no matter what the alert text says. - Hand off ambiguous incidents cleanly. When the agent can't safely act, the on-call human gets a structured block (not a raw transcript) and can decide in 30 seconds. ### Why naive approaches fail - Agent runs commands without a hook gate → rm -rf /prod executes because the alert text said 'clear stuck pods'. - Permission-denied confused with empty result → agent retries forever thinking 'no data yet'; backoff never converges. - Sentiment-trigger escalation → angry-but-correct alerts wake humans; calm-but-broken alerts get ignored. - No audit log → post-mortem can't reconstruct what the agent did at 3 AM; trust evaporates. ### Definition of done - Every monitoring alert classified into the 4-bucket isError contract (Transient · Permission · Data · Business) - 3-5 safe runbooks in the registry; rare/unique incidents always escalate (don't add a 6th runbook for an edge case) - PreToolUse hook denies destructive commands by regex; exit 2 returns a model-readable reason - Structured escalation block for every human handoff: incident_id, service, severity, root_cause, partial_status, recommended_action - PostToolUse audit log writes every tool call with input, output, latency, and the bucket the result fell into - Sentiment is logged but never gates routing. Escalation triggers are policy gaps, access failures, and explicit user request only ## Concepts in play - 🟢 **Agentic loops** (`agentic-loops`), On-call agent loop with stop_reason branching - 🟢 **Hooks** (`hooks`), PreToolUse blocklist + PostToolUse audit (the load-bearing pair) - 🟡 **Escalation** (`escalation`), Structured handoff block to PagerDuty - 🟢 **Structured outputs** (`structured-outputs`), 4-bucket isError contract + escalation block - 🟠 **Case-facts block** (`case-facts-block`), Incident metadata pinned across multi-turn ops - 🟢 **Tool calling** (`tool-calling`), Runbook tools + Bash with allowlist - 🟢 **Evaluation** (`evaluation`), Audit log + post-mortem replay - 🟢 **System prompts** (`system-prompts`), On-call persona + escalation triggers documented ## Components ### Monitoring Alert Parser, structured isError, 4 buckets Receives raw alert payloads (Datadog, Prometheus, Sentry, custom monitors) and projects them into a stable 4-bucket isError contract: {bucket: 'Transient'|'Permission'|'Data'|'Business', service, severity, root_cause_signal, retryable}. The agent reads bucket and retryable and routes; it never parses raw alert text. Without this contract, the agent retries permission errors forever and panics on transient blips. **Configuration:** Webhook receiver → schema validator → bucket classifier (rule-based or Haiku-classified) → enriched alert. Schema requires bucket + retryable; the alert ingestor rejects alerts missing the contract. **Concept:** `structured-outputs` ### Runbook Registry (3-5 Safe Playbooks), small, audited, reversible A tight registry of 3-5 named runbooks the agent can execute. Each runbook is a sequence of safe, reversible commands with explicit preconditions, timeouts, and rollback steps. Rare or unique incidents NEVER get a runbook. They escalate. The registry stays small on purpose; expanding it past 5 erodes the agent's routing accuracy and increases blast radius. **Configuration:** Registry: ["restart_service", "drain_region", "rollback_release", "rotate_credentials", "scale_replicas"]. Each has {preconditions: [...], steps: [...], timeout_s, rollback: [...]}. Reviewed in PRs; production-deployed via the same CI/CD as application code. **Concept:** `tool-calling` ### PreToolUse Hook (Blocklist Gate), deterministic destructive-command guard Sits between the model's tool_use request and Bash/destructive tool execution. Reads the proposed command; matches against an explicit blocklist regex (rm -rf, sudo, drop database, kill -9, chmod 777, >:). On match, exits 2 with a model-readable reason. The agent observes the deny in the next turn as a tool_result with is_error: true and re-plans (typically by escalating). Deterministic. No prompt-injection bypass. **Configuration:** matcher: "Bash". Blocklist regex (compiled once): r"\b(rm\s+-rf|sudo\s|drop\s+(database|table)|kill\s+-9|chmod\s+777|>:)\b". Exit 2 with stderr message. Allowlist for known-safe binaries (kubectl, docker, journalctl, curl, jq). **Concept:** `hooks` ### Structured Escalation Block, 30-second human triage When the agent can't (or shouldn't) act. Unknown runbook, hook denied, ambiguous alert, explicit request. The harness writes a STRUCTURED escalation block to PagerDuty / Slack. Six fields, every field required: incident_id, service, severity, root_cause_signal, partial_status (what the agent already did), recommended_action. Humans triage in ~30 seconds vs ~5 minutes reading a raw transcript. **Configuration:** Schema: {incident_id, service, severity: "P1|P2|P3", root_cause_signal: "permission|transient|data|business|unknown", partial_status: "what agent did before stopping", recommended_action: "single sentence"}. Posted to PagerDuty REST + Slack #ops-incidents. **Concept:** `escalation` ### PostToolUse Audit Log, the post-mortem replay tool Fires AFTER every tool call (Bash, runbook, escalate, log query). Writes a canonical row: ts, tool_name, tool_input, tool_result_bucket, latency_ms, stop_reason_context, hook_decisions. Append-only JSONL on durable storage. Indispensable for the 4 AM 'what did the agent do?' post-mortem; without it, trust in the agent collapses on the first incident. **Configuration:** matcher: '*'. Append to audit/{YYYY-MM-DD}.jsonl. Retain ≥ 90 days. Searchable by incident_id, service, tool_name. Includes both successful and denied calls. **Concept:** `evaluation` ## Build steps ### 1. Parse alerts into the 4-bucket isError contract The webhook receiver projects raw monitoring payloads (Datadog/Prometheus/Sentry/custom) into a stable shape: {bucket, service, severity, root_cause_signal, retryable, raw}. The bucket is the single most important field. It's what the agent's routing logic branches on. Bucket assignment is rule-based for known patterns; ambiguous alerts get classified by Haiku (cheap, fast) and emit the bucket + a confidence score. **Python:** ```python from enum import Enum from typing import TypedDict, Literal class Bucket(str, Enum): TRANSIENT = "Transient" # network blip, rate limit. Retry PERMISSION = "Permission" # 403/401. Escalate, won't fix itself DATA = "Data" # malformed input. Surface to user BUSINESS = "Business" # policy violation. Block + log + escalate class Alert(TypedDict): bucket: Bucket service: str severity: Literal["P1", "P2", "P3"] root_cause_signal: str retryable: bool raw: dict def classify_alert(raw: dict) -> Alert: """Project arbitrary monitoring payload into the 4-bucket contract.""" msg = (raw.get("message") or "").lower() if any(s in msg for s in ["503", "timeout", "rate limit", "circuit breaker"]): bucket, retryable = Bucket.TRANSIENT, True elif any(s in msg for s in ["403", "401", "forbidden", "unauthorized"]): bucket, retryable = Bucket.PERMISSION, False elif any(s in msg for s in ["malformed", "schema", "validation", "400"]): bucket, retryable = Bucket.DATA, False elif any(s in msg for s in ["policy", "compliance", "limit exceeded"]): bucket, retryable = Bucket.BUSINESS, False else: # Ambiguous. Classify with Haiku (cheap) and trust the bucket it returns bucket, retryable = haiku_classify_bucket(raw) return { "bucket": bucket, "service": raw.get("service", "unknown"), "severity": raw.get("severity", "P2"), "root_cause_signal": raw.get("root_cause") or msg[:100], "retryable": retryable, "raw": raw, } def haiku_classify_bucket(raw: dict) -> tuple[Bucket, bool]: """Cheap fallback classifier for ambiguous alerts.""" resp = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=64, messages=[{"role": "user", "content": f"Classify into one bucket (Transient|Permission|Data|Business) " f"+ retryable (true|false). Alert: {raw}\n" f"Output JSON only: {{\"bucket\": ..., \"retryable\": ...}}"}], ) parsed = json.loads(resp.content[0].text) return Bucket(parsed["bucket"]), parsed["retryable"] ``` **TypeScript:** ```typescript enum Bucket { Transient = "Transient", // network blip, rate limit. Retry Permission = "Permission", // 403/401. Escalate, won't fix itself Data = "Data", // malformed input. Surface to user Business = "Business", // policy violation. Block + log + escalate } interface Alert { bucket: Bucket; service: string; severity: "P1" | "P2" | "P3"; root_cause_signal: string; retryable: boolean; raw: Record; } export async function classifyAlert(raw: Record): Promise { // Project arbitrary monitoring payload into the 4-bucket contract. const msg = String(raw.message ?? "").toLowerCase(); let bucket: Bucket; let retryable: boolean; if (["503", "timeout", "rate limit", "circuit breaker"].some((s) => msg.includes(s))) { bucket = Bucket.Transient; retryable = true; } else if (["403", "401", "forbidden", "unauthorized"].some((s) => msg.includes(s))) { bucket = Bucket.Permission; retryable = false; } else if (["malformed", "schema", "validation", "400"].some((s) => msg.includes(s))) { bucket = Bucket.Data; retryable = false; } else if (["policy", "compliance", "limit exceeded"].some((s) => msg.includes(s))) { bucket = Bucket.Business; retryable = false; } else { [bucket, retryable] = await haikuClassifyBucket(raw); } return { bucket, service: String(raw.service ?? "unknown"), severity: (raw.severity as Alert["severity"]) ?? "P2", root_cause_signal: String(raw.root_cause ?? msg.slice(0, 100)), retryable, raw, }; } async function haikuClassifyBucket( raw: Record, ): Promise<[Bucket, boolean]> { // Cheap fallback classifier for ambiguous alerts. const resp = await client.messages.create({ model: "claude-haiku-4-5-20251001", max_tokens: 64, messages: [ { role: "user", content: "Classify into one bucket (Transient|Permission|Data|Business) " + "+ retryable (true|false). Alert: " + JSON.stringify(raw) + "\n" + "Output JSON only: {\"bucket\": ..., \"retryable\": ...}", }, ], }); const text = resp.content[0].type === "text" ? resp.content[0].text : "{}"; const parsed = JSON.parse(text); return [parsed.bucket as Bucket, parsed.retryable as boolean]; } ``` Concept: `structured-outputs` ### 2. Define a small runbook registry (3-5 entries) Every runbook is a named sequence of safe, reversible commands with explicit preconditions, timeouts, and rollback steps. The registry stays at 3-5 entries. The agent's routing accuracy degrades past 5, and a sixth runbook for an edge case is exactly the wrong move (escalate the edge case instead). Each runbook is reviewed in PRs and deployed through CI like application code. **Python:** ```python # Runbooks live in code, version-controlled, PR-reviewed from typing import TypedDict class RunbookStep(TypedDict): cmd: str # the actual shell command timeout_s: int rollback: str # the inverse command, run on failure class Runbook(TypedDict): name: str description: str # 4-line pattern (what / when / edges / ordering) preconditions: list[str] # gating conditions steps: list[RunbookStep] requires_human_confirm: bool RUNBOOKS: list[Runbook] = [ { "name": "restart_service", "description": ( "Restart a stuck Kubernetes pod or systemd service.\n" "Use when service health check fails 3 consecutive times.\n" "Edge cases: returns 'still-stuck' if pod re-enters CrashLoopBackOff; escalate.\n" "Prerequisite: service is in the runbook allowlist." ), "preconditions": ["service in ALLOWLIST", "alert.bucket == Transient"], "steps": [ {"cmd": "kubectl rollout restart deploy/{service}", "timeout_s": 60, "rollback": "kubectl rollout undo deploy/{service}"}, {"cmd": "kubectl rollout status deploy/{service} --timeout=2m", "timeout_s": 120, "rollback": ""}, ], "requires_human_confirm": False, }, { "name": "rollback_release", "description": ( "Roll the service back to the last known-good release.\n" "Use when a release introduces a P1/P2 regression.\n" "Edge cases: returns 'no-prior-release' if no rollback target; escalate.\n" "Prerequisite: service has a deploy history." ), "preconditions": ["service in ALLOWLIST", "deploy_history_count >= 2"], "steps": [ {"cmd": "kubectl rollout undo deploy/{service}", "timeout_s": 60, "rollback": "kubectl rollout undo deploy/{service} --to-revision={current}"}, ], "requires_human_confirm": True, # destructive, gates on human click }, # ... drain_region, rotate_credentials, scale_replicas. 5 total ] def match_runbook(alert: Alert) -> Runbook | None: """Pick the runbook whose preconditions match. Returns None to escalate.""" for rb in RUNBOOKS: if all(eval_precondition(p, alert) for p in rb["preconditions"]): return rb return None # no match. Escalate ``` **TypeScript:** ```typescript // Runbooks live in code, version-controlled, PR-reviewed interface RunbookStep { cmd: string; // the actual shell command timeout_s: number; rollback: string; // the inverse command, run on failure } interface Runbook { name: string; description: string; // 4-line pattern (what / when / edges / ordering) preconditions: string[]; steps: RunbookStep[]; requires_human_confirm: boolean; } export const RUNBOOKS: Runbook[] = [ { name: "restart_service", description: "Restart a stuck Kubernetes pod or systemd service.\n" + "Use when service health check fails 3 consecutive times.\n" + "Edge cases: returns 'still-stuck' if pod re-enters CrashLoopBackOff; escalate.\n" + "Prerequisite: service is in the runbook allowlist.", preconditions: ["service in ALLOWLIST", "alert.bucket == Transient"], steps: [ { cmd: "kubectl rollout restart deploy/{service}", timeout_s: 60, rollback: "kubectl rollout undo deploy/{service}", }, { cmd: "kubectl rollout status deploy/{service} --timeout=2m", timeout_s: 120, rollback: "", }, ], requires_human_confirm: false, }, { name: "rollback_release", description: "Roll the service back to the last known-good release.\n" + "Use when a release introduces a P1/P2 regression.\n" + "Edge cases: returns 'no-prior-release' if no rollback target; escalate.\n" + "Prerequisite: service has a deploy history.", preconditions: ["service in ALLOWLIST", "deploy_history_count >= 2"], steps: [ { cmd: "kubectl rollout undo deploy/{service}", timeout_s: 60, rollback: "kubectl rollout undo deploy/{service} --to-revision={current}", }, ], requires_human_confirm: true, // destructive, gates on human click }, // ... drain_region, rotate_credentials, scale_replicas. 5 total ]; export function matchRunbook(alert: Alert): Runbook | null { // Pick the runbook whose preconditions match. Returns null to escalate. for (const rb of RUNBOOKS) { if (rb.preconditions.every((p) => evalPrecondition(p, alert))) { return rb; } } return null; // no match. Escalate } ``` Concept: `tool-calling` ### 3. Wire the PreToolUse hook with a destructive blocklist The hook is the deterministic safety gate. It reads tool_name + tool_input.command from stdin JSON, applies a regex blocklist (compiled once), exits 2 with a stderr message on match. The model sees the deny as a tool_result with is_error: true and re-plans. No prompt-injection bypass; the blocklist is in code, not in the prompt. **Python:** ```python # .claude/hooks/ops_blocklist.py import sys, json, re BLOCKLIST = re.compile( r"\b(" r"rm\s+-rf" # destructive recursive delete r"|sudo\s+" # privilege escalation r"|drop\s+(database|table)" r"|kill\s+-9" # uncatchable termination r"|chmod\s+777" # world-writable r"|>(:?\s*/[^\s]+/(etc|usr|var)/)" # redirect to system paths r"|curl\s+[^\s]+\s*\|\s*sh" # remote shell exec r")\b", re.IGNORECASE, ) ALLOWLIST_BINS = {"kubectl", "docker", "journalctl", "curl", "jq", "ps", "df", "top"} def main(): payload = json.loads(sys.stdin.read()) if payload["tool_name"] != "Bash": sys.exit(0) cmd = payload["tool_input"].get("command", "") # Hard blocklist if BLOCKLIST.search(cmd): print( f"BLOCKED: command matches destructive pattern. " f"Escalate via the structured escalation block. command={cmd!r}", file=sys.stderr, ) sys.exit(2) # DENY # Soft allowlist on the binary (first token) first = cmd.strip().split()[0] if cmd.strip() else "" if first and first not in ALLOWLIST_BINS: print( f"BLOCKED: binary {first!r} not on the ops allowlist. " f"Allowed: {sorted(ALLOWLIST_BINS)}", file=sys.stderr, ) sys.exit(2) sys.exit(0) if __name__ == "__main__": main() ``` **TypeScript:** ```typescript // .claude/hooks/ops-blocklist.ts import { readFileSync } from "node:fs"; const BLOCKLIST = new RegExp( String.raw`\b(` + String.raw`rm\s+-rf` + // destructive recursive delete String.raw`|sudo\s+` + // privilege escalation String.raw`|drop\s+(database|table)` + String.raw`|kill\s+-9` + // uncatchable termination String.raw`|chmod\s+777` + // world-writable String.raw`|>(:?\s*/[^\s]+/(etc|usr|var)/)` + // redirect to system paths String.raw`|curl\s+[^\s]+\s*\|\s*sh` + // remote shell exec String.raw`)\b`, "i", ); const ALLOWLIST_BINS = new Set([ "kubectl", "docker", "journalctl", "curl", "jq", "ps", "df", "top", ]); const payload = JSON.parse(readFileSync(0, "utf8")); if (payload.tool_name !== "Bash") process.exit(0); const cmd = String(payload.tool_input?.command ?? ""); // Hard blocklist if (BLOCKLIST.test(cmd)) { process.stderr.write( `BLOCKED: command matches destructive pattern. ` + `Escalate via the structured escalation block. command=${JSON.stringify(cmd)}\n`, ); process.exit(2); // DENY } // Soft allowlist on the binary (first token) const first = cmd.trim().split(/\s+/)[0] ?? ""; if (first && !ALLOWLIST_BINS.has(first)) { process.stderr.write( `BLOCKED: binary ${JSON.stringify(first)} not on the ops allowlist. ` + `Allowed: ${[...ALLOWLIST_BINS].sort().join(", ")}\n`, ); process.exit(2); } process.exit(0); ``` Concept: `hooks` ### 4. Build the structured escalation block When the agent can't act safely. Unknown runbook, hook denied, ambiguous alert, explicit escalation request. The harness emits a structured escalation block to PagerDuty / Slack. Six required fields. Humans triage in ~30 seconds vs ~5 minutes reading a raw transcript. The block is the single most important UX artifact in the whole system; tune the wording to be a one-screen read. **Python:** ```python import json, requests from typing import TypedDict, Literal class EscalationBlock(TypedDict): incident_id: str service: str severity: Literal["P1", "P2", "P3"] root_cause_signal: str partial_status: str recommended_action: str PAGERDUTY_KEY = os.environ["PAGERDUTY_INTEGRATION_KEY"] def escalate(alert: Alert, partial_actions: list[str], reason: str) -> EscalationBlock: """Emit the structured handoff block. Six fields, every field required.""" block: EscalationBlock = { "incident_id": f"PD-{alert['service']}-{int(time.time())}", "service": alert["service"], "severity": alert["severity"], "root_cause_signal": alert["root_cause_signal"], "partial_status": ( "; ".join(partial_actions) if partial_actions else "agent stopped before any action" ), "recommended_action": derive_recommendation(alert, reason), } # Post to PagerDuty requests.post( "https://events.pagerduty.com/v2/enqueue", json={ "routing_key": PAGERDUTY_KEY, "event_action": "trigger", "payload": { "summary": f"[{block['severity']}] {block['service']}: {block['root_cause_signal']}", "severity": "critical" if block["severity"] == "P1" else "warning", "source": "claude-ops-agent", "custom_details": block, }, }, ) # Also write to audit log audit_log("escalation", block) return block def derive_recommendation(alert: Alert, reason: str) -> str: """Single-sentence recommended action for the on-call human.""" if reason == "hook_denied": return f"Run the blocked command manually after verifying it is safe; agent will not retry." if reason == "unknown_runbook": return f"No runbook matched. Investigate root cause; consider adding a runbook if pattern recurs." if alert["bucket"] == "Permission": return f"Rotate or grant the missing credential; agent cannot self-recover from Permission errors." return f"Investigate the alert and apply the appropriate remediation manually." ``` **TypeScript:** ```typescript interface EscalationBlock { incident_id: string; service: string; severity: "P1" | "P2" | "P3"; root_cause_signal: string; partial_status: string; recommended_action: string; } const PAGERDUTY_KEY = process.env.PAGERDUTY_INTEGRATION_KEY!; export async function escalate( alert: Alert, partialActions: string[], reason: string, ): Promise { // Emit the structured handoff block. Six fields, every field required. const block: EscalationBlock = { incident_id: `PD-${alert.service}-${Math.floor(Date.now() / 1000)}`, service: alert.service, severity: alert.severity, root_cause_signal: alert.root_cause_signal, partial_status: partialActions.length > 0 ? partialActions.join("; ") : "agent stopped before any action", recommended_action: deriveRecommendation(alert, reason), }; // Post to PagerDuty await fetch("https://events.pagerduty.com/v2/enqueue", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ routing_key: PAGERDUTY_KEY, event_action: "trigger", payload: { summary: `[${block.severity}] ${block.service}: ${block.root_cause_signal}`, severity: block.severity === "P1" ? "critical" : "warning", source: "claude-ops-agent", custom_details: block, }, }), }); await auditLog("escalation", block); return block; } function deriveRecommendation(alert: Alert, reason: string): string { if (reason === "hook_denied") { return "Run the blocked command manually after verifying it is safe; agent will not retry."; } if (reason === "unknown_runbook") { return "No runbook matched. Investigate root cause; consider adding a runbook if pattern recurs."; } if (alert.bucket === Bucket.Permission) { return "Rotate or grant the missing credential; agent cannot self-recover from Permission errors."; } return "Investigate the alert and apply the appropriate remediation manually."; } ``` Concept: `escalation` ### 5. Wire the PostToolUse audit log PostToolUse fires AFTER every tool call (Bash, runbook, escalate, log query, hook decisions). Writes a canonical row: ts, tool_name, tool_input, tool_result, latency_ms, hook_decisions, incident_id. Append-only JSONL on durable storage; retain ≥ 90 days. Indexed by incident_id, service, tool_name. The post-mortem replay tool depends on it; without it, trust collapses on the first incident. **Python:** ```python # .claude/hooks/ops_audit.py. PostToolUse import sys, json, datetime from pathlib import Path AUDIT_DIR = Path("audit") AUDIT_DIR.mkdir(exist_ok=True) def main(): payload = json.loads(sys.stdin.read()) today = datetime.date.today().isoformat() row = { "ts": datetime.datetime.utcnow().isoformat() + "Z", "tool_name": payload["tool_name"], "tool_input": payload.get("tool_input", {}), "tool_result": payload.get("tool_result", {}), "latency_ms": payload.get("latency_ms"), "hook_decisions": payload.get("hook_decisions", []), "incident_id": payload.get("incident_id") or "no-incident", "stop_reason_context": payload.get("stop_reason"), } with open(AUDIT_DIR / f"{today}.jsonl", "a") as f: f.write(json.dumps(row) + "\n") sys.exit(0) if __name__ == "__main__": main() # Replay tool. Used in post-mortems def replay(incident_id: str) -> list[dict]: """Reconstruct what the agent did on a given incident.""" rows = [] for path in sorted(AUDIT_DIR.glob("*.jsonl")): for line in path.read_text().splitlines(): row = json.loads(line) if row["incident_id"] == incident_id: rows.append(row) return sorted(rows, key=lambda r: r["ts"]) ``` **TypeScript:** ```typescript // .claude/hooks/ops-audit.ts. PostToolUse import { readFileSync, appendFileSync, mkdirSync, readdirSync } from "node:fs"; import { join } from "node:path"; const AUDIT_DIR = "audit"; mkdirSync(AUDIT_DIR, { recursive: true }); const payload = JSON.parse(readFileSync(0, "utf8")); const today = new Date().toISOString().slice(0, 10); const row = { ts: new Date().toISOString(), tool_name: payload.tool_name, tool_input: payload.tool_input ?? {}, tool_result: payload.tool_result ?? {}, latency_ms: payload.latency_ms, hook_decisions: payload.hook_decisions ?? [], incident_id: payload.incident_id ?? "no-incident", stop_reason_context: payload.stop_reason, }; appendFileSync(join(AUDIT_DIR, `${today}.jsonl`), JSON.stringify(row) + "\n"); process.exit(0); // Replay tool. Used in post-mortems export function replay(incidentId: string) { // Reconstruct what the agent did on a given incident. const rows: Array> = []; for (const fname of readdirSync(AUDIT_DIR).sort()) { if (!fname.endsWith(".jsonl")) continue; const lines = readFileSync(join(AUDIT_DIR, fname), "utf8") .split("\n") .filter(Boolean); for (const line of lines) { const r = JSON.parse(line); if (r.incident_id === incidentId) rows.push(r); } } return rows.sort((a, b) => String(a.ts).localeCompare(String(b.ts))); } ``` Concept: `evaluation` ### 6. Run the agent loop with stop_reason branching The on-call agent loop is a strict stop_reason FSM. end_turn: success, log + close incident. tool_use: extract the tool call, run hooks, execute, append result, continue. tool_use → hook denied: append the deny as a tool_result with is_error: true, continue (the agent re-plans, usually by escalating). max_tokens: persist partial state, escalate. The loop NEVER branches on response text. The structured stop_reason is the only contract. **Python:** ```python OPS_TOOLS = [BASH_TOOL, RUNBOOK_TOOL, ESCALATE_TOOL, GET_LOGS_TOOL] # 4 tools def run_ops_agent(alert: Alert, max_iter: int = 10) -> dict: """On-call agent loop. Strict stop_reason branching.""" case_facts = { "incident_id": alert.get("incident_id") or alert["raw"].get("incident_id"), "service": alert["service"], "bucket": alert["bucket"], "severity": alert["severity"], } messages = [{"role": "user", "content": json.dumps(alert)}] partial_actions = [] for iteration in range(max_iter): resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=2048, system=build_ops_system_prompt(case_facts), tools=OPS_TOOLS, messages=messages, ) if resp.stop_reason == "end_turn": audit_log("incident_resolved", {"case_facts": case_facts, "partial_actions": partial_actions}) return {"status": "resolved", "actions": partial_actions} if resp.stop_reason == "tool_use": tool_uses = [b for b in resp.content if b.type == "tool_use"] results = [] for tu in tool_uses: # PreToolUse hook runs here in the SDK; if it exits 2, # the SDK emits a tool_result with is_error: true automatically. result = execute_tool_with_hooks(tu, case_facts) results.append(result) if not result.get("is_error"): partial_actions.append(f"{tu.name}: {tu.input}") messages.append({"role": "assistant", "content": resp.content}) messages.append({"role": "user", "content": results}) continue if resp.stop_reason == "max_tokens": # Persist + escalate; harness will resume in a fresh session block = escalate(alert, partial_actions, reason="max_tokens_exhaustion") return {"status": "escalated", "block": block} # Iteration cap. Escalate block = escalate(alert, partial_actions, reason="iteration_cap") return {"status": "escalated", "block": block} ``` **TypeScript:** ```typescript const OPS_TOOLS = [BASH_TOOL, RUNBOOK_TOOL, ESCALATE_TOOL, GET_LOGS_TOOL]; // 4 tools export async function runOpsAgent(alert: Alert, maxIter = 10) { // On-call agent loop. Strict stop_reason branching. const caseFacts = { incident_id: (alert as { incident_id?: string }).incident_id ?? (alert.raw.incident_id as string | undefined), service: alert.service, bucket: alert.bucket, severity: alert.severity, }; const messages: Anthropic.MessageParam[] = [ { role: "user", content: JSON.stringify(alert) }, ]; const partialActions: string[] = []; for (let i = 0; i < maxIter; i++) { const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 2048, system: buildOpsSystemPrompt(caseFacts), tools: OPS_TOOLS, messages, }); if (resp.stop_reason === "end_turn") { await auditLog("incident_resolved", { case_facts: caseFacts, partial_actions: partialActions }); return { status: "resolved" as const, actions: partialActions }; } if (resp.stop_reason === "tool_use") { const toolUses = resp.content.filter((b) => b.type === "tool_use"); const results = await Promise.all( toolUses.map(async (tu) => { if (tu.type !== "tool_use") return null; // PreToolUse hook runs here in the SDK; on exit 2 the SDK emits // a tool_result with is_error: true automatically. const result = await executeToolWithHooks(tu, caseFacts); if (!(result as { is_error?: boolean }).is_error) { partialActions.push(`${tu.name}: ${JSON.stringify(tu.input)}`); } return result; }), ); messages.push({ role: "assistant", content: resp.content }); messages.push({ role: "user", content: results.filter(Boolean) as Anthropic.MessageParam["content"] }); continue; } if (resp.stop_reason === "max_tokens") { const block = await escalate(alert, partialActions, "max_tokens_exhaustion"); return { status: "escalated" as const, block }; } } const block = await escalate(alert, partialActions, "iteration_cap"); return { status: "escalated" as const, block }; } ``` Concept: `agentic-loops` ### 7. Pin incident metadata in CASE_FACTS Multi-turn ops conversations need the incident metadata anchored. Pin incident_id, service, severity, bucket, partial_status, runbook_in_progress at the top of every system prompt iteration. This survives summarization and ensures the agent on turn 8 still knows it's working on PD-4711, not a generic incident. Without case-facts pinning, multi-turn ops conversations regress to single-turn quality. **Python:** ```python def build_ops_system_prompt(case_facts: dict) -> str: return f"""You are a 3-AM on-call ops agent. Composure under pressure. CASE_FACTS (immutable; re-read every turn; values are EXACT): - incident_id: {case_facts['incident_id']} - service: {case_facts['service']} - severity: {case_facts['severity']} - bucket: {case_facts['bucket']} - runbook_in_progress: {case_facts.get('runbook', 'none')} - partial_status: {case_facts.get('partial_status', 'no actions yet')} Constraints: - ESCALATION TRIGGERS: policy gap | access failure (Permission bucket) | explicit user request | unknown runbook | hook denial. - Sentiment is logged, NEVER gates routing. An angry alert with a clean runbook gets the runbook. - Branch on stop_reason. Never on response text. - Cite the runbook name when you call it; cite the incident_id in every tool call. - For destructive Bash, expect the PreToolUse hook to deny. Re-plan to escalate, don't retry. RUNBOOKS AVAILABLE (5 total): {', '.join(rb['name'] for rb in RUNBOOKS)}. ESCALATE if no runbook matches; escalate with the structured block, not a transcript.""" ``` **TypeScript:** ```typescript function buildOpsSystemPrompt(caseFacts: Record): string { return `You are a 3-AM on-call ops agent. Composure under pressure. CASE_FACTS (immutable; re-read every turn; values are EXACT): - incident_id: ${caseFacts.incident_id} - service: ${caseFacts.service} - severity: ${caseFacts.severity} - bucket: ${caseFacts.bucket} - runbook_in_progress: ${caseFacts.runbook ?? "none"} - partial_status: ${caseFacts.partial_status ?? "no actions yet"} Constraints: - ESCALATION TRIGGERS: policy gap | access failure (Permission bucket) | explicit user request | unknown runbook | hook denial. - Sentiment is logged, NEVER gates routing. An angry alert with a clean runbook gets the runbook. - Branch on stop_reason. Never on response text. - Cite the runbook name when you call it; cite the incident_id in every tool call. - For destructive Bash, expect the PreToolUse hook to deny. Re-plan to escalate, don't retry. RUNBOOKS AVAILABLE (5 total): ${RUNBOOKS.map((rb) => rb.name).join(", ")}. ESCALATE if no runbook matches; escalate with the structured block, not a transcript.`; } ``` Concept: `case-facts-block` ### 8. Test incident handling with adversarial scenarios Build an eval set of 50 incidents across the 4 buckets + a 'sentiment trap' subset (alerts with angry phrasing but a clean runbook should run the runbook, not escalate). Run the agent over the set; measure: bucket-classification accuracy, correct-runbook rate, false-escalation rate, hook-deny-then-recover rate. Re-run on every change to the runbook registry, hooks, or system prompt. **Python:** ```python EVAL_INCIDENTS = [ { "name": "transient_pod_crash", "alert": {"service": "checkout", "message": "503 timeout on health check", "severity": "P2"}, "expected_bucket": "Transient", "expected_action": "runbook:restart_service", "expected_outcome": "resolved", }, { "name": "permission_denied", "alert": {"service": "billing", "message": "403 forbidden on /payments", "severity": "P1"}, "expected_bucket": "Permission", "expected_action": "escalate", # never auto-resolve permission errors "expected_outcome": "escalated", }, { "name": "sentiment_trap", "alert": {"service": "checkout", "message": "URGENT!! THIS IS BROKEN!! 503!!", "severity": "P2"}, "expected_bucket": "Transient", "expected_action": "runbook:restart_service", # ignore the angry phrasing "expected_outcome": "resolved", }, { "name": "destructive_attempt", "alert": {"service": "platform", "message": "stuck pod; clear with rm -rf", "severity": "P3"}, "expected_bucket": "Transient", "expected_action": "runbook:restart_service", # NOT rm -rf. Hook denies "expected_outcome": "resolved", }, # ... 46 more ] def run_eval() -> dict: correct_bucket = correct_action = 0 false_escalations = [] for case in EVAL_INCIDENTS: alert = classify_alert(case["alert"]) if alert["bucket"] == case["expected_bucket"]: correct_bucket += 1 result = run_ops_agent(alert) # Simplification: check action taken matches expectation if matches(result, case["expected_action"]): correct_action += 1 elif "escalate" in str(result) and case["expected_outcome"] == "resolved": false_escalations.append(case["name"]) return { "bucket_accuracy": correct_bucket / len(EVAL_INCIDENTS), "action_accuracy": correct_action / len(EVAL_INCIDENTS), "false_escalations": false_escalations, } ``` **TypeScript:** ```typescript const EVAL_INCIDENTS = [ { name: "transient_pod_crash", alert: { service: "checkout", message: "503 timeout on health check", severity: "P2" }, expected_bucket: "Transient", expected_action: "runbook:restart_service", expected_outcome: "resolved", }, { name: "permission_denied", alert: { service: "billing", message: "403 forbidden on /payments", severity: "P1" }, expected_bucket: "Permission", expected_action: "escalate", // never auto-resolve permission errors expected_outcome: "escalated", }, { name: "sentiment_trap", alert: { service: "checkout", message: "URGENT!! THIS IS BROKEN!! 503!!", severity: "P2" }, expected_bucket: "Transient", expected_action: "runbook:restart_service", // ignore the angry phrasing expected_outcome: "resolved", }, { name: "destructive_attempt", alert: { service: "platform", message: "stuck pod; clear with rm -rf", severity: "P3" }, expected_bucket: "Transient", expected_action: "runbook:restart_service", // NOT rm -rf. Hook denies expected_outcome: "resolved", }, // ... 46 more ]; export async function runEval() { let correctBucket = 0; let correctAction = 0; const falseEscalations: string[] = []; for (const c of EVAL_INCIDENTS) { const alert = await classifyAlert(c.alert as Record); if (alert.bucket === c.expected_bucket) correctBucket++; const result = await runOpsAgent(alert); if (matches(result, c.expected_action)) { correctAction++; } else if ( JSON.stringify(result).includes("escalate") && c.expected_outcome === "resolved" ) { falseEscalations.push(c.name); } } return { bucket_accuracy: correctBucket / EVAL_INCIDENTS.length, action_accuracy: correctAction / EVAL_INCIDENTS.length, false_escalations: falseEscalations, }; } ``` Concept: `evaluation` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | Destructive command in a runbook step | PreToolUse hook denies via blocklist regex; agent escalates | Trust the system prompt to instruct the model never to run rm -rf | Prompt-only enforcement leaks under prompt injection or clever phrasing. The hook is a deterministic exit-2; the destructive command literally cannot execute. This is the only credible architecture for ops automation. | | Alert classification ambiguous | Haiku-classify into one of the 4 buckets + retryable; trust the bucket | Pass the raw alert to the agent and ask it to figure out the bucket | Bucket classification is a focused, cheap task. Haiku does it accurately for ~$0.0001 per alert. Letting the main agent figure it out wastes context and produces inconsistent buckets across the same incident pattern. | | Runbook registry size | 3-5 safe, reversible runbooks; rare incidents always escalate | Add a 6th, 7th, 8th runbook for edge cases as they appear | Past 5 runbooks, routing accuracy drops and blast radius grows. The cost of ONE wrong runbook firing on a 'similar but different' incident exceeds the cost of escalating that edge case to a human. Stay small on purpose. | | Agent can't resolve. Escalate | Structured 6-field escalation block to PagerDuty | Forward the full transcript to the on-call human | Humans triage a structured block in ~30 seconds (incident_id, service, severity, root_cause, partial_status, recommended_action. One-screen read). They take 5+ minutes parsing a raw transcript at 3 AM. The block is the single most-impactful UX artifact in the system. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-OP-01 · Runbook execution without hook enforcement | Agent calls Bash with rm -rf /tmp/stale_pods. Typo'd as rm -rf / in production. The command runs because the only guard was a system-prompt warning. Service goes down; recovery takes hours. | PreToolUse hook with a blocklist regex (rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh). Exit 2 with stderr message on match. The agent observes the deny, escalates instead of retrying. Deterministic, no bypass. | | AP-OP-02 · Prompt-only constraints | System prompt says 'never run dangerous commands'. Production logs show the agent sometimes runs them anyway (3-5%) because the alert text was very persuasive or used unusual phrasing. | Move the constraint to a deterministic PreToolUse hook. Exit 2 on the blocklist match. The system prompt becomes a soft suggestion that complements the hard hook, not a substitute for it. | | AP-OP-03 · Ambiguous incident classification | Agent reads raw alert text and infers severity / type. Same alert pattern produces different classifications on different days; routing drifts; on-call humans get inconsistent escalations. | Project every alert into the 4-bucket isError contract at the webhook receiver (rule-based + Haiku fallback). The agent reads bucket and retryable; never parses raw alert text. Classification is consistent and auditable. | | AP-OP-04 · Silent escalation, no audit trail | Agent escalates ambiguous incidents but writes nothing to a durable log. Post-mortem at 9 AM can't reconstruct what the agent saw, what it tried, what it skipped. Trust collapses on the first incident. | PostToolUse audit hook on every tool call: append to audit/{date}.jsonl with ts, tool_name, tool_input, tool_result, latency_ms, hook_decisions, incident_id. Retain ≥ 90 days. Replay tool reconstructs any incident in seconds. | | AP-OP-05 · Access-failure as empty-result | Monitoring API returns 403 forbidden; agent treats empty response as 'no alerts to handle' and goes back to sleep. Real incident festers undetected for hours; eventually a human finds the auth-token expired. | Structured isError bucket (Permission). The monitoring tool wrapper returns {is_error: true, bucket: 'Permission', retryable: false, detail: 'auth token expired'}. Agent reads bucket: Permission and immediately escalates. Permission errors NEVER look like empty results. | ## Implementation checklist - [ ] Webhook receiver projects alerts into 4-bucket isError contract (`structured-outputs`) - [ ] Runbook registry capped at 3-5 entries; PR-reviewed; deployed via CI (`tool-calling`) - [ ] PreToolUse hook with destructive blocklist regex; allowlist of safe binaries (`hooks`) - [ ] PostToolUse audit log writes every tool call to durable storage (`evaluation`) - [ ] Structured 6-field escalation block (incident_id, service, severity, root_cause_signal, partial_status, recommended_action) (`escalation`) - [ ] Sentiment is logged but NEVER gates escalation; sentiment-trap eval passes (`system-prompts`) - [ ] CASE_FACTS pins incident metadata across multi-turn ops conversations (`case-facts-block`) - [ ] Agent loop branches strictly on stop_reason, never on response text (`agentic-loops`) - [ ] 50-incident eval set covers all 4 buckets + sentiment-trap + destructive-attempt; gated in CI - [ ] Audit log retention ≥ 90 days; replay tool reconstructs any incident - [ ] PagerDuty integration delivers structured block with custom_details intact ## Cost & latency - **Per-incident cost (typical, with cache):** ~$0.005-0.015, Haiku alert classification (~$0.0001) + Sonnet 4.5 ops loop ~3-5 turns × ~1500 cached input + ~300 output tokens. Total per resolved incident: ~$0.01. PagerDuty integration adds no token cost. - **Hook overhead:** ~5-15ms per tool call; ~0% token cost, PreToolUse + PostToolUse run as subprocesses reading stdin JSON. No LLM call. Pure local Python/TS. Latency below the noise floor of typical Bash / kubectl execution. - **Audit log storage:** ~5-10 MB/month at 1000 incidents/month, Each row ~5 KB; 1000 incidents × ~5 tool calls each = 5000 rows/month ≈ 25 MB. At $0.023/GB/month object storage, <$0.001/month. Negligible. - **Eval set run (50 incidents, weekly):** ~$0.30/run, 50 × $0.006 average per incident. Weekly = $1.20/month. Insurance against runbook registry / hook regression; gates deploys. - **False-escalation cost (avoided):** ~$30-100 per false-escalation prevented, A false escalation wakes an on-call human at 3 AM (~30 min wasted at $60/hr fully-loaded ≈ $30) or worse, trains them to ignore alerts (compounding cost). The eval set + sentiment-trap test pays for itself many times over by preventing this. ## Domain weights - **D1 · Agentic Architectures (27%):** Ops agent loop · stop_reason FSM · runbook registry · hook composition - **D5 · Context + Reliability (15%):** CASE_FACTS pinning · structured escalation block · audit log replay ## Practice questions ### Q1. An ops agent processes alerts. In ~8% of cases it calls restart_pod when the alert was actually a 403 permission-denied error from monitoring. What architectural change distinguishes access-failure from actual pod crash? Project every alert into the 4-bucket isError contract at the webhook receiver: {bucket: 'Transient'|'Permission'|'Data'|'Business', retryable, ...}. The agent reads bucket and retryable. It never parses raw alert text. Permission → escalate (won't fix itself); Transient → retry / runbook; Data → surface to user; Business → block + log + escalate. Classification is consistent across the same pattern; the agent's routing logic becomes reliable. Tagged to AP-OP-05. ### Q2. Your runbook enforcement uses a system-prompt warning: 'never run destructive commands like rm -rf'. Production sees ~3% of refunds that violate the cap and ~3-5% of destructive commands that slip through despite the warning. The agent keeps requesting rm -rf after a clever-phrased alert. Architectural response? Move the constraint to a PreToolUse hook with a blocklist regex. Hook reads tool_input.command, matches against a compiled regex (rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh), exits 2 on match. The agent observes the deny as tool_result: {is_error: true} and re-plans (typically by escalating). Hooks are deterministic; prompts are probabilistic. This is the only credible architecture for ops automation. Tagged to AP-OP-01. ### Q3. An incident-response agent escalates ambiguous cases. Today's baseline: escalation triggers on negative-sentiment phrasing OR low confidence. False-escalation rate is ~50%. What deterministic criteria should replace those heuristics? Structural triggers only: (a) policy gap, (b) access failure (Permission bucket), (c) explicit user request, (d) unknown runbook (no match in the registry), (e) hook denial. Sentiment is logged for analytics but NEVER gates routing. An angry alert with a clean runbook gets the runbook; a calm-but-broken alert escalates if it hits one of (a)-(e). The eval set's 'sentiment-trap' subset specifically tests this and gates deploys at ≥ 95% pass rate. ### Q4. You need to audit every ops action for compliance. Should audit logging live in a PostToolUse hook or in a prompt instruction? PostToolUse hook. Tool-based logging via the agent is *easily skipped* (the model can simply not emit the audit-log tool call); hooks are *automatic* and run on every tool execution by the SDK. The PostToolUse hook writes a canonical row (ts, tool_name, tool_input, tool_result, latency_ms, hook_decisions, incident_id) to append-only JSONL. Replay tool reconstructs any incident in seconds. Without this, post-mortems can't reconstruct what the agent did at 3 AM and trust evaporates. Tagged to AP-OP-04. ### Q5. Runbook spec: 'Restart payment service if latency > 5s for >2min'. This combines monitoring signal parsing + action criteria. Where should this logic live: agent prompt, hook, or specialized tool? Specialized tool with the threshold logic in code, not in the prompt. The runbook's preconditions field encodes the structural condition (latency_p99 > 5s AND duration > 120s); the agent's job is to match the alert to the runbook, not to evaluate the threshold itself. Putting the threshold in the prompt makes it probabilistic (the model might decide 4.8s is 'close enough'); putting it in the hook is too narrow (hooks gate, they don't compute); a typed runbook with code-evaluated preconditions is the right composition. ## FAQ ### Q1. How do I distinguish a permission-denied error from an actual service failure? Both bucket projection AND structured isError on the monitoring tool. The webhook receiver classifies the alert into one of 4 buckets; the monitoring tool wrapper that fetches additional context returns {is_error: true, bucket: 'Permission', retryable: false, detail: '...'} instead of an empty result. The agent reads bucket and retryable and routes. Permission → escalate, Transient → retry. Don't parse error text. ### Q2. Can the hook block dangerous commands? Yes. That's its whole job. PreToolUse hook with a blocklist regex on Bash. Exit 2 with a stderr message on match; the agent sees the message as tool_result: {is_error: true, content: } and re-plans. Tested against rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh. Pair with an allowlist of safe binaries (kubectl, docker, journalctl, etc.) so unknown binaries also deny. ### Q3. Should audit logging be a tool call or a hook? A PostToolUse hook. Tool-based logging via the agent is easily skipped (the model decides not to call it on a busy turn); hooks run automatically by the SDK on every tool execution. The hook writes a canonical row to append-only JSONL; the replay tool reconstructs any incident later. This is the difference between 'we have logs sometimes' and 'we have logs always'. ### Q4. What triggers an escalation? Five structural triggers, in priority order: (1) explicit user request, (2) hook denial of a needed action, (3) unknown runbook (no match in the registry), (4) Permission bucket (access failure won't self-recover), (5) max_tokens or iteration cap. Sentiment is logged but NEVER triggers escalation. The 'sentiment trap' eval subset specifically tests that an angry-but-clean alert gets the runbook, not an escalation. ### Q5. How does the agent know which runbook to execute? Runbook registry has explicit preconditions; the agent's match_runbook(alert) function returns the first runbook whose preconditions all match. No match → escalate (don't bend a runbook to fit an edge case). The match logic is deterministic (rule evaluation), not LLM-judged. Adding a runbook means adding a precondition definition + the steps + the rollback; it's a code change, PR-reviewed, deployed through CI. ### Q6. What if the operator approves an escalation later? Round-trip via the escalation block + a separate approval flow. The structured block hits PagerDuty / Slack; the operator clicks approve in their tool; an approval webhook fires; the harness picks up the approval, loads the case-facts checkpoint, and resumes the agent loop with the approved action injected. The agent never bypasses the escalation. Humans authorize the next step explicitly. ### Q7. Can the agent execute arbitrary bash? No. Two layers: (1) PreToolUse hook with a destructive blocklist regex denies known-dangerous patterns; (2) allowlist of safe binaries (kubectl, docker, journalctl, curl, jq, etc.). Anything else is denied even if not on the blocklist. Both layers are deterministic. The agent's Bash access is narrow enough that a prompt-injection attack can't escalate to repo-wide damage. ## Production readiness - [ ] Alert webhook receiver enforces 4-bucket schema; rejects malformed payloads - [ ] Runbook registry size capped at ≤ 5 in CI lint; PRs to add a 6th require explicit override - [ ] PreToolUse blocklist regex tested against an adversarial 'destructive-attempt' eval subset - [ ] PostToolUse audit hook writes every tool call; retention ≥ 90 days - [ ] Structured escalation block schema enforced; PagerDuty integration delivers custom_details intact - [ ] Sentiment-trap eval subset gates deploys at ≥ 95% correct-action rate - [ ] Replay tool reconstructs any incident from audit logs in <5 seconds - [ ] On-call human runbook exists for the case 'agent escalates and no human is on PagerDuty' (escalation-of-escalation) --- **Source:** https://claudearchitectcertification.com/scenarios/claude-for-operations **Vault sources:** ACP-T05 §Scenario 9 (🟡 beyond-guide; OP-claimed Reddit 1s34iyl); ACP-T08 §3.9 metadata; ACP-T07 §Lab 9 spec (incident-response agent + structured escalation); ACP-T06 (5 practice Qs tagged to components); Concepts: hooks · escalation · case-facts-block; GAI-K05 CCA exam questions and scenarios **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Agent Skills for Developer Tooling > A CLI-first Skills architecture for developer tooling. Each Skill is a markdown file with frontmatter (name, version, parameters, allowed-tools); the CLI invokes them via claude skills invoke --param key=value; risky exploration runs inside context: fork so the working tree is untouched; the allowed-tools whitelist denies Edit and Bash by default and grants only what the Skill needs (Read, Grep, Glob); parameterization (directory, language, target_pattern) makes Skills reusable across repos; Git semver tagging (skill-name@1.2.3) pins versions so a v2 release does not silently break v1 callers; IDE extensions are thin wrappers over the CLI, not a parallel implementation. The most-tested distractor: building Skills as a parallel IDE-only feature instead of a CLI-first primitive. **Sub-marker:** P3.12 **Domains:** D3 · Agent Operations, D2 · Tool Design + Integration **Exam weight:** 38% of CCA-F (D3 + D2) **Build time:** 22 minutes **Source:** 🟡 Beyond-guide scenario. OP-claimed (Reddit 1s34iyl). Architecture matches Anthropic public guidance. **Canonical:** https://claudearchitectcertification.com/scenarios/agent-skills-for-developer-tooling **Last reviewed:** 2026-05-04 ## In plain English Think of this as the way you give every developer on the team a small library of pre-baked Claude workflows they can invoke from the command line. A refactoring Skill explores a function in an isolated child session and proposes the change without ever touching the working tree. A test-generation Skill reads a source file and writes a matching test file. A code-generation Skill scaffolds a new component in your team's exact style. Each Skill has a tight whitelist of tools it can use, a parameter shape that makes it reusable across repos, and a Git tag that pins which version each invocation runs against. The whole point is that developer tooling Skills work like npm packages built on top of Claude rather than fragile prompt copy-paste. ## Exam impact Domain 3 (Claude Code Configuration, 20%) tests Skill frontmatter, `context: fork` semantics, allowed-tools whitelist, and the Skill-vs-Command decision tree. Domain 2 (Tool Design, 18%) tests Skill parameterization and Git semver tagging. Beyond-guide but architecturally well-grounded. The 'why is the IDE plugin a wrapper, not the source?' question is the canonical exam distractor. ## The problem ### What the customer needs - One source of truth for refactoring, test generation, doc generation. Not 12 copy-pasted prompts in 12 repos. - Risk-free exploration. A refactoring Skill must propose changes without touching the working tree. - Reusable across repos. A Skill written for the React team should work on the Python team's repo with parameter changes only. - Versioned upgrades. A breaking change to a Skill must NOT silently break agents on the prior version. ### Why naive approaches fail - Skills built as IDE plugins first. Other editors get a parallel implementation that drifts. The CLI never exists. - Skills with unrestricted tool access. A test-generation Skill accidentally calls Edit on a real source file. - Skills hardcoded to one codebase. A team has to re-author the Skill for every new repo. - Skills versioned by edit-in-place. A v2 frontmatter change silently breaks 12 agents. ### Definition of done - Skills live in .claude/skills/{team}/{name}.md with frontmatter (name, version, description, parameters, allowed-tools). - CLI invocation: claude skills invoke --param key=value. IDE extensions wrap the CLI; they do not bypass it. - Exploratory Skills run inside context: fork. The parent session is untouched. - allowed-tools is an explicit whitelist on every Skill. Edit and Bash are not on the list unless required. - Parameters are declared in frontmatter and validated by the CLI before invocation. - Git tags pin versions: skill-refactor@1.2.3. Callers reference the major (@1.x); registry resolves the latest patch. ## Concepts in play - 🟢 **Skills** (`skills`), Markdown plus frontmatter as the unit of reusable workflow - 🟢 **Project memory** (`claude-md-hierarchy`), Skills extend project-level CLAUDE.md across teams - 🟢 **Tool calling** (`tool-calling`), Skill invocation as a structured tool call from the CLI - 🟢 **Subagents** (`subagents`), context: fork is a lightweight subagent for one Skill invocation - 🟢 **Attention engineering** (`attention-engineering`), Frontmatter routes the LLM to the right Skill - 🟢 **Evaluation** (`evaluation`), Per-Skill regression evals catch frontmatter drift - 🟢 **Structured outputs** (`structured-outputs`), Skill parameter contract is the schema - 🟢 **Context window** (`context-window`), context: fork keeps parent context lean ## Components ### Skill Definition File, .claude/skills/{team}/{name}.md The unit of dev-tooling Skills. Markdown body holds the instructions; YAML frontmatter holds the metadata: name, version (semver), description, parameters (with types and defaults), allowed-tools (whitelist), context_mode (session or fork). Lives in version control. Reviewed via PR. **Configuration:** Path: .claude/skills/{team}/{name}.md. Required frontmatter: name, version, description, parameters, allowed-tools, context_mode. Optional: deprecated, owners, requires_human_confirm. **Concept:** `skills` ### Skill Frontmatter (Attention Engineering), metadata routes the LLM to the right Skill The frontmatter is read into the agent's system prompt at invocation; the LLM forward-pass uses it to decide whether the Skill fits the request. It is NOT a regex classifier. Good frontmatter (clear description, accurate when_to_use, well-typed parameters) lifts routing accuracy substantially. **Configuration:** name: refactor-fn. version: 1.2.3. description: 'Rename a function and update every call site.'. when_to_use: 'When the user asks to rename a function across the repo.'. parameters: { directory: string, old_name: string, new_name: string }. allowed-tools: [Read, Grep, Glob, Edit]. **Concept:** `attention-engineering` ### context: fork Isolation, child session runs in isolation, parent untouched When a Skill is exploratory (refactoring, test-gen, doc-gen), the CLI spawns a child session with context: fork. The child has its own conversation history and its own working tree view. Whatever the Skill explores or proposes stays in the child until the parent receives the final tool_result and decides whether to merge. Lighter than a full subagent, sufficient for one-Skill scope. **Configuration:** context_mode: fork in the Skill frontmatter. CLI spawns an isolated session per invocation. Parent receives only the tool_result payload (proposed diff, generated test file, doc string). Parent decides whether to apply. **Concept:** `subagents` ### allowed-tools Whitelist, explicit, structural, deny-by-default Every Skill declares its allowed-tools array in frontmatter. The CLI enforces the whitelist at tool_use interception: any call to a non-whitelisted tool fails with is_error: true. By default, Edit and Bash are NOT on the list. A code-gen Skill that only needs to read files lists [Read, Grep, Glob]; a refactoring Skill that needs to write changes adds Edit. Side-effect prevention is structural, not prompt-based. **Configuration:** allowed-tools: [Read, Grep, Glob]. SDK-side enforcement: tool_use calls outside this list return tool_result with is_error: true. The Skill body cannot escalate its own tool list. **Concept:** `tool-calling` ### IDE/CLI Integration Wrapper, CLI-first; IDE is a thin shell over the CLI The CLI is the canonical entry point. IDE extensions (VSCode, JetBrains, Neovim) shell out to the CLI rather than re-implementing Skill invocation logic. This means a Skill update in the registry propagates to every editor immediately. New editors get supported by writing a 200-line shell-out extension, not a full Skill engine. **Configuration:** VSCode extension binds keybinds and context-menu items to claude skills invoke --param .... The extension's only job is to translate UI events to CLI calls and stream output back to the editor. **Concept:** `claude-md-hierarchy` ## Build steps ### 1. Lay out the team-namespaced directory Create .claude/skills/{team}/{name}.md per team. The directory IS the registry's source of truth. Even on day one, namespace from the start. Retrofitting a flat layout into namespaces at 50 Skills is painful. **Python:** ```python import os TEAMS = ["frontend", "backend", "data", "shared"] for t in TEAMS: os.makedirs(f".claude/skills/{t}", exist_ok=True) with open(f".claude/skills/{t}/.gitkeep", "w") as f: pass print("namespace-by-team layout ready; commit and start authoring.") ``` **TypeScript:** ```typescript import { mkdirSync, writeFileSync } from "node:fs"; const teams = ["frontend", "backend", "data", "shared"]; for (const t of teams) { mkdirSync(`.claude/skills/${t}`, { recursive: true }); writeFileSync(`.claude/skills/${t}/.gitkeep`, ""); } console.log("namespace-by-team layout ready; commit and start authoring."); ``` Concept: `skills` ### 2. Author the Skill with full frontmatter Required keys: name, version, description, when_to_use, parameters (with types and defaults), allowed-tools (explicit whitelist), context_mode (session for state-shared, fork for isolated). Body holds the prompt. The CLI validates frontmatter at parse time; invalid Skills are rejected. **Python:** ```python # .claude/skills/frontend/refactor-component.md SKILL_TEMPLATE = """--- name: frontend/refactor-component version: 1.2.3 description: | Rename a React component and update every import + usage in the repo. Runs in context: fork so the working tree is untouched until you approve. when_to_use: | When the user asks to rename a React component or move it between files. parameters: old_name: type: string description: Current PascalCase component name (e.g. UserCard) required: true new_name: type: string description: Target PascalCase component name (e.g. UserProfile) required: true directory: type: string default: src/ description: Directory to scope the search (default: src/) allowed-tools: - Read - Grep - Glob - Edit context_mode: fork --- You are renaming a React component across this repository. Steps: 1. Glob {directory}/**/*.{tsx,ts,jsx,js} to find candidate files. 2. Grep for {old_name} across the matched files; collect file + line. 3. For each match, Read the file and decide whether the occurrence is the component (the import/export/JSX-tag) or a coincidental string. 4. Edit each file. Update the filename if the file is currently {old_name}.tsx. 5. Return a structured summary: files touched, occurrences changed. """ ``` **TypeScript:** ```typescript // .claude/skills/frontend/refactor-component.md const SKILL_TEMPLATE = `--- name: frontend/refactor-component version: 1.2.3 description: | Rename a React component and update every import + usage in the repo. Runs in context: fork so the working tree is untouched until you approve. when_to_use: | When the user asks to rename a React component or move it between files. parameters: old_name: type: string required: true new_name: type: string required: true directory: type: string default: src/ allowed-tools: - Read - Grep - Glob - Edit context_mode: fork --- You are renaming a React component across this repository. Steps: 1. Glob {directory}/**/*.{tsx,ts,jsx,js} to find candidate files. 2. Grep for {old_name} across the matched files. 3. Decide whether each occurrence is the component or a coincidental string. 4. Edit each file. Update the filename if the file is currently {old_name}.tsx. 5. Return a structured summary: files touched, occurrences changed. `; ``` Concept: `attention-engineering` ### 3. Spawn the Skill in context: fork When context_mode: fork is set, the CLI runs the Skill in a child session with its own conversation history, its own tool whitelist, and its own working tree view. The parent session is untouched. The child returns a tool_result with the proposed change; the parent decides whether to apply. **Python:** ```python from anthropic import Anthropic import yaml client = Anthropic() def parse_skill(path: str) -> dict: text = open(path).read() if not text.startswith("---"): raise ValueError(f"{path}: missing frontmatter") _, fm, body = text.split("---", 2) return {"frontmatter": yaml.safe_load(fm), "body": body.strip()} def invoke_skill_in_fork(skill_path: str, params: dict, user_message: str) -> dict: """Spawn a child session for the Skill. Parent state untouched.""" skill = parse_skill(skill_path) fm = skill["frontmatter"] rendered_body = skill["body"] for k, v in params.items(): rendered_body = rendered_body.replace("{" + k + "}", str(v)) child_response = client.messages.create( model="claude-sonnet-4.5", max_tokens=4096, system=rendered_body, tools=load_tools_from_whitelist(fm["allowed-tools"]), messages=[{"role": "user", "content": user_message}], ) return { "skill_name": fm["name"], "skill_version": fm["version"], "child_stop_reason": child_response.stop_reason, "child_content": child_response.content, } ``` **TypeScript:** ```typescript import Anthropic from "@anthropic-ai/sdk"; import { readFileSync } from "node:fs"; import { parse as parseYaml } from "yaml"; const client = new Anthropic(); function parseSkill(path: string) { const text = readFileSync(path, "utf8"); if (!text.startsWith("---")) throw new Error(`${path}: missing frontmatter`); const [, fm, body] = text.split("---", 3); return { frontmatter: parseYaml(fm) as Record, body: body.trim() }; } export async function invokeSkillInFork( skillPath: string, params: Record, userMessage: string, ) { const skill = parseSkill(skillPath); const fm = skill.frontmatter as { name: string; version: string; "allowed-tools": string[] }; let renderedBody = skill.body; for (const [k, v] of Object.entries(params)) { renderedBody = renderedBody.replaceAll(`{${k}}`, v); } const childResponse = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 4096, system: renderedBody, tools: loadToolsFromWhitelist(fm["allowed-tools"]), messages: [{ role: "user", content: userMessage }], }); return { skill_name: fm.name, skill_version: fm.version, child_stop_reason: childResponse.stop_reason, child_content: childResponse.content, }; } ``` Concept: `subagents` ### 4. Enforce allowed-tools at the SDK boundary The frontmatter declares the whitelist; the CLI enforces it. Any tool_use call that targets a non-whitelisted tool fails with is_error: true. The Skill body cannot escalate its own tool list. This is structural, not prompt-based: a clever prompt cannot trick the SDK into calling Edit on a Skill that does not whitelist Edit. **Python:** ```python from anthropic.types import Tool ALL_TOOLS: dict[str, Tool] = { "Read": {"name": "Read", "description": "...", "input_schema": {}}, "Edit": {"name": "Edit", "description": "...", "input_schema": {}}, "Bash": {"name": "Bash", "description": "...", "input_schema": {}}, "Grep": {"name": "Grep", "description": "...", "input_schema": {}}, "Glob": {"name": "Glob", "description": "...", "input_schema": {}}, "Write": {"name": "Write", "description": "...", "input_schema": {}}, } KNOWN_TOOLS = set(ALL_TOOLS) def load_tools_from_whitelist(whitelist: list[str]) -> list[Tool]: unknown = set(whitelist) - KNOWN_TOOLS if unknown: raise ValueError(f"Skill frontmatter references unknown tools: {unknown}") return [ALL_TOOLS[name] for name in whitelist] # A refactor Skill whitelist TOOLS_FOR_REFACTOR = load_tools_from_whitelist(["Read", "Grep", "Glob", "Edit"]) # A test-gen Skill whitelist (no Edit, no Bash) TOOLS_FOR_TESTGEN = load_tools_from_whitelist(["Read", "Grep", "Glob"]) ``` **TypeScript:** ```typescript import type Anthropic from "@anthropic-ai/sdk"; const ALL_TOOLS: Record = { Read: { name: "Read", description: "...", input_schema: { type: "object", properties: {} } }, Edit: { name: "Edit", description: "...", input_schema: { type: "object", properties: {} } }, Bash: { name: "Bash", description: "...", input_schema: { type: "object", properties: {} } }, Grep: { name: "Grep", description: "...", input_schema: { type: "object", properties: {} } }, Glob: { name: "Glob", description: "...", input_schema: { type: "object", properties: {} } }, Write: { name: "Write", description: "...", input_schema: { type: "object", properties: {} } }, }; const KNOWN_TOOLS = new Set(Object.keys(ALL_TOOLS)); export function loadToolsFromWhitelist(whitelist: string[]): Anthropic.Tool[] { const unknown = whitelist.filter((t) => !KNOWN_TOOLS.has(t)); if (unknown.length > 0) { throw new Error(`Skill frontmatter references unknown tools: ${unknown.join(", ")}`); } return whitelist.map((name) => ALL_TOOLS[name]); } const TOOLS_FOR_REFACTOR = loadToolsFromWhitelist(["Read", "Grep", "Glob", "Edit"]); const TOOLS_FOR_TESTGEN = loadToolsFromWhitelist(["Read", "Grep", "Glob"]); ``` Concept: `tool-calling` ### 5. Parameterize for cross-repo reuse A good Skill is generic across repos. The Skill body uses {param_name} placeholders; the CLI fills them in from --param key=value arguments at invocation time. Required vs optional parameters are declared in the frontmatter; the CLI rejects invocations that miss required params before any LLM call. **Python:** ```python import jsonschema def validate_params(skill_fm: dict, params: dict) -> None: schema = {"type": "object", "properties": {}, "required": []} for name, spec in skill_fm.get("parameters", {}).items(): schema["properties"][name] = {"type": spec["type"]} if spec.get("required"): schema["required"].append(name) try: jsonschema.validate(instance=params, schema=schema) except jsonschema.ValidationError as e: raise ValueError(f"invalid params for skill {skill_fm['name']}: {e.message}") # CLI: claude skills invoke --param key=value def parse_cli_params(argv: list[str]) -> dict: params = {} i = 0 while i < len(argv): if argv[i] == "--param" and i + 1 < len(argv): k, _, v = argv[i + 1].partition("=") params[k] = v i += 2 else: i += 1 return params ``` **TypeScript:** ```typescript import Ajv from "ajv"; const ajv = new Ajv(); export function validateParams( skillFm: { name: string; parameters?: Record }, params: Record, ): void { const schema: Record = { type: "object", properties: {} as Record, required: [] as string[], }; for (const [name, spec] of Object.entries(skillFm.parameters ?? {})) { (schema.properties as Record)[name] = { type: spec.type }; if (spec.required) (schema.required as string[]).push(name); } const validate = ajv.compile(schema); if (!validate(params)) { throw new Error(`invalid params for skill ${skillFm.name}: ${ajv.errorsText(validate.errors)}`); } } export function parseCliParams(argv: string[]): Record { const params: Record = {}; for (let i = 0; i < argv.length; i++) { if (argv[i] === "--param" && i + 1 < argv.length) { const [k, ...rest] = argv[i + 1].split("="); params[k] = rest.join("="); i++; } } return params; } ``` Concept: `structured-outputs` ### 6. Build the IDE wrapper as a thin shell over the CLI VSCode (or JetBrains, or Neovim) extension is the smallest possible shell over the CLI. It registers commands and keybinds, captures the developer's selection, builds a claude skills invoke shell command, runs it, and streams the output back into the editor. **Python:** ```python # VSCode extension (TypeScript) - pseudo-Python summary of what it does def vscode_command_refactor_component(editor_state): """The user invoked the 'Claude: Refactor This Component' palette item.""" selection = editor_state.get_selected_text() workspace_root = editor_state.get_workspace_root() cli_args = [ "claude", "skills", "invoke", "frontend/refactor-component", "--param", f"old_name={selection}", "--param", "new_name=AskUserViaInputBox", "--param", f"directory={workspace_root}/src/", ] proc = run_subprocess(cli_args, cwd=workspace_root) stream_to_editor_panel(proc.stdout) ``` **TypeScript:** ```typescript import * as vscode from "vscode"; import { spawn } from "node:child_process"; export function activate(ctx: vscode.ExtensionContext) { const cmd = vscode.commands.registerCommand("claude.refactorComponent", async () => { const editor = vscode.window.activeTextEditor; if (!editor) return; const selection = editor.document.getText(editor.selection); const newName = await vscode.window.showInputBox({ prompt: `New name for ${selection}` }); if (!newName) return; const ws = vscode.workspace.workspaceFolders?.[0]?.uri.fsPath ?? "."; const proc = spawn( "claude", ["skills", "invoke", "frontend/refactor-component", "--param", `old_name=${selection}`, "--param", `new_name=${newName}`, "--param", `directory=${ws}/src/`], { cwd: ws }, ); const out = vscode.window.createOutputChannel("Claude refactor"); proc.stdout.on("data", (d) => out.append(d.toString())); proc.stderr.on("data", (d) => out.append(d.toString())); out.show(); }); ctx.subscriptions.push(cmd); } ``` Concept: `claude-md-hierarchy` ### 7. Discover Skills via the CLI registry claude skills list walks .claude/skills//*.md and ~/.claude/skills//*.md, parses frontmatter, and prints a discoverable table. IDE extensions call this and feed the result into command palettes. **Python:** ```python import glob, yaml from pathlib import Path def list_skills(roots: list[str]) -> list[dict]: out = [] for root in roots: for path in glob.glob(f"{root}/**/*.md", recursive=True): try: text = Path(path).read_text() _, fm, _ = text.split("---", 2) meta = yaml.safe_load(fm) out.append({ "name": meta["name"], "version": meta["version"], "description": meta["description"].strip().split("\n")[0], "parameters": list(meta.get("parameters", {}).keys()), "path": path, }) except (ValueError, KeyError): pass return sorted(out, key=lambda s: s["name"]) def cmd_skills_list(team: str | None = None): skills = list_skills([".claude/skills", str(Path.home() / ".claude/skills")]) if team: skills = [s for s in skills if s["name"].startswith(f"{team}/")] for s in skills: print(f"{s['name']}@{s['version']:<8} {s['description']}") ``` **TypeScript:** ```typescript import { glob } from "glob"; import { parse as parseYaml } from "yaml"; import { readFileSync } from "node:fs"; import { homedir } from "node:os"; import { join } from "node:path"; interface SkillSummary { name: string; version: string; description: string; parameters: string[]; path: string; } export async function listSkills(roots: string[]): Promise { const out: SkillSummary[] = []; for (const root of roots) { const paths = await glob(`${root}/**/*.md`); for (const path of paths) { try { const text = readFileSync(path, "utf8"); const [, fm] = text.split("---", 3); const meta = parseYaml(fm) as { name: string; version: string; description: string; parameters?: Record }; out.push({ name: meta.name, version: meta.version, description: meta.description.trim().split("\n")[0], parameters: Object.keys(meta.parameters ?? {}), path, }); } catch { /* skip malformed */ } } } return out.sort((a, b) => a.name.localeCompare(b.name)); } export async function cmdSkillsList(opts: { team?: string } = {}) { let skills = await listSkills([".claude/skills", join(homedir(), ".claude/skills")]); if (opts.team) skills = skills.filter((s) => s.name.startsWith(`${opts.team}/`)); for (const s of skills) { console.log(`${s.name}@${s.version.padEnd(8)} ${s.description}`); } } ``` Concept: `evaluation` ### 8. Version Skills via Git tags; pin majors Each Skill carries a semver in frontmatter. Each release tags the Git history (git tag skill-refactor@1.2.3). Callers pin a major (@1.x); the registry resolves to the latest patch within that major. Edit-in-place is forbidden by PR review. **Python:** ```python import subprocess, semver def tag_skill_release(skill_path: str, new_version: str): semver.VersionInfo.parse(new_version) skill_name = parse_skill(skill_path)["frontmatter"]["name"] tag = f"{skill_name.replace('/', '-')}@{new_version}" update_frontmatter_version(skill_path, new_version) subprocess.run(["git", "add", skill_path], check=True) subprocess.run(["git", "commit", "-m", f"chore(skill): bump {skill_name} to {new_version}"], check=True) subprocess.run(["git", "tag", "-a", tag, "-m", f"{skill_name} {new_version}"], check=True) print(f"tagged {tag}") def resolve_caller_pin(skill_name: str, pin: str) -> str: if pin.endswith(".x"): major = int(pin.split(".")[0]) tags = subprocess.check_output( ["git", "tag", "-l", f"{skill_name.replace('/', '-')}@{major}.*"], text=True, ).strip().split("\n") versions = [t.split("@", 1)[1] for t in tags if t] return max(versions, key=lambda v: semver.VersionInfo.parse(v)) return pin ``` **TypeScript:** ```typescript import { execSync } from "node:child_process"; import semver from "semver"; export function tagSkillRelease(skillPath: string, newVersion: string) { if (!semver.valid(newVersion)) throw new Error(`invalid semver ${newVersion}`); const skillName = (parseSkill(skillPath).frontmatter as { name: string }).name; const tag = `${skillName.replaceAll("/", "-")}@${newVersion}`; updateFrontmatterVersion(skillPath, newVersion); execSync(`git add ${skillPath}`, { stdio: "inherit" }); execSync(`git commit -m "chore(skill): bump ${skillName} to ${newVersion}"`, { stdio: "inherit" }); execSync(`git tag -a ${tag} -m "${skillName} ${newVersion}"`, { stdio: "inherit" }); console.log(`tagged ${tag}`); } export function resolveCallerPin(skillName: string, pin: string): string { if (pin.endsWith(".x")) { const major = Number(pin.split(".")[0]); const raw = execSync(`git tag -l "${skillName.replaceAll("/", "-")}@${major}.*"`, { encoding: "utf8" }); const tags = raw.trim().split("\n").filter(Boolean); const versions = tags.map((t) => t.split("@")[1]); return versions.sort(semver.rcompare)[0]; } return pin; } declare function parseSkill(path: string): { frontmatter: Record; body: string }; declare function updateFrontmatterVersion(path: string, version: string): void; ``` Concept: `tool-calling` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | IDE-first or CLI-first? | CLI-first. IDE extensions wrap the CLI. | IDE-first. Skills built into a VSCode plugin and ported to other editors as parallel implementations. | CLI-first is portable across every editor and shell workflow. New editors get supported with a 200-line shell-out wrapper instead of a parallel Skill engine. One source of truth; one update path. | | Skill or slash Command? | Skill if the work needs isolated exploration (context: fork) or reusable parameters. Command if the work has session-wide effects. | Use Command for everything because it is simpler. | Skills give you isolation (fork), reusable parameter schemas, and discoverability via claude skills list. Commands are inline and per-session. The wrong choice produces fragile workflows that break when copied between repos. | | Tool access in a refactoring Skill | Explicit allowed-tools whitelist (Read, Grep, Glob, Edit). No Bash. Edit only because the Skill genuinely needs it. | Unrestricted tools. The agent will be careful. | Whitelisting is structural and SDK-enforced. Prompt-based caution is probabilistic and leaks under unusual phrasing. A test-gen Skill that does not list Edit literally cannot Edit, no matter how the prompt phrases the request. | | Skill reusability across repos | Parameterize: directory, language, target_pattern. The Skill body uses {placeholders}. The CLI fills them in. | Hardcode paths and language. Fork the Skill per repo. | Parameterization scales linearly with use-cases. Hardcoding scales linearly with repos and produces drift. Once you have 5 forks of the same Skill, the next breaking change requires updating all 5. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-DEVTOOLS-01 · Skills in IDE without CLI foundation | The team builds Skills as a VSCode extension first. Six months later, JetBrains and Neovim users are stuck or get a parallel re-implementation that drifts. Updates ship to one editor at a time. | CLI-first architecture. The CLI is the canonical entry point. IDE extensions are ~200-line shells over the CLI. New editors get supported with a tiny wrapper. The CLI stays the source of truth. | | AP-DEVTOOLS-02 · Unlimited tool access in skill context | A test-generation Skill is granted full tool access. A clever prompt-injection in source comments tricks it into calling Edit on a real source file and overwriting the working tree. | allowed-tools whitelist on every Skill. Explicit list. No Bash, no Edit unless the Skill genuinely needs them. SDK enforces the whitelist; the Skill body cannot escalate. | | AP-DEVTOOLS-03 · Skill designed for one codebase only | A refactor Skill hardcodes directory=src/ and language=tsx. Backend team needs the same Skill on app/ with language=py and forks the file. Now there are 5 forks across teams. | Parameterize: declare directory, language, target_pattern in frontmatter. The Skill body uses {placeholders}. The CLI substitutes them at invocation. One Skill, infinite repos. | | AP-DEVTOOLS-04 · Skills vs Commands ambiguity | The team has no clear criterion. Some workflows are Skills, some are Commands, the choice is ad-hoc. New developers cannot predict which to author for a new use case. | Explicit decision tree. Skill if context: fork is needed (exploration without touching parent state) or if parameters make it reusable. Command if the work has session-wide effects. | | AP-DEVTOOLS-05 · Shared Skills without version control | Skills are edited in place. A v2 frontmatter change ships; 12 agents that depended on the v1 shape silently break. Nobody knows which Skill regression caused the failure. | Git semver tagging. skill-refactor@1.2.3. Callers pin major (@1.x); the registry resolves to the latest patch. Breaking changes bump the major and ship as @2.0.0. | ## Implementation checklist - [ ] Team-namespaced layout: .claude/skills/{team}/{name}.md (`skills`) - [ ] Frontmatter schema (name, version, description, when_to_use, parameters, allowed-tools, context_mode) validated by the CLI at parse time (`structured-outputs`) - [ ] Every Skill declares allowed-tools explicitly. Edit and Bash are NOT default (`tool-calling`) - [ ] Exploratory Skills use context_mode: fork (`subagents`) - [ ] Parameters declared with types in frontmatter. CLI validates before invocation (`structured-outputs`) - [ ] CLI is the canonical entry point. IDE extensions are thin wrappers over the CLI (`claude-md-hierarchy`) - [ ] claude skills list returns name, version, description, when_to_use, parameters (`evaluation`) - [ ] Git semver tags per release. Callers pin major (`tool-calling`) - [ ] PR review on every Skill change. Frontmatter shape changes require a major bump - [ ] Per-Skill regression eval set. Runs in CI on every Skill PR ## Cost & latency - **Per-Skill invocation (in fork mode):** ~$0.004 to $0.012, Skill body ~500 tokens system + parameters ~50 tokens + child working tokens ~1500-3000 input + ~500 output. Sonnet 4.5 pricing. - **context: fork overhead:** ~50 tokens per invocation, Fork setup writes a fresh system prompt and instantiates child message history. No LLM-side cost beyond a few extra tokens. - **IDE wrapper integration latency p95:** ~2 to 3 seconds, Editor event triggers shell-out to the CLI; CLI parses Skill; spawns child session; child returns. The Claude API call dominates. - **Skill registry storage:** ~10 MB for 100 Skills across 5 versions each, Frontmatter and body per Skill ~5-15 KB. 100 Skills with full Git history of 5 versions per Skill ~10 MB checked into the repo. - **Cost per invocation at production scale:** ~$0.004, Combined Claude tokens, CLI overhead, registry lookup. At 1000 invocations per day across the team, ~$4 per day, ~$120 per month. ## Domain weights - **D3 · Agent Operations (20%):** Skill definition file. Frontmatter schema. context: fork. allowed-tools whitelist. - **D2 · Tool Design + Integration (18%):** Parameter contract. CLI invocation shape. Git semver tagging. IDE wrapper protocol. ## Practice questions ### Q1. You are designing a Skill for TypeScript refactoring. The agent should explore changes without affecting the working directory. Which feature isolates the exploration: context: fork or allowed-tools? context: fork. Setting context_mode: fork in the Skill frontmatter spawns a child session with its own conversation history and its own working tree view; the parent session is untouched. The child returns a tool_result with the proposed change and the parent decides whether to apply. allowed-tools is a separate axis: it restricts which tools the child can call. The two compose. For exploration that may need to write changes, you set context_mode: fork AND list Edit in allowed-tools. Tagged to AP-DEVTOOLS-04. ### Q2. A Skill needs parameters for directory, target_pattern, backup_location. How should these parameters be defined to make the Skill reusable across different codebases? Declare them in the Skill frontmatter with types and defaults. The Skill body uses {directory}, {target_pattern}, {backup_location} placeholders that the CLI fills in from --param key=value arguments. Required parameters that are missing cause the CLI to reject the invocation before any Claude API call. Result: one generic Skill that works on any repo by passing different parameters. Tagged to AP-DEVTOOLS-03. ### Q3. Your IDE integration uses Skills. When should a developer use a Skill vs a slash Command? Use a Skill when (a) the work needs isolated exploration that should not touch the parent session (context: fork), or (b) the work has reusable parameters that vary across invocations. Use a Command when the work has session-wide effects (persisting state into the current conversation, sharing context with subsequent Commands). Skills are versioned, parameterized, discoverable. Commands are inline and per-session. ### Q4. A team shares a test-gen Skill. It currently has no version identifier. How should you version the Skill? Git semver tagging. Add version: 1.0.0 to the frontmatter and cut a Git tag skill-test-gen@1.0.0 on release. Callers pin a major version: claude skills invoke shared/test-gen@1.x. The registry resolves the pin to the latest patch within the major. Breaking changes bump the major to 2.0.0; existing v1.x callers continue to work; new callers explicitly opt in to v2. Tagged to AP-DEVTOOLS-05. ### Q5. A Skill can technically execute Bash, Edit, and Read. A developer wants to run a Skill that should NOT modify files. How do you prevent the Skill from calling Edit? Set allowed-tools: [Read, Grep, Glob] in the Skill frontmatter. Edit is omitted. The CLI loads only the listed tools into the child session; any tool_use call to Edit returns tool_result with is_error: true. This is structural: the Skill body cannot escalate its own whitelist, no matter how the prompt phrases the request. ## FAQ ### Q1. Can a Skill modify files? Only if allowed-tools includes Edit (or Write). Refactoring Skills typically grant Edit. Exploratory Skills (test-gen, code-gen, doc-gen) often deny it: they propose changes via the tool_result payload and let the parent session decide whether to apply. ### Q2. How does the IDE know which Skills are available? The IDE extension calls claude skills list (which walks .claude/skills//*.md and ~/.claude/skills//*.md) and feeds the result into its command palette. The CLI is the source of truth. ### Q3. What is the difference between context: fork and a Subagent? context: fork is lightweight isolation for a single Skill invocation: fresh messages, scoped tools, parent untouched. A full Subagent is a separate agent loop with its own task and full autonomy. Use fork for one-shot exploration. Use Subagent for delegated work that needs its own multi-turn loop. ### Q4. How do I version a Skill? version: MAJOR.MINOR.PATCH in the frontmatter; git tag skill-name@1.2.3 on release. Callers pin a major (@1.x); the registry resolves to the latest patch. Breaking changes bump the major; existing callers stay on v1.x until they migrate. ### Q5. Can a Skill call another Skill? Yes, if both are in allowed-tools. The composing Skill lists invoke_skill as an allowed tool. Composition enables shared building blocks. Avoid deep nesting (depth > 2): debugging multi-level Skill chains is painful. ### Q6. Does Skill frontmatter override the agent's decision-making? No. Frontmatter is attention engineering, not a regex classifier. The LLM forward-pass reads the frontmatter into context and uses it to decide whether the Skill is the right fit. Good frontmatter lifts routing accuracy substantially without removing the model's agency. ### Q7. What happens if a Skill fails? The child session returns a tool_result with is_error: true and a structured error payload. The parent agent observes the error and decides: retry with different parameters, propose an alternative Skill, or escalate. Failures do NOT propagate to the parent's working tree because of context: fork. ## Production readiness - [ ] Skills directory structure: team-namespaced; PR-reviewed on every change - [ ] Frontmatter schema validated in CI; PRs that violate the schema fail to merge - [ ] allowed-tools audit: lint blocks any Skill that grants Edit + Bash + Write together - [ ] context_mode: fork tested end-to-end: parent state must be unchanged after a fork-mode invocation - [ ] Parameter validation tested with missing required, wrong type, and extra-key cases - [ ] Git tag automation: release script bumps version, tags, pushes; PR review gates the bump - [ ] IDE wrapper extensions kept under 300 lines; review enforces 'thin wrapper' principle - [ ] Per-Skill regression eval suite; runs in CI on every Skill PR --- **Source:** https://claudearchitectcertification.com/scenarios/agent-skills-for-developer-tooling **Vault sources:** ACP-T05 Scenario 12 (yellow beyond-guide; OP-claimed Reddit 1s34iyl); ACP-T07 Lab 12 spec (Skills for developer tooling); ACP-T08 section 3.12 (CLI-first, IDE wrappers, version control); Course 15 Introduction to Agent Skills (overview + lesson 5 sharing skills); Course 01 Claude 101 lesson 7 (working with Skills); ACP-T06 (5 practice Qs tagged to components) **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Agent Skills with Code Execution > A code-execution Skill with four layers of safety. Layer 1: route file I/O to built-in tools (Read, Write, Edit) and reserve Bash only for actual execution. Layer 2: a PreToolUse hook scans the proposed Bash command for destructive patterns and exits 2 on match. Layer 3: a Docker or Firecracker sandbox runs the code with kernel-level limits (CPU 2, memory 1GB, timeout 30s, network deny). Layer 4: a PostToolUse hook normalizes the raw output to JSON, then a semantic validator confirms the result shape matches the task. The most-tested distractor: Python signal-handler timeouts. The right answer is kernel-level via systemd-run or cgroups; signal handlers can be caught. **Sub-marker:** P3.13 **Domains:** D2 · Tool Design + Integration, D3 · Agent Operations **Exam weight:** 38% of CCA-F (D2 + D3) **Build time:** 24 minutes **Source:** 🟡 Beyond-guide scenario. OP-claimed (Reddit 1s34iyl). Architecture matches Anthropic public guidance. **Canonical:** https://claudearchitectcertification.com/scenarios/agent-skills-with-code-execution **Last reviewed:** 2026-05-04 ## In plain English Think of this as the way you let an agent run actual code (a Python data-analysis script, a one-off shell command, a compiled binary) without giving it the keys to your machine. The script runs inside a sandbox: a small isolated container with strict limits on CPU, memory, time, and network. A PreToolUse hook scans the proposed command BEFORE it runs and refuses anything destructive. After the script runs, a PostToolUse hook normalizes the messy raw output into a clean structured result. A semantic validator confirms the output makes sense given the task before the agent acts on it. The whole point is that code execution is too dangerous to be a free tool; it needs four layers of containment. ## Exam impact Domain 2 (Tool Design, 18%) tests the Bash-vs-built-ins distinction, the PreToolUse blocklist contract, and the structured-output normalization shape. Domain 3 (Claude Code Configuration, 20%) tests sandboxing strategy, kernel-level resource enforcement, and the four-layer containment model. The 'why does Python signal-handler timeout fail to actually stop a runaway script?' question is the canonical exam distractor. ## The problem ### What the customer needs - Run real Python or shell scripts as part of the agent's workflow, not just simulate them. - Untrusted code stays contained. A misbehaving script does not destroy the host's filesystem or exhaust its memory. - Predictable termination. A runaway loop or infinite recursion stops at exactly the configured time limit. - Consistent output shape. The agent sees a predictable JSON contract regardless of which tool ran or how the script printed. ### Why naive approaches fail - Use Bash for everything (including cat file.txt instead of Read). Audit trail is opaque, file I/O and execution conflate. - No PreToolUse blocklist. A clever prompt-injection in the alert text gets rm -rf /prod to execute. - No resource limits. A loop allocates 10 GB or runs forever; the sandbox runner is exhausted. - Heterogeneous raw output passed to the agent. The agent parses inconsistently and routes wrong. - Schema-only validation. The output matches {status: string} but status is 'banana'. The agent acts on nonsense. ### Definition of done - File I/O routes to Read / Write / Edit. Bash is reserved for actual command execution. - PreToolUse hook on Bash with a destructive blocklist (regex). Exit 2 on match. - Sandbox runtime (Docker or Firecracker) with kernel-level limits: CPU 2, memory 1GB, timeout 30s, network deny. - PostToolUse hook normalizes raw output to JSON: {status, stdout, stderr, duration_ms, peak_memory_mb}. - Semantic validator confirms result shape matches the task type. - Audit log: every code-exec invocation writes an append-only row. ## Concepts in play - 🟢 **Skills** (`skills`), Code-execution capability packaged as a Skill - 🟢 **Tool calling** (`tool-calling`), Bash vs built-in tool selection - 🟢 **Hooks** (`hooks`), PreToolUse blocklist and PostToolUse normalizer - 🟢 **Evaluation** (`evaluation`), Semantic validation beyond schema - 🟢 **Structured outputs** (`structured-outputs`), Normalized result contract - 🟢 **Subagents** (`subagents`), Sandbox child session for isolated execution - 🟢 **Context window** (`context-window`), Truncate huge stdout before feeding to the agent - 🟢 **Agentic loops** (`agentic-loops`), stop_reason branching on tool_use and is_error ## Components ### Bash Tool with Destructive Blocklist, PreToolUse gate, regex-driven Bash sits behind a PreToolUse hook with a compiled regex blocklist (rm -rf, sudo, drop database, kill -9, chmod 777, curl ... | sh). Match exits 2 with a model-readable stderr message; agent observes the deny as tool_result: is_error: true and re-plans. No prompt-injection bypass: the blocklist is in code, not in the prompt. **Configuration:** matcher: 'Bash'. Blocklist regex compiled at hook-load time. Allowlist of safe binaries (kubectl, docker, journalctl, jq, ps, df, top). Exit 2 with stderr. **Concept:** `hooks` ### Sandbox Runtime (Docker or Firecracker), fresh sandbox per invocation Each code-exec invocation runs in a freshly spawned isolated environment. Docker for most cases; Firecracker for stronger isolation when running fully untrusted user code. The image is cached so spin-up stays fast (~500ms warm). The sandbox is destroyed after the run; no state leaks between invocations. **Configuration:** Sandbox config: { image: code-exec:latest, cpus: 2, memory_mb: 1024, timeout_sec: 30, network: deny, ipc: private, pid: private }. Image is read-only with a small writable tmpfs scratch space. **Concept:** `subagents` ### Resource Limit Enforcement, kernel-level via cgroups or systemd CPU, memory, time, and network limits are enforced at the kernel level (cgroups for Docker, jailer for Firecracker, systemd-run --property=TimeoutStartSec=30s for raw process spawn). Kernel limits cannot be caught or ignored. Python signal-handler-based timeouts are the canonical wrong answer: a busy-loop or a try: pass swallows them. **Configuration:** cgroup limits: cpu.max=2, memory.max=1G, network deny via iptables egress rule. Timeout via systemd-run --property=TimeoutStartSec=30s. Process exit code 137 (SIGKILL) means OOM; 124 means timeout. **Concept:** `tool-calling` ### PostToolUse Output Normalizer, raw bytes to structured JSON Real shell output is messy: mixed Unix timestamps and ISO 8601, mixed status code conventions, multiline stack traces, ANSI color codes. The PostToolUse hook normalizes everything into a stable contract: {status, stdout, stderr, duration_ms, peak_memory_mb, exit_code}. Timestamps converted to ISO 8601 UTC. Color codes stripped. Long stdout truncated. **Configuration:** matcher: 'Bash'. Hook reads stdin: {tool_name, tool_input, tool_result, latency_ms, peak_memory_mb}. Returns normalized JSON. Truncates stdout > 4 KB. Strips ANSI codes. Always exits 0. **Concept:** `structured-outputs` ### Semantic Result Validator, task-aware sanity check Schema validation guarantees shape; semantic validation guarantees meaning. Given the task context, check that the normalized result is sensible: passed + failed + skipped equals total; counts are non-negative. Failed semantic validation routes the agent back with a specific error message; it does NOT propagate bad data. **Configuration:** Per-task validators registered by Skill. Test-runner validator: passed + failed + skipped total. Data-analysis validator: row count > 0; required columns present. **Concept:** `evaluation` ## Build steps ### 1. Route file I/O to built-in tools; reserve Bash for execution The first layer of safety is tool selection. cat file.txt should be a Read call, not a Bash call. grep -r foo should be a Grep call. find . -name '*.py' should be a Glob call. Bash is reserved for what the built-ins cannot do: compile code, run tests, execute a Python data-analysis script. This single distinction shrinks the Bash blast-radius by ~80%. **Python:** ```python # Wrong: Bash for everything # tool_use: Bash, command: "cat config.json" # tool_use: Bash, command: "grep -r 'TODO' src/" # Right: built-in tools for I/O; Bash only for execution # tool_use: Read, file_path: "config.json" # tool_use: Grep, pattern: "TODO", path: "src/" # tool_use: Glob, pattern: "**/*.py" # Bash legitimately for execution: # tool_use: Bash, command: "pytest tests/ --json-report" # tool_use: Bash, command: "python analyze.py --input data.csv" import re FILE_IO_VIA_BASH = re.compile( r"^\s*(cat|head|tail|less|more|grep|find|ls|wc|sort|uniq|cut|awk|sed)\s", ) def warn_on_io_via_bash(tool_name: str, command: str) -> str | None: if tool_name != "Bash": return None if FILE_IO_VIA_BASH.match(command): first = command.strip().split()[0] return ( f"Bash command starts with {first!r}. " f"For file I/O prefer Read / Grep / Glob; reserve Bash for execution." ) return None ``` **TypeScript:** ```typescript // Wrong: Bash for everything // tool_use: Bash, command: "cat config.json" // Right: built-in tools for I/O; Bash only for execution // tool_use: Read, file_path: "config.json" // tool_use: Grep, pattern: "TODO", path: "src/" const FILE_IO_VIA_BASH = /^\s*(cat|head|tail|less|more|grep|find|ls|wc|sort|uniq|cut|awk|sed)\s/; export function warnOnIoViaBash(toolName: string, command: string): string | null { if (toolName !== "Bash") return null; const m = FILE_IO_VIA_BASH.exec(command); if (!m) return null; const first = command.trim().split(/\s+/)[0]; return ( `Bash command starts with "${first}". ` + `For file I/O prefer Read / Grep / Glob; reserve Bash for execution.` ); } ``` Concept: `tool-calling` ### 2. Wire the PreToolUse blocklist hook on Bash The destructive blocklist runs before the sandbox is even spawned. Compiled regex against rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh. Match exits 2; the agent sees the deny as a tool_result with is_error: true and re-plans. **Python:** ```python # .claude/hooks/codeexec_blocklist.py import sys, json, re BLOCKLIST = re.compile( r"\b(" r"rm\s+-rf" r"|sudo\s+" r"|drop\s+(database|table)" r"|kill\s+-9" r"|chmod\s+777" r"|>\s*/(etc|usr|var)/" r"|curl\s+[^|]+\|\s*sh" r")\b", re.IGNORECASE, ) ALLOWLIST_BINS = { "python", "python3", "pytest", "node", "npm", "pnpm", "tsc", "eslint", "prettier", "ruff", "black", "go", "cargo", "kubectl", "docker", "jq", } def main(): payload = json.loads(sys.stdin.read()) if payload["tool_name"] != "Bash": sys.exit(0) cmd = (payload["tool_input"].get("command") or "").strip() if BLOCKLIST.search(cmd): print(f"BLOCKED: command matches destructive pattern. command={cmd!r}", file=sys.stderr) sys.exit(2) first = cmd.split()[0] if cmd else "" if first and first not in ALLOWLIST_BINS: print(f"BLOCKED: binary {first!r} not on the code-exec allowlist.", file=sys.stderr) sys.exit(2) sys.exit(0) if __name__ == "__main__": main() ``` **TypeScript:** ```typescript // .claude/hooks/codeexec-blocklist.ts import { readFileSync } from "node:fs"; const BLOCKLIST = new RegExp( String.raw`\b(` + String.raw`rm\s+-rf` + String.raw`|sudo\s+` + String.raw`|drop\s+(database|table)` + String.raw`|kill\s+-9` + String.raw`|chmod\s+777` + String.raw`|>\s*/(etc|usr|var)/` + String.raw`|curl\s+[^|]+\|\s*sh` + String.raw`)\b`, "i", ); const ALLOWLIST_BINS = new Set([ "python", "python3", "pytest", "node", "npm", "pnpm", "tsc", "eslint", "prettier", "ruff", "black", "go", "cargo", "kubectl", "docker", "jq", ]); const payload = JSON.parse(readFileSync(0, "utf8")); if (payload.tool_name !== "Bash") process.exit(0); const cmd = String(payload.tool_input?.command ?? "").trim(); if (BLOCKLIST.test(cmd)) { process.stderr.write(`BLOCKED: command matches destructive pattern. command=${JSON.stringify(cmd)}\n`); process.exit(2); } const first = cmd.split(/\s+/)[0] ?? ""; if (first && !ALLOWLIST_BINS.has(first)) { process.stderr.write(`BLOCKED: binary ${JSON.stringify(first)} not on the code-exec allowlist.\n`); process.exit(2); } process.exit(0); ``` Concept: `hooks` ### 3. Spawn the sandbox with kernel-level limits Once the blocklist allows the command, the sandbox runs the actual code. Docker is the default; Firecracker for stronger isolation. The sandbox is fresh per invocation, runs read-only with a tmpfs scratch space, and enforces CPU / memory / time / network limits at the kernel level via cgroups. **Python:** ```python import subprocess, time def run_in_sandbox(command: str, timeout_s: int = 30) -> dict: """Run command in a Docker sandbox with kernel-level limits.""" start = time.monotonic() try: result = subprocess.run( [ "docker", "run", "--rm", "--cpus", "2", "--memory", "1024m", "--memory-swap", "1024m", "--network", "none", "--read-only", "--tmpfs", "/tmp:size=128m", "--ipc", "private", "--pid", "private", "code-exec:latest", "bash", "-c", command, ], capture_output=True, text=True, timeout=timeout_s, ) return { "status": "ok" if result.returncode == 0 else "exit_nonzero", "exit_code": result.returncode, "stdout": result.stdout, "stderr": result.stderr, "duration_ms": int((time.monotonic() - start) * 1000), } except subprocess.TimeoutExpired: return { "status": "timeout", "exit_code": 124, "stdout": "", "stderr": f"command exceeded {timeout_s}s timeout", "duration_ms": timeout_s * 1000, } ``` **TypeScript:** ```typescript import { spawn } from "node:child_process"; interface SandboxResult { status: "ok" | "exit_nonzero" | "timeout"; exit_code: number; stdout: string; stderr: string; duration_ms: number; } export async function runInSandbox(command: string, timeoutSec = 30): Promise { const start = Date.now(); return new Promise((resolve) => { const child = spawn( "docker", ["run", "--rm", "--cpus", "2", "--memory", "1024m", "--memory-swap", "1024m", "--network", "none", "--read-only", "--tmpfs", "/tmp:size=128m", "--ipc", "private", "--pid", "private", "code-exec:latest", "bash", "-c", command], { stdio: ["ignore", "pipe", "pipe"] }, ); let stdout = ""; let stderr = ""; child.stdout.on("data", (d) => { stdout += d.toString(); }); child.stderr.on("data", (d) => { stderr += d.toString(); }); const killer = setTimeout(() => { child.kill("SIGKILL"); resolve({ status: "timeout", exit_code: 124, stdout: "", stderr: `command exceeded ${timeoutSec}s timeout`, duration_ms: timeoutSec * 1000, }); }, timeoutSec * 1000); child.on("close", (code) => { clearTimeout(killer); resolve({ status: code === 0 ? "ok" : "exit_nonzero", exit_code: code ?? -1, stdout, stderr, duration_ms: Date.now() - start, }); }); }); } ``` Concept: `subagents` ### 4. Use kernel timeouts, not Python signal handlers The canonical wrong answer to 'how do I time-out a script after 30 seconds?' is signal.signal(signal.SIGALRM, handler). Python signal handlers can be caught (try: ... except: pass), can be ignored, and do not fire inside C extensions. Use kernel-level timeouts: systemd-run --property=TimeoutStartSec=30s, Docker's intrinsic timeout. Kernel timeouts cannot be caught. **Python:** ```python # Wrong: Python signal handler # import signal # def handler(signum, frame): # raise TimeoutError("timed out") # signal.signal(signal.SIGALRM, handler) # signal.alarm(30) # # User script can do try/except and swallow the SIGALRM. # Right: kernel timeout via systemd-run import subprocess def run_with_kernel_timeout(command: str, timeout_s: int = 30) -> int: result = subprocess.run( [ "systemd-run", "--user", "--scope", f"--property=TimeoutStartSec={timeout_s}s", "--property=MemoryMax=1G", "--property=CPUQuota=200%", "bash", "-c", command, ], ) # Exit code 124 means systemd killed it for timeout; user script cannot prevent. return result.returncode ``` **TypeScript:** ```typescript // Wrong: setTimeout // User script in a tight loop blocks the event loop; setTimeout never fires. // Right: kernel timeout via spawn with systemd-run import { spawnSync } from "node:child_process"; export function runWithKernelTimeout(command: string, timeoutSec = 30): number { const result = spawnSync( "systemd-run", [ "--user", "--scope", `--property=TimeoutStartSec=${timeoutSec}s`, "--property=MemoryMax=1G", "--property=CPUQuota=200%", "bash", "-c", command, ], { stdio: "inherit" }, ); return result.status ?? -1; } ``` Concept: `tool-calling` ### 5. PostToolUse output normalizer Real shell output is messy. The PostToolUse hook normalizes everything into a stable contract before the agent sees it. Strip ANSI color codes, convert timestamps to ISO 8601 UTC, truncate stdout / stderr above 4 KB. **Python:** ```python import sys, json, re, datetime ANSI = re.compile(r"\x1b\[[0-9;]*m") UNIX_TS = re.compile(r"\b1[6-9]\d{8}\b") def normalize_stdout(text: str, max_bytes: int = 4096) -> str: text = ANSI.sub("", text) text = UNIX_TS.sub( lambda m: datetime.datetime.utcfromtimestamp(int(m.group())).isoformat() + "Z", text, ) if len(text) > max_bytes: head = text[: max_bytes // 2] tail = text[-max_bytes // 2 :] omitted_lines = text[max_bytes // 2 : -max_bytes // 2].count("\n") text = f"{head}\n... truncated ({omitted_lines} more lines) ...\n{tail}" return text def main(): payload = json.loads(sys.stdin.read()) if payload["tool_name"] != "Bash": print(json.dumps(payload)) sys.exit(0) raw_result = payload.get("tool_result") or {} normalized = { "status": raw_result.get("status", "unknown"), "exit_code": raw_result.get("exit_code", -1), "stdout": normalize_stdout(raw_result.get("stdout", "")), "stderr": normalize_stdout(raw_result.get("stderr", "")), "duration_ms": raw_result.get("duration_ms"), "peak_memory_mb": raw_result.get("peak_memory_mb"), } payload["tool_result"] = normalized print(json.dumps(payload)) sys.exit(0) if __name__ == "__main__": main() ``` **TypeScript:** ```typescript import { readFileSync } from "node:fs"; const ANSI = /\x1b\[[0-9;]*m/g; const UNIX_TS = /\b1[6-9]\d{8}\b/g; function normalizeStdout(text: string, maxBytes = 4096): string { let out = text.replace(ANSI, ""); out = out.replace(UNIX_TS, (m) => new Date(Number(m) * 1000).toISOString()); if (out.length > maxBytes) { const head = out.slice(0, maxBytes / 2); const tail = out.slice(-maxBytes / 2); const omittedLines = out.slice(maxBytes / 2, -maxBytes / 2).split("\n").length - 1; out = `${head}\n... truncated (${omittedLines} more lines) ...\n${tail}`; } return out; } const payload = JSON.parse(readFileSync(0, "utf8")); if (payload.tool_name !== "Bash") { process.stdout.write(JSON.stringify(payload)); process.exit(0); } const raw = payload.tool_result ?? {}; const normalized = { status: raw.status ?? "unknown", exit_code: raw.exit_code ?? -1, stdout: normalizeStdout(String(raw.stdout ?? "")), stderr: normalizeStdout(String(raw.stderr ?? "")), duration_ms: raw.duration_ms, peak_memory_mb: raw.peak_memory_mb, }; payload.tool_result = normalized; process.stdout.write(JSON.stringify(payload)); process.exit(0); ``` Concept: `structured-outputs` ### 6. Validate the result semantically Schema validation guarantees shape; semantic validation guarantees meaning. After normalization, run a task-specific validator. For a test runner: passed + failed + skipped total; counts are non-negative. Failed validation returns is_error: true with the specific check that failed. **Python:** ```python import json from typing import TypedDict class ValidationResult(TypedDict): valid: bool errors: list[str] def validate_pytest_output(result: dict) -> ValidationResult: errors = [] try: report = json.loads(result["stdout"]) except json.JSONDecodeError: errors.append("pytest stdout is not valid JSON") return {"valid": False, "errors": errors} summary = report.get("summary", {}) passed = summary.get("passed", 0) failed = summary.get("failed", 0) skipped = summary.get("skipped", 0) total = summary.get("total", 0) if any(v < 0 for v in (passed, failed, skipped, total)): errors.append(f"negative test counts in summary: {summary}") if passed + failed + skipped != total: errors.append(f"counts inconsistent: passed={passed} + failed={failed} + skipped={skipped} != total={total}") if total == 0 and result.get("exit_code") == 0: errors.append("pytest reported total=0 but exited 0; expected tests to run") return {"valid": len(errors) == 0, "errors": errors} ``` **TypeScript:** ```typescript interface ValidationResult { valid: boolean; errors: string[]; } export function validatePytestOutput(result: { stdout: string; exit_code: number }): ValidationResult { const errors: string[] = []; let report: { summary?: Record }; try { report = JSON.parse(result.stdout); } catch { return { valid: false, errors: ["pytest stdout is not valid JSON"] }; } const summary = report.summary ?? {}; const passed = summary.passed ?? 0; const failed = summary.failed ?? 0; const skipped = summary.skipped ?? 0; const total = summary.total ?? 0; if ([passed, failed, skipped, total].some((v) => v < 0)) { errors.push(`negative test counts in summary: ${JSON.stringify(summary)}`); } if (passed + failed + skipped !== total) { errors.push(`counts inconsistent: passed=${passed} + failed=${failed} + skipped=${skipped} != total=${total}`); } if (total === 0 && result.exit_code === 0) { errors.push("pytest reported total=0 but exited 0; expected tests to run"); } return { valid: errors.length === 0, errors }; } ``` Concept: `evaluation` ### 7. Test resource limits and timeouts adversarially Ship the sandbox config; then break it on purpose. Run scripts that allocate 10 GB; verify the kernel kills them at 1 GB with exit code 137. Run busy loops; verify timeout at 30s with exit code 124. Run network-egress attempts; verify the deny rule fires. **Python:** ```python def adversarial_oom_test() -> bool: """Allocate 10 GB. Sandbox should kill at 1 GB cap.""" code = "x = bytearray(10 * 1024 * 1024 * 1024)" result = run_in_sandbox(f"python3 -c {code!r}") if result["exit_code"] != 137: print(f"FAIL: expected exit 137 (OOM kill), got {result['exit_code']}") return False print("PASS: OOM kill at 1 GB cap") return True def adversarial_timeout_test() -> bool: """Busy loop. Sandbox should kill at 30s cap.""" result = run_in_sandbox("python3 -c 'while True: pass'", timeout_s=30) if result["status"] != "timeout": print(f"FAIL: expected status=timeout, got {result['status']}") return False print("PASS: timeout kill at 30s") return True ``` **TypeScript:** ```typescript export async function adversarialOomTest(): Promise { const result = await runInSandbox(`python3 -c "x = bytearray(10 * 1024 * 1024 * 1024)"`); if (result.exit_code !== 137) { console.log(`FAIL: expected exit 137, got ${result.exit_code}`); return false; } console.log("PASS: OOM kill at 1 GB cap"); return true; } export async function adversarialTimeoutTest(): Promise { const result = await runInSandbox(`python3 -c "while True: pass"`, 30); if (result.status !== "timeout") { console.log(`FAIL: expected status=timeout, got ${result.status}`); return false; } console.log("PASS: timeout kill at 30s"); return true; } ``` Concept: `evaluation` ### 8. Audit-log every code-exec invocation Every code-exec invocation writes an append-only row to durable storage: timestamp, command, hook decisions, sandbox metrics (duration, peak memory, exit code), validation outcome, agent that requested it. Retain for at least 90 days. **Python:** ```python import datetime, json from pathlib import Path AUDIT_DIR = Path("audit") AUDIT_DIR.mkdir(exist_ok=True) def audit_code_exec(command, pre_decision, sandbox_result, validation, agent_id): today = datetime.date.today().isoformat() row = { "ts": datetime.datetime.utcnow().isoformat() + "Z", "agent_id": agent_id, "command": command[:200], "pre_decision": pre_decision, "sandbox": { "status": sandbox_result.get("status"), "exit_code": sandbox_result.get("exit_code"), "duration_ms": sandbox_result.get("duration_ms"), "peak_memory_mb": sandbox_result.get("peak_memory_mb"), }, "validation": validation, } with open(AUDIT_DIR / f"{today}.jsonl", "a") as f: f.write(json.dumps(row) + "\n") ``` **TypeScript:** ```typescript import { appendFileSync, mkdirSync } from "node:fs"; import { join } from "node:path"; const AUDIT_DIR = "audit"; mkdirSync(AUDIT_DIR, { recursive: true }); export function auditCodeExec( command: string, preDecision: string, sandboxResult: { status?: string; exit_code?: number; duration_ms?: number; peak_memory_mb?: number }, validation: ValidationResult, agentId: string, ) { const today = new Date().toISOString().slice(0, 10); const row = { ts: new Date().toISOString(), agent_id: agentId, command: command.slice(0, 200), pre_decision: preDecision, sandbox: { status: sandboxResult.status, exit_code: sandboxResult.exit_code, duration_ms: sandboxResult.duration_ms, peak_memory_mb: sandboxResult.peak_memory_mb, }, validation, }; appendFileSync(join(AUDIT_DIR, `${today}.jsonl`), JSON.stringify(row) + "\n"); } ``` Concept: `evaluation` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | Reading the contents of config.json from agent code | Use the Read tool. file_path: 'config.json'. | Use Bash. command: 'cat config.json'. | Built-in tools are auditable, fast, and structurally distinct from execution. Bash conflates file I/O with execution and bloats the audit trail; the PreToolUse blocklist also has to reason about every cat/grep/find. | | Preventing destructive Bash commands at runtime | PreToolUse hook on Bash with a compiled regex blocklist. Exit 2 on match. | System prompt instruction: 'never run destructive commands'. | Prompts are probabilistic and leak under prompt injection or unusual phrasing. Hooks are deterministic, run before the sandbox spawns, and emit a model-readable stderr message that becomes a tool_result is_error: true. | | Stopping a runaway script after 30 seconds | Kernel timeout: systemd-run --property=TimeoutStartSec=30s, or Docker --timeout, or cgroup limit. | Python signal handler: signal.signal(signal.SIGALRM, handler). | Signal handlers can be caught, ignored, or never delivered (e.g. blocked inside a C extension). Kernel timeouts cannot be caught by the running code; the kernel sends SIGKILL and the process exits regardless. | | Validating that a test-runner result is sane | Semantic validation: passed + failed + skipped equals total; counts are non-negative; total > 0 for a non-empty run. | Schema validation only: the JSON has the expected shape. | Schema guarantees shape; semantic guarantees meaning. A schema-valid output of {passed: -5, failed: 2, total: 0} is structurally fine but semantically nonsense. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-CODEEXEC-01 · Using Bash for file I/O | Agent calls Bash with cat data.json, grep -r foo, find . -name *.py. Audit trail is opaque and the PreToolUse blocklist has to reason about every cat / grep / find. | Route file I/O to built-in tools: Read, Grep, Glob. Reserve Bash for actual command execution. The first layer of safety is tool selection. | | AP-CODEEXEC-02 · No PreToolUse blocklist on Bash | A clever prompt-injection in alert text gets rm -rf /prod past the agent. The Bash command runs because no hook scanned it first. | PreToolUse hook with a compiled regex blocklist (rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh). Match exits 2 with stderr. | | AP-CODEEXEC-03 · No resource limits on code execution | An agent's Python script allocates 10 GB and crashes the runner. Another runs an infinite loop and starves the queue. | Sandbox config with kernel-level limits: CPU 2, memory 1024 MB, timeout 30 s, network deny. Enforced via Docker cgroups, Firecracker jailer, or systemd-run scopes. | | AP-CODEEXEC-04 · Heterogeneous raw output to the agent | Bash output goes straight back: mixed Unix timestamps and ISO 8601, ANSI color codes, multiline stack traces. The agent parses inconsistently. | PostToolUse hook normalizes everything: strip ANSI, convert timestamps to ISO 8601 UTC, truncate stdout above 4 KB, emit a stable contract. | | AP-CODEEXEC-05 · Schema-only validation | The result JSON has the expected shape but passed: -5, total: 0. Schema-valid; semantically nonsense. The agent acts on it. | Semantic validators registered per task: passed + failed + skipped equals total; counts are non-negative; total > 0 for a non-empty run. | ## Implementation checklist - [ ] File I/O routes to Read / Write / Edit. Bash reserved for actual execution (`tool-calling`) - [ ] PreToolUse hook on Bash with compiled regex blocklist. Allowlist of safe binaries (`hooks`) - [ ] Sandbox runtime (Docker or Firecracker) per invocation. Image cached for warm spin-up (`subagents`) - [ ] Kernel-level resource limits: CPU 2, memory 1024 MB, timeout 30 s, network deny (`tool-calling`) - [ ] PostToolUse hook normalizes raw output to JSON contract (`structured-outputs`) - [ ] Per-task semantic validators (test runner, data analysis, type checker) (`evaluation`) - [ ] Adversarial test suite: OOM kill, timeout kill, network deny each verified end-to-end - [ ] Audit log: append-only JSONL with command, hook decisions, sandbox metrics, validation outcome (`evaluation`) - [ ] Retention 90+ days. Indexed by agent_id and timestamp - [ ] Telemetry: per-invocation duration, peak_memory, hook deny rate, validation pass rate ## Cost & latency - **Per-invocation Claude API:** ~$0.005 to $0.015, Skill body ~500 tokens system + parameters ~50 tokens + working tokens ~1500-3000 input + ~500 output. - **Sandbox spin-up overhead:** ~500 ms warm; ~3 s cold, Docker image cached in the runner. Warm spin-up is dominated by container start and cgroup setup. - **Code execution duration p95:** ~3 to 8 seconds end-to-end, Sandbox spin-up (500 ms) + actual code (1-5 s) + PostToolUse normalization (50 ms) + semantic validation (20 ms). - **Sandbox resource usage at the runner level:** ~50-200 MB memory, ~100-500 ms CPU per typical invocation, Most data-analysis or test-runner Skills are short and small. Resource caps prevent the long tail from dominating. - **Storage for audit log:** ~1 GB per month at 10 K invocations, JSONL row ~2-5 KB per invocation. 10 K invocations per month ~30-50 MB. Negligible at object-storage prices. ## Domain weights - **D2 · Tool Design + Integration (18%):** Bash vs built-ins. PreToolUse blocklist. PostToolUse normalizer. JSON contract. - **D3 · Agent Operations (20%):** Sandbox config. Kernel-level limits. Adversarial verification. Audit-log integration. ## Practice questions ### Q1. A Skill executes Python code. The agent calls Bash with command: 'cat data.json'. What is the correct tool to use here and why? Use the Read tool with file_path: 'data.json'. Reserve Bash for actual command execution (compiling, running tests, executing a Python data-analysis script). Built-in tools are auditable and structurally distinct from execution; the PreToolUse blocklist has a smaller reasoning surface when Bash usage is narrow. Tagged to AP-CODEEXEC-01. ### Q2. Your code-execution Skill runs untrusted Python. What prevents the code from running rm -rf / or other destructive commands? A PreToolUse hook on Bash with a compiled regex blocklist (rm -rf, sudo, drop (database|table), kill -9, chmod 777, curl | sh). On match, the hook exits 2 with a model-readable stderr message; the SDK delivers that as a tool_result with is_error: true. The agent observes the deny and re-plans. The blocklist lives in code, not in the prompt; prompt-injection cannot bypass it. Tagged to AP-CODEEXEC-02. ### Q3. A Skill executes a data-analysis script. The script can run for 5 minutes (max) or 10 seconds (expected). How should you enforce the time limit? Kernel-level timeout, not a Python signal handler. Use systemd-run --property=TimeoutStartSec=30s, or Docker's intrinsic timeout via cgroups. Kernel timeouts cannot be caught: the kernel sends SIGKILL and the process exits regardless. Python signal.signal(signal.SIGALRM, ...) is the canonical wrong answer because the user script can try: ... except: pass it. Exit code 124 is the standard timeout signature. Tagged to AP-CODEEXEC-03. ### Q4. A PostToolUse hook normalizes code-execution output. The raw output is heterogeneous: Unix timestamp, ISO 8601 date, status code, ANSI color codes, multi-line stack trace. How should the hook normalize this? Emit a stable JSON contract: {status, exit_code, stdout, stderr, duration_ms, peak_memory_mb}. Strip ANSI color codes via regex. Convert Unix timestamps to ISO 8601 UTC. Truncate stdout above 4 KB with a ... truncated (N more lines) ... marker. Map exit_code to status (0 -> ok, 137 -> oom, 124 -> timeout). Always exit the hook with code 0; this hook is for shape, not for denial. Tagged to AP-CODEEXEC-04. ### Q5. A Skill validates the result semantically. It runs a test suite and gets back: {passed: 5, failed: 1, skipped: 0, total: 6}. What semantic check should validate this result? Three semantic checks beyond schema. (1) passed + failed + skipped total (here 5 + 1 + 0 = 6: passes). (2) All counts are non-negative. (3) total > 0 for a non-empty run. If any check fails, return is_error: true with the specific failure. Schema validation alone catches malformed JSON; semantic validation catches schema-valid nonsense like {passed: -5, total: 0}. Tagged to AP-CODEEXEC-05. ## FAQ ### Q1. What languages can a code-execution Skill run? Anything in the sandbox base image. Common: Python, Node.js, Go, Rust, shell. The image determines available runtimes. For most teams, python:3.12-slim plus a few preinstalled libraries covers 90% of use cases. ### Q2. Can code access the network? No by default. --network none denies all egress at the kernel level. Opt in for specific tasks by spawning the sandbox with --network bridge and specific iptables rules. ### Q3. What happens if code runs out of memory? The kernel sends SIGKILL when the cgroup memory limit is hit; the process exits with code 137. The harness detects 137 and emits status: oom in the normalized result. ### Q4. Can code persist state across Skill invocations? Not by default. Each invocation gets a fresh sandbox with no state from prior runs. Opt-in persistence via a mounted volume on the runner. ### Q5. How do I validate output semantically? Per-task validators registered by Skill key. Test-runner validator: passed + failed + skipped total. Data-analysis validator: row count, required columns, aggregate sanity. ### Q6. Can a Skill call another Skill that does code execution? Yes. Skill-to-Skill calls are tool calls. Each Skill invocation gets its own sandbox; nesting does not share resources. ### Q7. What is the timeout for code execution? 30 seconds by default. Configurable per Skill via the sandbox_timeout_sec parameter in the Skill frontmatter. The kernel enforces it; the running code cannot extend or ignore it. ## Production readiness - [ ] Sandbox image built and cached on every runner; cold-start time documented - [ ] PreToolUse blocklist regex tested against an adversarial 'destructive-attempt' eval set - [ ] Kernel-level resource limits verified end-to-end (OOM kill, timeout kill, network deny) - [ ] PostToolUse normalizer tested against 5 representative output shapes - [ ] Per-task semantic validators registered for every Skill that emits code-exec results - [ ] Audit log retention 90+ days; indexed; replay tool reconstructs any invocation in seconds - [ ] Allowlist of safe binaries kept narrow; PR review on every additional binary - [ ] Telemetry: per-invocation duration, peak memory, hook deny rate, validation pass rate --- **Source:** https://claudearchitectcertification.com/scenarios/agent-skills-with-code-execution **Vault sources:** ACP-T05 Scenario 13 (yellow beyond-guide; OP-claimed Reddit 1s34iyl); ACP-T07 Lab 13 spec (Skills with code execution); ACP-T08 section 3.13 (sandbox strategy, resource caps, normalization); Course 01 Claude 101 lesson 7 (working with Skills); GAI-K05 CCA exam questions and scenarios; ACP-T06 (5 practice Qs tagged to components) **Last reviewed:** 2026-05-04 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Invoice Processing Agent > An AP-automation agent that wraps four guarantees around invoice approval. (1) Forced tool_use with a strict JSON schema (vendor_id, invoice_number, line_items[], total_amount, currency ISO 4217, due_date ISO 8601, PO_reference nullable) prevents fabrication. (2) Validation-retry loop confirms sum(line_items) total, currency in ISO 4217 enum, due_date >= invoice_date. (3) Three-way match reconciles invoice with the purchase order and the goods receipt; variance > 2% routes to human review. (4) PreToolUse hook on approve_payment denies if invoice_amount > vendor_authorization_cap, or if vendor on blocklist, or if (vendor_id, invoice_number) was seen in the last 90 days (duplicate detection). PostToolUse audit log captures every approval and rejection. The most-tested distractor: prompt-only field extraction leaks ~15% on edge invoices; forced tool_choice is the only credible architecture. **Sub-marker:** P3.14 **Domains:** D2 · Tool Design + Integration, D5 · Context + Reliability **Exam weight:** 33% of CCA-F (D2 + D5) **Build time:** 26 minutes **Source:** 🟠 Applied scenario. Community-derived from AP-automation patterns; composes P3.6 + P3.7 + P3.8 in a finance workflow. **Canonical:** https://claudearchitectcertification.com/scenarios/invoice-processing-agent **Last reviewed:** 2026-05-05 ## In plain English Think of this as the agent that handles your accounts-payable inbox without the team paying the same invoice twice. A vendor emails a PDF or scanned image; the agent extracts the structured fields (vendor, invoice number, line items, total, currency, due date, PO reference) using a strict schema so it cannot make values up; then it checks the math (line totals must equal the header total), looks up the matching purchase order and goods receipt to make sure the three documents agree, asks a deterministic policy hook whether the vendor still has authorization headroom and whether this exact invoice has been seen in the last 90 days, and only then approves payment. Anything ambiguous routes to a human AP analyst with a structured exception block. The whole point is that AP automation is one wrong cap-policy or duplicate-detection check away from a real money loss. ## Exam impact Domain 2 (Tool Design, 18%) tests forced tool_choice + JSON schema authoring + the PreToolUse cap and duplicate-detection hooks. Domain 5 (Context, 15%) tests CASE_FACTS pinning across multi-page invoices, schema caching, and Batch API for overnight bulk runs. Composes the patterns from P3.6 (structured-data-extraction), P3.7 (agentic-tool-design), and P3.8 (long-document-processing) into a single deployable workflow. The 'why does my prompt-only AP agent leak 15% on edge invoices?' question is the canonical exam distractor. ## The problem ### What the customer needs - Schema-conformant extraction on every invoice: vendor, number, line items, total, currency, due date, PO reference. No prose wrapping; downstream systems must parse cleanly. - Three-way match before approval: invoice, purchase order, goods receipt all agree on amount, vendor, and quantities. - Cap-policy enforcement that cannot be bypassed by clever invoice phrasing: vendor authorization caps, duplicate detection, blocklisted-vendor checks. - Audit-grade trail of every approval and rejection so finance can replay any decision in a quarterly close. ### Why naive approaches fail - Prompt 'output JSON' for invoice extraction: ~15% leakage on edge invoices (handwritten notes, mixed languages, credit memos, rotated scans). - Single-pass extraction with no semantic validation: line totals do not match the header; corrupted records ship downstream. - No three-way match: the agent approves an invoice for goods that were never received, or against a PO that does not exist. - Cap policy in the system prompt: ~3% of approvals exceed authorization cap because prompts leak under unusual phrasing. - No duplicate-invoice check: the same invoice number gets paid twice when the vendor re-sends after a delivery confirmation. ### Definition of done - Forced tool_choice: { type: 'tool', name: 'extract_invoice' } on every extraction call. - JSON schema requires vendor_id, invoice_number, line_items[], total_amount, currency (ISO 4217 enum), due_date (ISO 8601), PO_reference (nullable). - Validation-retry loop confirms sum(line_items) total, currency in enum, due_date >= invoice_date. - Three-way match service reconciles invoice + PO + GRN; variance > 2% routes to human review. - PreToolUse hook on approve_payment: deny on cap exceeded, vendor blocklisted, or duplicate (vendor_id, invoice_number) in the last 90 days. - PostToolUse audit log writes every approval / rejection / hook decision. ## Concepts in play - 🟢 **Structured outputs** (`structured-outputs`), Forced tool_use as the structural contract - 🟢 **Tool calling** (`tool-calling`), Schema lives in tools[0].input_schema - 🟢 **tool_choice** (`tool-choice`), Forced for guaranteed extraction - 🟢 **Evaluation** (`evaluation`), Semantic validation + three-way match - 🟢 **Hooks** (`hooks`), PreToolUse cap + duplicate gate; PostToolUse audit - 🟠 **Case-facts block** (`case-facts-block`), Vendor + invoice_number pinned across multi-page runs - 🟢 **Prompt caching** (`prompt-caching`), Schema + vendor master cached for sustained traffic - 🟢 **Batch API** (`batch-api`), Overnight bulk processing at 50% off ## Components ### Invoice JSON Schema, the contract, in tools[0].input_schema The output shape lives inside a tool definition, not as freeform text. Required: vendor_id, invoice_number, line_items[], total_amount, currency (ISO 4217 enum), due_date (ISO 8601 string). Optional and nullable: PO_reference, tax_amount, notes. Every numeric field has a minimum: 0. Every line item has description, quantity, unit_price, total. **Configuration:** tools = [{ name: 'extract_invoice', input_schema: { type: 'object', properties: { vendor_id: {type: 'string'}, invoice_number: {type: 'string'}, total_amount: {type: 'number', minimum: 0}, currency: {type: 'string', enum: ['USD', 'EUR', 'GBP', 'INR', 'JPY', 'unclear']}, due_date: {type: 'string', format: 'date'}, line_items: {type: 'array', items: {...}}, PO_reference: {type: ['string', 'null']} }, required: ['vendor_id', 'invoice_number', 'total_amount', 'currency', 'due_date', 'line_items'] } }] **Concept:** `structured-outputs` ### Forced tool_use Extractor, tool_choice: { type: 'tool', name: 'extract_invoice' } Forces the model to fire extract_invoice with arguments matching the schema. No prose preamble, no probabilistic adherence. Vision-capable invocation reads the PDF or image; the model emits a structured tool_use. Pair with few-shot examples that show currency: 'unclear' on truly ambiguous source. **Configuration:** tool_choice: { type: 'tool', name: 'extract_invoice' }. Use auto only on triage-style flows. Forced is for mandatory extraction. **Concept:** `tool-choice` ### Validation-Retry Loop, sum check, currency enum, date sanity Schema enforces shape. Code enforces meaning. After parse: sum(line_items[].total) total_amount (within 0.01 cent tolerance for FX rounding); currency in the enum; due_date format YYYY-MM-DD; due_date >= invoice_date. On failure, feed the specific error back to the model ('line totals sum to 4950 but header total is 5000'); typical convergence in 1-2 retries. **Configuration:** loop: extract -> parse -> validate_semantically -> on failure, append { role: 'user', content: tool_result with is_error: true and a specific error } -> retry. Max retries: 3. After 3, route to human review. **Concept:** `evaluation` ### Three-Way Match Service, invoice + PO + goods receipt Queries the PO master and the goods-receipt ledger by PO_reference. Compares amount (variance <= 2% OK for FX rounding and small price changes), vendor identity (normalized vendor name fuzzy match), line-item count (must match), and date sanity (invoice date >= PO date; receipt date >= PO date). Variance above thresholds returns a structured exception; invoice is held pending human review. **Configuration:** match(invoice, po, grn) -> { match: bool, variance_pct, mismatched_fields[], routed_to: 'auto-approve' | 'human-review' }. Threshold: amount variance > 2% -> human-review. Vendor mismatch -> human-review. Line-item count mismatch -> human-review. **Concept:** `evaluation` ### PreToolUse Cap and Duplicate Hook, deterministic policy gate before approve_payment Sits between the model's tool_use for approve_payment and actual execution. Reads tool_input.vendor_id, tool_input.amount, tool_input.invoice_number. Three checks. (1) Cap: vendor_ytd_spend + amount <= vendor_authorization_cap. (2) Blocklist: vendor not in the active blocklist. (3) Duplicate: no row in the audit log with the same (vendor_id, invoice_number) in the last 90 days. Any check fails and the hook exits 2 with a structured stderr message; the agent observes the deny as tool_result is_error: true and routes to a structured exception block for the AP analyst. **Configuration:** matcher: 'approve_payment'. Hook exits 2 with stderr { reason: 'cap_exceeded' | 'vendor_blocklisted' | 'duplicate_detected', detail: ..., recommended_action: ... }. SDK forwards stderr to the model as a tool_result with is_error: true. **Concept:** `hooks` ## Build steps ### 1. Author the invoice JSON schema as a tool definition Define the output shape in tools[0].input_schema. Every required field listed in required[]. Currency is an enum that includes an 'unclear' escape hatch. PO_reference is ['string', 'null'] because cash invoices and credit memos have no PO. Every numeric field has minimum: 0. Line items are an array with description, quantity, unit_price, total. The schema is the contract; everything downstream depends on it being right. **Python:** ```python from anthropic import Anthropic client = Anthropic() EXTRACT_INVOICE_TOOL = { "name": "extract_invoice", "description": "Extract a structured invoice record from a PDF or image.", "input_schema": { "type": "object", "properties": { "vendor_id": {"type": "string"}, "invoice_number": {"type": "string"}, "invoice_date": {"type": "string", "format": "date"}, "due_date": {"type": "string", "format": "date"}, "currency": { "type": "string", "enum": ["USD", "EUR", "GBP", "INR", "JPY", "unclear"], }, "total_amount": {"type": "number", "minimum": 0}, "tax_amount": {"type": ["number", "null"], "minimum": 0}, "PO_reference": {"type": ["string", "null"]}, "line_items": { "type": "array", "minItems": 1, "items": { "type": "object", "properties": { "description": {"type": "string"}, "quantity": {"type": "number", "minimum": 0}, "unit_price": {"type": "number", "minimum": 0}, "total": {"type": "number", "minimum": 0}, }, "required": ["description", "quantity", "unit_price", "total"], }, }, }, "required": [ "vendor_id", "invoice_number", "invoice_date", "due_date", "currency", "total_amount", "line_items", ], }, } ``` **TypeScript:** ```typescript import Anthropic from "@anthropic-ai/sdk"; const client = new Anthropic(); const EXTRACT_INVOICE_TOOL: Anthropic.Tool = { name: "extract_invoice", description: "Extract a structured invoice record from a PDF or image.", input_schema: { type: "object", properties: { vendor_id: { type: "string" }, invoice_number: { type: "string" }, invoice_date: { type: "string", format: "date" }, due_date: { type: "string", format: "date" }, currency: { type: "string", enum: ["USD", "EUR", "GBP", "INR", "JPY", "unclear"], }, total_amount: { type: "number", minimum: 0 }, tax_amount: { type: ["number", "null"], minimum: 0 }, PO_reference: { type: ["string", "null"] }, line_items: { type: "array", minItems: 1, items: { type: "object", properties: { description: { type: "string" }, quantity: { type: "number", minimum: 0 }, unit_price: { type: "number", minimum: 0 }, total: { type: "number", minimum: 0 }, }, required: ["description", "quantity", "unit_price", "total"], }, }, }, required: [ "vendor_id", "invoice_number", "invoice_date", "due_date", "currency", "total_amount", "line_items", ], }, }; ``` Concept: `structured-outputs` ### 2. Force tool_choice and run extraction with vision input Set tool_choice: { type: 'tool', name: 'extract_invoice' } so the model has no choice but to fire the tool with arguments matching the schema. Pass the invoice as a vision input (PDF page rasterized to image, or direct image upload). The model emits a structured tool_use; the harness extracts tool_use.input as the candidate record. **Python:** ```python import base64 def extract_invoice(invoice_image_bytes: bytes, mime_type: str = "image/png") -> dict: image_b64 = base64.b64encode(invoice_image_bytes).decode("ascii") resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=2048, tools=[EXTRACT_INVOICE_TOOL], tool_choice={"type": "tool", "name": "extract_invoice"}, messages=[ { "role": "user", "content": [ { "type": "image", "source": {"type": "base64", "media_type": mime_type, "data": image_b64}, }, {"type": "text", "text": "Extract this invoice into the schema."}, ], } ], ) for block in resp.content: if block.type == "tool_use" and block.name == "extract_invoice": return block.input raise RuntimeError("forced tool_choice did not yield tool_use") ``` **TypeScript:** ```typescript async function extractInvoice( invoiceBytes: Uint8Array, mimeType: "image/png" | "image/jpeg" = "image/png", ) { const imageB64 = Buffer.from(invoiceBytes).toString("base64"); const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 2048, tools: [EXTRACT_INVOICE_TOOL], tool_choice: { type: "tool", name: "extract_invoice" }, messages: [ { role: "user", content: [ { type: "image", source: { type: "base64", media_type: mimeType, data: imageB64 }, }, { type: "text", text: "Extract this invoice into the schema." }, ], }, ], }); for (const block of resp.content) { if (block.type === "tool_use" && block.name === "extract_invoice") { return block.input as Record; } } throw new Error("forced tool_choice did not yield tool_use"); } ``` Concept: `tool-choice` ### 3. Wrap extraction in a validation-retry loop Schema guarantees structure; semantics need code. After parsing, validate: sum(line_items[].total) equals total_amount within 0.01 tolerance; currency in the enum; due_date format and >= invoice_date. On failure, feed the specific error back via tool_result with is_error: true so the model sees what was wrong; retry up to 3 times. Most failures converge in 1-2 retries because the model now knows what the validator rejected. **Python:** ```python from datetime import date def validate(record: dict) -> list[str]: errors = [] items_sum = sum(it.get("total", 0) for it in record.get("line_items", [])) if abs(items_sum - record.get("total_amount", 0)) > 0.01: errors.append( f"line items sum to {items_sum:.2f} but total_amount is " f"{record['total_amount']:.2f}; reconcile" ) if record.get("currency") not in {"USD", "EUR", "GBP", "INR", "JPY", "unclear"}: errors.append(f"currency {record.get('currency')!r} not in ISO 4217 enum") try: inv_date = date.fromisoformat(record.get("invoice_date", "")) due_date = date.fromisoformat(record.get("due_date", "")) if due_date < inv_date: errors.append( f"due_date {due_date} is before invoice_date {inv_date}" ) except ValueError as e: errors.append(f"date parse failed: {e}") return errors def extract_with_retry(invoice_image_bytes: bytes, max_retries: int = 3) -> dict: image_b64 = base64.b64encode(invoice_image_bytes).decode("ascii") messages = [{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_b64}}, {"type": "text", "content": "Extract this invoice into the schema."}, ], }] for attempt in range(max_retries): resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=2048, tools=[EXTRACT_INVOICE_TOOL], tool_choice={"type": "tool", "name": "extract_invoice"}, messages=messages, ) tool_use = next(b for b in resp.content if b.type == "tool_use") record = tool_use.input errors = validate(record) if not errors: return record messages.append({"role": "assistant", "content": resp.content}) messages.append({ "role": "user", "content": [{ "type": "tool_result", "tool_use_id": tool_use.id, "content": "Validation failed: " + "; ".join(errors) + ". Re-extract.", "is_error": True, }], }) raise ValueError(f"extraction did not converge in {max_retries} attempts") ``` **TypeScript:** ```typescript function validate(record: Record): string[] { const errors: string[] = []; const itemsSum = (record.line_items as Array<{ total: number }>)?.reduce( (s, it) => s + (it.total ?? 0), 0, ) ?? 0; if (Math.abs(itemsSum - (record.total_amount ?? 0)) > 0.01) { errors.push( `line items sum to ${itemsSum.toFixed(2)} but total_amount is ` + `${(record.total_amount ?? 0).toFixed(2)}; reconcile`, ); } const validCurrencies = new Set(["USD", "EUR", "GBP", "INR", "JPY", "unclear"]); if (!validCurrencies.has(record.currency)) { errors.push(`currency ${JSON.stringify(record.currency)} not in ISO 4217 enum`); } try { const inv = new Date(record.invoice_date); const due = new Date(record.due_date); if (due < inv) { errors.push(`due_date ${record.due_date} is before invoice_date ${record.invoice_date}`); } } catch (e) { errors.push(`date parse failed: ${(e as Error).message}`); } return errors; } ``` Concept: `evaluation` ### 4. Run a three-way match against PO and goods receipt Query the PO master by PO_reference and the goods-receipt ledger by the same key. Compare amount (variance <= 2% OK for FX rounding and minor price changes), vendor identity (normalized fuzzy match on vendor name), and line-item count (must match exactly). Variance above any threshold routes to human review with a structured exception block; otherwise auto-proceed. **Python:** ```python def three_way_match(invoice: dict, po: dict, grn: dict) -> dict: """Reconcile invoice with purchase order and goods receipt note.""" issues = [] inv_amount = invoice["total_amount"] po_amount = po.get("total_amount", 0) if po_amount > 0: variance_pct = abs(inv_amount - po_amount) / po_amount * 100 if variance_pct > 2.0: issues.append( f"amount variance {variance_pct:.2f}% exceeds 2% threshold" ) if normalize_vendor(invoice["vendor_id"]) != normalize_vendor(po["vendor_id"]): issues.append( f"vendor mismatch: invoice {invoice['vendor_id']!r} " f"vs PO {po['vendor_id']!r}" ) if len(invoice["line_items"]) != len(grn.get("line_items", [])): issues.append( f"line-item count mismatch: invoice {len(invoice['line_items'])} " f"vs GRN {len(grn.get('line_items', []))}" ) return { "match": len(issues) == 0, "issues": issues, "routed_to": "auto-approve" if not issues else "human-review", } def normalize_vendor(name: str) -> str: return "".join(ch.lower() for ch in name if ch.isalnum()) ``` **TypeScript:** ```typescript interface Invoice { vendor_id: string; total_amount: number; line_items: unknown[]; } interface PO { vendor_id: string; total_amount: number; } interface GRN { line_items: unknown[]; } export function threeWayMatch(invoice: Invoice, po: PO, grn: GRN) { // Reconcile invoice with purchase order and goods receipt note. const issues: string[] = []; if (po.total_amount > 0) { const variancePct = (Math.abs(invoice.total_amount - po.total_amount) / po.total_amount) * 100; if (variancePct > 2.0) { issues.push(`amount variance ${variancePct.toFixed(2)}% exceeds 2% threshold`); } } if (normalizeVendor(invoice.vendor_id) !== normalizeVendor(po.vendor_id)) { issues.push( `vendor mismatch: invoice ${JSON.stringify(invoice.vendor_id)} ` + `vs PO ${JSON.stringify(po.vendor_id)}`, ); } if (invoice.line_items.length !== (grn.line_items?.length ?? 0)) { issues.push( `line-item count mismatch: invoice ${invoice.line_items.length} ` + `vs GRN ${grn.line_items?.length ?? 0}`, ); } return { match: issues.length === 0, issues, routed_to: issues.length === 0 ? "auto-approve" : "human-review", }; } function normalizeVendor(name: string): string { return name.toLowerCase().replace(/[^a-z0-9]/g, ""); } ``` Concept: `evaluation` ### 5. Wire the PreToolUse cap and duplicate-detection hook Hook on approve_payment. Three checks. (1) Cap: vendor_ytd_spend + amount <= vendor_authorization_cap. (2) Blocklist: vendor not on the active blocklist. (3) Duplicate: no audit-log row with the same (vendor_id, invoice_number) in the last 90 days. Any check fails and the hook exits 2 with a structured stderr message; the agent observes the deny and routes to an exception block for the AP analyst. Deterministic, no prompt-injection bypass. **Python:** ```python # .claude/hooks/invoice_approval.py import sys, json, os, sqlite3 from datetime import date, timedelta DB = sqlite3.connect(os.environ.get("AUDIT_DB", "audit.sqlite3")) def vendor_cap_check(vendor_id: str, amount: float) -> str | None: row = DB.execute( "SELECT cap, ytd_spend FROM vendor_master WHERE vendor_id = ?", (vendor_id,), ).fetchone() if not row: return f"vendor {vendor_id!r} not in master; escalate" cap, ytd = row if ytd + amount > cap: remaining = cap - ytd return ( f"vendor cap exceeded: ytd_spend={ytd:.2f} + amount={amount:.2f} > " f"cap={cap:.2f}; cap_remaining={remaining:.2f}" ) return None def blocklist_check(vendor_id: str) -> str | None: row = DB.execute( "SELECT 1 FROM vendor_blocklist WHERE vendor_id = ?", (vendor_id,) ).fetchone() if row: return f"vendor {vendor_id!r} on active blocklist" return None def duplicate_check(vendor_id: str, invoice_number: str) -> str | None: cutoff = (date.today() - timedelta(days=90)).isoformat() row = DB.execute( "SELECT approved_at FROM audit_log WHERE vendor_id = ? " "AND invoice_number = ? AND approved_at >= ? ORDER BY approved_at DESC LIMIT 1", (vendor_id, invoice_number, cutoff), ).fetchone() if row: return ( f"duplicate detected: same (vendor_id, invoice_number) approved on " f"{row[0]}; reject this submission" ) return None def main(): payload = json.loads(sys.stdin.read()) if payload["tool_name"] != "approve_payment": sys.exit(0) inp = payload["tool_input"] for check in ( vendor_cap_check(inp["vendor_id"], inp["amount"]), blocklist_check(inp["vendor_id"]), duplicate_check(inp["vendor_id"], inp["invoice_number"]), ): if check: print(check, file=sys.stderr) sys.exit(2) sys.exit(0) if __name__ == "__main__": main() ``` **TypeScript:** ```typescript // .claude/hooks/invoice-approval.ts import { readFileSync } from "node:fs"; import Database from "better-sqlite3"; const db = new Database(process.env.AUDIT_DB ?? "audit.sqlite3"); function vendorCapCheck(vendorId: string, amount: number): string | null { const row = db .prepare("SELECT cap, ytd_spend FROM vendor_master WHERE vendor_id = ?") .get(vendorId) as { cap: number; ytd_spend: number } | undefined; if (!row) return `vendor ${JSON.stringify(vendorId)} not in master; escalate`; if (row.ytd_spend + amount > row.cap) { const remaining = row.cap - row.ytd_spend; return ( `vendor cap exceeded: ytd_spend=${row.ytd_spend.toFixed(2)} + ` + `amount=${amount.toFixed(2)} > cap=${row.cap.toFixed(2)}; ` + `cap_remaining=${remaining.toFixed(2)}` ); } return null; } function blocklistCheck(vendorId: string): string | null { const row = db.prepare("SELECT 1 FROM vendor_blocklist WHERE vendor_id = ?").get(vendorId); return row ? `vendor ${JSON.stringify(vendorId)} on active blocklist` : null; } function duplicateCheck(vendorId: string, invoiceNumber: string): string | null { const cutoff = new Date(Date.now() - 90 * 86400_000).toISOString().slice(0, 10); const row = db .prepare( "SELECT approved_at FROM audit_log WHERE vendor_id = ? AND invoice_number = ? " + "AND approved_at >= ? ORDER BY approved_at DESC LIMIT 1", ) .get(vendorId, invoiceNumber, cutoff) as { approved_at: string } | undefined; return row ? `duplicate detected: same (vendor_id, invoice_number) approved on ${row.approved_at}; reject this submission` : null; } const payload = JSON.parse(readFileSync(0, "utf8")); if (payload.tool_name !== "approve_payment") process.exit(0); const inp = payload.tool_input; for (const check of [ vendorCapCheck(inp.vendor_id, inp.amount), blocklistCheck(inp.vendor_id), duplicateCheck(inp.vendor_id, inp.invoice_number), ]) { if (check) { process.stderr.write(check + "\n"); process.exit(2); } } process.exit(0); ``` Concept: `hooks` ### 6. Cache the schema and the vendor master The schema is the largest stable token cost (~1500 tokens for invoice extraction). The vendor master (caps, blocklist, name normalization rules) is also stable per session. Mark both with cache_control: ephemeral so a 5-minute TTL keeps them warm across sustained AP traffic. Realistic savings: ~80% on cached portions, ~50% reduction on overall steady-state cost. **Python:** ```python def extract_with_cache(invoice_image_bytes: bytes, vendor_master_blob: str) -> dict: image_b64 = base64.b64encode(invoice_image_bytes).decode("ascii") resp = client.messages.create( model="claude-sonnet-4.5", max_tokens=2048, system=[ { "type": "text", "text": ( "You are an AP-automation extraction agent. Return only " "structured tool_use; never prose." ), "cache_control": {"type": "ephemeral"}, }, { "type": "text", "text": vendor_master_blob, "cache_control": {"type": "ephemeral"}, }, ], tools=[ {**EXTRACT_INVOICE_TOOL, "cache_control": {"type": "ephemeral"}}, ], tool_choice={"type": "tool", "name": "extract_invoice"}, messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_b64}}, {"type": "text", "text": "Extract this invoice into the schema."}, ], }], ) print(f"cache_creation: {resp.usage.cache_creation_input_tokens}") print(f"cache_read: {resp.usage.cache_read_input_tokens}") return next(b.input for b in resp.content if b.type == "tool_use") ``` **TypeScript:** ```typescript async function extractWithCache( invoiceBytes: Uint8Array, vendorMasterBlob: string, ) { const imageB64 = Buffer.from(invoiceBytes).toString("base64"); const resp = await client.messages.create({ model: "claude-sonnet-4.5", max_tokens: 2048, system: [ { type: "text", text: "You are an AP-automation extraction agent. Return only structured " + "tool_use; never prose.", cache_control: { type: "ephemeral" }, }, { type: "text", text: vendorMasterBlob, cache_control: { type: "ephemeral" }, }, ], tools: [ { ...EXTRACT_INVOICE_TOOL, cache_control: { type: "ephemeral" } }, ], tool_choice: { type: "tool", name: "extract_invoice" }, messages: [ { role: "user", content: [ { type: "image", source: { type: "base64", media_type: "image/png", data: imageB64 }, }, { type: "text", text: "Extract this invoice into the schema." }, ], }, ], }); console.log(`cache_creation: ${resp.usage.cache_creation_input_tokens}`); console.log(`cache_read: ${resp.usage.cache_read_input_tokens}`); const tu = resp.content.find((b) => b.type === "tool_use"); return tu?.type === "tool_use" ? tu.input : null; } ``` Concept: `prompt-caching` ### 7. Use Batch API for overnight bulk runs Sync API for inbox-arrival latency. For nightly backfills (10K invoices), the Batch API gives a flat 50% discount with a 24-hour SLA. Combined with schema and vendor-master caching (per-100-item sub-batches keep ephemeral cache warm), bulk extraction cost drops ~75% versus naive sync. Resubmit failures the next morning as a fresh batch with the specific error in the next message. **Python:** ```python def submit_bulk_extraction(invoices: list[dict]) -> str: """Submit a batch of invoice extractions for overnight processing.""" requests = [] for inv in invoices: image_b64 = base64.b64encode(inv["image_bytes"]).decode("ascii") requests.append({ "custom_id": f"extract-{inv['id']}", "params": { "model": "claude-sonnet-4.5", "max_tokens": 2048, "tools": [EXTRACT_INVOICE_TOOL], "tool_choice": {"type": "tool", "name": "extract_invoice"}, "messages": [{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_b64}}, {"type": "text", "text": "Extract this invoice into the schema."}, ], }], }, }) batch = client.messages.batches.create(requests=requests) print(f"Batch {batch.id} submitted with {len(requests)} extractions") return batch.id def harvest_batch(batch_id: str): batch = client.messages.batches.retrieve(batch_id) if batch.processing_status != "ended": return {"status": "not_ready"} accepted, rejected = [], [] for r in client.messages.batches.results(batch_id): if r.result.type == "succeeded": tu = next(b for b in r.result.message.content if b.type == "tool_use") if not validate(tu.input): accepted.append(tu.input) continue rejected.append(r.custom_id) return {"accepted": accepted, "rejected_for_retry": rejected} ``` **TypeScript:** ```typescript async function submitBulkExtraction( invoices: Array<{ id: string; image_bytes: Uint8Array }>, ) { const requests = invoices.map((inv) => { const imageB64 = Buffer.from(inv.image_bytes).toString("base64"); return { custom_id: `extract-${inv.id}`, params: { model: "claude-sonnet-4.5", max_tokens: 2048, tools: [EXTRACT_INVOICE_TOOL], tool_choice: { type: "tool", name: "extract_invoice" } as const, messages: [ { role: "user" as const, content: [ { type: "image", source: { type: "base64", media_type: "image/png", data: imageB64 }, }, { type: "text", text: "Extract this invoice into the schema." }, ], }, ], }, }; }); const batch = await client.messages.batches.create({ requests }); console.log(`Batch ${batch.id} submitted with ${requests.length} extractions`); return batch.id; } async function harvestBatch(batchId: string) { const batch = await client.messages.batches.retrieve(batchId); if (batch.processing_status !== "ended") return { status: "not_ready" }; const accepted: unknown[] = []; const rejected: string[] = []; for await (const r of client.messages.batches.results(batchId)) { if (r.result.type === "succeeded") { const tu = r.result.message.content.find((b) => b.type === "tool_use"); if ( tu?.type === "tool_use" && validate(tu.input as Record).length === 0 ) { accepted.push(tu.input); continue; } } rejected.push(r.custom_id); } return { accepted, rejected_for_retry: rejected }; } ``` Concept: `batch-api` ### 8. Audit-log every approval, rejection, and hook decision PostToolUse hook on every approve_payment call. Append a row to durable storage: timestamp, vendor_id, invoice_number, amount, currency, three-way-match outcome, hook decisions (cap, blocklist, duplicate), final routing (approved | human-review | denied). Retain at least 7 years for audit compliance. The audit log is the replay tool when finance asks 'why did we approve this in May?' three months later. **Python:** ```python import datetime, json, sqlite3 from pathlib import Path AUDIT_DB = sqlite3.connect("audit.sqlite3") AUDIT_DB.execute(""" CREATE TABLE IF NOT EXISTS audit_log ( ts TEXT PRIMARY KEY, vendor_id TEXT, invoice_number TEXT, amount REAL, currency TEXT, match_outcome TEXT, hook_decisions TEXT, final_routing TEXT, approved_at TEXT ) """) def audit(invoice: dict, match_result: dict, hook_decisions: dict, routing: str): AUDIT_DB.execute( "INSERT INTO audit_log (ts, vendor_id, invoice_number, amount, currency, " "match_outcome, hook_decisions, final_routing, approved_at) " "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)", ( datetime.datetime.utcnow().isoformat() + "Z", invoice["vendor_id"], invoice["invoice_number"], invoice["total_amount"], invoice["currency"], json.dumps(match_result), json.dumps(hook_decisions), routing, datetime.date.today().isoformat() if routing == "approved" else None, ), ) AUDIT_DB.commit() ``` **TypeScript:** ```typescript import Database from "better-sqlite3"; const auditDb = new Database("audit.sqlite3"); auditDb.exec(` CREATE TABLE IF NOT EXISTS audit_log ( ts TEXT PRIMARY KEY, vendor_id TEXT, invoice_number TEXT, amount REAL, currency TEXT, match_outcome TEXT, hook_decisions TEXT, final_routing TEXT, approved_at TEXT ) `); interface InvoiceRecord { vendor_id: string; invoice_number: string; total_amount: number; currency: string; } export function audit( invoice: InvoiceRecord, matchResult: Record, hookDecisions: Record, routing: "approved" | "human-review" | "denied", ) { auditDb .prepare( "INSERT INTO audit_log (ts, vendor_id, invoice_number, amount, currency, " + "match_outcome, hook_decisions, final_routing, approved_at) " + "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)", ) .run( new Date().toISOString(), invoice.vendor_id, invoice.invoice_number, invoice.total_amount, invoice.currency, JSON.stringify(matchResult), JSON.stringify(hookDecisions), routing, routing === "approved" ? new Date().toISOString().slice(0, 10) : null, ); } ``` Concept: `evaluation` ## Decision matrix | Decision | Right answer | Wrong answer | Why | |---|---|---|---| | Output shape guarantee on extraction | Forced tool_choice with input_schema as the contract | Prompt 'output JSON' or 'respond with valid JSON only' | Prompt-only is probabilistic (~85% adherence); ~15% leakage on edge invoices (handwritten notes, mixed languages, credit memos, rotated scans). Forced tool_use is structural (100% adherence). The cost is identical; the reliability gap is decisive in finance. | | Vendor authorization cap enforcement | PreToolUse hook reads vendor_ytd_spend, exits 2 on violation | System prompt: 'never approve above the vendor cap' | Prompts leak ~3% in production. Hooks are deterministic. For policy-bearing limits (cap, duplicate, blocklist), the deterministic gate is the only credible architecture. Prompt-only enforcement is a finding waiting to be flagged in the next audit. | | Same invoice arriving twice (vendor re-sends after delivery) | PreToolUse duplicate-detection hook keyed on (vendor_id, invoice_number) over last 90 days | Trust the model to notice duplicates in conversation context | Context memory is unreliable across multi-turn or batch runs. The hook is stateless, queries the audit log, and prevents race conditions when two parallel extractions hit the same invoice within seconds. | | Bulk overnight processing of 10K invoices | Batch API + schema and vendor-master caching | Sync API in a tight loop or sync API without caching | Batch API gives a flat 50% discount with a 24-hour SLA. Caching adds another ~80% off the schema and vendor-master tokens. Combined: ~75% savings versus naive sync. Sync API is reserved for inbox-arrival latency. | ## Failure modes | Anti-pattern | Failure | Fix | |---|---|---| | AP-INV-01 · Prompt-only field extraction | Prompt 'extract this invoice as JSON' leaks ~15% on edge invoices. Downstream parser breaks every seventh document; AP analyst spends the morning re-keying invoices the agent botched. | Forced tool_choice: { type: 'tool', name: 'extract_invoice' } plus a strict JSON schema in tools[0].input_schema. The model has no choice but to fire the tool with arguments matching the schema. 100% structural adherence. | | AP-INV-02 · No semantic validation | Single-pass extraction with no math check. The model returns a structurally-valid record where line totals sum to 4950 but the header total says 5000. Bad data ships downstream; quarterly close finds the discrepancy three months later. | Validation-retry loop. After parse, validate sum(line_items[].total) total_amount (within 0.01 tolerance), currency in ISO 4217 enum, due_date >= invoice_date. On failure, feed the specific error back; retry up to 3 times; route to human review if still failing. | | AP-INV-03 · No three-way match | Agent approves an invoice that has no matching purchase order, or where the goods receipt was for fewer items, or where the vendor name on the invoice does not match the vendor on the PO. AP pays for goods never received, or pays the wrong vendor. | Three-way match service queries PO master and goods-receipt ledger. Compares amount (variance <= 2% OK), normalized vendor name, line-item count. Variance above thresholds routes to human review with a structured exception block. | | AP-INV-04 · Cap policy in the system prompt | System prompt: 'never approve more than the vendor authorization cap'. Production logs show ~3% of approvals exceed the cap because the prompt language leaks under unusual phrasing or when the agent is processing many invoices in one session. | PreToolUse hook on approve_payment reads tool_input.vendor_id and tool_input.amount, queries the vendor master for vendor_ytd_spend + cap, exits 2 on violation with a structured message including cap_remaining. Deterministic, not probabilistic. | | AP-INV-05 · No duplicate-invoice check | Vendor re-sends the same invoice number after delivery confirmation, or the same invoice is uploaded twice through different channels (email + portal). The agent approves both. AP discovers the duplicate payment in next month's reconciliation. | PreToolUse hook queries the audit log for any row with the same (vendor_id, invoice_number) in the last 90 days. On match, exits 2 with the prior approval date. Stateless, auditable, prevents race conditions in parallel runs. | ## Implementation checklist - [ ] Invoice JSON schema lives in tools[0].input_schema with required and nullable fields explicit (`structured-outputs`) - [ ] tool_choice forced to extract_invoice on every extraction call (`tool-choice`) - [ ] Currency field is an enum with an 'unclear' escape hatch (`structured-outputs`) - [ ] Validation-retry loop with sum check, currency enum, date sanity (`evaluation`) - [ ] Three-way match service against PO master and goods-receipt ledger (`evaluation`) - [ ] PreToolUse cap-and-duplicate hook on approve_payment (`hooks`) - [ ] Schema and vendor master cached with cache_control: ephemeral (`prompt-caching`) - [ ] Batch API for nightly bulk runs (greater than 100 invoices) (`batch-api`) - [ ] PostToolUse audit log writes every approval, rejection, and hook decision; 7-year retention - [ ] Stratified accuracy reporting by vendor, currency, document type - [ ] Human-review queue for invoices that fail validation, three-way match, or hook checks ## Cost & latency - **Per-invoice synchronous extraction (cached schema):** ~$0.001 to $0.003, Schema ~1500 tokens at cache-read price plus image vision tokens (~1000-2000) plus ~150 output. Sustained AP traffic with cache hits >= 70% drops effective cost predictably. - **Three-way match service:** ~$0 token cost; ~10-30 ms latency, Pure SQL queries against PO master and goods-receipt ledger. No LLM call. Latency is dominated by the database round-trip. - **PreToolUse hook overhead:** ~$0; ~5 ms latency, Subprocess reads stdin JSON, runs three SQL queries (vendor cap, blocklist, duplicate), exits 0 or 2. No LLM call. Latency below the noise floor of any tool dispatch. - **Batch overnight (10K invoices, batch + caching):** ~75% off naive sync, Batch API flat 50% discount times schema and vendor-master cache (~80% off cached portion). 10K invoices at typical complexity drop from ~$30 sync uncached to ~$8 batch cached. - **Validation-retry overhead:** ~+25% on records that retry, 5-10% of records retry once; 1-2% retry twice. Specific-error feedback converges quickly. Pipeline cost up ~5% to gain ~99% schema-conformance plus ~99% semantic-conformance. - **Per-1000-invoices total (steady state):** ~$1.00 to $3.00, Sync cached extraction at scale. Adding human review of unconverged records adds operator-time cost but recovers the long tail of edge invoices. ## Domain weights - **D2 · Tool Design + Integration (18%):** Invoice schema. Forced tool_choice. PreToolUse cap and duplicate hook. Three-way match contract. - **D5 · Context + Reliability (15%):** Schema caching. Vendor master caching. Multi-page invoice CASE_FACTS. Batch API integration. ## Practice questions ### Q1. Your invoice extraction agent uses prompt-only extraction. Production logs show ~15% of records arrive with prose wrapping ('Sure, here is the JSON:') and the downstream parser breaks. What is the architectural fix? Move the schema into tools[0].input_schema and set tool_choice: { type: 'tool', name: 'extract_invoice' }. This forces the model to emit a structured tool_use call matching the schema. No prose wrapping, no preamble, no probabilistic adherence. The 15% leak collapses to 0% because the SDK rejects anything that does not match the schema. Tagged to AP-INV-01. ### Q2. You notice line-item totals do not match the invoice header total. The model extracted all items correctly but the math is wrong. How do you prevent this from shipping bad data downstream? Add a validation-retry loop. After parsing, compute sum(line_items[].total) in code. If it differs from total_amount by more than 0.01, send the specific error back to the model in a tool_result with is_error: true ('line items sum to 4950 but header total says 5000'); retry up to 3 times. About 95% of records converge on the second attempt because the model now sees what was wrong. Pair with currency enum and date sanity checks. Tagged to AP-INV-02. ### Q3. An invoice arrives with no matching purchase order in your system. The vendor claims the PO was issued verbally last quarter. Should the agent approve based on the vendor reputation? No. The three-way match service must find an active PO before the agent can approve. No PO means the invoice routes to human review with a structured exception block; the AP analyst either creates a retroactive PO and reprocesses, or rejects the invoice. The agent never bypasses the three-way match; verbal POs are not a valid input to the workflow. Tagged to AP-INV-03. ### Q4. Your system prompt says 'never approve invoices above the vendor authorization cap'. Production logs show ~3% of approvals still exceed the cap. What is the architectural fix? Move the constraint to a PreToolUse hook on approve_payment. The hook reads tool_input.vendor_id and tool_input.amount, queries the vendor master for vendor_ytd_spend + cap, exits 2 on violation with a structured stderr message including cap_remaining. Deterministic, not probabilistic. Pair with blocklist and duplicate checks in the same hook. Prompts leak; hooks do not. Tagged to AP-INV-04. ### Q5. A vendor re-sends the same invoice 30 days later because they did not see the payment confirmation. How does your agent prevent paying the same invoice twice? PreToolUse hook on approve_payment queries the audit log for any row with the same (vendor_id, invoice_number) in the last 90 days. On match, exits 2 with the prior approval date and routes to a structured exception block. The check is stateless, auditable, and prevents race conditions when two parallel extractions hit the same invoice within seconds. The 90-day window is a configurable policy. Tagged to AP-INV-05. ## FAQ ### Q1. How do you handle handwritten or scanned invoices with poor image quality? Vision-capable extraction handles most cases. For edge invoices (rotated scans, faded ink, handwritten amendments), the validation-retry loop catches arithmetic mismatches and the three-way match catches structural issues. Records that fail after 3 retries route to human review with the original image attached. Stratified accuracy reporting by document-type quickly surfaces vendors whose invoices need a layout-aware preprocessing step. ### Q2. What happens if a vendor has multiple naming variations (Apple Inc, APPLE, Apple, Inc.)? The vendor master holds the canonical vendor_id and a list of name variations. The extraction schema requires the model to extract the vendor as text; a normalization step (lowercase, strip punctuation, fuzzy match against the vendor master) resolves it to a vendor_id. The duplicate-detection hook keys on vendor_id, not the raw name, so naming variation does not break uniqueness. ### Q3. Can the agent process multi-currency invoices in one workflow? Yes. The schema enforces currency as an ISO 4217 enum. The cap policy and duplicate detection key on vendor_id and amount in the invoice currency; the cap can be denominated per-vendor in the vendor master. For consolidated reporting, a daily FX-rate table converts to a base currency at audit-log write time. ### Q4. How do you handle credit memos (negative invoices)? Credit memos use the same schema with total_amount representing the credit (positive number) and a separate document_type enum field that distinguishes invoice from credit_memo. The PreToolUse hook treats credit memos as vendor_ytd_spend - amount (effectively decreasing YTD spend). Three-way match runs against the original invoice and the credit-memo reason code instead of a PO and GRN. ### Q5. Should the agent auto-approve, or always route to human review? Auto-approve only when all gates pass: schema valid, semantic validation passed, three-way match within thresholds, PreToolUse hook approved (cap, blocklist, duplicate). Any failure routes to human review with a structured exception block. Auto-approval rate at steady state is typically 75-85%; the remaining 15-25% needs an analyst's eye. The point of the agent is not to remove the analyst; it is to make the analyst's queue much smaller and every queued invoice well-explained. ### Q6. How long do you retain the audit log? At least 7 years for financial-record compliance (US SOX, EU equivalent). Append-only schema; immutable rows; indexed by vendor_id, invoice_number, and date. Replay tool reconstructs any approval decision in seconds when finance asks 'why did we approve this in May?' three months later. ### Q7. Is Batch API worth using for fewer than 1000 invoices a night? Sometimes. Batch API gives a flat 50% discount but with a 24-hour SLA. For under 500 invoices, the Batch overhead and the latency may not be worth it; sync extraction is cheaper end-to-end when AP needs same-day processing. For nightly backfills of historical invoices or large vendor consolidations (more than 1000 documents), Batch API earns its keep. ## Production readiness - [ ] JSON schema versioned in source control; PR-reviewed before deploy - [ ] Vendor master kept current; cap and blocklist updates flow through change control - [ ] Validation-retry loop unit-tested for line-total mismatch, currency drift, date inversion - [ ] Three-way match service tested against synthetic PO + GRN cases including 1.9% and 2.1% variance edge cases - [ ] PreToolUse hook unit-tested for cap exceeded, blocklisted vendor, duplicate within 90 days, all three pass - [ ] PostToolUse audit log retention confirmed at 7 years; index on (vendor_id, invoice_number, date) - [ ] Schema cache hit rate monitored; alert if drops below 50% - [ ] Stratified accuracy dashboard by vendor and document type; alert on any vendor below 90% pass rate - [ ] Human-review queue with SLA documented and on-call for invoices held more than 48 hours - [ ] Batch API job for nightly backfill with auto-resubmit on transient failures --- **Source:** https://claudearchitectcertification.com/scenarios/invoice-processing-agent **Vault sources:** P3.6 structured-data-extraction (canonical extraction patterns); P3.7 agentic-tool-design (PreToolUse hook lifecycle); P3.8 long-document-processing (multi-page invoice handling); P3.5 claude-code-for-cicd (Batch API for bulk runs); concepts/vision-multimodal (PDF and image input); concepts/structured-outputs (forced tool_choice patterns); concepts/hooks (PreToolUse + PostToolUse policy enforcement) **Last reviewed:** 2026-05-05 **Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed. --- # Subagents: Context Management + Task Delegation > Subagents are specialized helpers that Claude Code spawns into their own context window to handle a focused task and return only a summary. They keep your main conversation clean by isolating tool calls, file reads, and search noise. Use built-in subagents (Explore, Plan, General) or define custom ones with scoped system prompts and tool whitelists. **Domain:** D1 · Agentic Architectures (27%) **Difficulty:** intro **Skilljar course:** Introduction to Subagents (4 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/subagents-intro **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 27% (D1) Direct prep for D1 task statements on agent-to-subagent delegation, hub-and-spoke topology, and context isolation. Subagents appear in P3.3 multi-agent-research and the developer-productivity scenario. ## What you'll learn - Why subagents exist and what problem they solve in long Claude Code sessions - How to invoke a built-in subagent vs spawn a custom one from a config file - How to write a focused system prompt + tool whitelist that keeps a subagent on-task - When to use a subagent and when to keep work inline in the main thread ## Prerequisites - **Agentic loops (concept)** (concepts · `agentic-loops`) - **Context window (concept)** (concepts · `context-window`) ## Lesson outline ### 1. What are subagents? Specialized helpers that run in their own context and return only a summary, isolating tool noise from your main thread. ### 2. Creating a subagent Define a markdown config with name, description, tools, and system prompt; place it in .claude/agents/ to make it discoverable. ### 3. Designing effective subagents Scope tightly: focused role, narrow tool whitelist, explicit output format. Broad agents drift; narrow ones land. ### 4. Using subagents effectively Invoke explicitly when you want isolation, accept the visibility tradeoff, and aggregate findings in the parent thread. ## Our simplification Subagents exist because every Claude Code session has a finite context window, and every tool call, file read, and search result fills that window with noise. After 50 turns of debugging, you might need to scroll past dozens of intermediate tool outputs to find the answer Claude actually gave you. Subagents fix this by running their own conversation in a separate window, doing their work, and returning only a summary to the main thread. The intermediate work is discarded; you keep the answer. Mechanically, a subagent receives two inputs at spawn: a custom system prompt (defining role and behavior) and a task description (written by the parent agent based on the user's request). It then runs its own loop with whatever tools it has access to, isolated from the parent's context. When it returns, only the summary lands in the main thread. The parent never sees the intermediate steps, which is both the value (cleaner context) and the limitation (less debuggability if the subagent reaches a wrong answer). Claude Code ships three built-in subagents: General (multi-step tasks needing both exploration and action), Explore (fast search and navigation), and Plan (used during plan mode for research before producing a plan). For most tasks, the built-ins are enough - Claude picks which one to use automatically when you ask the right shape of question. The customization story matters when your team has recurring specialist patterns - a code reviewer, a test writer, a docs generator - that warrant a fixed config. Custom subagents live as markdown files in .claude/agents/ (or ~/.claude/agents/ for personal). Each file declares: a name, a description (which Claude reads to decide when to invoke), a tool whitelist, and a system prompt. The tool whitelist is load-bearing: a code-reviewer subagent should have [Read, Grep, Glob] only, never [Edit, Write]. Tool overscoping is the canonical anti-pattern. Claude is allowed to call any whitelisted tool but will never call one outside the list, so the whitelist is your safety contract. Designing effective subagents comes down to three rules: scope tightly (one role, not three), constrain tools (minimum needed for the role), and define output format (so the parent can aggregate cleanly). A subagent without an output schema wanders. A subagent with [Read, Grep, Glob, Bash, Edit, Write] is just another general-purpose agent in disguise. The narrow ones land; the broad ones drift, every time. When you should *not* use a subagent: short tasks where the noise is minimal, exploratory work where you want to see the journey, or anything where you'll need to debug the path to the answer. The clean-context benefit is also a visibility cost - once the subagent returns, you cannot ask it follow-up questions, and the intermediate state is gone. For inline reasoning that should remain in the parent thread, just stay inline. Subagents are for delegation, not micro-isolation. ## Patterns ### 3 anti-patterns when designing subagents Each of these will be a question on the exam in disguise. Watch for the failure mode behind each. - **Tool overscoping.** Granting [Read, Edit, Write, Bash] to a code-reviewer subagent. The reviewer should never edit; the whitelist is the contract. Restrict to [Read, Grep, Glob] so accidents become impossible. - **No output format.** A subagent without a defined output schema wanders for 30+ turns and returns a wall of prose. Define a structured shape ({findings: [], confidence: number}) so the parent can aggregate. - **Treating subagents as inheritance.** Subagents do NOT inherit the parent's chat history. Pass every fact the subagent needs in the task string. This is the single most-tested distractor pattern. ## Key takeaways - Subagents run in their own context window and return only a summary, isolating tool noise from the parent thread. (`subagents`) - Claude Code ships three built-in subagents (General, Explore, Plan); custom ones live as markdown files in .claude/agents/. (`subagents`) - Tool whitelist is the load-bearing contract for safety - restrict to the minimum needed for the role; never trust prompt-only constraints. (`tool-calling`) - Subagents do NOT inherit the parent's conversation; pass every required fact in the task string explicitly. (`multi-agent-research-system`) - Define a structured output format in the subagent's system prompt to bound runtime and enable clean parent-side aggregation. (`structured-outputs`) ## Concepts in play - **Subagents** (`subagents`), Core primitive - **Context window** (`context-window`), What subagents preserve - **Tool calling** (`tool-calling`), How subagents access capabilities - **Agentic loops** (`agentic-loops`), How the parent agent dispatches ## Scenarios in play - **Multi-agent research system** (`multi-agent-research-system`), Full hub-and-spoke architecture using subagents end-to-end - **Developer productivity agent** (`developer-productivity-agent`), Practical subagent specialization (4-5 tools per agent) ## Curated sources - **How we built our multi-agent research system** (anthropic-blog, 2025-06-19): Anthropic's own engineering write-up of the hub-and-spoke pattern Claude Code's subagents are modelled on; deeper context behind the Skilljar lessons. - **Subagents - Anthropic Claude Code documentation** (anthropic-blog, 2025-09-01): Canonical reference for the markdown config schema and tool whitelist mechanics. Pair with the Skilljar lesson when you start authoring custom subagents. ## FAQ ### Q1. What is a subagent in Claude Code and how is it different from the main agent? A subagent is a specialized helper that runs in its own separate context window, does a focused task, and returns only a summary to the main thread. The main agent stays clean of all the intermediate file reads, searches, and tool calls the subagent performed. Subagents are stateless across invocations and do not inherit the parent's conversation. ### Q2. When should I use a subagent vs keep work inline in the main Claude Code session? Use a subagent when the task is bounded, well-described, and you don't need to see the journey - exploration, research, code review, automated tests. Stay inline when you want visibility into the reasoning, when the work is short enough that context noise doesn't matter, or when you'll iterate based on intermediate findings. ### Q3. How do I create a custom subagent in Claude Code? Create a markdown file in .claude/agents/.md with frontmatter declaring name, description, tools (the whitelist), and system_prompt. Claude Code auto-discovers files in this directory and uses the description field to decide when to invoke the subagent. Place personal subagents in ~/.claude/agents/ and project ones in .claude/agents/. ### Q4. Why does my subagent return wrong answers about facts the main conversation already established? Subagents do not inherit the parent's conversation history. The parent must pass every required fact in the task string explicitly. If the parent says investigate the refund flow without naming which customer, file, or service, the subagent has nothing to anchor on and will guess. The fix is to embed all needed context in the task description. ### Q5. What tools should I give a code-review subagent? Restrict to [Read, Grep, Glob] - strictly read-only. Reviewers should never edit, write, or run code. Tool overscoping is the canonical anti-pattern: granting Edit or Write to a reviewer means accidents become possible. The whitelist is your safety contract; the prompt alone won't enforce it. ### Q6. Are subagents and Claude API agents the same thing? No. Subagents are a Claude Code specific feature for delegating work inside a coding session. Agentic loops on the Claude API are a more general pattern where any client harness orchestrates messages.create calls with tool use. The mental model is similar (parent dispatches, child does focused work), but the runtime, tool surface, and configuration are distinct. ### Q7. How many built-in subagents does Claude Code have and what do they do? Three: General for multi-step tasks needing both exploration and action; Explore for fast search and codebase navigation; Plan for research and analysis during plan mode. Claude picks which built-in to invoke based on the task shape. Custom subagents extend this set with team-specific specialists like reviewers, test writers, or docs generators. --- **Source:** https://claudearchitectcertification.com/knowledge/subagents-intro **Vault sources:** Course_16/Lesson_01_what-are-subagents.md; Course_16/Lesson_02_creating-a-subagent.md; Course_16/Lesson_03_designing-effective-subagents.md; Course_16/Lesson_04_using-subagents-effectively.md **Last reviewed:** 2026-05-06 --- # Claude 101: Foundations for Everyday Work > Claude 101 is the on-ramp course that teaches you what Claude is, how to write prompts that land, and how to organize work using Projects, Artifacts, Skills, and Connectors. It is a product tour, not an architecture course, but it grounds every later concept in the actual UI you will be tested on. If you already use claude.ai daily you can skim it; if not, do not skip it. **Domain:** D5 · Context + Reliability (15%) **Difficulty:** intro **Skilljar course:** Claude 101 (14 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/claude-101 **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 15% (D5) Direct prep for D5 task statements on the Claude product surface (Chat, Projects, Artifacts, Skills, Connectors) and the human-in-the-loop habits the exam frames as evaluation and iteration. Light overlap with D2 prompt clarity practices. ## What you'll learn - What Claude is, what Constitutional AI means in plain language, and where Claude is available (web, desktop, Slack, Excel) - How to write prompts that beat the five canonical failure modes (too generic, wrong length, wrong format, hallucinated facts, wrong tone) - How Projects (knowledge), Artifacts (output), Skills (procedure), and Connectors (tools) divide responsibility cleanly - When to reach for Skills vs Projects (procedure vs knowledge) without overlapping the two - How the 4D Framework (Delegation, Description, Discernment, Diligence) frames every later course ## Lesson outline ### 1. What is Claude? Claude is a steerable AI assistant trained under Constitutional AI; available on web, desktop, mobile, Slack, and Excel. ### 2. Your first conversation with Claude Walk through the chat UI, message history, and the basic conversational rhythm of asking and refining. ### 3. Getting better results Five canonical failure modes (generic, length, format, hallucination, tone) with the iteration mindset and a simple eval recipe. ### 4. Navigating the Claude desktop app: Chat, Cowork, Code Three surfaces in one app; Chat for conversation, Cowork for delegation on real files, Code for agentic dev work. ### 5. Introduction to projects Self-contained workspaces with knowledge, instructions, and chat history; auto-RAG when knowledge exceeds context. ### 6. Creating with artifacts Standalone interactive outputs (docs, code, HTML, SVG, Mermaid, React) rendered alongside the chat for reuse and sharing. ### 7. Working with skills Folders of instructions and scripts Claude loads dynamically; Anthropic-built or custom; procedure not knowledge. ### 8. Connecting your tools MCP-powered web connectors (Drive, Slack, Linear, Stripe) and desktop extensions; Claude only sees what you can see. ### 9. Enterprise search Claude for Work feature that connects Claude to your org's knowledge sources with custom prompts. ### 10. Research mode for deep dives Long-running research that searches, reads, and synthesizes citations into a structured report. ### 11. Claude in action: use-cases by role A use-case gallery walkthrough; sales, marketing, finance, HR, legal, research patterns. ### 12. Other ways to work with Claude Pointer to Claude Code, Claude in Chrome, Claude in Excel, Claude in Slack as specialized surfaces. ### 13. What you've learned Recap of the four organizing primitives and the iteration mindset. ### 14. Certificate of completion Skilljar certificate; not exam-relevant on its own but a signal of foundations coverage. ## Our simplification Claude 101 looks like a product tour but it is actually the course that teaches you the four primitives every later course assumes you know. Claude is not a chatbot; it is a steerable assistant trained under Constitutional AI, available across web, desktop, mobile, Slack, and Excel. The course frames Claude as a thinking partner, which sounds soft but matters on the exam because every D5 question about appropriate use, evaluation, and human-in-the-loop traces back to this framing. Skip the lesson and you miss the vocabulary the rest of the curriculum uses. The load-bearing lesson is Lesson 3, Getting better results. It catalogs the five failure modes you will hit on day one: responses too generic, wrong length, wrong format, confidently wrong facts, wrong tone. Each has a fix that is mechanical, not magical: add audience and constraints, ask for explicit length, show a format example, ask for sources, describe tone in plain language. The lesson also introduces the iteration mindset (treat first drafts as starting points, give specific feedback, know when to start fresh) and the 4D Framework (Delegation, Description, Discernment, Diligence). The 4D vocabulary reappears across every Anthropic course, and the iteration habits are the practical answer when D5 exam questions ask how a responsible operator should use Claude. Learn this lesson once, deeply, and you save time on every later course. Projects are the organizing primitive for knowledge. A project is a self-contained workspace with its own chat history, knowledge base, and instructions. Project instructions apply to every conversation in the project; this is how you stop re-explaining tone, audience, and constraints turn after turn. When the knowledge base exceeds the context window, Claude transparently switches to RAG mode (search-and-retrieve) so the project capacity expands roughly 10x. For Claude for Work users, projects also enable team collaboration with view, edit, and owner roles. Artifacts are the organizing primitive for output. When Claude generates something significant and self-contained (typically over 15 lines, something you would edit, iterate on, or reuse) it renders it in a dedicated panel beside the chat instead of dumping it inline. Artifacts cover documents, code, HTML pages, SVGs, Mermaid diagrams, and live React components. The interaction model is iterate incrementally: one change at a time so you can see what each instruction did. Free, Pro, and Max users can publish artifacts to a public link; Team and Enterprise users share them inside the org. Skills vs Projects is the single most-confused distinction in the course, and it lands on the exam in disguise. The clean rule: projects store knowledge, skills perform tasks. A project holds the brand guidelines PDF; a skill encodes the procedure for applying them to every deck you generate. Skills are folders of instructions and scripts Claude loads dynamically; Anthropic ships built-in skills for Excel, Word, PowerPoint, and PDF creation, and you can author custom skills by chatting with Claude itself. Skills require code execution and file creation to be enabled. Connectors are the organizing primitive for reach. They are powered by MCP (Model Context Protocol), which the course memorably describes as USB-C for AI: a universal standard that lets any developer build a connector and any Claude surface use it. There are two kinds: web connectors (Drive, Notion, Slack, Asana, Linear, Stripe, and many others) and desktop extensions (local file access, browser control, native app integration via the Claude Desktop app). The safety contract is simple and worth memorizing: Claude only sees what you see. Connecting your work email never grants Claude access to anyone else's inbox, every permission is scoped to what the connector actually needs, and every grant is revocable from either Claude's settings or the third-party service. For enterprise environments, Enterprise Search (Lesson 9) layers custom prompts on top of organization-specific knowledge sources. If you already use claude.ai daily, you can skim Claude 101 for the vocabulary and move on. If you do not, do not skip it; later courses will assume you know what an artifact is, what a connector does, and why projects exist. Treat Lessons 3, 5, 6, 7, and 8 as the load-bearing five; the rest are useful but not exam-critical. The 4D Framework, the projects-vs-skills split, and Claude only sees what you see are the three things you should be able to recite on demand. ## Patterns ### 5 canonical prompt failure modes (and the fix for each) Lesson 3 of Claude 101 catalogs the five most common day-one failures. Memorize the fix, not the failure. - **Response is too generic.** Your prompt didn't include enough context. Fix: add audience, role, and constraints. Write an email about the delay becomes Write an email to our enterprise client explaining the integration is delayed two weeks; this is the second delay; keep it professional but apologetic. - **Response is the wrong length.** Claude is guessing. Be explicit: two-paragraph summary, under 100 words, comprehensive analysis, length isn't a concern. Length is a parameter, not a hint. - **Format isn't what you wanted.** Claude understood the what but not the how. Fix: show, don't tell. Provide a format example or describe structure: bullet points with bold headers per section. - **Confidently wrong facts.** Hallucinations are most likely on niche specifics. Fix: ask Claude to cite sources or indicate confidence; for high-stakes work, verify independently; enable web search to ground in current information. - **Tone isn't right.** Claude defaults to helpful and professional. Fix: describe tone in plain language (more conversational, authoritative and formal) and provide an example of writing in the style you want. ### Projects vs Skills, in one sentence each The cleanest mental model from Lesson 7. If you can recite this, you've already passed half the D5 questions on the surface. - **Projects store knowledge.** Long-term reference materials, persistent context across all chats in the project, team-collaboration target. Use for client hubs, research repositories, brand kits. - **Skills perform tasks.** Procedural how-to packages Claude loads when relevant. Use for repeatable workflows: brand-voice review, quarterly variance analysis, deck templating. - **They compose.** A customer call prep skill can pull from customer profiles in a project's knowledge base. The project provides the what (information); the skill provides the how (process). ## Key takeaways - Claude is built around four primitives: Projects (knowledge), Artifacts (output), Skills (procedure), Connectors (reach). Every later course inherits this vocabulary. (`skills`) - Project instructions apply to every chat in the project; when knowledge exceeds the context window, Claude transparently switches to RAG mode for ~10x capacity. (`context-window`) - Artifacts trigger automatically for content that's substantial, self-contained, and reusable; you can also explicitly request Create this as an artifact if Claude misses the cue. (`structured-outputs`) - Skills are procedural (how Claude executes); Projects are referential (what Claude knows). They compose, not overlap. (`skills`) - Connectors are powered by MCP and follow the rule Claude only sees what you see; permissions are scoped, revocable, and per-service. (`mcp`) - The 4D Framework (Delegation, Description, Discernment, Diligence) is the AI-fluency vocabulary every later Anthropic course assumes; learn it here. (`4d-framework`) ## Concepts in play - **Skills** (`skills`), One of the four product primitives - **Model Context Protocol** (`mcp`), How connectors are wired - **4D Framework** (`4d-framework`), AI-fluency vocabulary used across every Anthropic course - **Context window** (`context-window`), Why Projects auto-switch to RAG mode - **System prompts** (`system-prompts`), What Project Instructions actually are ## Scenarios in play - **Claude for operations** (`claude-for-operations`), Practical use of Projects + Connectors for ops-team knowledge work - **Agent Skills for enterprise KM** (`agent-skills-for-enterprise-km`), Where the Projects-vs-Skills split shows up at enterprise scale ## Curated sources - **What are Skills? Anthropic Help Center** (anthropic-blog, 2025-10-15): Canonical reference for the Skills feature; pairs with Lesson 7 when you start building custom skills and need the exact enable-step sequence. - **How can I create and manage projects? Anthropic Help Center** (anthropic-blog, 2025-08-01): The walkthrough Lesson 5 abbreviates; useful when you want the screenshots, sharing-permission matrix, and current pricing-tier gating. - **Introducing the Model Context Protocol** (anthropic-blog, 2024-11-25): Anthropic's announcement of MCP itself; deeper than Lesson 8's USB-C for AI analogy and required reading before MCP Foundations course. ## FAQ ### Q1. What is Claude 101 and is it worth taking before the Claude Architect exam? Claude 101 is Anthropic's free 14-lesson Skilljar course that introduces the Claude product surface: Chat, Projects, Artifacts, Skills, and Connectors. It's the only course that teaches the vocabulary every later course assumes you know, so yes, take it first if you don't use claude.ai daily. If you do, skim Lessons 3, 5, 6, 7, 8 and move on. ### Q2. What's the difference between Claude Projects and Claude Skills? Projects store knowledge; Skills perform tasks. A Project is a workspace with persistent knowledge, instructions, and chat history Claude references across every conversation in the project. A Skill is a procedural package (instructions, scripts, resources) Claude loads when it determines the workflow applies. They compose: a skill can read knowledge from a project's knowledge base. ### Q3. When does Claude create an artifact vs reply inline in chat? Claude creates an artifact when content is significant and self-contained (typically over 15 lines), is something you'll likely edit or reuse, and stands on its own without the surrounding conversation. If Claude replies inline when you expected an artifact, just ask: Create this as an artifact. ### Q4. How do I get Claude to follow a specific format every time? Show, don't just tell. Provide an example of the format directly in the prompt or in your project's Instructions. Vague tells (use bullet points) work less reliably than examples. For repeatable workflows that need the same format every time, encode the procedure in a custom Skill so Claude follows the steps consistently. ### Q5. What can Claude actually access through connectors and is it safe? Connectors are powered by MCP (Model Context Protocol) and grant Claude scoped, revocable access to your tools (Drive, Slack, Linear, Stripe, and others). The safety rule is Claude only sees what you see: connecting your work email gives Claude access to your inbox, never anyone else's. Permissions are revocable from Claude's settings or the third-party service at any time. ### Q6. Why does Claude give me confidently wrong information sometimes? Claude occasionally generates plausible but incorrect content, especially on niche specifics or recent events. The fixes are mechanical: ask Claude to cite sources, ask it to indicate confidence level, enable web search to ground responses in current information, and for high-stakes work verify key facts independently. Treat first drafts as starting points, never as final answers. ### Q7. Do I need a paid Claude plan to use Projects, Artifacts, Skills, and Connectors? Projects and Artifacts are available on Free, Pro, Max, Team, and Enterprise. Skills require Pro, Max, Team, or Enterprise (and code execution + file creation enabled). Connectors are available across plans, but enterprise-grade connectors and higher rate limits are gated to paid tiers. Check your plan's Capabilities settings. ### Q8. What is the 4D Framework and why does it matter for the certification? The 4D Framework is Anthropic's AI-fluency vocabulary: Delegation (deciding human vs AI work), Description (clearly communicating with AI), Discernment (critically evaluating AI output), Diligence (using AI responsibly). Every later Anthropic course assumes this vocabulary, and exam questions on appropriate use, evaluation, and oversight trace back to it. Learn it once in Claude 101 and you don't have to relearn it. --- **Source:** https://claudearchitectcertification.com/knowledge/claude-101 **Vault sources:** Course_01_claude-101/_Course_Overview.md; Course_01_claude-101/Lesson_01_what-is-claude.md; Course_01_claude-101/Lesson_03_getting-better-results.md; Course_01_claude-101/Lesson_05_introduction-to-projects.md; Course_01_claude-101/Lesson_06_creating-with-artifacts.md; Course_01_claude-101/Lesson_07_working-with-skills.md; Course_01_claude-101/Lesson_08_connecting-your-tools.md; Course_01_claude-101/Lesson_11_claude-in-action-use-cases-by-role.md **Last reviewed:** 2026-05-06 --- # Claude Code 101: Workflows + Context Mastery > Claude Code 101 is the daily-driver course for the agentic coding tool that reads your codebase, edits files, and runs commands inside an explicit Explore, Plan, Code, Commit loop. It teaches the five customization surfaces you will be tested on: CLAUDE.md, Subagents, Skills, MCP, Hooks. Take it after Claude 101 and before any of the deeper Claude Code courses. **Domain:** D1 · Agentic Architectures (27%) **Difficulty:** intro **Skilljar course:** Claude Code 101 (13 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/claude-code-101 **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 27% (D1) Direct prep for D1 task statements on agentic loops, plan mode, context management, and tool use; D3 questions on extension surfaces (CLAUDE.md, Subagents, Skills, MCP, Hooks); and D5 habits like start without a CLAUDE.md so you can see what to add. ## What you'll learn - What an agentic coding tool is and how Claude Code differs from claude.ai (it touches your files, terminal, and codebase directly) - The Explore-Plan-Code-Commit workflow with Plan Mode (Shift + Tab) as the safe place to course-correct before any code is written - How to manage the context window with /compact, /clear, /context, and what the trade-offs are - How to write a CLAUDE.md that gives Claude persistent project memory; the project vs user hierarchy - The five customization surfaces: Subagents, Skills, MCP, Hooks, and CLAUDE.md, and which one to reach for when - Why Hooks are the deterministic answer when CLAUDE.md instructions aren't enough (if it must happen every time, don't put it in a prompt) ## Prerequisites - **Claude 101 (knowledge)** (knowledge · `claude-101`) ## Lesson outline ### 1. What is Claude Code? An agentic coding tool that reads your codebase, edits files, runs commands; available in terminal, VS Code, JetBrains, Desktop, and web. ### 2. How Claude Code works An LLM in a real-time loop with tool access; agent means software that takes actions in its environment to reach a goal. ### 3. Installing Claude Code Install via npm, pip, or the official installer; sign in with your Anthropic account; verify in terminal. ### 4. Your first prompt First-prompt mechanics: how Claude Code asks for permission, how it streams output, how to stop it. ### 5. The explore → plan → code → commit workflow Use Plan Mode (Shift + Tab) to read-only explore and produce a plan; review and revise before any code is written; then code, then commit. ### 6. Context management Use /compact to summarize when nearing limit; /clear to start fresh; /context to inspect what's filling the window. ### 7. Code review Run a code-reviewer subagent before commit; fresh-eyes review without the main session's bias. ### 8. The CLAUDE.md file Project-level Markdown that's auto-appended to your prompt every session; project + user hierarchy; commit it for team. ### 9. Subagents Specialized helpers in their own context window; built-in (General, Explore, Plan) plus custom ones in .claude/agents/. ### 10. Skills Reusable instruction packages; lighter context cost than MCP because only name + description load until invoked. ### 11. MCP Open standard for connecting external tools (HTTP or stdio); add with claude mcp add; scope local/user/project; watch context cost. ### 12. Hooks Deterministic commands that run on lifecycle events (PreToolUse, PostToolUse, Stop, Notification); use exit code 2 to block. ### 13. Course quiz 13-question Skilljar quiz; not exam-relevant on its own but useful as a self-check on the five customization surfaces. ## Our simplification Claude Code is described in Lesson 1 as an agentic coding tool that understands your codebase, edits your files, runs commands, and integrates with your existing developer tools. The framing matters: Claude Code is not a smarter chat box, it is an agent. The course defines an AI agent as software that interacts with its environment and performs actions to reach a goal, and that definition lands directly on the D1 exam questions about agentic loops. Three ground rules govern every session: the context window is finite, Claude asks for permission by default, and Claude can make mistakes, so staying in the loop is not optional. The single most-tested concept in this course is the Explore, Plan, Code, Commit workflow in Lesson 5. The course is explicit: If you take one thing away from this course, let it be this workflow. The fastest way to do the first two steps is Plan Mode, which you toggle with Shift + Tab until you see Plan Mode under the input. In plan mode Claude can read but cannot edit, so it can gather context safely and produce a plan you review before any code is written. This is the cheapest place to course-correct; once code lands, every fix is more expensive. Context management (Lesson 6) is the rest of the D1 surface. The context window is Claude's working memory, and every prompt, file read, and tool call eats into it. When you approach the limit, Claude compacts the conversation: it summarizes important details and drops noise. Compaction can lose details, so you can also run /compact manually before that happens, /clear to start fully fresh, and /context to inspect what is consuming space. The three habits the course pushes: be specific (vague prompts cost MORE context, not less), manage MCP servers (they load all tools by default), and use subagents for where is X? questions. CLAUDE.md (Lesson 8) is the persistent-memory primitive. It is a Markdown file at your project root that Claude Code automatically appends to every prompt at session start. Treat it as an onboarding script: stack, commands, code style, anything Claude would otherwise rediscover. The file has a hierarchy: project-level (./CLAUDE.md, committed to version control, shared with the team) and user-level (your config folder, personal preferences across projects). The course's strongest tip: start without one and add to it whenever you find yourself correcting Claude on the same thing twice. That keeps the file compact and signal-rich. The customization story is the five surfaces you will be tested on. Subagents (Lesson 9) run in their own context window and return only a summary; ship built-in (General, Explore, Plan) plus custom ones as Markdown files in .claude/agents/. Skills (Lesson 10) are procedural packages that only load their full body when Claude decides to invoke them, so they are lighter on context than MCP. MCP (Lesson 11) connects external tools; you add servers with claude mcp add, scope them local/user/project, and watch the context cost because MCP loads all tool definitions by default. Hooks (Lesson 12) are the deterministic escape hatch: shell commands that fire on lifecycle events (PreToolUse, PostToolUse, UserPromptSubmit, Stop, Notification). The course's mic-drop line: if something needs to happen every time without fail, do not put it in a prompt; put it in a hook. The mental model the course teaches for picking the right surface: CLAUDE.md for Claude should always know this; Skills for Claude should follow this procedure when relevant; MCP for Claude needs access to this external system; Subagents for this side-task should not pollute my main context; Hooks for this must happen, no exceptions. MCP and Skills both add capability, but Skills are cheaper on context because only the name and description load until use; MCP loads all tool definitions upfront. If your MCP tools exceed 10% of the context window, Claude Code automatically switches to tool search mode, which discovers tools on demand but is less reliable. Take Claude Code 101 right after Claude 101. It is the second-most-load-bearing course on Pillar 4 and supplies vocabulary the Subagents, Agent Skills, and MCP courses all assume. The five customization surfaces, the Explore-Plan-Code-Commit loop, and the deterministic-hook rule are the three exam anchors you should be able to recite without notes. After this course, you have enough mechanical fluency to take Claude Code in Action (integration patterns), Subagents Intro (deeper context isolation), Agent Skills Intro (composable procedure), and MCP Foundations (external-tool surface) in any order. ## Patterns ### 5 customization surfaces in Claude Code (and which to pick when) Lessons 8 through 12 cover the five surfaces. Picking the right one is the most-tested practical skill in the course. - **CLAUDE.md, persistent project memory.** Use when Claude should always know something about the project (stack, commands, conventions). Auto-appended to every session. Cheapest, most universal surface. - **Subagents, context isolation.** Use when a side task should not pollute your main context. Built-in (General, Explore, Plan) or custom ones in .claude/agents/. Returns only a summary. - **Skills, procedural know-how.** Use when there's a repeatable workflow Claude should execute consistently. Lighter on context than MCP because only name and description load until invoked. - **MCP, external tools and data.** Use when Claude needs to reach an external system (Linear, GitHub, internal database). Add with claude mcp add; scope local/user/project. Watch the context cost. - **Hooks, deterministic enforcement.** Use when something must happen every time, with no exceptions: auto-formatting, blocking dangerous commands, audit logging. Configured in settings.json; PreToolUse exit code 2 blocks the action. ## Key takeaways - Claude Code is an agentic coding tool: an LLM in a real-time loop with direct access to your files, terminal, and codebase. Permission-gated by default, but staying in the loop catches its mistakes. (`agentic-loops`) - The Explore-Plan-Code-Commit workflow with Plan Mode (Shift + Tab) is the highest-leverage habit in the course; course-correcting in plan mode is far cheaper than fixing committed code. (`plan-mode`) - Context management is mechanical: /compact to summarize, /clear to start fresh, /context to inspect; vague prompts cost MORE context than specific ones because Claude has to explore. (`context-window`) - CLAUDE.md is auto-appended to every session; the project file goes in version control, the user file is personal. Start without one and add only what you find yourself repeating. (`claude-md-hierarchy`) - Skills are cheaper on context than MCP because only the name and description load until invoked; MCP loads all tool definitions upfront, which can dominate the context window. (`skills`) - Hooks are the deterministic escape hatch: if something must happen every time, don't put it in a prompt; put it in a hook. PreToolUse with exit code 2 blocks the action and feeds back stderr. (`hooks`) ## Concepts in play - **Agentic loops** (`agentic-loops`), What Claude Code runs at its core - **Plan mode** (`plan-mode`), Read-only exploration before code - **Context window** (`context-window`), What /compact and /clear manage - **CLAUDE.md hierarchy** (`claude-md-hierarchy`), Project and user memory layers - **Subagents** (`subagents`), Isolated-context delegation surface - **Skills** (`skills`), Procedural extension surface - **Model Context Protocol** (`mcp`), External-tool extension surface - **Hooks** (`hooks`), Deterministic enforcement surface ## Scenarios in play - **Code generation with Claude Code** (`code-generation-with-claude-code`), End-to-end Explore-Plan-Code-Commit workflow on a real feature - **Developer productivity agent** (`developer-productivity-agent`), Customizing Claude Code with subagents and skills for a recurring team workflow - **Claude Code for CI/CD** (`claude-code-for-cicd`), Hooks and headless mode for deterministic pipeline behavior ## Curated sources - **Claude Code: Best practices for agentic coding** (anthropic-blog, 2025-04-18): Anthropic's engineering write-up on the workflow patterns this course teaches; deeper rationale behind Plan Mode, CLAUDE.md, and the customization surfaces. - **Claude Code documentation: Hooks** (anthropic-blog, 2025-09-01): Canonical reference for the hook event types and exit-code semantics; pair with Lesson 12 when you start authoring your first PreToolUse hook. - **Claude Code documentation: Memory (CLAUDE.md)** (anthropic-blog, 2025-09-01): The hierarchy spec for CLAUDE.md (project, user, enterprise) and import syntax; the lesson abbreviates this. ## FAQ ### Q1. What is Claude Code and how is it different from claude.ai? Claude Code is Anthropic's agentic coding tool: an LLM running in a real-time loop with direct access to your files, terminal, and codebase. Unlike claude.ai, it does the work itself (reading code, editing files, running tests) instead of giving you text to paste. It's available in the terminal, VS Code, JetBrains, the Claude Desktop app, and the web. ### Q2. What is the Explore Plan Code Commit workflow in Claude Code? It's the four-step pattern Lesson 5 calls the single most important takeaway. Explore gathers relevant context, Plan produces a written plan you review (use Plan Mode via Shift + Tab for read-only safety), Code executes the plan with Claude editing files, Commit runs a code-reviewer subagent and pushes. Course-correcting in plan mode is far cheaper than fixing committed code. ### Q3. How do I manage the context window in Claude Code when it gets full? Three commands: /compact summarizes important details and drops noise (run manually before auto-compaction loses things you wanted), /clear wipes context to start fully fresh, /context shows you a visual breakdown of what's consuming space. Vague prompts cost MORE context, not less, because Claude has to explore the codebase to figure out what you meant. ### Q4. What goes in a CLAUDE.md file and where should I put it? A CLAUDE.md is Markdown that Claude Code auto-appends to every session. Put it at project root for team-shared rules (stack, commands, conventions, code style); use ~/.claude/CLAUDE.md for personal preferences across projects. The strong recommendation from Lesson 8: start without one and add only what you find yourself correcting Claude on twice. Run /init when you're ready to generate one. ### Q5. What's the difference between Skills and MCP servers in Claude Code? Both extend Claude Code's capability, but they differ in context cost. Skills only load name and description until invoked, so adding more skills costs almost nothing in context. MCP servers load all tool definitions upfront, so a misconfigured MCP can dominate the context window. Reach for Skills when you have a repeatable procedure; reach for MCP when you need access to an external system. ### Q6. Why is my Claude Code session forgetting things between turns? Two likely causes. First, automatic compaction kicked in near the context limit and dropped detail you wanted preserved; run /compact manually before auto-compaction so you control what gets summarized. Second, you're starting fresh sessions without a CLAUDE.md; anything Claude should always know belongs in CLAUDE.md so it's loaded automatically every session. ### Q7. When should I use a hook instead of putting an instruction in CLAUDE.md? Use a hook when something must happen every single time without exception. Lesson 12's exact framing: if something needs to happen every time without fail, don't put it in a prompt; put it in a hook. Common cases: auto-formatting after edits (PostToolUse), blocking writes to production directories (PreToolUse with exit code 2), audit logging, finish notifications. CLAUDE.md instructions are guidance; hooks are deterministic. ### Q8. Is Claude Code 101 enough to pass the Claude Architect certification? No, but it's the second-most-load-bearing course in the curriculum and it's a hard prerequisite for almost everything else. It anchors D1 (agentic systems), parts of D3 (development environments), and D5 (responsible use). Take it after Claude 101 and before Subagents, Agent Skills, MCP Foundations, and Claude Code in Action. --- **Source:** https://claudearchitectcertification.com/knowledge/claude-code-101 **Vault sources:** Course_02_claude-code-101/_Course_Overview.md; Course_02_claude-code-101/Lesson_01_what-is-claude-code.md; Course_02_claude-code-101/Lesson_05_explore-plan-code-commit-workflow.md; Course_02_claude-code-101/Lesson_06_context-management.md; Course_02_claude-code-101/Lesson_08_the-claude-md-file.md; Course_02_claude-code-101/Lesson_09_subagents.md; Course_02_claude-code-101/Lesson_10_skills.md; Course_02_claude-code-101/Lesson_11_mcp.md; Course_02_claude-code-101/Lesson_12_hooks.md **Last reviewed:** 2026-05-06 --- # Introduction to Claude Cowork: Delegation for Knowledge Work > Claude Cowork is the same agentic architecture that powers Claude Code, retargeted at knowledge work: analysis, research, writing, and the documents you produce every day. The shift the course teaches is from conversation to delegation: describe an outcome, review the plan, let it run, and review the finished file. It's the right mental model for any exam question about when to use chat vs an agent. **Domain:** D5 · Context + Reliability (15%) **Difficulty:** intro **Skilljar course:** Introduction to Claude Cowork (11 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/claude-cowork-intro **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 15% (D5) Direct prep for D5 task statements on appropriate use, human-in-the-loop review, permissions, and model selection. Light D1 overlap on the agentic loop (Cowork shares the Claude Code architecture for non-developer knowledge work). ## What you'll learn - What Cowork is, how it differs from chat (delegation, not conversation), and when each fits the work - The 4-step task loop: describe what you want back, answer follow-up questions, step away (or steer), open the finished file - How project-level Instructions and global instructions give Cowork context that carries across every session - How plugins bundle skills, connectors, and workflows for a specific function (sales, finance, research) - The permission and safety model: isolated execution, controlled file access, gated deletion, network policy compliance - How to choose between Opus, Sonnet, and Haiku based on task complexity and usage allocation ## Prerequisites - **Claude 101 (knowledge)** (knowledge · `claude-101`) ## Lesson outline ### 1. What is Cowork? The Claude Code agent architecture retargeted at knowledge work; describe an outcome, get a finished file, not pasted text. ### 2. Getting set up Install the Cowork app, sign in, point it at a folder; that's the first step in any task. ### 3. The task loop 4-step pattern: describe outcome, answer follow-ups, step away or steer, open the file. Repeat for every task. ### 4. Giving Cowork context Projects carry context across sessions; Instructions panel for project rules; global instructions for cross-project preferences. ### 5. Plugins: Cowork as a specialist Plugins bundle skills, connectors, and workflows for a specific function so Cowork approaches your task as a domain specialist. ### 6. Scheduled tasks Run a workflow on a recurring cadence; work you'd otherwise remember to do happens automatically. ### 7. File & document tasks Read, edit, and create real files (.docx, .xlsx, .pptx, .pdf) in your folder; output is a file, not a paste. ### 8. Research & analysis at scale Subagents parallelize independent pieces (e.g., one agent per vendor in a comparison) and synthesize results back. ### 9. Permissions, usage, & choosing your model Isolated execution, gated deletion, network-policy compliance; pick Opus for complex multi-step work, Sonnet for everyday default, Haiku for light tasks. ### 10. Troubleshooting & next steps Common failure modes (wrong folder grant, missing context, runaway tasks) and the next courses in the Cowork sequence. ### 11. Quiz on Claude Cowork 11-question Skilljar quiz; useful self-check on the task loop, plugins, permissions, and model selection. ## Our simplification Cowork is described in Lesson 1 as built on the same architecture as Claude Code, the agentic system used to write and ship production software, but retargeted at knowledge work. The shift the course frames is from conversation to delegation. In chat, you ask a question and get text back; you still move information between tools, assemble the output, and handle the steps in between. In Cowork, you describe an outcome, Claude plans it, works through it, and delivers a finished file to your drive. Three things enable the loop: Plan (review before work starts), Execute (long-running work in an isolated environment), and Connect (reach the systems where your work already lives). The load-bearing lesson is Lesson 3, The Task Loop. The pattern is four steps you will repeat for every task: (1) describe what to look at, what you want back, and where it should go; (2) answer a few follow-up questions Cowork asks before it starts; (3) step away or steer (a progress panel shows each step, and you can type in chat to redirect at any point); (4) open the finished file. The course's framing for the output is precise: treat the result as a draft, the way you'd read a first pass from a capable colleague. Cowork's confidence is the same whether the work is right or wrong; your review catches the difference. Cowork tasks start fresh by default; the way context carries across sessions is projects (Lesson 4). A Cowork project is a named workspace backed by a real folder on your machine, with persistent Instructions and memory. The Instructions panel takes shorthand-style context: who's involved (Rachel is head of product), where things live (contracts in ./Contracts, archive in Drive /old-reports), output preferences (drafts in .docx, finals in PDF), project-specific rules. A few lines is enough. Once Cowork knows the people and the file layout, send this to Rachel and file it where the Q3 report went start meaning specific things. Global instructions handle preferences that don't change between projects. Plugins (Lesson 5) are the domain-specialist surface. A plugin bundles skills, connectors, and workflows for a function (sales, finance, research, legal) so Cowork approaches your task the way a domain specialist would, not a generalist. Scheduled tasks (Lesson 6) let you run a workflow on a recurring cadence: a Monday-morning research digest, a weekly metrics summary, a quarterly compliance check. File and document tasks (Lesson 7) do real work on real files in real formats: .docx, .xlsx, .pptx, .pdf. The output is a file you open, not text you paste. Subagents in Cowork (Lesson 8) parallelize independent pieces of a task. The course's example is comparing four vendors: Cowork spins up one subagent per vendor, each researching pricing, integrations, and reviews without the others' material crowding its view, and synthesizes the results into one output. Tasks too large to hold in one conversation get split into pieces that each fit comfortably, and each piece gets focused attention. This is the same hub-and-spoke pattern Claude Code uses, retargeted for knowledge-work parallelism. Permissions and safety (Lesson 9) is the D5 anchor of the course. Four boundaries shape every session: isolated execution (Cowork runs in a sandboxed environment separate from your OS); controlled file access (you decide which folders Cowork can see; no grant, no access); network policies respected (org rules apply); deletion is gated (permanent deletion always requires explicit approval). The reviewing habit the course pushes hard: open the file, check a number, follow one thread of reasoning. Polished-looking outputs need MORE scrutiny, not less, because Cowork's confidence does not correlate with correctness. Model selection (Lesson 9) is the practical handle on usage. Claude models come in three tiers: Opus for complex multi-step work (uses the most allocation), Haiku for the quickest and lightest tasks, Sonnet as the everyday default. Match the model to what the task actually needs rather than defaulting to Opus for everything. Three usage-management habits: batch related work (fresh sessions have overhead), use chat for tasks that don't need files or tools (faster and cheaper), monitor where you stand in settings. The Lesson 1 question (does this need my files, my connected tools, or a real output file?) is the same question for both is this a Cowork task? and am I spending my allocation well? ## Patterns ### When to reach for Cowork vs chat The Lesson 1 framing is the cleanest decision rule in the course. Memorize the question. - **Reach for Cowork.** When you want a finished file you can open, when the work touches files on your drive or tools you're connected to, when there are many steps or many items to process, or when you want to let it run while you do something else. - **Reach for chat.** When you want an answer or a draft you'll refine yourself, when everything Claude needs fits in a single paste, or when you want to think turn-by-turn together. - **The deciding question.** Does this task need to touch your files, your connected tools, or produce a real output file? If yes, Cowork. If no, chat is faster and cheaper. ### The 4 safety boundaries every Cowork session operates within Lesson 9 is the D5 anchor. These four are testable. - **Isolated execution.** Cowork runs in a sandboxed environment on your computer, separate from your operating system. It can't reach what hasn't been granted. - **Controlled file access.** You decide which folders Cowork can see. No grant, no access. The grant is per-folder, not blanket. - **Network policies respected.** Cowork follows your organization's network rules. Restricted environments stay restricted. - **Deletion is gated.** Permanent deletion always requires your explicit approval. You'll always see a prompt first. ## Key takeaways - Cowork is Claude Code's agent architecture retargeted at knowledge work; the course's framing is the shift from conversation to delegation. (`agentic-loops`) - The 4-step task loop (describe outcome, answer follow-ups, step away or steer, open the file) is the pattern for every Cowork task; the result is always a draft, never a final. (`plan-mode`) - Projects with an Instructions panel are how context carries across sessions; a few lines about people, file locations, and output preferences is enough to make shorthand work. (`system-prompts`) - Subagents parallelize independent pieces of a task (one per vendor in a comparison) and synthesize results back; same hub-and-spoke pattern as Claude Code subagents, retargeted for knowledge work. (`multi-agent-research-system`) - Cowork's safety boundaries are the D5 exam anchor: isolated execution, controlled file access, network-policy compliance, gated deletion. (`evaluation`) - Model selection matters for both quality and usage: Opus for complex multi-step, Sonnet for everyday default, Haiku for light tasks. Match the model to the work. (`evaluation`) ## Concepts in play - **Agentic loops** (`agentic-loops`), What Cowork inherits from Claude Code - **Plan mode** (`plan-mode`), Cowork's plan-then-execute pattern - **Subagents** (`subagents`), How Cowork parallelizes independent task pieces - **System prompts** (`system-prompts`), What the Instructions panel actually configures - **Skills** (`skills`), What plugins bundle for domain specialization - **Evaluation** (`evaluation`), The review-the-output habit Lesson 9 enforces ## Scenarios in play - **Claude for operations** (`claude-for-operations`), Cowork's natural home: knowledge work, recurring tasks, file outputs - **Long document processing** (`long-document-processing`), Where Cowork's parallel subagents and persistent file output earn their keep - **Agent Skills for enterprise KM** (`agent-skills-for-enterprise-km`), Plugins as the enterprise-scale specialization surface ## Curated sources - **Claude Cowork: a research preview** (anthropic-blog, 2026-01-15): Anthropic's design and purpose write-up; deeper rationale for the conversation-to-delegation shift than the Skilljar lessons abbreviate. - **Choosing the right Claude model** (anthropic-blog, 2025-11-01): The full Opus vs Sonnet vs Haiku tradeoff matrix Lesson 9 abbreviates; pair when you need the per-tier capability and pricing breakdown. - **Organize your tasks with projects in Cowork** (anthropic-blog, 2026-02-01): Canonical reference for project setup, importing from chat projects, and the Instructions panel; expands Lesson 4's brief tour. ## FAQ ### Q1. What is Claude Cowork and how is it different from claude.ai chat? Claude Cowork is built on the same agentic architecture as Claude Code but retargeted at knowledge work. The shift is from conversation to delegation. In chat, you ask and refine text back-and-forth; in Cowork, you describe an outcome and get a finished file (PowerPoint, Excel, Word, PDF) on your drive. Cowork plans the steps, executes in an isolated environment, and connects to the tools where your work already lives. ### Q2. When should I use Claude Cowork instead of Claude chat? The deciding question from Lesson 1: does this task need to touch your files, your connected tools, or produce a real output file? If yes, Cowork. If no, chat is faster and uses less of your allocation. Lean toward Cowork when there are many steps or many items, when you want to let it run while you do something else, or when chat ran out of room on a similar task before. ### Q3. What is the 4-step task loop in Claude Cowork? Every Cowork task follows the same pattern. (1) Describe what to look at, what you want back, and where it should go. (2) Answer a few follow-up questions Cowork asks before it starts. (3) Step away while it runs, or type in chat to redirect at any point. (4) Open the finished file and review it the way you'd read a first pass from a capable colleague. ### Q4. How do I give Claude Cowork context that carries across sessions? Use a project: a named workspace backed by a real folder on your machine, with persistent Instructions and memory. The Instructions panel takes shorthand context: who's involved (Rachel is head of product), where things live (contracts in ./Contracts), output preferences (drafts in .docx, finals in PDF). For preferences that don't change between projects, use global instructions in Settings. ### Q5. Is Claude Cowork safe to give access to my files and connected tools? Cowork operates within four explicit boundaries. Isolated execution (sandboxed environment separate from your OS); controlled file access (you decide which folders are visible, per-folder, not blanket); network policies respected (org rules apply); gated deletion (permanent deletion always requires explicit approval). Conversation history stores locally on your machine; check your plan documentation for current audit-logging and compliance details if you have regulated workloads. ### Q6. Why does my Claude Cowork output look polished but contain errors? Cowork's confidence is the same whether the work is right or wrong. The more polished an output looks, the more a second look is worth. Polish is a styling artifact, not a correctness signal. The habit Lesson 9 pushes: open the file, check a number, follow one thread of reasoning. Treat every output as a first draft, regardless of how finished it looks. ### Q7. Should I use Opus, Sonnet, or Haiku for Claude Cowork tasks? Match the model to the work. Opus for complex multi-step tasks where capability matters most (uses the most allocation). Haiku for quick, light tasks where speed matters most. Sonnet sits in the middle as a sensible default for everyday work. Defaulting to Opus for everything wastes allocation; defaulting to Haiku for everything sacrifices quality where it mattered. ### Q8. How do plugins in Claude Cowork work and when should I use one? A plugin bundles skills, connectors, and workflows for a specific function (sales, finance, research, legal) so Cowork approaches your task the way a domain specialist would, not a generalist. Install one when your work is predominantly in that domain and you want consistent specialist behavior across tasks. Plugins are the highest-leverage way to get role-specific quality without writing custom skills yourself. --- **Source:** https://claudearchitectcertification.com/knowledge/claude-cowork-intro **Vault sources:** Course_03_introduction-to-claude-cowork/_Course_Overview.md; Course_03_introduction-to-claude-cowork/Lesson_01_what-is-cowork.md; Course_03_introduction-to-claude-cowork/Lesson_03_the-task-loop.md; Course_03_introduction-to-claude-cowork/Lesson_04_giving-cowork-context.md; Course_03_introduction-to-claude-cowork/Lesson_05_plugins-cowork-as-specialist.md; Course_03_introduction-to-claude-cowork/Lesson_08_research-and-analysis-at-scale.md; Course_03_introduction-to-claude-cowork/Lesson_09_permissions-usage-choosing-model.md **Last reviewed:** 2026-05-06 --- # Claude Code in Action: Integration Patterns > Claude Code in Action moves past the basics into the eight integration surfaces that turn Claude Code from a chat assistant into a real engineering tool: /init and the three CLAUDE.md tiers, @ file mentions, custom slash commands, MCP servers, GitHub integration, PreToolUse and PostToolUse hooks, and the Claude Code SDK. The settings hierarchy (global, project-shared, project-personal) is the meta-pattern that ties every layer together. Treat the course as an integration manual every team needs before they ship Claude Code into a real codebase. **Domain:** D3 · Agent Operations (20%) **Difficulty:** intermediate **Skilljar course:** Claude Code in Action (21 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/claude-code-in-action **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 20% (D3) + spillover into D1, D2 Direct prep for D3 task statements on Claude Code workflow integration: CLAUDE.md hierarchy, custom slash commands, hooks, MCP servers in Claude Code, GitHub Actions, and the Claude Code SDK. Also reinforces D1 agentic-loop framing and D2 tool whitelisting. ## What you'll learn - How /init bootstraps a project and where the three CLAUDE.md tiers live (project, local, user-global) - How to author custom slash commands in .claude/commands/ and pass arguments via $ARGUMENTS - How PreToolUse and PostToolUse hooks insert before or after tool execution, and what each can or cannot block - How to wire MCP servers into Claude Code and when to reach for the gh CLI for GitHub workflows - How the Claude Code SDK runs the same Claude Code programmatically from TypeScript, Python, or the CLI ## Prerequisites - **Claude Code 101 (knowledge)** (knowledge · `claude-code-101`) - **Agentic loops (concept)** (concepts · `agentic-loops`) - **CLAUDE.md hierarchy (concept)** (concepts · `claude-md-hierarchy`) ## Lesson outline ### 1. Introduction Course frame: Claude Code is more than a chat box; it integrates with your editor, repo, CI, and shell. ### 2. What is a coding assistant? Coding assistants pair an LLM with file access, shell access, and a tool loop, not just autocomplete. ### 3. Claude Code in action Live demo of Claude Code reading the repo, planning, and editing across files in one session. ### 4. Claude Code setup Install via npm/Homebrew, sign in with your Anthropic API key or Console subscription, verify in a sandbox repo. ### 5. Project setup Cd into the repo and run Claude Code; it auto-detects the project root and respects .gitignore. ### 6. Adding context /init writes CLAUDE.md from the codebase; @file injects file contents; # enters memory mode for inline edits. ### 7. Making changes Claude proposes edits, you approve per-file or shift-tab to auto-accept; diffs render inline before commit. ### 8. Course satisfaction survey Mid-course survey checkpoint, no technical content. ### 9. Controlling context Use /clear to reset, /compact to summarize, and @-mentions to keep only relevant files in working memory. ### 10. Custom commands Create .claude/commands/.md files; the filename becomes / and $ARGUMENTS pipes user input. ### 11. MCP servers with Claude Code Add MCP servers via /mcp or settings JSON to give Claude Code access to GitHub, Sentry, databases, browsers. ### 12. GitHub integration Use the gh CLI inside Claude Code or the GitHub Action to run Claude Code on PRs and issues. ### 13. Introducing hooks Hooks run shell commands before or after a tool call; PreToolUse can block, PostToolUse can only react. ### 14. Defining hooks Hooks live in ~/.claude/settings.json, .claude/settings.json, or .claude/settings.local.json; matchers target tool names. ### 15. Implementing a hook A hook is a command that reads a JSON tool-call payload from stdin and exits 0/non-zero or writes feedback. ### 16. Gotchas around hooks Hooks run as your shell user with full perms; matcher regex traps and infinite-loop hooks are common failure modes. ### 17. Useful hooks! Auto-format on Edit, run tests on Write, log every Bash to a file, block reads from .env files. ### 18. Another useful hook Notification hooks: ping a webhook, play a sound, or open a desktop notification when Claude finishes a task. ### 19. The Claude Code SDK Run Claude Code programmatically from TypeScript or Python; default is read-only, opt in to writes via allowedTools. ### 20. Quiz on Claude Code End-of-course knowledge check across CLAUDE.md, hooks, MCP, slash commands, and the SDK. ### 21. Summary and next steps Recap and pointers to Subagents, MCP Advanced, and Agent Skills as natural follow-ups. ## Our simplification Claude Code in Action is a tour of the eight integration surfaces that take Claude Code from "a chat box that reads files" to "a build-time tool every engineer on the team uses the same way." The eight: /init and CLAUDE.md, @ file mentions, custom slash commands, MCP servers, GitHub integration, hooks (Pre and Post), the SDK, and the settings hierarchy that lets each layer be team-shared or personal. The course is short on theory and heavy on "here is the file you put it in". Treat it as an integration manual, not an introduction. The CLAUDE.md hierarchy is the load-bearing primitive. Three locations: CLAUDE.md at repo root (committed, shared with the team), CLAUDE.local.md (gitignored, personal overrides), and ~/.claude/CLAUDE.md (user-global across every project). Claude reads all three on every request, layered. The /init command bootstraps the repo-root file by analyzing your codebase. The # memory-mode shortcut merges new instructions into the right tier without you opening the file. @file mentions inline file contents on demand, and you can put @-mentions inside CLAUDE.md so referenced files load on every turn. Custom slash commands live in .claude/commands/.md. The filename becomes the command, the body is the prompt, and $ARGUMENTS is the placeholder for whatever the user types after the command name. A write_tests.md file becomes /write_tests . This is the cleanest way to encode a team's repeatable workflows; audit, ship, refactor, write_tests, lint-fix; without each engineer reinventing the prompt every time. Restart Claude Code after creating a new command file; discovery is at startup, not live. Personal commands live in ~/.claude/commands/ and shadow project commands of the same name, so each engineer can override a team-shared command locally without forking the file. The pattern matters: what slash commands are to ad-hoc prompting, custom commands are to your team's shared engineering vocabulary. Hooks are where Claude Code stops being a chat assistant and becomes part of your build pipeline. PreToolUse runs before a tool is executed (matcher targets Read, Edit, Bash, etc.) and can block by exit code; PostToolUse runs after and can only react. Real uses: auto-format with Prettier on every Edit, run pnpm typecheck after Write, block reads from .env files, log every Bash call to a security audit trail. The hook is just a shell command receiving a JSON payload on stdin. Configuration lives in the same three-tier settings system (global, project, project-local) so you can ship team hooks via .claude/settings.json while letting each engineer add personal ones in .claude/settings.local.json. MCP servers in Claude Code extend the tool surface beyond the built-ins. Claude Code is an MCP client; you point it at MCP servers via /mcp or settings JSON, and those servers expose their tools, resources, and prompts to the session. Install GitHub, Sentry, Postgres, Playwright, Linear, or Notion in one line of config and Claude can query and act against them. The gh CLI is often the simpler GitHub path: Claude Code calls gh directly through the Bash tool, no MCP wiring needed. The Claude Code GitHub Action lets the same harness run on PRs and issues from CI. The Claude Code SDK is the same Claude Code, just runnable from your own scripts. The TypeScript and Python packages expose a query() async iterator; you give it a prompt, and it streams the same tool-call conversation you'd see in the terminal. Default permissions are read-only; the SDK can Read, Grep, Glob, Bash (read-only commands), but cannot Edit or Write unless you opt in via allowedTools. Use cases: pre-commit hooks that review diffs, CI checks that summarize changes, batch refactors that apply Claude Code across many repos, custom slash commands inside other tools. The implicit thread through the whole course is the settings hierarchy. Three tiers (global at ~/.claude/settings.json, project-shared at .claude/settings.json committed, project-personal at .claude/settings.local.json gitignored) apply to MCP servers, hooks, custom commands, and permissions in exactly the same way. Think of .claude/settings.json as your team's contract for how Claude Code behaves in this repo, and .claude/settings.local.json as your personal overrides on top. The deeper layers shadow earlier ones, never silently. Once you see the pattern in one surface (hooks), you see it in all of them, and the course's eight topics collapse into one coherent integration model. The exam-relevant takeaway: Claude Code is not a chat box; it is a configurable harness whose every surface obeys the same three-tier hierarchy, and an architect designing rollout strategy works at that meta-layer rather than at any single surface. ## Patterns ### 5 hooks every Claude Code repo should consider These show up in real production setups; each maps to a real failure mode that hooks prevent. - **PostToolUse: format on Edit.** Match Write|Edit|MultiEdit and run prettier --write or ruff format on the changed file. Removes the entire "Claude wrote ugly code" class of complaints in one config block. - **PostToolUse: typecheck on Edit.** Run pnpm typecheck or mypy after edits; the hook can echo errors back to Claude as feedback. Claude self-corrects without the user mediating. - **PreToolUse: block .env reads.** Match Read and reject any path matching .env* with a non-zero exit. Never trust prompt-only constraints when you have a hook contract you can enforce. - **PreToolUse: log every Bash.** Match Bash and append the command + cwd to an audit log before letting it run. Cheap forensics for any "what did Claude do?" investigation. - **Notification hook on Stop.** Match the Stop event and play a sound or send a webhook so you know when a long task finishes. Removes the "is it still running?" tab-switching tax. ### 3 things to put in your team's CLAUDE.md The repo-root CLAUDE.md is committed, shared, and read on every request. Optimize it for the highest-leverage instructions, not for verbosity. - **Project commands.** The exact dev, build, test, typecheck, lint commands. Claude defaults to npm if you don't tell it you use pnpm/yarn/bun. - **Architecture sketch.** A 5-10 line description of the app's shape: frontend, backend, database, auth, deploy target. Saves Claude from grepping the same files every session. - **House rules.** "Never edit db/migrations/", "Use Tailwind utilities, not CSS modules", "Run typecheck before claiming done". Concrete rules, not platitudes. ## Key takeaways - Claude Code reads three CLAUDE.md tiers (repo-root, repo-local, user-global) on every request; design instructions to live at the right layer. (`claude-md-hierarchy`) - Custom slash commands are markdown files in .claude/commands/; $ARGUMENTS substitutes user input and the filename becomes the command. (`code-generation-with-claude-code`) - PreToolUse hooks can block tool execution by non-zero exit; PostToolUse hooks can only react, not block. Choose by what you need to enforce. (`hooks`) - MCP servers in Claude Code extend the tool surface to GitHub, Sentry, Postgres, browsers, etc.; configure via /mcp or settings JSON in the same three-tier hierarchy. (`mcp`) - The Claude Code SDK runs the same harness programmatically with read-only defaults; opt in to writes via allowedTools when you trust the prompt. (`claude-code-for-cicd`) - Settings hierarchy is the meta-pattern: global / project-shared / project-personal applies to hooks, MCP, commands, and permissions identically. (`claude-md-hierarchy`) ## Concepts in play - **CLAUDE.md hierarchy** (`claude-md-hierarchy`), Three-tier context primitive at the heart of every lesson - **Hooks** (`hooks`), PreToolUse / PostToolUse, the integration mechanism for build pipelines - **MCP** (`mcp`), How Claude Code extends its tool surface to external systems - **Tool calling** (`tool-calling`), What hooks intercept and what slash commands invoke - **Agentic loops** (`agentic-loops`), The runtime model under every Claude Code session ## Scenarios in play - **Code generation with Claude Code** (`code-generation-with-claude-code`), Primary scenario covering custom commands and CLAUDE.md authoring - **Claude Code for CI/CD** (`claude-code-for-cicd`), SDK + GitHub Action use case from Lessons 12 and 19 - **Developer productivity agent** (`developer-productivity-agent`), Hooks + slash commands as a daily productivity layer ## Curated sources - **Claude Code Hooks; Anthropic documentation** (anthropic-blog, 2025-08-01): Canonical reference for hook event names, matcher syntax, JSON payload schema, and exit-code semantics. The Skilljar lessons demo hooks; this doc is what you keep open while writing them. - **Claude Code SDK; Anthropic documentation** (anthropic-blog, 2025-09-15): TypeScript and Python SDK reference with query() signatures, allowedTools semantics, and permission modes. Pair with Lesson 19 when you start building automation. - **Claude Code best practices for agentic coding** (anthropic-blog, 2025-04-18): Anthropic's own engineering write-up of how they use Claude Code internally; covers CLAUDE.md authoring, command design, and hook patterns at scale. ## FAQ ### Q1. What is the difference between CLAUDE.md, CLAUDE.local.md, and the user-global CLAUDE.md? Three tiers, all read on every request. CLAUDE.md at repo root is committed and shared with the whole team; put project commands, architecture, house rules here. CLAUDE.local.md is gitignored and holds your personal overrides for this repo. ~/.claude/CLAUDE.md is user-global and applies across every project on your machine. Claude layers all three; later tiers do not overwrite earlier ones, they accumulate. ### Q2. How do I create a custom slash command in Claude Code? Create .claude/commands/.md in your repo. The filename becomes the command (audit.md → /audit), the body of the file is the prompt Claude runs when invoked, and $ARGUMENTS is the placeholder substituted with whatever the user types after the command name. Restart Claude Code after creating the file; discovery happens at startup, not live. ### Q3. Can a PostToolUse hook block a tool call from running? No. Only PreToolUse hooks can block; they run before the tool executes and a non-zero exit code rejects the call. PostToolUse runs after the tool has already executed, so it can only react: format the file, run tests, log to an audit trail, or echo feedback back to Claude. If you need to enforce something, use PreToolUse. ### Q4. Why does my custom slash command show up in /help but doesn't run? Most likely you forgot to restart Claude Code after creating or editing the file. Discovery happens at startup, not live. Other common causes: the file is in the wrong directory (must be .claude/commands/, not .claude/command/ or commands/), or the filename has spaces or uppercase letters that are getting stripped. Use lowercase-hyphenated names. ### Q5. Do I need MCP to integrate Claude Code with GitHub? No, the simplest path is the gh CLI. Claude Code can call gh directly through its Bash tool; no MCP wiring needed for issues, PRs, comments, or repo operations. Use the GitHub MCP server when you want stricter tool boundaries, structured outputs, or features the CLI doesn't expose. Use the official Claude Code GitHub Action when you want Claude to run on PRs and issues from CI. ### Q6. Is the Claude Code SDK the same thing as the Anthropic Messages API? No. The Anthropic Messages API is the raw messages.create() endpoint; you build the agent loop yourself. The Claude Code SDK wraps the *full Claude Code harness*; same tools, same agentic loop, same MCP servers, same CLAUDE.md reading; and exposes it as a query() function in TypeScript or Python. Use the SDK when you want Claude Code's capabilities programmatically; use the raw API when you're building a different harness from scratch. ### Q7. What does /init actually do when I run it in a new project? Claude Code analyzes the codebase; reads the package manifest, samples key directories, checks for tests, infers the framework; and writes a starter CLAUDE.md at the repo root summarizing the project's purpose, architecture, key commands, and important files. Treat the output as a draft, not the final; almost every team rewrites it after the first week as they learn what Claude actually needs to know. ### Q8. Are hooks safe? Can a hostile prompt make Claude run a destructive shell command via a hook? Hooks run as your shell user with full filesystem and network permissions; that's by design. Claude does not invoke hooks; the harness does, on tool events Claude triggers. The risk is your own hook logic: a Bash hook that pipes the command into a shell evaluator without sanitization is dangerous. Treat hooks like cron jobs you wrote: review them, scope tightly, and never accept hook configs from untrusted sources. --- **Source:** https://claudearchitectcertification.com/knowledge/claude-code-in-action **Vault sources:** Course_04/Lesson_06_adding-context.md; Course_04/Lesson_10_custom-commands.md; Course_04/Lesson_11_mcp-servers-with-claude-code.md; Course_04/Lesson_12_github-integration.md; Course_04/Lesson_13_introducing-hooks.md; Course_04/Lesson_14_defining-hooks.md; Course_04/Lesson_15_implementing-a-hook.md; Course_04/Lesson_19_the-claude-code-sdk.md **Last reviewed:** 2026-05-06 --- # AI Fluency: The 4D Framework for Working with AI > AI Fluency teaches a four-competency framework for collaborating with AI: Delegation (deciding what to give the AI), Description (communicating clearly), Discernment (evaluating outputs), and Diligence (taking responsibility). The course is light on technical depth but heavy on the disciplined judgment that separates effective AI users from prompt-and-pray users. Treat it as the operating manual under every other Skilljar course you will take. **Domain:** D4 · Prompt Engineering (20%) **Difficulty:** intro **Skilljar course:** AI Fluency: Framework & Foundations (15 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/ai-fluency-framework **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 20% (D4) + spillover into D5 Direct prep for D4 prompt-engineering competencies and the human-judgment side of D5 reliability. The 4D framework (Delegation, Description, Discernment, Diligence) is the meta-skill that frames every other domain on the exam. ## What you'll learn - What AI Fluency is and the three modes of human-AI engagement (automation, augmentation, agency) - The 4D framework end-to-end: Delegation, Description, Discernment, Diligence - How to apply Problem / Platform / Task awareness inside Delegation - How to write Product, Process, and Performance descriptions when prompting - Six foundational prompting techniques and how to troubleshoot when output is off - How the Description-Discernment loop tightens iteratively, and what Diligence statements are for ## Prerequisites - **Claude 101 (knowledge)** (knowledge · `claude-101`) ## Lesson outline ### 1. Introduction to AI Fluency AI Fluency means engaging with AI effectively, efficiently, ethically, and safely; framework-driven, not vibes-driven. ### 2. Why do we need AI Fluency? Generative AI shifts work from execution to judgment; without a framework, output quality regresses to the prompt-and-pray mean. ### 3. The 4D Framework Four competencies; Delegation, Description, Discernment, Diligence; that compose into every productive AI interaction. ### 4. Generative AI fundamentals How LLMs are trained, what tokens are, why models hallucinate; the technical floor under all four Ds. ### 5. Capabilities & limitations What Claude can do well, what it cannot, and why knowing both is the prerequisite for good Delegation. ### 6. A closer look at Delegation Three sub-skills: Problem Awareness, Platform Awareness, Task Delegation; decide what to give the AI before you start typing. ### 7. Project planning and Delegation Map a project's tasks, mark which need human judgment vs. which can go to AI, build a delegation plan together with Claude. ### 8. A closer look at Description Three sub-skills: Product Description (what), Process Description (how), Performance Description (style of interaction). ### 9. Effective prompting techniques Six techniques: give context, show examples, specify constraints, break into steps, ask AI to think first, define role/tone. ### 10. A closer look at Discernment Three sub-skills: Product Discernment (output quality), Process Discernment (reasoning), Performance Discernment (interaction). ### 11. The Description-Discernment loop Description and Discernment form a feedback loop; sharper description improves output, sharper discernment improves your next description. ### 12. A closer look at Diligence Three sub-skills: Creation Diligence (system choice), Transparency Diligence (disclosure), Deployment Diligence (own the output). ### 13. Conclusion Course wrap-up: the 4Ds compose; weakness in any one degrades the others; develop them deliberately. ### 14. Certificate of completion Skilljar issues a verifiable certificate URL; logistics only. ### 15. Additional activities Optional exercises: draft a Diligence statement, audit a recent AI interaction through the 4D lens. ## Our simplification The 4D Framework is the operating system. Every other Skilljar course you take; Claude Code, MCP, Tool Use, RAG; teaches a *capability*. AI Fluency teaches the *discipline* under those capabilities: how to decide what to delegate, how to communicate it, how to evaluate the result, and how to take responsibility for what you ship. The four Ds; Delegation, Description, Discernment, Diligence; are not stages, they are competencies that compose. A great prompt with bad delegation is wasted effort. Great evaluation with no diligence is dishonest. Treat the 4Ds as a checklist you run on every meaningful AI interaction, not a one-time mindset. Delegation is the first D and the most under-practiced. It has three sub-skills: Problem Awareness (do I actually understand what I'm trying to do?), Platform Awareness (what can this specific AI system do, and where does it fall over?), and Task Delegation (which slices go to me, which to the AI, which to a human collaborator). The course's central insight: most AI failures trace back to Delegation, not prompting. People hand off tasks the AI cannot do (deep domain reasoning without context) or keep tasks the AI could do better (mechanical reformatting). The fix is not a better prompt; it's a better split. Description is the prompting competency, and the course breaks it into three layers most people collapse into one. Product Description is *what* you want; the output, format, audience, length. Process Description is *how* you want the AI to get there; show your work, ask before you guess, break into steps. Performance Description is *how you want to interact*; be concise, push back on bad ideas, ask clarifying questions before starting. Six prompting techniques follow: give context, show examples, specify constraints, break into steps, ask the AI to think first, define role or tone. The biggest unlock is realizing AI is a partner, not a vending machine; Performance Description is the layer that earns you the partnership. Discernment is Description's mirror image, and the course makes a sharp claim: you cannot evaluate AI output well without your own expertise. There are three flavors. Product Discernment; is the output accurate, appropriate, coherent, relevant? Process Discernment; did the reasoning track, or did it skip steps and confabulate? Performance Discernment; did the AI behave well during the interaction, or did it sycophantically agree, drift, or refuse to push back? The Discernment-Description feedback loop is where fluency compounds: each round of evaluation sharpens the next round of description, and the AI's output quality climbs with you. Diligence is the ethics layer and the one that's hardest to formalize. Three sub-skills: Creation Diligence (which system you choose and why; privacy, alignment, capability fit), Transparency Diligence (being honest with everyone who needs to know that AI was involved; readers, clients, students, regulators), and Deployment Diligence (you own the output, full stop, AI did not write it, *you* shipped it). The course recommends drafting a personal or project-level diligence statement declaring how you use AI in your work. The exam-relevant take is that Diligence is what stops Claude's helpfulness from becoming your liability, and a senior architect operationalizes it with policy, not memos. Three modes of engagement frame the whole framework. Automation is AI executing a specific task you scoped (write this email, summarize this doc). Augmentation is you and AI as creative partners thinking through a problem together. Agency is you setting up the AI to act independently; agents, autonomous loops, scheduled tasks. The 4Ds apply to all three modes, but the *weight* shifts. In Automation, Description carries the load. In Augmentation, Discernment dominates because you're iterating live. In Agency, Diligence becomes structural; you cannot review every action live, so the system itself has to encode your judgment. Most exam scenarios live in Agency mode, which is why the framework matters for the architect role. Where this course fits in your prep: if you're coming from Claude 101 or Claude Code 101, AI Fluency is the meta-layer that ties them together. It will not teach you a single API call or a single feature. It will fix the mental model that determines how well every other course you take actually sticks. The course is short (15 lessons, ~75 minutes including videos), heavy on reflection exercises, and intentionally cross-domain; the same framework applies whether you're a writer using Claude.ai, a developer using Claude Code, or an architect designing an agentic system. Skim the lesson videos at 1.5x; the framework itself is what stays with you. ## Patterns ### The 4 Ds at a glance Each D has three sub-skills. Memorize the structure once and the rest of the course is filling in the cells. - **Delegation.** Problem Awareness, Platform Awareness, Task Delegation. Decide what to give the AI before you type the first prompt. - **Description.** Product (what), Process (how), Performance (interaction style). The prompting competency, three layers most people collapse into one. - **Discernment.** Product (output quality), Process (reasoning), Performance (interaction). You cannot evaluate output you don't understand; expertise gates discernment. - **Diligence.** Creation (system choice), Transparency (disclosure), Deployment (you own it). The ethics layer, operationalized through policy, not memos. ### 6 foundational prompting techniques These come from Lesson 9 and map directly to Description sub-skills. Use them as a self-check when output is off. - **Give context.** What you want, why you want it, who it's for, relevant background. Most prompt failures are missing context, not bad wording. - **Show examples.** One or two examples of the output style or format you want. Multishot beats explanation almost every time. - **Specify constraints.** Format, length, what to include, what to exclude. Constraints are how you make output checkable. - **Break into steps.** Decompose multi-step reasoning. Claude follows numbered plans more reliably than implicit chains. - **Ask the AI to think first.** Give Claude room to reason before answering. Use scratchpad XML tags or extended thinking when stakes are high. - **Define role or tone.** Specify who Claude is acting as and how it should communicate. Performance Description in one line. ## Key takeaways - The 4D framework; Delegation, Description, Discernment, Diligence; is the operating system under every other AI skill; weakness in one degrades the others. (`4d-framework`) - Most AI failures trace back to Delegation (wrong split), not prompting; fix the split before you fix the prompt. (`4d-framework`) - Description has three layers: Product (what), Process (how), Performance (interaction style). Collapsing them into one is the most common prompting mistake. (`prompt-engineering-techniques`) - Discernment requires domain expertise; you cannot reliably evaluate output you don't understand, which is why human-in-the-loop is structural, not optional. (`evaluation`) - Diligence is operationalized through Creation (system choice), Transparency (disclosure), and Deployment (you own the output). Write a Diligence statement for any project you ship. (`claude-for-operations`) - Three modes of engagement; Automation, Augmentation, Agency; shift which D carries the load; in Agency mode (agents), Diligence has to be encoded structurally. (`conversational-ai-patterns`) ## Concepts in play - **4D framework** (`4d-framework`), The course's central organizing concept - **Prompt engineering techniques** (`prompt-engineering-techniques`), Description's tactical layer - **Evaluation** (`evaluation`), Discernment's structural counterpart - **System prompts** (`system-prompts`), Where Performance Description lives in practice - **Attention engineering** (`attention-engineering`), Process Description at the prompt-architecture level ## Scenarios in play - **Conversational AI patterns** (`conversational-ai-patterns`), Augmentation-mode use case where Discernment dominates - **Claude for operations** (`claude-for-operations`), Agency-mode use case where Diligence becomes structural ## Curated sources - **AI Fluency framework; Anthropic Learn** (anthropic-blog, 2025-06-01): Anthropic's home page for the framework with the source white paper by Rick Dakan and Joseph Feller, plus the diligence statement template referenced in Lesson 12. - **Prompt engineering overview; Anthropic documentation** (anthropic-blog, 2025-09-01): Canonical Anthropic prompt-engineering guide that the Lesson 9 six techniques map onto cleanly. Use it as the technical companion to the framework. - **AI Fluency: A Framework for the AI Age (Dakan & Feller)** (paper, 2024-09-01): Original academic source for the 4D framework; deeper than the Skilljar course on the pedagogical and ethical foundations. ## FAQ ### Q1. What does the 4D framework stand for in AI fluency? The four Ds are Delegation, Description, Discernment, and Diligence. Delegation is deciding what to give the AI; Description is communicating clearly with the AI; Discernment is evaluating what the AI produces; Diligence is taking responsibility for the outcome. They are competencies that compose, not stages; every meaningful AI interaction touches all four. ### Q2. How is description different from prompt engineering? Prompt engineering is *one layer* of Description. Description has three sub-skills: Product Description (what you want), Process Description (how the AI should get there), and Performance Description (how you want the interaction to feel). Prompt engineering techniques; context, examples, constraints, decomposition; sit inside Product and Process Description. Performance Description ("push back on bad ideas", "ask before guessing") is the layer most prompt-engineering guides leave out. ### Q3. Why do AI experts say delegation matters more than prompting? Because most AI failures are wrong-split failures, not wrong-prompt failures. If you hand Claude a task it cannot do well; deep domain reasoning without your context, novel research with no source material; no amount of prompting saves the output. The reverse is also true: if you keep tasks Claude could do faster (mechanical reformatting, boilerplate generation), you're paying a tax for no reason. Fix the split first, then optimize the prompt. ### Q4. What is a diligence statement and why would I write one? A diligence statement is a short written declaration of how you use AI in a specific project or role; which systems, what you delegate, how you disclose, how you verify before shipping. It operationalizes Diligence into policy rather than vague intent. The Skilljar course links to the example diligence statement Anthropic itself published for the AI Fluency course. Useful in academic, professional, regulated, or client-facing work where AI involvement is a question that will be asked. ### Q5. How do I evaluate Claude's output if I'm not an expert in the topic? You can't, reliably, and the course is direct about that. Discernment depends on domain expertise; without it, you can spot fluency but not accuracy. The practical workarounds: only delegate evaluation-required work in domains where you *are* expert, recruit a human reviewer who is, or design a verification step (independent source, eval set, automated test) that does not require you to judge the content. Trusting AI output you cannot evaluate is the failure mode the framework exists to prevent. ### Q6. What are the three modes of engaging with AI according to the framework? Automation, where AI executes a specific task you scoped. Augmentation, where you and AI collaborate as creative partners on the same problem. Agency, where you guide AI to act independently on your behalf; agents, scheduled jobs, autonomous loops. The 4Ds apply to all three modes, but the weight shifts: Description carries Automation, Discernment carries Augmentation, Diligence has to be structurally encoded in Agency. ### Q7. Is AI fluency a technical skill or a soft skill? Both, deliberately. Delegation and Description involve technical literacy (knowing what models can do, how prompts compose, what context windows are). Discernment and Diligence involve judgment, ethics, and communication. The framework's premise is that effective AI use cannot be split into "technical" and "soft"; the architect role specifically requires fluency on both axes, which is why the course is foundational for exam prep. ### Q8. Why does my prompt seem clear but Claude still misunderstands? Most often you've written strong Product Description (what you want) but skipped Process Description (how to get there) or Performance Description (interaction style). Try adding "think step by step before answering", "ask me clarifying questions if anything is ambiguous", or "show your reasoning in tags". The course's secret-weapon move: ask Claude itself to critique and improve your prompt before you run it on the actual task. --- **Source:** https://claudearchitectcertification.com/knowledge/ai-fluency-framework **Vault sources:** Course_05/Lesson_03_the-4d-framework.md; Course_05/Lesson_05_capabilities-and-limitations.md; Course_05/Lesson_06_a-closer-look-at-delegation.md; Course_05/Lesson_08_a-closer-look-at-description.md; Course_05/Lesson_09_effective-prompting-techniques.md; Course_05/Lesson_10_a-closer-look-at-discernment.md; Course_05/Lesson_11_the-description-discernment-loop.md; Course_05/Lesson_12_a-closer-look-at-diligence.md **Last reviewed:** 2026-05-06 --- # Building with the Claude API: Foundations to Agents > Building with the Claude API is the comprehensive bottom-up tour of the Anthropic SDK: messages, system prompts, streaming, structured outputs, evals, prompt engineering, tool use, RAG, MCP, prompt caching, vision, citations, and finally agents and workflows. Eighty-five lessons across fourteen sections, organized so each capability builds on the prior. Treat it as the canonical API reference course; every other course in the catalog assumes you know what is here. **Domain:** D2 · Tool Design + Integration (18%) **Difficulty:** intermediate **Skilljar course:** Building with the Claude API (85 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/claude-api-foundations **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 18% (D2) + heavy spillover into D1 / D4 / D5 The most exam-relevant single course in the catalog. Direct prep for D2 tool design, D1 agentic loops, D4 prompt engineering, and D5 context features (caching, batch, streaming, citations, vision). If you can only take one Skilljar course before the exam, this is it. ## What you'll learn - How client.messages.create() works, including system prompts, temperature, streaming, and structured outputs - How to build a prompt-evaluation pipeline with model-based and code-based grading on a curated test set - Six prompt-engineering techniques: clarity, specificity, XML tags, examples, role, decomposition - How tool use works end-to-end: schemas, message blocks, tool results, multi-turn tool calling, fine-grained streaming - How RAG composes (chunking, embeddings, BM25, multi-index reranking) and how prompt caching, vision, citations stack on top - How MCP exposes tools/resources/prompts, and when to choose workflows (chaining, routing, parallelization) versus full agents ## Prerequisites - **Claude 101 (knowledge)** (knowledge · `claude-101`) ## Lesson outline ### 1. Welcome to the course Course frame: bottom-up API tour from first request to full agent architectures. ### 2. Overview of Claude models Opus, Sonnet, Haiku tiers; pick by capability vs. cost vs. latency tradeoff. ### 3. Accessing the API Anthropic Console, direct API, AWS Bedrock, Google Vertex; same model, different access path. ### 4. Getting an API key Generate a key from console.anthropic.com; never commit it; use .env plus python-dotenv. ### 5. Making a request client.messages.create(model, max_tokens, messages); the three required params; max_tokens is a safety cap, not a target. ### 6. Multi-turn conversations Append assistant responses to your messages list; the model is stateless, the conversation is your responsibility. ### 7. Chat exercise Hands-on: build a minimal CLI chatbot with multi-turn message history. ### 8. System prompts Pass system="..." to set role, behavior, constraints; system prompts are not in the messages array. ### 9. System prompts exercise Hands-on: write a system prompt that turns Claude into a Socratic math tutor without giving direct answers. ### 10. Temperature 0.0 = deterministic and focused; 1.0 = creative and varied. Default 1.0; lower for extraction, higher for ideation. ### 11. Course satisfaction survey Mid-course survey checkpoint, no technical content. ### 12. Response streaming Use stream=True and iterate events; ship tokens to the UI as they arrive instead of waiting for the full response. ### 13. Structured data Ask for JSON in the prompt and parse; for strict schemas use tool-calling as a structured-output mechanism. ### 14. Structured data exercise Hands-on: extract structured fields (name, date, amount) from unstructured invoice text. ### 15. Quiz on accessing Claude with the API Knowledge check covering messages, system prompts, temperature, streaming, structured data. ### 16. Prompt evaluation Don't ship a prompt without evaluating it; evals are the difference between a demo and a system. ### 17. A typical eval workflow Generate test dataset, run prompt against each, grade outputs, score, iterate. ### 18. Generating test datasets Use Claude itself to generate diverse test inputs covering edge cases, then hand-curate. ### 19. Running the eval Run prompt against the test set in parallel with rate-limit handling; collect raw outputs. ### 20. Model-based grading Use a grader prompt (often a stronger model) to score outputs against rubric; cheap and flexible. ### 21. Code-based grading Programmatic checks: regex, schema validation, exact match. Use when criteria are mechanical, model-grade when subjective. ### 22. Exercise on prompt evals Hands-on: build an eval pipeline for a meal-plan prompt with mixed code-based and model-based grading. ### 23. Quiz on prompt evaluation Knowledge check covering eval workflow, dataset generation, model vs. code grading. ### 24. Prompt engineering Iterative loop: goal → initial prompt → eval → apply technique → re-eval; repeat until you hit your bar. ### 25. Being clear and direct Tell Claude exactly what you want; ambiguity is the most common failure mode. ### 26. Being specific Concrete constraints beat abstract instructions; "three bullet points, max 15 words each" beats "be concise". ### 27. Structure with XML tags Wrap inputs in , , tags; Claude is trained to attend to XML structure. ### 28. Providing examples Multishot beats explanation; one or two well-chosen examples shape output style faster than five paragraphs of description. ### 29. Exercise on prompting Hands-on: take a weak prompt and apply the four engineering techniques to lift its eval score. ### 30. Quiz on prompt engineering techniques Knowledge check covering clarity, specificity, XML, examples, role, decomposition. ### 31. Introducing tool use Tool use lets Claude call your functions; you describe the tool, Claude decides when to invoke it, you run it and return the result. ### 32. Project overview Course-long project frame: build a customer-data agent using tool use end-to-end. ### 33. Tool functions Write the actual Python functions Claude will call; pure logic, no Anthropic-specific glue. ### 34. Tool schemas JSON schema with name, description, input_schema; the description is what Claude uses to decide when to call. ### 35. Handling message blocks Response content is a list of blocks: text, tool_use, thinking. Iterate the list, don't index [0]. ### 36. Sending tool results After running the tool, send a tool_result block in the next user message with the matching tool_use_id. ### 37. Multi-turn conversations with tools Loop: Claude requests tool → you run it → you append result → Claude responds or requests another tool. ### 38. Implementing multiple turns Hands-on: write the agent loop with stop_reason 'tool_use' as the continuation signal. ### 39. Using multiple tools Pass an array of tools; Claude picks based on tool descriptions; use tool_choice to force or restrict selection. ### 40. Fine-grained tool calling Stream tool inputs token-by-token as Claude generates them; useful for long arguments and progressive UIs. ### 41. The text edit tool Anthropic-defined tool that lets Claude edit files via structured ops (view, create, str_replace, insert). ### 42. The web search tool Anthropic-hosted web search tool; Claude issues search queries and you get cited results back. ### 43. Quiz on tool use with Claude Knowledge check covering schemas, message blocks, tool results, multi-tool selection, fine-grained streaming. ### 44. Introducing retrieval augmented generation RAG = retrieve relevant context from your data, stuff it into the prompt, let Claude answer with citations. ### 45. Text chunking strategies Split documents into chunks; size and overlap matter; semantic boundaries beat naive token splits. ### 46. Text embeddings Vectorize chunks with an embedding model; cosine similarity finds semantically near chunks at query time. ### 47. The full RAG flow Ingest → chunk → embed → store. Query → embed → search → rerank → stuff prompt → generate. ### 48. Implementing the RAG flow Hands-on: build the ingest and query path with a vector DB and embedding API. ### 49. BM25 lexical search Keyword-based ranking that complements semantic search; catches exact terms semantic search misses. ### 50. A multi-index RAG pipeline Combine dense (embeddings) + sparse (BM25) retrieval and rerank; the production-quality pattern. ### 51. Extended thinking thinking={'type': 'enabled', 'budget_tokens': N} gives Claude scratchpad reasoning before answering; use for hard problems. ### 52. Image support Pass images as base64 or URL in message content; Claude reads diagrams, screenshots, charts, photos natively. ### 53. PDF support Upload PDFs as document content blocks; Claude reads text and visual layout (tables, figures) together. ### 54. Citations Citations API attaches source spans to Claude's claims so users can verify; built-in for document content blocks. ### 55. Prompt caching Mark a prefix as cache_control to cache it; subsequent calls reuse the cached prefix at ~10% cost. ### 56. Rules of prompt caching Cache breakpoints must be deterministic, ordered, and exact; caching breaks on any change above the breakpoint. ### 57. Prompt caching in action Hands-on: cache a long system prompt + RAG context block; measure cost and latency reduction. ### 58. Code execution and the files API Anthropic-hosted Python sandbox + file storage; Claude can run code and read its output back. ### 59. Quiz on features of Claude Knowledge check covering extended thinking, vision, PDFs, citations, caching, code execution. ### 60. Introducing MCP Model Context Protocol: open standard for LLM clients to discover and use tools, resources, and prompts from servers. ### 61. MCP clients Clients (Claude Desktop, Claude Code, Cursor) speak MCP; same server works across all of them. ### 62. Project setup Hands-on project frame: build an MCP server that exposes customer data to any MCP client. ### 63. Defining tools with MCP Same tool concept as the API, exposed via the MCP tools/list and tools/call methods. ### 64. The server inspector MCP Inspector is a dev tool that connects to your server and lets you call tools/resources/prompts manually. ### 65. Implementing a client Build a Python MCP client that lists server tools and routes Claude's tool calls to the server. ### 66. Defining resources Resources expose read-only context (files, records) the client can fetch on demand; not tool calls. ### 67. Accessing resources Client lists resources, fetches by URI, includes content in the prompt; cleaner than ad-hoc context loading. ### 68. Defining prompts MCP prompts are reusable prompt templates the server publishes; client invokes by name with arguments. ### 69. Prompts in the client Hands-on: list and invoke server prompts from the Python client. ### 70. MCP review Recap: tools = actions, resources = read-only context, prompts = reusable templates. Three primitives, one protocol. ### 71. Quiz on Model Context Protocol Knowledge check covering tools, resources, prompts, client/server architecture. ### 72. Anthropic apps Tour of Anthropic-built apps: Claude.ai, Claude Code, Computer Use; how each layers on top of the API. ### 73. Claude Code setup Install Claude Code, point it at the repo, see CLAUDE.md and slash commands in their native habitat. ### 74. Claude Code in action Live walkthrough of Claude Code editing a real codebase end-to-end; same harness available via SDK. ### 75. Enhancements with MCP servers Add MCP servers (GitHub, Postgres, Sentry) to Claude Code or Claude Desktop and watch capability bloom. ### 76. Agents and workflows Two architectures: workflows (predefined steps, model fills gaps) vs. agents (model drives flow, tool use loops). ### 77. Parallelization workflows Run multiple LLM calls in parallel and aggregate; lowers latency on independent subtasks. ### 78. Chaining workflows Sequential pipeline: output of one call is input to the next; classic prompt chain. ### 79. Routing workflows Classify input, route to specialized prompt or tool; cheaper and more accurate than one giant prompt. ### 80. Agents and tools True agents loop on tool use until stop_reason != 'tool_use'; the model decides when it's done. ### 81. Environment inspection Give the agent a tool to inspect its environment (list files, read state) before it acts. ### 82. Workflows vs agents Use workflows when steps are predictable; use agents when the path depends on what the model finds. ### 83. Quiz on agents and workflows Knowledge check covering parallelization, chaining, routing, agent loops, when to choose each. ### 84. Final assessment Course-final assessment integrating everything from messages through agents. ### 85. Course wrap-up Recap and pointers to MCP Advanced, Subagents, Claude Code in Action as next courses. ## Our simplification Building with the Claude API is the spine of the entire Skilljar catalog. Eighty-five lessons, fourteen sections, organized as a strict bottom-up build: messages → system prompts → streaming → structured outputs → evals → prompt engineering → tool use → RAG → features → MCP → agents and workflows. Each section assumes you have the prior. The course is heavy on hands-on Jupyter-notebook exercises and rewards code-along; passive watching loses ~60% of the value. If you can take only one Skilljar course before the exam, take this one, because it is the canonical reference every other course assumes. The API surface itself collapses to one function and three primitives. The function is client.messages.create(model, max_tokens, messages, ...). The primitives layered on top are system prompt (role and behavior), temperature (0.0 deterministic to 1.0 creative), and streaming (token-by-token via stream=True). The conversation is *your* responsibility; Claude is stateless, you append assistant responses back into the messages list yourself. max_tokens is a safety cap, not a target; Claude doesn't try to fill it. Structured outputs come for free if you ask for JSON in the prompt, but for strict schemas the production-quality pattern is to use tool-calling as a structured-output mechanism, which the course transitions to in Section 6. Evaluation is the gate between a demo and a system, and prompt engineering is what you do once that gate is open. Section 4 (Lessons 16-23) walks the eval workflow in five steps: generate a diverse test dataset (use Claude itself, then hand-curate), run the prompt against each input in parallel with rate-limit handling, grade outputs (model-based for subjective criteria, code-based for mechanical ones), score, iterate. Section 5 (Lessons 24-30) layers prompt engineering on top: be clear and direct (ambiguity is the dominant failure mode), be specific (concrete constraints beat abstract instructions), use XML tags (, , ; Claude is trained to attend to them), provide examples (multishot beats explanation), define a role, and decompose multi-step reasoning. The five-step engineering loop (goal → prompt → eval → apply technique → re-eval) is *only* possible if you have the eval pipeline from Section 4 in place; without evals, every prompt change is vibes-based. Pair Lesson 27 (XML tags) with the docs.claude.com prompt-engineering guide; both are required reading for the exam's prompt-engineering domain. Tool use is the longest section (Lessons 31-43) and the most tested concept on the exam. The mental model: you describe tools to Claude (name, description, JSON input schema), Claude decides when to invoke them, you actually run them, you send the result back via a tool_result block, repeat. The agent loop is just while stop_reason 'tool_use'; when stop_reason flips to end_turn, the model is done. Three sub-skills the course emphasizes: writing good tool descriptions (the description is what Claude reads to choose), iterating message *blocks* not message *strings* (responses are arrays of text, tool_use, thinking blocks), and matching tool_use_id between request and result. Fine-grained tool calling (Lesson 40) streams tool inputs as they're generated; important for progressive UIs but optional for first builds. RAG is presented as a composition pattern, not a product. Lessons 44-50 walk the full pipeline: chunk documents (semantic boundaries beat naive token splits), embed chunks (vectorize with an embedding model), store in a vector DB, retrieve at query time by cosine similarity, rerank, stuff into the prompt, generate. The production move is multi-index retrieval: combine dense (embeddings, captures meaning) + sparse (BM25, captures exact terms) and rerank. The course's RAG section is video-heavy and short on code, so the actual implementation work happens in Lesson 48; treat the rest as conceptual scaffolding. Pair this section with the Citations and Prompt Caching lessons from Section 8, because production RAG always involves both. Section 8 (Features of Claude) is where the exam-tested optimization knobs live. Prompt caching (Lessons 55-57) marks a prefix as cache_control and reuses it across calls at ~10% cost; the rules are strict; cache breakpoints must be deterministic, ordered, and exact, and any change above the breakpoint invalidates everything. Extended thinking gives Claude an explicit reasoning budget (thinking={'type': 'enabled', 'budget_tokens': N}) for hard problems. Vision (image and PDF) is a content-block extension; pass image or document blocks alongside text. Citations attach source spans to claims; built-in for document blocks. Code execution and the Files API give Claude a Python sandbox. All five of these are highly testable on the certification and stack with tool use and RAG; learning them together is faster than learning them separately. MCP and agents close the course. Lessons 60-71 build a working MCP server and client from scratch, with three primitives: tools (actions Claude can take), resources (read-only context Claude can fetch by URI), prompts (reusable templates the server publishes). The MCP Inspector is a dev tool that connects to your server and lets you call everything manually; use it before you wire up any client. The agents section (Lessons 76-83) frames the architectural choice cleanly: workflows for predictable, predefined paths (parallelization, chaining, routing) and agents for paths that depend on what the model finds (loop on tool_use, model decides when done). The course's final claim is that workflows and agents are not opposites; production systems usually mix them, with agents inside specific workflow steps. This is the architect-role conceptual takeaway; the certification's D1 domain hinges on getting it right. ## Patterns ### The 8 load-bearing themes across 85 lessons Don't try to memorize every lesson. Anchor on these eight themes; the lessons fill them in. - **Messages + parameters.** client.messages.create(), system prompt, temperature, max_tokens, streaming. The base layer everything else sits on. - **Structured outputs.** JSON-in-prompt for casual, tool-call-as-output for strict schemas. Lessons 13-14, then revisited in Section 6. - **Evals.** Generate dataset, run, grade (model + code), score, iterate. Section 4 is the gate from demo to system. - **Prompt engineering.** Clear, specific, XML-tagged, exemplified, role-defined, decomposed. Six techniques applied iteratively. - **Tool use.** Schema → tool_use block → run → tool_result → loop. The longest section and most tested concept. - **RAG.** Chunk, embed, retrieve, rerank, stuff, generate. Multi-index (BM25 + dense) is the production pattern. - **Features (caching, vision, citations, thinking, code exec).** The optimization knobs. Highly testable; learn together because they stack. - **MCP + agents/workflows.** MCP exposes tools/resources/prompts via protocol. Workflows (predictable) vs. agents (path depends on findings). ### 5 concrete eval-pipeline moves from Lessons 16-23 Section 4 is short on celebrity but heavy on exam relevance. These five moves are the eval blueprint. - **Generate test data with Claude.** Use a stronger model with a clear rubric to generate diverse test inputs covering edge cases. Then hand-curate. - **Run with concurrency control.** Start max_concurrent_tasks=3 to avoid rate limits; raise once you know your quota. Async + retry on 429. - **Use model-based grading for subjective criteria.** Tone, helpfulness, completeness; these are model-grader territory. Pin the grader to a stronger model than the generator. - **Use code-based grading for mechanical criteria.** Schema validation, regex match, length bounds, keyword presence. Cheap, deterministic, no model in the grading loop. - **Iterate with measurable deltas.** Each prompt change should lift the eval score, not just feel better. Without this discipline, prompt-engineering is vibes-based tuning. ### Workflows vs. agents; when to choose each Lessons 76-82 frame the choice; this is the D1 architect-role decision the exam tests. - **Use a workflow when the steps are predictable.** Sequential chain (chaining), parallel fan-out (parallelization), classify-then-route (routing). Cheaper, more debuggable, lower variance. - **Use an agent when the path depends on findings.** Loop on stop_reason 'tool_use'. The model decides when to stop. Higher variance, higher capability ceiling. - **Mix them in real systems.** Agents inside specific workflow steps. The agent does the open-ended sub-task; the workflow handles deterministic before/after. ## Key takeaways - client.messages.create() is the only function; model, max_tokens, and messages are the three required parameters. Claude is stateless; the conversation is your job. (`agentic-loops`) - Evals (model-based + code-based grading on a curated test set) are the gate between a demo and a system; without them prompt iteration is vibes-based. (`evaluation`) - Tool use is a four-step loop: describe (schema) → request (tool_use block) → run → return (tool_result block); continue while stop_reason 'tool_use'. (`tool-calling`) - RAG production-quality means multi-index retrieval; dense (embeddings) + sparse (BM25) plus a reranker; chunking strategy matters more than embedding model choice. (`long-document-processing`) - Prompt caching reuses a deterministic, ordered, exact prefix at ~10% cost; pair it with extended thinking, vision, citations, and code execution as the five exam-tested optimization knobs. (`prompt-caching`) - MCP exposes three primitives; tools (actions), resources (read-only context), prompts (reusable templates); over an open protocol; same server runs across Claude Desktop, Claude Code, Cursor. (`mcp`) ## Concepts in play - **System prompts** (`system-prompts`), Section 3 primitive, used everywhere - **Tool calling** (`tool-calling`), Longest section (31-43), most tested concept - **Structured outputs** (`structured-outputs`), Tool-call-as-output is the strict-schema pattern - **Evaluation** (`evaluation`), Section 4, the gate from demo to system - **Prompt engineering techniques** (`prompt-engineering-techniques`), Section 5, the six techniques - **Prompt caching** (`prompt-caching`), Section 8 optimization, cost reduction at ~10% - **Vision and multimodal** (`vision-multimodal`), Section 8 image and PDF support - **Streaming** (`streaming`), Section 3 + fine-grained tool streaming in Section 6 - **MCP** (`mcp`), Section 9, full protocol walkthrough - **Agentic loops** (`agentic-loops`), Section 11, workflows vs. agents architectural choice ## Scenarios in play - **Long document processing** (`long-document-processing`), RAG section (44-50) plus PDF support and citations - **Structured data extraction** (`structured-data-extraction`), Section 3 structured outputs + Section 6 tool-call-as-output - **Agentic tool design** (`agentic-tool-design`), Section 6 tool use end-to-end, the most exam-relevant scenario - **Customer support resolution agent** (`customer-support-resolution-agent`), Section 11 agents + tool use composition ## Curated sources - **Building effective agents; Anthropic engineering** (anthropic-blog, 2024-12-19): Anthropic's canonical engineering essay distinguishing workflows (chaining, routing, parallelization) from agents. Lessons 76-82 are the course version; this is the source-of-truth. - **Prompt engineering overview; Anthropic documentation** (anthropic-blog, 2025-09-01): Canonical reference for the six prompt-engineering techniques the course teaches in Lessons 24-30. Keep open while building production prompts. - **Prompt caching; Anthropic documentation** (anthropic-blog, 2025-08-01): Authoritative reference for cache breakpoint rules, eligible content, TTL behavior, and pricing math. The course's caching lessons are video-heavy; the doc is what you reference while shipping. ## FAQ ### Q1. What does client messages create do in the Anthropic Python SDK? client.messages.create() is the single API function for all Claude generation. You pass model (e.g. claude-sonnet-4-0), max_tokens (a safety cap, not a target), and messages (a list of {role, content} dicts). Optional params like system, temperature, stream, tools, and tool_choice shape the call. Claude is stateless; every call sends the full conversation, you append assistant responses back into messages yourself. ### Q2. When should I use a system prompt versus put the same instructions in the user message? Use a system prompt for stable role and behavior that doesn't change turn-to-turn; "you are a math tutor", "respond in JSON", "never reveal internal IDs". Put turn-specific instructions in the user message. System prompts are passed as the system parameter (not inside messages) and are weighted slightly higher in attention, which makes them the right place for guardrails and persona. ### Q3. How does tool use actually work end-to-end with the Claude API? Four steps. Step 1: pass tools=[...] describing each tool with name, description, and input_schema. Step 2: Claude returns a response containing a tool_use content block with the tool name and arguments; stop_reason will be tool_use. Step 3: you run the tool yourself and capture its output. Step 4: append a user message with a tool_result block matching the tool_use_id, then call messages.create() again. Loop while stop_reason 'tool_use'. ### Q4. Why is my prompt caching not actually saving any tokens? Three common causes. First, your cache breakpoint isn't above truly stable content; any change above the cache_control block invalidates the entire cache. Second, the cached content is below the minimum cache size (1024 tokens for most models). Third, calls are spaced more than the cache TTL apart (default 5 minutes for ephemeral caching). Check the cache_creation_input_tokens vs. cache_read_input_tokens counts in the response usage to confirm hits. ### Q5. What is the difference between a workflow and an agent in the Claude API? A workflow has predefined steps; the model fills in the content of each step but does not choose the path. Examples: chaining (step A → step B → step C), routing (classify input, dispatch to specialized prompt), parallelization (fan out, aggregate). An agent loops on tool use; the model decides what to call next based on what it has found, and decides when to stop. Use workflows for predictability and lower cost; use agents when the path depends on what the model discovers. ### Q6. Do I need a vector database to do RAG with Claude? Not for small corpora. If your data fits in the context window (with prompt caching to keep it cheap), you can skip retrieval entirely and stuff everything in. For larger corpora, yes; and the production pattern is multi-index: a vector DB for semantic similarity *plus* BM25 keyword search, with results reranked. Pinecone, Weaviate, pgvector, Turbopuffer all work; Anthropic doesn't ship a vector DB. ### Q7. Is MCP the same thing as tool use in the Claude API? Related but distinct. Tool use is an API feature where you describe tools in your messages.create() call and run them yourself. MCP is a protocol that lets a Claude client (Desktop, Claude Code, Cursor) discover and call tools, fetch resources, and use prompts from an external server. MCP servers expose tools that *become* tool-use entries inside the API call the client makes. So MCP is upstream of tool use: it's how the tools get registered with the client; tool use is how they get called. ### Q8. How long does it take to complete the Building with the Claude API course? Skilljar estimates ~8 hours of video and exercises across 85 lessons, plus several hands-on coding exercises that double the wall-clock time if you actually code along. The course is heavy on Jupyter-notebook walkthroughs; passive watching loses ~60% of the value. Plan two full work-day blocks if you want to internalize tool use, RAG, and the agent/workflow distinction at exam-prep depth. --- **Source:** https://claudearchitectcertification.com/knowledge/claude-api-foundations **Vault sources:** Course_06/Lesson_05_making-a-request.md; Course_06/Lesson_08_system-prompts.md; Course_06/Lesson_10_temperature.md; Course_06/Lesson_12_response-streaming.md; Course_06/Lesson_13_structured-data.md; Course_06/Lesson_16_prompt-evaluation.md; Course_06/Lesson_24_prompt-engineering.md; Course_06/Lesson_27_structure-with-xml-tags.md; Course_06/Lesson_31_introducing-tool-use.md; Course_06/Lesson_44_introducing-retrieval-augmented-generation.md; Course_06/Lesson_55_prompt-caching.md; Course_06/Lesson_60_introducing-mcp.md; Course_06/Lesson_76_agents-and-workflows.md; Course_06/Lesson_82_workflows-vs-agents.md **Last reviewed:** 2026-05-06 --- # MCP Foundations: Tools, Resources, Prompts > Model Context Protocol (MCP) is an open spec that lets any Claude client talk to any external system through a thin server, using three primitives: tools, resources, and prompts. You write a Python (or TypeScript) MCP server once, and every MCP-aware client (Claude Desktop, Claude Code, Cursor, your own harness) can use it without bespoke glue. This course builds a working server and client end-to-end so you understand the wire protocol, the inspector, and how the three primitives map to real agent behavior. **Domain:** D2 · Tool Design + Integration (18%) **Difficulty:** intermediate **Skilljar course:** Introduction to Model Context Protocol (14 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/mcp-foundations **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 18% (D2) + 20% (D3) Direct prep for D2 task statements on extending Claude with external tools and D3 questions on integration patterns. MCP shows up across claude-code-in-action, the agentic-tool-design scenario, and any item that asks how Claude reaches systems outside its context. ## What you'll learn - What MCP is, why it exists, and how it differs from raw tool-calling on the Claude API - The three MCP primitives (tools, resources, prompts) and which one to reach for in a given situation - How to scaffold an MCP server in Python with the official SDK and verify it with the MCP Inspector - How to implement an MCP client that handshakes, lists capabilities, and routes calls to the server - How resources differ from tools (read-only data exposure vs callable actions) and why that distinction matters - How prompts work as reusable, parameterized prompt templates the user explicitly invokes ## Prerequisites - **Tool calling (concept)** (concepts · `tool-calling`) - **MCP overview (concept)** (concepts · `mcp`) - **Claude API foundations (knowledge)** (knowledge · `claude-api-foundations`) ## Lesson outline ### 1. Welcome to the course Course framing: build an MCP server and client from scratch in Python; tools, resources, prompts are the three primitives. ### 2. Introducing MCP MCP is the USB-C of LLM integrations: one open protocol so any client speaks to any tool server without bespoke glue. ### 3. MCP clients Clients (Claude Desktop, Claude Code, Cursor, custom harnesses) discover and call MCP servers over stdio or streamable-http. ### 4. Project setup Scaffold a Python project with uv, install the MCP SDK, and create the empty server entrypoint that the inspector can hit. ### 5. Defining tools with MCP A tool is a callable function the model can decide to invoke; declare it with a name, description, and JSON-Schema input shape. ### 6. The server inspector MCP Inspector is the curl-equivalent for your server; use it to manually fire tools and verify the wire protocol before wiring a client. ### 7. Course satisfaction survey Mid-course feedback checkpoint; no technical content. ### 8. Implementing a client Build a minimal Python client that handshakes, calls list_tools, and routes Claude's tool-use blocks to the server. ### 9. Defining resources A resource is read-only data exposed at a URI (file, DB row, log) that the client can fetch and inject as context. ### 10. Accessing resources Clients enumerate resources via list_resources and fetch bytes via read_resource; the user (not the model) usually picks which to attach. ### 11. Defining prompts Prompts are server-defined, parameterized prompt templates a user explicitly invokes (e.g., a slash command in the client). ### 12. Prompts in the client Wire the client to surface server prompts as user-selectable shortcuts; the user fills in arguments, the server returns a message list. ### 13. Final assessment on MCP Assessment covering primitives, control flow, and which surface to use for which job. ### 14. MCP review Recap: tools are model-controlled actions, resources are user-controlled data, prompts are user-invoked templates. ## Our simplification MCP exists because every team that integrated Claude with their internal systems was solving the same plumbing problem: write a tool schema, write the dispatcher, write the auth, write it again next quarter when the API changes. Model Context Protocol standardizes that plumbing as an open spec so any MCP-aware client (Claude Desktop, Claude Code, Cursor, your own harness) can talk to any MCP server without bespoke wiring. Anthropic publishes the spec and reference SDKs in Python and TypeScript; the community publishes hundreds of servers for GitHub, Slack, Postgres, filesystems, browsers, and more. The protocol exposes three primitives, and the exam loves to test which one fits which job. Tools are functions the *model* decides to call (write to a database, run a query, send a Slack message). Resources are read-only data the *user or client* decides to attach (a file's contents, a row from a table, a log snippet). Prompts are parameterized prompt templates the *user* explicitly invokes, often as slash commands. The control point is the load-bearing distinction: model-controlled tools, user-controlled resources, user-invoked prompts. Mechanically, an MCP server is a process that speaks JSON-RPC over either stdio (local subprocess; default) or streamable-http (remote, multiplexed). The client launches or connects to the server, performs a capability handshake, then calls list_tools, list_resources, and list_prompts to discover what the server offers. Discovery is dynamic: the server can publish or revoke capabilities at runtime, and the client surfaces them to the user or the model accordingly. The Python SDK wraps this so you write decorators (@mcp.tool(), @mcp.resource(), @mcp.prompt()) and the SDK handles the wire format. Tools are where most beginners over-reach. The temptation is to ship a single do_anything tool with a free-form query string. Don't. The exam, and reality, both reward narrow tools with strict JSON-Schema inputs, like create_issue(title, body, labels[]) rather than github(query). The description field is what the model reads to decide whether to call the tool, so write it like a function docstring aimed at a teammate, not a marketing blurb. Tool errors should return structured failure (isError: true plus a message) rather than raising; the model can then recover. Resources are the primitive that's easiest to misunderstand. A resource has a URI, a MIME type, and a read_resource handler that returns bytes. Think of them as files that live behind the server. The user (or the client UI) picks which ones to attach to the conversation, the server reads them on demand, and Claude treats the contents as plain context. Resources should never have side effects; if you find yourself wanting one to do work, you wanted a tool. The MCP Inspector lets you enumerate resources and fetch them by URI without writing any client code, which is invaluable while developing. Prompts round out the trio: they are server-side prompt templates with named arguments, surfaced to the user as shortcuts (Claude Desktop renders them as slash commands; Claude Code surfaces them in its prompt chooser). When the user invokes one, the server returns a list of messages, ready to send. Prompts are a UX primitive, not a model primitive. The model never decides to invoke a prompt; the user does. Treat them as the right home for canonical workflows your team runs over and over: /standup, /review-pr, /draft-incident-postmortem. Each one becomes shareable, version-controlled, and discoverable across every MCP-aware client. When you've finished the course you should be able to: scaffold a server with the official Python SDK, expose a tool with strict JSON-Schema inputs, expose a resource at a stable URI, define a prompt template with named arguments, verify each surface via the MCP Inspector, and wire a thin Python client that proxies Claude's tool-use blocks to the server. That's the whole loop. The exam is built around recognizing this loop in different shapes. A question that says "Claude should be able to read a project file but not modify it" is testing resources vs tools. A question about a slash command in Claude Desktop is testing prompts. A question about why tool descriptions matter is testing the model-controlled discovery flow. From there, advanced MCP topics (sampling, notifications, roots, transport choice) are the natural next layer, covered in mcp-advanced, and they are where production deployments actually live. ## Patterns ### The 3 MCP primitives and who controls each The single most-tested MCP concept is which primitive fits which job. Memorize the control point. - **Tools: model-controlled actions.** The model decides whether and when to call a tool. Use for side-effectful actions: writes, sends, mutations. Declare with a name, description, and JSON-Schema input. Return structured success or isError: true for failures. - **Resources: user-controlled data.** The user (or client UI) decides which resources to attach to the conversation. Use for read-only context: file contents, DB rows, logs. Each resource has a URI and MIME type; the server reads bytes on demand. Never side-effectful. - **Prompts: user-invoked templates.** The user explicitly invokes a prompt (often as a slash command). The server returns a list of messages with arguments interpolated. Treat as a UX shortcut for canonical team workflows. The model never picks prompts. ### 5 anti-patterns when designing MCP tools Each of these will be a question on the exam in disguise. Watch for the failure mode behind each. - **The kitchen-sink tool.** A single do_thing(query: string) that branches internally on the query. The model can't tell when to call it; descriptions become marketing. Split into narrow tools with typed inputs. - **Free-text inputs instead of JSON-Schema.** Letting Claude pass args: string and parsing inside the tool. The schema is your validation contract; without it the model hallucinates shapes. Always type every input. - **Raising exceptions instead of structured errors.** An unhandled exception kills the tool call without giving the model anything to recover from. Return isError: true with a message so the loop can adapt. - **Side-effects in resources.** Treating resources as triggerable actions. Resources are read-only; the moment one mutates state, you wanted a tool. Mixing them up makes audit and gating impossible. - **No description, or a marketing description.** The description is what the model reads to decide whether to call the tool. Write it like a function docstring, not a tagline. State what it does, when to use it, and what it returns. ## Key takeaways - MCP standardizes the LLM-to-system integration plumbing so any MCP-aware client speaks to any MCP server without bespoke glue. (`mcp`) - The three primitives split by control point: tools are model-controlled, resources are user-controlled, prompts are user-invoked. (`mcp`) - Tools should be narrow with strict JSON-Schema inputs; the description field is what the model reads to decide whether to call them. (`tool-calling`) - Resources are read-only data attached at user discretion; if you want side effects, you wanted a tool, not a resource. (`agentic-tool-design`) - Use the MCP Inspector to verify your server end-to-end before wiring a client; it is the curl-equivalent for the protocol. (`mcp`) - Prompts are server-side templates the user invokes (often as slash commands); they are a UX primitive, not a model primitive. (`prompt-engineering-techniques`) ## Concepts in play - **Model Context Protocol** (`mcp`), Core protocol the course builds around - **Tool calling** (`tool-calling`), How MCP tools become Claude tool-use blocks - **Structured outputs** (`structured-outputs`), JSON-Schema input validation for tools - **Prompt engineering techniques** (`prompt-engineering-techniques`), Crafting reusable MCP prompt templates ## Scenarios in play - **Agentic tool design** (`agentic-tool-design`), Applying tool-narrowness rules when designing MCP tool surfaces - **Code generation with Claude Code** (`code-generation-with-claude-code`), Claude Code is the most-used MCP client; this is the canonical consumer scenario ## Curated sources - **Introducing the Model Context Protocol** (anthropic-blog, 2024-11-25): Anthropic's launch announcement framing the why behind MCP; pair with the Skilljar lessons for the architectural rationale the videos skim past. - **Model Context Protocol: Introduction** (anthropic-blog, 2025-03-01): The canonical spec home with primitive references, SDK guides, and the Inspector docs. The Skilljar course teaches Python; this is where you go for TypeScript, Java, or transport details. - **Awesome MCP Servers** (community-post, 2025-04-01): Living catalog of community MCP servers (GitHub, Postgres, browser, filesystem, Slack). Useful when designing your own to see what conventions have settled across the ecosystem. ## FAQ ### Q1. What is Model Context Protocol and why do I need it? MCP is an open spec that standardizes how an LLM client (Claude Desktop, Claude Code, Cursor) talks to external systems. You write a thin MCP server for your system once, and every MCP-aware client can use it without bespoke integration code. The alternative is rebuilding tool wiring for every client surface, which is exactly what MCP was designed to eliminate. ### Q2. What are the three MCP primitives and when do I use each one? Tools are model-controlled actions (the model decides when to call them; use for writes and side effects). Resources are user-controlled read-only data (the user attaches a file or DB row as context). Prompts are user-invoked templates (the user runs them as slash commands). The control point is the discriminator: model, user-attached, user-invoked. ### Q3. How do I create my first MCP server in Python? Install the official SDK with uv add mcp, create a server.py that instantiates FastMCP, and decorate functions with @mcp.tool(), @mcp.resource(), or @mcp.prompt(). Run it with uv run server.py and verify with mcp dev server.py to launch the MCP Inspector. The Inspector lets you fire tools and read resources without writing a client. ### Q4. What is the difference between an MCP tool and an MCP resource? Tools are callable actions the model invokes (often with side effects); resources are read-only data the user attaches. The control point is the load-bearing distinction. If your function mutates state or sends a message, it is a tool. If it returns bytes for context with no side effects, it is a resource. Mixing them breaks audit and confuses the model. ### Q5. Why isn't my MCP server showing up in Claude Desktop? First check claude_desktop_config.json in the right OS-specific path (~/Library/Application Support/Claude/ on macOS) and confirm the JSON is valid. Then restart Claude Desktop; the config is read at startup, not live-reloaded. Run the same command from the config block in a terminal to confirm the server actually starts. Use the MCP Inspector to bypass Claude Desktop entirely and isolate whether the bug is in the server or the wiring. ### Q6. Are MCP tools the same as Claude API tool use? They translate to the same thing on the wire to Claude (a tool_use block), but the discovery and dispatch layer is different. With raw Claude API tool use, your code declares the tools inline and dispatches the call. With MCP, the client discovers tools dynamically from a server, and the server handles dispatch. MCP shines when many clients need to share the same tool surface. ### Q7. How does MCP differ from agent skills in Claude Code? MCP provides external tools and integrations through a separate server process. Agent skills are local markdown files Claude Code loads on demand to add task-specific knowledge to the conversation. They are complementary: MCP gives Claude new capabilities; skills teach Claude how to use them. The Skilljar agent-skills-intro course goes deeper on the latter. ### Q8. What is the MCP Inspector and how do I use it? The Inspector is a local web UI that connects to any MCP server and lets you manually call tools, fetch resources, and invoke prompts. Launch it with mcp dev . Treat it as curl for MCP: use it to verify the protocol surface before wiring a real client, and to reproduce bugs without involving Claude. --- **Source:** https://claudearchitectcertification.com/knowledge/mcp-foundations **Vault sources:** Course_07/_Course_Overview.md; Course_07/Lesson_01_welcome-to-the-course.md; Course_07/Lesson_02_introducing-mcp.md; Course_07/Lesson_03_mcp-clients.md; Course_07/Lesson_04_project-setup.md; Course_07/Lesson_05_defining-tools-with-mcp.md; Course_07/Lesson_06_the-server-inspector.md; Course_07/Lesson_08_implementing-a-client.md; Course_07/Lesson_09_defining-resources.md; Course_07/Lesson_10_accessing-resources.md; Course_07/Lesson_11_defining-prompts.md; Course_07/Lesson_12_prompts-in-the-client.md; Course_07/Lesson_14_mcp-review.md **Last reviewed:** 2026-05-06 --- # MCP Advanced: Sampling, Notifications, Roots, Transports > This course extends the MCP foundations with the capabilities production servers actually need: sampling (server asks the client to call the model), log and progress notifications (server streams updates), roots (server reads files only inside client-allowed boundaries), and the two transports (stdio for local subprocesses and streamable-http for remote, multi-client deployments). You also dig into the JSON message types underneath so you can debug at the wire level. Finish here and you can ship an MCP server that handles long-running work, respects security boundaries, and runs at scale. **Domain:** D3 · Agent Operations (20%) **Difficulty:** advanced **Skilljar course:** Model Context Protocol: Advanced Topics (15 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/mcp-advanced **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 20% (D3) + 18% (D2) Direct prep for D3 task statements on production integration patterns and D2 questions on tool surface design under operational constraints. Advanced MCP topics surface in the agentic-tool-design scenario, in long-running server discussions, and in any item that contrasts stdio vs remote HTTP deployments. ## What you'll learn - How sampling lets a server delegate LLM calls back to the client (and why that beats putting an API key on the server) - How to emit log and progress notifications during long-running tool calls so the client can render real-time feedback - What roots are, why they matter for filesystem security, and how a server enumerates and respects them - The full JSON-RPC 2.0 message vocabulary MCP uses (request, response, notification, error) - When to choose stdio vs streamable-http transport, and how state and reconnection differ between them - How session state works in streamable-http and the failure modes to design around ## Prerequisites - **MCP Foundations (knowledge)** (knowledge · `mcp-foundations`) - **Tool calling (concept)** (concepts · `tool-calling`) - **Streaming (concept)** (concepts · `streaming`) ## Lesson outline ### 1. Let's get started! Course framing: the four advanced surfaces are sampling, notifications, roots, and transports. Each fixes a real production gap. ### 2. Sampling Sampling lets a server ask the client to make an LLM call on its behalf, so secrets and model choice stay on the client. ### 3. Sampling walkthrough End-to-end: server requests sampling/createMessage, client prompts user (optional), invokes model, returns the completion. ### 4. Log and progress notifications Notifications are one-way messages: notifications/message for logs, notifications/progress for percent-complete on long calls. ### 5. Notifications walkthrough Hands-on: emit progress with a token from the request, stream log lines, watch them surface in the client UI. ### 6. Roots Roots are filesystem boundaries the client publishes; the server enumerates them via roots/list and stays inside them. ### 7. Roots walkthrough Wire a server that respects roots, listens for roots/list_changed, and refuses access outside the published set. ### 8. Survey Mid-course feedback checkpoint; no technical content. ### 9. JSON message types MCP rides JSON-RPC 2.0: requests have id and expect a response, notifications have no id and never get one. ### 10. The STDIO transport stdio is the default for local servers: client spawns the server as a subprocess, messages flow over stdin/stdout, length-prefixed. ### 11. The StreamableHTTP transport streamable-http is the remote transport: HTTP POST for requests, optional SSE for server-to-client streaming, sessions over Mcp-Session-Id header. ### 12. StreamableHTTP in depth Walk through the headers, session lifecycle, GET-for-stream pattern, and resume tokens; this is the layer to debug when production breaks. ### 13. State and the StreamableHTTP transport Sessions are sticky to a server instance; for horizontal scaling either pin sessions or externalize state to Redis or similar. ### 14. Assessment on MCP concepts Final assessment covering all four advanced surfaces. ### 15. Wrapping up Recap and pointers: combine these surfaces to ship a production-ready MCP server with security and observability. ## Our simplification MCP Foundations gave you tools, resources, and prompts. That gets you a working server. Production servers need four more capabilities, and this course is built around them: sampling, notifications, roots, and the two transports. Each one solves a problem you hit the moment a real team starts using your server. Skip them and your server breaks the first time a tool call takes 30 seconds, or the moment your security review asks where the LLM API key lives. Sampling is the most counterintuitive of the four. It lets the server ask the *client* to make an LLM call on the server's behalf. That sounds backwards until you realize it solves a real problem: the client already has an API key, model preference, and rate limits configured. Putting another LLM key on the server doubles the surface area of secrets and model-version drift. With sampling, the server requests sampling/createMessage, the client (optionally) shows the user what's about to be sent, the client invokes the model, and the response comes back. Common use case: a database-querying server that wants to summarize a result in natural language without ever holding an Anthropic key. Log and progress notifications are the streaming layer. A long-running tool call (a 30-second build, a multi-step research task) needs to give the client feedback or the user thinks the system has hung. MCP defines two notification flavors: notifications/message for log lines (with severity levels) and notifications/progress for percent-complete updates tied to a progress token included in the original request. The notifications are one-way: the client doesn't ack them, and the server doesn't expect a response. The Python SDK exposes ctx.report_progress() and ctx.log() so you don't write JSON-RPC by hand. Roots are the security boundary the spec adds for filesystem access. A client publishes a list of roots, meaning directories the server is allowed to read. The server queries them with roots/list and listens for roots/list_changed to react to live changes. This is the spec's answer to "how do I let an MCP filesystem server work without giving it /". The server still implements its own enforcement (the spec is advisory at the protocol layer), but a well-behaved server treats roots as a hard fence and rejects paths that fall outside. Expect exam questions on which side enforces the boundary (server) vs which side declares it (client). Underneath all four surfaces is JSON-RPC 2.0. MCP uses three message shapes: requests (have id, expect a response), responses (carry the matching id, return result or error), and notifications (no id, no response, fire-and-forget). Knowing this saves you when a server appears hung. Odds are you sent a notification when a request was needed, or vice versa. The Inspector and most SDKs hide the JSON, but when production breaks at 2 a.m., you read the raw frames and you need to know which is which. Transport is the choice you make first and rarely revisit. stdio is the local default: the client spawns the server as a subprocess, messages flow over stdin/stdout with length prefixes, the connection lives as long as the subprocess does. Simple, secure (no network), but single-client. streamable-http is the remote transport: HTTP POST carries requests, optional Server-Sent Events stream server-to-client messages, and an Mcp-Session-Id header tracks per-client session state. It is the right choice for shared SaaS-style MCP servers but introduces session-stickiness, reconnection, and horizontal-scaling concerns that stdio doesn't have. The takeaway: combine these four surfaces with the foundation primitives and you have an MCP server that can run long jobs, stream feedback, respect security boundaries, delegate model calls cleanly, and scale horizontally. Most production MCP outages trace to one of these four surfaces being skipped or misconfigured. The exam knows this; questions around "which surface fixes problem X" are reliable territory. When in doubt, run the heuristic: long-running work means notifications, server-side LLM calls means sampling, filesystem access means roots, deployment shape means transport choice. The other heuristic the exam loves is the control point. Sampling is server-requested but client-executed. Notifications are server-emitted with no client response. Roots are client-declared and server-enforced. Transport is client-chosen at connection time. Every advanced MCP question can be answered by figuring out which side owns the decision. Combine this with the foundation course's tools-resources-prompts breakdown, and you have a complete mental model of the protocol. ## Patterns ### The 4 advanced MCP surfaces and what each fixes Every advanced MCP feature solves a specific production problem. Memorize the failure mode each one prevents. - **Sampling: the server delegates LLM calls.** Server calls sampling/createMessage; client invokes the model and returns the completion. Use when the server needs LLM output but you do not want to put another API key on it. Keeps model choice and secrets on the client. - **Notifications: server streams progress and logs.** notifications/progress (with a token from the request) and notifications/message (with a severity level). One-way; no ack. Use for any tool call longer than a few seconds so the client UI can render feedback. - **Roots: filesystem security boundary.** Client publishes a list of allowed directories. Server reads via roots/list, listens for roots/list_changed, and refuses access outside. The client declares; the server enforces. Always assume the spec is advisory and write a hard check. - **Transport: `stdio` vs `streamable-http`.** stdio is local subprocess, single-client, simplest. streamable-http is remote, multi-client, with session IDs and optional SSE streaming. Pick streamable-http when many clients share one server; pick stdio for everything else. ### 3 transport pitfalls in production When streamable-http MCP servers fail in production, it is almost always one of these three. - **Session affinity not enforced.** Mcp-Session-Id ties a client to a specific server instance's in-memory state. Without sticky routing at the load balancer, the next request hits a server that has never seen this session and 404s. Pin sessions or externalize state. - **SSE connection dropped without resume.** Long-lived SSE streams die at proxies, load balancers, and on flaky mobile networks. The spec supports resume tokens; if you do not implement them, the client loses notifications on every reconnect. - **Treating `stdio` like a daemon.** stdio servers live and die with the client subprocess; they do not persist. Storing in-memory state and assuming the next session will see it is wrong. Persist to disk, or accept session-scoped state. ## Key takeaways - Sampling lets the server request LLM completions through the client, keeping API keys and model choice on the client side. (`mcp`) - Use notifications/progress with a token from the request and notifications/message for logs whenever a tool call takes more than a few seconds. (`streaming`) - Roots are the filesystem boundary the client publishes and the server is expected to enforce; never trust the spec alone, write a hard path check. (`mcp`) - MCP rides JSON-RPC 2.0 with three message shapes: requests carry an id, responses match by id, notifications have no id and never get a response. (`mcp`) - Choose stdio for local single-client servers and streamable-http only when multiple clients share one server; the latter introduces session-stickiness and reconnection. (`agentic-tool-design`) - streamable-http sessions are pinned to a specific server instance via Mcp-Session-Id; horizontal scaling requires sticky routing or externalized session state. (`session-state`) ## Concepts in play - **Model Context Protocol** (`mcp`), Core protocol the course extends - **Streaming** (`streaming`), Mechanism behind progress and log notifications - **Session state** (`session-state`), How streamable-http tracks per-client context - **Tool calling** (`tool-calling`), Where sampling and notifications live in the loop ## Scenarios in play - **Agentic tool design** (`agentic-tool-design`), Production tool surface design decisions including transport and notifications - **Claude for operations** (`claude-for-operations`), Long-running operational tooling where progress notifications carry real value ## Curated sources - **Model Context Protocol: Specification** (anthropic-blog, 2025-06-18): The authoritative spec covering sampling, notifications, roots, and both transports at the wire level. The Skilljar videos teach intuition; this is where you go to settle a debate. - **Streamable HTTP transport: MCP docs** (anthropic-blog, 2025-06-18): Deep-dive into headers, session lifecycle, SSE upgrade path, and resume semantics. The single best reference for debugging a streamable-http MCP server in production. - **MCP Inspector** (community-post, 2025-04-01): Official Inspector repo with usage examples for testing sampling, notifications, and both transports. Indispensable when you cannot reproduce a wire-level bug from a real client. ## FAQ ### Q1. What is MCP sampling and when should I use it? Sampling is when an MCP server asks the client to make an LLM call on the server's behalf via sampling/createMessage. Use it when your server needs natural-language output but you do not want to put a separate Anthropic key, model preference, and rate limit on it. The client owns the model decision; the server just describes what it needs. ### Q2. How do I show progress for a long-running MCP tool call? Include a progressToken in the original request, then have the server emit notifications/progress updates with that token and a percent-complete value. Pair with notifications/message for log lines. The Python SDK exposes ctx.report_progress(progress, total) so you do not write JSON-RPC frames by hand. ### Q3. What are MCP roots and who enforces them? Roots are filesystem directories the client publishes as the boundary the server is allowed to read inside. The server queries them via roots/list and reacts to roots/list_changed. The client declares; the server enforces. Treat the spec as advisory and write a hard path check, because clients vary in how strictly they police access. ### Q4. Should I use stdio or streamable-http transport for my MCP server? Use stdio for local single-client servers (simplest, no network, lifetime tied to the subprocess). Use streamable-http when many clients share one server (Slack-team-wide deployments, hosted SaaS). The latter introduces session-stickiness, reconnection, and horizontal-scaling concerns that stdio does not have. ### Q5. Why is my streamable-http MCP server losing sessions when scaled to multiple instances? Mcp-Session-Id is pinned to a specific server instance's in-memory session state. Without sticky routing at your load balancer, the next request lands on an instance that has never seen the session and returns 404. Either configure session affinity at the proxy or externalize session state to Redis or a similar shared store. ### Q6. How is an MCP notification different from a request? A request has an id and the receiver must respond (success or error). A notification has no id and the receiver must never respond. Notifications are fire-and-forget. Mixing these up is the most common source of MCP bugs at the wire level: a missing response on what you thought was a notification will hang the client forever. ### Q7. Does streamable-http require Server-Sent Events? Not for every interaction. Requests flow over standard HTTP POST. SSE is the optional upgrade path when the server needs to stream messages to the client (notifications, sampling requests, async results). The client opens the SSE stream with a GET; the server pushes events when it has them. Without SSE you can still do request-response, just no server-initiated traffic. ### Q8. How do I test the advanced MCP features without writing a full client? Use the MCP Inspector. Run mcp dev and the Inspector launches a local web UI that supports sampling (it will prompt you for completions), notifications (you see them stream in), roots (you can publish test roots), and both transports. It is the curl-equivalent for the protocol and the fastest way to isolate bugs. --- **Source:** https://claudearchitectcertification.com/knowledge/mcp-advanced **Vault sources:** Course_10/_Course_Overview.md; Course_10/Lesson_02_sampling.md; Course_10/Lesson_03_sampling-walkthrough.md; Course_10/Lesson_04_log-and-progress-notifications.md; Course_10/Lesson_05_notifications-walkthrough.md; Course_10/Lesson_06_roots.md; Course_10/Lesson_07_roots-walkthrough.md; Course_10/Lesson_09_json-message-types.md; Course_10/Lesson_10_the-stdio-transport.md; Course_10/Lesson_11_the-streamablehttp-transport.md; Course_10/Lesson_12_streamablehttp-in-depth.md; Course_10/Lesson_13_state-and-the-streamablehttp-transport.md; Course_10/Lesson_15_wrapping-up.md **Last reviewed:** 2026-05-06 --- # Agent Skills: Reusable Prompts in Claude Code > Skills are reusable markdown files that teach Claude Code how to handle specific tasks automatically, so you stop repeating the same instructions every conversation. Each skill is a directory with a SKILL.md declaring name + description, and Claude semantically matches incoming requests against descriptions to load the right skill on demand. The course walks through creating a skill, advanced configuration (allowed-tools, multi-file progressive disclosure), how skills relate to CLAUDE.md, subagents, hooks, and MCP, plus how to share and troubleshoot them. **Domain:** D1 · Agentic Architectures (27%) **Difficulty:** intro **Skilljar course:** Introduction to Agent Skills (6 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/agent-skills-intro **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 27% (D1) + 18% (D2) Direct prep for D1 task statements on Claude Code customization features and D2 questions on choosing the right primitive for the job. Skills appear in agent-skills-for-enterprise-km, agent-skills-for-developer-tooling, and agent-skills-with-code-execution scenarios. ## What you'll learn - What a skill is, where it lives (personal ~/.claude/skills vs project .claude/skills), and how Claude Code matches one to a request - How to author a SKILL.md with name, description, and instructions that reliably trigger when the task comes up - Advanced fields: allowed-tools for restricting capabilities and model for routing to a specific Claude model - Progressive disclosure: keeping SKILL.md under 500 lines and linking to supporting files Claude reads only when needed - When to use skills versus CLAUDE.md, subagents, hooks, or MCP, and how they complement each other - How to share skills via repo commits, plugins, or enterprise managed settings, and how to troubleshoot when one does not trigger ## Prerequisites - **Claude Code 101 (knowledge)** (knowledge · `claude-code-101`) - **Skills (concept)** (concepts · `skills`) - **CLAUDE.md hierarchy (concept)** (concepts · `claude-md-hierarchy`) ## Lesson outline ### 1. What are skills? Skills are folders of instructions Claude Code discovers and applies automatically; if you find yourself repeating instructions, that is a skill waiting to be written. ### 2. Creating your first skill Create a SKILL.md with name + description frontmatter, drop it in ~/.claude/skills//, restart Claude Code to pick it up. ### 3. Configuration and multi-file skills Add allowed-tools for capability restriction, use progressive disclosure to keep SKILL.md small and link to supporting files. ### 4. Skills vs. other Claude Code features Skills are on-demand and request-driven; CLAUDE.md is always-on, subagents are isolated execution, hooks are event-driven, MCP is external tools. ### 5. Sharing skills Personal in ~/.claude/skills, project (committed) in .claude/skills, plugins for distribution, managed settings for enterprise rollout. ### 6. Troubleshooting skills Use the skills validator first; if it does not trigger, fix the description; if it does not load, check SKILL.md is in a named directory. ## Our simplification Skills exist because every Claude Code user eventually notices they are typing the same instructions over and over. Every PR review, you re-describe how you want feedback structured. Every commit, you remind Claude of your message format. Skills fix this by letting you write a markdown file once that Claude applies automatically the next time the task comes up. The mental model is npm install for Claude's task-specific knowledge: scoped, on-demand, version-controlled. Mechanically a skill is a directory containing a SKILL.md. The file has YAML frontmatter declaring at minimum name and description, then markdown instructions below. The description is load-bearing: at startup Claude Code loads only the names and descriptions of every available skill (cheap, no token cost). When you make a request, Claude semantically matches your phrasing against those descriptions and loads the full body of the matching skill into context. You see a confirmation prompt before the load. Skills are inert until they match. Skills live in two places: personal at ~/.claude/skills// (follows you across all projects on your machine) and project at .claude/skills// inside a repo (committed to git, shared with everyone who clones). When names conflict, the priority hierarchy is Enterprise → Personal → Project → Plugins. Use personal for your own commit-message style or PR-review preferences; use project for team standards like coding conventions, brand guidelines, or framework-specific debugging checklists. Restart Claude Code after creating or editing a skill so it picks up the change. Advanced configuration unlocks the production-grade use cases. allowed-tools restricts which tools Claude can use while the skill is active; set [Read, Grep, Glob] for a code-review skill so the reviewer cannot edit files even if the prompt drifts. model lets a skill request a specific Claude model. Progressive disclosure is the technique that keeps skills efficient: keep SKILL.md under 500 lines, and put deep references, scripts, and assets in sibling files that Claude reads on demand. The trick: Claude can execute a script and consume only its output, never the script's source, so a 2,000-line tax-rules-by-state lookup costs zero context tokens until the output lands. Skills versus everything else is the question every team asks and the exam loves to test. CLAUDE.md loads into every conversation; use it for project-wide always-on standards (TypeScript strict mode, never modify the schema). Skills load on demand; use them for task-specific expertise that would clutter every conversation if it lived in CLAUDE.md. Subagents run in an isolated context with their own tool access; use them when you want delegation, not knowledge injection. Hooks fire on events (file save, tool call); use them for automation, not reasoning. MCP provides external tools; a different category entirely from skills. The right setup combines all of them. Sharing scales the value. A personal skill stays on your machine. A project skill gets committed alongside the code; the next teammate who clones the repo gets it free. Plugins package a set of related skills (plus optional commands and config) for distribution outside a single repo. Managed settings let an enterprise admin push a skill set to every user's Claude Code installation, with the highest priority in the hierarchy. The shape that emerges is: personal skills are your habits, project skills are your team's standards, plugin skills are community contributions, enterprise skills are policy. Troubleshooting follows a predictable pattern. Run the skills validator first; it catches structural problems (missing frontmatter, wrong filename, directory shape) faster than any other technique. If a skill does not trigger, the description is almost always the cause; add the trigger phrases users actually type. If it does not load, check that SKILL.md lives inside a named directory and that the filename is exactly SKILL.md (case-sensitive). If the wrong skill activates, two descriptions are too similar; disambiguate. For runtime errors in skill scripts, check dependencies, file permissions (chmod +x), and use forward slashes in paths. The exam treats skills as a recognition problem more than a creation problem. You will see scenarios describing team pain (a brand-guidelines doc Claude keeps missing, a PR-review style that drifts) and be asked which Claude Code feature solves it. Skill is the right answer when the knowledge is task-specific and on-demand; CLAUDE.md when it should always apply; subagent when you want isolation; hook when an event triggers automation; MCP when you need an external system. ## Patterns ### Skills vs CLAUDE.md vs subagents vs hooks vs MCP The single most-tested skills concept is which feature fits which job. Memorize the load model and the trigger. - **CLAUDE.md: always-on standards.** Loads into every conversation. Use for project-wide constraints (TypeScript strict, never modify schema, framework preferences). Token cost on every turn, so keep it tight. - **Skills: on-demand, request-driven.** Loads only when Claude matches a request to the skill's description. Use for task-specific expertise (PR review style, commit format, debugging checklists). Cost is zero until activated. - **Subagents: isolated execution.** Runs in a separate context window with its own tool access; returns only a summary. Use for delegation, not knowledge injection. Cleaner parent context, less debuggability. - **Hooks: event-driven automation.** Fires on events (file save, tool call). Use for side effects: linters, validators, telemetry. Not for reasoning. Configured in settings.json, not markdown. - **MCP: external tools.** Provides callable capabilities from a separate server. Use for connecting Claude to systems outside the conversation. A different category entirely; complement skills, do not replace them. ## Key takeaways - Skills are markdown files that Claude Code loads on demand when your request matches their description, so they cost zero context tokens until activated. (`skills`) - Personal skills go in ~/.claude/skills, project skills go in .claude/skills (committed to git), and the priority hierarchy is Enterprise > Personal > Project > Plugins. (`skills`) - The description field is the load-bearing decision input; write it to answer two questions: what does the skill do, and when should Claude use it. (`agent-skills-for-enterprise-km`) - Use allowed-tools to restrict capabilities while a skill is active and use progressive disclosure (sub-500-line SKILL.md plus linked support files) to keep large skills efficient. (`skills`) - Skills are knowledge-injection; subagents are isolated execution; hooks are event-driven automation; CLAUDE.md is always-on standards; MCP is external tools, and each handles its own specialty. (`agent-skills-for-developer-tooling`) - When a skill does not trigger, the description is almost always the cause; add the trigger phrases users actually type and run the skills validator first when debugging. (`skills`) ## Concepts in play - **Skills** (`skills`), Core primitive the course centers on - **CLAUDE.md hierarchy** (`claude-md-hierarchy`), The always-on alternative skills complement - **Subagents** (`subagents`), Isolated execution alternative for delegation - **Hooks** (`hooks`), Event-driven alternative for automation - **MCP** (`mcp`), External tool surface that pairs with skills ## Scenarios in play - **Agent skills for enterprise KM** (`agent-skills-for-enterprise-km`), Enterprise rollout pattern: managed settings, shared brand and policy skills - **Agent skills for developer tooling** (`agent-skills-for-developer-tooling`), Team-shared project skills for code review, commit format, debugging checklists - **Agent skills with code execution** (`agent-skills-with-code-execution`), Progressive disclosure pattern using scripts that return data without loading source ## Curated sources - **Equipping agents for the real world with Agent Skills** (anthropic-blog, 2025-10-16): Anthropic's launch announcement framing the why behind skills, including the progressive-disclosure design choice. Pair with the Skilljar lessons for the architectural rationale. - **Agent Skills: Anthropic Claude Code documentation** (anthropic-blog, 2025-10-20): Canonical reference for the SKILL.md schema, allowed-tools, progressive disclosure, and the priority hierarchy. The doc you bookmark when authoring real skills. ## FAQ ### Q1. What are agent skills in Claude Code and how are they different from CLAUDE.md? Skills are reusable markdown files that Claude Code loads on demand when your request matches the skill's description. CLAUDE.md loads into every conversation always. Use CLAUDE.md for project-wide standards that always apply (TypeScript strict mode); use skills for task-specific expertise that would clutter every conversation if it lived in CLAUDE.md. ### Q2. How do I create my first skill in Claude Code? Create a directory at ~/.claude/skills// (personal) or .claude/skills// (project). Inside, create a SKILL.md with YAML frontmatter declaring name and description, then write the instructions below the frontmatter. Restart Claude Code so it discovers the new skill. Verify it loads by making a request that should trigger it; you will see a confirmation prompt. ### Q3. Why is my skill not triggering when I make a request? The cause is almost always the description. Claude matches your request semantically against skill descriptions, so vague or generic descriptions miss real-world phrasings. Add the trigger phrases users actually type ('use when reviewing PRs', 'use when writing commit messages'). Also run the skills validator first; it catches structural issues (missing frontmatter, wrong filename, wrong directory shape) before you waste time on the description. ### Q4. Where do skills live and how does Claude Code find them? Personal skills go in ~/.claude/skills// and follow you across every project on your machine. Project skills go in .claude/skills// inside a repo and are committed to version control so the team shares them. At startup Claude Code loads only the names and descriptions of every skill it finds; the full body loads on demand when a request matches. ### Q5. What is progressive disclosure in agent skills? Progressive disclosure is the technique of keeping SKILL.md small (under 500 lines) and linking to supporting files Claude reads only when needed. The trick: Claude can execute a script and consume only its output, not the script's source. A 2,000-line lookup table costs zero context tokens until the output lands. Use this pattern for any skill that would otherwise bloat the context window. ### Q6. When should I use a skill versus a subagent versus a hook? Use a skill when you want to inject task-specific knowledge into the current conversation. Use a subagent when you want to delegate work to a separate context with its own tool access (returns only a summary). Use a hook when you want automated side effects on events (file save, tool call). Skills inform reasoning; subagents do work; hooks do automation. They are complementary, not alternatives. ### Q7. How do I share a skill with my team? Put the skill directory in .claude/skills// inside the repo and commit it. Anyone who clones the repo gets the skill automatically. For broader distribution, package related skills as a plugin. For enterprise-wide rollout, use managed settings to push skills to every user's Claude Code installation; managed-setting skills sit at the top of the priority hierarchy. ### Q8. Can I restrict which tools a skill is allowed to use? Yes. Add an allowed-tools field to the skill's frontmatter listing only the tools Claude can use while the skill is active. For a code-review skill, set allowed-tools: [Read, Grep, Glob] so the reviewer cannot edit files even if the prompt drifts. The whitelist is your safety contract; relying on prompt-only constraints is the canonical anti-pattern. --- **Source:** https://claudearchitectcertification.com/knowledge/agent-skills-intro **Vault sources:** Course_15/_Course_Overview.md; Course_15/Lesson_01_what-are-skills.md; Course_15/Lesson_02_creating-your-first-skill.md; Course_15/Lesson_03_configuration-and-multi-file-skills.md; Course_15/Lesson_04_skills-vs-other-claude-code-features.md; Course_15/Lesson_05_sharing-skills.md; Course_15/Lesson_06_troubleshooting-skills.md **Last reviewed:** 2026-05-06 --- # AI Capabilities and Limitations: A Mental Model of the Machine > Generative AI has four properties that each sit on a continuum from capability to limitation: Next Token Prediction, Knowledge, Working Memory, and Steerability. Most real-world failures are two properties colliding (a hallucinated citation is Next Token Prediction meeting a Knowledge gap), and naming the pair points you straight to the fix. This is the durable mental model that stays useful even as models keep improving. **Domain:** D1 · Agentic Architectures (27%) **Difficulty:** intro **Skilljar course:** AI Capabilities and Limitations (14 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/ai-capabilities-limitations **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 27% (D1) + 18% (D2) Builds the calibrated-trust mental model behind D1 task statements about hallucination, knowledge cutoffs, and context limits, and grounds D2 prompt-engineering decisions in why each technique exists. The four properties (Next Token Prediction, Knowledge, Working Memory, Steerability) are the diagnostic vocabulary the exam expects you to apply. ## What you'll learn - Why generative AI is fundamentally a prediction system, not a search engine, and what that implies for fluency vs accuracy - How pretraining and fine-tuning give each model its character, and the four behavioral fingerprints those stages leave (sycophancy, verbosity, over-caution, loose calibration) - How to locate any task on each of the four property continuums and predict where it will struggle before you run it - How to diagnose real failures by naming which two properties collided, then choose a targeted mitigation - How this framework connects to the 4D Framework (Delegation, Description, Discernment, Diligence) as two halves of one calibrated-trust system ## Prerequisites - **AI Fluency Framework (companion course)** (knowledge · `ai-fluency-framework`) - **Context window (concept)** (concepts · `context-window`) ## Lesson outline ### 1. Intro to AI Capabilities and Limitations Course roadmap; this is the machine-side companion to the 4D Framework's human-side competencies. ### 2. What We Mean by AI Generative AI produces new content; classification AI sorts existing content. Four properties define what generative AI can and cannot do. ### 3. How AI Gets Its Character Pretraining builds a document completer; fine-tuning layers an assistant on top, leaving fingerprints like sycophancy and verbosity. ### 4. Next Token Prediction AI writes one fragment at a time based on what tends to follow what. Fluency and accuracy are independent variables. ### 5. Try it out: Next Token Prediction Hands-on probes: capability zone vs specificity-under-pressure vs sampling variance on the same prompt. ### 6. Knowledge What the model knows is frozen at the training cutoff. Mainstream and stable wins; rare, recent, niche, contested loses. ### 7. Try it out: Knowledge Outsider-test exercises: coverage gaps, staleness, default-assumption blind spots in your domain. ### 8. Working Memory Fixed-size context window with a cliff failure mode. Lost-in-the-middle is real; corrections do not persist across sessions. ### 9. Try it out: Working Memory Cold-start vs context-supplied probes; demonstrates that context is leverage and the blank slate between sessions. ### 10. Steerability Instructions are followed via pattern matching, not understanding. Short concrete asks land; long reasoning chains drift. ### 11. Try it out: Steerability Goal-rewrite exercise: state intent alongside format, insert mid-process checkpoints, watch letter-vs-spirit failures. ### 12. When Properties Collide Real failures are two properties meeting. Hallucinated citation = NTP x Knowledge; long-conversation drift = Working Memory x Steerability. ### 13. Next Steps Synthesis: calibrated trust is a habit, not an attitude; the property shapes stay useful as models keep changing. ### 14. Course Quiz Self-check on the four properties, the two training stages, and the diagnostic-pair vocabulary. ## Our simplification Generative AI is not uniformly capable or uniformly unreliable. It is strong and weak along four predictable axes, and the same underlying mechanism that produces a strength often produces the matching weakness. Anthropic's framework names those axes Next Token Prediction, Knowledge, Working Memory, and Steerability. Each one is a continuum, not a switch. Your job before delegating any task is to locate it on each continuum and decide what verification or context to supply. That move is what Anthropic calls calibrated trust, and it is the single most exam-relevant idea in the course. Next Token Prediction answers *where do AI answers come from?* The model is writing what statistically comes next, one fragment at a time. On well-worn paths (summarize, reformat, explain a common concept) the patterns are dense and the output is reliable. On novel territory the same fluent prose keeps coming, but accuracy thins. Fabrication concentrates in specificity; names, dates, statistics, citations, URLs, quotes. Confident tone is not an accuracy signal; smoothness and correctness are independent variables. Product features (citations, uncertainty signaling, constrained generation, generator-verifier loops) push the edge out, but the verification habit is yours to build. Knowledge answers *what does the model actually know?* Everything it learned came from training data and is frozen at a cutoff date. Mainstream, well-documented, stable topics land in the capability zone. Rare, post-cutoff, niche, local, or contested topics drift toward the edge. The characteristic failures are staleness (true-at-training-time is not true-now), uneven coverage (minority languages and recent developments suffer), inherited bias (the model's sense of default reflects training-data blind spots), and source amnesia (I read this somewhere is not a citation). Web search, RAG/retrieval, MCP, and tool use are all mitigations that extend knowledge at runtime; if you are not using them you are relying entirely on what the model absorbed. Working Memory answers *what is the AI paying attention to right now?* This is the context window; your instructions, uploaded docs, prior responses, all in one finite container the model rereads every turn. Unlike the other three properties, this one has a cliff rather than a gradient. When the window overflows, oldest material falls off silently. Attention is not uniform across the window either; the lost-in-the-middle effect means buried instructions carry less weight than top-or-tail ones. The model does not learn from your corrections; it only responds to what is currently in context. Memory features, projects, compaction, larger windows, and skills push the cliff further out, but front-loading critical material and chunking long work are your operator-side defenses. Steerability answers *how much am I in control?* Fine-tuning taught the model to treat your input as a request and follow rules, which gives precise control over format, role, length, and tone. But instructions are pattern-matched, not understood. Short, concrete, verifiable asks (respond as a table, under 100 words) land cleanly. Long reasoning chains drift; abstract asks like be insightful get patchy results; native arithmetic precision is brittle without code execution. The two characteristic failures are reasoning drift (small errors compound and the model does not notice) and letter-over-spirit (instruction honored literally but uselessly). When an instruction is followed literally but uselessly, restate the goal, not the instruction; repeating be concise louder will not fix what was really an intent problem. The diagnostic move that turns this into a working tool: most real-world AI failures are two properties colliding, not one. A hallucinated citation is Next Token Prediction meeting a Knowledge gap (the model generates what a plausible citation looks like while the training data is sparse). Long-conversation drift is Working Memory meeting Steerability (early constraints fade as the window fills, and steerability follows whatever instructions are most salient now). Confidently-wrong arithmetic is Next Token Prediction meeting Steerability without code execution. Before reaching for a prompt fix, name which two properties are at play. The fix follows automatically from the diagnosis: verify specifics, re-supply context, offload to a tool, or invite explicit pushback. Two training stages give every model its character. Pretraining reads vast amounts of text and learns one thing: predict what comes next. The result is a document completer with no concept of you or of helping. Fine-tuning is a second round on curated examples of helpful behavior plus reward signals shaped by human preferences. That second stage leaves fingerprints: sycophancy (the model validates your framing and backs down under light pushback), verbosity (thoroughness scored well in training, so essays come back when you wanted bullets), over-caution (conservative safety training means hedging on requests that are actually fine), and loose calibration between stated confidence and actual reliability. These are not bugs in one model; they are training artifacts that appear across all of them; knowing them puts you in control. ## Patterns ### The 4 properties of generative AI, on one page Each property is a continuum with a capability zone, a limitation zone, and product features that push the edge further out. Locate your task on each one before delegating. - **Next Token Prediction; where do answers come from?.** Capability: well-worn paths (summarize, reformat, explain). Limitation: novel territory and specificity (names, dates, citations, URLs). The same mechanism produces fluency and hallucination. - **Knowledge; what does it actually know?.** Capability: frequent, recent-in-training, consistent topics. Limitation: rare, post-cutoff, niche, local, contested. Mitigations: web search, RAG, tool use, MCP. - **Working Memory; what is it paying attention to right now?.** Capability: material fits comfortably, session is current, context supplied. Limitation: very long docs/conversations, cross-session continuity, lost-in-the-middle. The cliff is silent; you will not always be warned. - **Steerability; how much am I in control?.** Capability: short, concrete, verifiable instructions. Limitation: long reasoning chains, abstract asks, native precision. Failures: reasoning drift and letter-over-spirit. Most failures are two properties meeting; naming the pair points to the fix. ### 4 fingerprints fine-tuning leaves on every model These are not bugs in one model; they are training artifacts that appear across all of them. Spotting them puts you back in control. - **Sycophancy.** People prefer agreeable responses, so the model learns to validate your framing and back down under light pushback even when it was right. Counter by explicitly inviting disagreement: genuinely push back if you think I am wrong. - **Verbosity.** Thoroughness scored better in training, so the default is longer answers. Counter with explicit length constraints (one sentence, under 100 words, bullets only). - **Over-caution.** Conservative safety training means hedging on requests that are actually fine. Counter by stating the legitimate context up front so the model has a frame for why the ask is reasonable. - **Loose calibration.** Stated confidence and actual reliability are not tightly coupled. The model can sound certain while being wrong. Verify specifics independently regardless of tone. ## Key takeaways - Generative AI is a prediction system whose strengths and weaknesses live on four continuums, not in a single capable/unreliable verdict. (`prompt-engineering-techniques`) - Fabrication concentrates in specificity (names, dates, citations, URLs); confident tone is not an accuracy signal and the model cannot reliably tell grounded from invented. (`evaluation`) - Working Memory has a cliff failure mode; silent truncation, lost-in-the-middle, no learning from corrections; so front-loading and chunking are operator-side defenses. (`context-window`) - Steerability fails as reasoning drift on long chains and as letter-over-spirit on abstract asks; restate the goal, not the instruction, when output lands literally but uselessly. (`prompt-engineering-techniques`) - Most real-world AI failures are two properties colliding (hallucinated citation = NTP x Knowledge; long-conversation drift = Working Memory x Steerability); naming the pair points to the targeted fix. (`evaluation`) - Fine-tuning leaves four fingerprints across every model; sycophancy, verbosity, over-caution, loose calibration; and recognizing them is part of using AI well. (`system-prompts`) ## Concepts in play - **Context window** (`context-window`), Working Memory in mechanical terms - **Prompt engineering techniques** (`prompt-engineering-techniques`), How to operate within Steerability's capability zone - **System prompts** (`system-prompts`), Standing directions that resist Working Memory dilution - **Evaluation** (`evaluation`), How you measure Next Token Prediction failure rates in your own domain - **MCP** (`mcp`), Knowledge mitigation: connect models to runtime sources beyond the cutoff ## Scenarios in play - **Long document processing** (`long-document-processing`), Working Memory + Next Token Prediction failure modes show up first here; the cliff and the fabrication zone meet on a single 50-page PDF - **Structured data extraction** (`structured-data-extraction`), Steerability + Next Token Prediction in tension; format constraints land cleanly, but specificity in extracted fields is the verification surface ## Curated sources - **Building effective agents** (anthropic-blog, 2024-12-19): Pairs naturally with the four-property model; explains how the limitations the course names get engineered around in production agents (verifiers, tool use, context management). - **Effective context engineering for AI agents** (anthropic-blog, 2025-09-29): Direct extension of the Working Memory lesson; concrete patterns for managing the context window cliff in real systems, with the same mental model the course teaches. - **Lost in the Middle: How Language Models Use Long Contexts** (paper, 2023-07-06): The original empirical paper behind the lost-in-the-middle effect referenced in the Working Memory lesson; useful when you need to cite the phenomenon in a design review. ## FAQ ### Q1. What are the four properties of generative AI in Anthropic's capabilities and limitations framework? Next Token Prediction (where answers come from), Knowledge (what the model knows), Working Memory (what it is paying attention to right now), and Steerability (how much you are in control). Each sits on a continuum from capability to limitation, and most real-world failures are two of them colliding rather than one acting up alone. ### Q2. Why does an AI hallucinate citations when it sounds so confident? Confident tone and accuracy are independent variables in a generative model. The model writes what a plausible citation looks like using Next Token Prediction; when the underlying Knowledge is sparse on that niche topic, it generates citation-shaped text that may or may not point to a real paper. Fabrication concentrates in specificity; names, dates, journal titles, URLs; so verify those independently no matter how smooth the prose sounds. ### Q3. What is the difference between Knowledge and Working Memory in an AI model? Knowledge is what the model absorbed during training and is frozen at a cutoff date. Working Memory is the context window: what the model is paying attention to *right now*; your prompt, uploaded docs, prior turns. Knowledge fails through staleness and uneven coverage; Working Memory fails through silent truncation, lost-in-the-middle, and the blank slate between sessions. The mitigations are different: web search and RAG for Knowledge, front-loading and projects/memory for Working Memory. ### Q4. How do I stop the AI from agreeing with me when I want honest feedback? That behavior is sycophancy, a fingerprint left by fine-tuning on human preference data; people prefer agreeable responses, so the model learns to validate your framing. Counter it by explicitly inviting disagreement in the prompt: I want you to genuinely disagree if you think I am wrong; do not agree just because I sounded confident. The pattern only changes when you give the model permission to push back. ### Q5. Why does my AI assistant ignore the rules I set 20 messages ago? That is long-conversation drift; Working Memory meeting Steerability. Your early constraints have either fallen out of the context window (silent truncation) or are now buried so deep in the conversation that lost-in-the-middle is suppressing them. The fix is to either re-supply the critical constraints in the current turn, move them into a system prompt or Project so they stay persistent, or start a fresh conversation with the essentials front-loaded. ### Q6. Are these four properties going to change as models get better? The properties stay; the boundaries move. Context windows grow, hallucination rates drop, features close gaps. But generative AI will keep being a predictor whose fluency runs ahead of its accuracy, with uneven knowledge frozen at a cutoff, working inside a finite window, following instructions through a gap between words and intent. That is why the framework is durable on purpose; it remains useful even when version numbers change. ### Q7. When should I add web search or RAG to my AI workflow? Whenever your task lives in the Knowledge limitation zone: rare topics, post-cutoff events, niche regulations, local information, fast-moving fields, contested claims, or anywhere staleness is a real risk. Web search routes around the cutoff for time-sensitive questions; RAG and MCP connect the model to documents it never trained on (your wiki, a specialized database). If the task is in the capability zone; mainstream, stable, well-documented; the absorbed knowledge is usually enough. ### Q8. How does this framework relate to the 4D Framework (Delegation, Description, Discernment, Diligence)? They are two halves of one calibrated-trust system. The 4Ds are what *you* do; the four properties are what you are responding to when you do them. Next Token Prediction sharpens Discernment (fluency and accuracy are independent). Working Memory sharpens Description (context is leverage, the model does not remember). Steerability sharpens Delegation (you know where control is tight and where it is loose). Knowledge sharpens all of them by telling you when to hand off and when to bring the context yourself. --- **Source:** https://claudearchitectcertification.com/knowledge/ai-capabilities-limitations **Vault sources:** Course_17/Lesson_01_intro-to-ai-capabilities-and-limitations.md; Course_17/Lesson_02_what-we-mean-by-ai.md; Course_17/Lesson_03_how-ai-gets-its-character.md; Course_17/Lesson_04_next-token-prediction.md; Course_17/Lesson_06_knowledge.md; Course_17/Lesson_08_working-memory.md; Course_17/Lesson_10_steerability.md; Course_17/Lesson_12_when-properties-collide.md; Course_17/Lesson_13_next-steps.md **Last reviewed:** 2026-05-06 --- # Claude with Google Cloud Vertex AI: Deployment + GCP Integration > This 93-lesson course teaches the same Claude API surface as claude-api-foundations but accessed through Google Cloud's Vertex AI rather than the direct Anthropic API. The deployment-specific substance is the AnthropicVertex SDK, gcloud Application Default Credentials auth, Model Garden enablement, project + region binding, and the regional model-availability model. Everything else (prompt engineering, evals, tool use, RAG, MCP, agents) mirrors Course 6 lesson-for-lesson. **Domain:** D5 · Context + Reliability (15%) **Difficulty:** intermediate **Skilljar course:** Claude with Google Cloud's Vertex AI (93 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/claude-with-vertex **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 15% (D5) + 18% (D2) Maps directly to D5 task statements about deploying Claude in customer-managed cloud environments, IAM-bound auth, regional model availability, and quota management on Vertex AI. The API/prompt/tool/RAG/MCP/agent content is identical to Course 6, so this page focuses on what differs at the deployment seam. ## What you'll learn - How to enable Claude models in Vertex AI Model Garden and authenticate via gcloud Application Default Credentials - How the AnthropicVertex Python SDK differs from the direct Anthropic SDK (project_id, region, model name format) - Which Claude features ride on top of Vertex unchanged (prompt caching, vision, PDF support, citations, extended thinking, batch) - How regional model availability and quota management work on Vertex compared to the direct API - How IAM, VPC Service Controls, and Cloud Logging fit into a Vertex-hosted Claude deployment - When to choose Vertex AI vs the direct Anthropic API vs Amazon Bedrock for a given workload ## Prerequisites - **Claude API Foundations (the platform-agnostic content)** (knowledge · `claude-api-foundations`) - **Claude 101: First Principles** (knowledge · `claude-101`) ## Lesson outline ### 1. Welcome to the course Course intro and what Vertex AI adds vs the direct Anthropic API. ### 2. Overview of Claude models Model family overview; same models, accessed via Vertex. ### 3. Accessing the API Request lifecycle: client to server to Vertex to model and back; never call from browser. ### 4. Vertex AI Setup DEPLOYMENT-SPECIFIC: enable Anthropic models in Model Garden, install gcloud CLI, run gcloud auth application-default login. ### 5. Making a request DEPLOYMENT-SPECIFIC: pip install anthropic[vertex], instantiate AnthropicVertex(region=..., project_id=...), model id format claude-sonnet-4@20250514. ### 6. Multi-turn conversations Append assistant + user messages to maintain dialogue state; mirrors Course 6. ### 7. Chat exercise Hands-on: build a minimal chat loop against Vertex. ### 8. System prompts Set role and behavior via the system parameter; identical semantics to direct API. ### 9. System prompts exercise Hands-on: experiment with system prompt variations. ### 10. Temperature Sampling control 0 to 1; lower for deterministic, higher for creative. ### 11. Course satisfaction survey Mid-course feedback prompt. ### 12. Response streaming Stream tokens as they generate via SSE; reduces TTFT for chat UIs. ### 13. Controlling model output max_tokens, stop_sequences, and response shaping basics. ### 14. Structured data Coax JSON via prompting; introduce schema thinking before tool use. ### 15. Structured data exercise Hands-on: extract structured fields with prompted JSON. ### 16. Quiz on accessing Claude with the API Section quiz on auth, request shape, and response handling. ### 17. Prompt evaluation Why systematic eval beats vibes; same content as Course 6. ### 18. A typical eval workflow Test set + grader + iteration loop; canonical pattern. ### 19. Generating test datasets Use Claude itself (via Vertex) to synthesize test inputs at scale. ### 20. Running the eval Loop the test set through your prompt and capture outputs. ### 21. Model-based grading LLM-as-judge: rubric-driven grading with a separate Claude call. ### 22. Code-based grading Deterministic graders for format, regex, schema validation. ### 23. Exercise on prompt evals Hands-on: stand up a small eval harness. ### 24. Quiz on prompt evaluation Section quiz on eval workflow. ### 25. Prompt engineering Intro to the canonical Anthropic prompt-engineering techniques. ### 26. Being clear and direct State the task plainly; ambiguity costs more than verbosity. ### 27. Being specific Specificity collapses the response space; vague asks invite drift. ### 28. Structure with XML tags XML tags as structural anchors Claude attends to reliably. ### 29. Providing examples Few-shot examples in the prompt steer style and format. ### 30. Exercise on prompting Hands-on: refactor a weak prompt using the four techniques. ### 31. Quiz on prompt engineering techniques Section quiz on the prompting toolkit. ### 32. Introducing tool use Tools let Claude call your functions; same protocol on Vertex. ### 33. Project overview Multi-tool project setup for the section. ### 34. Tool functions Define the Python functions Claude will call. ### 35. Tool schemas JSON schema for tool inputs; the model only sees this schema. ### 36. Handling message blocks Iterate the response content blocks (text + tool_use) and dispatch. ### 37. Sending tool results Append tool_result messages and re-call the model to continue. ### 38. Multi-turn conversations with tools Maintain the agentic loop across multiple tool calls. ### 39. Implementing multiple turns Hands-on: code the loop with stop_reason guards. ### 40. Using multiple tools Register a tool list; let Claude pick the right one per turn. ### 41. The batch tool Submit many requests at once for cost-and-throughput wins. ### 42. Tools for structured data Tool schemas as the cleanest path to structured outputs. ### 43. The text edit tool Built-in text-editor tool for file-edit workflows. ### 44. The web search tool Built-in web search tool (note: availability differs across deployment platforms; check Vertex docs). ### 45. Quiz on tool use with Claude Section quiz on tool-use mechanics. ### 46. Introducing retrieval-augmented generation RAG = retrieve relevant docs and add them to context; Knowledge mitigation. ### 47. Text chunking strategies Fixed-size, sentence-boundary, and semantic chunking tradeoffs. ### 48. Text embeddings Dense vector representations for semantic search; on GCP often Vertex-hosted embedding models. ### 49. The full RAG flow Embed -> store -> retrieve top-k -> augment prompt -> generate. ### 50. Implementing the RAG flow Hands-on RAG pipeline. ### 51. BM25 lexical search Keyword/term-frequency retrieval as a complement to embedding search. ### 52. A multi-index RAG pipeline Combine BM25 + embeddings with reciprocal rank fusion. ### 53. Reranking results Cross-encoder rerank of top-k retrievals before context insertion. ### 54. Contextual retrieval Anthropic's contextual retrieval: prepend a chunk-aware summary before embedding. ### 55. Quiz on retrieval-augmented generation Section quiz on RAG components. ### 56. Extended thinking Reasoning mode where the model thinks before answering; surfaced as a content block. ### 57. Image support Vision: pass images as base64 or URL content blocks. ### 58. PDF support Native PDF inputs; large docs feed the Working Memory cliff. ### 59. Citations Built-in citations: model returns spans tying claims back to source docs. ### 60. Prompt caching Cache stable prefixes (system prompt, RAG context) for major cost wins. ### 61. Rules of prompt caching TTLs, breakpoints, minimum sizes; cache-hit accounting on Vertex. ### 62. Prompt caching in action Hands-on: measure cache-hit savings on a realistic workload. ### 63. Quiz on features of Claude Section quiz on cross-cutting features. ### 64. Introducing MCP Model Context Protocol: standard for connecting tools/data to Claude. ### 65. MCP clients Claude Desktop, Claude Code, custom clients; all speak the same protocol. ### 66. Project setup Stand up an MCP server scaffold for the section. ### 67. Defining tools with MCP Expose tools through MCP rather than per-app tool schemas. ### 68. The server inspector Anthropic's MCP inspector for debugging server output. ### 69. Implementing a client Build a custom MCP client around Claude on Vertex. ### 70. Defining resources MCP resources: model-pull data sources. ### 71. Accessing resources Wire resources into the client and let Claude read them. ### 72. Defining prompts MCP prompts: server-provided prompt templates. ### 73. Prompts in the client Surface MCP-served prompts in client UI. ### 74. MCP review Recap of tools/resources/prompts split. ### 75. Quiz on Model Context Protocol Section quiz on MCP architecture. ### 76. Anthropic apps Overview of Claude Desktop and Claude Code as Vertex-compatible clients. ### 77. Claude Code setup Install + configure Claude Code; works against Vertex-backed deployments. ### 78. Claude Code in action Live coding session demoing common workflows. ### 79. Enhancements with MCP servers Plug MCP servers into Claude Code for repo/db/issue access. ### 80. Parallelizing Claude Code git worktrees + multiple sessions for parallel feature work. ### 81. Automated debugging Subagent-driven bug repro and fix loop. ### 82. Computer use Computer-use tool: Claude controls a virtual desktop via screenshots + actions. ### 83. How computer use works Action loop: screenshot -> reason -> click/type -> repeat. ### 84. Agents and workflows Distinction: workflows are scripted, agents choose their own path. ### 85. Parallelization workflows Fan-out workflow: same task across many inputs in parallel. ### 86. Chaining workflows Sequential workflow: output of step N feeds step N+1. ### 87. Routing workflows Classifier-driven routing to specialized downstream prompts. ### 88. Agents and tools Agentic loop with a curated tool whitelist; same primitives on Vertex. ### 89. Environment inspection Let the agent probe its environment before acting. ### 90. Workflows vs agents Choose workflow when steps are known; choose agent when path is open. ### 91. Quiz on agents and workflows Section quiz on agent design. ### 92. Final assessment quiz End-of-course assessment across all sections. ### 93. Course wrap-up Recap and pointers to deeper Vertex + Anthropic resources. ## Our simplification This course is the platform-agnostic Claude API course (Course 6 claude-api-foundations) wrapped in Google Cloud; same prompt engineering, same eval workflow, same tool-use mechanics, same RAG patterns, same MCP protocol, same agent design. If you have done Course 6, roughly 85 of the 93 lessons will feel familiar verbatim. The ~8 lessons that *justify a separate Knowledge page* are the deployment seam: how you authenticate, how you address models, how regions and quotas work, and how Vertex's enterprise controls (IAM, VPC-SC, Cloud Logging) fit on top. This page focuses on those seams; for everything else, lean on claude-api-foundations as the canonical reference. Authentication on Vertex is gcloud-mediated, not API-key-mediated. You install the gcloud CLI, run gcloud init and gcloud auth application-default login, set a project with gcloud config set project YOUR_PROJECT_ID, and from then on the AnthropicVertex SDK picks up Application Default Credentials automatically. There is no ANTHROPIC_API_KEY. The implication for your architecture is that auth is bound to a Google Cloud identity (a user account in dev, a service account in production), which means your IAM model becomes the security boundary. Grant roles/aiplatform.user to the service account, scope it to the project, and rotate via Google Cloud's normal service-account-key lifecycle (or Workload Identity Federation if you are running outside GCP). The SDK surface differs in three load-bearing ways. First, the import: from anthropic import AnthropicVertex instead of from anthropic import Anthropic. Second, the constructor: AnthropicVertex(region="global", project_id="your-project-id"); the region and project_id are mandatory and bind every request to a specific Vertex tenant. Third, the model id format: Vertex uses claude-sonnet-4@20250514 rather than claude-sonnet-4-20250514 (note the @ instead of trailing dash). Everything below those three lines (messages.create, content blocks, tool schemas, streaming, prompt caching) is byte-for-byte identical to the direct API. pip install "anthropic[vertex]" pulls the right extras. Regional model availability is a real operational concern. Not every Claude model is hosted in every Vertex region; new models often launch in us-east5 or us-central1 first and roll out elsewhere over weeks. The Vertex region="global" setting routes to the nearest available region and is usually the right default for production unless you have data-residency constraints. If you do need a specific region (EU data residency, regulated workloads), check the Model Garden listing for that region before you commit; a model that exists in us-east5 will return a not found error in europe-west4 even though both are valid Vertex regions. Quotas are per-project per-region and are managed in the Cloud Console under Quotas & system limits; default quotas are conservative and you will likely raise them before production traffic. Enterprise controls ride on top of Vertex unchanged. This is the main reason customers choose Vertex over the direct Anthropic API: VPC Service Controls confine traffic to a security perimeter, Cloud Audit Logs capture every messages.create invocation with caller identity, Customer-Managed Encryption Keys (CMEK) wrap inputs and outputs, and Private Service Connect avoids public-internet egress. None of those are Anthropic features per se; they are GCP features that Vertex inherits because Claude is served as a first-class Vertex AI model. The compliance story is Vertex's, not Anthropic's: SOC 2, ISO 27001, HIPAA BAA (where applicable), FedRAMP for government workloads. If your org has a Google Cloud landing zone, deploying Claude on Vertex slots into existing policy controls instead of standing up a parallel data-flow review. Feature parity is high but not perfect, and the gaps move over time. Prompt caching, vision, PDF support, citations, extended thinking, and the batch API generally land on Vertex within weeks of the direct API release; the message-format protocol is identical. The exceptions tend to be at the *tool* layer: the built-in web search tool and computer use have shipped with deployment-specific availability gates, so check the Anthropic on Vertex docs before depending on them. Pricing is set by Google Cloud (not Anthropic) and is typically priced per 1M input/output tokens at parity with the direct API, billed through your GCP invoice. Cache hits and batch requests get the same multipliers you see on direct API. When to choose Vertex vs the direct Anthropic API vs Bedrock. Pick Vertex when your stack is already on Google Cloud; the auth, billing, IAM, audit, and data-residency stories all consolidate, and you avoid a second vendor relationship. Pick the direct Anthropic API when you want fastest access to new models and features, simpler key-based auth, and no cloud lock-in. Pick Bedrock (covered in claude-with-bedrock) when your stack is on AWS for the symmetric reasons. The application code is roughly 95% portable across all three; the differences are auth, model id format, the SDK constructor, and operational integrations. Picking a deployment platform is mostly an organizational decision, not a technical one; and the exam expects you to recognize that. ## Patterns ### 5 things that change when you move from the direct Anthropic API to Vertex If you already know the direct API (Course 6), these are the deltas you actually need to internalize. Everything else is unchanged. - **Auth: gcloud ADC instead of API key.** No ANTHROPIC_API_KEY. Run gcloud auth application-default login in dev; use a service account with roles/aiplatform.user in prod. Auth is bound to a Google Cloud identity, which means IAM is your security boundary. - **SDK constructor: `AnthropicVertex(region, project_id)`.** from anthropic import AnthropicVertex and pass region and project_id. region="global" is a sensible default unless data residency dictates otherwise. Install with pip install "anthropic[vertex]". - **Model id format uses `@` not `-`.** claude-sonnet-4@20250514 on Vertex, claude-sonnet-4-20250514 on the direct API. A small but easy-to-trip-on difference; copy from the Model Garden listing rather than from the Anthropic docs. - **Regional availability is real.** Not every model is in every region. New models tend to launch in us-east5 first. Check Model Garden for your target region before you commit. region="global" routes to nearest available. - **Enterprise controls come from GCP.** VPC-SC, CMEK, Cloud Audit Logs, Private Service Connect, and the GCP compliance posture (SOC 2, ISO 27001, HIPAA BAA, FedRAMP) all apply because Claude is served as a Vertex model. Compliance story is GCP's, not Anthropic's directly. ## Key takeaways - Vertex deployment is the same Claude API surface as claude-api-foundations plus a different auth and addressing model; about 85 of 93 lessons mirror Course 6 verbatim. (`tool-calling`) - Authentication is gcloud Application Default Credentials, not an API key; production uses a service account with roles/aiplatform.user and IAM is the security boundary. (`system-prompts`) - The AnthropicVertex SDK requires region and project_id, and the model id format uses @ (e.g. claude-sonnet-4@20250514) instead of a trailing dash. (`tool-calling`) - Regional model availability is a real operational gate; new models launch in specific regions first and region="global" is the sensible production default unless data residency requires otherwise. (`claude-for-operations`) - VPC Service Controls, CMEK, Cloud Audit Logs, and the GCP compliance posture (SOC 2, HIPAA BAA, FedRAMP) ride on top of Vertex unchanged; that is the main reason enterprises pick Vertex over the direct API. (`evaluation`) - Application code is roughly 95% portable across direct API, Vertex, and Bedrock; choosing a deployment platform is mostly an organizational decision (where your cloud landing zone lives), not a technical one. (`tool-calling`) ## Concepts in play - **Tool calling** (`tool-calling`), Same protocol on Vertex; tool schemas are byte-identical to direct API - **Prompt caching** (`prompt-caching`), Available on Vertex with same TTL/breakpoint rules as direct API - **Batch API** (`batch-api`), Available on Vertex; useful for cost-sensitive bulk workloads - **MCP** (`mcp`), Protocol-level, deployment-independent; works with Vertex-backed Claude - **Vision and multimodal** (`vision-multimodal`), Image and PDF inputs work identically on Vertex ## Scenarios in play - **Claude for operations** (`claude-for-operations`), Vertex's IAM, audit, and VPC-SC controls are the operational substrate this scenario depends on for enterprise rollouts - **Structured data extraction** (`structured-data-extraction`), Common Vertex workload: route extraction jobs through Cloud Run / Cloud Functions backed by Claude on Vertex ## Curated sources - **Claude on Google Cloud Vertex AI; Anthropic API documentation** (anthropic-blog, 2025-09-01): Canonical reference for the SDK surface, model id formats, and feature-availability table on Vertex. Pair with Skilljar Lessons 4-5 when you start authenticating against a real GCP project. - **Anthropic Claude in Vertex AI Model Garden** (anthropic-blog, 2025-10-15): Google's own integration docs covering Model Garden enablement, regional availability, quota management, and the IAM permissions required to invoke Claude on Vertex. ## FAQ ### Q1. What is the difference between using Claude through the Anthropic API and through Google Vertex AI? The application code is roughly 95% identical; the differences are at the deployment seam. Vertex uses gcloud Application Default Credentials instead of an ANTHROPIC_API_KEY, requires AnthropicVertex(region, project_id) instead of Anthropic(), uses @ in the model id (e.g. claude-sonnet-4@20250514), and inherits Google Cloud's IAM, audit, VPC-SC, and compliance controls. Choose Vertex when your stack is on GCP; choose direct API for simpler auth and fastest access to new features. ### Q2. How do I authenticate with Claude on Vertex AI? Install the gcloud CLI, run gcloud init and gcloud auth login, set your project with gcloud config set project YOUR_PROJECT_ID, then run gcloud auth application-default login. The AnthropicVertex SDK picks up Application Default Credentials automatically. There is no API key; auth is bound to a Google Cloud identity (your user account in dev, a service account with roles/aiplatform.user in production). ### Q3. Why does my Vertex AI request return a model-not-found error when the model exists? Almost always a regional availability mismatch. Not every Claude model is hosted in every Vertex region, and new models often launch in us-east5 or us-central1 first. Check the Model Garden listing for your target region before you commit. The fix is usually to switch to region="global", which routes to the nearest available region, unless data residency dictates a specific region. Also confirm the model id format; Vertex uses claude-sonnet-4@20250514 with an @, not a trailing dash. ### Q4. Does prompt caching work on Claude through Vertex AI? Yes. Prompt caching, vision, PDF support, citations, extended thinking, and the batch API all work on Vertex with the same TTLs, breakpoint rules, and pricing multipliers as the direct API. The message-format protocol is identical. The features that occasionally lag are at the tool layer; the built-in web search tool and computer use have shipped with deployment-specific availability gates, so check the Anthropic on Vertex docs before depending on them. ### Q5. Should I use Vertex AI or the direct Anthropic API for my production deployment? Pick Vertex when your stack is already on Google Cloud; the auth, billing, IAM, audit, VPC-SC, and data-residency stories all consolidate into your existing GCP landing zone. Pick the direct Anthropic API when you want the fastest access to new models and features, simpler key-based auth, and no cloud lock-in. The application code is portable both ways, so this is mostly an organizational decision (where does your security review live, who pays the invoice) rather than a technical one. ### Q6. How do I install the right Anthropic SDK for Vertex AI in Python? Run pip install "anthropic[vertex]"; the [vertex] extras pull in the Google Auth dependencies needed to connect to Vertex. Then import AnthropicVertex (not Anthropic) and instantiate it with region and project_id. The same messages.create API works on both clients, so application code below the constructor line is unchanged. ### Q7. What IAM permissions does my service account need to call Claude on Vertex? At minimum, roles/aiplatform.user on the project. For production, scope tightly: grant only that role on only the project hosting your AI workloads, and rotate service-account keys via Workload Identity Federation if your code runs outside GCP. Auth is bound to identity, so any IAM policy you apply to the service account flows through to your Claude calls; including organizational policies on which regions are allowed and which models are enabled. ### Q8. Can I use HIPAA, SOC 2, or FedRAMP-covered Claude through Vertex? Yes; the compliance posture is Google Cloud's, and Claude on Vertex inherits it. SOC 2, ISO 27001, HIPAA BAA (where applicable), and FedRAMP for government workloads all extend to Anthropic models served through Vertex AI. The compliance story is Vertex's, not Anthropic's directly, which is one of the main reasons regulated industries pick Vertex over the direct API. Confirm the specific certifications in the Google Cloud Compliance Resource Center for your target region before going to production. --- **Source:** https://claudearchitectcertification.com/knowledge/claude-with-vertex **Vault sources:** Course_12/Lesson_03_accessing-the-api.md; Course_12/Lesson_04_vertex-ai-setup.md; Course_12/Lesson_05_making-a-request.md; Course_12/Lesson_60_prompt-caching.md; Course_12/Lesson_61_rules-of-prompt-caching.md; Course_12/Lesson_64_introducing-mcp.md; Course_12/Lesson_84_agents-and-workflows.md **Last reviewed:** 2026-05-06 --- # Claude in Amazon Bedrock: Deployment + AWS Integration > This 83-lesson course teaches the same Claude API surface as claude-api-foundations but accessed through Amazon Bedrock rather than the direct Anthropic API. The deployment-specific substance is the AWS boto3 SDK, IAM-bound auth, the Bedrock converse API shape, regional model availability with cross-region inference profiles, and Bedrock-only features (Guardrails, Knowledge Bases, Agents). Everything else (prompt engineering, evals, tool use, RAG, MCP) mirrors Course 6 lesson-for-lesson. **Domain:** D5 · Context + Reliability (15%) **Difficulty:** intermediate **Skilljar course:** Claude with Amazon Bedrock (83 lessons) **Canonical:** https://claudearchitectcertification.com/knowledge/claude-with-bedrock **Last reviewed:** 2026-05-06 ## Exam mapping **Blueprint share:** 15% (D5) + 18% (D2) Maps directly to D5 task statements about deploying Claude in customer-managed cloud environments, IAM-bound auth, regional model availability, cross-region inference profiles, and quota management on AWS. The API/prompt/tool/RAG/MCP/agent content is the same as Course 6, so this page focuses on what differs at the deployment seam. ## What you'll learn - How to enable Claude models in Bedrock and authenticate via AWS IAM (access keys, IAM roles, or SSO) - How the boto3 bedrock-runtime client and the converse API differ from the direct Anthropic SDK - What inference profiles are and how cross-region inference solves Bedrock's regional availability problem - Which Claude features ride on top of Bedrock unchanged (prompt caching, vision, PDF support, tool use, extended thinking) and which are AWS-specific add-ons (Guardrails, Knowledge Bases, Bedrock Agents) - How AWS-native controls (IAM, VPC endpoints, CloudTrail, KMS) fit into a Bedrock-hosted Claude deployment - When to choose Bedrock vs the direct Anthropic API vs Vertex AI for a given workload ## Prerequisites - **Claude API Foundations (the platform-agnostic content)** (knowledge · `claude-api-foundations`) - **Claude 101: First Principles** (knowledge · `claude-101`) ## Lesson outline ### 1. Introduction to the course Course intro; originally produced for AWS employees, now public. Same Claude API content via Bedrock. ### 2. Overview of Claude models Model family overview; same models, accessed via Bedrock. ### 3. Accessing the API Request lifecycle: client to server to Bedrock to model and back; never call from browser. ### 4. Making a request DEPLOYMENT-SPECIFIC: boto3.client('bedrock-runtime'), model id vs inference profile, converse API, regional availability. ### 5. Multi-turn conversations Append assistant + user messages to maintain dialogue state; mirrors Course 6 with Bedrock content-block shape. ### 6. Chat bot exercise Hands-on: build a minimal chat loop against Bedrock. ### 7. System prompts DEPLOYMENT-SPECIFIC: Bedrock converse takes system=[{"text": "..."}] (list of blocks) instead of a plain string. ### 8. System prompt exercise Hands-on: experiment with system prompts on Bedrock. ### 9. Temperature Sampling control via inferenceConfig.temperature parameter on converse. ### 10. Streaming DEPLOYMENT-SPECIFIC: Bedrock uses converse_stream (different method) returning EventStream. ### 11. Controlling model output max_tokens via inferenceConfig.maxTokens; stop sequences via stopSequences. ### 12. Structured data Coax JSON via prompting; introduce schema thinking before tool use. ### 13. Structured data exercise Hands-on: extract structured fields with prompted JSON. ### 14. Quiz on working with the API Section quiz on Bedrock API mechanics. ### 15. Prompt evaluation Why systematic eval beats vibes; same content as Course 6. ### 16. A typical eval workflow Test set + grader + iteration loop; canonical pattern. ### 17. Generating test datasets Use Claude itself (via Bedrock) to synthesize test inputs at scale. ### 18. Running the eval Loop the test set through your prompt and capture outputs. ### 19. Model-based grading LLM-as-judge: rubric-driven grading with a separate Claude call. ### 20. Code-based grading Deterministic graders for format, regex, schema validation. ### 21. Exercise on prompt evals Hands-on: stand up a small eval harness. ### 22. Quiz on prompt evaluations Section quiz on eval workflow. ### 23. Prompt engineering Intro to the canonical Anthropic prompt-engineering techniques. ### 24. Being clear and direct State the task plainly; ambiguity costs more than verbosity. ### 25. Being specific Specificity collapses the response space; vague asks invite drift. ### 26. Structure with XML tags XML tags as structural anchors Claude attends to reliably. ### 27. Providing examples Few-shot examples in the prompt steer style and format. ### 28. Exercise on prompting Hands-on: refactor a weak prompt using the four techniques. ### 29. Quiz on prompt engineering Section quiz on the prompting toolkit. ### 30. Introducing tool use Tools let Claude call your functions; Bedrock converse uses toolConfig shape. ### 31. Tool functions Define the Python functions Claude will call. ### 32. JSON schema for tools JSON schema for tool inputs; the model only sees this schema. ### 33. Handling tool use responses Iterate the response content blocks (text + toolUse) and dispatch. ### 34. Running tool functions Execute the tool function with the model-supplied arguments. ### 35. Sending tool results Append toolResult content blocks and re-call converse to continue. ### 36. Multi-turn conversations with tools Maintain the agentic loop across multiple tool calls. ### 37. Adding multiple tools Register a tool list in toolConfig.tools; let Claude pick the right one per turn. ### 38. Batch tool use Submit many tool-use requests at once for cost-and-throughput wins. ### 39. Structured data with tools Tool schemas as the cleanest path to structured outputs. ### 40. Flexible tool extraction Patterns for parsing tool use across varied inputs. ### 41. The text editor tool Built-in text-editor tool for file-edit workflows. ### 42. Quiz on tool use Section quiz on tool-use mechanics. ### 43. Introducing retrieval-augmented generation RAG = retrieve relevant docs and add them to context; Knowledge mitigation. ### 44. Text chunking strategies Fixed-size, sentence-boundary, and semantic chunking tradeoffs. ### 45. Text embeddings Dense vector representations for semantic search; on AWS often Bedrock-hosted Titan or Cohere embeddings. ### 46. The full RAG flow Embed -> store -> retrieve top-k -> augment prompt -> generate. ### 47. Implementing the RAG flow Hands-on RAG pipeline. Note: Bedrock Knowledge Bases provides a managed alternative. ### 48. BM25 lexical search Keyword/term-frequency retrieval as a complement to embedding search. ### 49. A multi-search RAG pipeline Combine BM25 + embeddings with reciprocal rank fusion. ### 50. Reranking results Cross-encoder rerank of top-k retrievals before context insertion. ### 51. Contextual retrieval Anthropic's contextual retrieval: prepend a chunk-aware summary before embedding. ### 52. Quiz on retrieval-augmented generation Section quiz on RAG components. ### 53. Extended thinking Reasoning mode where the model thinks before answering; surfaced as a content block. ### 54. Image support Vision: pass images as base64 or S3 references in content blocks. ### 55. PDF support Native PDF inputs; large docs feed the Working Memory cliff. ### 56. Citations Built-in citations: model returns spans tying claims back to source docs. ### 57. Prompt caching Cache stable prefixes (system prompt, RAG context) for major cost wins. ### 58. Rules of prompt caching TTLs, breakpoints, minimum sizes; cache-hit accounting on Bedrock. ### 59. Prompt caching in action Hands-on: measure cache-hit savings on a realistic workload. ### 60. Quiz on features of Claude Section quiz on cross-cutting features. ### 61. Introducing MCP Model Context Protocol: standard for connecting tools/data to Claude. ### 62. MCP clients Claude Desktop, Claude Code, custom clients; all speak the same protocol. ### 63. Project setup Stand up an MCP server scaffold for the section. ### 64. Defining tools with MCP Expose tools through MCP rather than per-app tool schemas. ### 65. The server inspector Anthropic's MCP inspector for debugging server output. ### 66. Implementing a client Build a custom MCP client around Claude on Bedrock. ### 67. Defining resources MCP resources: model-pull data sources. ### 68. Accessing resources Wire resources into the client and let Claude read them. ### 69. Defining prompts MCP prompts: server-provided prompt templates. ### 70. Prompts in the client Surface MCP-served prompts in client UI. ### 71. MCP review Recap of tools/resources/prompts split. ### 72. Quiz on Model Context Protocol Section quiz on MCP architecture. ### 73. Agents overview Agentic loop intro. Note: Bedrock Agents is an AWS-managed alternative not covered in detail here. ### 74. Claude Code setup Install + configure Claude Code; can be configured against Bedrock-backed deployments. ### 75. Claude Code in action Live coding session demoing common workflows. ### 76. Enhancements with MCP servers Plug MCP servers into Claude Code for repo/db/issue access. ### 77. Parallelizing Claude Code git worktrees + multiple sessions for parallel feature work. ### 78. Automated debugging Subagent-driven bug repro and fix loop. ### 79. Computer use Computer-use tool: Claude controls a virtual desktop via screenshots + actions. ### 80. How computer use works Action loop: screenshot -> reason -> click/type -> repeat. ### 81. Qualities of agents What makes agents reliable vs brittle: scope, tools, evaluation, escalation. ### 82. Final assessment quiz End-of-course assessment across all sections. ### 83. Course wrap-up Recap and pointers to deeper Bedrock + Anthropic resources. ## Our simplification This course is the platform-agnostic Claude API course (Course 6 claude-api-foundations) wrapped in AWS; same prompt engineering, same eval workflow, same tool-use mechanics, same RAG patterns, same MCP protocol. If you have done Course 6, roughly 75 of the 83 lessons will feel familiar verbatim. The ~8 lessons that *justify a separate Knowledge page* are the deployment seam: how you authenticate, which API shape you call, how regions and inference profiles work, and how AWS-native enterprise features (Guardrails, Knowledge Bases, IAM, VPC endpoints, CloudTrail, KMS) fit on top. This page focuses on those seams; for everything else, lean on claude-api-foundations as the canonical reference. Authentication on Bedrock is IAM-bound, not API-key-bound. You get credentials the AWS way: an IAM user with access keys for development (set via aws configure or environment variables), or an IAM role attached to your compute (EC2 instance profile, ECS task role, Lambda execution role) for production. The boto3.client('bedrock-runtime', region_name='us-west-2') call picks up credentials from the standard AWS credential chain automatically. There is no ANTHROPIC_API_KEY; auth is bound to an AWS principal, which means your IAM policy is the security boundary. Grant bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream on the specific model ARNs you need, scope by region, and rotate via IAM Identity Center or short-lived STS credentials. The SDK and API shape diverge from the direct Anthropic SDK in three load-bearing ways. First, the client: boto3.client('bedrock-runtime') instead of Anthropic() or AnthropicBedrock() (the latter exists as an Anthropic convenience wrapper, but the course uses raw boto3). Second, the API method: client.converse(modelId=..., messages=...) rather than messages.create(...). Third, the message shape: Bedrock's converse uses lists-of-content-blocks even for plain text, so a user message is {"role": "user", "content": [{"text": "What is 1+1?"}]} rather than the direct API's flat string. The system prompt is a list of blocks too: system=[{"text": "You are a helpful assistant."}]. Tool use uses toolConfig and toolUse/toolResult blocks. The semantics are identical to the direct API; the wrapping is verbose. Regional model availability is the single biggest operational gotcha, and inference profiles are the AWS-specific fix. Not every Claude model is hosted in every AWS region; Claude Sonnet might be in us-west-2 while you are calling from us-east-1 and you will get a cryptic model not found error. The pre-2024 fix was to manually pin every call to the right region. The current AWS-native solution is inference profiles: a profile id (look under Cross-region inference in the Bedrock console, not under the main model catalog) that AWS automatically routes to a region where the model exists. You pass the inference profile id as modelId and AWS handles regional load balancing. This is a Bedrock-only concept; there is no analogue on Vertex or the direct API, so it is one of the most exam-relevant deployment-specific facts in the course. AWS-side bonuses come in two flavors: managed services and enterprise controls. The managed-service tier sits on top of Bedrock and has no analogue on the direct API or Vertex: Bedrock Guardrails for content filters, PII redaction, and topic restrictions; Bedrock Knowledge Bases as managed RAG (ingest from S3, OpenSearch Serverless or Aurora as vector store); Bedrock Agents as higher-level agent orchestration with action groups, knowledge bases, and AWS-handled control flow. The Skilljar course teaches you the underlying primitives (tool use, RAG, agents) at the API level; the AWS-managed services are *alternatives* you can use instead of rolling your own. The control tier is the main reason regulated customers pick Bedrock: VPC endpoints (PrivateLink) keep traffic off the public internet, CloudTrail logs every InvokeModel call with caller identity, AWS KMS Customer-Managed Keys wrap inputs and outputs, AWS Config tracks compliance drift. The compliance story is Bedrock's, not Anthropic's directly: SOC, ISO, HIPAA-eligible (with a BAA), FedRAMP High in GovCloud, IRAP for Australia. Both tiers slot into existing AWS landing zones rather than requiring parallel security review. Feature parity is high but not perfect, and the gaps move over time. Prompt caching, vision, PDF support, citations, extended thinking, and tool use generally land on Bedrock within weeks of the direct API release; the message-format protocol is identical though more verbose. The exceptions tend to be at the *tool* layer: the built-in web search tool and computer use have shipped with deployment-specific availability gates and may lag the direct API. The batch API is exposed as Bedrock's separate Batch Inference job runner rather than as inline messages.batches calls. Pricing is set by AWS and is typically priced per 1K input/output tokens at parity with the direct API, billed through your AWS invoice. Cache hits get the same multipliers; reserved throughput is an AWS-specific provisioned-capacity option for predictable high-volume workloads. When to choose Bedrock vs the direct Anthropic API vs Vertex AI. Pick Bedrock when your stack is already on AWS; IAM, VPC endpoints, CloudTrail, KMS, billing, and your security review consolidate, and you avoid a second vendor relationship. The AWS-only managed services (Guardrails, Knowledge Bases, Agents) are real bonuses for regulated industries. Pick the direct Anthropic API when you want fastest access to new models and features, simpler key-based auth, and no cloud lock-in. Pick Vertex when you are on Google Cloud for the symmetric reasons. The application code is roughly 95% portable across all three; the differences are auth, the API method shape, the message wrapping, regional/inference-profile mechanics, and AWS-specific managed services. Picking a deployment platform is mostly an organizational decision (where your landing zone lives), not a technical one; and the exam expects you to recognize that. ## Patterns ### 6 things that change when you move from the direct Anthropic API to Bedrock If you already know the direct API (Course 6), these are the deltas you actually need to internalize. Everything else is unchanged. - **Auth: IAM, not API key.** No ANTHROPIC_API_KEY. Use aws configure, IAM roles on EC2/ECS/Lambda, or short-lived STS credentials. Auth is bound to an AWS principal and IAM policy is your security boundary. Required permission: bedrock:InvokeModel on the model ARN. - **Client: boto3 `bedrock-runtime`.** client = boto3.client('bedrock-runtime', region_name='us-west-2'). Use AWS standard credential chain. The Anthropic SDK ships an AnthropicBedrock convenience wrapper, but the official Skilljar course teaches raw boto3. - **Method: `converse`, not `messages.create`.** client.converse(modelId=..., messages=[user_message]). Streaming uses converse_stream. Response shape: response['output']['message']['content'][0]['text']. The semantics map 1:1 to the direct API but the wrapper is more verbose. - **Content blocks are always lists.** User message: {'role': 'user', 'content': [{'text': '...'}]}. System prompt: system=[{'text': '...'}]. The list shape is so multimodal content (images, documents) can be mixed; plain text just looks more verbose. - **Inference profiles for cross-region routing.** Models exist in specific regions. Use cross-region inference profiles (under Cross-region inference in the Bedrock console) so AWS auto-routes to a region where the model is available. Pass the profile id as modelId. Bedrock-only concept; no analogue on Vertex or direct API. - **AWS-managed bonuses: Guardrails, Knowledge Bases, Agents.** Bedrock Guardrails for content controls, Bedrock Knowledge Bases for managed RAG, Bedrock Agents for managed agent orchestration. These are alternatives to rolling your own; useful for regulated industries that want managed compliance and operations. ## Key takeaways - Bedrock deployment is the same Claude API surface as claude-api-foundations plus a different auth, API method, and message shape; about 75 of 83 lessons mirror Course 6 verbatim. (`tool-calling`) - Authentication is AWS IAM (access keys, IAM roles, or STS short-lived credentials), not an API key; production uses an IAM role with bedrock:InvokeModel and the policy is the security boundary. (`system-prompts`) - Bedrock uses boto3's bedrock-runtime client and the converse API; messages and system prompts are lists of content blocks even for plain text, which makes the API more verbose than the direct Anthropic SDK. (`tool-calling`) - Cross-region inference profiles are the AWS-specific solution to regional model availability; pass an inference profile id as modelId and AWS routes to a region where the model exists. No analogue on Vertex or the direct API. (`claude-for-operations`) - Bedrock-only managed services (Guardrails, Knowledge Bases, Agents) are alternatives to rolling your own; useful for regulated industries needing managed compliance, but the underlying primitives the exam tests are still tool use, RAG, and agents. (`evaluation`) - Application code is roughly 95% portable across direct API, Bedrock, and Vertex; choosing a deployment platform is mostly an organizational decision (where your AWS or GCP landing zone lives), not a technical one. (`tool-calling`) ## Concepts in play - **Tool calling** (`tool-calling`), Same protocol on Bedrock; tool schemas wrap into toolConfig / toolUse / toolResult blocks - **Prompt caching** (`prompt-caching`), Available on Bedrock with same TTL/breakpoint rules as direct API - **Batch API** (`batch-api`), Exposed on Bedrock as the separate Batch Inference job runner rather than inline calls - **MCP** (`mcp`), Protocol-level, deployment-independent; works with Bedrock-backed Claude - **Vision and multimodal** (`vision-multimodal`), Image and PDF inputs work via converse content blocks (base64 or S3 references) - **Evaluation** (`evaluation`), Same eval workflow; AWS-specific tooling like CloudWatch metrics and Bedrock model evaluation jobs are optional add-ons ## Scenarios in play - **Claude for operations** (`claude-for-operations`), Bedrock's IAM, CloudTrail, VPC endpoints, and KMS controls are the operational substrate this scenario depends on for AWS-native enterprise rollouts - **Structured data extraction** (`structured-data-extraction`), Common Bedrock workload: route extraction jobs through Lambda or Step Functions backed by Claude on Bedrock, often with Knowledge Bases for the document corpus ## Curated sources - **Claude on Amazon Bedrock; Anthropic API documentation** (anthropic-blog, 2025-09-01): Canonical reference for the SDK surface, model id formats, inference profile usage, and feature-availability table on Bedrock. Pair with Skilljar Lesson 4 when you start authenticating against a real AWS account. - **Use Anthropic models in Amazon Bedrock; AWS documentation** (anthropic-blog, 2025-10-15): AWS's own integration docs covering model enablement, IAM permissions, cross-region inference profiles, and the converse API parameter reference for Anthropic models on Bedrock. - **Building effective agents** (anthropic-blog, 2024-12-19): Anthropic's primitives-first take on agents; useful when deciding whether to use Bedrock Agents (managed) or roll your own with the raw tool-use primitives the course teaches. ## FAQ ### Q1. What is the difference between using Claude through the Anthropic API and through Amazon Bedrock? The application code is roughly 95% identical; the differences are at the deployment seam. Bedrock uses AWS IAM instead of an ANTHROPIC_API_KEY, calls boto3.client('bedrock-runtime').converse(...) instead of Anthropic().messages.create(...), wraps content in lists of blocks even for plain text, uses inference profiles for cross-region availability, and gives you AWS-managed extras (Guardrails, Knowledge Bases, Agents). Choose Bedrock when your stack is on AWS; choose direct API for simpler auth and fastest access to new features. ### Q2. How do I authenticate with Claude on Amazon Bedrock? Use AWS standard credentials. In dev, run aws configure with an access key + secret. In production, attach an IAM role to your compute (EC2 instance profile, ECS task role, Lambda execution role); boto3 picks up the role automatically. Required permission is bedrock:InvokeModel (and bedrock:InvokeModelWithResponseStream for streaming) on the specific model ARN. There is no API key; auth is bound to an AWS principal and IAM policy is your security boundary. ### Q3. Why does my Bedrock request return a model-not-found error when the model exists? Almost always a regional availability mismatch. Not every Claude model is hosted in every AWS region; for example, Claude Sonnet might be in us-west-2 while you are calling from us-east-1. The fix is to use a cross-region inference profile rather than the raw model id. Look under Cross-region inference in the Bedrock console (not the main model catalog), copy the profile id, and pass it as modelId. AWS will route the request to a region where the model is available. ### Q4. What is a Bedrock inference profile and when should I use one? An inference profile is an AWS-Bedrock-specific abstraction that bundles a model with one or more regions where it can be served. Use one whenever you do not want to manually track model-region availability; which in practice means almost always in production. Pass the profile id as modelId and AWS automatically routes your request to a region with capacity. There is no analogue on Vertex AI or the direct Anthropic API; this is a Bedrock-only concept and a frequent exam topic for D5 deployment questions. ### Q5. Should I use Bedrock Knowledge Bases or build my own RAG pipeline on Bedrock? Knowledge Bases is AWS's managed RAG service: it handles S3 ingestion, chunking, embedding (Titan or Cohere), vector storage (OpenSearch Serverless or Aurora), and retrieval. Use it when you want managed operations and your corpus lives in S3. Build your own when you need custom chunking strategies, contextual retrieval, hybrid BM25+embeddings with reranking, or non-S3 sources. The Skilljar course teaches the underlying primitives so you can do either; Knowledge Bases is the managed shortcut, your own pipeline is the precision option. ### Q6. Does prompt caching work on Claude through Amazon Bedrock? Yes. Prompt caching, vision, PDF support, citations, extended thinking, and tool use all work on Bedrock with the same TTLs, breakpoint rules, and pricing multipliers as the direct API. The message-format protocol is identical (just wrapped in content-block lists). The features that occasionally lag are at the tool layer; built-in web search and computer use have shipped with deployment-specific availability gates. The batch API is exposed as Bedrock's separate Batch Inference jobs rather than inline messages.batches. ### Q7. How do I install and use the boto3 SDK for Bedrock with Claude? Run pip install boto3 (which most AWS Python projects already have). Then: client = boto3.client('bedrock-runtime', region_name='us-west-2'). Call client.converse(modelId=..., messages=...) to send a request. Extract the response with response['output']['message']['content'][0]['text']. For streaming, use client.converse_stream(...) and iterate the EventStream. There is also an AnthropicBedrock SDK from Anthropic that wraps boto3 with a more familiar Anthropic-style API surface; pick whichever feels more natural to your codebase. ### Q8. Can I use HIPAA, FedRAMP, or IRAP-compliant Claude through Bedrock? Yes; the compliance posture is AWS's, and Claude on Bedrock inherits it. SOC 1/2/3, ISO 27001, HIPAA-eligible (with a BAA), FedRAMP High in GovCloud regions, and IRAP for Australia all extend to Anthropic models served through Bedrock. The compliance story is Bedrock's, not Anthropic's directly, which is one of the main reasons regulated industries pick Bedrock over the direct API. Confirm the specific certifications in the AWS compliance program for your target region before going to production. --- **Source:** https://claudearchitectcertification.com/knowledge/claude-with-bedrock **Vault sources:** Course_11/Lesson_03_accessing-the-api.md; Course_11/Lesson_04_making-a-request.md; Course_11/Lesson_05_multi-turn-conversations.md; Course_11/Lesson_07_system-prompts.md; Course_11/Lesson_10_streaming.md; Course_11/Lesson_30_introducing-tool-use.md; Course_11/Lesson_57_prompt-caching.md; Course_11/Lesson_61_introducing-mcp.md; Course_11/Lesson_73_agents-overview.md **Last reviewed:** 2026-05-06