Prompt Caching (D4, 20% of CCA-F) - Claude Architect Concept

01 · Summary

TLDR

Prompt caching reduces cost (~90%) on repeated context like long system prompts and tool definitions. Cache breakpoints, TTL, and cache_control field placement are exam patterns. Anthropic caching API

~90%

Cost reduction

D4

Exam domain

B

Coverage tier

repeated context

Trigger

cache_control

Field

02 · Definition

What it is

Prompt caching is a cost optimization that lets you reuse expensive prompt tokens across multiple API calls. Mark a section with cache_control: {type: "ephemeral"}, and Claude caches those tokens for 5 minutes. Every subsequent API call within the TTL with the same cached section pays ~90% less for those tokens. The cache is a memory optimization for the model's KV cache.

The cached section persists across turns in agentic loops, so it's ideal for content that doesn't change: system prompts, tool definitions, large reference docs, fact blocks. The message list cannot be cached (each turn changes content). What can: (1) system prompt, (2) tool definitions, (3) static fact block, (4) reusable context document. What cannot: growing message history, recent user questions.

The cache has a 5-minute TTL by default, extendable to 1 hour. After 5 minutes of no access, the cache expires and the next call re-reads at full token cost. Intentional design: caching is for bursty workloads (a customer support loop running for 5 minutes), not permanent storage.

Production failures cluster around two gaps: caching content that changes (marking the growing message list as cacheable) and underscoping what's cacheable (not caching the system prompt when it's the biggest savings). A 1000-token system prompt called 10 times costs 10,000 fresh; with caching, ~1100 (89% savings).

03 · Mechanics

How it works

Caching is enabled by adding cache_control: {type: "ephemeral"} to a section. Both system and messages arrays support this. On the first call, Claude reads and caches. On the second (within 5 min) with identical content, Claude skips re-reading and uses the cached KV state, paying only the lookup cost (~10% of original).

The cache key is the exact content. Send the system prompt with a typo on turn 1, fix on turn 2 → cache miss. The content changed, turn 2 caches a new version, starts a new 5-minute window. Why immutable content is crucial: refund policy (never changes), tool definitions (stable), customer facts (extracted once) are cacheable. The growing message list is not.

The TTL is 5 minutes by default. Within 5 min, cache survives; after 5 min of no access, expires. Extend via cache_control: {type: "ephemeral", max_tokens: 1024} (max_tokens hints duration). Longer loops re-cache automatically (cache expires, next call caches again).

Caching works per API key per conversation, not globally. Two API keys with the same system prompt each get their own cache. Same key across conversations can share the cached system prompt but not the cached message history. Isolation is a security feature.

Prompt Caching mechanics, painterly diagram featuring Loop mascot.

04 · In production

Where you'll see it

Customer support loop with cached system prompt

15-turn refund conversation. System prompt (1000 tokens) cached on turn 1. Turns 2-15 reuse, paying ~100 tokens each instead of 1000. Total savings: 8100 tokens, ~30% off the conversation cost.

Parallel subagents with shared tool definitions

Coordinator spawns 4 subagents to analyze 4 repos. Each gets the same 5-tool definition (400 tokens). With caching, first subagent caches (400 tokens), 2-4 reuse (~40 each). 1200 tokens saved, 75% off tool-definition cost.

Long-context loop with immutable fact block

Invoice extraction loop, 50 invoices. System prompt + a 200-token customer-facts block both cached. All 50 extractions reuse both. ~60,000 tokens saved over 50 calls, 95% on fixed content.

Batch-job polling with consistent system rules

Overnight batch processing 1000 documents. System prompt cached once, 1000 calls reuse. Cache expires after 5 min of no access (end of first batch). Next batch starts fresh window. 90% savings on system-prompt overhead.

05 · Implementation

Code examples

Cached system prompt + tool definitions

from anthropic import Anthropic
client = Anthropic()

SYSTEM_PROMPT = """You are a customer support agent enforcing a $500 lifetime refund limit. Always verify customer first."""

TOOLS = [
    {"name": "verify_customer", "description": "Verify identity, retrieve refund history. Call first.", "input_schema": {...}},
    {"name": "lookup_order", "description": "Look up order by customer + order ID.", "input_schema": {...}},
    {"name": "process_refund", "description": "Process refund. Call only after verify + lookup.", "input_schema": {...}},
]

def run_support(customer_id: str, request: str):
    messages = [{"role": "user", "content": request}]
    for i in range(10):
        resp = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            # Cache the system prompt (90% savings on turns 2+)
            system=[{
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},
            }],
            # Tools also cached in the SDK by default
            tools=TOOLS,
            messages=messages,
        )
        if resp.stop_reason == "end_turn":
            return resp.content[0].text
        messages.append({"role": "assistant", "content": resp.content})
        # ... handle tool_use, append tool_result ...
    return "max_iterations"

# Turn 1: full cost (~1500 tokens for system + tools)
# Turns 2-10: cached (~150 tokens each = 90% savings)

cache_control: ephemeral on system. Cache persists 5 minutes. Tools implicitly cached by SDK.

06 · Distractor patterns

Looks right, isn't

Each row pairs a plausible-looking pattern with the failure it actually creates. These are the shapes exam distractors are built from.

Looks right

Cache the entire message history to speed up long conversations.

Actually wrong

The message list grows every turn (new user + assistant messages). Caching requires immutable content. Cache the system prompt and tool definitions, not the messages.

Looks right

Use ephemeral caching for frequently-updated content like daily news.

Actually wrong

Ephemeral caching is for content that doesn't change within 5 minutes. Daily-updated content causes cache misses on every change. Cache only immutable content.

Looks right

Enable caching on the longest document in the prompt.

Actually wrong

Enable on immutable content, not just long. A 10,000-token policy doc that never changes is perfect. A 500-token fact block updated every turn is not. Cache what's constant.

Looks right

Caching is global; once enabled, applies to all conversations.

Actually wrong

Caching is per API key, per conversation. Each conversation has its own cache. Cache isolation is a security feature.

Looks right

After 5 minutes the cache extends automatically.

Actually wrong

After 5 minutes of no access, the cache expires completely. Next call re-reads at full cost, then starts a new 5-minute window. Plan for cache expiry in long-running loops.

07 · Compare

Side-by-side

Aspect	Cached (5min)	Cached (1hr+)	Not cached	Batch API (50%)
Content type	Immutable system, tools	Recurring reference docs	Growing message list	Non-urgent bulk
TTL	5 min default	1 hr configurable	None	Up to 24 hr
Savings	90% per reuse	90% per reuse	0%	50% flat
Use when	Repeated calls same prompt	Recurring queries	First call or content changes	Can wait for results
Cost first call	100%	100%	100%	100%
Cost 2nd-10th call	10% each	10% each	100% each	50% each

08 · When to use

Decision tree

01

Running an agentic loop (5+ turns) with the same system prompt?

YesCache the system prompt. Save ~90% on system-prompt tokens.

NoSingle-turn call: caching marginal.

02

Is the content constant within 5 minutes?

YesCache it. Immutable content is the cache's best use.

NoDon't cache. Changing content causes misses.

03

Large reference doc reused across calls?

YesCache. 1000-token doc × 10 calls saves ~9000 tokens.

NoCaching not applicable.

04

Loop longer than 5 minutes?

YesCache will expire mid-loop. Plan for next call to re-cache. Cheap (one re-cache cost).

NoCache persists for entire loop.

05

Caching or Batch API?

YesCaching if results needed within 5 min or interactive. Instant savings.

NoBatch API if you can wait 24 hours and need 50% on all tokens.

09 · On the exam

Question patterns

Prompt Caching exam trap, painterly cautionary scene featuring Loop mascot.

32 V2 questions wired to this concept. Tap an answer to check it instantly — you'll see whether it's right and why — then expand the full breakdown for the mental model and all four rationales.

Your project root CLAUDE.md is 1,500 lines and Claude's responses are getting slower. Why, and what is the fix?

Tap your answer to check it.

You wrote a 2,000-line SKILL.md and Claude responses are slow when the skill is active. What is the better structure?

Tap your answer to check it.

On a long task should you summarize the conversation at each checkpoint to save tokens?

Tap your answer to check it.

Does restoring a checkpoint reset the cost of the conversation so far?

Tap your answer to check it.

What happens if you change the system prompt mid-conversation rather than between conversations?

Tap your answer to check it.

When is caching the system prompt not worth the complexity?

Tap your answer to check it.

26 additional questions for this concept live in the practice pillar. Take a mock exam ↗

10 · FAQ

Frequently asked

How much does caching save?

~90% on cached content per reuse. 1000-token system prompt × 10 calls = ~1900 tokens instead of 10,000 (~81% savings).

How long does the cache persist?

5 minutes default. Extendable to 1 hour by configuring max_tokens.

What if I change the cached content mid-loop?

Cache miss. New content cached as a different key, starting a new 5-minute window. If you change content frequently, caching doesn't help.

Can I cache the entire prompt?

No. System prompt and static docs cacheable. Message list grows every turn. Cache constants; message list is fresh.

Cheaper than the Batch API?

Complementary. Caching saves 90% on reused content. Batch saves 50% on all tokens but waits 24 hours. Caching for interactive, Batch for async.

Manually manage cache expiry?

No. After 5 min, cache automatically expires. Next call re-reads at full cost, then starts new window. Automatic.

Cache different sections with different TTLs?

No. All ephemeral caches have the same TTL (5 min default). Use separate API calls if you need differentiated TTLs.

Caching with tool_use and multi-turn loops?

Yes. System prompt and tool definitions cached. Growing message list always fresh. Cache constants, let list grow.

Trade-off between caching and Batch API?

Caching: instant, 90% on reused, 5-min window. Batch: wait 24h, 50% on all tokens. Use caching for interactive; Batch for async.

Caching with subagents?

Yes. Each subagent caches its own system prompt and tools. Caches are separate (per subagent), so subagent A's cache doesn't affect subagent B.

11 · Practice with AI

Work this with your AI

Work this concept hands-on with Claude Code, Codex, or claude.ai. Copy a prompt, paste it into your assistant, and practise in tandem. Each one keeps you active (explain it back, get drilled, or build) rather than just reading.

Drill it like the exam (scenario MCQs)
Practice in the exam's scenario-MCQ format with trap awareness.
Explain it back (Feynman)
Build durable, transferable understanding of a concept you can half-state.
Test me, adapting the difficulty
Active recall practice on a concept you think you know.
Check my prerequisites first
Before studying a concept that keeps not sticking.
Find the high-leverage 20%
When a domain feels too big and you are short on time.

Prompt Caching.

TLDR

What it is

How it works

Where you'll see it

Customer support loop with cached system prompt

Parallel subagents with shared tool definitions

Long-context loop with immutable fact block

Batch-job polling with consistent system rules

Code examples

Looks right, isn't

Side-by-side

Decision tree

Running an agentic loop (5+ turns) with the same system prompt?

Is the content constant within 5 minutes?

Large reference doc reused across calls?

Loop longer than 5 minutes?

Caching or Batch API?

Question patterns

Frequently asked

Work this with your AI

Test yourself

Prompt Caching, complete.

Prompt Caching.

TLDR

What it is

How it works

Where you'll see it

Customer support loop with cached system prompt

Parallel subagents with shared tool definitions

Long-context loop with immutable fact block

Batch-job polling with consistent system rules

Code examples

Looks right, isn't

Side-by-side

Decision tree

Running an agentic loop (5+ turns) with the same system prompt?

Is the content constant within 5 minutes?

Large reference doc reused across calls?

Loop longer than 5 minutes?

Caching or Batch API?

Question patterns

Frequently asked

Work this with your AI

Test yourself

Prompt Caching, complete.

Share this primitive