# Prompt Caching

> Prompt caching reduces cost (~90%) on repeated context like long system prompts and tool definitions. Cache breakpoints, TTL, and cache_control field placement are exam patterns. Full content in SCRUM-21 follow-up.

**Domain:** D4 · Prompt Engineering (20% of CCA-F exam)
**Canonical:** https://claudearchitectcertification.com/concepts/prompt-caching
**Last reviewed:** 2026-05-04

## Quick stats

- **Cost reduction:** ~90%
- **Exam domain:** D4
- **Coverage tier:** B
- **Trigger:** repeated context
- **Field:** cache_control

## What it is

Prompt caching is a cost optimization that lets you reuse expensive prompt tokens across multiple API calls. Mark a section with cache_control: {type: "ephemeral"}, and Claude caches those tokens for 5 minutes. Every subsequent API call within the TTL with the same cached section pays ~90% less for those tokens. The cache is a memory optimization for the model's KV cache.

The cached section persists across turns in agentic loops, so it's ideal for content that doesn't change: system prompts, tool definitions, large reference docs, fact blocks. The message list cannot be cached (each turn changes content). What can: (1) system prompt, (2) tool definitions, (3) static fact block, (4) reusable context document. What cannot: growing message history, recent user questions.

The cache has a 5-minute TTL by default, extendable to 1 hour. After 5 minutes of no access, the cache expires and the next call re-reads at full token cost. Intentional design: caching is for bursty workloads (a customer support loop running for 5 minutes), not permanent storage.

Production failures cluster around two gaps: caching content that changes (marking the growing message list as cacheable) and underscoping what's cacheable (not caching the system prompt when it's the biggest savings). A 1000-token system prompt called 10 times costs 10,000 fresh; with caching, ~1100 (89% savings).

## How it works

Caching is enabled by adding cache_control: {type: "ephemeral"} to a section. Both system and messages arrays support this. On the first call, Claude reads and caches. On the second (within 5 min) with identical content, Claude skips re-reading and uses the cached KV state, paying only the lookup cost (~10% of original).

The cache key is the exact content. Send the system prompt with a typo on turn 1, fix on turn 2 → cache miss. The content changed, turn 2 caches a new version, starts a new 5-minute window. Why immutable content is crucial: refund policy (never changes), tool definitions (stable), customer facts (extracted once) are cacheable. The growing message list is not.

The TTL is 5 minutes by default. Within 5 min, cache survives; after 5 min of no access, expires. Extend via cache_control: {type: "ephemeral", max_tokens: 1024} (max_tokens hints duration). Longer loops re-cache automatically (cache expires, next call caches again).

Caching works per API key per conversation, not globally. Two API keys with the same system prompt each get their own cache. Same key across conversations can share the cached system prompt but not the cached message history. Isolation is a security feature.

## Where you'll see it in production

### Customer support loop with cached system prompt

15-turn refund conversation. System prompt (1000 tokens) cached on turn 1. Turns 2-15 reuse, paying ~100 tokens each instead of 1000. Total savings: 8100 tokens, ~30% off the conversation cost.

### Parallel subagents with shared tool definitions

Coordinator spawns 4 subagents to analyze 4 repos. Each gets the same 5-tool definition (400 tokens). With caching, first subagent caches (400 tokens), 2-4 reuse (~40 each). 1200 tokens saved, 75% off tool-definition cost.

### Long-context loop with immutable fact block

Invoice extraction loop, 50 invoices. System prompt + a 200-token customer-facts block both cached. All 50 extractions reuse both. ~60,000 tokens saved over 50 calls, 95% on fixed content.

### Batch-job polling with consistent system rules

Overnight batch processing 1000 documents. System prompt cached once, 1000 calls reuse. Cache expires after 5 min of no access (end of first batch). Next batch starts fresh window. 90% savings on system-prompt overhead.

## Code examples

### Cached system prompt + tool definitions

**Python:**

```python
from anthropic import Anthropic
client = Anthropic()

SYSTEM_PROMPT = """You are a customer support agent enforcing a $500 lifetime refund limit. Always verify customer first."""

TOOLS = [
    {"name": "verify_customer", "description": "Verify identity, retrieve refund history. Call first.", "input_schema": {...}},
    {"name": "lookup_order", "description": "Look up order by customer + order ID.", "input_schema": {...}},
    {"name": "process_refund", "description": "Process refund. Call only after verify + lookup.", "input_schema": {...}},
]

def run_support(customer_id: str, request: str):
    messages = [{"role": "user", "content": request}]
    for i in range(10):
        resp = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            # Cache the system prompt (90% savings on turns 2+)
            system=[{
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},
            }],
            # Tools also cached in the SDK by default
            tools=TOOLS,
            messages=messages,
        )
        if resp.stop_reason == "end_turn":
            return resp.content[0].text
        messages.append({"role": "assistant", "content": resp.content})
        # ... handle tool_use, append tool_result ...
    return "max_iterations"

# Turn 1: full cost (~1500 tokens for system + tools)
# Turns 2-10: cached (~150 tokens each = 90% savings)
```

> cache_control: ephemeral on system. Cache persists 5 minutes. Tools implicitly cached by SDK.

**TypeScript:**

```typescript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

const SYSTEM = "You are a customer support agent enforcing $500 lifetime refund limit. Always verify customer first.";
const TOOLS: Anthropic.Tool[] = [
  { name: "verify_customer", description: "Verify identity. Call first.", input_schema: { type: "object", properties: {} } },
  { name: "lookup_order", description: "Look up order.", input_schema: { type: "object", properties: {} } },
  { name: "process_refund", description: "Process refund.", input_schema: { type: "object", properties: {} } },
];

async function runSupport(customerId: string, request: string) {
  const messages: Anthropic.MessageParam[] = [{ role: "user", content: request }];
  for (let i = 0; i < 10; i++) {
    const resp = await client.messages.create({
      model: "claude-opus-4-5",
      max_tokens: 1024,
      system: [{
        type: "text",
        text: SYSTEM,
        cache_control: { type: "ephemeral" },  // 5-min TTL, 90% savings
      }] as any,
      tools: TOOLS,
      messages,
    });
    if (resp.stop_reason === "end_turn") {
      return resp.content[0].type === "text" ? resp.content[0].text : "";
    }
    messages.push({ role: "assistant", content: resp.content });
    // ... tool_use handling ...
  }
}
```

> TypeScript pattern: array form for system, cache_control annotation. Cache reused across turns within 5 minutes.

## Looks-right vs actually-wrong

| Looks right | Actually wrong |
|---|---|
| Cache the entire message history to speed up long conversations. | The message list grows every turn (new user + assistant messages). Caching requires immutable content. Cache the system prompt and tool definitions, not the messages. |
| Use ephemeral caching for frequently-updated content like daily news. | Ephemeral caching is for content that doesn't change within 5 minutes. Daily-updated content causes cache misses on every change. Cache only immutable content. |
| Enable caching on the longest document in the prompt. | Enable on immutable content, not just long. A 10,000-token policy doc that never changes is perfect. A 500-token fact block updated every turn is not. Cache what's constant. |
| Caching is global; once enabled, applies to all conversations. | Caching is per API key, per conversation. Each conversation has its own cache. Cache isolation is a security feature. |
| After 5 minutes the cache extends automatically. | After 5 minutes of no access, the cache expires completely. Next call re-reads at full cost, then starts a new 5-minute window. Plan for cache expiry in long-running loops. |

## Comparison

| Aspect | Cached (5min) | Cached (1hr+) | Not cached | Batch API (50%) |
| --- | --- | --- | --- | --- |
| Content type | Immutable system, tools | Recurring reference docs | Growing message list | Non-urgent bulk |
| TTL | 5 min default | 1 hr configurable | None | Up to 24 hr |
| Savings | 90% per reuse | 90% per reuse | 0% | 50% flat |
| Use when | Repeated calls same prompt | Recurring queries | First call or content changes | Can wait for results |
| Cost first call | 100% | 100% | 100% | 100% |
| Cost 2nd-10th call | 10% each | 10% each | 100% each | 50% each |

## Decision tree

1. **Running an agentic loop (5+ turns) with the same system prompt?**
   - **Yes:** Cache the system prompt. Save ~90% on system-prompt tokens.
   - **No:** Single-turn call: caching marginal.

2. **Is the content constant within 5 minutes?**
   - **Yes:** Cache it. Immutable content is the cache's best use.
   - **No:** Don't cache. Changing content causes misses.

3. **Large reference doc reused across calls?**
   - **Yes:** Cache. 1000-token doc × 10 calls saves ~9000 tokens.
   - **No:** Caching not applicable.

4. **Loop longer than 5 minutes?**
   - **Yes:** Cache will expire mid-loop. Plan for next call to re-cache. Cheap (one re-cache cost).
   - **No:** Cache persists for entire loop.

5. **Caching or Batch API?**
   - **Yes:** Caching if results needed within 5 min or interactive. Instant savings.
   - **No:** Batch API if you can wait 24 hours and need 50% on all tokens.

## Exam-pattern questions

### Q1. Cache the entire message history to speed up long conversations: works?

No. The message list grows every turn (new user + assistant messages). Caching requires immutable content. Cache the system prompt and tool definitions, not the messages.

### Q2. Daily news content marked cache_control: ephemeral. Why is this wrong?

Ephemeral caching is for content that doesn't change within 5 minutes. Daily content causes cache misses on every change. Cache only immutable content; daily-update content gets no benefit.

### Q3. Cached content from one API key leaks to another. Security issue?

No, this can't happen. Caching is per API key, per conversation. Each conversation gets its own cache. Cache isolation is a security feature.

### Q4. After 5 minutes the cache automatically extends?

No. After 5 minutes of no access, the cache expires completely. The next call re-reads at full token cost. Plan for cache expiry in long-running loops.

### Q5. How much does caching save on a 1000-token system prompt called 10 times?

~81-88%. First call: 1000 tokens (full). Calls 2-10: 100 each (10% of cache price). Total: 1000 + 900 = ~1900 tokens vs 10,000 fresh.

### Q6. Caching vs the Batch API: same purpose?

Different. Caching: 90% on reused content within a 5-min window (instant). Batch API: 50% on all tokens (24-hour wait). Caching for interactive loops; Batch for async bulk.

### Q7. Tool definitions: cacheable like the system prompt?

Yes. Tool definitions are stable across turns. Mark with cache_control: ephemeral (or rely on SDK auto-caching). Combined with cached system prompt, saves ~90% on the fixed prefix per turn.

### Q8. Caching with subagents: each subagent gets its own cache?

Yes. Each subagent caches its own system prompt and tools. Caches are separate (per subagent). Subagent A's cache doesn't affect subagent B.

## FAQ

### Q1. How much does caching save?

~90% on cached content per reuse. 1000-token system prompt × 10 calls = ~1900 tokens instead of 10,000 (~81% savings).

### Q2. How long does the cache persist?

5 minutes default. Extendable to 1 hour by configuring max_tokens.

### Q3. What if I change the cached content mid-loop?

Cache miss. New content cached as a different key, starting a new 5-minute window. If you change content frequently, caching doesn't help.

### Q4. Can I cache the entire prompt?

No. System prompt and static docs cacheable. Message list grows every turn. Cache constants; message list is fresh.

### Q5. Cheaper than the Batch API?

Complementary. Caching saves 90% on reused content. Batch saves 50% on all tokens but waits 24 hours. Caching for interactive, Batch for async.

### Q6. Manually manage cache expiry?

No. After 5 min, cache automatically expires. Next call re-reads at full cost, then starts new window. Automatic.

### Q7. Cache different sections with different TTLs?

No. All ephemeral caches have the same TTL (5 min default). Use separate API calls if you need differentiated TTLs.

### Q8. Caching with tool_use and multi-turn loops?

Yes. System prompt and tool definitions cached. Growing message list always fresh. Cache constants, let list grow.

### Q9. Trade-off between caching and Batch API?

Caching: instant, 90% on reused, 5-min window. Batch: wait 24h, 50% on all tokens. Use caching for interactive; Batch for async.

### Q10. Caching with subagents?

Yes. Each subagent caches its own system prompt and tools. Caches are separate (per subagent), so subagent A's cache doesn't affect subagent B.

---

**Source:** https://claudearchitectcertification.com/concepts/prompt-caching
**Vault sources:** ACP-T03 §5 prompt caching; ACP-T04 prompt-caching-optimization; ASC-A01 Course 6
**Last reviewed:** 2026-05-04

**Evidence tiers** — 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed.
