# Context Window Management

> Context window management is how you keep long conversations within limits without dropping critical facts. Patterns: case-facts blocks, progressive summarization, retrieval. Full content in SCRUM-21 follow-up.

**Domain:** D5 · Context + Reliability (15% of CCA-F exam)
**Canonical:** https://claudearchitectcertification.com/concepts/context-window
**Last reviewed:** 2026-05-04

## Quick stats

- **Patterns:** 3
- **Exam domain:** D5
- **Coverage tier:** B
- **Trap:** summary loss
- **Right answer:** facts block

## What it is

The context window is the maximum token budget Claude can process in a single request. Claude Sonnet and Opus have a 200K token window, equivalent to ~500 pages of text or ~1 hour of conversation. Hard architectural boundary set by positional embeddings. Cannot be exceeded; requests fail with context_length_exceeded. Managing the window is core production architecture: inefficient windowing wastes budget; strategic windowing unlocks large-scale automation.

The 200K window applies to total input tokens, including system prompt, all messages, all tool definitions. Example: system (500) + prior conversation (5000) + current request (1000) + tool definitions (2000) = 8500 used, 191.5K remaining. If you're building an agent loop and messages grow unchecked, you can hit the limit at turn 12. The window doesn't reset between turns; shared budget for the entire sequence.

Three production patterns manage the window: prompt caching (90% off repeated prefix), context windowing (summarize old turns into FACTS block, drop verbose history), and streaming output (doesn't reduce tokens but improves UX during long waits). The fourth, underutilized pattern: case-facts immutability (CASE_FACTS rewritten not appended at every turn).

The most-tested anti-pattern is windowing too late. If your loop runs for 12 turns and you window at turn 11, you've wasted tokens for 10 turns. Optimal: window at turn 5 if the conversation is verbose, or preemptively at turn 3 if each turn is expensive. Also tested: context starvation in subagents. Coordinator passes 100KB to a subagent; subagent wastes half its 200K window on irrelevant context. Mitigate: subagents receive only what they need in a compact TASK_CONTEXT block.

## How it works

Token counting is deterministic: SDK provides count_tokens() endpoint that returns exact cost before you execute. Always call on large requests. Track at every turn; when remaining <20K, window immediately.

Windowing mechanics: at turn N, extract a summary of turns 1 through N-1 into a PRIOR_CONTEXT block, drop the verbose message history, append PRIOR_CONTEXT + current turn as the new message list. Before windowing, messages = [user_msg, asst_1, tool_1, ..., asst_4, tool_4, user_5]. After windowing at turn 5, messages = [{role: "user", content: "PRIOR_CONTEXT: ...\n\nCURRENT: {turn_5}"}]. The new list is 10% the size of the old list.

Prompt caching exploits that system prompt and tool definitions don't change between turns. Mark with cache_control: {type: "ephemeral"}. On first request, full price; turns 2+ pay only 10% of cache price. Save 88% over 9 turns.

Case-facts immutability means CASE_FACTS is rewritten at every turn, never appended. Turn 1: CASE_FACTS: {customer_id, amount}\nQUESTION: .... Turn 2: CASE_FACTS: {customer_id, amount, order_found: true}\nQUESTION: ... (facts updated, not accumulated). Prevents the block from growing unboundedly. Combined with windowing: production pattern for multi-turn customer service.

## Where you'll see it in production

### 200-page contract analysis with windowing

Legal team uploads 200-page contract. Without windowing, loop dies at page 15 (context exhausted). With windowing at page 10: "Pages 1-10 analyzed; risks: X, Y, Z." Claude continues fresh on pages 11-20, final turn aggregates findings. ~50% of non-windowed cost despite same work.

### Long-running customer support with case-facts immutability

Loop starts CASE_FACTS = {customer_id, order_id}. Turn 2: updated = {..., refund_amount: 150}. Turn 3: {..., policy_check: "under $500, approved", manager_approval: "needed"}. Each turn rewrites (not appends). Combined with windowing at turn 5, agent runs 20+ turns without exhaustion.

### Subagent context starvation mitigation

Coordinator delegates with full 150KB case context expecting inheritance. Subagent wastes 100KB on irrelevant history. Fix: coordinator extracts a 2KB TASK_CONTEXT ("Analyze competitor pricing; they offer X, Y, Z") and passes only that. Subagent has 198K for actual research.

### Prompt caching for repeated system prompt + tools

Support agent loop runs 50 times/day. System (500 tokens) + 5 tools (2000) = 2500 tokens. Normal: 50 × $0.0375 = $1.875/day. Cached: 1 uncached + 49 cached @ 10% = $0.0375 + 49 × $0.00375 = $0.221/day. 88% savings.

## Code examples

### Window + cache + case-facts immutability

**Python:**

```python
from anthropic import Anthropic
client = Anthropic()

def long_agent_with_windowing(case_facts: dict, max_turns: int = 50):
    messages = []
    turn = 0
    system_prompt = """You are a support agent. Refunds >$500 require approval."""
    tools = [
        {"name": "verify_customer", "description": "Check customer exists", "input_schema": {"type": "object", "properties": {}}},
        {"name": "lookup_order", "description": "Find order details", "input_schema": {"type": "object", "properties": {}}},
        {"name": "process_refund", "description": "Issue refund (with policy)", "input_schema": {"type": "object", "properties": {}}},
    ]

    while turn < max_turns:
        turn += 1

        # WINDOWING at turn 5: drop history, keep CASE_FACTS
        if turn == 5:
            case_facts["prior_work"] = "Customer verified, order found."
            messages = []

        # CASE-FACTS IMMUTABILITY: rewrite (not append)
        block = f"""CASE_FACTS:
- Customer ID: {case_facts.get('customer_id', 'N/A')}
- Order ID: {case_facts.get('order_id', 'N/A')}
- Refund Amount: ${case_facts.get('amount', 0)}
- Prior Work: {case_facts.get('prior_work', 'None')}
"""
        user_message = f"{block}\nQuestion: {case_facts.get('question', 'Help')}" if turn == 1 else f"{block}\nContinue."
        messages.append({"role": "user", "content": user_message})

        # PRE-CHECK token count
        token_count = client.messages.count_tokens(model="claude-opus-4-5", system=system_prompt, tools=tools, messages=messages)
        if token_count.input_tokens > 190000:
            print(f"Warning: {token_count.input_tokens} tokens; near limit.")

        resp = client.messages.create(
            model="claude-opus-4-5", max_tokens=1024,
            system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
            tools=tools, messages=messages,
        )

        if resp.stop_reason == "end_turn":
            return resp.content[0].text if resp.content else "Done"
        messages.append({"role": "assistant", "content": resp.content})

    return "Max turns exceeded."
```

> Three patterns combined: windowing at turn 5, case-facts rewritten not appended, caching on system prompt, token pre-check.

**TypeScript:**

```typescript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

interface Facts { customer_id?: string; order_id?: string; amount?: number; prior_work?: string; question?: string; }

async function longAgentWithWindowing(facts: Facts, maxTurns: number = 50) {
  const messages: Anthropic.MessageParam[] = [];
  let turn = 0;
  const systemPrompt = "You are a support agent. Refunds >$500 require approval.";
  const tools: Anthropic.Tool[] = [
    { name: "verify_customer", description: "Check customer", input_schema: { type: "object", properties: {} } },
    { name: "lookup_order", description: "Find order", input_schema: { type: "object", properties: {} } },
    { name: "process_refund", description: "Issue refund", input_schema: { type: "object", properties: {} } },
  ];

  while (turn < maxTurns) {
    turn += 1;

    if (turn === 5) {
      facts.prior_work = "Customer verified, order found.";
      messages.splice(0);
    }

    const block = `CASE_FACTS:\n- Customer ID: ${facts.customer_id || 'N/A'}\n- Order ID: ${facts.order_id || 'N/A'}\n- Refund Amount: \$${facts.amount || 0}\n- Prior Work: ${facts.prior_work || 'None'}\n`;
    const userMessage = turn === 1 ? `${block}\nQuestion: ${facts.question || 'Help'}` : `${block}\nContinue.`;
    messages.push({ role: "user", content: userMessage });

    const tokenCount = await client.messages.countTokens({ model: "claude-opus-4-5", system: systemPrompt, tools, messages });
    if (tokenCount.input_tokens > 190000) console.warn(`Warning: ${tokenCount.input_tokens} tokens`);

    const resp = await client.messages.create({
      model: "claude-opus-4-5", max_tokens: 1024,
      system: [{ type: "text", text: systemPrompt, cache_control: { type: "ephemeral" } }] as any,
      tools, messages,
    });

    if (resp.stop_reason === "end_turn") return resp.content[0].type === "text" ? resp.content[0].text : "Done";
    messages.push({ role: "assistant", content: resp.content });
  }
  return "Max turns exceeded.";
}
```

> Same 3-pattern combination in TypeScript: windowing, case-facts immutability, caching, token pre-check.

## Looks-right vs actually-wrong

| Looks right | Actually wrong |
|---|---|
| The context window resets after each turn. | Cumulative across all turns. Messages accumulate; budget shrinks. Once you hit 200K, new requests fail. |
| Streaming reduces context window cost. | Streaming is UX, not cost. You pay for every token, streamed or not. |
| Windowing loses information; only for cost reduction. | Windowing trades verbose intermediate details for critical facts. Information not lost; summarized. Also improves accuracy via attention engineering (facts at top). |
| Prompt caching is free; no trade-off. | Caching requires 1024-token minimum to be worthwhile. Short requests: caching overhead exceeds savings. Use for system + tools in loops, not one-off requests. |
| If you hit the context limit, increase max_tokens. | 200K is the input budget, not output. Increasing max_tokens (output budget) doesn't help. Window the input. |

## Comparison

| Aspect | 200K window | 100K window | Streaming | Caching |
| --- | --- | --- | --- | --- |
| Total budget | 200K tokens | 100K tokens | Same as non-streaming | Save 90% on cached |
| Effective usage | ~500 pages or 10 turns | ~250 pages or 5 turns | No savings; UX only | Big if repeated |
| When to use | Default; most tasks | Cost-constrained | Chat UX, long responses | Loops with fixed system/tools |
| Management | Window at turn 5 | Window at turn 2-3 | Stream events; buffer | Mark ephemeral; monitor TTL |
| Cost impact | Baseline | 2x cheaper per task | Neutral; UX benefit | 88% savings per repeated turn |
| Failure mode | Hit 200K limit | Hit 100K limit faster | Network drops | Cache miss on content change |

## Decision tree

1. **Single turn or a loop?**
   - **Yes:** Single turn: count tokens, ensure <200K, execute. Loop: plan windowing.
   - **No:** Continue.

2. **Loop likely to run >5 turns?**
   - **Yes:** Window at turn 5: summarize prior work, reset message list.
   - **No:** No windowing needed.

3. **Repeatedly sending the same system prompt + tools?**
   - **Yes:** Enable prompt caching: mark with cache_control: ephemeral. Save 90% per turn.
   - **No:** No caching needed.

4. **Response is long (>5 sec wait) and user-facing?**
   - **Yes:** Use streaming for UX. No cost difference.
   - **No:** Non-streaming fine.

5. **After turn N, approaching 200K tokens?**
   - **Yes:** Window immediately: extract prior context, drop history.
   - **No:** Continue normally.

## Exam-pattern questions

### Q1. Context window resets after each turn?

No. Cumulative across all turns. Messages accumulate; budget shrinks. Once you hit 200K total, new requests fail with context_length_exceeded.

### Q2. Streaming reduces context window cost?

No. Streaming is a UX feature; it doesn't change token cost. Same total tokens billed, streamed or not. Use streaming for responsiveness, not cost.

### Q3. Windowing at turn 11 of a 12-turn loop. Worth it?

Marginal value. You've already paid for turns 1-10 of accumulated context. Optimal: window at turn 5 if the conversation is verbose, or turn 3 if each turn is expensive. Windowing late is reactive; preemptive is better.

### Q4. Subagent receives the parent's full 100KB conversation history. What goes wrong?

Context starvation. Subagent wastes 100K of its 200K budget on irrelevant history. Coordinator should extract a 2KB TASK_CONTEXT and pass only that. Subagents do better with less, not more.

### Q5. Increasing max_tokens fixes a context_length_exceeded error?

No. 200K is the input budget; max_tokens is the output budget. Increasing output budget doesn't help an input limit. Window the input (summarize old turns into a CASE_FACTS block, drop verbose history).

### Q6. Cache the message history to keep all turns within budget?

No. The message list grows every turn; caching requires immutable content. Cache the system prompt and tool definitions (constants); the growing message list is always fresh.

### Q7. 1M-token window models: caching not needed, just use the bigger window?

Bigger windows are not free. Cost scales with input tokens. Filling a 1M window costs 5x a 200K window even if it fits. Use a 200K window thoughtfully (case-facts + summary + recent turns) for most production workloads.

### Q8. Optimal windowing target: at what % of capacity?

50-60% capacity. If your loop generates 20K tokens/turn, window at turn 5 (100K used) to have 100K left for the next 5 turns. Below 50% wastes capacity; above 70% risks running out before the next window cycle.

## FAQ

### Q1. What if I exceed 200K tokens?

Request fails with context_length_exceeded. No partial response; you get an error and are billed nothing (or only for excess).

### Q2. How many pages fit in 200K tokens?

Roughly 500 pages of plain text (400 tokens/page). Dense prose or structured data: fewer pages.

### Q3. Prompt caching worth it for short requests?

Only if cached section is >1024 tokens. Below that, overhead outweighs savings.

### Q4. Can I cache user messages?

No. Caching applies only to system prompts and tool definitions (static). User messages change per turn.

### Q5. How long does cached content stay in memory?

Ephemeral cache: 5 minutes default, up to 1 hour if extended. After TTL, discarded.

### Q6. Optimal window size?

Window at 50-60% capacity. If your loop generates 20K tokens/turn, window at turn 5 (100K) to have 100K left for next 5 turns.

### Q7. Does windowing affect output quality?

Slightly. Summarizing loses detail. But improves attention engineering by keeping facts at the top.

### Q8. Manually partition context into multiple calls?

Yes. Instead of one loop with windowing, make 5 separate API calls. More expensive (no context reuse), but simpler architecturally.

### Q9. How do I detect when I'm close to the limit?

Use count_tokens() before each turn. Set a threshold (e.g., 180K); window when crossed.

### Q10. Subagent context limits?

Each subagent has its own 200K. Coordinator's window doesn't subtract from subagent's. But don't pass full coordinator history to subagent; pass only the relevant TASK_CONTEXT.

---

**Source:** https://claudearchitectcertification.com/concepts/context-window
**Vault sources:** ACP-T03 §5 context; GAI-K04 §9
**Last reviewed:** 2026-05-04

**Evidence tiers** — 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed.
