D5.1 · Domain 5 · Context + Reliability · 15% of CCA-F

Context Window Management.

9 min read·10 sections·Tier A

Context window management is how you keep long conversations within limits without dropping critical facts. Patterns: case-facts blocks, progressive summarization, retrieval. model card limits

Deep-dive coming soonDomain 5
Context Window Management, hero illustration featuring Loop mascot in a warm gallery scene.
Domain D5Context + Reliability · 15%
On this page
01 · Summary

TLDR

Context window management is how you keep long conversations within limits without dropping critical facts. Patterns: case-facts blocks, progressive summarization, retrieval. model card limits

3
Patterns
D5
Exam domain
B
Coverage tier
summary loss
Trap
facts block
Right answer
02 · Definition

What it is

The context window is the maximum token budget Claude can process in a single request. Claude Sonnet and Opus have a 200K token window, equivalent to ~500 pages of text or ~1 hour of conversation. Hard architectural boundary set by positional embeddings. Cannot be exceeded; requests fail with context_length_exceeded. Managing the window is core production architecture: inefficient windowing wastes budget; strategic windowing unlocks large-scale automation.

The 200K window applies to total input tokens, including system prompt, all messages, all tool definitions. Example: system (500) + prior conversation (5000) + current request (1000) + tool definitions (2000) = 8500 used, 191.5K remaining. If you're building an agent loop and messages grow unchecked, you can hit the limit at turn 12. The window doesn't reset between turns; shared budget for the entire sequence.

Three production patterns manage the window: prompt caching (90% off repeated prefix), context windowing (summarize old turns into FACTS block, drop verbose history), and streaming output (doesn't reduce tokens but improves UX during long waits). The fourth, underutilized pattern: case-facts immutability (CASE_FACTS rewritten not appended at every turn).

The most-tested anti-pattern is windowing too late. If your loop runs for 12 turns and you window at turn 11, you've wasted tokens for 10 turns. Optimal: window at turn 5 if the conversation is verbose, or preemptively at turn 3 if each turn is expensive. Also tested: context starvation in subagents. Coordinator passes 100KB to a subagent; subagent wastes half its 200K window on irrelevant context. Mitigate: subagents receive only what they need in a compact TASK_CONTEXT block.

03 · Mechanics

How it works

Token counting is deterministic: SDK provides count_tokens() endpoint that returns exact cost before you execute. Always call on large requests. Track at every turn; when remaining <20K, window immediately.

Windowing mechanics: at turn N, extract a summary of turns 1 through N-1 into a PRIOR_CONTEXT block, drop the verbose message history, append PRIOR_CONTEXT + current turn as the new message list. Before windowing, messages = [user_msg, asst_1, tool_1, ..., asst_4, tool_4, user_5]. After windowing at turn 5, messages = [{role: "user", content: "PRIOR_CONTEXT: ...\n\nCURRENT: {turn_5}"}]. The new list is 10% the size of the old list.

Prompt caching exploits that system prompt and tool definitions don't change between turns. Mark with cache_control: {type: "ephemeral"}. On first request, full price; turns 2+ pay only 10% of cache price. Save 88% over 9 turns.

Case-facts immutability means CASE_FACTS is rewritten at every turn, never appended. Turn 1: CASE_FACTS: {customer_id, amount}\nQUESTION: .... Turn 2: CASE_FACTS: {customer_id, amount, order_found: true}\nQUESTION: ... (facts updated, not accumulated). Prevents the block from growing unboundedly. Combined with windowing: production pattern for multi-turn customer service.

Context Window Management mechanics, painterly diagram featuring Loop mascot.
04 · In production

Where you'll see it

200-page contract analysis with windowing

Legal team uploads 200-page contract. Without windowing, loop dies at page 15 (context exhausted). With windowing at page 10: "Pages 1-10 analyzed; risks: X, Y, Z." Claude continues fresh on pages 11-20, final turn aggregates findings. ~50% of non-windowed cost despite same work.

Long-running customer support with case-facts immutability

Loop starts CASE_FACTS = {customer_id, order_id}. Turn 2: updated = {..., refund_amount: 150}. Turn 3: {..., policy_check: "under $500, approved", manager_approval: "needed"}. Each turn rewrites (not appends). Combined with windowing at turn 5, agent runs 20+ turns without exhaustion.

Subagent context starvation mitigation

Coordinator delegates with full 150KB case context expecting inheritance. Subagent wastes 100KB on irrelevant history. Fix: coordinator extracts a 2KB TASK_CONTEXT ("Analyze competitor pricing; they offer X, Y, Z") and passes only that. Subagent has 198K for actual research.

Prompt caching for repeated system prompt + tools

Support agent loop runs 50 times/day. System (500 tokens) + 5 tools (2000) = 2500 tokens. Normal: 50 × $0.0375 = $1.875/day. Cached: 1 uncached + 49 cached @ 10% = $0.0375 + 49 × $0.00375 = $0.221/day. 88% savings.

05 · Implementation

Code examples

Window + cache + case-facts immutability
from anthropic import Anthropic
client = Anthropic()

def long_agent_with_windowing(case_facts: dict, max_turns: int = 50):
    messages = []
    turn = 0
    system_prompt = """You are a support agent. Refunds >$500 require approval."""
    tools = [
        {"name": "verify_customer", "description": "Check customer exists", "input_schema": {"type": "object", "properties": {}}},
        {"name": "lookup_order", "description": "Find order details", "input_schema": {"type": "object", "properties": {}}},
        {"name": "process_refund", "description": "Issue refund (with policy)", "input_schema": {"type": "object", "properties": {}}},
    ]

    while turn < max_turns:
        turn += 1

        # WINDOWING at turn 5: drop history, keep CASE_FACTS
        if turn == 5:
            case_facts["prior_work"] = "Customer verified, order found."
            messages = []

        # CASE-FACTS IMMUTABILITY: rewrite (not append)
        block = f"""CASE_FACTS:
- Customer ID: {case_facts.get('customer_id', 'N/A')}
- Order ID: {case_facts.get('order_id', 'N/A')}
- Refund Amount: ${case_facts.get('amount', 0)}
- Prior Work: {case_facts.get('prior_work', 'None')}
"""
        user_message = f"{block}\nQuestion: {case_facts.get('question', 'Help')}" if turn == 1 else f"{block}\nContinue."
        messages.append({"role": "user", "content": user_message})

        # PRE-CHECK token count
        token_count = client.messages.count_tokens(model="claude-opus-4-5", system=system_prompt, tools=tools, messages=messages)
        if token_count.input_tokens > 190000:
            print(f"Warning: {token_count.input_tokens} tokens; near limit.")

        resp = client.messages.create(
            model="claude-opus-4-5", max_tokens=1024,
            system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
            tools=tools, messages=messages,
        )

        if resp.stop_reason == "end_turn":
            return resp.content[0].text if resp.content else "Done"
        messages.append({"role": "assistant", "content": resp.content})

    return "Max turns exceeded."
Three patterns combined: windowing at turn 5, case-facts rewritten not appended, caching on system prompt, token pre-check.
06 · Distractor patterns

Looks right, isn't

Each row pairs a plausible-looking pattern with the failure it actually creates. These are the shapes exam distractors are built from.

Looks right

The context window resets after each turn.

Actually wrong

Cumulative across all turns. Messages accumulate; budget shrinks. Once you hit 200K, new requests fail.

Looks right

Streaming reduces context window cost.

Actually wrong

Streaming is UX, not cost. You pay for every token, streamed or not.

Looks right

Windowing loses information; only for cost reduction.

Actually wrong

Windowing trades verbose intermediate details for critical facts. Information not lost; summarized. Also improves accuracy via attention engineering (facts at top).

Looks right

Prompt caching is free; no trade-off.

Actually wrong

Caching requires 1024-token minimum to be worthwhile. Short requests: caching overhead exceeds savings. Use for system + tools in loops, not one-off requests.

Looks right

If you hit the context limit, increase max_tokens.

Actually wrong

200K is the input budget, not output. Increasing max_tokens (output budget) doesn't help. Window the input.

07 · Compare

Side-by-side

Aspect200K window100K windowStreamingCaching
Total budget200K tokens100K tokensSame as non-streamingSave 90% on cached
Effective usage~500 pages or 10 turns~250 pages or 5 turnsNo savings; UX onlyBig if repeated
When to useDefault; most tasksCost-constrainedChat UX, long responsesLoops with fixed system/tools
ManagementWindow at turn 5Window at turn 2-3Stream events; bufferMark ephemeral; monitor TTL
Cost impactBaseline2x cheaper per taskNeutral; UX benefit88% savings per repeated turn
Failure modeHit 200K limitHit 100K limit fasterNetwork dropsCache miss on content change
08 · When to use

Decision tree

01

Single turn or a loop?

YesSingle turn: count tokens, ensure <200K, execute. Loop: plan windowing.
NoContinue.
02

Loop likely to run >5 turns?

YesWindow at turn 5: summarize prior work, reset message list.
NoNo windowing needed.
03

Repeatedly sending the same system prompt + tools?

YesEnable prompt caching: mark with cache_control: ephemeral. Save 90% per turn.
NoNo caching needed.
04

Response is long (>5 sec wait) and user-facing?

YesUse streaming for UX. No cost difference.
NoNon-streaming fine.
05

After turn N, approaching 200K tokens?

YesWindow immediately: extract prior context, drop history.
NoContinue normally.
09 · On the exam

Question patterns

Context Window Management exam trap, painterly cautionary scene featuring Loop mascot.

78 V2 questions wired to this concept. Tap an answer to check it instantly — you'll see whether it's right and why — then expand the full breakdown for the mental model and all four rationales.

Long-document extraction stalls at chunk 18 every single time, regardless of which model you use. What is happening?

Tap your answer to check it.

Your coordinator passes the entire chat history to each subagent for context. Subagents respond with confused outputs. Why?

Tap your answer to check it.

Your support agent forgets the customer ID by turn 30 of a long conversation. What is the architectural fix?

Tap your answer to check it.

Long-document extraction dies at chapter 18 every run with stop_reason of max_tokens. What is the right fix?

Tap your answer to check it.

When you hit max_tokens, does increasing the model's context window solve the problem?

Tap your answer to check it.

Your tool returns 50KB of JSON per call and the loop hits max_tokens after 3 iterations. Architectural fix?

Tap your answer to check it.

72 additional questions for this concept live in the practice pillar. Take a mock exam ↗

10 · FAQ

Frequently asked

What if I exceed 200K tokens?
Request fails with context_length_exceeded. No partial response; you get an error and are billed nothing (or only for excess).
How many pages fit in 200K tokens?
Roughly 500 pages of plain text (400 tokens/page). Dense prose or structured data: fewer pages.
Prompt caching worth it for short requests?
Only if cached section is >1024 tokens. Below that, overhead outweighs savings.
Can I cache user messages?
No. Caching applies only to system prompts and tool definitions (static). User messages change per turn.
How long does cached content stay in memory?
Ephemeral cache: 5 minutes default, up to 1 hour if extended. After TTL, discarded.
Optimal window size?
Window at 50-60% capacity. If your loop generates 20K tokens/turn, window at turn 5 (100K) to have 100K left for next 5 turns.
Does windowing affect output quality?
Slightly. Summarizing loses detail. But improves attention engineering by keeping facts at the top.
Manually partition context into multiple calls?
Yes. Instead of one loop with windowing, make 5 separate API calls. More expensive (no context reuse), but simpler architecturally.
How do I detect when I'm close to the limit?
Use count_tokens() before each turn. Set a threshold (e.g., 180K); window when crossed.
Subagent context limits?
Each subagent has its own 200K. Coordinator's window doesn't subtract from subagent's. But don't pass full coordinator history to subagent; pass only the relevant TASK_CONTEXT.
11 · Practice with AI

Work this with your AI

Work this concept hands-on with Claude Code, Codex, or claude.ai. Copy a prompt, paste it into your assistant, and practise in tandem. Each one keeps you active (explain it back, get drilled, or build) rather than just reading.

  • Drill it like the exam (scenario MCQs)
    Practice in the exam's scenario-MCQ format with trap awareness.
  • Explain it back (Feynman)
    Build durable, transferable understanding of a concept you can half-state.
  • Test me, adapting the difficulty
    Active recall practice on a concept you think you know.
  • Check my prerequisites first
    Before studying a concept that keeps not sticking.
  • Find the high-leverage 20%
    When a domain feels too big and you are short on time.
Self-check

Test yourself

Three diagnostic questions on this primitive. Reveal each answer when you have a guess. Want a full 60-question mock? Open the mock hub →

Q1Context window resets after each turn?
No. Cumulative across all turns. Messages accumulate; budget shrinks. Once you hit 200K total, new requests fail with context_length_exceeded.
Q2Streaming reduces context window cost?
No. Streaming is a UX feature; it doesn't change token cost. Same total tokens billed, streamed or not. Use streaming for responsiveness, not cost.
Q3Windowing at turn 11 of a 12-turn loop. Worth it?
Marginal value. You've already paid for turns 1-10 of accumulated context. Optimal: window at turn 5 if the conversation is verbose, or turn 3 if each turn is expensive. Windowing late is reactive; preemptive is better.
Last reviewed: 2026-05-04·Refresh cadence: monthly
D5.1 · D5 · Context + Reliability

Context Window Management, complete.

You've covered the full ten-section breakdown for this primitive, definition, mechanics, code, false positives, comparison, decision tree, exam patterns, and FAQ. One technical primitive down on the path to CCA-F.

More platforms →