On this page
TLDR
Context window management is how you keep long conversations within limits without dropping critical facts. Patterns: case-facts blocks, progressive summarization, retrieval. model card limits
What it is
The context window is the maximum token budget Claude can process in a single request. Claude Sonnet and Opus have a 200K token window, equivalent to ~500 pages of text or ~1 hour of conversation. Hard architectural boundary set by positional embeddings. Cannot be exceeded; requests fail with context_length_exceeded. Managing the window is core production architecture: inefficient windowing wastes budget; strategic windowing unlocks large-scale automation.
The 200K window applies to total input tokens, including system prompt, all messages, all tool definitions. Example: system (500) + prior conversation (5000) + current request (1000) + tool definitions (2000) = 8500 used, 191.5K remaining. If you're building an agent loop and messages grow unchecked, you can hit the limit at turn 12. The window doesn't reset between turns; shared budget for the entire sequence.
Three production patterns manage the window: prompt caching (90% off repeated prefix), context windowing (summarize old turns into FACTS block, drop verbose history), and streaming output (doesn't reduce tokens but improves UX during long waits). The fourth, underutilized pattern: case-facts immutability (CASE_FACTS rewritten not appended at every turn).
The most-tested anti-pattern is windowing too late. If your loop runs for 12 turns and you window at turn 11, you've wasted tokens for 10 turns. Optimal: window at turn 5 if the conversation is verbose, or preemptively at turn 3 if each turn is expensive. Also tested: context starvation in subagents. Coordinator passes 100KB to a subagent; subagent wastes half its 200K window on irrelevant context. Mitigate: subagents receive only what they need in a compact TASK_CONTEXT block.
How it works
Token counting is deterministic: SDK provides count_tokens() endpoint that returns exact cost before you execute. Always call on large requests. Track at every turn; when remaining <20K, window immediately.
Windowing mechanics: at turn N, extract a summary of turns 1 through N-1 into a PRIOR_CONTEXT block, drop the verbose message history, append PRIOR_CONTEXT + current turn as the new message list. Before windowing, messages = [user_msg, asst_1, tool_1, ..., asst_4, tool_4, user_5]. After windowing at turn 5, messages = [{role: "user", content: "PRIOR_CONTEXT: ...\n\nCURRENT: {turn_5}"}]. The new list is 10% the size of the old list.
Prompt caching exploits that system prompt and tool definitions don't change between turns. Mark with cache_control: {type: "ephemeral"}. On first request, full price; turns 2+ pay only 10% of cache price. Save 88% over 9 turns.
Case-facts immutability means CASE_FACTS is rewritten at every turn, never appended. Turn 1: CASE_FACTS: {customer_id, amount}\nQUESTION: .... Turn 2: CASE_FACTS: {customer_id, amount, order_found: true}\nQUESTION: ... (facts updated, not accumulated). Prevents the block from growing unboundedly. Combined with windowing: production pattern for multi-turn customer service.

Where you'll see it
200-page contract analysis with windowing
Legal team uploads 200-page contract. Without windowing, loop dies at page 15 (context exhausted). With windowing at page 10: "Pages 1-10 analyzed; risks: X, Y, Z." Claude continues fresh on pages 11-20, final turn aggregates findings. ~50% of non-windowed cost despite same work.
Long-running customer support with case-facts immutability
Loop starts CASE_FACTS = {customer_id, order_id}. Turn 2: updated = {..., refund_amount: 150}. Turn 3: {..., policy_check: "under $500, approved", manager_approval: "needed"}. Each turn rewrites (not appends). Combined with windowing at turn 5, agent runs 20+ turns without exhaustion.
Subagent context starvation mitigation
Coordinator delegates with full 150KB case context expecting inheritance. Subagent wastes 100KB on irrelevant history. Fix: coordinator extracts a 2KB TASK_CONTEXT ("Analyze competitor pricing; they offer X, Y, Z") and passes only that. Subagent has 198K for actual research.
Prompt caching for repeated system prompt + tools
Support agent loop runs 50 times/day. System (500 tokens) + 5 tools (2000) = 2500 tokens. Normal: 50 × $0.0375 = $1.875/day. Cached: 1 uncached + 49 cached @ 10% = $0.0375 + 49 × $0.00375 = $0.221/day. 88% savings.
Code examples
from anthropic import Anthropic
client = Anthropic()
def long_agent_with_windowing(case_facts: dict, max_turns: int = 50):
messages = []
turn = 0
system_prompt = """You are a support agent. Refunds >$500 require approval."""
tools = [
{"name": "verify_customer", "description": "Check customer exists", "input_schema": {"type": "object", "properties": {}}},
{"name": "lookup_order", "description": "Find order details", "input_schema": {"type": "object", "properties": {}}},
{"name": "process_refund", "description": "Issue refund (with policy)", "input_schema": {"type": "object", "properties": {}}},
]
while turn < max_turns:
turn += 1
# WINDOWING at turn 5: drop history, keep CASE_FACTS
if turn == 5:
case_facts["prior_work"] = "Customer verified, order found."
messages = []
# CASE-FACTS IMMUTABILITY: rewrite (not append)
block = f"""CASE_FACTS:
- Customer ID: {case_facts.get('customer_id', 'N/A')}
- Order ID: {case_facts.get('order_id', 'N/A')}
- Refund Amount: ${case_facts.get('amount', 0)}
- Prior Work: {case_facts.get('prior_work', 'None')}
"""
user_message = f"{block}\nQuestion: {case_facts.get('question', 'Help')}" if turn == 1 else f"{block}\nContinue."
messages.append({"role": "user", "content": user_message})
# PRE-CHECK token count
token_count = client.messages.count_tokens(model="claude-opus-4-5", system=system_prompt, tools=tools, messages=messages)
if token_count.input_tokens > 190000:
print(f"Warning: {token_count.input_tokens} tokens; near limit.")
resp = client.messages.create(
model="claude-opus-4-5", max_tokens=1024,
system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
tools=tools, messages=messages,
)
if resp.stop_reason == "end_turn":
return resp.content[0].text if resp.content else "Done"
messages.append({"role": "assistant", "content": resp.content})
return "Max turns exceeded."Looks right, isn't
Each row pairs a plausible-looking pattern with the failure it actually creates. These are the shapes exam distractors are built from.
The context window resets after each turn.
Cumulative across all turns. Messages accumulate; budget shrinks. Once you hit 200K, new requests fail.
Streaming reduces context window cost.
Streaming is UX, not cost. You pay for every token, streamed or not.
Windowing loses information; only for cost reduction.
Windowing trades verbose intermediate details for critical facts. Information not lost; summarized. Also improves accuracy via attention engineering (facts at top).
Prompt caching is free; no trade-off.
Caching requires 1024-token minimum to be worthwhile. Short requests: caching overhead exceeds savings. Use for system + tools in loops, not one-off requests.
If you hit the context limit, increase max_tokens.
200K is the input budget, not output. Increasing max_tokens (output budget) doesn't help. Window the input.
Side-by-side
| Aspect | 200K window | 100K window | Streaming | Caching |
|---|---|---|---|---|
| Total budget | 200K tokens | 100K tokens | Same as non-streaming | Save 90% on cached |
| Effective usage | ~500 pages or 10 turns | ~250 pages or 5 turns | No savings; UX only | Big if repeated |
| When to use | Default; most tasks | Cost-constrained | Chat UX, long responses | Loops with fixed system/tools |
| Management | Window at turn 5 | Window at turn 2-3 | Stream events; buffer | Mark ephemeral; monitor TTL |
| Cost impact | Baseline | 2x cheaper per task | Neutral; UX benefit | 88% savings per repeated turn |
| Failure mode | Hit 200K limit | Hit 100K limit faster | Network drops | Cache miss on content change |
Decision tree
Single turn or a loop?
Loop likely to run >5 turns?
Repeatedly sending the same system prompt + tools?
cache_control: ephemeral. Save 90% per turn.Response is long (>5 sec wait) and user-facing?
After turn N, approaching 200K tokens?
Question patterns

78 V2 questions wired to this concept. Tap an answer to check it instantly — you'll see whether it's right and why — then expand the full breakdown for the mental model and all four rationales.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
72 additional questions for this concept live in the practice pillar. Take a mock exam ↗
Frequently asked
What if I exceed 200K tokens?
context_length_exceeded. No partial response; you get an error and are billed nothing (or only for excess).How many pages fit in 200K tokens?
Prompt caching worth it for short requests?
Can I cache user messages?
How long does cached content stay in memory?
Optimal window size?
Does windowing affect output quality?
Manually partition context into multiple calls?
How do I detect when I'm close to the limit?
count_tokens() before each turn. Set a threshold (e.g., 180K); window when crossed.Subagent context limits?
Work this with your AI
Work this concept hands-on with Claude Code, Codex, or claude.ai. Copy a prompt, paste it into your assistant, and practise in tandem. Each one keeps you active (explain it back, get drilled, or build) rather than just reading.
- Drill it like the exam (scenario MCQs)Practice in the exam's scenario-MCQ format with trap awareness.
- Explain it back (Feynman)Build durable, transferable understanding of a concept you can half-state.
- Test me, adapting the difficultyActive recall practice on a concept you think you know.
- Check my prerequisites firstBefore studying a concept that keeps not sticking.
- Find the high-leverage 20%When a domain feels too big and you are short on time.
