Knowledge · D2 Tool Design + D5 Reliability

Handling Token Limits When Processing Large JSON Responses in an MCP Client/Server Flow.

Our first GitHub MCP integration returned the full issue payload on every call; one Issues.list dropped 14,000 tokens into context before the model had read a word. Four mitigation patterns, in order of leverage, take that to under 500.

D2 · Tool DesignD5 · Context & ReliabilityHow-to

Last updated

01 · TLDR

The short version

MCP injects tool results verbatim into the model's context. A 50 KB JSON payload routinely costs 12,000+ tokens. Four mitigations in order of leverage: server-side field projection (strip what is not needed), pagination via follow-up tool calls, chunk-and-summarize on the client when pagination is not possible, and streaming with hard truncation thresholds when the transport supports it. Silent truncation is the worst outcome - always surface oversize as an explicit error contract. On the CCA-F this is a D2 plus D5 topic; the distractor to reject is "tell the model to ignore unused fields."

02 · Why this matters in production

The silent context drain

An agent integrates the GitHub MCP server, calls list_issues with default parameters, and the response includes every field on every issue: title, body, labels, assignees, milestones, reactions, links to comments, embedded HTML. The raw payload is 40 KB; the tokenized version is closer to 14,000 tokens. The agent has burned 7% of a 200K window on a single tool call before doing any work. Three calls later it is out of context. The user sees a generic "I ran out of room to think about this" failure and nobody traces it back to the issue list.

Per the /concepts/mcp page in the vault: "The result is wrapped in a tool_result block and appended to the message list. Claude never knows where the tool ran or what authentication was used." That verbatim injection is the cost surface. The model cannot "skim" the payload; it cannot decline to read it; the tokens are there until the conversation rolls them out. Shrinking the payload before it enters context is the only structural fix.

03 · The mechanics

Four patterns, in order of leverage, with examples

1. Server-side field projection. Highest leverage. The server accepts a fields parameter (or applies a default policy filter) and returns only what the agent needs. A representative SaaS API response typically drops from 8,000 tokens to about 400 with just three fields preserved. Cost: a few lines of orchestrator code. Benefit: the largest single token reduction available.

# Before: full payload, ~8,000 tokens
{
  "id": 8814,
  "title": "Refund flow breaks on canceled subscriptions",
  "body": "## Steps to reproduce ...",      # 2,400 tokens of markdown
  "user": { ...20 fields... },               # 600 tokens
  "labels": [ ...12 labels with metadata... ],
  "comments": [ ...embedded 30 comments... ],
  "reactions": { ...11 emoji counts... },
  ...
}

# After: projected to {id, status, summary}, ~400 tokens
{
  "id": 8814,
  "status": "open",
  "summary": "Refund flow breaks on canceled subscriptions; bug in webhook handler."
}

2. Pagination via follow-up tool calls. When the response is naturally a list, expose a cursor or offset plus a has_more flag. The agent fetches one page; if it needs more, it calls again with the next cursor. Per page-level cost stays bounded; total cost grows with the actual fetch volume, not with the worst-case payload. Pagination compounds with projection: paginate the list, project the items.

# tool_result page 1
{
  "items": [ {id:1,...}, {id:2,...}, {id:3,...} ],   # 3 items, ~600 tokens
  "next_cursor": "eyJ...",
  "has_more": true
}

# tool_result page 2 (after model decides to keep fetching)
{
  "items": [ {id:4,...}, {id:5,...}, {id:6,...} ],
  "next_cursor": "eyJ...",
  "has_more": true
}

3. Chunk-and-summarize. When the response is one large document (a contract, a transcript, a long article) and pagination is not natural, slice the payload into chunks and run a summarization step between them. The model reasons over the summary, not the raw chunks. Best implemented server-side when possible - a deterministic top-N extractor is cheaper than a client-side LLM summarization call. Save client-side LLM summarization for cases where deterministic summarization would be wrong (long-form prose, narrative).

4. Stream and accumulate. When the MCP transport supports streaming, accumulate state on the client incrementally. Pair with a token-counting middleware and a hard truncation threshold. When the threshold trips, return an explicit error contract (retryable: true, errorCode: RESULT_TRUNCATED) so the orchestrator can choose to paginate or project. Per /knowledge/mcp-error-contracts-retry-behavior, never truncate silently - the model reasoning over an unknown-incomplete payload is the worst failure mode.

The patterns compound. Projection plus pagination handles 90% of production cases. Add chunk-and-summarize for the long-document outliers. Reserve streaming for the high-throughput, low-latency cases where the orchestrator wants to start processing before the full payload lands. Skip projection and the others paper over the symptom without fixing the cause.

04 · Decision rule and checklist

Seven checks for every MCP tool that returns JSON

  1. Measure with count_tokens, not eyeballed bytes. Tokens are the actual cost; bytes are not a reliable proxy.
  2. Apply server-side field projection by default. Pick the fields the agent actually needs; drop everything else at the source.
  3. Cap the per-call result size. Set a hard token ceiling on the tool response; return an explicit error contract above it.
  4. Expose pagination on any naturally-listy response. Cursor or offset plus a has_more flag.
  5. Choose chunk-and-summarize over pagination for one-shot documents. Server-side deterministic summarization first; client-side LLM only when the structure demands it.
  6. Never truncate silently. Surface oversize as a retryable error so the orchestrator can paginate or project.
  7. Measure again after any schema change.A "harmless" new field can quietly add 30% to every response.
05 · Common anti-patterns

Five recurring oversized-result mistakes

  1. Ignore-the-fields prompt. Telling Claude to skip irrelevant fields. Cause: confusing model attention with token cost. Fix: strip fields at the server.
  2. Silent truncation. Server caps at 4,000 tokens, drops the rest, model never knows. Cause: a defensive default that hides the symptom. Fix: explicit error contract per /knowledge/mcp-error-contracts-retry-behavior.
  3. Full payload on every call. No projection, no pagination, no cap. Cause: shipping the first thing that worked. Fix: projection as the default; pagination on lists.
  4. Client-side LLM summarization for everything. Paying for a summarization model call where a deterministic top-N filter would do. Cause: cargo cult from RAG patterns. Fix: deterministic summarization first; LLM only for prose.
  5. No size cap. A schema change adds a new field; every response grows by 30%. Cause: no per-call ceiling. Fix: a hard token cap on every tool response, enforced server-side.
06 · CCA-F exam mapping

How this shows up on the exam

Domains
D2 Tool Design + Integration (18%) · D5 Context + Reliability (15%)
What is tested
Whether you reach for server-side projection (the structural fix) or for prompt instructions (the wrong fix) when an oversized tool result exhausts context.
Stem pattern
An MCP tool returns a 50 KB JSON response. The agent exhausts its context window after three calls. What is the highest-leverage fix?
Distractor to reject
"Tell the model to ignore unused fields." Models do not skip tokens they were told to ignore; the budget is still consumed.
Second distractor
"Increase the context window." Defers the problem; does not solve it. Three calls become five; five become eight. The trajectory is the same.
Third distractor
"Use a more capable model." Model vs Design heuristic per ACP-T03 §6. The defect is data shape, not capability.
07 · Sources

Vault and external references

  • Vault: data/aeo/reports/2026-05-17-recommendations.md §Signal 1 - source of the four canonical mitigation patterns ranked by leverage.
  • Vault: data/aeo/reports/2026-05-16-recommendations.md §Signal 1 - earliest formulation; same four-pattern recommendation.
  • Vault: data/aeo/reports/2026-05-16-page-type-mix.md §Signal 1 - this question is the highest-frequency MCP-related search across competitor surfaces.
  • Vault: public/concepts/mcp.md §How it works - tool_result is appended to the message list verbatim, which is why payload shape matters.
  • Vault: public/concepts/context-window.md §How it works - count_tokens is deterministic; always measure large requests.
  • Vault: public/concepts/attention-engineering.md §How it works - the U-shaped attention curve and why unused middle content degrades retrieval fidelity.
  • Vault: public/concepts/stop-reason.md §How it works - max_tokens stop reason as the explicit truncation signal, vs silent stream cutoff.
  • Vault: data/aeo/reports/2026-05-17-competitor-teardown.md - external competitor coverage; reference for what other prep sites get wrong on this question.
08 · FAQ

Frequently asked

How do I handle the token limit when processing a large JSON response in an MCP client/server flow?
Apply the four canonical mitigations in order of leverage: project unneeded fields server-side, paginate via follow-up tool calls, chunk-and-summarize on the client, and stream when transport allows. A raw 50 KB JSON payload can consume 12,000+ tokens; field projection alone often takes that to 400 tokens.
Why does a 50 KB JSON response cost so many tokens?
MCP injects tool results verbatim into the conversation context. Keys, brackets, quotes, and whitespace all tokenize, so an API payload that looks compact on the wire can balloon to 12,000+ tokens before the model has reasoned at all. Per the /concepts/mcp page in the vault: 'The result is wrapped in a tool_result block and appended to the message list.' Verbatim means verbatim. This is a direct D5 context-budget failure mode when ignored.
Can I just tell Claude to ignore irrelevant fields?
No. Models do not actually skip tokens they were told to ignore. Unused fields still occupy context budget and degrade retrieval fidelity on the fields you do need. Per /concepts/attention-engineering, transformers exhibit a U-shaped attention curve - extra middle content reduces the model's ability to attend to the fields that matter. Strip the fields server-side before returning the tool_result block; that is the CCA-F-correct fix.
When should I paginate vs. summarize vs. stream?
Paginate when the response is naturally listy (issues, orders, search results) and the client can issue follow-up tool calls with cursor or offset. Chunk-and-summarize when the response is one large document (a contract, a transcript) and you can compress between chunks. Stream when the MCP transport supports it and you want incremental state accumulation rather than a single oversized blob. The decision rule maps to the shape of the data, not to your preference.
What if I cannot predict the response size at design time?
Instrument the orchestrator with token-counting middleware and set a hard truncation threshold. When the threshold is exceeded, return an explicit error signal to the orchestrator instead of silently truncating. Per /knowledge/mcp-error-contracts-retry-behavior, the error should be a typed contract with retryable, errorCode, retryAfterMs, humanMessage; the orchestrator can then choose to paginate, project, or surface. Silent truncation is the worst outcome because the model reasons over a partial payload it believes is complete.
How do I measure the actual token cost of a tool result?
Use the Anthropic count_tokens endpoint. Serialize the tool_result block exactly as it would be sent to the model and pass it through count_tokens. Repeat for representative samples (smallest expected, median, largest). Per /concepts/context-window: 'Token counting is deterministic: SDK provides count_tokens endpoint that returns exact cost before you execute. Always call on large requests.' Without measurement you are guessing, and guesses tend to be optimistic by an order of magnitude.
Does prompt caching help with oversized tool results?
Partially. Caching the tool result helps if the same exact payload appears in a later turn (rare in practice; tool results usually change). Caching the system prompt and tool manifest helps a lot - that is the high-leverage caching surface (see /knowledge/mcp-tool-description-context-optimization). Treat caching as orthogonal to result-shrinking: cache the stable surfaces; shrink the variable ones.
What is the difference between field projection and field filtering?
Field projection is the orchestrator or server explicitly requesting only the fields the agent needs (typically via a query parameter like fields=id,status,summary). Field filtering is the server dropping unwanted fields by policy regardless of what was asked. Projection is more flexible (the agent picks); filtering is more defensive (the server enforces a max). Use projection when you can predict the field set; use filtering as a fail-safe ceiling on every response.
How does this map to the CCA-F exam?
D2 (Tool Design + Integration, 18%) for the server-side projection design decision; D5 (Context + Reliability, 15%) for the context-budget consequence. Stem pattern: 'An MCP tool returns a 50 KB JSON response. The agent exhausts its context window after three calls. What is the highest-leverage fix?' Right answer: server-side field projection. Distractor to reject: 'Tell the model to ignore unused fields' (does not save tokens). Second distractor: 'Increase the context window' (defers the problem; does not solve it).
What is the right pagination signal to expose?
A cursor (opaque token returned by the server) or a numeric offset, plus a boolean has_more flag. Cursor pagination is more robust to inserts and deletes between calls; offset pagination is simpler. Always include has_more so the model knows when to stop calling. Return the next cursor or offset inside the tool_result body so the model can pass it back on the next turn without orchestrator code threading it through.
Can I summarize on the server instead of the client?
Yes, and often you should. A server-side summarization step (deterministic transform: extract the top N items, drop the verbose fields, append a count) is cheaper than a client-side LLM call to summarize. Reserve client-side summarization for cases where the structure of the document makes deterministic summarization wrong (long-form prose, customer transcripts). Server-side summarization is the projection step generalized.
What happens if I silently truncate?
The model reasons over a partial payload it believes is complete and produces confident wrong answers. The worst version: a search tool returns the first 10 results when there are 200, the model concludes the right answer is not in the data, and the agent responds 'no match found.' There is no error, no warning, no signal that truncation happened. The fix is to surface truncation as an explicit error per /knowledge/mcp-error-contracts-retry-behavior - retryable: true with an errorCode like RESULT_TRUNCATED so the orchestrator can paginate or project.