The short version
Claude only retries an MCP tool call when the error contract says the failure is retryable. The minimum contract is four fields: retryable (boolean), errorCode (enum), retryAfterMs (integer), humanMessage (one line). Return a non-2xx HTTP status when something actually failed; HTTP 200 with an error body is the #1 silent-abandonment bug in production MCP servers. On the CCA-F exam this lives in D2 (tool contract design) and D5 (retry and reliability), and the most common distractor is "increase Claude's retry budget" - there is no server-side knob.
The single bug that breaks every multi-tool agent
A long-running agent makes 40 tool calls in a single task. One upstream service has a transient hiccup at call 17. The MCP server catches the exception, wraps it in a 200 OK with a polite "error: upstream timeout" body, and hands it back. The agent reads the body as a successful tool result, treats "upstream timeout" as the answer, and proceeds with 23 more calls that all reference the wrong result. The whole pipeline is poisoned because the protocol's retry machinery never fired - the response looked like success.
Per the /concepts/mcp page in the vault: "Silent error suppression (returning empty as success) is the #1 production failure: Claude does not know the tool failed and re-requests indefinitely." The same page nails the requirement: "Error handling flows back to Claude via tool_result. If the server hits an API error (timeout, 403, invalid request), it returns a structured message in content." Structured being the operative word. Free-form strings do not trigger retry; structured contracts do.
How Claude decides whether to retry, and what your contract must carry
Claude's tool-use loop runs a small decision tree after every tool_result. Step one: is the result tagged as an error? In MCP that is the is_error flag in the tool_result envelope, set by the framework when the HTTP status is non-2xx or when the body declares it. Step two: if it is an error, is it retryable? The model reads the retryable boolean from the error payload. Step three: if retryable, how long to wait? It reads retryAfterMs. Step four: how many times have we already tried this exact call? The orchestrator tracks the counter against maxRetries.
Three of the four steps fail silently if the contract is missing. No is_error means the response is treated as data. No retryable means the model defaults to terminal (give up). No retryAfterMs means immediate retry, which on a 503 storm creates a thundering herd. Only the fourth step - the counter - is always enforced, because the orchestrator owns it independently of the contract.
Per the agentic-tool-design scenario in the vault: "Every tool that can fail emits an error in one of four explicit buckets: Transient, Permission, Data, Business. The harness reads is_error: true + error.bucket, then routes: Transient retry; Permission escalate; Data surface to user; Business block + log + escalate." The four-property contract is the envelope; the bucket is the routing tag inside it. The Skilljar Defining Tools With MCP lesson reinforces the principle: "The MCP framework returns a structured error to Claude (tool_error block) and attempts reconnection on the next tool call. Claude sees the error and can adjust strategy. Without graceful error handling on your side, repeated crashes look like the tool just stopped working."
Side-by-side: the abandoned call versus the retried call.
Causes abandonment
HTTP/1.1 200 OK
Content-Type: application/json
{ "error": "upstream timeout" }200 OK + free-form error string. No retryable signal. Claude treats this as a successful tool call with garbage output and proceeds without retry.
Causes retry
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
{
"retryable": true,
"errorCode": "UPSTREAM_TIMEOUT",
"retryAfterMs": 2000,
"humanMessage": "Upstream took >5s; retry after 2s.",
"bucket": "Transient"
}Non-2xx status + structured contract with all four properties (plus the routing bucket). Claude reads retryable: true, waits retryAfterMs, retries up to the orchestrator's maxRetries.
A reference implementation in TypeScript:
// mcp-server-billing/src/errors.ts
export type ErrorBucket = "Transient" | "Permission" | "Data" | "Business";
export interface McpError {
retryable: boolean;
errorCode: string; // enum, e.g. UPSTREAM_TIMEOUT
retryAfterMs: number; // 0 if non-retryable
humanMessage: string; // one line for the model and the on-call
bucket: ErrorBucket;
}
export function transient(code: string, ms: number, msg: string): McpError {
return { retryable: true, errorCode: code, retryAfterMs: jitter(ms), humanMessage: msg, bucket: "Transient" };
}
export function permission(code: string, msg: string): McpError {
return { retryable: false, errorCode: code, retryAfterMs: 0, humanMessage: msg, bucket: "Permission" };
}
function jitter(ms: number) { return Math.round(ms * (0.75 + Math.random() * 0.5)); }
// in the handler:
// if (upstream.status === 504) return res.status(503).json(transient("UPSTREAM_TIMEOUT", 2000, "Upstream over 5s; retry"));Three things matter about that snippet. The error constructors are typed - no handler can forget retryable because the type forces it. The jitter is centralized so every Transient error gets it without per-handler code. The HTTP status (503 for transient, the upstream-appropriate code for terminal) is set on the response object - the body alone is not the signal.
Seven checks for every MCP server you ship
- Every error path returns a non-2xx HTTP status. If it can fail, it can fail loudly. No 200 OK with error bodies.
- Every error body carries the four required fields. retryable, errorCode, retryAfterMs, humanMessage. Type-enforce in the language of your choice.
- errorCode is an enum owned by the server. Documented list, one bucket per code, no free-form strings.
- retryAfterMs is jittered on the server. Otherwise a 503 storm becomes a synchronized retry storm three seconds later.
- The four-bucket tag is set on every error. Transient, Permission, Data, Business. Orchestrator branches on bucket, not on free text.
- Permission and Business errors set retryable: false. Never retry what will not fix itself. Escalate and log.
- You wrote a contract test for the error envelope. Same rigor as any other API contract. CI fails on schema drift.
Five failure modes we keep seeing
- 200-with-error. The single biggest production bug. Cause: a catch-all middleware that converts every exception into a 200 with an error body. Fix: route exceptions through the typed error constructors above, never through the success-response path.
- Missing retryable boolean. The error body has detail and a message but no boolean. Cause: the author forgot or assumed Claude would infer. Fix: type-enforce the contract; the field is required, not optional.
- Synchronized retry storms. Every client retries at exactly the delay you returned. Cause: flat retryAfterMs. Fix: jitter on the server.
- Retrying permission errors. The agent burns the rate limit retrying 403s. Cause: retryable: true on a permission failure. Fix: hard rule - Permission bucket is always retryable: false.
- Free-form errorCode. The orchestrator dashboard cannot group failures because every handler invents its own code. Cause: no enum. Fix: a documented list checked at the type layer; a deploy adds new codes.
How this shows up on the exam
Vault and external references
- Vault:
data/aeo/reports/2026-05-17-recommendations.md§Signal 2 - source of the original recommendation, including the four-property envelope and D2/D5 mapping. - Vault:
data/aeo/reports/2026-05-16-recommendations.md- earliest formulation of the JSON envelope fields (retryable, retry_after_ms, attempt, max_attempts). - Vault:
public/concepts/mcp.md§How it works - "Silent error suppression is the #1 production failure" and the structured-message requirement. - Vault:
public/scenarios/agentic-tool-design.md§4-Bucket Structured Error Contract - canonical four-bucket model (Transient / Permission / Data / Business) and routing rules. - Vault:
99-attachements/asc-a01-skilljar-course-content/course-11-claude-in-amazon-bedrock/lesson-64-defining-tools-with-mcp.md- Skilljar coverage of MCP error envelope behavior under Bedrock-hosted Claude. - Vault:
02-tasks/acp-t06-sample-question-bank.mdOA-Q9 - "Structured error context gives the coordinator the information it needs to make intelligent recovery decisions." - Vault:
02-tasks/acp-t03-claude-architect-competitive-research.md§6 Distractor Heuristics - Model vs Design and Prompt vs Hook patterns referenced in the exam mapping.