Blog · 2026-05-24· 5 min read

The PGE harness: why Anthropic spends 15x more on Claude and still calls it cheap

Anthropic's Planner-Generator-Evaluator harness lifted ==SWE-bench Pro from 64.3% to 90.2%== at ==15x-19x token cost==, demoed alongside ==Claude Opus 4.7 (April 16, 2026)== with the new xhigh effort level and Task Budgets. The economics only flip past 12+ tool-domains; below that the harness is a tax. The architectural unlock is not the three agents — it is the shared markdown spec that survives every handoff.

D1D2D5multi-agentsubagentsagent-harness
Painterly walnut orchestrator's pavilion. A central conductor's podium with brass batons. Three subordinate music stands arranged in an arc labelled PLANNER, GENERATOR, EVALUATOR. A parchment handoff card rests on the conductor's stand. Loop in a small dark-walnut stagehand cabinet to the side.

Quick answer

The PGE harness — Planner, Generator, Evaluator — is Anthropic's reference shape for complex application development on Claude Opus 4.7. It lifted SWE-bench Pro from 64.3% to 90.2% at 15x-19x token cost by isolating each role's context and routing them through a shared markdown spec rather than chat history. The economics only justify themselves past 12+ tool-domains; below that, a single agent with prompt caching wins. The architectural unlock is not three agents — it is what they pass between each other.

The mistake most multi-agent demos make

The popular framing is that multi-agent systems work because more agents are smarter than one. That is the wrong abstraction.

Multi-agent demos that ship in production do not win on aggregate IQ. They win on two architectural choices: context isolation (the Evaluator does not see the Generator's reasoning, so it cannot inherit the Generator's blind spots) and deterministic routing (the harness code, not the model, picks which tool subset each role sees). When teams skip both and just spin up three Claude calls that pass chat history around, they get the worst of every world — higher cost, slower iteration, and the same blind spots they started with.

The four patterns below are what the Anthropic demo actually demonstrates, once you strip out the marketing.

Four patterns the PGE harness retires

Pattern 1. Decomposition is the work; the model is not.

The Planner does not "think harder." It runs on xhigh effort (the new mid-tier between high and max introduced with Claude Opus 4.7 on April 16, 2026) and produces 5-10 user stories with explicit acceptance criteria. The Generator never sees the high-level request. It receives one story at a time, with a clear definition of done. That is the whole point: the Generator's failure modes shrink when the surface area of its task shrinks. The model did not get better. The task got smaller.

Pattern 2. The Evaluator must be structurally isolated.

A single agent grading its own code suffers from self-evaluation leniency. Anthropic's reference setup runs the Evaluator as a separate agent with its own system prompt — "Senior QA Lead / Adversarial Auditor" — and gives it Playwright MCP to interact with the live application. The Evaluator returns structured JSON (status, critique, playwright_logs, recommended_fix), not prose. The pedantic rule that ships with the demo: "If a criterion is 99% complete, it is a FAIL." That is not theatre; it is the only way to prevent the loop from terminating on optimism.

Pattern 3. The handoff is a file, not a conversation.

Passing chat history between agents is the agent telephone game. Every handoff loses a little intent and adds a little noise; by iteration five the Evaluator is judging work against a goal the Planner never set. Anthropic's documented practice is the Shared Structured Artifact — a single evolving markdown spec (often AGENTS.md or CLAUDE.md at the repo root) that all three agents read and update. The spec is durable across retries, survives context-window rollover, and gives the Evaluator a stable target. The Planner writes; the Generator annotates; the Evaluator critiques in place.

Pattern 4. Task Budgets are not a nice-to-have.

The new Task Budgets API caps the entire loop, not just one call. A multi-agent harness without a hard ceiling can quietly burn 15x-19x more tokens on the wrong problem and you would not notice until the monthly bill. The April 16 release notes ship Task Budgets alongside Adaptive Thinking (dynamic reasoning-token allocation per sub-task), so the harness can spend deeply on the Planner's decomposition and shallowly on the Generator's third unit test. Without budgets, the economics never close. With them, the 15x-19x premium is bounded and decidable.

The nuanced point

Multi-agent harnesses are not a universal upgrade. The benchmark data is unambiguous: below 12 distinct tool-domains, a single agent with prompt caching outperforms the PGE harness on cost and matches it on quality. The jump from 32.1% to 84.0% success on >12 tool-domain tasks is real, but it is also a threshold function. Spin up three agents for a task that touches only Git and a database, and you have paid the harness tax for nothing.

The decision is therefore architectural, not aspirational. Ask whether the task crosses tool-domain boundaries (browser + DB + shell + API + Git, say), whether the agent will run for more than a handful of iterations, and whether the cost of an undetected failure is higher than the cost of three agents in a loop. If all three are yes, the PGE harness pays for itself. If any is no, a single Claude Opus 4.7 call with the right system prompt is the more honest answer.

How this shows up on the exam

D1 (Agentic Architectures, 27%) consistently presents scenarios where a single agent is under-performing on a multi-step, multi-tool task. The trap answers are "increase the temperature", "add more few-shot examples", or "upgrade to a larger model". The architecturally correct answer is almost always the orchestrator-worker pattern with isolated subagent context — what PGE names. The exam rewards candidates who reach for decomposition + context isolation before model selection. Memorise the shape: a coordinating agent that owns the spec, worker subagents with narrowed tool surfaces, and an adversarial reviewer that returns structured verdicts.

D5 (Context Window Management, 15%) asks directly about state handoff between agents. The exam reliably distinguishes between passing conversation history (anti-pattern, context rot, telephone game) and passing a persisted artifact (durable, re-readable, survives retries). D2 (Tool Design, 18%) rounds it out: questions about an agent flailing with too many tools are asking whether you understand deterministic routing — the harness, not the model, decides which tool subset each role sees per sprint state. PGE touches all three domains; recognise the pattern and the exam reduces to label-matching.

Where does your current agent setup fail first — the model, the tools, or the handoff?

The honest answer for most teams is the handoff. The model is fine. The tools are fine. But the agents are passing chat transcripts around like a relay race where each runner forgets the baton, and by lap three the goal has quietly drifted. The PGE harness is not interesting because it has three agents. It is interesting because it forces the question of what survives between them — and the answer, written down, on disk, in markdown, is the boring architectural choice that makes the 15x premium worth paying.

01 · Read next in the pillars

Where this lands in the exam-prep map

Each blog post bridges into the evergreen pillars. These are the most relevant follow-ups for this story.

02 · FAQ

6 questions answered

What is the PGE harness in one paragraph?
Planner, Generator, Evaluator. The Planner decomposes a vague request into 5-10 user stories with explicit acceptance criteria. The Generator implements one story at a time, plus unit tests. The Evaluator runs Playwright MCP against the live app and returns a structured PASS/FAIL JSON with critique. The three agents do not share conversation history; they share a single markdown specification that each one re-reads and updates. That shared artifact is the unlock, not the agent count.
Why does the harness beat a single agent by 26 points on SWE-bench Pro?
Two reasons. First, context isolation: the Evaluator never sees the Generator's reasoning, so it cannot inherit the Generator's blind spots — a single agent reviewing its own work suffers from self-evaluation leniency. Second, deterministic routing: the harness (code, not the model) decides which tool subset each role sees, so the Generator does not get drowned in 50+ tools when it only needs a chisel. The jump from 64.3% to 90.2% is what falls out when those two things are enforced architecturally.
Why is the 12-tool-domain threshold important?
Below 12 distinct tool-domains (mixing Git, DB, shell, browser, API, etc.), a single agent with prompt caching is cheaper and faster than running three agents in a loop. Past 12, the single agent starts losing track of which tool to reach for and context rot sets in. The benchmark data shows multi-agent success on >12 tool-domains climbs from 32.1% to 84.0%. Below the threshold, 15x-19x cost buys you nothing. Above it, the harness is the only thing that ships.
What is the Shared Structured Artifact and why not pass chat history?
The Shared Structured Artifact is a single evolving markdown document — usually a spec at the repo root — that the Planner authors, the Generator updates with implementation notes, and the Evaluator annotates with PASS/FAIL critiques. Passing raw conversation history between agents is the agent telephone game: every handoff loses a little intent and adds a little noise, so by iteration five the Evaluator is judging code against a goal the Planner never set. A markdown spec is durable; a chat transcript is not.
What changes with Claude Opus 4.7 and the new xhigh effort level?
Released ==April 16, 2026==, Claude Opus 4.7 introduces the `xhigh` effort level (deeper reasoning without max-tier latency), Adaptive Thinking (dynamic reasoning-token allocation per sub-task), and most importantly Task Budgets — a hard token/cost ceiling for an entire multi-agent loop. Without Task Budgets, a runaway PGE loop is a small finance incident. With them, the harness has a kill switch. Planner and Evaluator on xhigh is the documented configuration; the Generator can run lower.
How does the PGE harness map to the CCA-F exam?
D1 (Agentic Architectures, 27%) rewards candidates who reach for the orchestrator-worker pattern when a single agent under-performs on multi-step or multi-tool tasks. The distractor answers are *use a stronger model* or *add more examples*; the architecturally correct answer is *split the work into specialised subagents with isolated context*. D2 (Tool Design, 18%) tests whether you understand that exposing 50+ tools to one agent is a failure mode — the right move is deterministic routing of a tool subset per role. D5 (Context Window Management, 15%) asks about state handoff: the exam expects you to choose a persisted artifact over conversation history when agents must coordinate across iterations. PGE is the named pattern that touches all three.

Synthesized from research output on 2026-05-24. LinkedIn cross-post pending.
Last reviewed 2026-05-24.

Blog post · D1 · Blog

The PGE harness: why Anthropic spends 15x more on Claude and still calls it cheap, complete.

You've covered the full ten-section breakdown for this primitive, definition, mechanics, code, false positives, comparison, decision tree, exam patterns, and FAQ. One technical primitive down on the path to CCA-F.

More platforms →