Quick answer
The PGE harness — Planner, Generator, Evaluator — is Anthropic's reference shape for complex application development on Claude Opus 4.7. It lifted SWE-bench Pro from 64.3% to 90.2% at 15x-19x token cost by isolating each role's context and routing them through a shared markdown spec rather than chat history. The economics only justify themselves past 12+ tool-domains; below that, a single agent with prompt caching wins. The architectural unlock is not three agents — it is what they pass between each other.
The mistake most multi-agent demos make
The popular framing is that multi-agent systems work because more agents are smarter than one. That is the wrong abstraction.
Multi-agent demos that ship in production do not win on aggregate IQ. They win on two architectural choices: context isolation (the Evaluator does not see the Generator's reasoning, so it cannot inherit the Generator's blind spots) and deterministic routing (the harness code, not the model, picks which tool subset each role sees). When teams skip both and just spin up three Claude calls that pass chat history around, they get the worst of every world — higher cost, slower iteration, and the same blind spots they started with.
The four patterns below are what the Anthropic demo actually demonstrates, once you strip out the marketing.
Four patterns the PGE harness retires
Pattern 1. Decomposition is the work; the model is not.
The Planner does not "think harder." It runs on xhigh effort (the new mid-tier between high and max introduced with Claude Opus 4.7 on April 16, 2026) and produces 5-10 user stories with explicit acceptance criteria. The Generator never sees the high-level request. It receives one story at a time, with a clear definition of done. That is the whole point: the Generator's failure modes shrink when the surface area of its task shrinks. The model did not get better. The task got smaller.
Pattern 2. The Evaluator must be structurally isolated.
A single agent grading its own code suffers from self-evaluation leniency. Anthropic's reference setup runs the Evaluator as a separate agent with its own system prompt — "Senior QA Lead / Adversarial Auditor" — and gives it Playwright MCP to interact with the live application. The Evaluator returns structured JSON (status, critique, playwright_logs, recommended_fix), not prose. The pedantic rule that ships with the demo: "If a criterion is 99% complete, it is a FAIL." That is not theatre; it is the only way to prevent the loop from terminating on optimism.
Pattern 3. The handoff is a file, not a conversation.
Passing chat history between agents is the agent telephone game. Every handoff loses a little intent and adds a little noise; by iteration five the Evaluator is judging work against a goal the Planner never set. Anthropic's documented practice is the Shared Structured Artifact — a single evolving markdown spec (often AGENTS.md or CLAUDE.md at the repo root) that all three agents read and update. The spec is durable across retries, survives context-window rollover, and gives the Evaluator a stable target. The Planner writes; the Generator annotates; the Evaluator critiques in place.
Pattern 4. Task Budgets are not a nice-to-have.
The new Task Budgets API caps the entire loop, not just one call. A multi-agent harness without a hard ceiling can quietly burn 15x-19x more tokens on the wrong problem and you would not notice until the monthly bill. The April 16 release notes ship Task Budgets alongside Adaptive Thinking (dynamic reasoning-token allocation per sub-task), so the harness can spend deeply on the Planner's decomposition and shallowly on the Generator's third unit test. Without budgets, the economics never close. With them, the 15x-19x premium is bounded and decidable.
The nuanced point
Multi-agent harnesses are not a universal upgrade. The benchmark data is unambiguous: below 12 distinct tool-domains, a single agent with prompt caching outperforms the PGE harness on cost and matches it on quality. The jump from 32.1% to 84.0% success on >12 tool-domain tasks is real, but it is also a threshold function. Spin up three agents for a task that touches only Git and a database, and you have paid the harness tax for nothing.
The decision is therefore architectural, not aspirational. Ask whether the task crosses tool-domain boundaries (browser + DB + shell + API + Git, say), whether the agent will run for more than a handful of iterations, and whether the cost of an undetected failure is higher than the cost of three agents in a loop. If all three are yes, the PGE harness pays for itself. If any is no, a single Claude Opus 4.7 call with the right system prompt is the more honest answer.
How this shows up on the exam
D1 (Agentic Architectures, 27%) consistently presents scenarios where a single agent is under-performing on a multi-step, multi-tool task. The trap answers are "increase the temperature", "add more few-shot examples", or "upgrade to a larger model". The architecturally correct answer is almost always the orchestrator-worker pattern with isolated subagent context — what PGE names. The exam rewards candidates who reach for decomposition + context isolation before model selection. Memorise the shape: a coordinating agent that owns the spec, worker subagents with narrowed tool surfaces, and an adversarial reviewer that returns structured verdicts.
D5 (Context Window Management, 15%) asks directly about state handoff between agents. The exam reliably distinguishes between passing conversation history (anti-pattern, context rot, telephone game) and passing a persisted artifact (durable, re-readable, survives retries). D2 (Tool Design, 18%) rounds it out: questions about an agent flailing with too many tools are asking whether you understand deterministic routing — the harness, not the model, decides which tool subset each role sees per sprint state. PGE touches all three domains; recognise the pattern and the exam reduces to label-matching.
Where does your current agent setup fail first — the model, the tools, or the handoff?
The honest answer for most teams is the handoff. The model is fine. The tools are fine. But the agents are passing chat transcripts around like a relay race where each runner forgets the baton, and by lap three the goal has quietly drifted. The PGE harness is not interesting because it has three agents. It is interesting because it forces the question of what survives between them — and the answer, written down, on disk, in markdown, is the boring architectural choice that makes the 15x premium worth paying.
Where this lands in the exam-prep map
Each blog post bridges into the evergreen pillars. These are the most relevant follow-ups for this story.
Concept
Subagents
The Planner, Generator, and Evaluator are each subagents with isolated context. The PGE harness is the production-shaped version of the subagent primitive.
Open ↗Concept
Subagent state handoff
The Shared Structured Artifact pattern (a markdown spec all three agents read and update) is the canonical handoff mechanism. Passing chat history is the anti-pattern.
Open ↗Scenario
Multi-agent research system
Same orchestrator-worker shape, different verbs. The research system reads, the PGE harness builds. Both win on decomposition + isolated context, not on model strength.
Open ↗Knowledge
Architecture-aware agentic workflows
The 12+ tool-domain breakpoint and the harness-vs-model debate are architectural, not prompting decisions. This is the knowledge ground for why.
Open ↗6 questions answered
What is the PGE harness in one paragraph?
PASS/FAIL JSON with critique. The three agents do not share conversation history; they share a single markdown specification that each one re-reads and updates. That shared artifact is the unlock, not the agent count.Why does the harness beat a single agent by 26 points on SWE-bench Pro?
Why is the 12-tool-domain threshold important?
What is the Shared Structured Artifact and why not pass chat history?
What changes with Claude Opus 4.7 and the new xhigh effort level?
max-tier latency), Adaptive Thinking (dynamic reasoning-token allocation per sub-task), and most importantly Task Budgets — a hard token/cost ceiling for an entire multi-agent loop. Without Task Budgets, a runaway PGE loop is a small finance incident. With them, the harness has a kill switch. Planner and Evaluator on xhigh is the documented configuration; the Generator can run lower.How does the PGE harness map to the CCA-F exam?
Synthesized from research output on 2026-05-24. LinkedIn cross-post pending.
Last reviewed 2026-05-24.
