TL;DR
- Project Deal is an Anthropic experiment from April 27, 2026 where Claude agents transacted with 69 employees on a $100/seat budget
- Opus 4.5 closed 78% of deals; Haiku 4.5 closed 52% - but user satisfaction barely moved (4.8 vs 4.6)
- The perception gap is the lesson: a polite-feeling smaller model can quietly overpay and undersell
- The right production pattern is stage-aware routing: Haiku for discovery, Opus for the closing turn
- Maps to CCA-F D1 (27%) + D2 evaluation material - measure outcomes, not chat-feel
Quick answer
Project Deal is an Anthropic-published experiment from April 27, 2026 where Claude agents negotiated real transactions inside the company. The headline is the perception gap: Opus 4.5 closed 78% of deals; Haiku 4.5 closed 52%, but user satisfaction was nearly identical (4.8 vs 4.6). The smaller model felt fine while quietly overpaying and underselling. This is a CCA-F evaluation discipline lesson dressed as a marketplace demo.
What just happened
On April 27, 2026, Anthropic published results from Project Deal - an internal experiment where Claude agents transacted with 69 employees over a $100-per-participant budget. The headline framing is "Claude bought and sold stuff in an internal marketplace." The useful framing is something the CCA-F already tests aggressively: the gap between objective outcomes (close rate, price discipline, negotiation persistence) and subjective satisfaction (the chat felt fine).
When stakeholders evaluate agentic workflows on satisfaction surveys, they pick the wrong model. When they evaluate on benchmarks tied to the actual goal, they pick the right one. Project Deal is the cleanest public worked example of that divergence.
That divergence is exactly the territory the CCA-F's evaluation and model-selection questions probe - which is why this experiment matters far beyond the marketplace use case.
Why this matters now
Marketplace and negotiation workflows are one of the next big surfaces for agentic deployment. Stripe, Shopify, eBay, and several regional marketplaces are running agent pilots in 2026; OpenAI's Operator agent and Anthropic's own consumer-app agent both treat negotiation as a near-term capability. Project Deal is the cleanest public benchmark in that space - and the headline finding is that picking by feel rather than by outcome silently destroys margin. That signal is going to influence how every team writes their evaluation rubric for the next 12 months.
The trap is subtler than "use the bigger model." If you read Project Deal as "always pick Opus," you've missed the point. Opus also costs ~5× more per turn than Haiku at scale - running every step on Opus is itself the wrong answer because the cost compounds across discovery and listing turns where Haiku is genuinely fine. The real production pattern is stage-aware routing: Haiku for the cheap-but-frequent steps, Opus for the closing turn where the dollars actually move. That's the same logic Specialist Routing applies to coding agents.
For builders shipping today, the practical implication is that your evaluation harness chooses your model. If your eval is satisfaction-survey driven, you ship the smaller model and bleed margin. If your eval is outcome-driven (close rate, ask-vs-sell, escalation rate), you ship a staged pipeline that costs more per turn but converts better per transaction. The Project Deal authors made this point implicitly by publishing both metrics; teams replicating the pattern internally need to decide which metric they trust before they decide which model to ship.
This material is exam-relevant for two CCA-F domains, not one. D1 (Agentic Architecture, 27%) probes the stage-aware-routing pattern. D2 (Tool Design & Evaluation, 18%) probes the eval-rubric question. Both domains plant distractors in the same shape: an answer that suggests "tune the prompt" or "use the bigger model" when the right answer is "rewrite the eval and re-route per stage." Recognizing the perception-gap shape on the exam is worth at least 2-3 questions - and it generalizes beyond marketplaces to support, scheduling, and any agentic flow with measurable outcomes.
The open question is whether the perception-gap result holds across model families. Project Deal compared Opus 4.5 vs Haiku 4.5 - both Anthropic, both negotiation-tuned via the same RLHF lineage. The harder test is GPT-5.5 vs Haiku 4.5, or Gemini 3.1 vs Opus 4.5 - and the published replications haven't dropped yet. If the gap reverses on cross-vendor comparisons, the operational rule becomes "stage-aware routing AND vendor-aware fallback." That's a 2027-exam question; for the 2026 sitting, expect the within-Anthropic pattern only.
What the Project Deal numbers actually say
-
Close-rate gap was 26 points. Opus 4.5 closed 78% of deals; Haiku 4.5 closed 52%. On the most basic agent KPI - did the transaction happen - the smaller model lost on more than half its conversations.
-
Sell-side discipline lost 13 points to ask. Opus sold at 94% of asking price; Haiku at 81%. On a $100 listing, that's $13 walking out the door per transaction. Multiply across a workflow and the cost compounds quickly.
-
Buy-side discipline was 10 points worse. Opus came in 12% under budget on purchases; Haiku 2%. The smaller model accepts the first counter-offer. Negotiation persistence, not raw IQ, is where the gap shows up.
-
Negotiation depth doubled. Opus stayed in for 4.2 turns vs Haiku's 2.1. Persistence converts to revenue. The chat "feels long" on Opus, but the long chat is exactly where the value comes from.
-
User satisfaction barely moved. 4.8 vs 4.6 on a 5-point scale. That is the perception gap: stakeholders evaluating on chat feel will pick the wrong model and quietly bleed margin. The smaller model is polite-but-expensive.
3 production patterns Project Deal validates
-
Workflow split - faster agents for discovery, stronger for closing. Don't pick one model for the whole pipeline. Use Haiku for listing and discovery (cheap, fast, satisfaction stays high), and Opus for the closing turn and dispute mitigation (where the dollars actually move). This is stage-aware routing, a sibling of D1 Specialist Routing.
-
Outcome-tied evals beat satisfaction surveys. If your evaluation metric is "did the chat feel good?" you ship Haiku. If it is "closed at price within X% of ask?" you ship Opus on the closing turn. The CCA-F D2 evaluation questions hammer this exact substitution.
-
Benchmark audits before pilot-to-prod. Anthropic published Project Deal's numbers publicly; most internal demos don't. Before promoting an agentic workflow past pilot, run a benchmark step (close rate, ask-vs-sell, turns-to-close, regression vs baseline). Skipping it is how perception gaps ship.
Three mistakes Project Deal exposes
-
Treating marketplace as a single-agent task. Marketplace flows are a workflow split: faster agents for discovery and listing, stronger agents for final negotiation and disputes. Single-agent magic-trick framing always loses to staged routing.
-
Optimizing on satisfaction surveys. If your eval is "did the chat feel good?" you pick Haiku and ship. If your eval is "closed at price within X% of ask?" you pick Opus for the closing turn. The metric you choose chooses the model.
-
Skipping the benchmark audit before approval. Project Deal published the gap publicly; most internal demos won't. Add a benchmark step (close rate, ask-vs-sell, turns-to-close) before any agentic workflow rolls past pilot. The exam probes this discipline directly.
How this shows up on the exam
Domain 1 (Agentic Architecture, 27%) and Domain 2 (Tool Design & Evaluation, 18%) both probe this material. Expect questions where two model choices look equivalent on a UX dimension, and the correct answer is "pick by the benchmark tied to the actual outcome, not by the satisfaction survey." Project Deal is the canonical worked example. It is also a near-relative of the "Model vs Design" distractor pattern: when you read "add system prompt instructions to negotiate harder" as an answer choice, recognize that prompt-tuning rarely closes a 26-point benchmark gap.
For study-next, pair this post with the Evaluation concept page (which formalizes ask-vs-sell and other outcome metrics), the Customer support resolution agent scenario (the build-along that uses the same negotiation handoff loop), and the Day-of distractor patterns in the Exam Guide (especially the Prompt vs Hook and Model vs Design patterns). The exam-relevant principle is: measure the outcome, not the chat.
Sources
Where this lands in the exam-prep map
Each blog post bridges into the evergreen pillars. These are the most relevant follow-ups for this story.
Concept
Evaluation
Project Deal is the canonical example of why benchmarks beat satisfaction surveys.
Open ↗Scenario
Customer support resolution
The negotiation handoff loop is structurally identical to support escalation.
Open ↗Exam Guide
Day-of distractor patterns
Perception-gap is a Model-vs-Design distractor in disguise.
Open ↗7 questions answered
What was Project Deal and what did Anthropic learn?
What does Project Deal teach about model selection on the CCA-F?
Is Project Deal an example of the 'Model vs Design' distractor pattern?
Should I use Opus 4.5 or Haiku 4.5 for an agentic marketplace workflow?
How big was the price-discipline gap and why does it matter?
What evaluation metric should I use for an agentic workflow?
Why did user satisfaction barely move between Opus and Haiku?
Synthesized from research output on 2026-05-02. LinkedIn cross-post pending.
Last reviewed 2026-05-06.
