Blog · 2026-05-02· 6 min read

Claude's Marketplace Agent: The Project Deal Experiment

Anthropic's Project Deal had Claude agents negotiate real transactions with 69 employees and a $100 budget. Opus 4.5 closed deals at 78%; Haiku 4.5 at 52%. But user satisfaction barely moved (4.8 vs 4.6) - the perception gap trap. The smaller model felt fine while quietly overpaying and underselling.

D1D5marketplaceevaluationopus
Painterly walnut market-stall scene: a Loop behind a Project Deal storefront, brass coins and a wax-sealed contract on the counter.

TL;DR

  • Project Deal is an Anthropic experiment from April 27, 2026 where Claude agents transacted with 69 employees on a $100/seat budget
  • Opus 4.5 closed 78% of deals; Haiku 4.5 closed 52% - but user satisfaction barely moved (4.8 vs 4.6)
  • The perception gap is the lesson: a polite-feeling smaller model can quietly overpay and undersell
  • The right production pattern is stage-aware routing: Haiku for discovery, Opus for the closing turn
  • Maps to CCA-F D1 (27%) + D2 evaluation material - measure outcomes, not chat-feel

Quick answer

Project Deal is an Anthropic-published experiment from April 27, 2026 where Claude agents negotiated real transactions inside the company. The headline is the perception gap: Opus 4.5 closed 78% of deals; Haiku 4.5 closed 52%, but user satisfaction was nearly identical (4.8 vs 4.6). The smaller model felt fine while quietly overpaying and underselling. This is a CCA-F evaluation discipline lesson dressed as a marketplace demo.

What just happened

On April 27, 2026, Anthropic published results from Project Deal - an internal experiment where Claude agents transacted with 69 employees over a $100-per-participant budget. The headline framing is "Claude bought and sold stuff in an internal marketplace." The useful framing is something the CCA-F already tests aggressively: the gap between objective outcomes (close rate, price discipline, negotiation persistence) and subjective satisfaction (the chat felt fine).

When stakeholders evaluate agentic workflows on satisfaction surveys, they pick the wrong model. When they evaluate on benchmarks tied to the actual goal, they pick the right one. Project Deal is the cleanest public worked example of that divergence.

That divergence is exactly the territory the CCA-F's evaluation and model-selection questions probe - which is why this experiment matters far beyond the marketplace use case.

Why this matters now

Marketplace and negotiation workflows are one of the next big surfaces for agentic deployment. Stripe, Shopify, eBay, and several regional marketplaces are running agent pilots in 2026; OpenAI's Operator agent and Anthropic's own consumer-app agent both treat negotiation as a near-term capability. Project Deal is the cleanest public benchmark in that space - and the headline finding is that picking by feel rather than by outcome silently destroys margin. That signal is going to influence how every team writes their evaluation rubric for the next 12 months.

The trap is subtler than "use the bigger model." If you read Project Deal as "always pick Opus," you've missed the point. Opus also costs ~5× more per turn than Haiku at scale - running every step on Opus is itself the wrong answer because the cost compounds across discovery and listing turns where Haiku is genuinely fine. The real production pattern is stage-aware routing: Haiku for the cheap-but-frequent steps, Opus for the closing turn where the dollars actually move. That's the same logic Specialist Routing applies to coding agents.

For builders shipping today, the practical implication is that your evaluation harness chooses your model. If your eval is satisfaction-survey driven, you ship the smaller model and bleed margin. If your eval is outcome-driven (close rate, ask-vs-sell, escalation rate), you ship a staged pipeline that costs more per turn but converts better per transaction. The Project Deal authors made this point implicitly by publishing both metrics; teams replicating the pattern internally need to decide which metric they trust before they decide which model to ship.

This material is exam-relevant for two CCA-F domains, not one. D1 (Agentic Architecture, 27%) probes the stage-aware-routing pattern. D2 (Tool Design & Evaluation, 18%) probes the eval-rubric question. Both domains plant distractors in the same shape: an answer that suggests "tune the prompt" or "use the bigger model" when the right answer is "rewrite the eval and re-route per stage." Recognizing the perception-gap shape on the exam is worth at least 2-3 questions - and it generalizes beyond marketplaces to support, scheduling, and any agentic flow with measurable outcomes.

The open question is whether the perception-gap result holds across model families. Project Deal compared Opus 4.5 vs Haiku 4.5 - both Anthropic, both negotiation-tuned via the same RLHF lineage. The harder test is GPT-5.5 vs Haiku 4.5, or Gemini 3.1 vs Opus 4.5 - and the published replications haven't dropped yet. If the gap reverses on cross-vendor comparisons, the operational rule becomes "stage-aware routing AND vendor-aware fallback." That's a 2027-exam question; for the 2026 sitting, expect the within-Anthropic pattern only.

What the Project Deal numbers actually say

  1. Close-rate gap was 26 points. Opus 4.5 closed 78% of deals; Haiku 4.5 closed 52%. On the most basic agent KPI - did the transaction happen - the smaller model lost on more than half its conversations.

  2. Sell-side discipline lost 13 points to ask. Opus sold at 94% of asking price; Haiku at 81%. On a $100 listing, that's $13 walking out the door per transaction. Multiply across a workflow and the cost compounds quickly.

  3. Buy-side discipline was 10 points worse. Opus came in 12% under budget on purchases; Haiku 2%. The smaller model accepts the first counter-offer. Negotiation persistence, not raw IQ, is where the gap shows up.

  4. Negotiation depth doubled. Opus stayed in for 4.2 turns vs Haiku's 2.1. Persistence converts to revenue. The chat "feels long" on Opus, but the long chat is exactly where the value comes from.

  5. User satisfaction barely moved. 4.8 vs 4.6 on a 5-point scale. That is the perception gap: stakeholders evaluating on chat feel will pick the wrong model and quietly bleed margin. The smaller model is polite-but-expensive.

3 production patterns Project Deal validates

  1. Workflow split - faster agents for discovery, stronger for closing. Don't pick one model for the whole pipeline. Use Haiku for listing and discovery (cheap, fast, satisfaction stays high), and Opus for the closing turn and dispute mitigation (where the dollars actually move). This is stage-aware routing, a sibling of D1 Specialist Routing.

  2. Outcome-tied evals beat satisfaction surveys. If your evaluation metric is "did the chat feel good?" you ship Haiku. If it is "closed at price within X% of ask?" you ship Opus on the closing turn. The CCA-F D2 evaluation questions hammer this exact substitution.

  3. Benchmark audits before pilot-to-prod. Anthropic published Project Deal's numbers publicly; most internal demos don't. Before promoting an agentic workflow past pilot, run a benchmark step (close rate, ask-vs-sell, turns-to-close, regression vs baseline). Skipping it is how perception gaps ship.

Three mistakes Project Deal exposes

  • Treating marketplace as a single-agent task. Marketplace flows are a workflow split: faster agents for discovery and listing, stronger agents for final negotiation and disputes. Single-agent magic-trick framing always loses to staged routing.

  • Optimizing on satisfaction surveys. If your eval is "did the chat feel good?" you pick Haiku and ship. If your eval is "closed at price within X% of ask?" you pick Opus for the closing turn. The metric you choose chooses the model.

  • Skipping the benchmark audit before approval. Project Deal published the gap publicly; most internal demos won't. Add a benchmark step (close rate, ask-vs-sell, turns-to-close) before any agentic workflow rolls past pilot. The exam probes this discipline directly.

How this shows up on the exam

Domain 1 (Agentic Architecture, 27%) and Domain 2 (Tool Design & Evaluation, 18%) both probe this material. Expect questions where two model choices look equivalent on a UX dimension, and the correct answer is "pick by the benchmark tied to the actual outcome, not by the satisfaction survey." Project Deal is the canonical worked example. It is also a near-relative of the "Model vs Design" distractor pattern: when you read "add system prompt instructions to negotiate harder" as an answer choice, recognize that prompt-tuning rarely closes a 26-point benchmark gap.

For study-next, pair this post with the Evaluation concept page (which formalizes ask-vs-sell and other outcome metrics), the Customer support resolution agent scenario (the build-along that uses the same negotiation handoff loop), and the Day-of distractor patterns in the Exam Guide (especially the Prompt vs Hook and Model vs Design patterns). The exam-relevant principle is: measure the outcome, not the chat.

Sources

01 · Read next in the pillars

Where this lands in the exam-prep map

Each blog post bridges into the evergreen pillars. These are the most relevant follow-ups for this story.

02 · FAQ

7 questions answered

What was Project Deal and what did Anthropic learn?
Project Deal was an Anthropic-internal experiment where Claude agents (Opus 4.5 and Haiku 4.5) transacted with 69 employees on an internal marketplace, each with a $100 budget. The headline finding was the perception gap: Opus closed 78% of deals at 94% of ask while Haiku closed 52% at 81%, yet user satisfaction was nearly identical (4.8 vs 4.6). The smaller model felt fine while losing money on every transaction.
What does Project Deal teach about model selection on the CCA-F?
Don't trust the chat-feel signal when picking a model. The CCA-F's D1 and D2 questions repeatedly probe whether candidates can pick by objective outcome (benchmarks, evals tied to the goal) rather than subjective UX. Project Deal is the cleanest public worked example of those two diverging.
Is Project Deal an example of the 'Model vs Design' distractor pattern?
Sibling pattern. Model vs Design says *'if behavior is wrong, fix design before escalating model.'* Project Deal says *'if the smaller model feels right, that doesn't mean it is right - measure the outcome.'* Both share the same root cause: trusting feel over evidence.
Should I use Opus 4.5 or Haiku 4.5 for an agentic marketplace workflow?
Both, staged. Use Haiku for listing, discovery, and routine answers - it preserves satisfaction at low cost. Use Opus for the closing turn and dispute mitigation - that is where the 26-point benchmark gap moves real money. This is stage-aware routing, a sibling of Specialist Routing.
How big was the price-discipline gap and why does it matter?
Opus sold at 94% of asking price; Haiku at 81% - a 13-point gap. On the buy side, Opus came in 12% under budget vs Haiku at 2%. Across a marketplace workflow, the compounded margin loss from running the wrong model on the closing turn is significant - and invisible to satisfaction surveys.
What evaluation metric should I use for an agentic workflow?
Pick metrics tied to the actual goal, not chat satisfaction. For marketplace: close rate, ask-vs-sell %, buy price as % of budget, turns-to-close. For support: first-contact resolution, escalation rate. For research: citation accuracy, claim coverage. The CCA-F D2 evaluation questions reward outcome-tied metrics over UX surveys.
Why did user satisfaction barely move between Opus and Haiku?
Because politeness and conversational flow are easy for the smaller model to match. The gap shows up in persistence and discipline - staying in negotiation for 4 turns instead of 2, refusing to accept the first counter-offer, holding asking price. None of that registers on a 5-point satisfaction scale.

Synthesized from research output on 2026-05-02. LinkedIn cross-post pending.
Last reviewed 2026-05-06.

Blog post · D1 · Blog

Claude's Marketplace Agent: The Project Deal Experiment, complete.

You've covered the full ten-section breakdown for this primitive, definition, mechanics, code, false positives, comparison, decision tree, exam patterns, and FAQ. One technical primitive down on the path to CCA-F.

More platforms →