D4.9 · Domain 4 · Prompt Engineering · 20% of CCA-F

Prompt Engineering Techniques.

11 min read·9 sections·Tier A

Prompt engineering is a craft of seven techniques that turn a brittle one-shot prompt into a production-grade contract: few-shot examples, iterative refinement against an eval suite, anti-fabrication schemas, structured templates, explicit do-don't constraints, output anchoring via tools, and test-driven prompt edits. The exam trap is treating any one technique as 'the answer'; production prompts compose all seven. Anthropic best-practice

Domain 4 anchorHeavily testedAnthropic best practice
Prompt Engineering Techniques, hero illustration featuring Loop mascot in a warm gallery scene.
Domain D4Prompt Engineering · 20%
On this page
01 · Summary

TLDR

Prompt engineering is a craft of seven techniques that turn a brittle one-shot prompt into a production-grade contract: few-shot examples, iterative refinement against an eval suite, anti-fabrication schemas, structured templates, explicit do-don't constraints, output anchoring via tools, and test-driven prompt edits. The exam trap is treating any one technique as 'the answer'; production prompts compose all seven. Anthropic best-practice

7
Core techniques
5-15
Iterations to converge
20-50 cases
Eval suite size
2-5 pairs
Few-shot example sweet spot
D4
Exam domain
02 · Definition

What it is

Prompt engineering is the discipline of shaping Claude's output by writing a prompt, measuring it against test cases, observing where it fails, and tightening the prompt until the eval score plateaus. It is not writing one clever sentence. A production prompt converges over 5-15 iterations against a frozen suite of 20-50 cases, of which 3-5 are known-failure cases that anchor the regression boundary. Per ASC-A01 Course 11 §16, prompt v1 scoring 7.66 climbs to 8.7 in v2 only because the eval gave a quantitative delta to chase.

The seven load-bearing techniques are few-shot prompting (showing 2-5 input-output pairs so Claude can pattern-match the harder cases), iterative refinement (the Draft → Test → Observe → Refine → Anchor loop), anti-fabrication patterns (nullable fields, unclear enum values, citation-required answers), structured templates (role + objective + tone + tools + constraints + format + escalation), constraint elicitation (concrete do-don't lists with worked examples), output anchoring (force a tool_use schema, do not ask 'output JSON' in prose), and test-driven prompting (every change validated against a frozen eval suite, regression-free or roll back).

The reason every Domain 4 question on the exam is a composition question, never a single-technique question, is that real production prompts are layered. A refund agent uses a structured system-prompt template, three few-shot examples covering the sarcasm-style edge case, an unclear enum option for the refund-reason field, a forced tool_choice for output anchoring, and a 50-case eval that gates every prompt PR. Strip any one layer and the leak rate climbs above 5%. In production testing, natural-language 'output JSON' prompts leak structure ~15% of the time; tool-anchored prompts leak 0%.

03 · Mechanics

How it works

Iterative refinement is the spine. You start with a baseline prompt that almost works, build a small eval set (20-50 cases is the sweet spot, of which 3-5 are deliberate failure cases you have already seen Claude get wrong), and run the prompt against the suite to get a score. Score is the only signal that matters; subjective 'this feels better' is theater. Per ASC-A01 Course 12 §22, the score itself isn't inherently good or bad. What matters is whether you can improve it by refining your prompts. Each refinement targets the lowest-scoring case, you write a tighter instruction or add an example, then re-run the entire suite to make sure no other case regressed.

Few-shot prompting is the highest-ROI lever inside the loop. Per ASC-A01 Course 12 §29, you wrap each example in <sample_input> and <ideal_output> XML tags, you choose 2-5 examples that cover the failure modes (sarcasm, ambiguity, edge formats), and you optionally add a one-line explanation of *why* the example is ideal. A 2-3 example before-after pair reduces misclassification by 30-60%, the single biggest single-edit accuracy gain you can make. The mistake is using examples Claude already gets right; always pick examples that match the cases you have observed Claude fail on.

Output anchoring and anti-fabrication are where natural-language prompting reaches its ceiling. Asking 'please output JSON' in prose leaks structure roughly 15% of the time under load. The fix is tool_use with a JSON schema and tool_choice: forced, the API constrains token generation to match the schema, so structure is guaranteed at 100%. Fabrication is a separate problem, schemas guarantee shape, not truth. Per the structured-data-extraction scenario, when a source is genuinely silent, give the model two honest exits: nullable fields (`type: ['string', 'null']`) and an `unclear` / `not_provided` enum value. Without these escape hatches, required-string fields force fabrication and the leak rate climbs above 5%.

Prompt Engineering Techniques mechanics, painterly diagram featuring Loop mascot.
04 · In production

Where you'll see it

Refund agent prompt PR gate

Every prompt change PR runs against a 60-case golden suite (20 happy path, 20 edge, 10 escalations, 10 known-failure). The CI fails the PR if the average score drops or any previously-passing case regresses. Per ASC-A01 Course 11 §15, the alternative, 'test once and decide it's good enough', carries a significant risk of breaking in production. Three iterations typically take a prompt from 80% to 95%.

Sarcasm-aware sentiment classifier

A social-listening team's first prompt scored 71% on a sarcasm-heavy suite. Adding three <sample_input> / <ideal_output> pairs that explicitly covered Plan-9-style ironic praise lifted the score to 88%. The team picked the examples directly from the eval's lowest-scoring cases, which per ASC-A01 Course 12 §29 is the canonical way to source few-shot examples. No model upgrade required.

Contract-extraction anti-fabrication

A legal-ops extractor was fabricating termination_clause text when the contract was silent. The fix was schema-level, not prompt-level: nullable strings plus an unclear enum option in the tool input schema. Per the structured-data-extraction scenario doc, fabrication rate dropped from 8% to under 0.5% once the model had an honest exit. The system prompt remained almost unchanged.

05 · Implementation

Code examples

Few-shot prompt with iterative-eval harness
from anthropic import Anthropic
import json

client = Anthropic()

SYSTEM = """You classify tweet sentiment as positive, negative, or unclear.

DO emit one of: positive | negative | unclear.
DON'T add commentary or explanation.
DON'T pick positive when the tone is sarcastic.

Here are example input-output pairs:

<sample_input>I love how my flight was delayed three hours.</sample_input>
<ideal_output>negative</ideal_output>

<sample_input>Best coffee I've had all week, genuinely.</sample_input>
<ideal_output>positive</ideal_output>

<sample_input>idk what to make of this product</sample_input>
<ideal_output>unclear</ideal_output>"""

def classify(tweet: str) -> str:
    resp = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=10,
        system=SYSTEM,
        messages=[{"role": "user", "content": tweet}],
    )
    return resp.content[0].text.strip().lower()

def run_eval(cases_path: str) -> float:
    cases = [json.loads(line) for line in open(cases_path)]
    correct = sum(1 for c in cases if classify(c["tweet"]) == c["expected"])
    score = correct / len(cases)
    print(f"Score: {score:.2%} ({correct}/{len(cases)})")
    return score

# Iterate: edit SYSTEM, re-run, compare. Ship only on regression-free improvement.
v1_score = run_eval("sentiment_cases.jsonl")
Few-shot pairs cover the sarcasm failure case explicitly. The eval harness reports a single number per prompt version; iterate until the score plateaus.
Output anchoring with anti-fabrication schema
from anthropic import Anthropic

client = Anthropic()

# Anti-fabrication: nullable + 'unclear' enum exit.
EXTRACT_TOOL = {
    "name": "extract_refund_decision",
    "description": "Extract refund decision from a customer ticket.",
    "input_schema": {
        "type": "object",
        "properties": {
            "decision": {
                "type": "string",
                "enum": ["approve", "deny", "escalate", "unclear"],
            },
            "refund_reason": {"type": ["string", "null"]},
            "amount_usd": {"type": ["number", "null"]},
            "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        },
        "required": ["decision", "confidence"],
    },
}

def extract(ticket: str) -> dict:
    resp = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        tools=[EXTRACT_TOOL],
        tool_choice={"type": "tool", "name": "extract_refund_decision"},
        messages=[{"role": "user", "content": ticket}],
    )
    for block in resp.content:
        if block.type == "tool_use":
            return block.input
    raise RuntimeError("forced tool_choice did not fire")

# 'unclear' + nullable fields let the model say 'I don't know' honestly.
result = extract("Customer wants a refund. No order ID provided.")
# Expected: {"decision": "escalate", "refund_reason": null, "amount_usd": null, "confidence": 0.3}
Tool-anchored output is 100% structured. The 'unclear' enum and nullable fields cut fabrication on silent sources from 8% to under 0.5%.
06 · Distractor patterns

Looks right, isn't

Each row pairs a plausible-looking pattern with the failure it actually creates. These are the shapes exam distractors are built from.

Looks right

Adding 'output valid JSON, do not include any other text' to the system prompt to enforce structure.

Actually wrong

Natural-language prompting for structure leaks ~15% of the time under load. In production testing, the only structural guarantee is tool_use with a JSON schema and tool_choice: forced. Prompt instructions are advice, not enforcement.

Looks right

Spending three days hand-tuning the wording of the system prompt to fix a 12% accuracy gap.

Actually wrong

Per ASC-A01 Course 12 §29, adding 2-3 well-chosen few-shot examples typically yields a 30-60% misclassification reduction in a single edit. Prose tweaks deliver diminishing returns; concrete examples deliver step-changes.

Looks right

Using Claude itself to grade 100% of your eval cases because it scales better than humans.

Actually wrong

Per the evaluation concept (docid #f86eb3), model-based grading is not reproducible across model versions and fails on deterministic checks (tool sequences, JSON schemas). Use code-based grading for anything measurable; reserve model grading for subjective qualities like tone.

Looks right

Required-string field for refund_reason so the schema always returns a value the downstream code can parse.

Actually wrong

Required strings force fabrication when the source is silent. Per the structured-data-extraction scenario, fabrication climbs above 5% without an unclear enum option or a nullable type. Always give the model an honest exit.

Looks right

Iterating the prompt 20+ times against the same 10 hand-picked test cases until the score reaches 100%.

Actually wrong

Per the evaluation concept §Q3, that's overfitting. Score plateau on a tiny fixed suite masks production blind spots. Grow the suite to 50+ externally-validated cases and pair with shadow-mode production evals.

07 · When to use

Decision tree

01

Are you anchoring the output format?

YesUse tool_use with a JSON schema and tool_choice: forced. Structure is guaranteed at 100%, no parsing risk.
NoPrompt-only 'output JSON' leaks ~15% under load. Switch to tool-anchored output before iterating further.
02

Do you have a frozen 20-50 case eval suite, of which 3-5 are known-failure cases?

YesIterate: edit prompt, re-run suite, ship only on regression-free improvement. Per ASC-A01 Course 11 §16, target a measurable score delta per version.
NoBuild the suite before tuning the prompt. Even a 10-case suite catches 80% of regressions; without it, every change is a coin flip.
03

Is your accuracy gap concentrated in specific edge cases (sarcasm, ambiguity, format variance)?

YesAdd 2-3 few-shot <sample_input> / <ideal_output> examples that cover those exact cases. Per ASC-A01 Course 12 §29, this is the highest-ROI single edit.
NoIf the gap is uniform, your problem is the structured template (role + objective + format + constraints), not the examples. Rewrite the system prompt skeleton first.
04

Is the model fabricating values when the source is silent?

YesSchema-level fix: nullable types and an unclear / not_provided enum option. Per the structured-data-extraction scenario, this drops fabrication from 8% to under 0.5%.
NoIf fabrication only happens under load, check whether your schema requires fields the source rarely contains; relax the required list.
08 · On the exam

Question patterns

Prompt Engineering Techniques exam trap, painterly cautionary scene featuring Loop mascot.

54 V2 questions wired to this concept. Tap an answer to check it instantly — you'll see whether it's right and why — then expand the full breakdown for the mental model and all four rationales.

Two of your tools have similar names (fetch_data and get_data). The model picks the wrong one 30% of the time. What is the best first fix?

Tap your answer to check it.

A tool's input_schema is {data: any} and Claude returns null. What happened?

Tap your answer to check it.

Why are example patterns the highest-ROI section of a system prompt?

Tap your answer to check it.

Does bolding or capitalizing important facts in the prompt boost the model's attention to them?

Tap your answer to check it.

If the prompt is just "be helpful", which D in the 4D Framework is missing?

Tap your answer to check it.

Description in the 4D Framework has three sub-parts: Product, Process, and Performance. What does each one cover?

Tap your answer to check it.

48 additional questions for this concept live in the practice pillar. Take a mock exam ↗

09 · FAQ

Frequently asked

Why XML tags for few-shot examples instead of plain bullets?
Per ASC-A01 Course 12 §29, <sample_input> and <ideal_output> tags create unambiguous boundaries Claude can attend to. Plain bullets blur into the surrounding instructions.
How many few-shot examples is too many?
Past 5-7 examples you hit attention dilution and token cost without meaningful gains. Curate the set; replace weak examples instead of adding more.
Should the eval grader be Claude or code?
Use code-based grading for deterministic checks (tool sequence, JSON schema). Use model-based grading only for subjective qualities (tone, completeness). Per the evaluation concept page, code grading is reproducible; model grading is not.
Can I skip the eval suite if my use case is internal-only?
No. Without an eval, every prompt edit is a coin flip. Even a 10-case suite catches 80% of regressions. Build the suite before you tune the prompt.
Where do constraints belong, system prompt or tool descriptions?
Hard constraints (refund > $500 escalates) belong in tool code. System prompt restates them as guidance. Per the system-prompts concept, tool descriptions take precedence on routing; system prompt is linguistic guidance.
10 · Practice with AI

Work this with your AI

Work this concept hands-on with Claude Code, Codex, or claude.ai. Copy a prompt, paste it into your assistant, and practise in tandem. Each one keeps you active (explain it back, get drilled, or build) rather than just reading.

  • Drill it like the exam (scenario MCQs)
    Practice in the exam's scenario-MCQ format with trap awareness.
  • Explain it back (Feynman)
    Build durable, transferable understanding of a concept you can half-state.
  • Test me, adapting the difficulty
    Active recall practice on a concept you think you know.
  • Check my prerequisites first
    Before studying a concept that keeps not sticking.
  • Find the high-leverage 20%
    When a domain feels too big and you are short on time.
Self-check

Test yourself

Three diagnostic questions on this primitive. Reveal each answer when you have a guess. Want a full 60-question mock? Open the mock hub →

Q1Your extraction prompt says 'output valid JSON'. 12% of responses include explanatory prose. Best fix?
Stop prompting for JSON; use tool_use with a JSON schema and `tool_choice: forced`. The most-tested distractor is "add 'do not include any text outside the JSON' to the system prompt". In production testing, prompt-only is probabilistic (~15% leak); tool-anchored output is structurally guaranteed (0% leak).
Q2Your sentiment classifier misses sarcastic tweets ~30% of the time. The cheapest fix?
Add 2-3 few-shot examples that specifically demonstrate sarcasm, wrapped in <sample_input> / <ideal_output> XML tags. Per ASC-A01 Course 12 §29, this technique reduces misclassification by 30-60%. The distractor is "upgrade to a larger model"; the right answer is one prompt edit with concrete failure-case examples.
Q3A teammate edited the system prompt and shipped to production. A week later, accuracy dropped 6 points. What was missing from their workflow?
An eval suite with regression detection. Per ASC-A01 Course 11 §16, every prompt edit must run against a 20-50 case suite to compare scores (e.g. v1 = 7.66, v2 = 8.7) before merging. Distractor: "more A/B testing in production". Right answer: gate prompt PRs on offline eval delta, not on subjective review.
Last reviewed: 2026-05-04·Refresh cadence: monthly
D4.9 · D4 · Prompt Engineering

Prompt Engineering Techniques, complete.

You've covered the full ten-section breakdown for this primitive, definition, mechanics, code, false positives, comparison, decision tree, exam patterns, and FAQ. One technical primitive down on the path to CCA-F.

More platforms →