# Prompt Engineering Techniques

> Prompt engineering is a craft of seven techniques that turn a brittle one-shot prompt into a production-grade contract: few-shot examples, iterative refinement against an eval suite, anti-fabrication schemas, structured templates, explicit do-don't constraints, output anchoring via tools, and test-driven prompt edits. The exam trap is treating any one technique as 'the answer'; production prompts compose all seven.

**Domain:** D4 · Prompt Engineering (20% of CCA-F exam)
**Canonical:** https://claudearchitectcertification.com/concepts/prompt-engineering-techniques
**Last reviewed:** 2026-05-04

## Quick stats

- **Core techniques:** 7
- **Iterations to converge:** 5-15
- **Eval suite size:** 20-50 cases
- **Few-shot example sweet spot:** 2-5 pairs
- **Exam domain:** D4

## What it is

Prompt engineering is the discipline of shaping Claude's output by writing a prompt, measuring it against test cases, observing where it fails, and tightening the prompt until the eval score plateaus. It is not writing one clever sentence. A production prompt converges over 5-15 iterations against a frozen suite of 20-50 cases, of which 3-5 are known-failure cases that anchor the regression boundary. Per ASC-A01 Course 11 §16, prompt v1 scoring 7.66 climbs to 8.7 in v2 only because the eval gave a quantitative delta to chase.

The seven load-bearing techniques are few-shot prompting (showing 2-5 input-output pairs so Claude can pattern-match the harder cases), iterative refinement (the Draft → Test → Observe → Refine → Anchor loop), anti-fabrication patterns (nullable fields, unclear enum values, citation-required answers), structured templates (role + objective + tone + tools + constraints + format + escalation), constraint elicitation (concrete do-don't lists with worked examples), output anchoring (force a tool_use schema, do not ask 'output JSON' in prose), and test-driven prompting (every change validated against a frozen eval suite, regression-free or roll back).

The reason every Domain 4 question on the exam is a composition question, never a single-technique question, is that real production prompts are layered. A refund agent uses a structured system-prompt template, three few-shot examples covering the sarcasm-style edge case, an unclear enum option for the refund-reason field, a forced tool_choice for output anchoring, and a 50-case eval that gates every prompt PR. Strip any one layer and the leak rate climbs above 5%. Per ACP-T03 §4.4, natural-language 'output JSON' prompts leak structure ~15% of the time; tool-anchored prompts leak 0%.

## How it works

Iterative refinement is the spine. You start with a baseline prompt that almost works, build a small eval set (20-50 cases is the sweet spot, of which 3-5 are deliberate failure cases you have already seen Claude get wrong), and run the prompt against the suite to get a score. Score is the only signal that matters; subjective 'this feels better' is theater. Per ASC-A01 Course 12 §22, the score itself isn't inherently good or bad. What matters is whether you can improve it by refining your prompts. Each refinement targets the lowest-scoring case, you write a tighter instruction or add an example, then re-run the entire suite to make sure no other case regressed.

Few-shot prompting is the highest-ROI lever inside the loop. Per ASC-A01 Course 12 §29, you wrap each example in <sample_input> and <ideal_output> XML tags, you choose 2-5 examples that cover the failure modes (sarcasm, ambiguity, edge formats), and you optionally add a one-line explanation of *why* the example is ideal. A 2-3 example before-after pair reduces misclassification by 30-60%, the single biggest single-edit accuracy gain you can make. The mistake is using examples Claude already gets right; always pick examples that match the cases you have observed Claude fail on.

Output anchoring and anti-fabrication are where natural-language prompting reaches its ceiling. Asking 'please output JSON' in prose leaks structure roughly 15% of the time under load. The fix is tool_use with a JSON schema and tool_choice: forced, the API constrains token generation to match the schema, so structure is guaranteed at 100%. Fabrication is a separate problem, schemas guarantee shape, not truth. Per ACP-T03 §4.4 and the structured-data-extraction scenario doc, when a source is genuinely silent, give the model two honest exits: nullable fields (type: ['string', 'null']) and an unclear / not_provided enum value. Without these escape hatches, required-string fields force fabrication and the leak rate climbs above 5%.

## Where you'll see it in production

### Refund agent prompt PR gate

Every prompt change PR runs against a 60-case golden suite (20 happy path, 20 edge, 10 escalations, 10 known-failure). The CI fails the PR if the average score drops or any previously-passing case regresses. Per ASC-A01 Course 11 §15, the alternative, 'test once and decide it's good enough', carries a significant risk of breaking in production. Three iterations typically take a prompt from 80% to 95%.

### Sarcasm-aware sentiment classifier

A social-listening team's first prompt scored 71% on a sarcasm-heavy suite. Adding three <sample_input> / <ideal_output> pairs that explicitly covered Plan-9-style ironic praise lifted the score to 88%. The team picked the examples directly from the eval's lowest-scoring cases, which per ASC-A01 Course 12 §29 is the canonical way to source few-shot examples. No model upgrade required.

### Contract-extraction anti-fabrication

A legal-ops extractor was fabricating termination_clause text when the contract was silent. The fix was schema-level, not prompt-level: nullable strings plus an unclear enum option in the tool input schema. Per the structured-data-extraction scenario doc, fabrication rate dropped from 8% to under 0.5% once the model had an honest exit. The system prompt remained almost unchanged.

## Code examples

### Few-shot prompt with iterative-eval harness

**Python:**

```python
from anthropic import Anthropic
import json

client = Anthropic()

SYSTEM = """You classify tweet sentiment as positive, negative, or unclear.

DO emit one of: positive | negative | unclear.
DON'T add commentary or explanation.
DON'T pick positive when the tone is sarcastic.

Here are example input-output pairs:

<sample_input>I love how my flight was delayed three hours.</sample_input>
<ideal_output>negative</ideal_output>

<sample_input>Best coffee I've had all week, genuinely.</sample_input>
<ideal_output>positive</ideal_output>

<sample_input>idk what to make of this product</sample_input>
<ideal_output>unclear</ideal_output>"""

def classify(tweet: str) -> str:
    resp = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=10,
        system=SYSTEM,
        messages=[{"role": "user", "content": tweet}],
    )
    return resp.content[0].text.strip().lower()

def run_eval(cases_path: str) -> float:
    cases = [json.loads(line) for line in open(cases_path)]
    correct = sum(1 for c in cases if classify(c["tweet"]) == c["expected"])
    score = correct / len(cases)
    print(f"Score: {score:.2%} ({correct}/{len(cases)})")
    return score

# Iterate: edit SYSTEM, re-run, compare. Ship only on regression-free improvement.
v1_score = run_eval("sentiment_cases.jsonl")
```

> Few-shot pairs cover the sarcasm failure case explicitly. The eval harness reports a single number per prompt version; iterate until the score plateaus.

**TypeScript:**

```typescript
import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "node:fs";

const client = new Anthropic();

const SYSTEM = `You classify tweet sentiment as positive, negative, or unclear.

DO emit one of: positive | negative | unclear.
DON'T add commentary or explanation.
DON'T pick positive when the tone is sarcastic.

<sample_input>I love how my flight was delayed three hours.</sample_input>
<ideal_output>negative</ideal_output>

<sample_input>Best coffee I've had all week, genuinely.</sample_input>
<ideal_output>positive</ideal_output>

<sample_input>idk what to make of this product</sample_input>
<ideal_output>unclear</ideal_output>`;

async function classify(tweet: string): Promise<string> {
  const resp = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 10,
    system: SYSTEM,
    messages: [{ role: "user", content: tweet }],
  });
  const block = resp.content[0];
  return block.type === "text" ? block.text.trim().toLowerCase() : "";
}

async function runEval(casesPath: string): Promise<number> {
  const cases = readFileSync(casesPath, "utf8")
    .trim()
    .split("\n")
    .map((l) => JSON.parse(l) as { tweet: string; expected: string });
  let correct = 0;
  for (const c of cases) {
    const out = await classify(c.tweet);
    if (out === c.expected) correct += 1;
  }
  const score = correct / cases.length;
  console.log(`Score: ${(score * 100).toFixed(1)}% (${correct}/${cases.length})`);
  return score;
}

await runEval("sentiment_cases.jsonl");
```

> Same shape in TypeScript. Each prompt revision becomes a new SYSTEM constant; you compare scores side-by-side.

### Output anchoring with anti-fabrication schema

**Python:**

```python
from anthropic import Anthropic

client = Anthropic()

# Anti-fabrication: nullable + 'unclear' enum exit.
EXTRACT_TOOL = {
    "name": "extract_refund_decision",
    "description": "Extract refund decision from a customer ticket.",
    "input_schema": {
        "type": "object",
        "properties": {
            "decision": {
                "type": "string",
                "enum": ["approve", "deny", "escalate", "unclear"],
            },
            "refund_reason": {"type": ["string", "null"]},
            "amount_usd": {"type": ["number", "null"]},
            "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        },
        "required": ["decision", "confidence"],
    },
}

def extract(ticket: str) -> dict:
    resp = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        tools=[EXTRACT_TOOL],
        tool_choice={"type": "tool", "name": "extract_refund_decision"},
        messages=[{"role": "user", "content": ticket}],
    )
    for block in resp.content:
        if block.type == "tool_use":
            return block.input
    raise RuntimeError("forced tool_choice did not fire")

# 'unclear' + nullable fields let the model say 'I don't know' honestly.
result = extract("Customer wants a refund. No order ID provided.")
# Expected: {"decision": "escalate", "refund_reason": null, "amount_usd": null, "confidence": 0.3}
```

> Tool-anchored output is 100% structured. The 'unclear' enum and nullable fields cut fabrication on silent sources from 8% to under 0.5%.

**TypeScript:**

```typescript
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const EXTRACT_TOOL: Anthropic.Tool = {
  name: "extract_refund_decision",
  description: "Extract refund decision from a customer ticket.",
  input_schema: {
    type: "object",
    properties: {
      decision: {
        type: "string",
        enum: ["approve", "deny", "escalate", "unclear"],
      },
      refund_reason: { type: ["string", "null"] },
      amount_usd: { type: ["number", "null"] },
      confidence: { type: "number", minimum: 0, maximum: 1 },
    },
    required: ["decision", "confidence"],
  },
};

async function extract(ticket: string): Promise<unknown> {
  const resp = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 512,
    tools: [EXTRACT_TOOL],
    tool_choice: { type: "tool", name: "extract_refund_decision" },
    messages: [{ role: "user", content: ticket }],
  });
  const tool = resp.content.find(
    (b): b is Anthropic.ToolUseBlock => b.type === "tool_use",
  );
  if (!tool) throw new Error("forced tool_choice did not fire");
  return tool.input;
}
```

> Same anti-fabrication pattern in TypeScript. The schema is the contract; prompt-level pleas to 'be honest' are not a substitute.

## Looks-right vs actually-wrong

| Looks right | Actually wrong |
|---|---|
| Adding 'output valid JSON, do not include any other text' to the system prompt to enforce structure. | Natural-language prompting for structure leaks ~15% of the time under load. Per ACP-T03 §4.4, the only structural guarantee is tool_use with a JSON schema and tool_choice: forced. Prompt instructions are advice, not enforcement. |
| Spending three days hand-tuning the wording of the system prompt to fix a 12% accuracy gap. | Per ASC-A01 Course 12 §29, adding 2-3 well-chosen few-shot examples typically yields a 30-60% misclassification reduction in a single edit. Prose tweaks deliver diminishing returns; concrete examples deliver step-changes. |
| Using Claude itself to grade 100% of your eval cases because it scales better than humans. | Per the evaluation concept (docid #f86eb3), model-based grading is not reproducible across model versions and fails on deterministic checks (tool sequences, JSON schemas). Use code-based grading for anything measurable; reserve model grading for subjective qualities like tone. |
| Required-string field for refund_reason so the schema always returns a value the downstream code can parse. | Required strings force fabrication when the source is silent. Per the structured-data-extraction scenario, fabrication climbs above 5% without an unclear enum option or a nullable type. Always give the model an honest exit. |
| Iterating the prompt 20+ times against the same 10 hand-picked test cases until the score reaches 100%. | Per the evaluation concept §Q3, that's overfitting. Score plateau on a tiny fixed suite masks production blind spots. Grow the suite to 50+ externally-validated cases and pair with shadow-mode production evals. |

## Decision tree

1. **Are you anchoring the output format?**
   - **Yes:** Use tool_use with a JSON schema and tool_choice: forced. Structure is guaranteed at 100%, no parsing risk.
   - **No:** Prompt-only 'output JSON' leaks ~15% under load. Switch to tool-anchored output before iterating further.

2. **Do you have a frozen 20-50 case eval suite, of which 3-5 are known-failure cases?**
   - **Yes:** Iterate: edit prompt, re-run suite, ship only on regression-free improvement. Per ASC-A01 Course 11 §16, target a measurable score delta per version.
   - **No:** Build the suite before tuning the prompt. Even a 10-case suite catches 80% of regressions; without it, every change is a coin flip.

3. **Is your accuracy gap concentrated in specific edge cases (sarcasm, ambiguity, format variance)?**
   - **Yes:** Add 2-3 few-shot <sample_input> / <ideal_output> examples that cover those exact cases. Per ASC-A01 Course 12 §29, this is the highest-ROI single edit.
   - **No:** If the gap is uniform, your problem is the structured template (role + objective + format + constraints), not the examples. Rewrite the system prompt skeleton first.

4. **Is the model fabricating values when the source is silent?**
   - **Yes:** Schema-level fix: nullable types and an unclear / not_provided enum option. Per the structured-data-extraction scenario, this drops fabrication from 8% to under 0.5%.
   - **No:** If fabrication only happens under load, check whether your schema requires fields the source rarely contains; relax the required list.

## Exam-pattern questions

### Q1. Your extraction prompt says 'output valid JSON'. 12% of responses include explanatory prose. Best fix?

Stop prompting for JSON; use tool_use with a JSON schema and tool_choice: forced. The most-tested distractor is "add 'do not include any text outside the JSON' to the system prompt". Per ACP-T03 §4.4, prompt-only is probabilistic (~15% leak); tool-anchored output is structurally guaranteed (0% leak).

### Q2. Your sentiment classifier misses sarcastic tweets ~30% of the time. The cheapest fix?

Add 2-3 few-shot examples that specifically demonstrate sarcasm, wrapped in <sample_input> / <ideal_output> XML tags. Per ASC-A01 Course 12 §29, this technique reduces misclassification by 30-60%. The distractor is "upgrade to a larger model"; the right answer is one prompt edit with concrete failure-case examples.

### Q3. A teammate edited the system prompt and shipped to production. A week later, accuracy dropped 6 points. What was missing from their workflow?

An eval suite with regression detection. Per ASC-A01 Course 11 §16, every prompt edit must run against a 20-50 case suite to compare scores (e.g. v1 = 7.66, v2 = 8.7) before merging. Distractor: "more A/B testing in production". Right answer: gate prompt PRs on offline eval delta, not on subjective review.

### Q4. Your refund agent's refund_reason field returns fabricated reasons when the contract is silent. Schema-level fix?

Make the field nullable AND add an unclear enum option: {type: ['string', 'null'], enum: ['refund', 'replacement', 'unclear', null]}. Per the structured-data-extraction scenario doc, required-string fields force fabrication when the source is silent. The distractor is "add a temperature: 0 setting"; the actual fix is giving the model an honest exit in the schema.

### Q5. Your prompt iterates 47 times and the eval score keeps oscillating between 78% and 82%. What's the underlying problem?

The eval set is too small or not golden. Per the evaluation concept page, scores oscillating without a trend means random variance is dominating signal. Distractor: "use a more capable model". Right answer: grow the suite to 50+ externally-validated cases, then iterate. Without golden data, evals measure consistency, not correctness.

### Q6. Your system prompt uses vague guidance like 'be careful with refunds' and policy violations stay at 5%. Why doesn't the model improve when you add 'be especially careful'?

Vague descriptions are floor-level guidance the model ignores under load. Per the system-prompts concept (docid #5ca2ae), encode the policy as a do-don't list with concrete examples (e.g. "DO escalate refunds > $500. DON'T auto-approve any refund without an order_id."). Distractor: "add stronger language like NEVER". Right answer: replace adjectives with concrete rules and worked examples.

### Q7. You add a great new few-shot example and the targeted case finally passes, but two unrelated cases regress. What do you do?

Roll back, then iterate. Per the evaluation concept (docid #f86eb3), regression-free is a non-negotiable gate. The distractor is "ship anyway because the targeted case was the priority". Right answer: the new example is teaching Claude something that conflicts with the others; either add a counter-example for the regressing cases or rewrite the new example to be more specific.

### Q8. A new prompt scores 92% on your 50-case suite. You ship. Production failures climb. Why was the eval misleading?

The suite was overfit to known cases. Per the evaluation concept page §Q3, score plateau on a fixed suite can mask overfitting; production users hit scenarios you didn't design for. Distractor: "you should have raised the bar to 95%". Right answer: pair offline evals with shadow-mode evaluation on real traffic, and grow the golden suite from production failures.

## FAQ

### Q1. Why XML tags for few-shot examples instead of plain bullets?

Per ASC-A01 Course 12 §29, <sample_input> and <ideal_output> tags create unambiguous boundaries Claude can attend to. Plain bullets blur into the surrounding instructions.

### Q2. How many few-shot examples is too many?

Past 5-7 examples you hit attention dilution and token cost without meaningful gains. Curate the set; replace weak examples instead of adding more.

### Q3. Should the eval grader be Claude or code?

Use code-based grading for deterministic checks (tool sequence, JSON schema). Use model-based grading only for subjective qualities (tone, completeness). Per the evaluation concept page, code grading is reproducible; model grading is not.

### Q4. Can I skip the eval suite if my use case is internal-only?

No. Without an eval, every prompt edit is a coin flip. Even a 10-case suite catches 80% of regressions. Build the suite before you tune the prompt.

### Q5. Where do constraints belong, system prompt or tool descriptions?

Hard constraints (refund > $500 escalates) belong in tool code. System prompt restates them as guidance. Per the system-prompts concept, tool descriptions take precedence on routing; system prompt is linguistic guidance.

---

**Source:** https://claudearchitectcertification.com/concepts/prompt-engineering-techniques
**Vault sources:** ASC-A01 Course 12 §29 providing examples (#70748e); ASC-A01 Course 11 §16 typical eval workflow (#9d53f3); ASC-A01 Course 12 §22 code-based grading (#af5dc2); ACP-T03 §4.4 prompt engineering and anti-fabrication; Scenario: structured-data-extraction (#4fe21b); Concept: evaluation (#f86eb3); Concept: system-prompts (#5ca2ae)
**Last reviewed:** 2026-05-04

**Evidence tiers**, 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed.
