D4.5 · Domain 4 · Prompt Engineering · 20% of CCA-F

Vision & Multimodal.

9 min read·10 sections·Tier A

Vision lets Claude process images alongside text. A full deep-dive guide is coming soon. messages content types

Stub, research neededDomain 4
Vision & Multimodal, hero illustration featuring Loop mascot in a warm gallery scene.
Domain D4Prompt Engineering · 20%
On this page
01 · Summary

TLDR

Vision lets Claude process images alongside text. A full deep-dive guide is coming soon. messages content types

image
Input type
D4
Exam domain
C
Coverage tier
stub
Status
research
Action
02 · Definition

What it is

Vision multimodal is Claude's capability to analyze images, charts, screenshots, diagrams, and documents alongside text in a single request. Image content blocks ({type: "image", source: {...}}) are appended to the messages array just like text, and Claude processes them holistically. Each image costs tokens based on size: ~300 for thumbnails, ~1000-1200 for document pages. No per-image flat fee, only per-token pricing.

What makes vision multimodal rather than sequential is that text and image understanding happen in a single model pass. Claude doesn't see image first, then prompt, then reason backwards. The transformer's input embeddings encode both image patches (via vision tokenization) and text tokens in the same sequence, allowing the model to ground language in visual content directly.

Image content can be passed in three forms: base64-encoded (data embedded in JSON, no external URL), URL reference (Claude fetches from public HTTPS), or file upload via Files API (sensitive documents or when base64 bloats payload). Production use cases: OCR (text from scans), chart reading, document understanding (invoices, contracts, forms), UI/UX review (screenshots).

The primary risk is token cost explosion. A 20MB high-res image can consume 2000+ tokens; 50 documents prohibitively expensive without optimization. Mitigations: downresample to 1024×768 or 1200×1500, JPEG over PNG, crop to ROI, batch similar documents. Secondary risk: hallucination in structured extraction when images are ambiguous. Always validate JSON against schema and flag confidence.

03 · Mechanics

How it works

Request structure is identical to text-only, except content array contains mixed blocks. A user message might be [{type: "image", source: {type: "base64", media_type: "image/jpeg", data: "..."}}, {type: "text", text: "Extract invoice total"}]. Image is processed during the forward pass; vision encoder tokenizes patches, interleaves with text tokens, unified embedding attends across modalities simultaneously.

Token counting is deterministic: SDK provides count_tokens endpoint accepting image blocks. Always call before paying. A 1MB JPEG ~500 tokens; a 5MB scanned PDF page ~2500. Vision pricing ~$0.75/1M input tokens, so a 50-image batch at 800 tokens each runs ~$0.03.

For structured extraction, pass image + JSON schema, ask for valid JSON only, catch validation errors, retry with clarification. `tool_use` block pairs well: define extract_from_image with input schema matching desired output, Claude invokes with JSON, harness validates and appends. Without this pattern, extraction becomes a manual text-comparison game.

The most underutilized optimization is image context reuse via Files API. With 500 recurring documents, upload each once with purpose: "vision", store the file_id, reference in subsequent requests via {type: "image", source: {type: "file", file_id: "..."}}. Avoids re-encoding, reduces token cost 40-50% on recurring work.

Vision & Multimodal mechanics, painterly diagram featuring Loop mascot.
04 · In production

Where you'll see it

Intelligent document OCR pipeline

Compliance team processes 200+ regulatory filings weekly. Claude agentic loop accepts scanned PDFs, splits by page, downsamples to 1200×1500 JPEG, calls vision + extraction schema, validates JSON, retries on failure. Cost: ~$2/document at scale. 98% extraction accuracy vs 40% from commodity OCR.

Real-time chart interpretation for BI dashboards

Finance analytics tool embeds vision in Streamlit. Users upload chart screenshot. Claude analyzes, detects anomalies ("Q3 dropped 15% vs Q2"), generates insights. Stream response for <500ms latency. ~400 tokens per chart.

Invoice and receipt automation

Expense-management SaaS ingests receipts as images. Vision extracts merchant, amount, date, category, tax → structured JSON. Loop catches "too blurry" or "unreadable" failures and escalates to human with image attached. Cuts manual entry 90%.

UI/UX design review

QA team provides screenshots of new design. Vision analyzes: "Are form labels visible? Color contrast sufficient? Buttons keyboard-accessible?" Catches ~70% of accessibility violations before human testing.

05 · Implementation

Code examples

Multimodal extraction with schema validation
import anthropic, base64, json
client = anthropic.Anthropic()

def extract_invoice(image_path: str) -> dict:
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    schema = {
        "type": "object",
        "properties": {
            "vendor": {"type": "string"},
            "invoice_date": {"type": "string"},
            "amount": {"type": "number", "minimum": 0},
            "line_items": {"type": "array"},
        },
        "required": ["vendor", "invoice_date", "amount"],
    }

    for attempt in range(2):
        resp = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image_data}},
                    {"type": "text", "text": f"Extract invoice. Return ONLY JSON matching this schema:\n{json.dumps(schema)}\nIf field unclear, set null. No markdown."},
                ],
            }],
        )
        text = resp.content[0].text if resp.content else "{}"
        try:
            return json.loads(text)
        except json.JSONDecodeError:
            if attempt == 0:
                continue  # Retry once
            return {"error": "Failed to parse JSON"}
Image base64-encoded; JSON validation; one retry on parse failure. Production version adds confidence scoring.
06 · Distractor patterns

Looks right, isn't

Each row pairs a plausible-looking pattern with the failure it actually creates. These are the shapes exam distractors are built from.

Looks right

Upload all images as base64 inline.

Actually wrong

Base64 increases JSON size ~33%. For batch (50+), use Files API to reference stored file_ids. Inline OK for <10 images.

Looks right

Use high-resolution (4000×3000) for max OCR accuracy.

Actually wrong

Higher resolution increases token cost linearly with no accuracy gain above 1200×1500. Diminishing returns at ~1000px width. Downresample to 1024×768 for tables, 1200×1500 for dense text.

Looks right

If JSON is invalid, increase max_tokens and retry.

Actually wrong

Schema-validation failure usually signals ambiguity in the image (poor scans), not token exhaustion. Retry with clarification or escalate to human.

Looks right

Vision always produces accurate output; trust first extraction.

Actually wrong

Claude hallucinates fields when tables ambiguous or images degraded. Always validate JSON, track confidence per field, escalate low-confidence.

Looks right

Embed full-page scans of 100-page documents as single mega-image.

Actually wrong

Split by page, process each separately. 100 requests but each cheaper than one mega-request (token cost is sublinear per page). Batch in parallel.

07 · Compare

Side-by-side

AspectVision multimodalText-onlyTraditional OCRCaptioning pipeline
InputImage + text togetherText onlyImage onlyImage → text desc
Accuracy90%+ structuredN/A70-80% handwrittenLossy
Cost per document400-1500 tokens (~$0.0003-0.001)N/AFree or vendor feesVision + extra tokens
Speed1-3 sec per imageNegligible2-10 secSlower
Structured outputJSON schemaN/AUnstructuredNo schema
Best forInvoices, forms, chartsQueries on textBulk digitizationUser-facing summaries
08 · When to use

Decision tree

01

Need to extract structured data from images?

YesVision + tool_use with JSON schema. Validate, retry on failure.
NoVision + free-form text for summaries.
02

Processing >20 images in a batch?

YesFiles API. Upload once, reference file_id. Cuts cost ~40%.
NoBase64 inline OK.
03

Image is handwritten, blurry, or low-res?

YesExpect 60-70% accuracy. Plan human review and confidence thresholds.
NoExpect 90%+ on printed.
04

Documents >5MB each?

YesResample/compress. JPEG quality 85 cuts size 50% with no visual loss.
NoNo optimization needed.
05

Output is sensitive (PII, financial)?

YesFiles API: documents stay encrypted at rest. Validate and redact.
NoBase64 or URL fine.
09 · On the exam

Question patterns

Vision & Multimodal exam trap, painterly cautionary scene featuring Loop mascot.

20 V2 questions wired to this concept. Tap an answer to check it instantly — you'll see whether it's right and why — then expand the full breakdown for the mental model and all four rationales.

For a vision request, should you use base64 inline or the Files API?

Tap your answer to check it.

Are 4000x3000 high-res images worth the cost for OCR accuracy?

Tap your answer to check it.

Claude returns invalid JSON from a vision extraction. Should you increase max_tokens?

Tap your answer to check it.

Vision always produces accurate output. True or false?

Tap your answer to check it.

For a 100-page PDF, should you send it as one mega-image or split by page?

Tap your answer to check it.

Image size 1200x1500 vs 1024x768 for invoice extraction. Which?

Tap your answer to check it.

14 additional questions for this concept live in the practice pillar. Take a mock exam ↗

10 · FAQ

Frequently asked

How much does vision cost per image?
No per-image fee, only per-token. JPEG invoice (1200×1500) ~1000 tokens ≈ $0.00075. 1000 invoices ~$0.75.
Can Claude read handwritten text?
Printed: 95%+. Handwritten: 70-85% by legibility. Validate against schema, escalate low-confidence.
Image formats supported?
JPEG, PNG, GIF, WebP. Max 20MB. PDF supported (analyzed page-by-page internally).
URL instead of base64?
Yes. {type: "image", source: {type: "url", url: "https://..."}}. Requires public HTTPS.
Hallucinated fields?
Validate all JSON against schema. Set required fields explicitly, reject records that fail validation.
Cache images?
Not yet natively. Files API with file_idavoids re-encoding. Prompt caching doesn't apply to images.
Split a PDF into pages?
Use pypdf or pdf2image to extract pages as JPEG. Batch process in parallel.
Token cost screenshot vs document?
Equal per pixel. 1024×768 screenshot ~400 tokens; 1200×1500 doc page ~1000 tokens.
Detect and redact PII?
No explicit redaction, but ask Claude to flag PII regions in JSON. Redact client-side.
Charts with legends?
Very well. Claude understands axes, colors, legends. Interprets trends, anomalies, relationships.
11 · Practice with AI

Work this with your AI

Work this concept hands-on with Claude Code, Codex, or claude.ai. Copy a prompt, paste it into your assistant, and practise in tandem. Each one keeps you active (explain it back, get drilled, or build) rather than just reading.

  • Drill it like the exam (scenario MCQs)
    Practice in the exam's scenario-MCQ format with trap awareness.
  • Explain it back (Feynman)
    Build durable, transferable understanding of a concept you can half-state.
  • Test me, adapting the difficulty
    Active recall practice on a concept you think you know.
  • Check my prerequisites first
    Before studying a concept that keeps not sticking.
  • Find the high-leverage 20%
    When a domain feels too big and you are short on time.
Self-check

Test yourself

Three diagnostic questions on this primitive. Reveal each answer when you have a guess. Want a full 60-question mock? Open the mock hub →

Q1Vision request: should you use base64 inline or the Files API?
Base64 for <10 images, Files API for batch (50+). Base64 inflates JSON ~33%; Files API stores once and references via file_id, cutting bandwidth and ~40% on token cost for repeated reuse.
Q2High-res 4000×3000 images for OCR accuracy. Worth the cost?
No. Higher resolution increases token cost linearly with no accuracy gain above 1200×1500. Diminishing returns at ~1000 px width. Downresample for cost; you don't lose precision.
Q3Claude returns invalid JSON from a vision extraction. Increase max_tokens?
No. Schema-validation failure usually signals ambiguity in the image (poor scan, missing data), not token exhaustion. Retry with clarification or escalate to human review. More tokens won't help.
Last reviewed: 2026-05-04·Refresh cadence: monthly
D4.5 · D4 · Prompt Engineering

Vision & Multimodal, complete.

You've covered the full ten-section breakdown for this primitive, definition, mechanics, code, false positives, comparison, decision tree, exam patterns, and FAQ. One technical primitive down on the path to CCA-F.

More platforms →