# Vision & Multimodal

> Vision lets Claude process images alongside text. Vault coverage thin; needs Phase 6 research.

**Domain:** D4 · Prompt Engineering (20% of CCA-F exam)
**Canonical:** https://claudearchitectcertification.com/concepts/vision-multimodal
**Last reviewed:** 2026-05-04

## Quick stats

- **Input type:** image
- **Exam domain:** D4
- **Coverage tier:** C
- **Status:** stub
- **Action:** research

## What it is

Vision multimodal is Claude's capability to analyze images, charts, screenshots, diagrams, and documents alongside text in a single request. Image content blocks ({type: "image", source: {...}}) are appended to the messages array just like text, and Claude processes them holistically. Each image costs tokens based on size: ~300 for thumbnails, ~1000-1200 for document pages. No per-image flat fee, only per-token pricing.

What makes vision multimodal rather than sequential is that text and image understanding happen in a single model pass. Claude doesn't see image first, then prompt, then reason backwards. The transformer's input embeddings encode both image patches (via vision tokenization) and text tokens in the same sequence, allowing the model to ground language in visual content directly.

Image content can be passed in three forms: base64-encoded (data embedded in JSON, no external URL), URL reference (Claude fetches from public HTTPS), or file upload via Files API (sensitive documents or when base64 bloats payload). Production use cases: OCR (text from scans), chart reading, document understanding (invoices, contracts, forms), UI/UX review (screenshots).

The primary risk is token cost explosion. A 20MB high-res image can consume 2000+ tokens; 50 documents prohibitively expensive without optimization. Mitigations: downresample to 1024×768 or 1200×1500, JPEG over PNG, crop to ROI, batch similar documents. Secondary risk: hallucination in structured extraction when images are ambiguous. Always validate JSON against schema and flag confidence.

## How it works

Request structure is identical to text-only, except content array contains mixed blocks. A user message might be [{type: "image", source: {type: "base64", media_type: "image/jpeg", data: "..."}}, {type: "text", text: "Extract invoice total"}]. Image is processed during the forward pass; vision encoder tokenizes patches, interleaves with text tokens, unified embedding attends across modalities simultaneously.

Token counting is deterministic: SDK provides count_tokens endpoint accepting image blocks. Always call before paying. A 1MB JPEG ~500 tokens; a 5MB scanned PDF page ~2500. Vision pricing ~$0.75/1M input tokens, so a 50-image batch at 800 tokens each runs ~$0.03.

For structured extraction, pass image + JSON schema, ask for valid JSON only, catch validation errors, retry with clarification. tool_use block pairs well: define extract_from_image with input schema matching desired output, Claude invokes with JSON, harness validates and appends. Without this pattern, extraction becomes a manual text-comparison game.

The most underutilized optimization is image context reuse via Files API. With 500 recurring documents, upload each once with purpose: "vision", store the file_id, reference in subsequent requests via {type: "image", source: {type: "file", file_id: "..."}}. Avoids re-encoding, reduces token cost 40-50% on recurring work.

## Where you'll see it in production

### Intelligent document OCR pipeline

Compliance team processes 200+ regulatory filings weekly. Claude agentic loop accepts scanned PDFs, splits by page, downsamples to 1200×1500 JPEG, calls vision + extraction schema, validates JSON, retries on failure. Cost: ~$2/document at scale. 98% extraction accuracy vs 40% from commodity OCR.

### Real-time chart interpretation for BI dashboards

Finance analytics tool embeds vision in Streamlit. Users upload chart screenshot. Claude analyzes, detects anomalies ("Q3 dropped 15% vs Q2"), generates insights. Stream response for <500ms latency. ~400 tokens per chart.

### Invoice and receipt automation

Expense-management SaaS ingests receipts as images. Vision extracts merchant, amount, date, category, tax → structured JSON. Loop catches "too blurry" or "unreadable" failures and escalates to human with image attached. Cuts manual entry 90%.

### UI/UX design review

QA team provides screenshots of new design. Vision analyzes: "Are form labels visible? Color contrast sufficient? Buttons keyboard-accessible?" Catches ~70% of accessibility violations before human testing.

## Code examples

### Multimodal extraction with schema validation

**Python:**

```python
import anthropic, base64, json
client = anthropic.Anthropic()

def extract_invoice(image_path: str) -> dict:
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    schema = {
        "type": "object",
        "properties": {
            "vendor": {"type": "string"},
            "invoice_date": {"type": "string"},
            "amount": {"type": "number", "minimum": 0},
            "line_items": {"type": "array"},
        },
        "required": ["vendor", "invoice_date", "amount"],
    }

    for attempt in range(2):
        resp = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image_data}},
                    {"type": "text", "text": f"Extract invoice. Return ONLY JSON matching this schema:\n{json.dumps(schema)}\nIf field unclear, set null. No markdown."},
                ],
            }],
        )
        text = resp.content[0].text if resp.content else "{}"
        try:
            return json.loads(text)
        except json.JSONDecodeError:
            if attempt == 0:
                continue  # Retry once
            return {"error": "Failed to parse JSON"}
```

> Image base64-encoded; JSON validation; one retry on parse failure. Production version adds confidence scoring.

**TypeScript:**

```typescript
import Anthropic from "@anthropic-ai/sdk";
import fs from "fs";

const client = new Anthropic();

interface Invoice { vendor: string; invoice_date: string; amount: number; }

async function extractInvoice(imagePath: string): Promise<Invoice | { error: string }> {
  const buffer = fs.readFileSync(imagePath);
  const base64 = buffer.toString("base64");

  const resp = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 1024,
    messages: [{
      role: "user",
      content: [
        { type: "image", source: { type: "base64", media_type: "image/jpeg", data: base64 } },
        { type: "text", text: "Extract invoice. Return ONLY JSON: {vendor, invoice_date, amount, line_items[]}. Null for unclear fields. No markdown." },
      ],
    }],
  });

  const text = resp.content[0].type === "text" ? resp.content[0].text : "{}";
  try { return JSON.parse(text); } catch { return { error: "Failed to parse" }; }
}
```

> Files API not used here (Anthropic SDK handles it). Schema enforced client-side. Production adds retry loop.

## Looks-right vs actually-wrong

| Looks right | Actually wrong |
|---|---|
| Upload all images as base64 inline. | Base64 increases JSON size ~33%. For batch (50+), use Files API to reference stored file_ids. Inline OK for <10 images. |
| Use high-resolution (4000×3000) for max OCR accuracy. | Higher resolution increases token cost linearly with no accuracy gain above 1200×1500. Diminishing returns at ~1000px width. Downresample to 1024×768 for tables, 1200×1500 for dense text. |
| If JSON is invalid, increase max_tokens and retry. | Schema-validation failure usually signals ambiguity in the image (poor scans), not token exhaustion. Retry with clarification or escalate to human. |
| Vision always produces accurate output; trust first extraction. | Claude hallucinates fields when tables ambiguous or images degraded. Always validate JSON, track confidence per field, escalate low-confidence. |
| Embed full-page scans of 100-page documents as single mega-image. | Split by page, process each separately. 100 requests but each cheaper than one mega-request (token cost is sublinear per page). Batch in parallel. |

## Comparison

| Aspect | Vision multimodal | Text-only | Traditional OCR | Captioning pipeline |
| --- | --- | --- | --- | --- |
| Input | Image + text together | Text only | Image only | Image → text desc |
| Accuracy | 90%+ structured | N/A | 70-80% handwritten | Lossy |
| Cost per document | 400-1500 tokens (~$0.0003-0.001) | N/A | Free or vendor fees | Vision + extra tokens |
| Speed | 1-3 sec per image | Negligible | 2-10 sec | Slower |
| Structured output | JSON schema | N/A | Unstructured | No schema |
| Best for | Invoices, forms, charts | Queries on text | Bulk digitization | User-facing summaries |

## Decision tree

1. **Need to extract structured data from images?**
   - **Yes:** Vision + tool_use with JSON schema. Validate, retry on failure.
   - **No:** Vision + free-form text for summaries.

2. **Processing >20 images in a batch?**
   - **Yes:** Files API. Upload once, reference file_id. Cuts cost ~40%.
   - **No:** Base64 inline OK.

3. **Image is handwritten, blurry, or low-res?**
   - **Yes:** Expect 60-70% accuracy. Plan human review and confidence thresholds.
   - **No:** Expect 90%+ on printed.

4. **Documents >5MB each?**
   - **Yes:** Resample/compress. JPEG quality 85 cuts size 50% with no visual loss.
   - **No:** No optimization needed.

5. **Output is sensitive (PII, financial)?**
   - **Yes:** Files API: documents stay encrypted at rest. Validate and redact.
   - **No:** Base64 or URL fine.

## Exam-pattern questions

### Q1. Vision request: should you use base64 inline or the Files API?

Base64 for <10 images, Files API for batch (50+). Base64 inflates JSON ~33%; Files API stores once and references via file_id, cutting bandwidth and ~40% on token cost for repeated reuse.

### Q2. High-res 4000×3000 images for OCR accuracy. Worth the cost?

No. Higher resolution increases token cost linearly with no accuracy gain above 1200×1500. Diminishing returns at ~1000 px width. Downresample for cost; you don't lose precision.

### Q3. Claude returns invalid JSON from a vision extraction. Increase max_tokens?

No. Schema-validation failure usually signals ambiguity in the image (poor scan, missing data), not token exhaustion. Retry with clarification or escalate to human review. More tokens won't help.

### Q4. Vision always produces accurate output. True?

No. Claude hallucinates fields when tables are ambiguous or images degraded. Always validate JSON against your schema, track confidence per field, escalate low-confidence results.

### Q5. 100-page PDF: send as one mega-image or split by page?

Split by page, process each separately. 100 requests but each cheaper than one mega-request (token cost is sublinear per page). Batch process pages in parallel for speed.

### Q6. Image size 1200×1500 vs 1024×768 for invoice extraction. Which?

1200×1500 for dense text and tables; 1024×768 for screenshots or clean documents. Match resolution to information density. Anything above 1500 wastes tokens.

### Q7. Vision and PII (faces, credit cards): can Claude redact in-place?

No explicit redaction. Ask Claude to flag PII regions in JSON: {region: "top-left", pii_type: "credit_card"}. Redact client-side with image processing. Don't ship raw images with PII to logs.

### Q8. Token cost of a 1024×768 screenshot vs a 1200×1500 document page?

~400 tokens for the screenshot, ~1000 for the document page. Roughly equal per pixel. Plan token budget by resolution + density.

## FAQ

### Q1. How much does vision cost per image?

No per-image fee, only per-token. JPEG invoice (1200×1500) ~1000 tokens ≈ $0.00075. 1000 invoices ~$0.75.

### Q2. Can Claude read handwritten text?

Printed: 95%+. Handwritten: 70-85% by legibility. Validate against schema, escalate low-confidence.

### Q3. Image formats supported?

JPEG, PNG, GIF, WebP. Max 20MB. PDF supported (analyzed page-by-page internally).

### Q4. URL instead of base64?

Yes. {type: "image", source: {type: "url", url: "https://..."}}. Requires public HTTPS.

### Q5. Hallucinated fields?

Validate all JSON against schema. Set required fields explicitly, reject records that fail validation.

### Q6. Cache images?

Not yet natively. Files API with file_idavoids re-encoding. Prompt caching doesn't apply to images.

### Q7. Split a PDF into pages?

Use pypdf or pdf2image to extract pages as JPEG. Batch process in parallel.

### Q8. Token cost screenshot vs document?

Equal per pixel. 1024×768 screenshot ~400 tokens; 1200×1500 doc page ~1000 tokens.

### Q9. Detect and redact PII?

No explicit redaction, but ask Claude to flag PII regions in JSON. Redact client-side.

### Q10. Charts with legends?

Very well. Claude understands axes, colors, legends. Interprets trends, anomalies, relationships.

---

**Source:** https://claudearchitectcertification.com/concepts/vision-multimodal
**Vault sources:** ACP-T03 capabilities; ASC-A01 Course 6 vision
**Last reviewed:** 2026-05-04

**Evidence tiers** — 🟢 official Anthropic doc / API contract · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed.