Invoice Processing Agent | Claude Architect Certification Prep

Q: What happens if a vendor has multiple naming variations (Apple Inc, APPLE, Apple, Inc.)?

The vendor master holds the canonical vendor_id and a list of name variations. The extraction schema requires the model to extract the vendor as text; a normalization step (lowercase, strip punctuation, fuzzy match against the vendor master) resolves it to a vendor_id. The duplicate-detection hook keys on vendor_id, not the raw name, so naming variation does not break uniqueness.

Q: Can the agent process multi-currency invoices in one workflow?

Yes. The schema enforces currency as an ISO 4217 enum. The cap policy and duplicate detection key on vendor_id and amount in the invoice currency; the cap can be denominated per-vendor in the vendor master. For consolidated reporting, a daily FX-rate table converts to a base currency at audit-log write time.

Q: How do you handle credit memos (negative invoices)?

Credit memos use the same schema with total_amount representing the credit (positive number) and a separate document_type enum field that distinguishes invoice from credit_memo. The PreToolUse hook treats credit memos as vendor_ytd_spend - amount (effectively decreasing YTD spend). Three-way match runs against the original invoice and the credit-memo reason code instead of a PO and GRN.

01 · Problem framing

The problem

What the customer needs

Schema-conformant extraction on every invoice: vendor, number, line items, total, currency, due date, PO reference. No prose wrapping; downstream systems must parse cleanly.
Three-way match before approval: invoice, purchase order, goods receipt all agree on amount, vendor, and quantities.
Cap-policy enforcement that cannot be bypassed by clever invoice phrasing: vendor authorization caps, duplicate detection, blocklisted-vendor checks.
Audit-grade trail of every approval and rejection so finance can replay any decision in a quarterly close.

Why naive approaches fail

Prompt 'output JSON' for invoice extraction: ~15% leakage on edge invoices (handwritten notes, mixed languages, credit memos, rotated scans).
Single-pass extraction with no semantic validation: line totals do not match the header; corrupted records ship downstream.
No three-way match: the agent approves an invoice for goods that were never received, or against a PO that does not exist.
Cap policy in the system prompt: ~3% of approvals exceed authorization cap because prompts leak under unusual phrasing.
No duplicate-invoice check: the same invoice number gets paid twice when the vendor re-sends after a delivery confirmation.

Definition of done

Forced tool_choice: { type: 'tool', name: 'extract_invoice' } on every extraction call.
JSON schema requires vendor_id, invoice_number, line_items[], total_amount, currency (ISO 4217 enum), due_date (ISO 8601), PO_reference (nullable).
Validation-retry loop confirms sum(line_items) == total, currency in enum, due_date >= invoice_date.
Three-way match service reconciles invoice + PO + GRN; variance > 2% routes to human review.
PreToolUse hook on approve_payment: deny on cap exceeded, vendor blocklisted, or duplicate (vendor_id, invoice_number) in the last 90 days.
PostToolUse audit log writes every approval / rejection / hook decision.

02 · Architecture

The system

03 · Component detail

What each part does

5 components, each owns a concept. Click any card to drill into the underlying primitive.

Invoice JSON Schema

the contract, in tools[0].input_schema

The output shape lives inside a tool definition, not as freeform text. Required: vendor_id, invoice_number, line_items[], total_amount, currency (ISO 4217 enum), due_date (ISO 8601 string). Optional and nullable: PO_reference, tax_amount, notes. Every numeric field has a minimum: 0. Every line item has description, quantity, unit_price, total.

Configuration

tools = [{ name: 'extract_invoice', input_schema: { type: 'object', properties: { vendor_id: {type: 'string'}, invoice_number: {type: 'string'}, total_amount: {type: 'number', minimum: 0}, currency: {type: 'string', enum: ['USD', 'EUR', 'GBP', 'INR', 'JPY', 'unclear']}, due_date: {type: 'string', format: 'date'}, line_items: {type: 'array', items: {...}}, PO_reference: {type: ['string', 'null']} }, required: ['vendor_id', 'invoice_number', 'total_amount', 'currency', 'due_date', 'line_items'] } }]

Concept: structured-outputs ↗

Forced tool_use Extractor

tool_choice: { type: 'tool', name: 'extract_invoice' }

Forces the model to fire extract_invoice with arguments matching the schema. No prose preamble, no probabilistic adherence. Vision-capable invocation reads the PDF or image; the model emits a structured tool_use. Pair with few-shot examples that show currency: 'unclear' on truly ambiguous source.

Configuration

tool_choice: { type: 'tool', name: 'extract_invoice' }. Use auto only on triage-style flows. Forced is for mandatory extraction.

Concept: tool-choice ↗

Validation-Retry Loop

sum check, currency enum, date sanity

Schema enforces shape. Code enforces meaning. After parse: sum(line_items[].total) == total_amount (within 0.01 cent tolerance for FX rounding); currency in the enum; due_date format YYYY-MM-DD; due_date >= invoice_date. On failure, feed the specific error back to the model ('line totals sum to 4950 but header total is 5000'); typical convergence in 1-2 retries.

Configuration

loop: extract -> parse -> validate_semantically -> on failure, append { role: 'user', content: tool_result with is_error: true and a specific error } -> retry. Max retries: 3. After 3, route to human review.

Concept: evaluation ↗

Three-Way Match Service

invoice + PO + goods receipt

Queries the PO master and the goods-receipt ledger by PO_reference. Compares amount (variance <= 2% OK for FX rounding and small price changes), vendor identity (normalized vendor name fuzzy match), line-item count (must match), and date sanity (invoice date >= PO date; receipt date >= PO date). Variance above thresholds returns a structured exception; invoice is held pending human review.

Configuration

match(invoice, po, grn) -> { match: bool, variance_pct, mismatched_fields[], routed_to: 'auto-approve' | 'human-review' }. Threshold: amount variance > 2% -> human-review. Vendor mismatch -> human-review. Line-item count mismatch -> human-review.

Concept: evaluation ↗

PreToolUse Cap and Duplicate Hook

deterministic policy gate before approve_payment

Sits between the model's tool_use for approve_payment and actual execution. Reads tool_input.vendor_id, tool_input.amount, tool_input.invoice_number. Three checks. (1) Cap: vendor_ytd_spend + amount <= vendor_authorization_cap. (2) Blocklist: vendor not in the active blocklist. (3) Duplicate: no row in the audit log with the same (vendor_id, invoice_number) in the last 90 days. Any check fails and the hook exits 2 with a structured stderr message; the agent observes the deny as tool_result is_error: true and routes to a structured exception block for the AP analyst.

Configuration

matcher: 'approve_payment'. Hook exits 2 with stderr { reason: 'cap_exceeded' | 'vendor_blocklisted' | 'duplicate_detected', detail: ..., recommended_action: ... }. SDK forwards stderr to the model as a tool_result with is_error: true.

Concept: hooks ↗

04 · One concrete run

Data flow

05 · Build it

Eight steps to production

Author the invoice JSON schema as a tool definition

Define the output shape in tools[0].input_schema. Every required field listed in required[]. Currency is an enum that includes an 'unclear' escape hatch. PO_reference is ['string', 'null'] because cash invoices and credit memos have no PO. Every numeric field has minimum: 0. Line items are an array with description, quantity, unit_price, total. The schema is the contract; everything downstream depends on it being right.

Author the invoice JSON schema as a tool definition

from anthropic import Anthropic
client = Anthropic()

EXTRACT_INVOICE_TOOL = {
    "name": "extract_invoice",
    "description": "Extract a structured invoice record from a PDF or image.",
    "input_schema": {
        "type": "object",
        "properties": {
            "vendor_id": {"type": "string"},
            "invoice_number": {"type": "string"},
            "invoice_date": {"type": "string", "format": "date"},
            "due_date": {"type": "string", "format": "date"},
            "currency": {
                "type": "string",
                "enum": ["USD", "EUR", "GBP", "INR", "JPY", "unclear"],
            },
            "total_amount": {"type": "number", "minimum": 0},
            "tax_amount": {"type": ["number", "null"], "minimum": 0},
            "PO_reference": {"type": ["string", "null"]},
            "line_items": {
                "type": "array",
                "minItems": 1,
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "quantity": {"type": "number", "minimum": 0},
                        "unit_price": {"type": "number", "minimum": 0},
                        "total": {"type": "number", "minimum": 0},
                    },
                    "required": ["description", "quantity", "unit_price", "total"],
                },
            },
        },
        "required": [
            "vendor_id", "invoice_number", "invoice_date", "due_date",
            "currency", "total_amount", "line_items",
        ],
    },
}

↪ Concept: structured-outputs

Force tool_choice and run extraction with vision input

Set tool_choice: { type: 'tool', name: 'extract_invoice' } so the model has no choice but to fire the tool with arguments matching the schema. Pass the invoice as a vision input (PDF page rasterized to image, or direct image upload). The model emits a structured tool_use; the harness extracts tool_use.input as the candidate record.

Force tool_choice and run extraction with vision input

import base64

def extract_invoice(invoice_image_bytes: bytes, mime_type: str = "image/png") -> dict:
    image_b64 = base64.b64encode(invoice_image_bytes).decode("ascii")
    resp = client.messages.create(
        model="claude-sonnet-4.5",
        max_tokens=2048,
        tools=[EXTRACT_INVOICE_TOOL],
        tool_choice={"type": "tool", "name": "extract_invoice"},
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {"type": "base64", "media_type": mime_type, "data": image_b64},
                    },
                    {"type": "text", "text": "Extract this invoice into the schema."},
                ],
            }
        ],
    )
    for block in resp.content:
        if block.type == "tool_use" and block.name == "extract_invoice":
            return block.input
    raise RuntimeError("forced tool_choice did not yield tool_use")

↪ Concept: tool-choice

Wrap extraction in a validation-retry loop

Schema guarantees structure; semantics need code. After parsing, validate: sum(line_items[].total) equals total_amount within 0.01 tolerance; currency in the enum; due_date format and >= invoice_date. On failure, feed the specific error back via tool_result with is_error: true so the model sees what was wrong; retry up to 3 times. Most failures converge in 1-2 retries because the model now knows what the validator rejected.

Wrap extraction in a validation-retry loop

from datetime import date

def validate(record: dict) -> list[str]:
    errors = []
    items_sum = sum(it.get("total", 0) for it in record.get("line_items", []))
    if abs(items_sum - record.get("total_amount", 0)) > 0.01:
        errors.append(
            f"line items sum to {items_sum:.2f} but total_amount is "
            f"{record['total_amount']:.2f}; reconcile"
        )
    if record.get("currency") not in {"USD", "EUR", "GBP", "INR", "JPY", "unclear"}:
        errors.append(f"currency {record.get('currency')!r} not in ISO 4217 enum")
    try:
        inv_date = date.fromisoformat(record.get("invoice_date", ""))
        due_date = date.fromisoformat(record.get("due_date", ""))
        if due_date < inv_date:
            errors.append(
                f"due_date {due_date} is before invoice_date {inv_date}"
            )
    except ValueError as e:
        errors.append(f"date parse failed: {e}")
    return errors

def extract_with_retry(invoice_image_bytes: bytes, max_retries: int = 3) -> dict:
    image_b64 = base64.b64encode(invoice_image_bytes).decode("ascii")
    messages = [{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_b64}},
            {"type": "text", "content": "Extract this invoice into the schema."},
        ],
    }]
    for attempt in range(max_retries):
        resp = client.messages.create(
            model="claude-sonnet-4.5",
            max_tokens=2048,
            tools=[EXTRACT_INVOICE_TOOL],
            tool_choice={"type": "tool", "name": "extract_invoice"},
            messages=messages,
        )
        tool_use = next(b for b in resp.content if b.type == "tool_use")
        record = tool_use.input
        errors = validate(record)
        if not errors:
            return record
        messages.append({"role": "assistant", "content": resp.content})
        messages.append({
            "role": "user",
            "content": [{
                "type": "tool_result",
                "tool_use_id": tool_use.id,
                "content": "Validation failed: " + "; ".join(errors) + ". Re-extract.",
                "is_error": True,
            }],
        })
    raise ValueError(f"extraction did not converge in {max_retries} attempts")

↪ Concept: evaluation

Run a three-way match against PO and goods receipt

Query the PO master by PO_reference and the goods-receipt ledger by the same key. Compare amount (variance <= 2% OK for FX rounding and minor price changes), vendor identity (normalized fuzzy match on vendor name), and line-item count (must match exactly). Variance above any threshold routes to human review with a structured exception block; otherwise auto-proceed.

Run a three-way match against PO and goods receipt

def three_way_match(invoice: dict, po: dict, grn: dict) -> dict:
    """Reconcile invoice with purchase order and goods receipt note."""
    issues = []
    inv_amount = invoice["total_amount"]
    po_amount = po.get("total_amount", 0)
    if po_amount > 0:
        variance_pct = abs(inv_amount - po_amount) / po_amount * 100
        if variance_pct > 2.0:
            issues.append(
                f"amount variance {variance_pct:.2f}% exceeds 2% threshold"
            )

    if normalize_vendor(invoice["vendor_id"]) != normalize_vendor(po["vendor_id"]):
        issues.append(
            f"vendor mismatch: invoice {invoice['vendor_id']!r} "
            f"vs PO {po['vendor_id']!r}"
        )

    if len(invoice["line_items"]) != len(grn.get("line_items", [])):
        issues.append(
            f"line-item count mismatch: invoice {len(invoice['line_items'])} "
            f"vs GRN {len(grn.get('line_items', []))}"
        )

    return {
        "match": len(issues) == 0,
        "issues": issues,
        "routed_to": "auto-approve" if not issues else "human-review",
    }

def normalize_vendor(name: str) -> str:
    return "".join(ch.lower() for ch in name if ch.isalnum())

↪ Concept: evaluation

Wire the PreToolUse cap and duplicate-detection hook

Hook on approve_payment. Three checks. (1) Cap: vendor_ytd_spend + amount <= vendor_authorization_cap. (2) Blocklist: vendor not on the active blocklist. (3) Duplicate: no audit-log row with the same (vendor_id, invoice_number) in the last 90 days. Any check fails and the hook exits 2 with a structured stderr message; the agent observes the deny and routes to an exception block for the AP analyst. Deterministic, no prompt-injection bypass.

Wire the PreToolUse cap and duplicate-detection hook

# .claude/hooks/invoice_approval.py
import sys, json, os, sqlite3
from datetime import date, timedelta

DB = sqlite3.connect(os.environ.get("AUDIT_DB", "audit.sqlite3"))

def vendor_cap_check(vendor_id: str, amount: float) -> str | None:
    row = DB.execute(
        "SELECT cap, ytd_spend FROM vendor_master WHERE vendor_id = ?",
        (vendor_id,),
    ).fetchone()
    if not row:
        return f"vendor {vendor_id!r} not in master; escalate"
    cap, ytd = row
    if ytd + amount > cap:
        remaining = cap - ytd
        return (
            f"vendor cap exceeded: ytd_spend={ytd:.2f} + amount={amount:.2f} > "
            f"cap={cap:.2f}; cap_remaining={remaining:.2f}"
        )
    return None

def blocklist_check(vendor_id: str) -> str | None:
    row = DB.execute(
        "SELECT 1 FROM vendor_blocklist WHERE vendor_id = ?", (vendor_id,)
    ).fetchone()
    if row:
        return f"vendor {vendor_id!r} on active blocklist"
    return None

def duplicate_check(vendor_id: str, invoice_number: str) -> str | None:
    cutoff = (date.today() - timedelta(days=90)).isoformat()
    row = DB.execute(
        "SELECT approved_at FROM audit_log WHERE vendor_id = ? "
        "AND invoice_number = ? AND approved_at >= ? ORDER BY approved_at DESC LIMIT 1",
        (vendor_id, invoice_number, cutoff),
    ).fetchone()
    if row:
        return (
            f"duplicate detected: same (vendor_id, invoice_number) approved on "
            f"{row[0]}; reject this submission"
        )
    return None

def main():
    payload = json.loads(sys.stdin.read())
    if payload["tool_name"] != "approve_payment":
        sys.exit(0)
    inp = payload["tool_input"]
    for check in (
        vendor_cap_check(inp["vendor_id"], inp["amount"]),
        blocklist_check(inp["vendor_id"]),
        duplicate_check(inp["vendor_id"], inp["invoice_number"]),
    ):
        if check:
            print(check, file=sys.stderr)
            sys.exit(2)
    sys.exit(0)

if __name__ == "__main__":
    main()

↪ Concept: hooks

Cache the schema and the vendor master

The schema is the largest stable token cost (~1500 tokens for invoice extraction). The vendor master (caps, blocklist, name normalization rules) is also stable per session. Mark both with cache_control: ephemeral so a 5-minute TTL keeps them warm across sustained AP traffic. Realistic savings: ~80% on cached portions, ~50% reduction on overall steady-state cost.

Cache the schema and the vendor master

def extract_with_cache(invoice_image_bytes: bytes, vendor_master_blob: str) -> dict:
    image_b64 = base64.b64encode(invoice_image_bytes).decode("ascii")
    resp = client.messages.create(
        model="claude-sonnet-4.5",
        max_tokens=2048,
        system=[
            {
                "type": "text",
                "text": (
                    "You are an AP-automation extraction agent. Return only "
                    "structured tool_use; never prose."
                ),
                "cache_control": {"type": "ephemeral"},
            },
            {
                "type": "text",
                "text": vendor_master_blob,
                "cache_control": {"type": "ephemeral"},
            },
        ],
        tools=[
            {**EXTRACT_INVOICE_TOOL, "cache_control": {"type": "ephemeral"}},
        ],
        tool_choice={"type": "tool", "name": "extract_invoice"},
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_b64}},
                {"type": "text", "text": "Extract this invoice into the schema."},
            ],
        }],
    )
    print(f"cache_creation: {resp.usage.cache_creation_input_tokens}")
    print(f"cache_read:     {resp.usage.cache_read_input_tokens}")
    return next(b.input for b in resp.content if b.type == "tool_use")

↪ Concept: prompt-caching

Use Batch API for overnight bulk runs

Sync API for inbox-arrival latency. For nightly backfills (10K invoices), the Batch API gives a flat 50% discount with a 24-hour SLA. Combined with schema and vendor-master caching (per-100-item sub-batches keep ephemeral cache warm), bulk extraction cost drops ~75% versus naive sync. Resubmit failures the next morning as a fresh batch with the specific error in the next message.

Use Batch API for overnight bulk runs

def submit_bulk_extraction(invoices: list[dict]) -> str:
    """Submit a batch of invoice extractions for overnight processing."""
    requests = []
    for inv in invoices:
        image_b64 = base64.b64encode(inv["image_bytes"]).decode("ascii")
        requests.append({
            "custom_id": f"extract-{inv['id']}",
            "params": {
                "model": "claude-sonnet-4.5",
                "max_tokens": 2048,
                "tools": [EXTRACT_INVOICE_TOOL],
                "tool_choice": {"type": "tool", "name": "extract_invoice"},
                "messages": [{
                    "role": "user",
                    "content": [
                        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_b64}},
                        {"type": "text", "text": "Extract this invoice into the schema."},
                    ],
                }],
            },
        })
    batch = client.messages.batches.create(requests=requests)
    print(f"Batch {batch.id} submitted with {len(requests)} extractions")
    return batch.id

def harvest_batch(batch_id: str):
    batch = client.messages.batches.retrieve(batch_id)
    if batch.processing_status != "ended":
        return {"status": "not_ready"}
    accepted, rejected = [], []
    for r in client.messages.batches.results(batch_id):
        if r.result.type == "succeeded":
            tu = next(b for b in r.result.message.content if b.type == "tool_use")
            if not validate(tu.input):
                accepted.append(tu.input)
                continue
        rejected.append(r.custom_id)
    return {"accepted": accepted, "rejected_for_retry": rejected}

↪ Concept: batch-api

Audit-log every approval, rejection, and hook decision

PostToolUse hook on every approve_payment call. Append a row to durable storage: timestamp, vendor_id, invoice_number, amount, currency, three-way-match outcome, hook decisions (cap, blocklist, duplicate), final routing (approved | human-review | denied). Retain at least 7 years for audit compliance. The audit log is the replay tool when finance asks 'why did we approve this in May?' three months later.

Audit-log every approval, rejection, and hook decision

import datetime, json, sqlite3
from pathlib import Path

AUDIT_DB = sqlite3.connect("audit.sqlite3")
AUDIT_DB.execute("""
CREATE TABLE IF NOT EXISTS audit_log (
    ts TEXT PRIMARY KEY,
    vendor_id TEXT,
    invoice_number TEXT,
    amount REAL,
    currency TEXT,
    match_outcome TEXT,
    hook_decisions TEXT,
    final_routing TEXT,
    approved_at TEXT
)
""")

def audit(invoice: dict, match_result: dict, hook_decisions: dict, routing: str):
    AUDIT_DB.execute(
        "INSERT INTO audit_log (ts, vendor_id, invoice_number, amount, currency, "
        "match_outcome, hook_decisions, final_routing, approved_at) "
        "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)",
        (
            datetime.datetime.utcnow().isoformat() + "Z",
            invoice["vendor_id"],
            invoice["invoice_number"],
            invoice["total_amount"],
            invoice["currency"],
            json.dumps(match_result),
            json.dumps(hook_decisions),
            routing,
            datetime.date.today().isoformat() if routing == "approved" else None,
        ),
    )
    AUDIT_DB.commit()

↪ Concept: evaluation

06 · Configuration decisions

The four decisions

Decision	Right answer	Wrong answer	Why
Output shape guarantee on extraction	Forced tool_choice with input_schema as the contract	Prompt 'output JSON' or 'respond with valid JSON only'	Prompt-only is probabilistic (~85% adherence); ~15% leakage on edge invoices (handwritten notes, mixed languages, credit memos, rotated scans). Forced tool_use is structural (100% adherence). The cost is identical; the reliability gap is decisive in finance.
Vendor authorization cap enforcement	PreToolUse hook reads vendor_ytd_spend, exits 2 on violation	System prompt: 'never approve above the vendor cap'	Prompts leak ~3% in production. Hooks are deterministic. For policy-bearing limits (cap, duplicate, blocklist), the deterministic gate is the only credible architecture. Prompt-only enforcement is a finding waiting to be flagged in the next audit.
Same invoice arriving twice (vendor re-sends after delivery)	PreToolUse duplicate-detection hook keyed on (vendor_id, invoice_number) over last 90 days	Trust the model to notice duplicates in conversation context	Context memory is unreliable across multi-turn or batch runs. The hook is stateless, queries the audit log, and prevents race conditions when two parallel extractions hit the same invoice within seconds.
Bulk overnight processing of 10K invoices	Batch API + schema and vendor-master caching	Sync API in a tight loop or sync API without caching	Batch API gives a flat 50% discount with a 24-hour SLA. Caching adds another ~80% off the schema and vendor-master tokens. Combined: ~75% savings versus naive sync. Sync API is reserved for inbox-arrival latency.

07 · Failure modes

Where it breaks

Five failure pairs. Each one is one exam question. The fix is always architectural, deterministic gates, structured fields, pinned state.

❌ Prompt-only field extraction

Prompt 'extract this invoice as JSON' leaks ~15% on edge invoices. Downstream parser breaks every seventh document; AP analyst spends the morning re-keying invoices the agent botched.

AP-INV-01

✅ Fix

Forced tool_choice: { type: 'tool', name: 'extract_invoice' } plus a strict JSON schema in tools[0].input_schema. The model has no choice but to fire the tool with arguments matching the schema. 100% structural adherence.

❌ No semantic validation

Single-pass extraction with no math check. The model returns a structurally-valid record where line totals sum to 4950 but the header total says 5000. Bad data ships downstream; quarterly close finds the discrepancy three months later.

AP-INV-02

✅ Fix

Validation-retry loop. After parse, validate sum(line_items[].total) == total_amount (within 0.01 tolerance), currency in ISO 4217 enum, due_date >= invoice_date. On failure, feed the specific error back; retry up to 3 times; route to human review if still failing.

❌ No three-way match

Agent approves an invoice that has no matching purchase order, or where the goods receipt was for fewer items, or where the vendor name on the invoice does not match the vendor on the PO. AP pays for goods never received, or pays the wrong vendor.

AP-INV-03

✅ Fix

Three-way match service queries PO master and goods-receipt ledger. Compares amount (variance <= 2% OK), normalized vendor name, line-item count. Variance above thresholds routes to human review with a structured exception block.

❌ Cap policy in the system prompt

System prompt: 'never approve more than the vendor authorization cap'. Production logs show ~3% of approvals exceed the cap because the prompt language leaks under unusual phrasing or when the agent is processing many invoices in one session.

AP-INV-04

✅ Fix

PreToolUse hook on approve_payment reads tool_input.vendor_id and tool_input.amount, queries the vendor master for vendor_ytd_spend + cap, exits 2 on violation with a structured message including cap_remaining. Deterministic, not probabilistic.

❌ No duplicate-invoice check

Vendor re-sends the same invoice number after delivery confirmation, or the same invoice is uploaded twice through different channels (email + portal). The agent approves both. AP discovers the duplicate payment in next month's reconciliation.

AP-INV-05

✅ Fix

PreToolUse hook queries the audit log for any row with the same (vendor_id, invoice_number) in the last 90 days. On match, exits 2 with the prior approval date. Stateless, auditable, prevents race conditions in parallel runs.

08 · Budget

Cost & latency

Per-invoice synchronous extraction (cached schema)

~$0.001 to $0.003

Schema ~1500 tokens at cache-read price plus image vision tokens (~1000-2000) plus ~150 output. Sustained AP traffic with cache hits >= 70% drops effective cost predictably.

Three-way match service

~$0 token cost; ~10-30 ms latency

Pure SQL queries against PO master and goods-receipt ledger. No LLM call. Latency is dominated by the database round-trip.

PreToolUse hook overhead

~$0; ~5 ms latency

Subprocess reads stdin JSON, runs three SQL queries (vendor cap, blocklist, duplicate), exits 0 or 2. No LLM call. Latency below the noise floor of any tool dispatch.

Batch overnight (10K invoices, batch + caching)

~75% off naive sync

Batch API flat 50% discount times schema and vendor-master cache (~80% off cached portion). 10K invoices at typical complexity drop from ~$30 sync uncached to ~$8 batch cached.

Validation-retry overhead

~+25% on records that retry

5-10% of records retry once; 1-2% retry twice. Specific-error feedback converges quickly. Pipeline cost up ~5% to gain ~99% schema-conformance plus ~99% semantic-conformance.

Per-1000-invoices total (steady state)

~$1.00 to $3.00

Sync cached extraction at scale. Adding human review of unconverged records adds operator-time cost but recovers the long tail of edge invoices.

09 · Ship gates

Ship checklist

Two passes. Build-time gates verify the code; run-time gates verify the system in production.

Build-time

Invoice JSON schema lives in tools[0].input_schema with required and nullable fields explicit↗ structured-outputs
tool_choice forced to extract_invoice on every extraction call↗ tool-choice
Currency field is an enum with an 'unclear' escape hatch↗ structured-outputs
Validation-retry loop with sum check, currency enum, date sanity↗ evaluation
Three-way match service against PO master and goods-receipt ledger↗ evaluation
PreToolUse cap-and-duplicate hook on approve_payment↗ hooks
Schema and vendor master cached with cache_control: ephemeral↗ prompt-caching
Batch API for nightly bulk runs (greater than 100 invoices)↗ batch-api
PostToolUse audit log writes every approval, rejection, and hook decision; 7-year retention
Stratified accuracy reporting by vendor, currency, document type
Human-review queue for invoices that fail validation, three-way match, or hook checks

Run-time

JSON schema versioned in source control; PR-reviewed before deploy
Vendor master kept current; cap and blocklist updates flow through change control
Validation-retry loop unit-tested for line-total mismatch, currency drift, date inversion
Three-way match service tested against synthetic PO + GRN cases including 1.9% and 2.1% variance edge cases
PreToolUse hook unit-tested for cap exceeded, blocklisted vendor, duplicate within 90 days, all three pass
PostToolUse audit log retention confirmed at 7 years; index on (vendor_id, invoice_number, date)
Schema cache hit rate monitored; alert if drops below 50%
Stratified accuracy dashboard by vendor and document type; alert on any vendor below 90% pass rate
Human-review queue with SLA documented and on-call for invoices held more than 48 hours
Batch API job for nightly backfill with auto-resubmit on transient failures

10 · Question patterns

Five exam-pattern questions

Your invoice extraction agent uses prompt-only extraction. Production logs show ~15% of records arrive with prose wrapping ('Sure, here is the JSON:') and the downstream parser breaks. What is the architectural fix?

Move the schema into tools[0].input_schema and set tool_choice: { type: 'tool', name: 'extract_invoice' }. This forces the model to emit a structured tool_use call matching the schema. No prose wrapping, no preamble, no probabilistic adherence. The 15% leak collapses to 0% because the SDK rejects anything that does not match the schema. Tagged to AP-INV-01.

You notice line-item totals do not match the invoice header total. The model extracted all items correctly but the math is wrong. How do you prevent this from shipping bad data downstream?

Add a validation-retry loop. After parsing, compute sum(line_items[].total) in code. If it differs from total_amount by more than 0.01, send the specific error back to the model in a tool_result with is_error: true ('line items sum to 4950 but header total says 5000'); retry up to 3 times. About 95% of records converge on the second attempt because the model now sees what was wrong. Pair with currency enum and date sanity checks. Tagged to AP-INV-02.

An invoice arrives with no matching purchase order in your system. The vendor claims the PO was issued verbally last quarter. Should the agent approve based on the vendor reputation?

No. The three-way match service must find an active PO before the agent can approve. No PO means the invoice routes to human review with a structured exception block; the AP analyst either creates a retroactive PO and reprocesses, or rejects the invoice. The agent never bypasses the three-way match; verbal POs are not a valid input to the workflow. Tagged to AP-INV-03.

Your system prompt says 'never approve invoices above the vendor authorization cap'. Production logs show ~3% of approvals still exceed the cap. What is the architectural fix?

Move the constraint to a PreToolUse hook on approve_payment. The hook reads tool_input.vendor_id and tool_input.amount, queries the vendor master for vendor_ytd_spend + cap, exits 2 on violation with a structured stderr message including cap_remaining. Deterministic, not probabilistic. Pair with blocklist and duplicate checks in the same hook. Prompts leak; hooks do not. Tagged to AP-INV-04.

A vendor re-sends the same invoice 30 days later because they did not see the payment confirmation. How does your agent prevent paying the same invoice twice?

PreToolUse hook on approve_payment queries the audit log for any row with the same (vendor_id, invoice_number) in the last 90 days. On match, exits 2 with the prior approval date and routes to a structured exception block. The check is stateless, auditable, and prevents race conditions when two parallel extractions hit the same invoice within seconds. The 90-day window is a configurable policy. Tagged to AP-INV-05.

11 · FAQ

Frequently asked

How do you handle handwritten or scanned invoices with poor image quality?

Vision-capable extraction handles most cases. For edge invoices (rotated scans, faded ink, handwritten amendments), the validation-retry loop catches arithmetic mismatches and the three-way match catches structural issues. Records that fail after 3 retries route to human review with the original image attached. Stratified accuracy reporting by document-type quickly surfaces vendors whose invoices need a layout-aware preprocessing step.

What happens if a vendor has multiple naming variations (Apple Inc, APPLE, Apple, Inc.)?

The vendor master holds the canonical vendor_id and a list of name variations. The extraction schema requires the model to extract the vendor as text; a normalization step (lowercase, strip punctuation, fuzzy match against the vendor master) resolves it to a vendor_id. The duplicate-detection hook keys on vendor_id, not the raw name, so naming variation does not break uniqueness.

Can the agent process multi-currency invoices in one workflow?

Yes. The schema enforces currency as an ISO 4217 enum. The cap policy and duplicate detection key on vendor_id and amount in the invoice currency; the cap can be denominated per-vendor in the vendor master. For consolidated reporting, a daily FX-rate table converts to a base currency at audit-log write time.

How do you handle credit memos (negative invoices)?

Credit memos use the same schema with total_amount representing the credit (positive number) and a separate document_type enum field that distinguishes invoice from credit_memo. The PreToolUse hook treats credit memos as vendor_ytd_spend - amount (effectively decreasing YTD spend). Three-way match runs against the original invoice and the credit-memo reason code instead of a PO and GRN.

Should the agent auto-approve, or always route to human review?

Auto-approve only when all gates pass: schema valid, semantic validation passed, three-way match within thresholds, PreToolUse hook approved (cap, blocklist, duplicate). Any failure routes to human review with a structured exception block. Auto-approval rate at steady state is typically 75-85%; the remaining 15-25% needs an analyst's eye. The point of the agent is not to remove the analyst; it is to make the analyst's queue much smaller and every queued invoice well-explained.

How long do you retain the audit log?

At least 7 years for financial-record compliance (US SOX, EU equivalent). Append-only schema; immutable rows; indexed by vendor_id, invoice_number, and date. Replay tool reconstructs any approval decision in seconds when finance asks 'why did we approve this in May?' three months later.

Is Batch API worth using for fewer than 1000 invoices a night?

Sometimes. Batch API gives a flat 50% discount but with a 24-hour SLA. For under 500 invoices, the Batch overhead and the latency may not be worth it; sync extraction is cheaper end-to-end when AP needs same-day processing. For nightly backfills of historical invoices or large vendor consolidations (more than 1000 documents), Batch API earns its keep.

Invoice Processing Agent.

The problem

What the customer needs

Why naive approaches fail

The system

What each part does

Invoice JSON Schema

Forced tool_use Extractor

Validation-Retry Loop

Three-Way Match Service

PreToolUse Cap and Duplicate Hook

Data flow

Eight steps to production

Author the invoice JSON schema as a tool definition

Force tool_choice and run extraction with vision input

Wrap extraction in a validation-retry loop

Run a three-way match against PO and goods receipt

Wire the PreToolUse cap and duplicate-detection hook

Cache the schema and the vendor master

Use Batch API for overnight bulk runs

Audit-log every approval, rejection, and hook decision

The four decisions

Where it breaks

Cost & latency

Ship checklist

Build-time

Run-time

Five exam-pattern questions

Frequently asked

Invoice Processing Agent, complete.

Invoice Processing Agent.

The problem

What the customer needs

Why naive approaches fail

The system

What each part does

Invoice JSON Schema

Forced tool_use Extractor

Validation-Retry Loop

Three-Way Match Service

PreToolUse Cap and Duplicate Hook

Data flow

Eight steps to production

Author the invoice JSON schema as a tool definition

Force tool_choice and run extraction with vision input

Wrap extraction in a validation-retry loop

Run a three-way match against PO and goods receipt

Wire the PreToolUse cap and duplicate-detection hook

Cache the schema and the vendor master

Use Batch API for overnight bulk runs

Audit-log every approval, rejection, and hook decision

The four decisions

Where it breaks

Cost & latency

Ship checklist

Build-time

Run-time

Five exam-pattern questions

Frequently asked

Invoice Processing Agent, complete.

Share this primitive