The problem
What the customer needs
- Schema-conformant extraction on every invoice: vendor, number, line items, total, currency, due date, PO reference. No prose wrapping; downstream systems must parse cleanly.
- Three-way match before approval: invoice, purchase order, goods receipt all agree on amount, vendor, and quantities.
- Cap-policy enforcement that cannot be bypassed by clever invoice phrasing: vendor authorization caps, duplicate detection, blocklisted-vendor checks.
- Audit-grade trail of every approval and rejection so finance can replay any decision in a quarterly close.
Why naive approaches fail
- Prompt 'output JSON' for invoice extraction: ~15% leakage on edge invoices (handwritten notes, mixed languages, credit memos, rotated scans).
- Single-pass extraction with no semantic validation: line totals do not match the header; corrupted records ship downstream.
- No three-way match: the agent approves an invoice for goods that were never received, or against a PO that does not exist.
- Cap policy in the system prompt: ~3% of approvals exceed authorization cap because prompts leak under unusual phrasing.
- No duplicate-invoice check: the same invoice number gets paid twice when the vendor re-sends after a delivery confirmation.
- Forced
tool_choice: { type: 'tool', name: 'extract_invoice' }on every extraction call. - JSON schema requires vendor_id, invoice_number, line_items[], total_amount, currency (ISO 4217 enum), due_date (ISO 8601), PO_reference (nullable).
- Validation-retry loop confirms
sum(line_items) == total, currency in enum, due_date >= invoice_date. - Three-way match service reconciles invoice + PO + GRN; variance > 2% routes to human review.
- PreToolUse hook on approve_payment: deny on cap exceeded, vendor blocklisted, or duplicate
(vendor_id, invoice_number)in the last 90 days. - PostToolUse audit log writes every approval / rejection / hook decision.
The system
What each part does
5 components, each owns a concept. Click any card to drill into the underlying primitive.
Invoice JSON Schema
the contract, in tools[0].input_schema
The output shape lives inside a tool definition, not as freeform text. Required: vendor_id, invoice_number, line_items[], total_amount, currency (ISO 4217 enum), due_date (ISO 8601 string). Optional and nullable: PO_reference, tax_amount, notes. Every numeric field has a minimum: 0. Every line item has description, quantity, unit_price, total.
Configuration
tools = [{ name: 'extract_invoice', input_schema: { type: 'object', properties: { vendor_id: {type: 'string'}, invoice_number: {type: 'string'}, total_amount: {type: 'number', minimum: 0}, currency: {type: 'string', enum: ['USD', 'EUR', 'GBP', 'INR', 'JPY', 'unclear']}, due_date: {type: 'string', format: 'date'}, line_items: {type: 'array', items: {...}}, PO_reference: {type: ['string', 'null']} }, required: ['vendor_id', 'invoice_number', 'total_amount', 'currency', 'due_date', 'line_items'] } }]
Forced tool_use Extractor
tool_choice: { type: 'tool', name: 'extract_invoice' }
Forces the model to fire extract_invoice with arguments matching the schema. No prose preamble, no probabilistic adherence. Vision-capable invocation reads the PDF or image; the model emits a structured tool_use. Pair with few-shot examples that show currency: 'unclear' on truly ambiguous source.
Configuration
tool_choice: { type: 'tool', name: 'extract_invoice' }. Use auto only on triage-style flows. Forced is for mandatory extraction.
Validation-Retry Loop
sum check, currency enum, date sanity
Schema enforces shape. Code enforces meaning. After parse: sum(line_items[].total) == total_amount (within 0.01 cent tolerance for FX rounding); currency in the enum; due_date format YYYY-MM-DD; due_date >= invoice_date. On failure, feed the specific error back to the model ('line totals sum to 4950 but header total is 5000'); typical convergence in 1-2 retries.
Configuration
loop: extract -> parse -> validate_semantically -> on failure, append { role: 'user', content: tool_result with is_error: true and a specific error } -> retry. Max retries: 3. After 3, route to human review.
Three-Way Match Service
invoice + PO + goods receipt
Queries the PO master and the goods-receipt ledger by PO_reference. Compares amount (variance <= 2% OK for FX rounding and small price changes), vendor identity (normalized vendor name fuzzy match), line-item count (must match), and date sanity (invoice date >= PO date; receipt date >= PO date). Variance above thresholds returns a structured exception; invoice is held pending human review.
Configuration
match(invoice, po, grn) -> { match: bool, variance_pct, mismatched_fields[], routed_to: 'auto-approve' | 'human-review' }. Threshold: amount variance > 2% -> human-review. Vendor mismatch -> human-review. Line-item count mismatch -> human-review.
PreToolUse Cap and Duplicate Hook
deterministic policy gate before approve_payment
Sits between the model's tool_use for approve_payment and actual execution. Reads tool_input.vendor_id, tool_input.amount, tool_input.invoice_number. Three checks. (1) Cap: vendor_ytd_spend + amount <= vendor_authorization_cap. (2) Blocklist: vendor not in the active blocklist. (3) Duplicate: no row in the audit log with the same (vendor_id, invoice_number) in the last 90 days. Any check fails and the hook exits 2 with a structured stderr message; the agent observes the deny as tool_result is_error: true and routes to a structured exception block for the AP analyst.
Configuration
matcher: 'approve_payment'. Hook exits 2 with stderr { reason: 'cap_exceeded' | 'vendor_blocklisted' | 'duplicate_detected', detail: ..., recommended_action: ... }. SDK forwards stderr to the model as a tool_result with is_error: true.
Data flow
Eight steps to production
Author the invoice JSON schema as a tool definition
Define the output shape in tools[0].input_schema. Every required field listed in required[]. Currency is an enum that includes an 'unclear' escape hatch. PO_reference is ['string', 'null'] because cash invoices and credit memos have no PO. Every numeric field has minimum: 0. Line items are an array with description, quantity, unit_price, total. The schema is the contract; everything downstream depends on it being right.
from anthropic import Anthropic
client = Anthropic()
EXTRACT_INVOICE_TOOL = {
"name": "extract_invoice",
"description": "Extract a structured invoice record from a PDF or image.",
"input_schema": {
"type": "object",
"properties": {
"vendor_id": {"type": "string"},
"invoice_number": {"type": "string"},
"invoice_date": {"type": "string", "format": "date"},
"due_date": {"type": "string", "format": "date"},
"currency": {
"type": "string",
"enum": ["USD", "EUR", "GBP", "INR", "JPY", "unclear"],
},
"total_amount": {"type": "number", "minimum": 0},
"tax_amount": {"type": ["number", "null"], "minimum": 0},
"PO_reference": {"type": ["string", "null"]},
"line_items": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number", "minimum": 0},
"unit_price": {"type": "number", "minimum": 0},
"total": {"type": "number", "minimum": 0},
},
"required": ["description", "quantity", "unit_price", "total"],
},
},
},
"required": [
"vendor_id", "invoice_number", "invoice_date", "due_date",
"currency", "total_amount", "line_items",
],
},
}Force tool_choice and run extraction with vision input
Set tool_choice: { type: 'tool', name: 'extract_invoice' } so the model has no choice but to fire the tool with arguments matching the schema. Pass the invoice as a vision input (PDF page rasterized to image, or direct image upload). The model emits a structured tool_use; the harness extracts tool_use.input as the candidate record.
import base64
def extract_invoice(invoice_image_bytes: bytes, mime_type: str = "image/png") -> dict:
image_b64 = base64.b64encode(invoice_image_bytes).decode("ascii")
resp = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=2048,
tools=[EXTRACT_INVOICE_TOOL],
tool_choice={"type": "tool", "name": "extract_invoice"},
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": mime_type, "data": image_b64},
},
{"type": "text", "text": "Extract this invoice into the schema."},
],
}
],
)
for block in resp.content:
if block.type == "tool_use" and block.name == "extract_invoice":
return block.input
raise RuntimeError("forced tool_choice did not yield tool_use")Wrap extraction in a validation-retry loop
Schema guarantees structure; semantics need code. After parsing, validate: sum(line_items[].total) equals total_amount within 0.01 tolerance; currency in the enum; due_date format and >= invoice_date. On failure, feed the specific error back via tool_result with is_error: true so the model sees what was wrong; retry up to 3 times. Most failures converge in 1-2 retries because the model now knows what the validator rejected.
from datetime import date
def validate(record: dict) -> list[str]:
errors = []
items_sum = sum(it.get("total", 0) for it in record.get("line_items", []))
if abs(items_sum - record.get("total_amount", 0)) > 0.01:
errors.append(
f"line items sum to {items_sum:.2f} but total_amount is "
f"{record['total_amount']:.2f}; reconcile"
)
if record.get("currency") not in {"USD", "EUR", "GBP", "INR", "JPY", "unclear"}:
errors.append(f"currency {record.get('currency')!r} not in ISO 4217 enum")
try:
inv_date = date.fromisoformat(record.get("invoice_date", ""))
due_date = date.fromisoformat(record.get("due_date", ""))
if due_date < inv_date:
errors.append(
f"due_date {due_date} is before invoice_date {inv_date}"
)
except ValueError as e:
errors.append(f"date parse failed: {e}")
return errors
def extract_with_retry(invoice_image_bytes: bytes, max_retries: int = 3) -> dict:
image_b64 = base64.b64encode(invoice_image_bytes).decode("ascii")
messages = [{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_b64}},
{"type": "text", "content": "Extract this invoice into the schema."},
],
}]
for attempt in range(max_retries):
resp = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=2048,
tools=[EXTRACT_INVOICE_TOOL],
tool_choice={"type": "tool", "name": "extract_invoice"},
messages=messages,
)
tool_use = next(b for b in resp.content if b.type == "tool_use")
record = tool_use.input
errors = validate(record)
if not errors:
return record
messages.append({"role": "assistant", "content": resp.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": "Validation failed: " + "; ".join(errors) + ". Re-extract.",
"is_error": True,
}],
})
raise ValueError(f"extraction did not converge in {max_retries} attempts")Run a three-way match against PO and goods receipt
Query the PO master by PO_reference and the goods-receipt ledger by the same key. Compare amount (variance <= 2% OK for FX rounding and minor price changes), vendor identity (normalized fuzzy match on vendor name), and line-item count (must match exactly). Variance above any threshold routes to human review with a structured exception block; otherwise auto-proceed.
def three_way_match(invoice: dict, po: dict, grn: dict) -> dict:
"""Reconcile invoice with purchase order and goods receipt note."""
issues = []
inv_amount = invoice["total_amount"]
po_amount = po.get("total_amount", 0)
if po_amount > 0:
variance_pct = abs(inv_amount - po_amount) / po_amount * 100
if variance_pct > 2.0:
issues.append(
f"amount variance {variance_pct:.2f}% exceeds 2% threshold"
)
if normalize_vendor(invoice["vendor_id"]) != normalize_vendor(po["vendor_id"]):
issues.append(
f"vendor mismatch: invoice {invoice['vendor_id']!r} "
f"vs PO {po['vendor_id']!r}"
)
if len(invoice["line_items"]) != len(grn.get("line_items", [])):
issues.append(
f"line-item count mismatch: invoice {len(invoice['line_items'])} "
f"vs GRN {len(grn.get('line_items', []))}"
)
return {
"match": len(issues) == 0,
"issues": issues,
"routed_to": "auto-approve" if not issues else "human-review",
}
def normalize_vendor(name: str) -> str:
return "".join(ch.lower() for ch in name if ch.isalnum())Wire the PreToolUse cap and duplicate-detection hook
Hook on approve_payment. Three checks. (1) Cap: vendor_ytd_spend + amount <= vendor_authorization_cap. (2) Blocklist: vendor not on the active blocklist. (3) Duplicate: no audit-log row with the same (vendor_id, invoice_number) in the last 90 days. Any check fails and the hook exits 2 with a structured stderr message; the agent observes the deny and routes to an exception block for the AP analyst. Deterministic, no prompt-injection bypass.
# .claude/hooks/invoice_approval.py
import sys, json, os, sqlite3
from datetime import date, timedelta
DB = sqlite3.connect(os.environ.get("AUDIT_DB", "audit.sqlite3"))
def vendor_cap_check(vendor_id: str, amount: float) -> str | None:
row = DB.execute(
"SELECT cap, ytd_spend FROM vendor_master WHERE vendor_id = ?",
(vendor_id,),
).fetchone()
if not row:
return f"vendor {vendor_id!r} not in master; escalate"
cap, ytd = row
if ytd + amount > cap:
remaining = cap - ytd
return (
f"vendor cap exceeded: ytd_spend={ytd:.2f} + amount={amount:.2f} > "
f"cap={cap:.2f}; cap_remaining={remaining:.2f}"
)
return None
def blocklist_check(vendor_id: str) -> str | None:
row = DB.execute(
"SELECT 1 FROM vendor_blocklist WHERE vendor_id = ?", (vendor_id,)
).fetchone()
if row:
return f"vendor {vendor_id!r} on active blocklist"
return None
def duplicate_check(vendor_id: str, invoice_number: str) -> str | None:
cutoff = (date.today() - timedelta(days=90)).isoformat()
row = DB.execute(
"SELECT approved_at FROM audit_log WHERE vendor_id = ? "
"AND invoice_number = ? AND approved_at >= ? ORDER BY approved_at DESC LIMIT 1",
(vendor_id, invoice_number, cutoff),
).fetchone()
if row:
return (
f"duplicate detected: same (vendor_id, invoice_number) approved on "
f"{row[0]}; reject this submission"
)
return None
def main():
payload = json.loads(sys.stdin.read())
if payload["tool_name"] != "approve_payment":
sys.exit(0)
inp = payload["tool_input"]
for check in (
vendor_cap_check(inp["vendor_id"], inp["amount"]),
blocklist_check(inp["vendor_id"]),
duplicate_check(inp["vendor_id"], inp["invoice_number"]),
):
if check:
print(check, file=sys.stderr)
sys.exit(2)
sys.exit(0)
if __name__ == "__main__":
main()Cache the schema and the vendor master
The schema is the largest stable token cost (~1500 tokens for invoice extraction). The vendor master (caps, blocklist, name normalization rules) is also stable per session. Mark both with cache_control: ephemeral so a 5-minute TTL keeps them warm across sustained AP traffic. Realistic savings: ~80% on cached portions, ~50% reduction on overall steady-state cost.
def extract_with_cache(invoice_image_bytes: bytes, vendor_master_blob: str) -> dict:
image_b64 = base64.b64encode(invoice_image_bytes).decode("ascii")
resp = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=2048,
system=[
{
"type": "text",
"text": (
"You are an AP-automation extraction agent. Return only "
"structured tool_use; never prose."
),
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": vendor_master_blob,
"cache_control": {"type": "ephemeral"},
},
],
tools=[
{**EXTRACT_INVOICE_TOOL, "cache_control": {"type": "ephemeral"}},
],
tool_choice={"type": "tool", "name": "extract_invoice"},
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_b64}},
{"type": "text", "text": "Extract this invoice into the schema."},
],
}],
)
print(f"cache_creation: {resp.usage.cache_creation_input_tokens}")
print(f"cache_read: {resp.usage.cache_read_input_tokens}")
return next(b.input for b in resp.content if b.type == "tool_use")Use Batch API for overnight bulk runs
Sync API for inbox-arrival latency. For nightly backfills (10K invoices), the Batch API gives a flat 50% discount with a 24-hour SLA. Combined with schema and vendor-master caching (per-100-item sub-batches keep ephemeral cache warm), bulk extraction cost drops ~75% versus naive sync. Resubmit failures the next morning as a fresh batch with the specific error in the next message.
def submit_bulk_extraction(invoices: list[dict]) -> str:
"""Submit a batch of invoice extractions for overnight processing."""
requests = []
for inv in invoices:
image_b64 = base64.b64encode(inv["image_bytes"]).decode("ascii")
requests.append({
"custom_id": f"extract-{inv['id']}",
"params": {
"model": "claude-sonnet-4.5",
"max_tokens": 2048,
"tools": [EXTRACT_INVOICE_TOOL],
"tool_choice": {"type": "tool", "name": "extract_invoice"},
"messages": [{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_b64}},
{"type": "text", "text": "Extract this invoice into the schema."},
],
}],
},
})
batch = client.messages.batches.create(requests=requests)
print(f"Batch {batch.id} submitted with {len(requests)} extractions")
return batch.id
def harvest_batch(batch_id: str):
batch = client.messages.batches.retrieve(batch_id)
if batch.processing_status != "ended":
return {"status": "not_ready"}
accepted, rejected = [], []
for r in client.messages.batches.results(batch_id):
if r.result.type == "succeeded":
tu = next(b for b in r.result.message.content if b.type == "tool_use")
if not validate(tu.input):
accepted.append(tu.input)
continue
rejected.append(r.custom_id)
return {"accepted": accepted, "rejected_for_retry": rejected}Audit-log every approval, rejection, and hook decision
PostToolUse hook on every approve_payment call. Append a row to durable storage: timestamp, vendor_id, invoice_number, amount, currency, three-way-match outcome, hook decisions (cap, blocklist, duplicate), final routing (approved | human-review | denied). Retain at least 7 years for audit compliance. The audit log is the replay tool when finance asks 'why did we approve this in May?' three months later.
import datetime, json, sqlite3
from pathlib import Path
AUDIT_DB = sqlite3.connect("audit.sqlite3")
AUDIT_DB.execute("""
CREATE TABLE IF NOT EXISTS audit_log (
ts TEXT PRIMARY KEY,
vendor_id TEXT,
invoice_number TEXT,
amount REAL,
currency TEXT,
match_outcome TEXT,
hook_decisions TEXT,
final_routing TEXT,
approved_at TEXT
)
""")
def audit(invoice: dict, match_result: dict, hook_decisions: dict, routing: str):
AUDIT_DB.execute(
"INSERT INTO audit_log (ts, vendor_id, invoice_number, amount, currency, "
"match_outcome, hook_decisions, final_routing, approved_at) "
"VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)",
(
datetime.datetime.utcnow().isoformat() + "Z",
invoice["vendor_id"],
invoice["invoice_number"],
invoice["total_amount"],
invoice["currency"],
json.dumps(match_result),
json.dumps(hook_decisions),
routing,
datetime.date.today().isoformat() if routing == "approved" else None,
),
)
AUDIT_DB.commit()The four decisions
| Decision | Right answer | Wrong answer | Why |
|---|---|---|---|
| Output shape guarantee on extraction | Forced tool_choice with input_schema as the contract | Prompt 'output JSON' or 'respond with valid JSON only' | Prompt-only is probabilistic (~85% adherence); ~15% leakage on edge invoices (handwritten notes, mixed languages, credit memos, rotated scans). Forced tool_use is structural (100% adherence). The cost is identical; the reliability gap is decisive in finance. |
| Vendor authorization cap enforcement | PreToolUse hook reads vendor_ytd_spend, exits 2 on violation | System prompt: 'never approve above the vendor cap' | Prompts leak ~3% in production. Hooks are deterministic. For policy-bearing limits (cap, duplicate, blocklist), the deterministic gate is the only credible architecture. Prompt-only enforcement is a finding waiting to be flagged in the next audit. |
| Same invoice arriving twice (vendor re-sends after delivery) | PreToolUse duplicate-detection hook keyed on (vendor_id, invoice_number) over last 90 days | Trust the model to notice duplicates in conversation context | Context memory is unreliable across multi-turn or batch runs. The hook is stateless, queries the audit log, and prevents race conditions when two parallel extractions hit the same invoice within seconds. |
| Bulk overnight processing of 10K invoices | Batch API + schema and vendor-master caching | Sync API in a tight loop or sync API without caching | Batch API gives a flat 50% discount with a 24-hour SLA. Caching adds another ~80% off the schema and vendor-master tokens. Combined: ~75% savings versus naive sync. Sync API is reserved for inbox-arrival latency. |
Where it breaks
Five failure pairs. Each one is one exam question. The fix is always architectural, deterministic gates, structured fields, pinned state.
Prompt 'extract this invoice as JSON' leaks ~15% on edge invoices. Downstream parser breaks every seventh document; AP analyst spends the morning re-keying invoices the agent botched.
AP-INV-01Forced tool_choice: { type: 'tool', name: 'extract_invoice' } plus a strict JSON schema in tools[0].input_schema. The model has no choice but to fire the tool with arguments matching the schema. 100% structural adherence.
Single-pass extraction with no math check. The model returns a structurally-valid record where line totals sum to 4950 but the header total says 5000. Bad data ships downstream; quarterly close finds the discrepancy three months later.
AP-INV-02Validation-retry loop. After parse, validate sum(line_items[].total) == total_amount (within 0.01 tolerance), currency in ISO 4217 enum, due_date >= invoice_date. On failure, feed the specific error back; retry up to 3 times; route to human review if still failing.
Agent approves an invoice that has no matching purchase order, or where the goods receipt was for fewer items, or where the vendor name on the invoice does not match the vendor on the PO. AP pays for goods never received, or pays the wrong vendor.
AP-INV-03Three-way match service queries PO master and goods-receipt ledger. Compares amount (variance <= 2% OK), normalized vendor name, line-item count. Variance above thresholds routes to human review with a structured exception block.
System prompt: 'never approve more than the vendor authorization cap'. Production logs show ~3% of approvals exceed the cap because the prompt language leaks under unusual phrasing or when the agent is processing many invoices in one session.
AP-INV-04PreToolUse hook on approve_payment reads tool_input.vendor_id and tool_input.amount, queries the vendor master for vendor_ytd_spend + cap, exits 2 on violation with a structured message including cap_remaining. Deterministic, not probabilistic.
Vendor re-sends the same invoice number after delivery confirmation, or the same invoice is uploaded twice through different channels (email + portal). The agent approves both. AP discovers the duplicate payment in next month's reconciliation.
AP-INV-05PreToolUse hook queries the audit log for any row with the same (vendor_id, invoice_number) in the last 90 days. On match, exits 2 with the prior approval date. Stateless, auditable, prevents race conditions in parallel runs.
Cost & latency
Schema ~1500 tokens at cache-read price plus image vision tokens (~1000-2000) plus ~150 output. Sustained AP traffic with cache hits >= 70% drops effective cost predictably.
Pure SQL queries against PO master and goods-receipt ledger. No LLM call. Latency is dominated by the database round-trip.
Subprocess reads stdin JSON, runs three SQL queries (vendor cap, blocklist, duplicate), exits 0 or 2. No LLM call. Latency below the noise floor of any tool dispatch.
Batch API flat 50% discount times schema and vendor-master cache (~80% off cached portion). 10K invoices at typical complexity drop from ~$30 sync uncached to ~$8 batch cached.
5-10% of records retry once; 1-2% retry twice. Specific-error feedback converges quickly. Pipeline cost up ~5% to gain ~99% schema-conformance plus ~99% semantic-conformance.
Sync cached extraction at scale. Adding human review of unconverged records adds operator-time cost but recovers the long tail of edge invoices.
Ship checklist
Two passes. Build-time gates verify the code; run-time gates verify the system in production.
Build-time
- Invoice JSON schema lives in tools[0].input_schema with required and nullable fields explicit↗ structured-outputs
- tool_choice forced to extract_invoice on every extraction call↗ tool-choice
- Currency field is an enum with an 'unclear' escape hatch↗ structured-outputs
- Validation-retry loop with sum check, currency enum, date sanity↗ evaluation
- Three-way match service against PO master and goods-receipt ledger↗ evaluation
- PreToolUse cap-and-duplicate hook on approve_payment↗ hooks
- Schema and vendor master cached with cache_control: ephemeral↗ prompt-caching
- Batch API for nightly bulk runs (greater than 100 invoices)↗ batch-api
- PostToolUse audit log writes every approval, rejection, and hook decision; 7-year retention
- Stratified accuracy reporting by vendor, currency, document type
- Human-review queue for invoices that fail validation, three-way match, or hook checks
Run-time
- JSON schema versioned in source control; PR-reviewed before deploy
- Vendor master kept current; cap and blocklist updates flow through change control
- Validation-retry loop unit-tested for line-total mismatch, currency drift, date inversion
- Three-way match service tested against synthetic PO + GRN cases including 1.9% and 2.1% variance edge cases
- PreToolUse hook unit-tested for cap exceeded, blocklisted vendor, duplicate within 90 days, all three pass
- PostToolUse audit log retention confirmed at 7 years; index on (vendor_id, invoice_number, date)
- Schema cache hit rate monitored; alert if drops below 50%
- Stratified accuracy dashboard by vendor and document type; alert on any vendor below 90% pass rate
- Human-review queue with SLA documented and on-call for invoices held more than 48 hours
- Batch API job for nightly backfill with auto-resubmit on transient failures
Five exam-pattern questions
Your invoice extraction agent uses prompt-only extraction. Production logs show ~15% of records arrive with prose wrapping ('Sure, here is the JSON:') and the downstream parser breaks. What is the architectural fix?
tools[0].input_schema and set tool_choice: { type: 'tool', name: 'extract_invoice' }. This forces the model to emit a structured tool_use call matching the schema. No prose wrapping, no preamble, no probabilistic adherence. The 15% leak collapses to 0% because the SDK rejects anything that does not match the schema. Tagged to AP-INV-01.You notice line-item totals do not match the invoice header total. The model extracted all items correctly but the math is wrong. How do you prevent this from shipping bad data downstream?
sum(line_items[].total) in code. If it differs from total_amount by more than 0.01, send the specific error back to the model in a tool_result with is_error: true ('line items sum to 4950 but header total says 5000'); retry up to 3 times. About 95% of records converge on the second attempt because the model now sees what was wrong. Pair with currency enum and date sanity checks. Tagged to AP-INV-02.An invoice arrives with no matching purchase order in your system. The vendor claims the PO was issued verbally last quarter. Should the agent approve based on the vendor reputation?
Your system prompt says 'never approve invoices above the vendor authorization cap'. Production logs show ~3% of approvals still exceed the cap. What is the architectural fix?
approve_payment. The hook reads tool_input.vendor_id and tool_input.amount, queries the vendor master for vendor_ytd_spend + cap, exits 2 on violation with a structured stderr message including cap_remaining. Deterministic, not probabilistic. Pair with blocklist and duplicate checks in the same hook. Prompts leak; hooks do not. Tagged to AP-INV-04.A vendor re-sends the same invoice 30 days later because they did not see the payment confirmation. How does your agent prevent paying the same invoice twice?
approve_payment queries the audit log for any row with the same (vendor_id, invoice_number) in the last 90 days. On match, exits 2 with the prior approval date and routes to a structured exception block. The check is stateless, auditable, and prevents race conditions when two parallel extractions hit the same invoice within seconds. The 90-day window is a configurable policy. Tagged to AP-INV-05.Frequently asked
How do you handle handwritten or scanned invoices with poor image quality?
What happens if a vendor has multiple naming variations (Apple Inc, APPLE, Apple, Inc.)?
vendor_id and a list of name variations. The extraction schema requires the model to extract the vendor as text; a normalization step (lowercase, strip punctuation, fuzzy match against the vendor master) resolves it to a vendor_id. The duplicate-detection hook keys on vendor_id, not the raw name, so naming variation does not break uniqueness.Can the agent process multi-currency invoices in one workflow?
currency as an ISO 4217 enum. The cap policy and duplicate detection key on vendor_id and amount in the invoice currency; the cap can be denominated per-vendor in the vendor master. For consolidated reporting, a daily FX-rate table converts to a base currency at audit-log write time.How do you handle credit memos (negative invoices)?
total_amount representing the credit (positive number) and a separate document_type enum field that distinguishes invoice from credit_memo. The PreToolUse hook treats credit memos as vendor_ytd_spend - amount (effectively decreasing YTD spend). Three-way match runs against the original invoice and the credit-memo reason code instead of a PO and GRN.