The problem
What the customer needs
- Process 200-page contracts without max_tokens errors and without losing the order/case ID partway through.
- Audit-grade citations. Every extracted clause traces back to a specific chunk and page number.
- Bulk overnight runs for backfills. 1000 documents in one batch, results next morning.
Why naive approaches fail
- Stuff the whole document into the prompt → max_tokens at page 120; lost-in-the-middle drops the order ID established on page 1.
- Progressive summarization of facts → '$247.83' becomes '~$250' in the case-facts; audit fails because exact values were paraphrased.
- RAG without citation tracking → model hallucinates source pages; auditor can't verify any claim against the original document.
- Semantic chunking (paragraph + section boundaries), not fixed-size, with 10-20% overlap
- Top-K retrieval (K = 5 typical) returns only relevant chunks; full document never enters context
- CASE_FACTS block pinned at every prompt top. Never summarized, only the conversation history is
- Checkpoint-and-resume on
max_tokens: state saved, fresh session, resume from checkpoint - Citations (chunk_id + page) propagate through every extraction; auditor can verify each claim
- Batch API for bulk overnight extraction (≥ 100 docs at 50% off)
The system
What each part does
5 components, each owns a concept. Click any card to drill into the underlying primitive.
Semantic Chunker
paragraph + section boundaries, not fixed-size
Splits the document at natural boundaries (sections, paragraphs, list items) rather than at fixed byte counts. Preserves the meaning of each chunk; a sentence is never cut in half. Adds 10-20% overlap between adjacent chunks so a clause that spans a boundary still appears whole in at least one chunk. Fixed-size chunking destroys context; semantic chunking preserves it.
Configuration
Chunk size: 500-2000 tokens (≈400-1500 words). Overlap: 10-20%. Boundary precedence: section header → paragraph → sentence (never break mid-sentence). Each chunk gets a deterministic chunk_id and a page number for citation.
Retrieval Index (top-K)
embeddings + cosine similarity
Embeds every chunk once at ingest time; stores embeddings in a vector index (FAISS, pgvector, Pinecone). At extraction time, the agent's question gets embedded; the index returns the top-K most-similar chunks (K = 5 typical). Only those K chunks enter context. The full document never does. Latency p95 < 100ms even on 10K-chunk documents.
Configuration
Embedding model: Voyage-3 or OpenAI text-embedding-3-small (~$0.13 / M tokens). Distance: cosine similarity. K: 5 (sweet spot. Bigger dilutes context, smaller misses relevance). Re-rank with Claude Haiku for the top-20 → top-5 if precision matters.
CASE_FACTS Block
immutable anchor at prompt top
Pinned at the very top of every system prompt iteration. Holds doc_id, extracted_count, decisions already made, policy_cap. Survives every chunk swap. NEVER summarized. Exact values like '$247.83' stay exact across hundreds of turns. The conversation history below it CAN be summarized; case-facts cannot. The architectural difference between 'reliable extraction' and 'paraphrased nonsense'.
Configuration
system: "CASE_FACTS (immutable; re-read every turn): doc_id={doc_id}, extracted_count={count}, last_clause_id={cid}". Updated by hooks after state-changing tool calls.
Checkpoint-and-Resume
the architectural fix for max_tokens
When the model returns stop_reason: max_tokens, the harness writes the current case-facts + last extracted record + chunk position to a durable store (Convex DB, S3, local JSONL), then starts a FRESH session and reads the checkpoint as its case-facts. The new session continues from where the old left off. No data loss; no re-processing; no manual intervention.
Configuration
On stop_reason==max_tokens: persist({doc_id, extracted_count, last_chunk_id, partial_extraction}). New session: system_prompt loads case-facts from checkpoint. Idempotent on the chunk_id key.
Citation Tracker
chunk_id + page on every output
Every tool result includes the chunk_id(s) and page number(s) that supported the extraction. The model's output schema requires citations: [{chunk_id, page, span?}]; downstream consumers can click any extracted value and see the exact paragraph in the original document. Audit-grade provenance, structurally enforced.
Configuration
extract_clause output_schema: { clause_text, clause_type, citations: [{ chunk_id, page, span?: 'character offsets within chunk' }] }. The model can't emit a clause without at least one citation.
Data flow
Eight steps to production
Semantic chunking with overlap
Walk the document; split at section / paragraph / sentence boundaries (in that precedence). Aim for 500-2000-token chunks; add 10-20% overlap between adjacent chunks so a clause spanning a boundary stays whole in at least one chunk. Each chunk gets a deterministic chunk_id (hash of content) and a page number for citation.
import hashlib
from typing import TypedDict
class Chunk(TypedDict):
chunk_id: str
page: int
text: str
def chunk_document(pages: list[str], target_tokens: int = 1200, overlap: float = 0.15) -> list[Chunk]:
"""Semantic chunking with overlap. Split at section, paragraph, sentence."""
chunks = []
buffer = ""
page_buffer_started = 1
for page_num, page_text in enumerate(pages, start=1):
for paragraph in split_into_paragraphs(page_text):
# If adding this paragraph would exceed target, emit current buffer
if approx_tokens(buffer + paragraph) > target_tokens and buffer:
chunks.append({
"chunk_id": hashlib.md5(buffer.encode()).hexdigest()[:12],
"page": page_buffer_started,
"text": buffer.strip(),
})
# Carry forward the last 15% as overlap
tail = buffer[-int(len(buffer) * overlap):]
buffer = tail + paragraph + "\n\n"
page_buffer_started = page_num
else:
buffer += paragraph + "\n\n"
if buffer.strip():
chunks.append({
"chunk_id": hashlib.md5(buffer.encode()).hexdigest()[:12],
"page": page_buffer_started,
"text": buffer.strip(),
})
return chunks
def split_into_paragraphs(page_text: str) -> list[str]:
return [p for p in page_text.split("\n\n") if p.strip()]
def approx_tokens(text: str) -> int:
return len(text) // 4 # rule of thumbEmbed and index every chunk
At ingest time, embed each chunk once with a strong embedding model (Voyage-3 or OpenAI text-embedding-3-small) and store in a vector index. Index by chunk_id; the embedding becomes the search key. Re-embedding only fires on content change (hash-keyed cache). For a 200-page document at ~500 chunks, embedding is a ~$0.05 one-time cost.
# embed_and_index.py
import voyageai
vo = voyageai.Client() # picks up VOYAGE_API_KEY
def embed_and_index(chunks: list[Chunk], collection: str):
"""Embed every chunk; store in vector DB keyed by chunk_id."""
texts = [c["text"] for c in chunks]
# Voyage supports batched embedding. Much cheaper than per-chunk
embeddings = vo.embed(texts, model="voyage-3", input_type="document").embeddings
# Pseudo-code for vector store; real impl uses pgvector / Pinecone / FAISS
for chunk, embedding in zip(chunks, embeddings):
vector_store.upsert(
collection=collection,
id=chunk["chunk_id"],
vector=embedding,
metadata={"page": chunk["page"], "text_preview": chunk["text"][:200]},
)
return len(embeddings)
# Re-embedding cache: skip if chunk content hash hasn't changed
def smart_reindex(doc_id: str, new_chunks: list[Chunk]):
existing = vector_store.list_chunks(collection=doc_id)
existing_ids = {c["id"] for c in existing}
new_ids = {c["chunk_id"] for c in new_chunks}
to_delete = existing_ids - new_ids
to_add = [c for c in new_chunks if c["chunk_id"] not in existing_ids]
vector_store.delete_many(doc_id, list(to_delete))
embed_and_index(to_add, doc_id)Retrieve top-K chunks; never the full document
When the agent asks a question (e.g., 'what's the indemnification clause?'), embed the question, retrieve the top-K=5 most-similar chunks, and pass ONLY those into context. The full document never enters the prompt. K=5 is the sweet spot. Bigger K dilutes context with marginally-relevant chunks; smaller K misses the right one. Re-rank top-20 with Claude Haiku if precision matters.
def retrieve_top_k(question: str, doc_id: str, k: int = 5) -> list[dict]:
"""Top-K retrieval. Full doc never enters context."""
q_embed = vo.embed([question], model="voyage-3", input_type="query").embeddings[0]
candidates = vector_store.search(collection=doc_id, vector=q_embed, top_k=20)
# Optional re-rank with Haiku for precision
if len(candidates) > k:
candidates = rerank_with_haiku(question, candidates)[:k]
return [
{
"chunk_id": c["id"],
"page": c["metadata"]["page"],
"text": c["metadata"]["text_full"], # rehydrate from chunk store
"score": c["score"],
}
for c in candidates[:k]
]
def rerank_with_haiku(question: str, candidates: list[dict]) -> list[dict]:
"""Use Haiku to re-rank embedding candidates (cheap, focused)."""
prompt = (
f"Question: {question}\n\n"
+ "\n\n".join(f"[{i}] {c['metadata']['text_preview']}"
for i, c in enumerate(candidates))
+ "\n\nReturn the indices of the top 5 most relevant chunks, JSON only:"
)
resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=64,
messages=[{"role": "user", "content": prompt}],
)
indices = json.loads(resp.content[0].text)
return [candidates[i] for i in indices]Pin CASE_FACTS at the prompt top. Never summarize
Every prompt iteration starts with a CASE_FACTS block: doc_id, extracted_count, last_clause_id, decisions already made. The block is rebuilt from durable state every turn. It survives summarization, model swaps, and session resets. Critically, EXACT VALUES ($247.83, not ~$250; cust_4711, not the customer) stay verbatim. The conversation history below it CAN be summarized; the case-facts cannot.
def build_system_prompt(case_facts: dict, retrieved_chunks: list[dict]) -> str:
"""Pin CASE_FACTS at the top; retrieved chunks below; conversation last."""
chunks_text = "\n\n".join(
f"### CHUNK {c['chunk_id']} (page {c['page']})\n{c['text']}"
for c in retrieved_chunks
)
return f"""You are a long-document extraction agent.
CASE_FACTS (immutable; re-read every turn; values are EXACT, never paraphrased):
- doc_id: {case_facts['doc_id']}
- doc_type: {case_facts.get('doc_type', 'unknown')}
- extracted_count: {case_facts.get('extracted_count', 0)}
- last_clause_id: {case_facts.get('last_clause_id', 'none')}
- policy_cap: ${case_facts.get('policy_cap', 0):,.2f}
Constraints:
- Cite every extraction with chunk_id + page from the chunks below.
- Never paraphrase exact values from CASE_FACTS or chunk text.
- Branch on stop_reason. On max_tokens, save state. The harness will resume.
RETRIEVED CHUNKS (top-K; only these are in context):
{chunks_text}"""
def update_case_facts(case_facts: dict, new_extraction: dict) -> dict:
"""Hook-style update; preserves all values verbatim."""
return {
**case_facts,
"extracted_count": case_facts.get("extracted_count", 0) + 1,
"last_clause_id": new_extraction["clause_id"],
}Checkpoint on max_tokens; resume in a fresh session
When stop_reason == 'max_tokens', the harness writes the current case-facts + last extracted record + chunk position to a durable store, then starts a fresh session and re-loads the checkpoint as its case-facts. Because case-facts are at the prompt top, the new session continues exactly where the old left off. No data loss; no manual intervention; no need for the agent to even know.
import json
from datetime import datetime
CHECKPOINT_DIR = ".checkpoints"
def save_checkpoint(case_facts: dict, last_record: dict, position: dict):
"""Persist state on max_tokens. Idempotent on doc_id."""
path = f"{CHECKPOINT_DIR}/{case_facts['doc_id']}.json"
with open(path, "w") as f:
json.dump({
"case_facts": case_facts,
"last_record": last_record,
"position": position, # {chunk_id, paragraph_offset}
"saved_at": datetime.utcnow().isoformat() + "Z",
}, f)
return path
def load_checkpoint(doc_id: str) -> dict | None:
path = f"{CHECKPOINT_DIR}/{doc_id}.json"
if not os.path.exists(path):
return None
with open(path) as f:
return json.load(f)
def extract_with_resume(doc_id: str, question: str, max_iter: int = 50):
"""Top-level loop with automatic checkpoint-and-resume."""
checkpoint = load_checkpoint(doc_id) or {}
case_facts = checkpoint.get("case_facts") or {"doc_id": doc_id}
for iteration in range(max_iter):
chunks = retrieve_top_k(question, doc_id, k=5)
resp = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=4096,
system=build_system_prompt(case_facts, chunks),
tools=[EXTRACT_CLAUSE_TOOL],
messages=[{"role": "user", "content": question}],
)
if resp.stop_reason == "end_turn":
return {"status": "complete", "case_facts": case_facts}
if resp.stop_reason == "max_tokens":
# Save and continue in a fresh session
save_checkpoint(case_facts, last_record={}, position={})
continue # next iteration starts a fresh session
if resp.stop_reason == "tool_use":
tool_use = next(b for b in resp.content if b.type == "tool_use")
case_facts = update_case_facts(case_facts, tool_use.input)
return {"status": "iteration_cap", "case_facts": case_facts}Citations. Chunk_id + page on every output
Every extraction tool emits a citations: [{ chunk_id, page }] array; the schema makes citations REQUIRED. The model can't extract a clause without pointing at the chunks that supported it. Downstream consumers (auditors, reviewers, regulators) click any extracted value and see the exact paragraph in the original. Audit-grade provenance, structurally enforced.
EXTRACT_CLAUSE_TOOL = {
"name": "extract_clause",
"description": (
"Extract a contractual clause from the retrieved chunks.\n"
"Use this when the user asks for a specific clause type.\n"
"Edge cases: if the clause type is not present in any retrieved chunk, "
"emit clause_text='not_found' with empty citations.\n"
"ALWAYS cite chunk_id and page for every extraction."
),
"input_schema": {
"type": "object",
"properties": {
"clause_id": {"type": "string"},
"clause_type": {
"type": "string",
"enum": ["indemnification", "termination", "payment", "ip", "other"],
},
"clause_text": {"type": "string"},
"citations": {
"type": "array",
"minItems": 1, # require at least one citation
"items": {
"type": "object",
"properties": {
"chunk_id": {"type": "string"},
"page": {"type": "integer", "minimum": 1},
"span": {
"type": "string",
"description": "character offsets within the chunk, e.g. '128-340'",
},
},
"required": ["chunk_id", "page"],
},
},
},
"required": ["clause_id", "clause_type", "clause_text", "citations"],
},
}Bulk extraction via Batch API (50% off, 24h)
When the use case is 'extract every payment clause from 1000 contracts overnight', the Batch API earns its 50% discount. Submit at 6 PM, results ready at 6 AM. No real-time retry inside the batch; failures get resubmitted in the next batch with their specific error in the next message. Combined with prompt caching on the system prompt + tool registry, bulk extraction cost drops 95%+ vs naive sync calls.
def submit_bulk_extraction(docs: list[dict], clause_type: str) -> str:
"""Submit a batch of clause-extraction requests for overnight processing."""
requests = []
for doc in docs:
chunks = retrieve_top_k(f"Find {clause_type} clauses", doc["id"], k=5)
case_facts = load_checkpoint(doc["id"]) or {"doc_id": doc["id"]}
requests.append({
"custom_id": f"clause-{doc['id']}-{clause_type}",
"params": {
"model": "claude-sonnet-4.5",
"max_tokens": 2048,
"system": [{
"type": "text",
"text": build_system_prompt(case_facts, chunks),
"cache_control": {"type": "ephemeral"}, # cache the system prompt
}],
"tools": [
{**EXTRACT_CLAUSE_TOOL, "cache_control": {"type": "ephemeral"}},
],
"tool_choice": {"type": "tool", "name": "extract_clause"},
"messages": [{"role": "user", "content": f"Extract all {clause_type} clauses."}],
},
})
batch = client.messages.batches.create(requests=requests)
return batch.id
# Next morning. Fetch + harvest
def harvest_bulk(batch_id: str):
results = client.messages.batches.results(batch_id)
accepted, retry_queue = [], []
for r in results:
if r.result.type == "succeeded":
tu = next(b for b in r.result.message.content if b.type == "tool_use")
accepted.append(tu.input)
else:
retry_queue.append(r.custom_id)
return {"accepted": accepted, "retry": retry_queue}Stratified accuracy + adversarial 'silent source' tests
Aggregate accuracy hides per-document-type weakness. Stratify by doc_type (MSA vs DPA vs SOW), by clause_type, by page-section (front/middle/back). Surface the worst stratum. Pair with an adversarial test set of 50 documents where the requested clause is GENUINELY absent. The right behaviour is clause_text='not_found' with empty citations, NEVER an invented clause. Hallucinated extractions = audit-fail.
from collections import defaultdict
def stratified_accuracy(extractions: list[dict]) -> dict:
"""Pass rate by doc_type × clause_type × page-section."""
buckets = defaultdict(lambda: {"pass": 0, "fail": 0})
for e in extractions:
section = ("front" if e["citations"][0]["page"] <= 30
else "back" if e["citations"][0]["page"] > 150
else "middle")
key = (e["doc_type"], e["clause_type"], section)
bucket = "pass" if validate_extraction(e) else "fail"
buckets[key][bucket] += 1
report = {}
for (doc_type, clause_type, section), counts in buckets.items():
total = counts["pass"] + counts["fail"]
report[f"{doc_type}/{clause_type}/{section}"] = {
"total": total,
"pass_rate": counts["pass"] / total if total else 0,
}
return dict(sorted(report.items(), key=lambda kv: kv[1]["pass_rate"]))
def adversarial_silent_source_test() -> float:
"""50 docs where the requested clause is GENUINELY absent."""
correct = 0
for doc in load_silent_source_docs(): # known to NOT contain the clause
result = extract_with_resume(doc["id"], "Find the indemnification clause")
# Right: clause_text='not_found' with empty citations
# Wrong: ANY invented clause text
if result.get("case_facts", {}).get("last_extraction", {}).get("clause_text") == "not_found":
correct += 1
return correct / 50 # target: ≥ 95%The four decisions
| Decision | Right answer | Wrong answer | Why |
|---|---|---|---|
| 200-page document into the prompt | Chunk + index + retrieve top-K (K=5) | Stuff the whole document into one prompt | Stuffing hits max_tokens around page 120 and triggers lost-in-the-middle. Top-K retrieval keeps the prompt small and focused. The full document never enters context, so length is bounded by chunk count not document size. |
| Storing transactional values across many turns | CASE_FACTS block. Exact values, never paraphrased | Progressive summarization that paraphrases the conversation including facts | Summarization erodes precision. '$247.83' becomes '~$250'; 'cust_4711' becomes 'the customer'. CASE_FACTS keeps exact values verbatim; conversation history can be summarized, facts cannot. |
| Hit max_tokens mid-document | Save state + start fresh session + reload from checkpoint | Increase max_tokens or just retry from scratch | Larger windows defer the problem; checkpoint-and-resume permanently solves it. Restarting from scratch loses everything extracted so far. The architectural pattern scales to documents of any length. |
| Bulk overnight processing of 1000 documents | Batch API + cached system prompt + cached tool registry | Sync API in a tight loop | Batch API gives a flat 50% discount with a 24h SLA. Fine for non-blocking backfill. Caching adds another ~90% off the system + tools. Combined: ~95% savings vs naive sync. Sync API is for latency-critical extraction only. |
Where it breaks
Five failure pairs. Each one is one exam question. The fix is always architectural, deterministic gates, structured fields, pinned state.
Try to paste a 150-page contract into a single prompt. Hits max_tokens at page 120; lost-in-the-middle drops the order ID established on page 1; agent makes contradictory recommendations on later pages.
AP-LDP-01Chunk + index + top-K retrieval. Only K=5 chunks ever enter context. The full document is searchable but never present. Length is bounded by retrieval, not document size.
Long conversation summarizes every 10 turns. Refund amount '$247.83' becomes '~$250' in the summary; customer ID 'cust_4711' becomes 'the customer'. Audit fails because exact values were paraphrased.
AP-LDP-02CASE_FACTS block at every prompt top. Never summarized. Holds exact values verbatim. Only the message history below is summarized; the case-facts persist verbatim across every iteration.
Long batch job processes 200 pages, hits max_tokens at turn 15. The whole pipeline aborts; everything extracted so far is lost; operator has to restart from page 1.
AP-LDP-03Checkpoint-and-resume: on stop_reason: max_tokens, persist state (case_facts + last extraction + chunk position) and start a fresh session that reloads the checkpoint as its case-facts. Idempotent on chunk_id.
Agent retrieves chunks and emits extracted clauses without saying which chunk supported each one. Auditor asks 'where does this come from?' and there's no answer; auditor flags the run as un-verifiable.
AP-LDP-04Citation tracker: every extraction emits citations: [{ chunk_id, page }] with minItems: 1 in the schema. The model can't extract a clause without pointing at the supporting chunks. Audit-grade provenance is structurally enforced.
Fixed-size chunks (every 1000 characters) split mid-sentence and mid-paragraph. A clause that spans a chunk boundary appears truncated in both adjacent chunks; retrieval misses it; extraction is wrong.
AP-LDP-05Semantic chunking with overlap. Split at section / paragraph / sentence boundaries (in that precedence). Add 10-20% overlap between adjacent chunks so a boundary-spanning clause stays whole in at least one chunk. Each chunk is meaning-complete.
Cost & latency
500 chunks × ~1000 tokens × Voyage-3 at $0.13/M tokens ≈ $0.06. One-time cost at ingest; never re-paid unless content changes.
Embedding query (~$0.0001) + vector lookup (~$0.0001) + 5 chunks × ~1000 tokens (cached) + ~500 output tokens. Cache hit rate ≥ 70% drops effective per-call cost ~80%.
Batch API 50% discount × prompt caching ~90% off the system + tools = ~95% off naive sync. 1000 extractions @ $0.008 each = $8 total overnight.
JSON dump of case_facts + last extraction + chunk position. Negligible per-document; at 1000 docs in flight, <15MB total. Idempotent re-loads add no cost.
Embed query (50ms) + vector lookup (50ms) + Claude call (2-3s with cache hit). Acceptable for interactive review of contracts; bulk uses Batch API.
Ship checklist
Two passes. Build-time gates verify the code; run-time gates verify the system in production.
Build-time
- Semantic chunking (paragraph + section boundaries) with 10-20% overlap↗ context-window
- Each chunk has a deterministic chunk_id and a page number
- Embeddings indexed once at ingest; re-embedding only on content change↗ context-window
- Top-K retrieval (K=5 typical); full document never enters context
- CASE_FACTS block pinned at every prompt top. Exact values, never paraphrased↗ case-facts-block
- Checkpoint-and-resume on max_tokens; idempotent on chunk_id↗ checkpoints
- Citations REQUIRED in every extraction tool's schema (minItems: 1)↗ structured-outputs
- System prompt + tool registry cached with cache_control: ephemeral↗ prompt-caching
- Batch API for bulk overnight runs (≥ 100 docs)↗ batch-api
- Stratified accuracy: doc_type × clause_type × page-section↗ evaluation
- Adversarial 'silent source' test (≥ 50 docs known to lack the clause); target ≥ 95% not_found rate
Run-time
- Semantic chunker tested on 5 representative document types (contracts, papers, manuals)
- Embedding cache by chunk content hash; rebuild only on change
- CASE_FACTS schema versioned; migration plan documented
- Checkpoint write is atomic; idempotency tested with deliberate restarts
- Citations schema enforced (minItems: 1); CI lint catches schemas missing the constraint
- Batch-API job retries failures in next batch with the specific error in the next message
- Stratified accuracy dashboard updated daily; alert on any stratum < 90%
- Adversarial 'silent source' eval runs weekly; ≥ 95% not_found rate required
6 exam-pattern questions
Tap an answer to check it instantly, then expand the full breakdown for the rationale per option, the mental model under test, and the priority order across distractors.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
1 more questions on this scenario in the practice pillar. Take a mock exam ↗
Frequently asked
What's the optimal chunk size?
Should I retrieve all matching chunks or top-K?
How do I prevent lost-in-the-middle?
Does prompt caching help with RAG?
Where do I store the checkpoint?
doc_id. The checkpoint write must be atomic; partial writes confuse the resume path. Retain checkpoints until the document is fully processed; delete on completion or after 30 days, whichever comes first.Can I combine Batch API with checkpoint-and-resume?
max_tokens, the harvest step writes a checkpoint and includes that document in the NEXT batch with the checkpoint as case-facts. Two batches usually finish a long document; rare cases need three.How do I handle a 500-page doc that exceeds even the chunked + paged max_tokens cap?
max_tokens automatically; at 500 pages you'll see ~5-8 checkpoint events across multiple sessions. Each session is bounded; the document length isn't.