Long Document Processing | Claude Architect Certification Prep

Q: What's the optimal chunk size?

500-2000 tokens (~400-1500 words). Smaller and you make too many retrieval calls; larger and you lose the granularity that makes top-K retrieval useful. 1000-1200 tokens is a good default. Always pair with 10-20% overlap so boundary-spanning content survives.

Q: Should I retrieve all matching chunks or top-K?

Top-K (default K=5). All-matching floods context with marginally-relevant noise; the model wastes attention. Top-5 keeps the prompt focused. If precision is critical, retrieve top-20 with embeddings and re-rank with Claude Haiku to top-5.

Q: How do I prevent lost-in-the-middle?

Anchor critical facts at context top (CASE_FACTS), retrieve top-K only, trim verbose tool results. Long contexts dilute attention to middle content. Keep the prompt structure: CASE_FACTS at top, retrieved chunks in the middle, the user's latest message at the end. Don't put case-facts in the middle of the prompt.

Q: Does prompt caching help with RAG?

Partially. Cache the stable parts: system prompt + tool registry + (optionally) the CASE_FACTS scaffold. The retrieved chunks change every query, so they're always fresh. Realistic savings: ~30-50% total cost reduction depending on system-prompt size. Not as dramatic as caching a 200-page document would have been, but RAG never had that overhead in the first place.

Q: Where do I store the checkpoint?

Durable, idempotent storage. Convex DB, S3, or a local JSONL file in dev. Key by doc_id. The checkpoint write must be atomic; partial writes confuse the resume path. Retain checkpoints until the document is fully processed; delete on completion or after 30 days, whichever comes first.

Q: Can I combine Batch API with checkpoint-and-resume?

Yes. For the long-running ones. Submit each document as one batch request. If a request hits max_tokens, the harvest step writes a checkpoint and includes that document in the NEXT batch with the checkpoint as case-facts. Two batches usually finish a long document; rare cases need three.

Q: How do I handle a 500-page doc that exceeds even the chunked + paged max_tokens cap?

Checkpoint after every ~50 pages or natural boundary (chapter, section). The harness writes checkpoints on max_tokens automatically; at 500 pages you'll see ~5-8 checkpoint events across multiple sessions. Each session is bounded; the document length isn't.

01 · Problem framing

The problem

What the customer needs

Process 200-page contracts without max_tokens errors and without losing the order/case ID partway through.
Audit-grade citations. Every extracted clause traces back to a specific chunk and page number.
Bulk overnight runs for backfills. 1000 documents in one batch, results next morning.

Why naive approaches fail

Stuff the whole document into the prompt → max_tokens at page 120; lost-in-the-middle drops the order ID established on page 1.
Progressive summarization of facts → '$247.83' becomes '~$250' in the case-facts; audit fails because exact values were paraphrased.
RAG without citation tracking → model hallucinates source pages; auditor can't verify any claim against the original document.

Definition of done

Semantic chunking (paragraph + section boundaries), not fixed-size, with 10-20% overlap
Top-K retrieval (K = 5 typical) returns only relevant chunks; full document never enters context
CASE_FACTS block pinned at every prompt top. Never summarized, only the conversation history is
Checkpoint-and-resume on max_tokens: state saved, fresh session, resume from checkpoint
Citations (chunk_id + page) propagate through every extraction; auditor can verify each claim
Batch API for bulk overnight extraction (≥ 100 docs at 50% off)

02 · Architecture

The system

03 · Component detail

What each part does

5 components, each owns a concept. Click any card to drill into the underlying primitive.

Semantic Chunker

paragraph + section boundaries, not fixed-size

Splits the document at natural boundaries (sections, paragraphs, list items) rather than at fixed byte counts. Preserves the meaning of each chunk; a sentence is never cut in half. Adds 10-20% overlap between adjacent chunks so a clause that spans a boundary still appears whole in at least one chunk. Fixed-size chunking destroys context; semantic chunking preserves it.

Configuration

Chunk size: 500-2000 tokens (≈400-1500 words). Overlap: 10-20%. Boundary precedence: section header → paragraph → sentence (never break mid-sentence). Each chunk gets a deterministic chunk_id and a page number for citation.

Concept: context-window ↗

Retrieval Index (top-K)

embeddings + cosine similarity

Embeds every chunk once at ingest time; stores embeddings in a vector index (FAISS, pgvector, Pinecone). At extraction time, the agent's question gets embedded; the index returns the top-K most-similar chunks (K = 5 typical). Only those K chunks enter context. The full document never does. Latency p95 < 100ms even on 10K-chunk documents.

Configuration

Embedding model: Voyage-3 or OpenAI text-embedding-3-small (~$0.13 / M tokens). Distance: cosine similarity. K: 5 (sweet spot. Bigger dilutes context, smaller misses relevance). Re-rank with Claude Haiku for the top-20 → top-5 if precision matters.

Concept: context-window ↗

CASE_FACTS Block

immutable anchor at prompt top

Pinned at the very top of every system prompt iteration. Holds doc_id, extracted_count, decisions already made, policy_cap. Survives every chunk swap. NEVER summarized. Exact values like '$247.83' stay exact across hundreds of turns. The conversation history below it CAN be summarized; case-facts cannot. The architectural difference between 'reliable extraction' and 'paraphrased nonsense'.

Configuration

system: "CASE_FACTS (immutable; re-read every turn): doc_id={doc_id}, extracted_count={count}, last_clause_id={cid}". Updated by hooks after state-changing tool calls.

Concept: case-facts-block ↗

Checkpoint-and-Resume

the architectural fix for max_tokens

When the model returns stop_reason: max_tokens, the harness writes the current case-facts + last extracted record + chunk position to a durable store (Convex DB, S3, local JSONL), then starts a FRESH session and reads the checkpoint as its case-facts. The new session continues from where the old left off. No data loss; no re-processing; no manual intervention.

Configuration

On stop_reason==max_tokens: persist({doc_id, extracted_count, last_chunk_id, partial_extraction}). New session: system_prompt loads case-facts from checkpoint. Idempotent on the chunk_id key.

Concept: checkpoints ↗

Citation Tracker

chunk_id + page on every output

Every tool result includes the chunk_id(s) and page number(s) that supported the extraction. The model's output schema requires citations: [{chunk_id, page, span?}]; downstream consumers can click any extracted value and see the exact paragraph in the original document. Audit-grade provenance, structurally enforced.

Configuration

extract_clause output_schema: { clause_text, clause_type, citations: [{ chunk_id, page, span?: 'character offsets within chunk' }] }. The model can't emit a clause without at least one citation.

Concept: structured-outputs ↗

04 · One concrete run

Data flow

05 · Build it

Eight steps to production

Semantic chunking with overlap

Walk the document; split at section / paragraph / sentence boundaries (in that precedence). Aim for 500-2000-token chunks; add 10-20% overlap between adjacent chunks so a clause spanning a boundary stays whole in at least one chunk. Each chunk gets a deterministic chunk_id (hash of content) and a page number for citation.

Semantic chunking with overlap

import hashlib
from typing import TypedDict

class Chunk(TypedDict):
    chunk_id: str
    page: int
    text: str

def chunk_document(pages: list[str], target_tokens: int = 1200, overlap: float = 0.15) -> list[Chunk]:
    """Semantic chunking with overlap. Split at section, paragraph, sentence."""
    chunks = []
    buffer = ""
    page_buffer_started = 1

    for page_num, page_text in enumerate(pages, start=1):
        for paragraph in split_into_paragraphs(page_text):
            # If adding this paragraph would exceed target, emit current buffer
            if approx_tokens(buffer + paragraph) > target_tokens and buffer:
                chunks.append({
                    "chunk_id": hashlib.md5(buffer.encode()).hexdigest()[:12],
                    "page": page_buffer_started,
                    "text": buffer.strip(),
                })
                # Carry forward the last 15% as overlap
                tail = buffer[-int(len(buffer) * overlap):]
                buffer = tail + paragraph + "\n\n"
                page_buffer_started = page_num
            else:
                buffer += paragraph + "\n\n"

    if buffer.strip():
        chunks.append({
            "chunk_id": hashlib.md5(buffer.encode()).hexdigest()[:12],
            "page": page_buffer_started,
            "text": buffer.strip(),
        })
    return chunks

def split_into_paragraphs(page_text: str) -> list[str]:
    return [p for p in page_text.split("\n\n") if p.strip()]

def approx_tokens(text: str) -> int:
    return len(text) // 4  # rule of thumb

↪ Concept: context-window

Embed and index every chunk

At ingest time, embed each chunk once with a strong embedding model (Voyage-3 or OpenAI text-embedding-3-small) and store in a vector index. Index by chunk_id; the embedding becomes the search key. Re-embedding only fires on content change (hash-keyed cache). For a 200-page document at ~500 chunks, embedding is a ~$0.05 one-time cost.

Embed and index every chunk

# embed_and_index.py
import voyageai

vo = voyageai.Client()  # picks up VOYAGE_API_KEY

def embed_and_index(chunks: list[Chunk], collection: str):
    """Embed every chunk; store in vector DB keyed by chunk_id."""
    texts = [c["text"] for c in chunks]
    # Voyage supports batched embedding. Much cheaper than per-chunk
    embeddings = vo.embed(texts, model="voyage-3", input_type="document").embeddings

    # Pseudo-code for vector store; real impl uses pgvector / Pinecone / FAISS
    for chunk, embedding in zip(chunks, embeddings):
        vector_store.upsert(
            collection=collection,
            id=chunk["chunk_id"],
            vector=embedding,
            metadata={"page": chunk["page"], "text_preview": chunk["text"][:200]},
        )
    return len(embeddings)

# Re-embedding cache: skip if chunk content hash hasn't changed
def smart_reindex(doc_id: str, new_chunks: list[Chunk]):
    existing = vector_store.list_chunks(collection=doc_id)
    existing_ids = {c["id"] for c in existing}
    new_ids = {c["chunk_id"] for c in new_chunks}
    to_delete = existing_ids - new_ids
    to_add = [c for c in new_chunks if c["chunk_id"] not in existing_ids]
    vector_store.delete_many(doc_id, list(to_delete))
    embed_and_index(to_add, doc_id)

↪ Concept: context-window

Retrieve top-K chunks; never the full document

When the agent asks a question (e.g., 'what's the indemnification clause?'), embed the question, retrieve the top-K=5 most-similar chunks, and pass ONLY those into context. The full document never enters the prompt. K=5 is the sweet spot. Bigger K dilutes context with marginally-relevant chunks; smaller K misses the right one. Re-rank top-20 with Claude Haiku if precision matters.

Retrieve top-K chunks; never the full document

def retrieve_top_k(question: str, doc_id: str, k: int = 5) -> list[dict]:
    """Top-K retrieval. Full doc never enters context."""
    q_embed = vo.embed([question], model="voyage-3", input_type="query").embeddings[0]
    candidates = vector_store.search(collection=doc_id, vector=q_embed, top_k=20)
    # Optional re-rank with Haiku for precision
    if len(candidates) > k:
        candidates = rerank_with_haiku(question, candidates)[:k]
    return [
        {
            "chunk_id": c["id"],
            "page": c["metadata"]["page"],
            "text": c["metadata"]["text_full"],  # rehydrate from chunk store
            "score": c["score"],
        }
        for c in candidates[:k]
    ]

def rerank_with_haiku(question: str, candidates: list[dict]) -> list[dict]:
    """Use Haiku to re-rank embedding candidates (cheap, focused)."""
    prompt = (
        f"Question: {question}\n\n"
        + "\n\n".join(f"[{i}] {c['metadata']['text_preview']}"
                       for i, c in enumerate(candidates))
        + "\n\nReturn the indices of the top 5 most relevant chunks, JSON only:"
    )
    resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=64,
        messages=[{"role": "user", "content": prompt}],
    )
    indices = json.loads(resp.content[0].text)
    return [candidates[i] for i in indices]

↪ Concept: context-window

Pin CASE_FACTS at the prompt top. Never summarize

Every prompt iteration starts with a CASE_FACTS block: doc_id, extracted_count, last_clause_id, decisions already made. The block is rebuilt from durable state every turn. It survives summarization, model swaps, and session resets. Critically, EXACT VALUES ($247.83, not ~$250; cust_4711, not the customer) stay verbatim. The conversation history below it CAN be summarized; the case-facts cannot.

Pin CASE_FACTS at the prompt top. Never summarize

def build_system_prompt(case_facts: dict, retrieved_chunks: list[dict]) -> str:
    """Pin CASE_FACTS at the top; retrieved chunks below; conversation last."""
    chunks_text = "\n\n".join(
        f"### CHUNK {c['chunk_id']} (page {c['page']})\n{c['text']}"
        for c in retrieved_chunks
    )
    return f"""You are a long-document extraction agent.

CASE_FACTS (immutable; re-read every turn; values are EXACT, never paraphrased):
- doc_id: {case_facts['doc_id']}
- doc_type: {case_facts.get('doc_type', 'unknown')}
- extracted_count: {case_facts.get('extracted_count', 0)}
- last_clause_id: {case_facts.get('last_clause_id', 'none')}
- policy_cap: ${case_facts.get('policy_cap', 0):,.2f}

Constraints:
- Cite every extraction with chunk_id + page from the chunks below.
- Never paraphrase exact values from CASE_FACTS or chunk text.
- Branch on stop_reason. On max_tokens, save state. The harness will resume.

RETRIEVED CHUNKS (top-K; only these are in context):
{chunks_text}"""

def update_case_facts(case_facts: dict, new_extraction: dict) -> dict:
    """Hook-style update; preserves all values verbatim."""
    return {
        **case_facts,
        "extracted_count": case_facts.get("extracted_count", 0) + 1,
        "last_clause_id": new_extraction["clause_id"],
    }

↪ Concept: case-facts-block

Checkpoint on max_tokens; resume in a fresh session

When stop_reason == 'max_tokens', the harness writes the current case-facts + last extracted record + chunk position to a durable store, then starts a fresh session and re-loads the checkpoint as its case-facts. Because case-facts are at the prompt top, the new session continues exactly where the old left off. No data loss; no manual intervention; no need for the agent to even know.

Checkpoint on max_tokens; resume in a fresh session

import json
from datetime import datetime

CHECKPOINT_DIR = ".checkpoints"

def save_checkpoint(case_facts: dict, last_record: dict, position: dict):
    """Persist state on max_tokens. Idempotent on doc_id."""
    path = f"{CHECKPOINT_DIR}/{case_facts['doc_id']}.json"
    with open(path, "w") as f:
        json.dump({
            "case_facts": case_facts,
            "last_record": last_record,
            "position": position,  # {chunk_id, paragraph_offset}
            "saved_at": datetime.utcnow().isoformat() + "Z",
        }, f)
    return path

def load_checkpoint(doc_id: str) -> dict | None:
    path = f"{CHECKPOINT_DIR}/{doc_id}.json"
    if not os.path.exists(path):
        return None
    with open(path) as f:
        return json.load(f)

def extract_with_resume(doc_id: str, question: str, max_iter: int = 50):
    """Top-level loop with automatic checkpoint-and-resume."""
    checkpoint = load_checkpoint(doc_id) or {}
    case_facts = checkpoint.get("case_facts") or {"doc_id": doc_id}

    for iteration in range(max_iter):
        chunks = retrieve_top_k(question, doc_id, k=5)
        resp = client.messages.create(
            model="claude-sonnet-4.5",
            max_tokens=4096,
            system=build_system_prompt(case_facts, chunks),
            tools=[EXTRACT_CLAUSE_TOOL],
            messages=[{"role": "user", "content": question}],
        )
        if resp.stop_reason == "end_turn":
            return {"status": "complete", "case_facts": case_facts}
        if resp.stop_reason == "max_tokens":
            # Save and continue in a fresh session
            save_checkpoint(case_facts, last_record={}, position={})
            continue  # next iteration starts a fresh session
        if resp.stop_reason == "tool_use":
            tool_use = next(b for b in resp.content if b.type == "tool_use")
            case_facts = update_case_facts(case_facts, tool_use.input)
    return {"status": "iteration_cap", "case_facts": case_facts}

↪ Concept: checkpoints

Citations. Chunk_id + page on every output

Every extraction tool emits a citations: [{ chunk_id, page }] array; the schema makes citations REQUIRED. The model can't extract a clause without pointing at the chunks that supported it. Downstream consumers (auditors, reviewers, regulators) click any extracted value and see the exact paragraph in the original. Audit-grade provenance, structurally enforced.

Citations. Chunk_id + page on every output

EXTRACT_CLAUSE_TOOL = {
    "name": "extract_clause",
    "description": (
        "Extract a contractual clause from the retrieved chunks.\n"
        "Use this when the user asks for a specific clause type.\n"
        "Edge cases: if the clause type is not present in any retrieved chunk, "
        "emit clause_text='not_found' with empty citations.\n"
        "ALWAYS cite chunk_id and page for every extraction."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "clause_id": {"type": "string"},
            "clause_type": {
                "type": "string",
                "enum": ["indemnification", "termination", "payment", "ip", "other"],
            },
            "clause_text": {"type": "string"},
            "citations": {
                "type": "array",
                "minItems": 1,  # require at least one citation
                "items": {
                    "type": "object",
                    "properties": {
                        "chunk_id": {"type": "string"},
                        "page": {"type": "integer", "minimum": 1},
                        "span": {
                            "type": "string",
                            "description": "character offsets within the chunk, e.g. '128-340'",
                        },
                    },
                    "required": ["chunk_id", "page"],
                },
            },
        },
        "required": ["clause_id", "clause_type", "clause_text", "citations"],
    },
}

↪ Concept: structured-outputs

Bulk extraction via Batch API (50% off, 24h)

When the use case is 'extract every payment clause from 1000 contracts overnight', the Batch API earns its 50% discount. Submit at 6 PM, results ready at 6 AM. No real-time retry inside the batch; failures get resubmitted in the next batch with their specific error in the next message. Combined with prompt caching on the system prompt + tool registry, bulk extraction cost drops 95%+ vs naive sync calls.

Bulk extraction via Batch API (50% off, 24h)

def submit_bulk_extraction(docs: list[dict], clause_type: str) -> str:
    """Submit a batch of clause-extraction requests for overnight processing."""
    requests = []
    for doc in docs:
        chunks = retrieve_top_k(f"Find {clause_type} clauses", doc["id"], k=5)
        case_facts = load_checkpoint(doc["id"]) or {"doc_id": doc["id"]}
        requests.append({
            "custom_id": f"clause-{doc['id']}-{clause_type}",
            "params": {
                "model": "claude-sonnet-4.5",
                "max_tokens": 2048,
                "system": [{
                    "type": "text",
                    "text": build_system_prompt(case_facts, chunks),
                    "cache_control": {"type": "ephemeral"},  # cache the system prompt
                }],
                "tools": [
                    {**EXTRACT_CLAUSE_TOOL, "cache_control": {"type": "ephemeral"}},
                ],
                "tool_choice": {"type": "tool", "name": "extract_clause"},
                "messages": [{"role": "user", "content": f"Extract all {clause_type} clauses."}],
            },
        })
    batch = client.messages.batches.create(requests=requests)
    return batch.id

# Next morning. Fetch + harvest
def harvest_bulk(batch_id: str):
    results = client.messages.batches.results(batch_id)
    accepted, retry_queue = [], []
    for r in results:
        if r.result.type == "succeeded":
            tu = next(b for b in r.result.message.content if b.type == "tool_use")
            accepted.append(tu.input)
        else:
            retry_queue.append(r.custom_id)
    return {"accepted": accepted, "retry": retry_queue}

↪ Concept: batch-api

Stratified accuracy + adversarial 'silent source' tests

Aggregate accuracy hides per-document-type weakness. Stratify by doc_type (MSA vs DPA vs SOW), by clause_type, by page-section (front/middle/back). Surface the worst stratum. Pair with an adversarial test set of 50 documents where the requested clause is GENUINELY absent. The right behaviour is clause_text='not_found' with empty citations, NEVER an invented clause. Hallucinated extractions = audit-fail.

Stratified accuracy + adversarial 'silent source' tests

from collections import defaultdict

def stratified_accuracy(extractions: list[dict]) -> dict:
    """Pass rate by doc_type × clause_type × page-section."""
    buckets = defaultdict(lambda: {"pass": 0, "fail": 0})
    for e in extractions:
        section = ("front" if e["citations"][0]["page"] <= 30
                   else "back" if e["citations"][0]["page"] > 150
                   else "middle")
        key = (e["doc_type"], e["clause_type"], section)
        bucket = "pass" if validate_extraction(e) else "fail"
        buckets[key][bucket] += 1

    report = {}
    for (doc_type, clause_type, section), counts in buckets.items():
        total = counts["pass"] + counts["fail"]
        report[f"{doc_type}/{clause_type}/{section}"] = {
            "total": total,
            "pass_rate": counts["pass"] / total if total else 0,
        }
    return dict(sorted(report.items(), key=lambda kv: kv[1]["pass_rate"]))

def adversarial_silent_source_test() -> float:
    """50 docs where the requested clause is GENUINELY absent."""
    correct = 0
    for doc in load_silent_source_docs():  # known to NOT contain the clause
        result = extract_with_resume(doc["id"], "Find the indemnification clause")
        # Right: clause_text='not_found' with empty citations
        # Wrong: ANY invented clause text
        if result.get("case_facts", {}).get("last_extraction", {}).get("clause_text") == "not_found":
            correct += 1
    return correct / 50  # target: ≥ 95%

↪ Concept: evaluation

06 · Configuration decisions

The four decisions

Decision	Right answer	Wrong answer	Why
200-page document into the prompt	Chunk + index + retrieve top-K (K=5)	Stuff the whole document into one prompt	Stuffing hits max_tokens around page 120 and triggers lost-in-the-middle. Top-K retrieval keeps the prompt small and focused. The full document never enters context, so length is bounded by chunk count not document size.
Storing transactional values across many turns	CASE_FACTS block. Exact values, never paraphrased	Progressive summarization that paraphrases the conversation including facts	Summarization erodes precision. '$247.83' becomes '~$250'; 'cust_4711' becomes 'the customer'. CASE_FACTS keeps exact values verbatim; conversation history can be summarized, facts cannot.
Hit max_tokens mid-document	Save state + start fresh session + reload from checkpoint	Increase max_tokens or just retry from scratch	Larger windows defer the problem; checkpoint-and-resume permanently solves it. Restarting from scratch loses everything extracted so far. The architectural pattern scales to documents of any length.
Bulk overnight processing of 1000 documents	Batch API + cached system prompt + cached tool registry	Sync API in a tight loop	Batch API gives a flat 50% discount with a 24h SLA. Fine for non-blocking backfill. Caching adds another ~90% off the system + tools. Combined: ~95% savings vs naive sync. Sync API is for latency-critical extraction only.

07 · Failure modes

Where it breaks

Five failure pairs. Each one is one exam question. The fix is always architectural, deterministic gates, structured fields, pinned state.

❌ Stuff the whole document into the prompt

Try to paste a 150-page contract into a single prompt. Hits max_tokens at page 120; lost-in-the-middle drops the order ID established on page 1; agent makes contradictory recommendations on later pages.

AP-LDP-01

✅ Fix

Chunk + index + top-K retrieval. Only K=5 chunks ever enter context. The full document is searchable but never present. Length is bounded by retrieval, not document size.

❌ Progressive summarization of facts

Long conversation summarizes every 10 turns. Refund amount '$247.83' becomes '~$250' in the summary; customer ID 'cust_4711' becomes 'the customer'. Audit fails because exact values were paraphrased.

AP-LDP-02

✅ Fix

CASE_FACTS block at every prompt top. Never summarized. Holds exact values verbatim. Only the message history below is summarized; the case-facts persist verbatim across every iteration.

❌ No checkpoint on max_tokens

Long batch job processes 200 pages, hits max_tokens at turn 15. The whole pipeline aborts; everything extracted so far is lost; operator has to restart from page 1.

AP-LDP-03

✅ Fix

Checkpoint-and-resume: on stop_reason: max_tokens, persist state (case_facts + last extraction + chunk position) and start a fresh session that reloads the checkpoint as its case-facts. Idempotent on chunk_id.

❌ RAG without citations

Agent retrieves chunks and emits extracted clauses without saying which chunk supported each one. Auditor asks 'where does this come from?' and there's no answer; auditor flags the run as un-verifiable.

AP-LDP-04

✅ Fix

Citation tracker: every extraction emits citations: [{ chunk_id, page }] with minItems: 1 in the schema. The model can't extract a clause without pointing at the supporting chunks. Audit-grade provenance is structurally enforced.

❌ Chunking destroys context

Fixed-size chunks (every 1000 characters) split mid-sentence and mid-paragraph. A clause that spans a chunk boundary appears truncated in both adjacent chunks; retrieval misses it; extraction is wrong.

AP-LDP-05

✅ Fix

Semantic chunking with overlap. Split at section / paragraph / sentence boundaries (in that precedence). Add 10-20% overlap between adjacent chunks so a boundary-spanning clause stays whole in at least one chunk. Each chunk is meaning-complete.

08 · Budget

Cost & latency

One-time embedding (200-page doc, ~500 chunks)

~$0.05

500 chunks × ~1000 tokens × Voyage-3 at $0.13/M tokens ≈ $0.06. One-time cost at ingest; never re-paid unless content changes.

Per-extraction retrieval + Claude call (cached system + tools)

~$0.008-0.015

Embedding query (~$0.0001) + vector lookup (~$0.0001) + 5 chunks × ~1000 tokens (cached) + ~500 output tokens. Cache hit rate ≥ 70% drops effective per-call cost ~80%.

Bulk overnight (Batch API + caching)

~$0.004-0.008 per extraction

Batch API 50% discount × prompt caching ~90% off the system + tools = ~95% off naive sync. 1000 extractions @ $0.008 each = $8 total overnight.

Checkpoint storage

~5-15 KB per document

JSON dump of case_facts + last extraction + chunk position. Negligible per-document; at 1000 docs in flight, <15MB total. Idempotent re-loads add no cost.

p95 latency per extraction (cached, sync)

~2-4 seconds

Embed query (50ms) + vector lookup (50ms) + Claude call (2-3s with cache hit). Acceptable for interactive review of contracts; bulk uses Batch API.

09 · Ship gates

Ship checklist

Two passes. Build-time gates verify the code; run-time gates verify the system in production.

Build-time

Semantic chunking (paragraph + section boundaries) with 10-20% overlap↗ context-window
Each chunk has a deterministic chunk_id and a page number
Embeddings indexed once at ingest; re-embedding only on content change↗ context-window
Top-K retrieval (K=5 typical); full document never enters context
CASE_FACTS block pinned at every prompt top. Exact values, never paraphrased↗ case-facts-block
Checkpoint-and-resume on max_tokens; idempotent on chunk_id↗ checkpoints
Citations REQUIRED in every extraction tool's schema (minItems: 1)↗ structured-outputs
System prompt + tool registry cached with cache_control: ephemeral↗ prompt-caching
Batch API for bulk overnight runs (≥ 100 docs)↗ batch-api
Stratified accuracy: doc_type × clause_type × page-section↗ evaluation
Adversarial 'silent source' test (≥ 50 docs known to lack the clause); target ≥ 95% not_found rate

Run-time

Semantic chunker tested on 5 representative document types (contracts, papers, manuals)
Embedding cache by chunk content hash; rebuild only on change
CASE_FACTS schema versioned; migration plan documented
Checkpoint write is atomic; idempotency tested with deliberate restarts
Citations schema enforced (minItems: 1); CI lint catches schemas missing the constraint
Batch-API job retries failures in next batch with the specific error in the next message
Stratified accuracy dashboard updated daily; alert on any stratum < 90%
Adversarial 'silent source' eval runs weekly; ≥ 95% not_found rate required

10 · Question patterns

Five exam-pattern questions

150-page contract; the agent processes it in one prompt and dies at page 120 with `max_tokens`. Recovery without losing everything?

Checkpoint-and-resume. On stop_reason: max_tokens, persist {case_facts, last_extraction, chunk_position} to durable storage; start a FRESH session whose system prompt loads the checkpoint as its CASE_FACTS; continue from chunk_position. The new session has clean context but inherits exactly the state of the old one. Idempotent on chunk_id. Re-running a chunk just produces the same extraction. Tagged to AP-LDP-03.

RAG: should you retrieve all chunks similar to the query, or top-K ranked?

Top-K (K=5 typical). All-similar floods context with marginally-relevant chunks; the model wastes attention. Top-5 keeps the prompt focused on the most relevant evidence. Re-rank top-20 with Claude Haiku for the top-5 if precision matters; the cost is negligible and the precision lift on hard queries is real.

Chunking strategy: fixed-size or semantic?

Semantic (break at section / paragraph / sentence boundaries, in that precedence). Fixed-size (every 1000 chars) splits mid-sentence; clauses that span a boundary appear truncated in both adjacent chunks. Semantic chunking preserves meaning; pair with 10-20% overlap so a clause spanning a paragraph boundary still appears whole in at least one chunk. Tagged to AP-LDP-05.

How do you preserve citations through long-document extraction?

Make citations REQUIRED in the extraction tool's schema (minItems: 1). Every extracted clause emits citations: [{ chunk_id, page, span? }]. The model literally cannot return a clause without pointing at the supporting chunks. Downstream auditors click any extracted value and see the exact paragraph in the original document. Audit-grade provenance, structurally enforced. The model has no way to forget it. Tagged to AP-LDP-04.

Long conversation summarized at every 10 turns. Customer ID 'cust_4711' becomes 'the customer'; refund amount '$247.83' becomes '~$250'. The audit fails. What's the architectural fix?

CASE_FACTS block. Pinned at the top of every system prompt iteration, holding exact transactional values verbatim. The conversation history below it CAN be summarized; the case-facts CANNOT. Structurally separate the two: facts go in case-facts (immutable, exact), reasoning chains go in conversation history (summarizable). Tagged to AP-LDP-02.

11 · FAQ

Frequently asked

What's the optimal chunk size?

500-2000 tokens (~400-1500 words). Smaller and you make too many retrieval calls; larger and you lose the granularity that makes top-K retrieval useful. 1000-1200 tokens is a good default. Always pair with 10-20% overlap so boundary-spanning content survives.

Should I retrieve all matching chunks or top-K?

Top-K (default K=5). All-matching floods context with marginally-relevant noise; the model wastes attention. Top-5 keeps the prompt focused. If precision is critical, retrieve top-20 with embeddings and re-rank with Claude Haiku to top-5.

How do I prevent lost-in-the-middle?

Anchor critical facts at context top (CASE_FACTS), retrieve top-K only, trim verbose tool results. Long contexts dilute attention to middle content. Keep the prompt structure: CASE_FACTS at top, retrieved chunks in the middle, the user's latest message at the end. Don't put case-facts in the middle of the prompt.

Does prompt caching help with RAG?

Partially. Cache the stable parts: system prompt + tool registry + (optionally) the CASE_FACTS scaffold. The retrieved chunks change every query, so they're always fresh. Realistic savings: ~30-50% total cost reduction depending on system-prompt size. Not as dramatic as caching a 200-page document would have been, but RAG never had that overhead in the first place.

Where do I store the checkpoint?

Durable, idempotent storage. Convex DB, S3, or a local JSONL file in dev. Key by doc_id. The checkpoint write must be atomic; partial writes confuse the resume path. Retain checkpoints until the document is fully processed; delete on completion or after 30 days, whichever comes first.

Can I combine Batch API with checkpoint-and-resume?

Yes. For the long-running ones. Submit each document as one batch request. If a request hits max_tokens, the harvest step writes a checkpoint and includes that document in the NEXT batch with the checkpoint as case-facts. Two batches usually finish a long document; rare cases need three.

How do I handle a 500-page doc that exceeds even the chunked + paged max_tokens cap?

Checkpoint after every ~50 pages or natural boundary (chapter, section). The harness writes checkpoints on max_tokens automatically; at 500 pages you'll see ~5-8 checkpoint events across multiple sessions. Each session is bounded; the document length isn't.

Long Document Processing.

The problem

What the customer needs

Why naive approaches fail

The system

What each part does

Semantic Chunker

Retrieval Index (top-K)

CASE_FACTS Block

Checkpoint-and-Resume

Citation Tracker

Data flow

Eight steps to production

Semantic chunking with overlap

Embed and index every chunk

Retrieve top-K chunks; never the full document

Pin CASE_FACTS at the prompt top. Never summarize

Checkpoint on max_tokens; resume in a fresh session

Citations. Chunk_id + page on every output

Bulk extraction via Batch API (50% off, 24h)

Stratified accuracy + adversarial 'silent source' tests

The four decisions

Where it breaks

Cost & latency

Ship checklist

Build-time

Run-time

Five exam-pattern questions

Frequently asked

Long Document Processing, complete.

Long Document Processing.

The problem

What the customer needs

Why naive approaches fail

The system

What each part does

Semantic Chunker

Retrieval Index (top-K)

CASE_FACTS Block

Checkpoint-and-Resume

Citation Tracker

Data flow

Eight steps to production

Semantic chunking with overlap

Embed and index every chunk

Retrieve top-K chunks; never the full document

Pin CASE_FACTS at the prompt top. Never summarize

Checkpoint on max_tokens; resume in a fresh session

Citations. Chunk_id + page on every output

Bulk extraction via Batch API (50% off, 24h)

Stratified accuracy + adversarial 'silent source' tests

The four decisions

Where it breaks

Cost & latency

Ship checklist

Build-time

Run-time

Five exam-pattern questions

Frequently asked

Long Document Processing, complete.

Share this primitive