The problem
What the customer needs
- Pick up the conversation at turn 15 and still see the customer ID, decision, and contact preference from turn 2.
- Be honored immediately when they say 'I want to speak to a human'. No negotiation, no 'let me try first'.
- Not be re-asked the same clarifying question three turns after they already answered it.
Why naive approaches fail
- Single-block message history hits lost-in-the-middle by turn 9; the agent loses the order ID and re-asks.
- Prompt-only clarification language ('don't repeat questions') leaks 8% of cases; the agent re-asks anyway.
- Sentiment-triggered escalation creates 50% false positives: angry-but-valid customers get escalated unnecessarily.
- Turn-15 retrieval of customer_id + prior decision = 100% (case-facts pinned, never summarized)
- Repeat-clarification rate < 1% (programmatic prerequisite block, not prompt language)
- False-escalation rate < 5% (policy-gap + explicit-request triggers only, sentiment ignored)
- p95 turn latency < 5s including hook overhead and history compression
The system
What each part does
5 components, each owns a concept. Click any card to drill into the underlying primitive.
Case-Facts Block
immutable customer state, top of prompt
Pinned at the very top of every system-prompt iteration. Holds customer_id, decision_made, contact_preference, escalation_requested, policy_cap. Survives history compression and is re-read every turn. That is the entire point.
Configuration
system: f"CASE_FACTS:\n customer={cust_id} · decision={decision} · contact={pref} · escalated={escalated}". Updated by hooks after any state-changing tool call. Never paraphrased; always exact.
Session State Manager
decisions + flags between turns
Tracks the structured state that case-facts cannot: which clarification questions have been answered, which tool results are still in play, whether the customer has explicitly asked for a human. Updated post-each-tool-call. Read by the hook before any subsequent tool dispatch.
Configuration
state: {clarifications_answered: [order_id, refund_or_credit], last_tool_result, escalation_requested: false, contact_preference: 'email'}. Persisted in session store, loaded into prompt as a serialized block.
History Summarizer
turns 2-14 → 3 lines at turn 15
Watches conversation length. When the message list exceeds 15 entries, replaces turns 2-N-1 with a single 3-line summary preserving decisions, not transcripts. Case-facts stays untouched at the prompt top. The summary lives in the message list, not in case-facts.
Configuration
if len(messages) > 15: summary = compress_to_3_lines(messages[1:-1]); messages = [messages[0], {role: 'user', content: summary}, messages[-1]]. Keeps token count flat while preserving decision continuity.
Clarification Gate Hook
PreToolUse · prerequisite block
Sits between Claude's tool_use request and tool execution. If a downstream tool needs verified_id and case-facts.verified_id is null, exits 2 with a deterministic message routing Claude to call get_customer first. This is the difference between probabilistic prompt language and 100% prerequisite enforcement.
Configuration
Hook fires before process_refund / update_account / escalate_to_human. Reads case_facts.verified_id and conversation flags. Exit 2 with stderr message routes Claude back; no leakage, no exceptions.
Stop-Reason Loop Control
branch on the field, not the text
Reads stop_reason after every API response. end_turn → exit cleanly. tool_use → execute, append result, continue. max_tokens → save partial state and escalate (never silently truncate). Never branches on response text containing 'done' or 'goodbye'.
Configuration
while True: resp = client.messages.create(...). if resp.stop_reason "end_turn": return. if resp.stop_reason "tool_use": dispatch + append. if resp.stop_reason == "max_tokens": persist + escalate.
Data flow
Eight steps to production
Define the case-facts anchor block
Pin the immutable customer facts at the very top of the system prompt. These survive compression, are re-read every turn, and are never paraphrased. The block is the single load-bearing pattern of the whole scenario. Get this wrong and turn 15 forgets turn 2.
from anthropic import Anthropic
client = Anthropic()
def build_system_prompt(case_facts: dict) -> str:
return f"""You are a conversational support agent.
CASE_FACTS (immutable; re-read every turn; never paraphrased):
- customer_id: {case_facts['customer_id']}
- decision_made: {case_facts.get('decision_made', 'none')}
- contact_preference: {case_facts.get('contact_preference', 'unset')}
- escalation_requested: {case_facts.get('escalation_requested', False)}
- policy_cap: ${case_facts.get('cap', 500)}
Constraints:
- Branch on stop_reason. Never on response text.
- If escalation_requested is True: route to human queue, no negotiation.
- If a clarifying question was already answered, do not re-ask (state below)."""Build the session-state structure
Case-facts holds immutable customer state; session-state holds the conversational state. Answered clarifications, last tool result, escalation flag. Together they replace the lost-in-the-middle problem with structural retrieval. Loaded into the prompt as a serialized block right after CASE_FACTS.
def build_session_block(state: dict) -> str:
return f"""SESSION_STATE (updated post-each-turn):
- clarifications_answered: {state.get('clarifications_answered', [])}
- last_tool: {state.get('last_tool')} → {state.get('last_tool_result_summary')}
- escalation_requested: {state.get('escalation_requested', False)}
- contact_preference: {state.get('contact_preference', 'email')}
"""
# After every tool call, update + persist
state['clarifications_answered'].append('order_id')
state['last_tool'] = 'lookup_order'
state['last_tool_result_summary'] = 'order_status=delivered'
session_store.save(case_facts['customer_id'], state)Wire the PreToolUse clarification hook
Programmatic prerequisite enforcement. Before any account-modifying tool, the hook checks case-facts + session-state. Missing prerequisites → exit 2 with a structured stderr message; Claude reads it and routes to the prerequisite tool first. Prompt language alone leaks 8%; this hook is 100%.
# .claude/hooks/clarification_gate.py
import sys, json
def main():
payload = json.loads(sys.stdin.read())
tool_name = payload["tool_name"]
case_facts = payload.get("case_facts", {})
session = payload.get("session_state", {})
# Account-modifying tools require verified identity
if tool_name in ("process_refund", "update_account", "escalate_to_human"):
if not case_facts.get("verified_id"):
print("verified_id missing. Call get_customer first", file=sys.stderr)
sys.exit(2)
# Honor explicit escalation request. No further tool calls
if session.get("escalation_requested") and tool_name != "escalate_to_human":
print("user requested human; route to escalation queue, no other tools", file=sys.stderr)
sys.exit(2)
sys.exit(0)
if __name__ == "__main__":
main()Run the loop on stop_reason, not text
The single most-tested distractor in this scenario is parsing response text for 'done'. Claude can return text + tool_use in the same message; the structured stop_reason field is the only authoritative termination signal. Branch on it. Always.
def run_conversation_turn(user_msg: str, case_facts: dict, state: dict, max_iter: int = 12):
messages = load_history(case_facts['customer_id']) + [{"role": "user", "content": user_msg}]
system = build_system_prompt(case_facts) + "\n\n" + build_session_block(state)
for _ in range(max_iter):
resp = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=2048,
system=system,
tools=tools,
messages=messages,
)
if resp.stop_reason == "end_turn":
persist_history(case_facts['customer_id'], messages, resp)
return extract_text(resp)
if resp.stop_reason == "tool_use":
tool_uses = [b for b in resp.content if b.type == "tool_use"]
results = [execute_tool(t, case_facts, state) for t in tool_uses]
messages.append({"role": "assistant", "content": resp.content})
messages.append({"role": "user", "content": results})
update_state(state, tool_uses, results)
continue
if resp.stop_reason == "max_tokens":
persist_partial(case_facts, state, resp)
return {"status": "partial_escalate", "text": extract_text(resp)}
return {"status": "iteration_cap"}Compress conversation history at turn 15
When the message list exceeds 15 entries, replace turns 2 through N-1 with a single summary that preserves decisions, not transcripts. Case-facts stays at the prompt top, untouched. The summary lives in the message list. This frees ~40% of tokens with zero decision loss.
def compress_history(messages: list, threshold: int = 15) -> list:
if len(messages) <= threshold:
return messages
# Preserve the original user message + the most recent exchange.
first = messages[0]
last = messages[-1]
middle = messages[1:-1]
# Summarize: extract decision points only (not full transcript)
decisions = extract_decisions(middle) # e.g. ["user_asked_for_refund", "agent_verified_customer", "policy_allowed_$50"]
summary = "CONVERSATION SUMMARY (turns 2-" + str(len(messages) - 1) + "):\n" + "\n".join(f"- {d}" for d in decisions)
return [first, {"role": "user", "content": summary}, last]
# Usage in the turn loop
messages = compress_history(messages, threshold=15)Honor explicit human-handoff requests immediately
When the customer says 'speak to a human', the agent does not negotiate. The hook flips session_state.escalation_requested → true. The next tool dispatch is escalate_to_human; everything else is blocked. Sentiment is orthogonal. Angry customers with valid requests still get the answer first.
# Detection lives in the agent's prompt; latching lives in state
EXPLICIT_HUMAN_PHRASES = [
"speak to a human",
"talk to a person",
"i want a human",
"transfer me",
"give me a human",
]
def detect_explicit_handoff(user_msg: str) -> bool:
msg = user_msg.lower()
return any(phrase in msg for phrase in EXPLICIT_HUMAN_PHRASES)
def handle_user_turn(user_msg: str, case_facts: dict, state: dict):
if detect_explicit_handoff(user_msg):
state['escalation_requested'] = True
# The next agent loop will see this in session_state and the hook
# will block any tool except escalate_to_human.
return run_conversation_turn(user_msg, case_facts, state)
# Sentiment is intentionally NOT consulted here.Cache the system prompt + tools
System prompt + tool definitions are stable across turns; only case-facts and session-state change. Mark the stable parts with cache_control: ephemeral and pay ~90% less for those bytes on every turn after the first. With 5-min TTL on continuous traffic, hit rate stays above 70%.
# Split system into stable (cached) + dynamic (fresh) blocks
resp = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=2048,
system=[
{
"type": "text",
"text": STABLE_SYSTEM_PREAMBLE, # role + constraints, never changes
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": build_case_facts_block(case_facts) + build_session_block(state), # changes per turn
},
],
tools=tools, # tools array also auto-cached when stable
messages=messages,
)
# Inspect resp.usage.cache_creation_input_tokens / cache_read_input_tokens to verify hit rate.Audit-log the conversation arc
Every closed conversation writes a structured row: customer_id, turn_count, tool_calls_in_order, escalation_reason (if any), elapsed_ms_total, csat. Skip the full transcript. The structured trace is enough to replay any failure and is 50× smaller. Store for 90 days minimum.
def audit_conversation(case_facts: dict, state: dict, agent_path: list, elapsed_ms: int, csat: int | None):
db.audit.insert({
"ts": datetime.utcnow(),
"customer_id": case_facts["customer_id"],
"turn_count": len(agent_path),
"tool_calls": [c["name"] for c in agent_path if c.get("type") == "tool_use"],
"stop_reasons": [c.get("stop_reason") for c in agent_path],
"escalation_reason": state.get("escalation_reason"),
"compression_fired_at_turn": state.get("compression_at_turn"),
"elapsed_ms": elapsed_ms,
"csat": csat,
})The four decisions
| Decision | Right answer | Wrong answer | Why |
|---|---|---|---|
| Multi-turn customer state | case-facts block at top of prompt + session-state block | progressive summarization of customer_id + amount | Transactional values must be pinned, never paraphrased. Summarization erodes precision; case-facts is structural. |
| Customer says 'speak to a human' | set escalation_requested → block all tools except escalate_to_human | negotiate ('let me try once more') or suggest alternatives | Explicit user requests are non-negotiable. Cost of overriding stated preference (churn, complaint escalation) exceeds any benefit of one-more-attempt. |
| Long conversation context fills | compress turns 2 through N-1 into 3 lines at turn 15; case-facts stays untouched | keep all messages OR summarize the case-facts block | Conversation history has diminishing returns; case-facts are structural. Compress history, never facts. |
| Angry customer with valid request | process the request normally; sentiment does not trigger escalation | escalate on negative sentiment | Sentiment is orthogonal to escalation need. Angry-but-valid customers should get the answer; only policy gaps + tool limits + explicit requests warrant escalation. |
Where it breaks
Five failure pairs. Each one is one exam question. The fix is always architectural, deterministic gates, structured fields, pinned state.
By turn 9, agent has summarized cust_4711's order ID to 'a recent order'. Treats turn 10 as a new conversation.
AP-35Pin CASE_FACTS at top of system prompt. Re-read every turn. Never paraphrased. Compression only touches the message list, never the case-facts block.
System prompt says 'do not re-ask answered questions'. Agent re-asks 'which order?' on turn 4 and turn 8. 8% leakage.
AP-02Track answered clarifications in session_state. PreToolUse hook checks state and blocks downstream tools if a prerequisite clarification is unanswered. Deterministic, not probabilistic.
50 turns fill the context window. Lost-in-the-middle effect drops the order ID. Agent makes contradictory recommendations.
AP-03At turn 15, summarize turns 2 through N-1 into 3 lines preserving decisions only. Case-facts stays at prompt top. Frees ~40% tokens with zero decision loss.
User said 'I do not want to be contacted by phone' on turn 3. Agent suggests phone callback on turn 9.
AP-04Persist session_state with contact_preference + decision flags. Read into prompt every turn alongside case-facts.
Angry customer with a valid refund request is escalated because tone is negative. 50% false-positive rate, customers learn that anger = faster service.
AP-22Escalation triggers only on (a) policy gap, (b) tool limit, (c) explicit user request. Sentiment is logged for reporting but never gates escalation.
Cost & latency
12 avg turns × (cached system + tools + dynamic case-facts/session-state + accumulating history). Cache hits ~70% on stable preamble + tools.
Pre-cache: ~$0.05. With ephemeral cache on stable system + tools: ~$0.022. ~55% reduction on long conversations.
Streaming first token in ~150ms. Tool round-trips 1.5-2s each. Average 1.5 tool calls per turn + hook (<100ms) + compose.
12 verbose message blocks (≈8K tokens) → 3-line summary (≈250 tokens). Frees the long-tail conversations from lost-in-the-middle and OOM-on-history.
5-min ephemeral TTL on stable preamble + tool definitions. Continuous chat traffic keeps cache warm. Per-turn case-facts/session-state stays fresh, as it should.
Ship checklist
Two passes. Build-time gates verify the code; run-time gates verify the system in production.
Build-time
- Case-facts block pinned at top of every system-prompt iteration↗ case-facts-block
- Session-state block serialized into prompt after case-facts↗ session-state
- PreToolUse clarification gate hook (deterministic prerequisite block)↗ hooks
- Loop branches on stop_reason, never on response text↗ agentic-loops
- History summarizer fires at turn 15; case-facts left untouched↗ context-window
- Explicit human-handoff phrase detection latches escalation_requested = true↗ escalation
- Sentiment is logged but never gates escalation
- Stable system preamble cached with cache_control: ephemeral↗ prompt-caching
- Tool definitions cached (unchanged across turns)↗ tool-calling
- Audit log per closed conversation (structured, not transcript)
- Iteration cap (max_iter=12) as a safety net, not the primary control
Run-time
- Unit tests for case-facts block construction (immutable, ordered, exact-string)
- Integration test: 20-turn conversation with case-facts assertion at every turn
- Hook test: missing verified_id → exit 2; escalation_requested + non-escalate tool → exit 2
- Compression test: at turn 15, message list shrinks; case-facts unchanged; decisions preserved
- Explicit-handoff test: 'I want a human' → escalation_requested latches true → hook blocks other tools
- Latency monitor: alert if p95 turn-latency > 6s for ≥ 5 min
- Cost monitor: alert if per-conversation cost > $0.04 (signals cache hit rate dropped)
- False-escalation monitor: alert if sentiment-only escalations exceed 1% of total
5 exam-pattern questions
Tap an answer to check it instantly, then expand the full breakdown for the rationale per option, the mental model under test, and the priority order across distractors.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Frequently asked
Why pin case-facts in the system prompt instead of passing them as a tool result?
What's the right threshold for triggering history compression?
If case-facts is immutable, how do I update it when the customer changes their decision?
Why a PreToolUse hook for clarification, instead of putting the rule in the prompt?
Should I cache the case-facts block?
cache_control: ephemeral) and a dynamic block (case-facts + session-state, fresh every turn). You get ~70% hit rate on the cached portion and zero staleness on the dynamic portion.What if the customer switches topics mid-conversation (refund → tech support)?
How do I test that conversation continuity is actually working?
What goes in the audit log for a long conversation?
customer_id, turn_count, ordered list of tool_calls (just names + timestamps), per-turn stop_reasons, compression_fired_at_turn (if any), escalation_reason (if any), elapsed_ms_total, csat (if surveyed). Skip the full transcript. The structured trace is 50× smaller and replays any failure path. Retain 90 days minimum for production debugging and exam-style retrospectives.