P3.9 · D1 + D5 · Process

Claude for Operations.

Think of this as the on-call agent that picks up the 3 AM page so a human doesn't have to. But only for the boring, predictable incidents, and only when the action is reversible. The agent reads a structured alert, classifies it into one of four buckets (transient blip vs permission failure vs bad data vs business policy), matches the right runbook, and executes the safe steps. The dangerous steps (`rm -rf prod`, dropping a production database) are blocked by a deterministic hook before they ever run. The agent literally cannot execute them, no matter how cleverly the alert text is phrased. Anything ambiguous gets a structured escalation block (incident ID, service, severity, root cause, recommended next step) routed to a human in PagerDuty. The whole point is composure under pressure: hooks for safety, structured isError for clarity, audit logs for after-action review.

26 min build·5 components·8 concepts

An on-call ops agent. Alerts arrive structured (service, severity, root_cause_signal, 4-bucket isError: Transient · Permission · Data · Business); the agent matches a runbook from a small registry (3-5 safe playbooks: restart, rollback, drain, scale); a PreToolUse hook denies destructive Bash (rm -rf, sudo, drop database) before it executes; on ambiguity or denial, the harness emits a structured escalation block (incident_id, service, severity, root_cause, partial_status, recommended_action) to PagerDuty; every tool call writes a PostToolUse audit log entry with the full call context. The most-tested distractor: sentiment-based escalation. Sentiment is orthogonal to ops. Escalate on policy gaps and access failures, never on log-message tone.

42% exam weight
SourceBeyond-guide scenario · OP-claimed (Reddit 1s34iyl) · architecture matches Anthropic public guidance
What do the colours mean?
Green
Official Anthropic doc or API contract
Yellow
Partial doc / inferred
Orange
Community-derived
Red
Disputed / changes frequently
Stack
Claude SDK · monitoring webhook · PagerDuty/Slack · durable audit log
Needs
Hooks (Pre/Post) · structured escalation · 4-bucket isError
Exam
42% of CCA-F (D1 + D5). 27% D1 · 15% D5. Highest-weight scenario on the test. Master this one and you've covered most of it.
Loop — the ACP mascot — illustrated as a calm customer-support agent at a walnut desk with headset, notebook, and a small speech-bubble holding an inbound question.
End-to-end flow42% of CCA-F (D1 + D5)
01 · Problem framing

The problem

What the customer needs

  1. Auto-resolve boring incidents. Pod restarts, cache flushes, log rotation. Without paging a human.
  2. Block destructive commands deterministically. rm -rf /prod must NEVER execute, no matter what the alert text says.
  3. Hand off ambiguous incidents cleanly. When the agent can't safely act, the on-call human gets a structured block (not a raw transcript) and can decide in 30 seconds.

Why naive approaches fail

  1. Agent runs commands without a hook gate → `rm -rf /prod` executes because the alert text said 'clear stuck pods'.
  2. Permission-denied confused with empty result → agent retries forever thinking 'no data yet'; backoff never converges.
  3. Sentiment-trigger escalation → angry-but-correct alerts wake humans; calm-but-broken alerts get ignored.
  4. No audit log → post-mortem can't reconstruct what the agent did at 3 AM; trust evaporates.
Definition of done
  • Every monitoring alert classified into the 4-bucket isError contract (Transient · Permission · Data · Business)
  • 3-5 safe runbooks in the registry; rare/unique incidents always escalate (don't add a 6th runbook for an edge case)
  • PreToolUse hook denies destructive commands by regex; exit 2 returns a model-readable reason
  • Structured escalation block for every human handoff: incident_id, service, severity, root_cause, partial_status, recommended_action
  • PostToolUse audit log writes every tool call with input, output, latency, and the bucket the result fell into
  • Sentiment is logged but never gates routing. Escalation triggers are policy gaps, access failures, and explicit user request only
02 · Architecture

The system

03 · Component detail

What each part does

5 components, each owns a concept. Click any card to drill into the underlying primitive.

Monitoring Alert Parser

structured isError, 4 buckets

Receives raw alert payloads (Datadog, Prometheus, Sentry, custom monitors) and projects them into a stable 4-bucket isError contract: {bucket: 'Transient'|'Permission'|'Data'|'Business', service, severity, root_cause_signal, retryable}. The agent reads bucket and retryable and routes; it never parses raw alert text. Without this contract, the agent retries permission errors forever and panics on transient blips.

Configuration

Webhook receiver → schema validator → bucket classifier (rule-based or Haiku-classified) → enriched alert. Schema requires bucket + retryable; the alert ingestor rejects alerts missing the contract.

Concept: structured-outputs

Runbook Registry (3-5 Safe Playbooks)

small, audited, reversible

A tight registry of 3-5 named runbooks the agent can execute. Each runbook is a sequence of safe, reversible commands with explicit preconditions, timeouts, and rollback steps. Rare or unique incidents NEVER get a runbook. They escalate. The registry stays small on purpose; expanding it past 5 erodes the agent's routing accuracy and increases blast radius.

Configuration

Registry: ["restart_service", "drain_region", "rollback_release", "rotate_credentials", "scale_replicas"]. Each has {preconditions: [...], steps: [...], timeout_s, rollback: [...]}. Reviewed in PRs; production-deployed via the same CI/CD as application code.

Concept: tool-calling

PreToolUse Hook (Blocklist Gate)

deterministic destructive-command guard

Sits between the model's tool_use request and Bash/destructive tool execution. Reads the proposed command; matches against an explicit blocklist regex (rm -rf, sudo, drop database, kill -9, chmod 777, >:). On match, exits 2 with a model-readable reason. The agent observes the deny in the next turn as a tool_result with is_error: true and re-plans (typically by escalating). Deterministic. No prompt-injection bypass.

Configuration

matcher: "Bash". Blocklist regex (compiled once): r"\b(rm\s+-rf|sudo\s|drop\s+(database|table)|kill\s+-9|chmod\s+777|>:)\b". Exit 2 with stderr message. Allowlist for known-safe binaries (kubectl, docker, journalctl, curl, jq).

Concept: hooks

Structured Escalation Block

30-second human triage

When the agent can't (or shouldn't) act. Unknown runbook, hook denied, ambiguous alert, explicit request. The harness writes a STRUCTURED escalation block to PagerDuty / Slack. Six fields, every field required: incident_id, service, severity, root_cause_signal, partial_status (what the agent already did), recommended_action. Humans triage in ~30 seconds vs ~5 minutes reading a raw transcript.

Configuration

Schema: {incident_id, service, severity: "P1|P2|P3", root_cause_signal: "permission|transient|data|business|unknown", partial_status: "what agent did before stopping", recommended_action: "single sentence"}. Posted to PagerDuty REST + Slack #ops-incidents.

Concept: escalation

PostToolUse Audit Log

the post-mortem replay tool

Fires AFTER every tool call (Bash, runbook, escalate, log query). Writes a canonical row: ts, tool_name, tool_input, tool_result_bucket, latency_ms, stop_reason_context, hook_decisions. Append-only JSONL on durable storage. Indispensable for the 4 AM 'what did the agent do?' post-mortem; without it, trust in the agent collapses on the first incident.

Configuration

matcher: '*'. Append to audit/{YYYY-MM-DD}.jsonl. Retain ≥ 90 days. Searchable by incident_id, service, tool_name. Includes both successful and denied calls.

Concept: evaluation
04 · One concrete run

Data flow

05 · Build it

Eight steps to production

01

Parse alerts into the 4-bucket isError contract

The webhook receiver projects raw monitoring payloads (Datadog/Prometheus/Sentry/custom) into a stable shape: {bucket, service, severity, root_cause_signal, retryable, raw}. The bucket is the single most important field. It's what the agent's routing logic branches on. Bucket assignment is rule-based for known patterns; ambiguous alerts get classified by Haiku (cheap, fast) and emit the bucket + a confidence score.

Parse alerts into the 4-bucket isError contract
from enum import Enum
from typing import TypedDict, Literal

class Bucket(str, Enum):
    TRANSIENT  = "Transient"   # network blip, rate limit. Retry
    PERMISSION = "Permission"  # 403/401. Escalate, won't fix itself
    DATA       = "Data"        # malformed input. Surface to user
    BUSINESS   = "Business"    # policy violation. Block + log + escalate

class Alert(TypedDict):
    bucket: Bucket
    service: str
    severity: Literal["P1", "P2", "P3"]
    root_cause_signal: str
    retryable: bool
    raw: dict

def classify_alert(raw: dict) -> Alert:
    """Project arbitrary monitoring payload into the 4-bucket contract."""
    msg = (raw.get("message") or "").lower()
    if any(s in msg for s in ["503", "timeout", "rate limit", "circuit breaker"]):
        bucket, retryable = Bucket.TRANSIENT, True
    elif any(s in msg for s in ["403", "401", "forbidden", "unauthorized"]):
        bucket, retryable = Bucket.PERMISSION, False
    elif any(s in msg for s in ["malformed", "schema", "validation", "400"]):
        bucket, retryable = Bucket.DATA, False
    elif any(s in msg for s in ["policy", "compliance", "limit exceeded"]):
        bucket, retryable = Bucket.BUSINESS, False
    else:
        # Ambiguous. Classify with Haiku (cheap) and trust the bucket it returns
        bucket, retryable = haiku_classify_bucket(raw)
    return {
        "bucket": bucket,
        "service": raw.get("service", "unknown"),
        "severity": raw.get("severity", "P2"),
        "root_cause_signal": raw.get("root_cause") or msg[:100],
        "retryable": retryable,
        "raw": raw,
    }

def haiku_classify_bucket(raw: dict) -> tuple[Bucket, bool]:
    """Cheap fallback classifier for ambiguous alerts."""
    resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=64,
        messages=[{"role": "user", "content":
            f"Classify into one bucket (Transient|Permission|Data|Business) "
            f"+ retryable (true|false). Alert: {raw}\n"
            f"Output JSON only: {{\"bucket\": ..., \"retryable\": ...}}"}],
    )
    parsed = json.loads(resp.content[0].text)
    return Bucket(parsed["bucket"]), parsed["retryable"]
↪ Concept: structured-outputs
02

Define a small runbook registry (3-5 entries)

Every runbook is a named sequence of safe, reversible commands with explicit preconditions, timeouts, and rollback steps. The registry stays at 3-5 entries. The agent's routing accuracy degrades past 5, and a sixth runbook for an edge case is exactly the wrong move (escalate the edge case instead). Each runbook is reviewed in PRs and deployed through CI like application code.

Define a small runbook registry (3-5 entries)
# Runbooks live in code, version-controlled, PR-reviewed
from typing import TypedDict

class RunbookStep(TypedDict):
    cmd: str           # the actual shell command
    timeout_s: int
    rollback: str      # the inverse command, run on failure

class Runbook(TypedDict):
    name: str
    description: str   # 4-line pattern (what / when / edges / ordering)
    preconditions: list[str]   # gating conditions
    steps: list[RunbookStep]
    requires_human_confirm: bool

RUNBOOKS: list[Runbook] = [
    {
        "name": "restart_service",
        "description": (
            "Restart a stuck Kubernetes pod or systemd service.\n"
            "Use when service health check fails 3 consecutive times.\n"
            "Edge cases: returns 'still-stuck' if pod re-enters CrashLoopBackOff; escalate.\n"
            "Prerequisite: service is in the runbook allowlist."
        ),
        "preconditions": ["service in ALLOWLIST", "alert.bucket == Transient"],
        "steps": [
            {"cmd": "kubectl rollout restart deploy/{service}", "timeout_s": 60,
             "rollback": "kubectl rollout undo deploy/{service}"},
            {"cmd": "kubectl rollout status deploy/{service} --timeout=2m", "timeout_s": 120,
             "rollback": ""},
        ],
        "requires_human_confirm": False,
    },
    {
        "name": "rollback_release",
        "description": (
            "Roll the service back to the last known-good release.\n"
            "Use when a release introduces a P1/P2 regression.\n"
            "Edge cases: returns 'no-prior-release' if no rollback target; escalate.\n"
            "Prerequisite: service has a deploy history."
        ),
        "preconditions": ["service in ALLOWLIST", "deploy_history_count >= 2"],
        "steps": [
            {"cmd": "kubectl rollout undo deploy/{service}", "timeout_s": 60,
             "rollback": "kubectl rollout undo deploy/{service} --to-revision={current}"},
        ],
        "requires_human_confirm": True,  # destructive, gates on human click
    },
    # ... drain_region, rotate_credentials, scale_replicas. 5 total
]

def match_runbook(alert: Alert) -> Runbook | None:
    """Pick the runbook whose preconditions match. Returns None to escalate."""
    for rb in RUNBOOKS:
        if all(eval_precondition(p, alert) for p in rb["preconditions"]):
            return rb
    return None  # no match. Escalate
↪ Concept: tool-calling
03

Wire the PreToolUse hook with a destructive blocklist

The hook is the deterministic safety gate. It reads tool_name + tool_input.command from stdin JSON, applies a regex blocklist (compiled once), exits 2 with a stderr message on match. The model sees the deny as a tool_result with is_error: true and re-plans. No prompt-injection bypass; the blocklist is in code, not in the prompt.

Wire the PreToolUse hook with a destructive blocklist
# .claude/hooks/ops_blocklist.py
import sys, json, re

BLOCKLIST = re.compile(
    r"\b("
    r"rm\s+-rf"            # destructive recursive delete
    r"|sudo\s+"            # privilege escalation
    r"|drop\s+(database|table)"
    r"|kill\s+-9"          # uncatchable termination
    r"|chmod\s+777"        # world-writable
    r"|>(:?\s*/[^\s]+/(etc|usr|var)/)"  # redirect to system paths
    r"|curl\s+[^\s]+\s*\|\s*sh"  # remote shell exec
    r")\b",
    re.IGNORECASE,
)

ALLOWLIST_BINS = {"kubectl", "docker", "journalctl", "curl", "jq", "ps", "df", "top"}

def main():
    payload = json.loads(sys.stdin.read())
    if payload["tool_name"] != "Bash":
        sys.exit(0)
    cmd = payload["tool_input"].get("command", "")
    # Hard blocklist
    if BLOCKLIST.search(cmd):
        print(
            f"BLOCKED: command matches destructive pattern. "
            f"Escalate via the structured escalation block. command={cmd!r}",
            file=sys.stderr,
        )
        sys.exit(2)  # DENY
    # Soft allowlist on the binary (first token)
    first = cmd.strip().split()[0] if cmd.strip() else ""
    if first and first not in ALLOWLIST_BINS:
        print(
            f"BLOCKED: binary {first!r} not on the ops allowlist. "
            f"Allowed: {sorted(ALLOWLIST_BINS)}",
            file=sys.stderr,
        )
        sys.exit(2)
    sys.exit(0)

if __name__ == "__main__":
    main()
↪ Concept: hooks
04

Build the structured escalation block

When the agent can't act safely. Unknown runbook, hook denied, ambiguous alert, explicit escalation request. The harness emits a structured escalation block to PagerDuty / Slack. Six required fields. Humans triage in ~30 seconds vs ~5 minutes reading a raw transcript. The block is the single most important UX artifact in the whole system; tune the wording to be a one-screen read.

Build the structured escalation block
import json, requests
from typing import TypedDict, Literal

class EscalationBlock(TypedDict):
    incident_id: str
    service: str
    severity: Literal["P1", "P2", "P3"]
    root_cause_signal: str
    partial_status: str
    recommended_action: str

PAGERDUTY_KEY = os.environ["PAGERDUTY_INTEGRATION_KEY"]

def escalate(alert: Alert, partial_actions: list[str], reason: str) -> EscalationBlock:
    """Emit the structured handoff block. Six fields, every field required."""
    block: EscalationBlock = {
        "incident_id": f"PD-{alert['service']}-{int(time.time())}",
        "service": alert["service"],
        "severity": alert["severity"],
        "root_cause_signal": alert["root_cause_signal"],
        "partial_status": (
            "; ".join(partial_actions) if partial_actions else "agent stopped before any action"
        ),
        "recommended_action": derive_recommendation(alert, reason),
    }
    # Post to PagerDuty
    requests.post(
        "https://events.pagerduty.com/v2/enqueue",
        json={
            "routing_key": PAGERDUTY_KEY,
            "event_action": "trigger",
            "payload": {
                "summary": f"[{block['severity']}] {block['service']}: {block['root_cause_signal']}",
                "severity": "critical" if block["severity"] == "P1" else "warning",
                "source": "claude-ops-agent",
                "custom_details": block,
            },
        },
    )
    # Also write to audit log
    audit_log("escalation", block)
    return block

def derive_recommendation(alert: Alert, reason: str) -> str:
    """Single-sentence recommended action for the on-call human."""
    if reason == "hook_denied":
        return f"Run the blocked command manually after verifying it is safe; agent will not retry."
    if reason == "unknown_runbook":
        return f"No runbook matched. Investigate root cause; consider adding a runbook if pattern recurs."
    if alert["bucket"] == "Permission":
        return f"Rotate or grant the missing credential; agent cannot self-recover from Permission errors."
    return f"Investigate the alert and apply the appropriate remediation manually."
↪ Concept: escalation
05

Wire the PostToolUse audit log

PostToolUse fires AFTER every tool call (Bash, runbook, escalate, log query, hook decisions). Writes a canonical row: ts, tool_name, tool_input, tool_result, latency_ms, hook_decisions, incident_id. Append-only JSONL on durable storage; retain ≥ 90 days. Indexed by incident_id, service, tool_name. The post-mortem replay tool depends on it; without it, trust collapses on the first incident.

Wire the PostToolUse audit log
# .claude/hooks/ops_audit.py. PostToolUse
import sys, json, datetime
from pathlib import Path

AUDIT_DIR = Path("audit")
AUDIT_DIR.mkdir(exist_ok=True)

def main():
    payload = json.loads(sys.stdin.read())
    today = datetime.date.today().isoformat()
    row = {
        "ts": datetime.datetime.utcnow().isoformat() + "Z",
        "tool_name": payload["tool_name"],
        "tool_input": payload.get("tool_input", {}),
        "tool_result": payload.get("tool_result", {}),
        "latency_ms": payload.get("latency_ms"),
        "hook_decisions": payload.get("hook_decisions", []),
        "incident_id": payload.get("incident_id") or "no-incident",
        "stop_reason_context": payload.get("stop_reason"),
    }
    with open(AUDIT_DIR / f"{today}.jsonl", "a") as f:
        f.write(json.dumps(row) + "\n")
    sys.exit(0)

if __name__ == "__main__":
    main()

# Replay tool. Used in post-mortems
def replay(incident_id: str) -> list[dict]:
    """Reconstruct what the agent did on a given incident."""
    rows = []
    for path in sorted(AUDIT_DIR.glob("*.jsonl")):
        for line in path.read_text().splitlines():
            row = json.loads(line)
            if row["incident_id"] == incident_id:
                rows.append(row)
    return sorted(rows, key=lambda r: r["ts"])
↪ Concept: evaluation
06

Run the agent loop with stop_reason branching

The on-call agent loop is a strict stop_reason FSM. end_turn: success, log + close incident. tool_use: extract the tool call, run hooks, execute, append result, continue. tool_use → hook denied: append the deny as a tool_result with is_error: true, continue (the agent re-plans, usually by escalating). max_tokens: persist partial state, escalate. The loop NEVER branches on response text. The structured stop_reason is the only contract.

Run the agent loop with stop_reason branching
OPS_TOOLS = [BASH_TOOL, RUNBOOK_TOOL, ESCALATE_TOOL, GET_LOGS_TOOL]  # 4 tools

def run_ops_agent(alert: Alert, max_iter: int = 10) -> dict:
    """On-call agent loop. Strict stop_reason branching."""
    case_facts = {
        "incident_id": alert.get("incident_id") or alert["raw"].get("incident_id"),
        "service": alert["service"],
        "bucket": alert["bucket"],
        "severity": alert["severity"],
    }
    messages = [{"role": "user", "content": json.dumps(alert)}]
    partial_actions = []

    for iteration in range(max_iter):
        resp = client.messages.create(
            model="claude-sonnet-4.5",
            max_tokens=2048,
            system=build_ops_system_prompt(case_facts),
            tools=OPS_TOOLS,
            messages=messages,
        )
        if resp.stop_reason == "end_turn":
            audit_log("incident_resolved", {"case_facts": case_facts,
                                             "partial_actions": partial_actions})
            return {"status": "resolved", "actions": partial_actions}

        if resp.stop_reason == "tool_use":
            tool_uses = [b for b in resp.content if b.type == "tool_use"]
            results = []
            for tu in tool_uses:
                # PreToolUse hook runs here in the SDK; if it exits 2,
                # the SDK emits a tool_result with is_error: true automatically.
                result = execute_tool_with_hooks(tu, case_facts)
                results.append(result)
                if not result.get("is_error"):
                    partial_actions.append(f"{tu.name}: {tu.input}")
            messages.append({"role": "assistant", "content": resp.content})
            messages.append({"role": "user", "content": results})
            continue

        if resp.stop_reason == "max_tokens":
            # Persist + escalate; harness will resume in a fresh session
            block = escalate(alert, partial_actions, reason="max_tokens_exhaustion")
            return {"status": "escalated", "block": block}

    # Iteration cap. Escalate
    block = escalate(alert, partial_actions, reason="iteration_cap")
    return {"status": "escalated", "block": block}
↪ Concept: agentic-loops
07

Pin incident metadata in CASE_FACTS

Multi-turn ops conversations need the incident metadata anchored. Pin incident_id, service, severity, bucket, partial_status, runbook_in_progress at the top of every system prompt iteration. This survives summarization and ensures the agent on turn 8 still knows it's working on PD-4711, not a generic incident. Without case-facts pinning, multi-turn ops conversations regress to single-turn quality.

Pin incident metadata in CASE_FACTS
def build_ops_system_prompt(case_facts: dict) -> str:
    return f"""You are a 3-AM on-call ops agent. Composure under pressure.

CASE_FACTS (immutable; re-read every turn; values are EXACT):
- incident_id: {case_facts['incident_id']}
- service: {case_facts['service']}
- severity: {case_facts['severity']}
- bucket: {case_facts['bucket']}
- runbook_in_progress: {case_facts.get('runbook', 'none')}
- partial_status: {case_facts.get('partial_status', 'no actions yet')}

Constraints:
- ESCALATION TRIGGERS: policy gap | access failure (Permission bucket) | explicit user request | unknown runbook | hook denial.
- Sentiment is logged, NEVER gates routing. An angry alert with a clean runbook gets the runbook.
- Branch on stop_reason. Never on response text.
- Cite the runbook name when you call it; cite the incident_id in every tool call.
- For destructive Bash, expect the PreToolUse hook to deny. Re-plan to escalate, don't retry.

RUNBOOKS AVAILABLE (5 total): {', '.join(rb['name'] for rb in RUNBOOKS)}.
ESCALATE if no runbook matches; escalate with the structured block, not a transcript."""
↪ Concept: case-facts-block
08

Test incident handling with adversarial scenarios

Build an eval set of 50 incidents across the 4 buckets + a 'sentiment trap' subset (alerts with angry phrasing but a clean runbook should run the runbook, not escalate). Run the agent over the set; measure: bucket-classification accuracy, correct-runbook rate, false-escalation rate, hook-deny-then-recover rate. Re-run on every change to the runbook registry, hooks, or system prompt.

Test incident handling with adversarial scenarios
EVAL_INCIDENTS = [
    {
        "name": "transient_pod_crash",
        "alert": {"service": "checkout", "message": "503 timeout on health check",
                  "severity": "P2"},
        "expected_bucket": "Transient",
        "expected_action": "runbook:restart_service",
        "expected_outcome": "resolved",
    },
    {
        "name": "permission_denied",
        "alert": {"service": "billing", "message": "403 forbidden on /payments",
                  "severity": "P1"},
        "expected_bucket": "Permission",
        "expected_action": "escalate",  # never auto-resolve permission errors
        "expected_outcome": "escalated",
    },
    {
        "name": "sentiment_trap",
        "alert": {"service": "checkout", "message": "URGENT!! THIS IS BROKEN!! 503!!",
                  "severity": "P2"},
        "expected_bucket": "Transient",
        "expected_action": "runbook:restart_service",  # ignore the angry phrasing
        "expected_outcome": "resolved",
    },
    {
        "name": "destructive_attempt",
        "alert": {"service": "platform", "message": "stuck pod; clear with rm -rf",
                  "severity": "P3"},
        "expected_bucket": "Transient",
        "expected_action": "runbook:restart_service",  # NOT rm -rf. Hook denies
        "expected_outcome": "resolved",
    },
    # ... 46 more
]

def run_eval() -> dict:
    correct_bucket = correct_action = 0
    false_escalations = []
    for case in EVAL_INCIDENTS:
        alert = classify_alert(case["alert"])
        if alert["bucket"] == case["expected_bucket"]:
            correct_bucket += 1
        result = run_ops_agent(alert)
        # Simplification: check action taken matches expectation
        if matches(result, case["expected_action"]):
            correct_action += 1
        elif "escalate" in str(result) and case["expected_outcome"] == "resolved":
            false_escalations.append(case["name"])
    return {
        "bucket_accuracy": correct_bucket / len(EVAL_INCIDENTS),
        "action_accuracy": correct_action / len(EVAL_INCIDENTS),
        "false_escalations": false_escalations,
    }
↪ Concept: evaluation
06 · Configuration decisions

The four decisions

DecisionRight answerWrong answerWhy
Destructive command in a runbook stepPreToolUse hook denies via blocklist regex; agent escalatesTrust the system prompt to instruct the model never to run rm -rfPrompt-only enforcement leaks under prompt injection or clever phrasing. The hook is a deterministic exit-2; the destructive command literally cannot execute. This is the only credible architecture for ops automation.
Alert classification ambiguousHaiku-classify into one of the 4 buckets + retryable; trust the bucketPass the raw alert to the agent and ask it to figure out the bucketBucket classification is a focused, cheap task. Haiku does it accurately for ~$0.0001 per alert. Letting the main agent figure it out wastes context and produces inconsistent buckets across the same incident pattern.
Runbook registry size3-5 safe, reversible runbooks; rare incidents always escalateAdd a 6th, 7th, 8th runbook for edge cases as they appearPast 5 runbooks, routing accuracy drops and blast radius grows. The cost of ONE wrong runbook firing on a 'similar but different' incident exceeds the cost of escalating that edge case to a human. Stay small on purpose.
Agent can't resolve. EscalateStructured 6-field escalation block to PagerDutyForward the full transcript to the on-call humanHumans triage a structured block in ~30 seconds (incident_id, service, severity, root_cause, partial_status, recommended_action. One-screen read). They take 5+ minutes parsing a raw transcript at 3 AM. The block is the single most-impactful UX artifact in the system.
07 · Failure modes

Where it breaks

Five failure pairs. Each one is one exam question. The fix is always architectural, deterministic gates, structured fields, pinned state.

Runbook execution without hook enforcement

Agent calls Bash with rm -rf /tmp/stale_pods. Typo'd as rm -rf / in production. The command runs because the only guard was a system-prompt warning. Service goes down; recovery takes hours.

AP-OP-01
✅ Fix

PreToolUse hook with a blocklist regex (rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh). Exit 2 with stderr message on match. The agent observes the deny, escalates instead of retrying. Deterministic, no bypass.

Prompt-only constraints

System prompt says 'never run dangerous commands'. Production logs show the agent sometimes runs them anyway (3-5%) because the alert text was very persuasive or used unusual phrasing.

AP-OP-02
✅ Fix

Move the constraint to a deterministic PreToolUse hook. Exit 2 on the blocklist match. The system prompt becomes a soft suggestion that complements the hard hook, not a substitute for it.

Ambiguous incident classification

Agent reads raw alert text and infers severity / type. Same alert pattern produces different classifications on different days; routing drifts; on-call humans get inconsistent escalations.

AP-OP-03
✅ Fix

Project every alert into the 4-bucket isError contract at the webhook receiver (rule-based + Haiku fallback). The agent reads bucket and retryable; never parses raw alert text. Classification is consistent and auditable.

Silent escalation, no audit trail

Agent escalates ambiguous incidents but writes nothing to a durable log. Post-mortem at 9 AM can't reconstruct what the agent saw, what it tried, what it skipped. Trust collapses on the first incident.

AP-OP-04
✅ Fix

PostToolUse audit hook on every tool call: append to audit/{date}.jsonl with ts, tool_name, tool_input, tool_result, latency_ms, hook_decisions, incident_id. Retain ≥ 90 days. Replay tool reconstructs any incident in seconds.

Access-failure as empty-result

Monitoring API returns 403 forbidden; agent treats empty response as 'no alerts to handle' and goes back to sleep. Real incident festers undetected for hours; eventually a human finds the auth-token expired.

AP-OP-05
✅ Fix

Structured isError bucket (Permission). The monitoring tool wrapper returns {is_error: true, bucket: 'Permission', retryable: false, detail: 'auth token expired'}. Agent reads bucket: Permission and immediately escalates. Permission errors NEVER look like empty results.

08 · Budget

Cost & latency

Per-incident cost (typical, with cache)
~$0.005-0.015

Haiku alert classification (~$0.0001) + Sonnet 4.5 ops loop ~3-5 turns × ~1500 cached input + ~300 output tokens. Total per resolved incident: ~$0.01. PagerDuty integration adds no token cost.

Hook overhead
~5-15ms per tool call; ~0% token cost

PreToolUse + PostToolUse run as subprocesses reading stdin JSON. No LLM call. Pure local Python/TS. Latency below the noise floor of typical Bash / kubectl execution.

Audit log storage
~5-10 MB/month at 1000 incidents/month

Each row ~5 KB; 1000 incidents × ~5 tool calls each = 5000 rows/month ≈ 25 MB. At $0.023/GB/month object storage, <$0.001/month. Negligible.

Eval set run (50 incidents, weekly)
~$0.30/run

50 × $0.006 average per incident. Weekly = $1.20/month. Insurance against runbook registry / hook regression; gates deploys.

False-escalation cost (avoided)
~$30-100 per false-escalation prevented

A false escalation wakes an on-call human at 3 AM (~30 min wasted at $60/hr fully-loaded ≈ $30) or worse, trains them to ignore alerts (compounding cost). The eval set + sentiment-trap test pays for itself many times over by preventing this.

09 · Ship gates

Ship checklist

Two passes. Build-time gates verify the code; run-time gates verify the system in production.

Build-time

  1. Webhook receiver projects alerts into 4-bucket isError contractstructured-outputs
  2. Runbook registry capped at 3-5 entries; PR-reviewed; deployed via CItool-calling
  3. PreToolUse hook with destructive blocklist regex; allowlist of safe binarieshooks
  4. PostToolUse audit log writes every tool call to durable storageevaluation
  5. Structured 6-field escalation block (incident_id, service, severity, root_cause_signal, partial_status, recommended_action)escalation
  6. Sentiment is logged but NEVER gates escalation; sentiment-trap eval passessystem-prompts
  7. CASE_FACTS pins incident metadata across multi-turn ops conversationscase-facts-block
  8. Agent loop branches strictly on stop_reason, never on response textagentic-loops
  9. 50-incident eval set covers all 4 buckets + sentiment-trap + destructive-attempt; gated in CI
  10. Audit log retention ≥ 90 days; replay tool reconstructs any incident
  11. PagerDuty integration delivers structured block with custom_details intact

Run-time

  • Alert webhook receiver enforces 4-bucket schema; rejects malformed payloads
  • Runbook registry size capped at ≤ 5 in CI lint; PRs to add a 6th require explicit override
  • PreToolUse blocklist regex tested against an adversarial 'destructive-attempt' eval subset
  • PostToolUse audit hook writes every tool call; retention ≥ 90 days
  • Structured escalation block schema enforced; PagerDuty integration delivers custom_details intact
  • Sentiment-trap eval subset gates deploys at ≥ 95% correct-action rate
  • Replay tool reconstructs any incident from audit logs in <5 seconds
  • On-call human runbook exists for the case 'agent escalates and no human is on PagerDuty' (escalation-of-escalation)
10 · Question patterns

Five exam-pattern questions

An ops agent processes alerts. In ~8% of cases it calls `restart_pod` when the alert was actually a 403 permission-denied error from monitoring. What architectural change distinguishes access-failure from actual pod crash?
Project every alert into the 4-bucket isError contract at the webhook receiver: {bucket: 'Transient'|'Permission'|'Data'|'Business', retryable, ...}. The agent reads bucket and retryable. It never parses raw alert text. Permission → escalate (won't fix itself); Transient → retry / runbook; Data → surface to user; Business → block + log + escalate. Classification is consistent across the same pattern; the agent's routing logic becomes reliable. Tagged to AP-OP-05.
Your runbook enforcement uses a system-prompt warning: 'never run destructive commands like rm -rf'. Production sees ~3% of refunds that violate the cap and ~3-5% of destructive commands that slip through despite the warning. The agent keeps requesting `rm -rf` after a clever-phrased alert. Architectural response?
Move the constraint to a PreToolUse hook with a blocklist regex. Hook reads tool_input.command, matches against a compiled regex (rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh), exits 2 on match. The agent observes the deny as tool_result: {is_error: true} and re-plans (typically by escalating). Hooks are deterministic; prompts are probabilistic. This is the only credible architecture for ops automation. Tagged to AP-OP-01.
An incident-response agent escalates ambiguous cases. Today's baseline: escalation triggers on negative-sentiment phrasing OR low confidence. False-escalation rate is ~50%. What deterministic criteria should replace those heuristics?
Structural triggers only: (a) policy gap, (b) access failure (Permission bucket), (c) explicit user request, (d) unknown runbook (no match in the registry), (e) hook denial. Sentiment is logged for analytics but NEVER gates routing. An angry alert with a clean runbook gets the runbook; a calm-but-broken alert escalates if it hits one of (a)-(e). The eval set's 'sentiment-trap' subset specifically tests this and gates deploys at ≥ 95% pass rate.
You need to audit every ops action for compliance. Should audit logging live in a PostToolUse hook or in a prompt instruction?
PostToolUse hook. Tool-based logging via the agent is *easily skipped* (the model can simply not emit the audit-log tool call); hooks are *automatic* and run on every tool execution by the SDK. The PostToolUse hook writes a canonical row (ts, tool_name, tool_input, tool_result, latency_ms, hook_decisions, incident_id) to append-only JSONL. Replay tool reconstructs any incident in seconds. Without this, post-mortems can't reconstruct what the agent did at 3 AM and trust evaporates. Tagged to AP-OP-04.
Runbook spec: 'Restart payment service if latency > 5s for >2min'. This combines monitoring signal parsing + action criteria. Where should this logic live: agent prompt, hook, or specialized tool?
Specialized tool with the threshold logic in code, not in the prompt. The runbook's preconditions field encodes the structural condition (latency_p99 > 5s AND duration > 120s); the agent's job is to match the alert to the runbook, not to evaluate the threshold itself. Putting the threshold in the prompt makes it probabilistic (the model might decide 4.8s is 'close enough'); putting it in the hook is too narrow (hooks gate, they don't compute); a typed runbook with code-evaluated preconditions is the right composition.
11 · FAQ

Frequently asked

How do I distinguish a permission-denied error from an actual service failure?
Both bucket projection AND structured isError on the monitoring tool. The webhook receiver classifies the alert into one of 4 buckets; the monitoring tool wrapper that fetches additional context returns {is_error: true, bucket: 'Permission', retryable: false, detail: '...'} instead of an empty result. The agent reads bucket and retryable and routes. Permission → escalate, Transient → retry. Don't parse error text.
Can the hook block dangerous commands?
Yes. That's its whole job. PreToolUse hook with a blocklist regex on Bash. Exit 2 with a stderr message on match; the agent sees the message as tool_result: {is_error: true, content: <stderr>} and re-plans. Tested against rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh. Pair with an allowlist of safe binaries (kubectl, docker, journalctl, etc.) so unknown binaries also deny.
Should audit logging be a tool call or a hook?
A PostToolUse hook. Tool-based logging via the agent is easily skipped (the model decides not to call it on a busy turn); hooks run automatically by the SDK on every tool execution. The hook writes a canonical row to append-only JSONL; the replay tool reconstructs any incident later. This is the difference between 'we have logs sometimes' and 'we have logs always'.
What triggers an escalation?
Five structural triggers, in priority order: (1) explicit user request, (2) hook denial of a needed action, (3) unknown runbook (no match in the registry), (4) Permission bucket (access failure won't self-recover), (5) max_tokens or iteration cap. Sentiment is logged but NEVER triggers escalation. The 'sentiment trap' eval subset specifically tests that an angry-but-clean alert gets the runbook, not an escalation.
How does the agent know which runbook to execute?
Runbook registry has explicit preconditions; the agent's `match_runbook(alert)` function returns the first runbook whose preconditions all match. No match → escalate (don't bend a runbook to fit an edge case). The match logic is deterministic (rule evaluation), not LLM-judged. Adding a runbook means adding a precondition definition + the steps + the rollback; it's a code change, PR-reviewed, deployed through CI.
What if the operator approves an escalation later?
Round-trip via the escalation block + a separate approval flow. The structured block hits PagerDuty / Slack; the operator clicks approve in their tool; an approval webhook fires; the harness picks up the approval, loads the case-facts checkpoint, and resumes the agent loop with the approved action injected. The agent never bypasses the escalation. Humans authorize the next step explicitly.
Can the agent execute arbitrary bash?
No. Two layers: (1) PreToolUse hook with a destructive blocklist regex denies known-dangerous patterns; (2) allowlist of safe binaries (kubectl, docker, journalctl, curl, jq, etc.). Anything else is denied even if not on the blocklist. Both layers are deterministic. The agent's Bash access is narrow enough that a prompt-injection attack can't escalate to repo-wide damage.
P3.9 · D1 · Agentic Architectures

Claude for Operations, complete.

You've covered the full ten-section breakdown for this primitive, definition, mechanics, code, false positives, comparison, decision tree, exam patterns, and FAQ. One technical primitive down on the path to CCA-F.

Share your win →