The problem
What the customer needs
- Run real Python or shell scripts as part of the agent's workflow, not just simulate them.
- Untrusted code stays contained. A misbehaving script does not destroy the host's filesystem or exhaust its memory.
- Predictable termination. A runaway loop or infinite recursion stops at exactly the configured time limit.
- Consistent output shape. The agent sees a predictable JSON contract regardless of which tool ran or how the script printed.
Why naive approaches fail
- Use Bash for everything (including
cat file.txtinstead of Read). Audit trail is opaque, file I/O and execution conflate. - No PreToolUse blocklist. A clever prompt-injection in the alert text gets
rm -rf /prodto execute. - No resource limits. A loop allocates 10 GB or runs forever; the sandbox runner is exhausted.
- Heterogeneous raw output passed to the agent. The agent parses inconsistently and routes wrong.
- Schema-only validation. The output matches
{status: string}butstatusis'banana'. The agent acts on nonsense.
- File I/O routes to Read / Write / Edit. Bash is reserved for actual command execution.
- PreToolUse hook on Bash with a destructive blocklist (regex). Exit 2 on match.
- Sandbox runtime (Docker or Firecracker) with kernel-level limits: CPU 2, memory 1GB, timeout 30s, network deny.
- PostToolUse hook normalizes raw output to JSON:
{status, stdout, stderr, duration_ms, peak_memory_mb}. - Semantic validator confirms result shape matches the task type.
- Audit log: every code-exec invocation writes an append-only row.
The system
What each part does
5 components, each owns a concept. Click any card to drill into the underlying primitive.
Bash Tool with Destructive Blocklist
PreToolUse gate, regex-driven
Bash sits behind a PreToolUse hook with a compiled regex blocklist (rm -rf, sudo, drop database, kill -9, chmod 777, curl ... | sh). Match exits 2 with a model-readable stderr message; agent observes the deny as tool_result: is_error: true and re-plans. No prompt-injection bypass: the blocklist is in code, not in the prompt.
Configuration
matcher: 'Bash'. Blocklist regex compiled at hook-load time. Allowlist of safe binaries (kubectl, docker, journalctl, jq, ps, df, top). Exit 2 with stderr.
Sandbox Runtime (Docker or Firecracker)
fresh sandbox per invocation
Each code-exec invocation runs in a freshly spawned isolated environment. Docker for most cases; Firecracker for stronger isolation when running fully untrusted user code. The image is cached so spin-up stays fast (~500ms warm). The sandbox is destroyed after the run; no state leaks between invocations.
Configuration
Sandbox config: { image: code-exec:latest, cpus: 2, memory_mb: 1024, timeout_sec: 30, network: deny, ipc: private, pid: private }. Image is read-only with a small writable tmpfs scratch space.
Resource Limit Enforcement
kernel-level via cgroups or systemd
CPU, memory, time, and network limits are enforced at the kernel level (cgroups for Docker, jailer for Firecracker, systemd-run --property=TimeoutStartSec=30s for raw process spawn). Kernel limits cannot be caught or ignored. Python signal-handler-based timeouts are the canonical wrong answer: a busy-loop or a try: pass swallows them.
Configuration
cgroup limits: cpu.max=2, memory.max=1G, network deny via iptables egress rule. Timeout via systemd-run --property=TimeoutStartSec=30s. Process exit code 137 (SIGKILL) means OOM; 124 means timeout.
PostToolUse Output Normalizer
raw bytes to structured JSON
Real shell output is messy: mixed Unix timestamps and ISO 8601, mixed status code conventions, multiline stack traces, ANSI color codes. The PostToolUse hook normalizes everything into a stable contract: {status, stdout, stderr, duration_ms, peak_memory_mb, exit_code}. Timestamps converted to ISO 8601 UTC. Color codes stripped. Long stdout truncated.
Configuration
matcher: 'Bash'. Hook reads stdin: {tool_name, tool_input, tool_result, latency_ms, peak_memory_mb}. Returns normalized JSON. Truncates stdout > 4 KB. Strips ANSI codes. Always exits 0.
Semantic Result Validator
task-aware sanity check
Schema validation guarantees shape; semantic validation guarantees meaning. Given the task context, check that the normalized result is sensible: passed + failed + skipped equals total; counts are non-negative. Failed semantic validation routes the agent back with a specific error message; it does NOT propagate bad data.
Configuration
Per-task validators registered by Skill. Test-runner validator: passed + failed + skipped == total. Data-analysis validator: row count > 0; required columns present.
Data flow
Eight steps to production
Route file I/O to built-in tools; reserve Bash for execution
The first layer of safety is tool selection. cat file.txt should be a Read call, not a Bash call. grep -r foo should be a Grep call. find . -name '*.py' should be a Glob call. Bash is reserved for what the built-ins cannot do: compile code, run tests, execute a Python data-analysis script. This single distinction shrinks the Bash blast-radius by ~80%.
# Wrong: Bash for everything
# tool_use: Bash, command: "cat config.json"
# tool_use: Bash, command: "grep -r 'TODO' src/"
# Right: built-in tools for I/O; Bash only for execution
# tool_use: Read, file_path: "config.json"
# tool_use: Grep, pattern: "TODO", path: "src/"
# tool_use: Glob, pattern: "**/*.py"
# Bash legitimately for execution:
# tool_use: Bash, command: "pytest tests/ --json-report"
# tool_use: Bash, command: "python analyze.py --input data.csv"
import re
FILE_IO_VIA_BASH = re.compile(
r"^\s*(cat|head|tail|less|more|grep|find|ls|wc|sort|uniq|cut|awk|sed)\s",
)
def warn_on_io_via_bash(tool_name: str, command: str) -> str | None:
if tool_name != "Bash":
return None
if FILE_IO_VIA_BASH.match(command):
first = command.strip().split()[0]
return (
f"Bash command starts with {first!r}. "
f"For file I/O prefer Read / Grep / Glob; reserve Bash for execution."
)
return NoneWire the PreToolUse blocklist hook on Bash
The destructive blocklist runs before the sandbox is even spawned. Compiled regex against rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh. Match exits 2; the agent sees the deny as a tool_result with is_error: true and re-plans.
# .claude/hooks/codeexec_blocklist.py
import sys, json, re
BLOCKLIST = re.compile(
r"\b("
r"rm\s+-rf"
r"|sudo\s+"
r"|drop\s+(database|table)"
r"|kill\s+-9"
r"|chmod\s+777"
r"|>\s*/(etc|usr|var)/"
r"|curl\s+[^|]+\|\s*sh"
r")\b",
re.IGNORECASE,
)
ALLOWLIST_BINS = {
"python", "python3", "pytest", "node", "npm", "pnpm",
"tsc", "eslint", "prettier", "ruff", "black", "go", "cargo",
"kubectl", "docker", "jq",
}
def main():
payload = json.loads(sys.stdin.read())
if payload["tool_name"] != "Bash":
sys.exit(0)
cmd = (payload["tool_input"].get("command") or "").strip()
if BLOCKLIST.search(cmd):
print(f"BLOCKED: command matches destructive pattern. command={cmd!r}", file=sys.stderr)
sys.exit(2)
first = cmd.split()[0] if cmd else ""
if first and first not in ALLOWLIST_BINS:
print(f"BLOCKED: binary {first!r} not on the code-exec allowlist.", file=sys.stderr)
sys.exit(2)
sys.exit(0)
if __name__ == "__main__":
main()Spawn the sandbox with kernel-level limits
Once the blocklist allows the command, the sandbox runs the actual code. Docker is the default; Firecracker for stronger isolation. The sandbox is fresh per invocation, runs read-only with a tmpfs scratch space, and enforces CPU / memory / time / network limits at the kernel level via cgroups.
import subprocess, time
def run_in_sandbox(command: str, timeout_s: int = 30) -> dict:
"""Run command in a Docker sandbox with kernel-level limits."""
start = time.monotonic()
try:
result = subprocess.run(
[
"docker", "run", "--rm",
"--cpus", "2",
"--memory", "1024m",
"--memory-swap", "1024m",
"--network", "none",
"--read-only",
"--tmpfs", "/tmp:size=128m",
"--ipc", "private",
"--pid", "private",
"code-exec:latest",
"bash", "-c", command,
],
capture_output=True, text=True,
timeout=timeout_s,
)
return {
"status": "ok" if result.returncode == 0 else "exit_nonzero",
"exit_code": result.returncode,
"stdout": result.stdout,
"stderr": result.stderr,
"duration_ms": int((time.monotonic() - start) * 1000),
}
except subprocess.TimeoutExpired:
return {
"status": "timeout",
"exit_code": 124,
"stdout": "",
"stderr": f"command exceeded {timeout_s}s timeout",
"duration_ms": timeout_s * 1000,
}Use kernel timeouts, not Python signal handlers
The canonical wrong answer to 'how do I time-out a script after 30 seconds?' is signal.signal(signal.SIGALRM, handler). Python signal handlers can be caught (try: ... except: pass), can be ignored, and do not fire inside C extensions. Use kernel-level timeouts: systemd-run --property=TimeoutStartSec=30s, Docker's intrinsic timeout. Kernel timeouts cannot be caught.
# Wrong: Python signal handler
# import signal
# def handler(signum, frame):
# raise TimeoutError("timed out")
# signal.signal(signal.SIGALRM, handler)
# signal.alarm(30)
# # User script can do try/except and swallow the SIGALRM.
# Right: kernel timeout via systemd-run
import subprocess
def run_with_kernel_timeout(command: str, timeout_s: int = 30) -> int:
result = subprocess.run(
[
"systemd-run", "--user", "--scope",
f"--property=TimeoutStartSec={timeout_s}s",
"--property=MemoryMax=1G",
"--property=CPUQuota=200%",
"bash", "-c", command,
],
)
# Exit code 124 means systemd killed it for timeout; user script cannot prevent.
return result.returncodePostToolUse output normalizer
Real shell output is messy. The PostToolUse hook normalizes everything into a stable contract before the agent sees it. Strip ANSI color codes, convert timestamps to ISO 8601 UTC, truncate stdout / stderr above 4 KB.
import sys, json, re, datetime
ANSI = re.compile(r"\x1b\[[0-9;]*m")
UNIX_TS = re.compile(r"\b1[6-9]\d{8}\b")
def normalize_stdout(text: str, max_bytes: int = 4096) -> str:
text = ANSI.sub("", text)
text = UNIX_TS.sub(
lambda m: datetime.datetime.utcfromtimestamp(int(m.group())).isoformat() + "Z",
text,
)
if len(text) > max_bytes:
head = text[: max_bytes // 2]
tail = text[-max_bytes // 2 :]
omitted_lines = text[max_bytes // 2 : -max_bytes // 2].count("\n")
text = f"{head}\n... truncated ({omitted_lines} more lines) ...\n{tail}"
return text
def main():
payload = json.loads(sys.stdin.read())
if payload["tool_name"] != "Bash":
print(json.dumps(payload))
sys.exit(0)
raw_result = payload.get("tool_result") or {}
normalized = {
"status": raw_result.get("status", "unknown"),
"exit_code": raw_result.get("exit_code", -1),
"stdout": normalize_stdout(raw_result.get("stdout", "")),
"stderr": normalize_stdout(raw_result.get("stderr", "")),
"duration_ms": raw_result.get("duration_ms"),
"peak_memory_mb": raw_result.get("peak_memory_mb"),
}
payload["tool_result"] = normalized
print(json.dumps(payload))
sys.exit(0)
if __name__ == "__main__":
main()Validate the result semantically
Schema validation guarantees shape; semantic validation guarantees meaning. After normalization, run a task-specific validator. For a test runner: passed + failed + skipped == total; counts are non-negative. Failed validation returns is_error: true with the specific check that failed.
import json
from typing import TypedDict
class ValidationResult(TypedDict):
valid: bool
errors: list[str]
def validate_pytest_output(result: dict) -> ValidationResult:
errors = []
try:
report = json.loads(result["stdout"])
except json.JSONDecodeError:
errors.append("pytest stdout is not valid JSON")
return {"valid": False, "errors": errors}
summary = report.get("summary", {})
passed = summary.get("passed", 0)
failed = summary.get("failed", 0)
skipped = summary.get("skipped", 0)
total = summary.get("total", 0)
if any(v < 0 for v in (passed, failed, skipped, total)):
errors.append(f"negative test counts in summary: {summary}")
if passed + failed + skipped != total:
errors.append(f"counts inconsistent: passed={passed} + failed={failed} + skipped={skipped} != total={total}")
if total == 0 and result.get("exit_code") == 0:
errors.append("pytest reported total=0 but exited 0; expected tests to run")
return {"valid": len(errors) == 0, "errors": errors}Test resource limits and timeouts adversarially
Ship the sandbox config; then break it on purpose. Run scripts that allocate 10 GB; verify the kernel kills them at 1 GB with exit code 137. Run busy loops; verify timeout at 30s with exit code 124. Run network-egress attempts; verify the deny rule fires.
def adversarial_oom_test() -> bool:
"""Allocate 10 GB. Sandbox should kill at 1 GB cap."""
code = "x = bytearray(10 * 1024 * 1024 * 1024)"
result = run_in_sandbox(f"python3 -c {code!r}")
if result["exit_code"] != 137:
print(f"FAIL: expected exit 137 (OOM kill), got {result['exit_code']}")
return False
print("PASS: OOM kill at 1 GB cap")
return True
def adversarial_timeout_test() -> bool:
"""Busy loop. Sandbox should kill at 30s cap."""
result = run_in_sandbox("python3 -c 'while True: pass'", timeout_s=30)
if result["status"] != "timeout":
print(f"FAIL: expected status=timeout, got {result['status']}")
return False
print("PASS: timeout kill at 30s")
return TrueAudit-log every code-exec invocation
Every code-exec invocation writes an append-only row to durable storage: timestamp, command, hook decisions, sandbox metrics (duration, peak memory, exit code), validation outcome, agent that requested it. Retain for at least 90 days.
import datetime, json
from pathlib import Path
AUDIT_DIR = Path("audit")
AUDIT_DIR.mkdir(exist_ok=True)
def audit_code_exec(command, pre_decision, sandbox_result, validation, agent_id):
today = datetime.date.today().isoformat()
row = {
"ts": datetime.datetime.utcnow().isoformat() + "Z",
"agent_id": agent_id,
"command": command[:200],
"pre_decision": pre_decision,
"sandbox": {
"status": sandbox_result.get("status"),
"exit_code": sandbox_result.get("exit_code"),
"duration_ms": sandbox_result.get("duration_ms"),
"peak_memory_mb": sandbox_result.get("peak_memory_mb"),
},
"validation": validation,
}
with open(AUDIT_DIR / f"{today}.jsonl", "a") as f:
f.write(json.dumps(row) + "\n")The four decisions
| Decision | Right answer | Wrong answer | Why |
|---|---|---|---|
| Reading the contents of `config.json` from agent code | Use the Read tool. file_path: 'config.json'. | Use Bash. command: 'cat config.json'. | Built-in tools are auditable, fast, and structurally distinct from execution. Bash conflates file I/O with execution and bloats the audit trail; the PreToolUse blocklist also has to reason about every cat/grep/find. |
| Preventing destructive Bash commands at runtime | PreToolUse hook on Bash with a compiled regex blocklist. Exit 2 on match. | System prompt instruction: 'never run destructive commands'. | Prompts are probabilistic and leak under prompt injection or unusual phrasing. Hooks are deterministic, run before the sandbox spawns, and emit a model-readable stderr message that becomes a tool_result is_error: true. |
| Stopping a runaway script after 30 seconds | Kernel timeout: systemd-run --property=TimeoutStartSec=30s, or Docker --timeout, or cgroup limit. | Python signal handler: signal.signal(signal.SIGALRM, handler). | Signal handlers can be caught, ignored, or never delivered (e.g. blocked inside a C extension). Kernel timeouts cannot be caught by the running code; the kernel sends SIGKILL and the process exits regardless. |
| Validating that a test-runner result is sane | Semantic validation: passed + failed + skipped equals total; counts are non-negative; total > 0 for a non-empty run. | Schema validation only: the JSON has the expected shape. | Schema guarantees shape; semantic guarantees meaning. A schema-valid output of {passed: -5, failed: 2, total: 0} is structurally fine but semantically nonsense. |
Where it breaks
Five failure pairs. Each one is one exam question. The fix is always architectural, deterministic gates, structured fields, pinned state.
Agent calls Bash with cat data.json, grep -r foo, find . -name *.py. Audit trail is opaque and the PreToolUse blocklist has to reason about every cat / grep / find.
Route file I/O to built-in tools: Read, Grep, Glob. Reserve Bash for actual command execution. The first layer of safety is tool selection.
A clever prompt-injection in alert text gets rm -rf /prod past the agent. The Bash command runs because no hook scanned it first.
PreToolUse hook with a compiled regex blocklist (rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh). Match exits 2 with stderr.
An agent's Python script allocates 10 GB and crashes the runner. Another runs an infinite loop and starves the queue.
AP-CODEEXEC-03Sandbox config with kernel-level limits: CPU 2, memory 1024 MB, timeout 30 s, network deny. Enforced via Docker cgroups, Firecracker jailer, or systemd-run scopes.
Bash output goes straight back: mixed Unix timestamps and ISO 8601, ANSI color codes, multiline stack traces. The agent parses inconsistently.
AP-CODEEXEC-04PostToolUse hook normalizes everything: strip ANSI, convert timestamps to ISO 8601 UTC, truncate stdout above 4 KB, emit a stable contract.
The result JSON has the expected shape but passed: -5, total: 0. Schema-valid; semantically nonsense. The agent acts on it.
Semantic validators registered per task: passed + failed + skipped equals total; counts are non-negative; total > 0 for a non-empty run.
Cost & latency
Skill body ~500 tokens system + parameters ~50 tokens + working tokens ~1500-3000 input + ~500 output.
Docker image cached in the runner. Warm spin-up is dominated by container start and cgroup setup.
Sandbox spin-up (500 ms) + actual code (1-5 s) + PostToolUse normalization (50 ms) + semantic validation (20 ms).
Most data-analysis or test-runner Skills are short and small. Resource caps prevent the long tail from dominating.
JSONL row ~2-5 KB per invocation. 10 K invocations per month ~30-50 MB. Negligible at object-storage prices.
Ship checklist
Two passes. Build-time gates verify the code; run-time gates verify the system in production.
Build-time
- File I/O routes to Read / Write / Edit. Bash reserved for actual execution↗ tool-calling
- PreToolUse hook on Bash with compiled regex blocklist. Allowlist of safe binaries↗ hooks
- Sandbox runtime (Docker or Firecracker) per invocation. Image cached for warm spin-up↗ subagents
- Kernel-level resource limits: CPU 2, memory 1024 MB, timeout 30 s, network deny↗ tool-calling
- PostToolUse hook normalizes raw output to JSON contract↗ structured-outputs
- Per-task semantic validators (test runner, data analysis, type checker)↗ evaluation
- Adversarial test suite: OOM kill, timeout kill, network deny each verified end-to-end
- Audit log: append-only JSONL with command, hook decisions, sandbox metrics, validation outcome↗ evaluation
- Retention 90+ days. Indexed by agent_id and timestamp
- Telemetry: per-invocation duration, peak_memory, hook deny rate, validation pass rate
Run-time
- Sandbox image built and cached on every runner; cold-start time documented
- PreToolUse blocklist regex tested against an adversarial 'destructive-attempt' eval set
- Kernel-level resource limits verified end-to-end (OOM kill, timeout kill, network deny)
- PostToolUse normalizer tested against 5 representative output shapes
- Per-task semantic validators registered for every Skill that emits code-exec results
- Audit log retention 90+ days; indexed; replay tool reconstructs any invocation in seconds
- Allowlist of safe binaries kept narrow; PR review on every additional binary
- Telemetry: per-invocation duration, peak memory, hook deny rate, validation pass rate
14 exam-pattern questions
Tap an answer to check it instantly, then expand the full breakdown for the rationale per option, the mental model under test, and the priority order across distractors.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
Tap your answer to check it.
9 more questions on this scenario in the practice pillar. Take a mock exam ↗
Frequently asked
What languages can a code-execution Skill run?
python:3.12-slim plus a few preinstalled libraries covers 90% of use cases.Can code access the network?
--network none denies all egress at the kernel level. Opt in for specific tasks by spawning the sandbox with --network bridge and specific iptables rules.What happens if code runs out of memory?
status: oom in the normalized result.Can code persist state across Skill invocations?
How do I validate output semantically?
Can a Skill call another Skill that does code execution?
What is the timeout for code execution?
sandbox_timeout_sec parameter in the Skill frontmatter. The kernel enforces it; the running code cannot extend or ignore it.