# Agent Skills with Code Execution

> A code-execution Skill with four layers of safety. Layer 1: route file I/O to built-in tools (Read, Write, Edit) and reserve Bash only for actual execution. Layer 2: a PreToolUse hook scans the proposed Bash command for destructive patterns and exits 2 on match. Layer 3: a Docker or Firecracker sandbox runs the code with kernel-level limits (CPU 2, memory 1GB, timeout 30s, network deny). Layer 4: a PostToolUse hook normalizes the raw output to JSON, then a semantic validator confirms the result shape matches the task. The most-tested distractor: Python signal-handler timeouts. The right answer is kernel-level via systemd-run or cgroups; signal handlers can be caught.

**Sub-marker:** P3.13
**Domains:** D2 · Tool Design + Integration, D3 · Agent Operations
**Exam weight:** 38% of CCA-F (D2 + D3)
**Build time:** 24 minutes
**Source:** 🟡 Beyond-guide scenario. OP-claimed (Reddit 1s34iyl). Architecture matches Anthropic public guidance.
**Canonical:** https://claudearchitectcertification.com/scenarios/agent-skills-with-code-execution
**Last reviewed:** 2026-05-04

## In plain English

Think of this as the way you let an agent run actual code (a Python data-analysis script, a one-off shell command, a compiled binary) without giving it the keys to your machine. The script runs inside a sandbox: a small isolated container with strict limits on CPU, memory, time, and network. A PreToolUse hook scans the proposed command BEFORE it runs and refuses anything destructive. After the script runs, a PostToolUse hook normalizes the messy raw output into a clean structured result. A semantic validator confirms the output makes sense given the task before the agent acts on it. The whole point is that code execution is too dangerous to be a free tool; it needs four layers of containment.

## Exam impact

Domain 2 (Tool Design, 18%) tests the Bash-vs-built-ins distinction, the PreToolUse blocklist contract, and the structured-output normalization shape. Domain 3 (Claude Code Configuration, 20%) tests sandboxing strategy, kernel-level resource enforcement, and the four-layer containment model. The 'why does Python signal-handler timeout fail to actually stop a runaway script?' question is the canonical exam distractor.

## The problem

### What the customer needs
- Run real Python or shell scripts as part of the agent's workflow, not just simulate them.
- Untrusted code stays contained. A misbehaving script does not destroy the host's filesystem or exhaust its memory.
- Predictable termination. A runaway loop or infinite recursion stops at exactly the configured time limit.
- Consistent output shape. The agent sees a predictable JSON contract regardless of which tool ran or how the script printed.

### Why naive approaches fail
- Use Bash for everything (including cat file.txt instead of Read). Audit trail is opaque, file I/O and execution conflate.
- No PreToolUse blocklist. A clever prompt-injection in the alert text gets rm -rf /prod to execute.
- No resource limits. A loop allocates 10 GB or runs forever; the sandbox runner is exhausted.
- Heterogeneous raw output passed to the agent. The agent parses inconsistently and routes wrong.
- Schema-only validation. The output matches {status: string} but status is 'banana'. The agent acts on nonsense.

### Definition of done
- File I/O routes to Read / Write / Edit. Bash is reserved for actual command execution.
- PreToolUse hook on Bash with a destructive blocklist (regex). Exit 2 on match.
- Sandbox runtime (Docker or Firecracker) with kernel-level limits: CPU 2, memory 1GB, timeout 30s, network deny.
- PostToolUse hook normalizes raw output to JSON: {status, stdout, stderr, duration_ms, peak_memory_mb}.
- Semantic validator confirms result shape matches the task type.
- Audit log: every code-exec invocation writes an append-only row.

## Concepts in play

- 🟢 **Skills** (`skills`), Code-execution capability packaged as a Skill
- 🟢 **Tool calling** (`tool-calling`), Bash vs built-in tool selection
- 🟢 **Hooks** (`hooks`), PreToolUse blocklist and PostToolUse normalizer
- 🟢 **Evaluation** (`evaluation`), Semantic validation beyond schema
- 🟢 **Structured outputs** (`structured-outputs`), Normalized result contract
- 🟢 **Subagents** (`subagents`), Sandbox child session for isolated execution
- 🟢 **Context window** (`context-window`), Truncate huge stdout before feeding to the agent
- 🟢 **Agentic loops** (`agentic-loops`), stop_reason branching on tool_use and is_error

## Components

### Bash Tool with Destructive Blocklist, PreToolUse gate, regex-driven

Bash sits behind a PreToolUse hook with a compiled regex blocklist (rm -rf, sudo, drop database, kill -9, chmod 777, curl ... | sh). Match exits 2 with a model-readable stderr message; agent observes the deny as tool_result: is_error: true and re-plans. No prompt-injection bypass: the blocklist is in code, not in the prompt.

**Configuration:** matcher: 'Bash'. Blocklist regex compiled at hook-load time. Allowlist of safe binaries (kubectl, docker, journalctl, jq, ps, df, top). Exit 2 with stderr.
**Concept:** `hooks`

### Sandbox Runtime (Docker or Firecracker), fresh sandbox per invocation

Each code-exec invocation runs in a freshly spawned isolated environment. Docker for most cases; Firecracker for stronger isolation when running fully untrusted user code. The image is cached so spin-up stays fast (~500ms warm). The sandbox is destroyed after the run; no state leaks between invocations.

**Configuration:** Sandbox config: { image: code-exec:latest, cpus: 2, memory_mb: 1024, timeout_sec: 30, network: deny, ipc: private, pid: private }. Image is read-only with a small writable tmpfs scratch space.
**Concept:** `subagents`

### Resource Limit Enforcement, kernel-level via cgroups or systemd

CPU, memory, time, and network limits are enforced at the kernel level (cgroups for Docker, jailer for Firecracker, systemd-run --property=TimeoutStartSec=30s for raw process spawn). Kernel limits cannot be caught or ignored. Python signal-handler-based timeouts are the canonical wrong answer: a busy-loop or a try: pass swallows them.

**Configuration:** cgroup limits: cpu.max=2, memory.max=1G, network deny via iptables egress rule. Timeout via systemd-run --property=TimeoutStartSec=30s. Process exit code 137 (SIGKILL) means OOM; 124 means timeout.
**Concept:** `tool-calling`

### PostToolUse Output Normalizer, raw bytes to structured JSON

Real shell output is messy: mixed Unix timestamps and ISO 8601, mixed status code conventions, multiline stack traces, ANSI color codes. The PostToolUse hook normalizes everything into a stable contract: {status, stdout, stderr, duration_ms, peak_memory_mb, exit_code}. Timestamps converted to ISO 8601 UTC. Color codes stripped. Long stdout truncated.

**Configuration:** matcher: 'Bash'. Hook reads stdin: {tool_name, tool_input, tool_result, latency_ms, peak_memory_mb}. Returns normalized JSON. Truncates stdout > 4 KB. Strips ANSI codes. Always exits 0.
**Concept:** `structured-outputs`

### Semantic Result Validator, task-aware sanity check

Schema validation guarantees shape; semantic validation guarantees meaning. Given the task context, check that the normalized result is sensible: passed + failed + skipped equals total; counts are non-negative. Failed semantic validation routes the agent back with a specific error message; it does NOT propagate bad data.

**Configuration:** Per-task validators registered by Skill. Test-runner validator: passed + failed + skipped total. Data-analysis validator: row count > 0; required columns present.
**Concept:** `evaluation`

## Build steps

### 1. Route file I/O to built-in tools; reserve Bash for execution

The first layer of safety is tool selection. cat file.txt should be a Read call, not a Bash call. grep -r foo should be a Grep call. find . -name '*.py' should be a Glob call. Bash is reserved for what the built-ins cannot do: compile code, run tests, execute a Python data-analysis script. This single distinction shrinks the Bash blast-radius by ~80%.

**Python:**

```python
# Wrong: Bash for everything
# tool_use: Bash, command: "cat config.json"
# tool_use: Bash, command: "grep -r 'TODO' src/"

# Right: built-in tools for I/O; Bash only for execution
# tool_use: Read, file_path: "config.json"
# tool_use: Grep, pattern: "TODO", path: "src/"
# tool_use: Glob, pattern: "**/*.py"

# Bash legitimately for execution:
# tool_use: Bash, command: "pytest tests/ --json-report"
# tool_use: Bash, command: "python analyze.py --input data.csv"

import re
FILE_IO_VIA_BASH = re.compile(
    r"^\s*(cat|head|tail|less|more|grep|find|ls|wc|sort|uniq|cut|awk|sed)\s",
)

def warn_on_io_via_bash(tool_name: str, command: str) -> str | None:
    if tool_name != "Bash":
        return None
    if FILE_IO_VIA_BASH.match(command):
        first = command.strip().split()[0]
        return (
            f"Bash command starts with {first!r}. "
            f"For file I/O prefer Read / Grep / Glob; reserve Bash for execution."
        )
    return None
```

**TypeScript:**

```typescript
// Wrong: Bash for everything
// tool_use: Bash, command: "cat config.json"

// Right: built-in tools for I/O; Bash only for execution
// tool_use: Read, file_path: "config.json"
// tool_use: Grep, pattern: "TODO", path: "src/"

const FILE_IO_VIA_BASH = /^\s*(cat|head|tail|less|more|grep|find|ls|wc|sort|uniq|cut|awk|sed)\s/;

export function warnOnIoViaBash(toolName: string, command: string): string | null {
  if (toolName !== "Bash") return null;
  const m = FILE_IO_VIA_BASH.exec(command);
  if (!m) return null;
  const first = command.trim().split(/\s+/)[0];
  return (
    `Bash command starts with "${first}". ` +
    `For file I/O prefer Read / Grep / Glob; reserve Bash for execution.`
  );
}
```

Concept: `tool-calling`

### 2. Wire the PreToolUse blocklist hook on Bash

The destructive blocklist runs before the sandbox is even spawned. Compiled regex against rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh. Match exits 2; the agent sees the deny as a tool_result with is_error: true and re-plans.

**Python:**

```python
# .claude/hooks/codeexec_blocklist.py
import sys, json, re

BLOCKLIST = re.compile(
    r"\b("
    r"rm\s+-rf"
    r"|sudo\s+"
    r"|drop\s+(database|table)"
    r"|kill\s+-9"
    r"|chmod\s+777"
    r"|>\s*/(etc|usr|var)/"
    r"|curl\s+[^|]+\|\s*sh"
    r")\b",
    re.IGNORECASE,
)

ALLOWLIST_BINS = {
    "python", "python3", "pytest", "node", "npm", "pnpm",
    "tsc", "eslint", "prettier", "ruff", "black", "go", "cargo",
    "kubectl", "docker", "jq",
}

def main():
    payload = json.loads(sys.stdin.read())
    if payload["tool_name"] != "Bash":
        sys.exit(0)
    cmd = (payload["tool_input"].get("command") or "").strip()
    if BLOCKLIST.search(cmd):
        print(f"BLOCKED: command matches destructive pattern. command={cmd!r}", file=sys.stderr)
        sys.exit(2)
    first = cmd.split()[0] if cmd else ""
    if first and first not in ALLOWLIST_BINS:
        print(f"BLOCKED: binary {first!r} not on the code-exec allowlist.", file=sys.stderr)
        sys.exit(2)
    sys.exit(0)

if __name__ == "__main__":
    main()
```

**TypeScript:**

```typescript
// .claude/hooks/codeexec-blocklist.ts
import { readFileSync } from "node:fs";

const BLOCKLIST = new RegExp(
  String.raw`\b(` +
    String.raw`rm\s+-rf` +
    String.raw`|sudo\s+` +
    String.raw`|drop\s+(database|table)` +
    String.raw`|kill\s+-9` +
    String.raw`|chmod\s+777` +
    String.raw`|>\s*/(etc|usr|var)/` +
    String.raw`|curl\s+[^|]+\|\s*sh` +
    String.raw`)\b`,
  "i",
);

const ALLOWLIST_BINS = new Set([
  "python", "python3", "pytest", "node", "npm", "pnpm",
  "tsc", "eslint", "prettier", "ruff", "black", "go", "cargo",
  "kubectl", "docker", "jq",
]);

const payload = JSON.parse(readFileSync(0, "utf8"));
if (payload.tool_name !== "Bash") process.exit(0);
const cmd = String(payload.tool_input?.command ?? "").trim();
if (BLOCKLIST.test(cmd)) {
  process.stderr.write(`BLOCKED: command matches destructive pattern. command=${JSON.stringify(cmd)}\n`);
  process.exit(2);
}
const first = cmd.split(/\s+/)[0] ?? "";
if (first && !ALLOWLIST_BINS.has(first)) {
  process.stderr.write(`BLOCKED: binary ${JSON.stringify(first)} not on the code-exec allowlist.\n`);
  process.exit(2);
}
process.exit(0);
```

Concept: `hooks`

### 3. Spawn the sandbox with kernel-level limits

Once the blocklist allows the command, the sandbox runs the actual code. Docker is the default; Firecracker for stronger isolation. The sandbox is fresh per invocation, runs read-only with a tmpfs scratch space, and enforces CPU / memory / time / network limits at the kernel level via cgroups.

**Python:**

```python
import subprocess, time

def run_in_sandbox(command: str, timeout_s: int = 30) -> dict:
    """Run command in a Docker sandbox with kernel-level limits."""
    start = time.monotonic()
    try:
        result = subprocess.run(
            [
                "docker", "run", "--rm",
                "--cpus", "2",
                "--memory", "1024m",
                "--memory-swap", "1024m",
                "--network", "none",
                "--read-only",
                "--tmpfs", "/tmp:size=128m",
                "--ipc", "private",
                "--pid", "private",
                "code-exec:latest",
                "bash", "-c", command,
            ],
            capture_output=True, text=True,
            timeout=timeout_s,
        )
        return {
            "status": "ok" if result.returncode == 0 else "exit_nonzero",
            "exit_code": result.returncode,
            "stdout": result.stdout,
            "stderr": result.stderr,
            "duration_ms": int((time.monotonic() - start) * 1000),
        }
    except subprocess.TimeoutExpired:
        return {
            "status": "timeout",
            "exit_code": 124,
            "stdout": "",
            "stderr": f"command exceeded {timeout_s}s timeout",
            "duration_ms": timeout_s * 1000,
        }
```

**TypeScript:**

```typescript
import { spawn } from "node:child_process";

interface SandboxResult {
  status: "ok" | "exit_nonzero" | "timeout";
  exit_code: number;
  stdout: string;
  stderr: string;
  duration_ms: number;
}

export async function runInSandbox(command: string, timeoutSec = 30): Promise<SandboxResult> {
  const start = Date.now();
  return new Promise((resolve) => {
    const child = spawn(
      "docker",
      ["run", "--rm",
       "--cpus", "2",
       "--memory", "1024m",
       "--memory-swap", "1024m",
       "--network", "none",
       "--read-only",
       "--tmpfs", "/tmp:size=128m",
       "--ipc", "private",
       "--pid", "private",
       "code-exec:latest",
       "bash", "-c", command],
      { stdio: ["ignore", "pipe", "pipe"] },
    );
    let stdout = "";
    let stderr = "";
    child.stdout.on("data", (d) => { stdout += d.toString(); });
    child.stderr.on("data", (d) => { stderr += d.toString(); });
    const killer = setTimeout(() => {
      child.kill("SIGKILL");
      resolve({
        status: "timeout",
        exit_code: 124,
        stdout: "",
        stderr: `command exceeded ${timeoutSec}s timeout`,
        duration_ms: timeoutSec * 1000,
      });
    }, timeoutSec * 1000);
    child.on("close", (code) => {
      clearTimeout(killer);
      resolve({
        status: code === 0 ? "ok" : "exit_nonzero",
        exit_code: code ?? -1,
        stdout,
        stderr,
        duration_ms: Date.now() - start,
      });
    });
  });
}
```

Concept: `subagents`

### 4. Use kernel timeouts, not Python signal handlers

The canonical wrong answer to 'how do I time-out a script after 30 seconds?' is signal.signal(signal.SIGALRM, handler). Python signal handlers can be caught (try: ... except: pass), can be ignored, and do not fire inside C extensions. Use kernel-level timeouts: systemd-run --property=TimeoutStartSec=30s, Docker's intrinsic timeout. Kernel timeouts cannot be caught.

**Python:**

```python
# Wrong: Python signal handler
# import signal
# def handler(signum, frame):
#     raise TimeoutError("timed out")
# signal.signal(signal.SIGALRM, handler)
# signal.alarm(30)
# # User script can do try/except and swallow the SIGALRM.

# Right: kernel timeout via systemd-run
import subprocess

def run_with_kernel_timeout(command: str, timeout_s: int = 30) -> int:
    result = subprocess.run(
        [
            "systemd-run", "--user", "--scope",
            f"--property=TimeoutStartSec={timeout_s}s",
            "--property=MemoryMax=1G",
            "--property=CPUQuota=200%",
            "bash", "-c", command,
        ],
    )
    # Exit code 124 means systemd killed it for timeout; user script cannot prevent.
    return result.returncode
```

**TypeScript:**

```typescript
// Wrong: setTimeout
// User script in a tight loop blocks the event loop; setTimeout never fires.

// Right: kernel timeout via spawn with systemd-run
import { spawnSync } from "node:child_process";

export function runWithKernelTimeout(command: string, timeoutSec = 30): number {
  const result = spawnSync(
    "systemd-run",
    [
      "--user", "--scope",
      `--property=TimeoutStartSec=${timeoutSec}s`,
      "--property=MemoryMax=1G",
      "--property=CPUQuota=200%",
      "bash", "-c", command,
    ],
    { stdio: "inherit" },
  );
  return result.status ?? -1;
}
```

Concept: `tool-calling`

### 5. PostToolUse output normalizer

Real shell output is messy. The PostToolUse hook normalizes everything into a stable contract before the agent sees it. Strip ANSI color codes, convert timestamps to ISO 8601 UTC, truncate stdout / stderr above 4 KB.

**Python:**

```python
import sys, json, re, datetime

ANSI = re.compile(r"\x1b\[[0-9;]*m")
UNIX_TS = re.compile(r"\b1[6-9]\d{8}\b")

def normalize_stdout(text: str, max_bytes: int = 4096) -> str:
    text = ANSI.sub("", text)
    text = UNIX_TS.sub(
        lambda m: datetime.datetime.utcfromtimestamp(int(m.group())).isoformat() + "Z",
        text,
    )
    if len(text) > max_bytes:
        head = text[: max_bytes // 2]
        tail = text[-max_bytes // 2 :]
        omitted_lines = text[max_bytes // 2 : -max_bytes // 2].count("\n")
        text = f"{head}\n... truncated ({omitted_lines} more lines) ...\n{tail}"
    return text

def main():
    payload = json.loads(sys.stdin.read())
    if payload["tool_name"] != "Bash":
        print(json.dumps(payload))
        sys.exit(0)
    raw_result = payload.get("tool_result") or {}
    normalized = {
        "status": raw_result.get("status", "unknown"),
        "exit_code": raw_result.get("exit_code", -1),
        "stdout": normalize_stdout(raw_result.get("stdout", "")),
        "stderr": normalize_stdout(raw_result.get("stderr", "")),
        "duration_ms": raw_result.get("duration_ms"),
        "peak_memory_mb": raw_result.get("peak_memory_mb"),
    }
    payload["tool_result"] = normalized
    print(json.dumps(payload))
    sys.exit(0)

if __name__ == "__main__":
    main()
```

**TypeScript:**

```typescript
import { readFileSync } from "node:fs";

const ANSI = /\x1b\[[0-9;]*m/g;
const UNIX_TS = /\b1[6-9]\d{8}\b/g;

function normalizeStdout(text: string, maxBytes = 4096): string {
  let out = text.replace(ANSI, "");
  out = out.replace(UNIX_TS, (m) => new Date(Number(m) * 1000).toISOString());
  if (out.length > maxBytes) {
    const head = out.slice(0, maxBytes / 2);
    const tail = out.slice(-maxBytes / 2);
    const omittedLines = out.slice(maxBytes / 2, -maxBytes / 2).split("\n").length - 1;
    out = `${head}\n... truncated (${omittedLines} more lines) ...\n${tail}`;
  }
  return out;
}

const payload = JSON.parse(readFileSync(0, "utf8"));
if (payload.tool_name !== "Bash") {
  process.stdout.write(JSON.stringify(payload));
  process.exit(0);
}
const raw = payload.tool_result ?? {};
const normalized = {
  status: raw.status ?? "unknown",
  exit_code: raw.exit_code ?? -1,
  stdout: normalizeStdout(String(raw.stdout ?? "")),
  stderr: normalizeStdout(String(raw.stderr ?? "")),
  duration_ms: raw.duration_ms,
  peak_memory_mb: raw.peak_memory_mb,
};
payload.tool_result = normalized;
process.stdout.write(JSON.stringify(payload));
process.exit(0);
```

Concept: `structured-outputs`

### 6. Validate the result semantically

Schema validation guarantees shape; semantic validation guarantees meaning. After normalization, run a task-specific validator. For a test runner: passed + failed + skipped total; counts are non-negative. Failed validation returns is_error: true with the specific check that failed.

**Python:**

```python
import json
from typing import TypedDict

class ValidationResult(TypedDict):
    valid: bool
    errors: list[str]

def validate_pytest_output(result: dict) -> ValidationResult:
    errors = []
    try:
        report = json.loads(result["stdout"])
    except json.JSONDecodeError:
        errors.append("pytest stdout is not valid JSON")
        return {"valid": False, "errors": errors}

    summary = report.get("summary", {})
    passed = summary.get("passed", 0)
    failed = summary.get("failed", 0)
    skipped = summary.get("skipped", 0)
    total = summary.get("total", 0)

    if any(v < 0 for v in (passed, failed, skipped, total)):
        errors.append(f"negative test counts in summary: {summary}")
    if passed + failed + skipped != total:
        errors.append(f"counts inconsistent: passed={passed} + failed={failed} + skipped={skipped} != total={total}")
    if total == 0 and result.get("exit_code") == 0:
        errors.append("pytest reported total=0 but exited 0; expected tests to run")
    return {"valid": len(errors) == 0, "errors": errors}
```

**TypeScript:**

```typescript
interface ValidationResult { valid: boolean; errors: string[]; }

export function validatePytestOutput(result: { stdout: string; exit_code: number }): ValidationResult {
  const errors: string[] = [];
  let report: { summary?: Record<string, number> };
  try {
    report = JSON.parse(result.stdout);
  } catch {
    return { valid: false, errors: ["pytest stdout is not valid JSON"] };
  }
  const summary = report.summary ?? {};
  const passed = summary.passed ?? 0;
  const failed = summary.failed ?? 0;
  const skipped = summary.skipped ?? 0;
  const total = summary.total ?? 0;
  if ([passed, failed, skipped, total].some((v) => v < 0)) {
    errors.push(`negative test counts in summary: ${JSON.stringify(summary)}`);
  }
  if (passed + failed + skipped !== total) {
    errors.push(`counts inconsistent: passed=${passed} + failed=${failed} + skipped=${skipped} != total=${total}`);
  }
  if (total === 0 && result.exit_code === 0) {
    errors.push("pytest reported total=0 but exited 0; expected tests to run");
  }
  return { valid: errors.length === 0, errors };
}
```

Concept: `evaluation`

### 7. Test resource limits and timeouts adversarially

Ship the sandbox config; then break it on purpose. Run scripts that allocate 10 GB; verify the kernel kills them at 1 GB with exit code 137. Run busy loops; verify timeout at 30s with exit code 124. Run network-egress attempts; verify the deny rule fires.

**Python:**

```python
def adversarial_oom_test() -> bool:
    """Allocate 10 GB. Sandbox should kill at 1 GB cap."""
    code = "x = bytearray(10 * 1024 * 1024 * 1024)"
    result = run_in_sandbox(f"python3 -c {code!r}")
    if result["exit_code"] != 137:
        print(f"FAIL: expected exit 137 (OOM kill), got {result['exit_code']}")
        return False
    print("PASS: OOM kill at 1 GB cap")
    return True

def adversarial_timeout_test() -> bool:
    """Busy loop. Sandbox should kill at 30s cap."""
    result = run_in_sandbox("python3 -c 'while True: pass'", timeout_s=30)
    if result["status"] != "timeout":
        print(f"FAIL: expected status=timeout, got {result['status']}")
        return False
    print("PASS: timeout kill at 30s")
    return True
```

**TypeScript:**

```typescript
export async function adversarialOomTest(): Promise<boolean> {
  const result = await runInSandbox(`python3 -c "x = bytearray(10 * 1024 * 1024 * 1024)"`);
  if (result.exit_code !== 137) {
    console.log(`FAIL: expected exit 137, got ${result.exit_code}`);
    return false;
  }
  console.log("PASS: OOM kill at 1 GB cap");
  return true;
}

export async function adversarialTimeoutTest(): Promise<boolean> {
  const result = await runInSandbox(`python3 -c "while True: pass"`, 30);
  if (result.status !== "timeout") {
    console.log(`FAIL: expected status=timeout, got ${result.status}`);
    return false;
  }
  console.log("PASS: timeout kill at 30s");
  return true;
}
```

Concept: `evaluation`

### 8. Audit-log every code-exec invocation

Every code-exec invocation writes an append-only row to durable storage: timestamp, command, hook decisions, sandbox metrics (duration, peak memory, exit code), validation outcome, agent that requested it. Retain for at least 90 days.

**Python:**

```python
import datetime, json
from pathlib import Path

AUDIT_DIR = Path("audit")
AUDIT_DIR.mkdir(exist_ok=True)

def audit_code_exec(command, pre_decision, sandbox_result, validation, agent_id):
    today = datetime.date.today().isoformat()
    row = {
        "ts": datetime.datetime.utcnow().isoformat() + "Z",
        "agent_id": agent_id,
        "command": command[:200],
        "pre_decision": pre_decision,
        "sandbox": {
            "status": sandbox_result.get("status"),
            "exit_code": sandbox_result.get("exit_code"),
            "duration_ms": sandbox_result.get("duration_ms"),
            "peak_memory_mb": sandbox_result.get("peak_memory_mb"),
        },
        "validation": validation,
    }
    with open(AUDIT_DIR / f"{today}.jsonl", "a") as f:
        f.write(json.dumps(row) + "\n")
```

**TypeScript:**

```typescript
import { appendFileSync, mkdirSync } from "node:fs";
import { join } from "node:path";

const AUDIT_DIR = "audit";
mkdirSync(AUDIT_DIR, { recursive: true });

export function auditCodeExec(
  command: string,
  preDecision: string,
  sandboxResult: { status?: string; exit_code?: number; duration_ms?: number; peak_memory_mb?: number },
  validation: ValidationResult,
  agentId: string,
) {
  const today = new Date().toISOString().slice(0, 10);
  const row = {
    ts: new Date().toISOString(),
    agent_id: agentId,
    command: command.slice(0, 200),
    pre_decision: preDecision,
    sandbox: {
      status: sandboxResult.status,
      exit_code: sandboxResult.exit_code,
      duration_ms: sandboxResult.duration_ms,
      peak_memory_mb: sandboxResult.peak_memory_mb,
    },
    validation,
  };
  appendFileSync(join(AUDIT_DIR, `${today}.jsonl`), JSON.stringify(row) + "\n");
}
```

Concept: `evaluation`

## Decision matrix

| Decision | Right answer | Wrong answer | Why |
|---|---|---|---|
| Reading the contents of config.json from agent code | Use the Read tool. file_path: 'config.json'. | Use Bash. command: 'cat config.json'. | Built-in tools are auditable, fast, and structurally distinct from execution. Bash conflates file I/O with execution and bloats the audit trail; the PreToolUse blocklist also has to reason about every cat/grep/find. |
| Preventing destructive Bash commands at runtime | PreToolUse hook on Bash with a compiled regex blocklist. Exit 2 on match. | System prompt instruction: 'never run destructive commands'. | Prompts are probabilistic and leak under prompt injection or unusual phrasing. Hooks are deterministic, run before the sandbox spawns, and emit a model-readable stderr message that becomes a tool_result is_error: true. |
| Stopping a runaway script after 30 seconds | Kernel timeout: systemd-run --property=TimeoutStartSec=30s, or Docker --timeout, or cgroup limit. | Python signal handler: signal.signal(signal.SIGALRM, handler). | Signal handlers can be caught, ignored, or never delivered (e.g. blocked inside a C extension). Kernel timeouts cannot be caught by the running code; the kernel sends SIGKILL and the process exits regardless. |
| Validating that a test-runner result is sane | Semantic validation: passed + failed + skipped equals total; counts are non-negative; total > 0 for a non-empty run. | Schema validation only: the JSON has the expected shape. | Schema guarantees shape; semantic guarantees meaning. A schema-valid output of {passed: -5, failed: 2, total: 0} is structurally fine but semantically nonsense. |

## Failure modes

| Anti-pattern | Failure | Fix |
|---|---|---|
| AP-CODEEXEC-01 · Using Bash for file I/O | Agent calls Bash with cat data.json, grep -r foo, find . -name *.py. Audit trail is opaque and the PreToolUse blocklist has to reason about every cat / grep / find. | Route file I/O to built-in tools: Read, Grep, Glob. Reserve Bash for actual command execution. The first layer of safety is tool selection. |
| AP-CODEEXEC-02 · No PreToolUse blocklist on Bash | A clever prompt-injection in alert text gets rm -rf /prod past the agent. The Bash command runs because no hook scanned it first. | PreToolUse hook with a compiled regex blocklist (rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh). Match exits 2 with stderr. |
| AP-CODEEXEC-03 · No resource limits on code execution | An agent's Python script allocates 10 GB and crashes the runner. Another runs an infinite loop and starves the queue. | Sandbox config with kernel-level limits: CPU 2, memory 1024 MB, timeout 30 s, network deny. Enforced via Docker cgroups, Firecracker jailer, or systemd-run scopes. |
| AP-CODEEXEC-04 · Heterogeneous raw output to the agent | Bash output goes straight back: mixed Unix timestamps and ISO 8601, ANSI color codes, multiline stack traces. The agent parses inconsistently. | PostToolUse hook normalizes everything: strip ANSI, convert timestamps to ISO 8601 UTC, truncate stdout above 4 KB, emit a stable contract. |
| AP-CODEEXEC-05 · Schema-only validation | The result JSON has the expected shape but passed: -5, total: 0. Schema-valid; semantically nonsense. The agent acts on it. | Semantic validators registered per task: passed + failed + skipped equals total; counts are non-negative; total > 0 for a non-empty run. |

## Implementation checklist

- [ ] File I/O routes to Read / Write / Edit. Bash reserved for actual execution (`tool-calling`)
- [ ] PreToolUse hook on Bash with compiled regex blocklist. Allowlist of safe binaries (`hooks`)
- [ ] Sandbox runtime (Docker or Firecracker) per invocation. Image cached for warm spin-up (`subagents`)
- [ ] Kernel-level resource limits: CPU 2, memory 1024 MB, timeout 30 s, network deny (`tool-calling`)
- [ ] PostToolUse hook normalizes raw output to JSON contract (`structured-outputs`)
- [ ] Per-task semantic validators (test runner, data analysis, type checker) (`evaluation`)
- [ ] Adversarial test suite: OOM kill, timeout kill, network deny each verified end-to-end
- [ ] Audit log: append-only JSONL with command, hook decisions, sandbox metrics, validation outcome (`evaluation`)
- [ ] Retention 90+ days. Indexed by agent_id and timestamp
- [ ] Telemetry: per-invocation duration, peak_memory, hook deny rate, validation pass rate

## Cost &amp; latency

- **Per-invocation Claude API:** ~$0.005 to $0.015, Skill body ~500 tokens system + parameters ~50 tokens + working tokens ~1500-3000 input + ~500 output.
- **Sandbox spin-up overhead:** ~500 ms warm; ~3 s cold, Docker image cached in the runner. Warm spin-up is dominated by container start and cgroup setup.
- **Code execution duration p95:** ~3 to 8 seconds end-to-end, Sandbox spin-up (500 ms) + actual code (1-5 s) + PostToolUse normalization (50 ms) + semantic validation (20 ms).
- **Sandbox resource usage at the runner level:** ~50-200 MB memory, ~100-500 ms CPU per typical invocation, Most data-analysis or test-runner Skills are short and small. Resource caps prevent the long tail from dominating.
- **Storage for audit log:** ~1 GB per month at 10 K invocations, JSONL row ~2-5 KB per invocation. 10 K invocations per month ~30-50 MB. Negligible at object-storage prices.

## Domain weights

- **D2 · Tool Design + Integration (18%):** Bash vs built-ins. PreToolUse blocklist. PostToolUse normalizer. JSON contract.
- **D3 · Agent Operations (20%):** Sandbox config. Kernel-level limits. Adversarial verification. Audit-log integration.

## Practice questions

### Q1. A Skill executes Python code. The agent calls Bash with command: 'cat data.json'. What is the correct tool to use here and why?

Use the Read tool with file_path: 'data.json'. Reserve Bash for actual command execution (compiling, running tests, executing a Python data-analysis script). Built-in tools are auditable and structurally distinct from execution; the PreToolUse blocklist has a smaller reasoning surface when Bash usage is narrow. Tagged to AP-CODEEXEC-01.

### Q2. Your code-execution Skill runs untrusted Python. What prevents the code from running rm -rf / or other destructive commands?

A PreToolUse hook on Bash with a compiled regex blocklist (rm -rf, sudo, drop (database|table), kill -9, chmod 777, curl | sh). On match, the hook exits 2 with a model-readable stderr message; the SDK delivers that as a tool_result with is_error: true. The agent observes the deny and re-plans. The blocklist lives in code, not in the prompt; prompt-injection cannot bypass it. Tagged to AP-CODEEXEC-02.

### Q3. A Skill executes a data-analysis script. The script can run for 5 minutes (max) or 10 seconds (expected). How should you enforce the time limit?

Kernel-level timeout, not a Python signal handler. Use systemd-run --property=TimeoutStartSec=30s, or Docker's intrinsic timeout via cgroups. Kernel timeouts cannot be caught: the kernel sends SIGKILL and the process exits regardless. Python signal.signal(signal.SIGALRM, ...) is the canonical wrong answer because the user script can try: ... except: pass it. Exit code 124 is the standard timeout signature. Tagged to AP-CODEEXEC-03.

### Q4. A PostToolUse hook normalizes code-execution output. The raw output is heterogeneous: Unix timestamp, ISO 8601 date, status code, ANSI color codes, multi-line stack trace. How should the hook normalize this?

Emit a stable JSON contract: {status, exit_code, stdout, stderr, duration_ms, peak_memory_mb}. Strip ANSI color codes via regex. Convert Unix timestamps to ISO 8601 UTC. Truncate stdout above 4 KB with a ... truncated (N more lines) ... marker. Map exit_code to status (0 -> ok, 137 -> oom, 124 -> timeout). Always exit the hook with code 0; this hook is for shape, not for denial. Tagged to AP-CODEEXEC-04.

### Q5. A Skill validates the result semantically. It runs a test suite and gets back: {passed: 5, failed: 1, skipped: 0, total: 6}. What semantic check should validate this result?

Three semantic checks beyond schema. (1) passed + failed + skipped total (here 5 + 1 + 0 = 6: passes). (2) All counts are non-negative. (3) total > 0 for a non-empty run. If any check fails, return is_error: true with the specific failure. Schema validation alone catches malformed JSON; semantic validation catches schema-valid nonsense like {passed: -5, total: 0}. Tagged to AP-CODEEXEC-05.

## FAQ

### Q1. What languages can a code-execution Skill run?

Anything in the sandbox base image. Common: Python, Node.js, Go, Rust, shell. The image determines available runtimes. For most teams, python:3.12-slim plus a few preinstalled libraries covers 90% of use cases.

### Q2. Can code access the network?

No by default. --network none denies all egress at the kernel level. Opt in for specific tasks by spawning the sandbox with --network bridge and specific iptables rules.

### Q3. What happens if code runs out of memory?

The kernel sends SIGKILL when the cgroup memory limit is hit; the process exits with code 137. The harness detects 137 and emits status: oom in the normalized result.

### Q4. Can code persist state across Skill invocations?

Not by default. Each invocation gets a fresh sandbox with no state from prior runs. Opt-in persistence via a mounted volume on the runner.

### Q5. How do I validate output semantically?

Per-task validators registered by Skill key. Test-runner validator: passed + failed + skipped total. Data-analysis validator: row count, required columns, aggregate sanity.

### Q6. Can a Skill call another Skill that does code execution?

Yes. Skill-to-Skill calls are tool calls. Each Skill invocation gets its own sandbox; nesting does not share resources.

### Q7. What is the timeout for code execution?

30 seconds by default. Configurable per Skill via the sandbox_timeout_sec parameter in the Skill frontmatter. The kernel enforces it; the running code cannot extend or ignore it.

## Production readiness

- [ ] Sandbox image built and cached on every runner; cold-start time documented
- [ ] PreToolUse blocklist regex tested against an adversarial 'destructive-attempt' eval set
- [ ] Kernel-level resource limits verified end-to-end (OOM kill, timeout kill, network deny)
- [ ] PostToolUse normalizer tested against 5 representative output shapes
- [ ] Per-task semantic validators registered for every Skill that emits code-exec results
- [ ] Audit log retention 90+ days; indexed; replay tool reconstructs any invocation in seconds
- [ ] Allowlist of safe binaries kept narrow; PR review on every additional binary
- [ ] Telemetry: per-invocation duration, peak memory, hook deny rate, validation pass rate

---

**Source:** https://claudearchitectcertification.com/scenarios/agent-skills-with-code-execution
**Vault sources:** ACP-T05 Scenario 13 (yellow beyond-guide; OP-claimed Reddit 1s34iyl); ACP-T07 Lab 13 spec (Skills with code execution); ACP-T08 section 3.13 (sandbox strategy, resource caps, normalization); Course 01 Claude 101 lesson 7 (working with Skills); GAI-K05 CCA exam questions and scenarios; ACP-T06 (5 practice Qs tagged to components)
**Last reviewed:** 2026-05-04

**Evidence tiers**, 🟢 official Anthropic doc · 🟡 partial doc / inferred · 🟠 community-derived · 🔴 disputed.