Agent Skills with Code Execution | Claude Architect Certification Prep

Q: Can code access the network?

No by default. --network none denies all egress at the kernel level. Opt in for specific tasks by spawning the sandbox with --network bridge and specific iptables rules.

Q: What is the timeout for code execution?

30 seconds by default. Configurable per Skill via the sandbox_timeout_sec parameter in the Skill frontmatter. The kernel enforces it; the running code cannot extend or ignore it.

01 · Problem framing

The problem

What the customer needs

Run real Python or shell scripts as part of the agent's workflow, not just simulate them.
Untrusted code stays contained. A misbehaving script does not destroy the host's filesystem or exhaust its memory.
Predictable termination. A runaway loop or infinite recursion stops at exactly the configured time limit.
Consistent output shape. The agent sees a predictable JSON contract regardless of which tool ran or how the script printed.

Why naive approaches fail

Use Bash for everything (including cat file.txt instead of Read). Audit trail is opaque, file I/O and execution conflate.
No PreToolUse blocklist. A clever prompt-injection in the alert text gets rm -rf /prod to execute.
No resource limits. A loop allocates 10 GB or runs forever; the sandbox runner is exhausted.
Heterogeneous raw output passed to the agent. The agent parses inconsistently and routes wrong.
Schema-only validation. The output matches {status: string} but status is 'banana'. The agent acts on nonsense.

Definition of done

File I/O routes to Read / Write / Edit. Bash is reserved for actual command execution.
PreToolUse hook on Bash with a destructive blocklist (regex). Exit 2 on match.
Sandbox runtime (Docker or Firecracker) with kernel-level limits: CPU 2, memory 1GB, timeout 30s, network deny.
PostToolUse hook normalizes raw output to JSON: {status, stdout, stderr, duration_ms, peak_memory_mb}.
Semantic validator confirms result shape matches the task type.
Audit log: every code-exec invocation writes an append-only row.

02 · Architecture

The system

03 · Component detail

What each part does

5 components, each owns a concept. Click any card to drill into the underlying primitive.

Bash Tool with Destructive Blocklist

PreToolUse gate, regex-driven

Bash sits behind a PreToolUse hook with a compiled regex blocklist (rm -rf, sudo, drop database, kill -9, chmod 777, curl ... | sh). Match exits 2 with a model-readable stderr message; agent observes the deny as tool_result: is_error: true and re-plans. No prompt-injection bypass: the blocklist is in code, not in the prompt.

Configuration

matcher: 'Bash'. Blocklist regex compiled at hook-load time. Allowlist of safe binaries (kubectl, docker, journalctl, jq, ps, df, top). Exit 2 with stderr.

Concept: hooks ↗

Sandbox Runtime (Docker or Firecracker)

fresh sandbox per invocation

Each code-exec invocation runs in a freshly spawned isolated environment. Docker for most cases; Firecracker for stronger isolation when running fully untrusted user code. The image is cached so spin-up stays fast (~500ms warm). The sandbox is destroyed after the run; no state leaks between invocations.

Configuration

Sandbox config: { image: code-exec:latest, cpus: 2, memory_mb: 1024, timeout_sec: 30, network: deny, ipc: private, pid: private }. Image is read-only with a small writable tmpfs scratch space.

Concept: subagents ↗

Resource Limit Enforcement

kernel-level via cgroups or systemd

CPU, memory, time, and network limits are enforced at the kernel level (cgroups for Docker, jailer for Firecracker, systemd-run --property=TimeoutStartSec=30s for raw process spawn). Kernel limits cannot be caught or ignored. Python signal-handler-based timeouts are the canonical wrong answer: a busy-loop or a try: pass swallows them.

Configuration

cgroup limits: cpu.max=2, memory.max=1G, network deny via iptables egress rule. Timeout via systemd-run --property=TimeoutStartSec=30s. Process exit code 137 (SIGKILL) means OOM; 124 means timeout.

Concept: tool-calling ↗

PostToolUse Output Normalizer

raw bytes to structured JSON

Real shell output is messy: mixed Unix timestamps and ISO 8601, mixed status code conventions, multiline stack traces, ANSI color codes. The PostToolUse hook normalizes everything into a stable contract: {status, stdout, stderr, duration_ms, peak_memory_mb, exit_code}. Timestamps converted to ISO 8601 UTC. Color codes stripped. Long stdout truncated.

Configuration

matcher: 'Bash'. Hook reads stdin: {tool_name, tool_input, tool_result, latency_ms, peak_memory_mb}. Returns normalized JSON. Truncates stdout > 4 KB. Strips ANSI codes. Always exits 0.

Concept: structured-outputs ↗

Semantic Result Validator

task-aware sanity check

Schema validation guarantees shape; semantic validation guarantees meaning. Given the task context, check that the normalized result is sensible: passed + failed + skipped equals total; counts are non-negative. Failed semantic validation routes the agent back with a specific error message; it does NOT propagate bad data.

Configuration

Per-task validators registered by Skill. Test-runner validator: passed + failed + skipped == total. Data-analysis validator: row count > 0; required columns present.

Concept: evaluation ↗

04 · One concrete run

Data flow

05 · Build it

Eight steps to production

Route file I/O to built-in tools; reserve Bash for execution

The first layer of safety is tool selection. cat file.txt should be a Read call, not a Bash call. grep -r foo should be a Grep call. find . -name '*.py' should be a Glob call. Bash is reserved for what the built-ins cannot do: compile code, run tests, execute a Python data-analysis script. This single distinction shrinks the Bash blast-radius by ~80%.

Route file I/O to built-in tools; reserve Bash for execution

# Wrong: Bash for everything
# tool_use: Bash, command: "cat config.json"
# tool_use: Bash, command: "grep -r 'TODO' src/"

# Right: built-in tools for I/O; Bash only for execution
# tool_use: Read, file_path: "config.json"
# tool_use: Grep, pattern: "TODO", path: "src/"
# tool_use: Glob, pattern: "**/*.py"

# Bash legitimately for execution:
# tool_use: Bash, command: "pytest tests/ --json-report"
# tool_use: Bash, command: "python analyze.py --input data.csv"

import re
FILE_IO_VIA_BASH = re.compile(
    r"^\s*(cat|head|tail|less|more|grep|find|ls|wc|sort|uniq|cut|awk|sed)\s",
)

def warn_on_io_via_bash(tool_name: str, command: str) -> str | None:
    if tool_name != "Bash":
        return None
    if FILE_IO_VIA_BASH.match(command):
        first = command.strip().split()[0]
        return (
            f"Bash command starts with {first!r}. "
            f"For file I/O prefer Read / Grep / Glob; reserve Bash for execution."
        )
    return None

↪ Concept: tool-calling

Wire the PreToolUse blocklist hook on Bash

The destructive blocklist runs before the sandbox is even spawned. Compiled regex against rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh. Match exits 2; the agent sees the deny as a tool_result with is_error: true and re-plans.

Wire the PreToolUse blocklist hook on Bash

# .claude/hooks/codeexec_blocklist.py
import sys, json, re

BLOCKLIST = re.compile(
    r"\b("
    r"rm\s+-rf"
    r"|sudo\s+"
    r"|drop\s+(database|table)"
    r"|kill\s+-9"
    r"|chmod\s+777"
    r"|>\s*/(etc|usr|var)/"
    r"|curl\s+[^|]+\|\s*sh"
    r")\b",
    re.IGNORECASE,
)

ALLOWLIST_BINS = {
    "python", "python3", "pytest", "node", "npm", "pnpm",
    "tsc", "eslint", "prettier", "ruff", "black", "go", "cargo",
    "kubectl", "docker", "jq",
}

def main():
    payload = json.loads(sys.stdin.read())
    if payload["tool_name"] != "Bash":
        sys.exit(0)
    cmd = (payload["tool_input"].get("command") or "").strip()
    if BLOCKLIST.search(cmd):
        print(f"BLOCKED: command matches destructive pattern. command={cmd!r}", file=sys.stderr)
        sys.exit(2)
    first = cmd.split()[0] if cmd else ""
    if first and first not in ALLOWLIST_BINS:
        print(f"BLOCKED: binary {first!r} not on the code-exec allowlist.", file=sys.stderr)
        sys.exit(2)
    sys.exit(0)

if __name__ == "__main__":
    main()

↪ Concept: hooks

Spawn the sandbox with kernel-level limits

Once the blocklist allows the command, the sandbox runs the actual code. Docker is the default; Firecracker for stronger isolation. The sandbox is fresh per invocation, runs read-only with a tmpfs scratch space, and enforces CPU / memory / time / network limits at the kernel level via cgroups.

Spawn the sandbox with kernel-level limits

import subprocess, time

def run_in_sandbox(command: str, timeout_s: int = 30) -> dict:
    """Run command in a Docker sandbox with kernel-level limits."""
    start = time.monotonic()
    try:
        result = subprocess.run(
            [
                "docker", "run", "--rm",
                "--cpus", "2",
                "--memory", "1024m",
                "--memory-swap", "1024m",
                "--network", "none",
                "--read-only",
                "--tmpfs", "/tmp:size=128m",
                "--ipc", "private",
                "--pid", "private",
                "code-exec:latest",
                "bash", "-c", command,
            ],
            capture_output=True, text=True,
            timeout=timeout_s,
        )
        return {
            "status": "ok" if result.returncode == 0 else "exit_nonzero",
            "exit_code": result.returncode,
            "stdout": result.stdout,
            "stderr": result.stderr,
            "duration_ms": int((time.monotonic() - start) * 1000),
        }
    except subprocess.TimeoutExpired:
        return {
            "status": "timeout",
            "exit_code": 124,
            "stdout": "",
            "stderr": f"command exceeded {timeout_s}s timeout",
            "duration_ms": timeout_s * 1000,
        }

↪ Concept: subagents

Use kernel timeouts, not Python signal handlers

The canonical wrong answer to 'how do I time-out a script after 30 seconds?' is signal.signal(signal.SIGALRM, handler). Python signal handlers can be caught (try: ... except: pass), can be ignored, and do not fire inside C extensions. Use kernel-level timeouts: systemd-run --property=TimeoutStartSec=30s, Docker's intrinsic timeout. Kernel timeouts cannot be caught.

Use kernel timeouts, not Python signal handlers

# Wrong: Python signal handler
# import signal
# def handler(signum, frame):
#     raise TimeoutError("timed out")
# signal.signal(signal.SIGALRM, handler)
# signal.alarm(30)
# # User script can do try/except and swallow the SIGALRM.

# Right: kernel timeout via systemd-run
import subprocess

def run_with_kernel_timeout(command: str, timeout_s: int = 30) -> int:
    result = subprocess.run(
        [
            "systemd-run", "--user", "--scope",
            f"--property=TimeoutStartSec={timeout_s}s",
            "--property=MemoryMax=1G",
            "--property=CPUQuota=200%",
            "bash", "-c", command,
        ],
    )
    # Exit code 124 means systemd killed it for timeout; user script cannot prevent.
    return result.returncode

↪ Concept: tool-calling

PostToolUse output normalizer

Real shell output is messy. The PostToolUse hook normalizes everything into a stable contract before the agent sees it. Strip ANSI color codes, convert timestamps to ISO 8601 UTC, truncate stdout / stderr above 4 KB.

PostToolUse output normalizer

import sys, json, re, datetime

ANSI = re.compile(r"\x1b\[[0-9;]*m")
UNIX_TS = re.compile(r"\b1[6-9]\d{8}\b")

def normalize_stdout(text: str, max_bytes: int = 4096) -> str:
    text = ANSI.sub("", text)
    text = UNIX_TS.sub(
        lambda m: datetime.datetime.utcfromtimestamp(int(m.group())).isoformat() + "Z",
        text,
    )
    if len(text) > max_bytes:
        head = text[: max_bytes // 2]
        tail = text[-max_bytes // 2 :]
        omitted_lines = text[max_bytes // 2 : -max_bytes // 2].count("\n")
        text = f"{head}\n... truncated ({omitted_lines} more lines) ...\n{tail}"
    return text

def main():
    payload = json.loads(sys.stdin.read())
    if payload["tool_name"] != "Bash":
        print(json.dumps(payload))
        sys.exit(0)
    raw_result = payload.get("tool_result") or {}
    normalized = {
        "status": raw_result.get("status", "unknown"),
        "exit_code": raw_result.get("exit_code", -1),
        "stdout": normalize_stdout(raw_result.get("stdout", "")),
        "stderr": normalize_stdout(raw_result.get("stderr", "")),
        "duration_ms": raw_result.get("duration_ms"),
        "peak_memory_mb": raw_result.get("peak_memory_mb"),
    }
    payload["tool_result"] = normalized
    print(json.dumps(payload))
    sys.exit(0)

if __name__ == "__main__":
    main()

↪ Concept: structured-outputs

Validate the result semantically

Schema validation guarantees shape; semantic validation guarantees meaning. After normalization, run a task-specific validator. For a test runner: passed + failed + skipped == total; counts are non-negative. Failed validation returns is_error: true with the specific check that failed.

Validate the result semantically

import json
from typing import TypedDict

class ValidationResult(TypedDict):
    valid: bool
    errors: list[str]

def validate_pytest_output(result: dict) -> ValidationResult:
    errors = []
    try:
        report = json.loads(result["stdout"])
    except json.JSONDecodeError:
        errors.append("pytest stdout is not valid JSON")
        return {"valid": False, "errors": errors}

    summary = report.get("summary", {})
    passed = summary.get("passed", 0)
    failed = summary.get("failed", 0)
    skipped = summary.get("skipped", 0)
    total = summary.get("total", 0)

    if any(v < 0 for v in (passed, failed, skipped, total)):
        errors.append(f"negative test counts in summary: {summary}")
    if passed + failed + skipped != total:
        errors.append(f"counts inconsistent: passed={passed} + failed={failed} + skipped={skipped} != total={total}")
    if total == 0 and result.get("exit_code") == 0:
        errors.append("pytest reported total=0 but exited 0; expected tests to run")
    return {"valid": len(errors) == 0, "errors": errors}

↪ Concept: evaluation

Test resource limits and timeouts adversarially

Ship the sandbox config; then break it on purpose. Run scripts that allocate 10 GB; verify the kernel kills them at 1 GB with exit code 137. Run busy loops; verify timeout at 30s with exit code 124. Run network-egress attempts; verify the deny rule fires.

Test resource limits and timeouts adversarially

def adversarial_oom_test() -> bool:
    """Allocate 10 GB. Sandbox should kill at 1 GB cap."""
    code = "x = bytearray(10 * 1024 * 1024 * 1024)"
    result = run_in_sandbox(f"python3 -c {code!r}")
    if result["exit_code"] != 137:
        print(f"FAIL: expected exit 137 (OOM kill), got {result['exit_code']}")
        return False
    print("PASS: OOM kill at 1 GB cap")
    return True

def adversarial_timeout_test() -> bool:
    """Busy loop. Sandbox should kill at 30s cap."""
    result = run_in_sandbox("python3 -c 'while True: pass'", timeout_s=30)
    if result["status"] != "timeout":
        print(f"FAIL: expected status=timeout, got {result['status']}")
        return False
    print("PASS: timeout kill at 30s")
    return True

↪ Concept: evaluation

Audit-log every code-exec invocation

Every code-exec invocation writes an append-only row to durable storage: timestamp, command, hook decisions, sandbox metrics (duration, peak memory, exit code), validation outcome, agent that requested it. Retain for at least 90 days.

Audit-log every code-exec invocation

import datetime, json
from pathlib import Path

AUDIT_DIR = Path("audit")
AUDIT_DIR.mkdir(exist_ok=True)

def audit_code_exec(command, pre_decision, sandbox_result, validation, agent_id):
    today = datetime.date.today().isoformat()
    row = {
        "ts": datetime.datetime.utcnow().isoformat() + "Z",
        "agent_id": agent_id,
        "command": command[:200],
        "pre_decision": pre_decision,
        "sandbox": {
            "status": sandbox_result.get("status"),
            "exit_code": sandbox_result.get("exit_code"),
            "duration_ms": sandbox_result.get("duration_ms"),
            "peak_memory_mb": sandbox_result.get("peak_memory_mb"),
        },
        "validation": validation,
    }
    with open(AUDIT_DIR / f"{today}.jsonl", "a") as f:
        f.write(json.dumps(row) + "\n")

↪ Concept: evaluation

06 · Configuration decisions

The four decisions

Decision	Right answer	Wrong answer	Why
Reading the contents of `config.json` from agent code	Use the Read tool. file_path: 'config.json'.	Use Bash. command: 'cat config.json'.	Built-in tools are auditable, fast, and structurally distinct from execution. Bash conflates file I/O with execution and bloats the audit trail; the PreToolUse blocklist also has to reason about every cat/grep/find.
Preventing destructive Bash commands at runtime	PreToolUse hook on Bash with a compiled regex blocklist. Exit 2 on match.	System prompt instruction: 'never run destructive commands'.	Prompts are probabilistic and leak under prompt injection or unusual phrasing. Hooks are deterministic, run before the sandbox spawns, and emit a model-readable stderr message that becomes a tool_result is_error: true.
Stopping a runaway script after 30 seconds	Kernel timeout: systemd-run --property=TimeoutStartSec=30s, or Docker --timeout, or cgroup limit.	Python signal handler: signal.signal(signal.SIGALRM, handler).	Signal handlers can be caught, ignored, or never delivered (e.g. blocked inside a C extension). Kernel timeouts cannot be caught by the running code; the kernel sends SIGKILL and the process exits regardless.
Validating that a test-runner result is sane	Semantic validation: passed + failed + skipped equals total; counts are non-negative; total > 0 for a non-empty run.	Schema validation only: the JSON has the expected shape.	Schema guarantees shape; semantic guarantees meaning. A schema-valid output of {passed: -5, failed: 2, total: 0} is structurally fine but semantically nonsense.

07 · Failure modes

Where it breaks

Five failure pairs. Each one is one exam question. The fix is always architectural, deterministic gates, structured fields, pinned state.

❌ Using Bash for file I/O

Agent calls Bash with cat data.json, grep -r foo, find . -name *.py. Audit trail is opaque and the PreToolUse blocklist has to reason about every cat / grep / find.

AP-CODEEXEC-01

✅ Fix

Route file I/O to built-in tools: Read, Grep, Glob. Reserve Bash for actual command execution. The first layer of safety is tool selection.

❌ No PreToolUse blocklist on Bash

A clever prompt-injection in alert text gets rm -rf /prod past the agent. The Bash command runs because no hook scanned it first.

AP-CODEEXEC-02

✅ Fix

PreToolUse hook with a compiled regex blocklist (rm -rf, sudo, drop database, kill -9, chmod 777, curl | sh). Match exits 2 with stderr.

❌ No resource limits on code execution

An agent's Python script allocates 10 GB and crashes the runner. Another runs an infinite loop and starves the queue.

AP-CODEEXEC-03

✅ Fix

Sandbox config with kernel-level limits: CPU 2, memory 1024 MB, timeout 30 s, network deny. Enforced via Docker cgroups, Firecracker jailer, or systemd-run scopes.

❌ Heterogeneous raw output to the agent

Bash output goes straight back: mixed Unix timestamps and ISO 8601, ANSI color codes, multiline stack traces. The agent parses inconsistently.

AP-CODEEXEC-04

✅ Fix

PostToolUse hook normalizes everything: strip ANSI, convert timestamps to ISO 8601 UTC, truncate stdout above 4 KB, emit a stable contract.

❌ Schema-only validation

The result JSON has the expected shape but passed: -5, total: 0. Schema-valid; semantically nonsense. The agent acts on it.

AP-CODEEXEC-05

✅ Fix

Semantic validators registered per task: passed + failed + skipped equals total; counts are non-negative; total > 0 for a non-empty run.

08 · Budget

Cost & latency

Per-invocation Claude API

~$0.005 to $0.015

Skill body ~500 tokens system + parameters ~50 tokens + working tokens ~1500-3000 input + ~500 output.

Sandbox spin-up overhead

~500 ms warm; ~3 s cold

Docker image cached in the runner. Warm spin-up is dominated by container start and cgroup setup.

Code execution duration p95

~3 to 8 seconds end-to-end

Sandbox spin-up (500 ms) + actual code (1-5 s) + PostToolUse normalization (50 ms) + semantic validation (20 ms).

Sandbox resource usage at the runner level

~50-200 MB memory, ~100-500 ms CPU per typical invocation

Most data-analysis or test-runner Skills are short and small. Resource caps prevent the long tail from dominating.

Storage for audit log

~1 GB per month at 10 K invocations

JSONL row ~2-5 KB per invocation. 10 K invocations per month ~30-50 MB. Negligible at object-storage prices.

09 · Ship gates

Ship checklist

Two passes. Build-time gates verify the code; run-time gates verify the system in production.

Build-time

File I/O routes to Read / Write / Edit. Bash reserved for actual execution↗ tool-calling
PreToolUse hook on Bash with compiled regex blocklist. Allowlist of safe binaries↗ hooks
Sandbox runtime (Docker or Firecracker) per invocation. Image cached for warm spin-up↗ subagents
Kernel-level resource limits: CPU 2, memory 1024 MB, timeout 30 s, network deny↗ tool-calling
PostToolUse hook normalizes raw output to JSON contract↗ structured-outputs
Per-task semantic validators (test runner, data analysis, type checker)↗ evaluation
Adversarial test suite: OOM kill, timeout kill, network deny each verified end-to-end
Audit log: append-only JSONL with command, hook decisions, sandbox metrics, validation outcome↗ evaluation
Retention 90+ days. Indexed by agent_id and timestamp
Telemetry: per-invocation duration, peak_memory, hook deny rate, validation pass rate

Run-time

Sandbox image built and cached on every runner; cold-start time documented
PreToolUse blocklist regex tested against an adversarial 'destructive-attempt' eval set
Kernel-level resource limits verified end-to-end (OOM kill, timeout kill, network deny)
PostToolUse normalizer tested against 5 representative output shapes
Per-task semantic validators registered for every Skill that emits code-exec results
Audit log retention 90+ days; indexed; replay tool reconstructs any invocation in seconds
Allowlist of safe binaries kept narrow; PR review on every additional binary
Telemetry: per-invocation duration, peak memory, hook deny rate, validation pass rate

10 · Question patterns

Five exam-pattern questions

A Skill executes Python code. The agent calls Bash with command: 'cat data.json'. What is the correct tool to use here and why?

Use the Read tool with file_path: 'data.json'. Reserve Bash for actual command execution (compiling, running tests, executing a Python data-analysis script). Built-in tools are auditable and structurally distinct from execution; the PreToolUse blocklist has a smaller reasoning surface when Bash usage is narrow. Tagged to AP-CODEEXEC-01.

Your code-execution Skill runs untrusted Python. What prevents the code from running `rm -rf /` or other destructive commands?

A PreToolUse hook on Bash with a compiled regex blocklist (rm -rf, sudo, drop (database|table), kill -9, chmod 777, curl | sh). On match, the hook exits 2 with a model-readable stderr message; the SDK delivers that as a tool_result with is_error: true. The agent observes the deny and re-plans. The blocklist lives in code, not in the prompt; prompt-injection cannot bypass it. Tagged to AP-CODEEXEC-02.

A Skill executes a data-analysis script. The script can run for 5 minutes (max) or 10 seconds (expected). How should you enforce the time limit?

Kernel-level timeout, not a Python signal handler. Use systemd-run --property=TimeoutStartSec=30s, or Docker's intrinsic timeout via cgroups. Kernel timeouts cannot be caught: the kernel sends SIGKILL and the process exits regardless. Python signal.signal(signal.SIGALRM, ...) is the canonical wrong answer because the user script can try: ... except: pass it. Exit code 124 is the standard timeout signature. Tagged to AP-CODEEXEC-03.

A PostToolUse hook normalizes code-execution output. The raw output is heterogeneous: Unix timestamp, ISO 8601 date, status code, ANSI color codes, multi-line stack trace. How should the hook normalize this?

Emit a stable JSON contract: {status, exit_code, stdout, stderr, duration_ms, peak_memory_mb}. Strip ANSI color codes via regex. Convert Unix timestamps to ISO 8601 UTC. Truncate stdout above 4 KB with a ... truncated (N more lines) ... marker. Map exit_code to status (0 -> ok, 137 -> oom, 124 -> timeout). Always exit the hook with code 0; this hook is for shape, not for denial. Tagged to AP-CODEEXEC-04.

A Skill validates the result semantically. It runs a test suite and gets back: `{passed: 5, failed: 1, skipped: 0, total: 6}`. What semantic check should validate this result?

Three semantic checks beyond schema. (1) passed + failed + skipped == total (here 5 + 1 + 0 = 6: passes). (2) All counts are non-negative. (3) total > 0 for a non-empty run. If any check fails, return is_error: true with the specific failure. Schema validation alone catches malformed JSON; semantic validation catches schema-valid nonsense like {passed: -5, total: 0}. Tagged to AP-CODEEXEC-05.

11 · FAQ

Frequently asked

What languages can a code-execution Skill run?

Anything in the sandbox base image. Common: Python, Node.js, Go, Rust, shell. The image determines available runtimes. For most teams, python:3.12-slim plus a few preinstalled libraries covers 90% of use cases.

Can code access the network?

No by default. --network none denies all egress at the kernel level. Opt in for specific tasks by spawning the sandbox with --network bridge and specific iptables rules.

What happens if code runs out of memory?

The kernel sends SIGKILL when the cgroup memory limit is hit; the process exits with code 137. The harness detects 137 and emits status: oom in the normalized result.

Can code persist state across Skill invocations?

Not by default. Each invocation gets a fresh sandbox with no state from prior runs. Opt-in persistence via a mounted volume on the runner.

How do I validate output semantically?

Per-task validators registered by Skill key. Test-runner validator: passed + failed + skipped == total. Data-analysis validator: row count, required columns, aggregate sanity.

Can a Skill call another Skill that does code execution?

Yes. Skill-to-Skill calls are tool calls. Each Skill invocation gets its own sandbox; nesting does not share resources.

What is the timeout for code execution?

30 seconds by default. Configurable per Skill via the sandbox_timeout_sec parameter in the Skill frontmatter. The kernel enforces it; the running code cannot extend or ignore it.

Agent Skills with Code Execution.

The problem

What the customer needs

Why naive approaches fail

The system

What each part does

Bash Tool with Destructive Blocklist

Sandbox Runtime (Docker or Firecracker)

Resource Limit Enforcement

PostToolUse Output Normalizer

Semantic Result Validator

Data flow

Eight steps to production

Route file I/O to built-in tools; reserve Bash for execution

Wire the PreToolUse blocklist hook on Bash

Spawn the sandbox with kernel-level limits

Use kernel timeouts, not Python signal handlers

PostToolUse output normalizer

Validate the result semantically

Test resource limits and timeouts adversarially

Audit-log every code-exec invocation

The four decisions

Where it breaks

Cost & latency

Ship checklist

Build-time

Run-time

Five exam-pattern questions

Frequently asked

Agent Skills with Code Execution, complete.

Agent Skills with Code Execution.

The problem

What the customer needs

Why naive approaches fail

The system

What each part does

Bash Tool with Destructive Blocklist

Sandbox Runtime (Docker or Firecracker)

Resource Limit Enforcement

PostToolUse Output Normalizer

Semantic Result Validator

Data flow

Eight steps to production

Route file I/O to built-in tools; reserve Bash for execution

Wire the PreToolUse blocklist hook on Bash

Spawn the sandbox with kernel-level limits

Use kernel timeouts, not Python signal handlers

PostToolUse output normalizer

Validate the result semantically

Test resource limits and timeouts adversarially

Audit-log every code-exec invocation

The four decisions

Where it breaks

Cost & latency

Ship checklist

Build-time

Run-time

Five exam-pattern questions

Frequently asked

Agent Skills with Code Execution, complete.

Share this primitive