Agent Architecture/Autonomous Agents
Advanced18 min

Code Execution Agents

Build agents that generate, execute, and iterate on code safely. Covers managed sandboxes (Claude's native code execution tool, E2B), self-hosted Docker, the security gap between 'code ran' and 'answer is correct', and cost math for each option.

Quick Reference

  • Start with managed sandboxes: Claude's code execution tool or E2B — don't build Docker infra until you've outgrown managed options
  • The REPL pattern: generate → validate → execute → observe → iterate — cap at 3 retries before failing gracefully
  • Sandboxing is necessary but not sufficient: prompt injection can still reach the sandbox, and sandbox output can still be wrong
  • Claude code execution tool: free when paired with web_search or web_fetch; $0.05/hr standalone with 1,550 free hrs/mo
  • E2B: ~$0.083/hr, sub-150ms cold starts, any language — choose it when you need multi-language or faster cold starts than Docker
  • Validate output, not just execution: exit code 0 and a hallucinated answer look identical from the outside
  • Never mount user data as read-write without explicit task need — one 'cleanup' step can delete a dataset

Should You Build Your Own Sandbox?

The first decision in code execution agent design is not which sandbox to use — it is whether to build one at all. In 2026, the default answer for most teams is no. Managed sandboxes handle isolation, resource limits, and file I/O for you. You pay a small per-hour rate instead of engineering days. Build your own only when managed options cannot fit your requirements.

Custom runtimeor air-gap needed?YesSelf-hostedDocker / gVisor / FirecrackerNoUsing theAnthropic API?YesClaude codeexec tool★ Start hereNoE2B / Daytona / ModalAPI-based · sub-150ms cold starts · ~$0.083/hr

Two questions pick your sandbox — most teams start with the managed path

OptionCold StartPython RAMPricingWhen to Choose
Claude code exec toolNone (managed)5 GiBFree w/ web search; $0.05/hr otherwiseYou use the Anthropic API and need data analysis or computation
E2B~150 msConfigurable~$0.083/hr (1 vCPU + 1 GiB)Multi-language, no Anthropic dependency, fast setup
Daytona~90 msConfigurable~$0.083/hrSub-100ms cold starts, dev-environment style isolation
Modal<1 sConfigurable~$0.12/hrGPU access, ML workloads, Python-first teams
Docker (self-hosted)1–5 sYou set itCompute cost onlyAir-gap, compliance, full control over runtime
gVisor / Firecracker1–5 sYou set itCompute cost onlyHigh-security self-hosted with kernel-level isolation
Pyodide is archived and has known sandbox escape vulnerabilities

The Pyodide repository was archived in 2025 with no active maintainer. Known sandbox escape vulnerabilities remain unpatched. Do not use Pyodide for production code execution — use a managed sandbox or a proper container runtime instead.

Use managed sandboxes when: you are prototyping, your team has no container infrastructure, or your use case is data analysis with the Anthropic API. Shift to self-hosted when: you need a custom runtime, specific language versions, air-gap deployment, or your query volume is high enough that compute cost beats per-hour managed pricing.