Code Execution Agents
Build agents that generate, execute, and iterate on code safely. Covers managed sandboxes (Claude's native code execution tool, E2B), self-hosted Docker, the security gap between 'code ran' and 'answer is correct', and cost math for each option.
Quick Reference
- →Start with managed sandboxes: Claude's code execution tool or E2B — don't build Docker infra until you've outgrown managed options
- →The REPL pattern: generate → validate → execute → observe → iterate — cap at 3 retries before failing gracefully
- →Sandboxing is necessary but not sufficient: prompt injection can still reach the sandbox, and sandbox output can still be wrong
- →Claude code execution tool: free when paired with web_search or web_fetch; $0.05/hr standalone with 1,550 free hrs/mo
- →E2B: ~$0.083/hr, sub-150ms cold starts, any language — choose it when you need multi-language or faster cold starts than Docker
- →Validate output, not just execution: exit code 0 and a hallucinated answer look identical from the outside
- →Never mount user data as read-write without explicit task need — one 'cleanup' step can delete a dataset
Should You Build Your Own Sandbox?
The first decision in code execution agent design is not which sandbox to use — it is whether to build one at all. In 2026, the default answer for most teams is no. Managed sandboxes handle isolation, resource limits, and file I/O for you. You pay a small per-hour rate instead of engineering days. Build your own only when managed options cannot fit your requirements.
Two questions pick your sandbox — most teams start with the managed path
| Option | Cold Start | Python RAM | Pricing | When to Choose |
|---|---|---|---|---|
| Claude code exec tool | None (managed) | 5 GiB | Free w/ web search; $0.05/hr otherwise | You use the Anthropic API and need data analysis or computation |
| E2B | ~150 ms | Configurable | ~$0.083/hr (1 vCPU + 1 GiB) | Multi-language, no Anthropic dependency, fast setup |
| Daytona | ~90 ms | Configurable | ~$0.083/hr | Sub-100ms cold starts, dev-environment style isolation |
| Modal | <1 s | Configurable | ~$0.12/hr | GPU access, ML workloads, Python-first teams |
| Docker (self-hosted) | 1–5 s | You set it | Compute cost only | Air-gap, compliance, full control over runtime |
| gVisor / Firecracker | 1–5 s | You set it | Compute cost only | High-security self-hosted with kernel-level isolation |
The Pyodide repository was archived in 2025 with no active maintainer. Known sandbox escape vulnerabilities remain unpatched. Do not use Pyodide for production code execution — use a managed sandbox or a proper container runtime instead.
Use managed sandboxes when: you are prototyping, your team has no container infrastructure, or your use case is data analysis with the Anthropic API. Shift to self-hosted when: you need a custom runtime, specific language versions, air-gap deployment, or your query volume is high enough that compute cost beats per-hour managed pricing.