Advanced18 min

Code Execution Agents

Build agents that generate, execute, and iterate on code safely. Covers managed sandboxes (Claude's native code execution tool, E2B), self-hosted Docker, the security gap between 'code ran' and 'answer is correct', and cost math for each option.

Quick Reference

→Start with managed sandboxes: Claude's code execution tool or E2B — don't build Docker infra until you've outgrown managed options
→The REPL pattern: generate → validate → execute → observe → iterate — cap at 3 retries before failing gracefully
→Sandboxing is necessary but not sufficient: prompt injection can still reach the sandbox, and sandbox output can still be wrong
→Claude code execution tool: free when paired with web_search or web_fetch; $0.05/hr standalone with 1,550 free hrs/mo
→E2B: ~$0.083/hr, sub-150ms cold starts, any language — choose it when you need multi-language or faster cold starts than Docker
→Validate output, not just execution: exit code 0 and a hallucinated answer look identical from the outside
→Never mount user data as read-write without explicit task need — one 'cleanup' step can delete a dataset

Should You Build Your Own Sandbox?

The first decision in code execution agent design is not which sandbox to use — it is whether to build one at all. In 2026, the default answer for most teams is no. Managed sandboxes handle isolation, resource limits, and file I/O for you. You pay a small per-hour rate instead of engineering days. Build your own only when managed options cannot fit your requirements.

Two questions pick your sandbox — most teams start with the managed path

Option	Cold Start	Python RAM	Pricing	When to Choose
Claude code exec tool	None (managed)	5 GiB	Free w/ web search; $0.05/hr otherwise	You use the Anthropic API and need data analysis or computation
E2B	~150 ms	Configurable	~$0.083/hr (1 vCPU + 1 GiB)	Multi-language, no Anthropic dependency, fast setup
Daytona	~90 ms	Configurable	~$0.083/hr	Sub-100ms cold starts, dev-environment style isolation
Modal	<1 s	Configurable	~$0.12/hr	GPU access, ML workloads, Python-first teams
Docker (self-hosted)	1–5 s	You set it	Compute cost only	Air-gap, compliance, full control over runtime
gVisor / Firecracker	1–5 s	You set it	Compute cost only	High-security self-hosted with kernel-level isolation

Pyodide is archived and has known sandbox escape vulnerabilities

The Pyodide repository was archived in 2025 with no active maintainer. Known sandbox escape vulnerabilities remain unpatched. Do not use Pyodide for production code execution — use a managed sandbox or a proper container runtime instead.

Use managed sandboxes when: you are prototyping, your team has no container infrastructure, or your use case is data analysis with the Anthropic API. Shift to self-hosted when: you need a custom runtime, specific language versions, air-gap deployment, or your query volume is high enough that compute cost beats per-hour managed pricing.

The REPL Loop: Generate, Validate, Execute, Observe, Iterate

Every code execution agent runs the same five-phase loop, regardless of which sandbox sits underneath. The loop is cheap when it works and expensive when it does not — the mitigation column in the table below is what separates a production agent from a demo.

Managed Sandboxes: Claude Code Execution Tool and E2B

For teams using the Anthropic API, the Claude code execution tool is the lowest-friction starting point. You add one tool definition to your messages.create call and Claude runs Python and Bash in Anthropic's sandboxed container — no Docker, no infrastructure. For multi-language workloads or teams not using Anthropic's API, E2B is the practical alternative.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.