Sandboxed Execution for AI Agents: Why Isolation Is the Real Problem

January 14, 20268 min read

The first time I let an AI agent execute shell commands on a server I cared about, nothing bad happened. That's the worst possible outcome — because it taught me exactly the wrong lesson.

The agent ran some Python, wrote a few files, installed a package. Everything worked. So I kept doing it. It wasn't until I watched an agent try to rm -rf a directory it had no business touching that the actual problem became visceral. The agent wasn't malicious. It was doing exactly what I'd asked — cleaning up temporary files. It just had a different mental model of "temporary" than I did.

This is the core issue with agentic systems that most tutorials skip past: agents need to execute code, and the code they execute is, by definition, not fully predictable. If you could predict every command an agent would run, you wouldn't need an agent. You'd write a script.

The isolation question

Once you accept that agents will do unpredictable things, the question becomes where they do them. Your options, roughly:

Local execution is the fastest and simplest. The agent runs code directly on your machine or server. No network latency, no additional costs, full access to your filesystem and tools. This is what most people start with, and for development it's fine. The problem is obvious — there's no boundary between the agent and everything else. A runaway process eats your CPU. A bad pip install corrupts your Python environment. An overly creative agent modifies files outside its working directory.

Docker containers add a boundary. The agent runs inside a container with its own filesystem, its own process space, its own network stack. This is better. But Docker wasn't designed for ephemeral, on-demand agent workloads. Spinning up containers is slower than you'd like for interactive use. Managing container lifecycle, cleanup, and resource limits takes real engineering. And if you're running Docker on your own server, you're still responsible for the host machine.

Cloud sandboxes — this is where E2B sits — push the isolation off your infrastructure entirely. Each agent session gets its own Firecracker microVM in someone else's cloud. The VM boots, the agent does its work, the VM dies. Your infrastructure never touches the arbitrary code.

I use E2B to run agents in sandboxed environments for a web platform I built. The architecture is straightforward: a user connects via WebSocket, the server spins up an E2B sandbox from a pre-built template, the agent runs inside the sandbox, and output streams back through the WebSocket. When the session ends, the sandbox is destroyed. Nothing persists on my infrastructure.

What Firecracker actually gives you

E2B runs on Firecracker microVMs — the same technology behind AWS Lambda. The important number is boot time: roughly 300 milliseconds from "create sandbox" to "ready to execute." That's fast enough for on-demand creation per user session, which changes the economics of isolation.

With Docker, the practical pattern is keeping containers warm — pre-creating them and assigning them to users as needed. That means paying for idle compute and managing a pool. With 300ms boot times, you can create a fresh environment for every session and destroy it after. No pool management. No stale state leaking between sessions. No zombie containers to clean up.

The isolation is also stronger than containers. Firecracker VMs have their own kernel. A container escape — which is a real class of vulnerability — doesn't apply because there's no shared kernel to escape into. For a system where untrusted agent-generated code is running, that distinction matters.

Template pre-baking vs. install-on-demand

The single most impactful optimization I found with E2B has nothing to do with the agent itself. It's about how you build your sandbox template.

The naive approach: create a base sandbox, then install your dependencies at runtime. pip install pandas numpy, npm install, whatever your agent needs. This works. It also means every session starts with 30-60 seconds of package installation before the agent can do anything useful.

The better approach: pre-bake everything into a custom template. Install your dependencies, copy your application code, set up your directory structure — all at template build time. When the sandbox boots from that template, everything is already there. Your 300ms boot time stays close to 300ms instead of inflating to a minute.

This is a real trade-off. Pre-baked templates are faster but less flexible. If your agent needs a package that isn't in the template, you're back to runtime installation. The move is to pre-bake the common case and accept the occasional runtime install for edge cases.

The trade-offs nobody talks about

E2B solves real problems. But it's not free, and the costs aren't just financial.

Latency. Every sandbox operation goes through an API call to E2B's infrastructure. Creating a sandbox, executing a command, reading a file — each one adds network round-trip time. For interactive applications where users are watching output stream in real time, this latency is noticeable. It's not terrible — maybe 50-100ms per operation on top of whatever the operation itself takes — but it adds up across a session with dozens of command executions.

Debugging is harder. When something goes wrong inside a sandbox, you can't just SSH in and poke around. You're working through the SDK — executing diagnostic commands, reading log files through API calls, piecing together what happened from stdout and stderr output. It's manageable but slower than the local debugging experience by a significant margin.

Cost. E2B charges per compute-second. For a hobby project or low-traffic tool, this is fine — probably a few dollars a month. For a production system with hundreds of concurrent users, each running long agent sessions, the compute costs become a real line item. You start thinking about session timeouts and idle detection in a way you wouldn't with your own servers.

Vendor dependency. Your sandbox lifecycle is coupled to E2B's API availability. If their service goes down, your agents can't run. For development this is irrelevant. For production, it's a risk you should acknowledge even if you decide it's acceptable.

I think the trade-offs are worth it for production workloads where agents are executing untrusted or semi-trusted code. For development and testing, I run agents locally with no sandbox at all. The isolation overhead isn't worth it when you're iterating on prompts and tool implementations.

The alternatives

E2B is designed specifically for the agent sandboxing use case, but it's not the only way to get isolation.

Modal takes a different approach — it's focused on running Python functions in the cloud with automatic scaling. Less of a general-purpose sandbox, more of a "deploy this function" service. If your agent workload is primarily Python computation, Modal might be simpler. If you need full system access — installing packages, running arbitrary commands, modifying files — E2B's model is more flexible.

Fly.io gives you lightweight VMs that you control more directly. You get more configurability but less agent-specific tooling. No built-in sandbox lifecycle management, no SDK for command execution. You're closer to managing your own infrastructure.

Local Docker is the self-hosted middle ground. You get container isolation without vendor dependency, but you're responsible for everything — resource limits, cleanup, security hardening, scaling. For a single-user development setup this is often the right call. For multi-user production, it's a significant engineering commitment.

The honest answer is that the "best" option depends on what you're building. A personal development tool where you trust the agent and want fast iteration? Run locally, no sandbox. A multi-user platform where agents execute code on behalf of strangers? Cloud sandboxes earn their keep. Something in between? Docker on your own infrastructure is probably fine.

Persistent sessions and what they enable

One thing that surprised me about E2B is how persistent sessions change the design space. A sandbox can stay alive for hours or days, maintaining its filesystem state, running processes, keeping packages installed. The agent can come back to the same environment across multiple interactions.

This matters because agent work is often iterative. An agent analyzes data, you ask a follow-up question, the agent builds on what it already computed. Without persistence, every interaction starts from scratch — reloading data, re-installing packages, recreating files. With a persistent sandbox, the second interaction picks up where the first left off. The filesystem is the agent's working memory.

It also opens up workflows that don't work with ephemeral execution. An agent can start a long-running process — a training job, a data pipeline, a web server — and you can check back on it later. The sandbox keeps running independently of the client connection.

Where this is going

The sandbox-per-agent pattern is still early. Most agent frameworks don't have first-class support for execution isolation. Most tutorials show agents running code in the same process as the orchestrator, which works until it doesn't.

I think sandboxed execution becomes table stakes as agents get more capable and more autonomous. The more an agent can do, the more damage it can do when it gets confused. Isolation isn't a feature — it's damage containment. And the tooling is getting good enough that isolation no longer means a ten-second startup penalty or a weekend configuring Docker networking.

The question I'm still working through: what does the right abstraction look like for agents that need to coordinate across multiple sandboxes? One agent per sandbox is clean but limiting. Multiple agents sharing a sandbox is efficient but reintroduces the isolation problems you were trying to solve. There's probably a pattern here that doesn't have a name yet.

Related Posts

X
LinkedIn