all posts

Best Code Execution Sandboxes for AI Agents (2026)

Ajay Kumar··12 min read

If you're building an AI agent that writes and runs its own code, the sandbox is the single most load-bearing piece of infrastructure you'll choose. It's the boundary between 'the model generated a shell command' and 'the model wiped a host,' and the thing your agent loop blocks on dozens of times per task, so its latency shapes your product's feel. In 2026 there are a lot of credible options — good for buyers, confusing when every vendor's own blog ranks itself first. This is a best-of that tries to be an honest broker: real decision criteria, a fair pass over the field, and specifics only where I can stand behind them.

The field covered here: PandaStack (our project — open-source Firecracker microVMs, self-hostable, broad platform), E2B, Modal, Daytona, Northflank, Vercel Sandbox, and Fly.io Sprites, plus the open-source building blocks (Firecracker, gVisor, Kata, microsandbox) several are built on. Rather than rank these into a leaderboard that ignores your workload, we'll walk the six decisions that actually separate them, with a per-option summary and an honest 'pick this one when…' for each. The hosted-vs-self-host companion is /blog/e2b-alternatives; per-vendor deep dives are linked throughout.

Disclosure: I'm the founder of PandaStack, so read this as a vendor's roundup and weight it accordingly. I keep it honest the only way that works — I cite specific numbers (latency, fork times, license) only for PandaStack, and I describe every other tool in general, qualitative terms drawn from its own docs rather than inventing internals or quoting figures I can't stand behind. I deliberately don't print competitor latency or dollar pricing, because both are easy to mis-measure and change monthly. For anything load-bearing to your decision, verify against each vendor's own current docs and pricing page before you commit.

The criteria that actually separate them

Almost every sandbox in this market will run a Python script and hand you back stdout. That baseline tells you nothing. The differences that decide whether a tool fits your agent live in six places: the isolation model, hosted vs self-host, cold-start latency, forking and copy-on-write state, platform breadth, and pricing posture. Work out which is forcing your hand before you compare anything — the right answer changes completely depending on which one matters to you.

Criterion 1: the isolation model

When an agent runs code it wrote itself, you're running untrusted code by definition — the model is not an authority you can trust with your host. Isolation strength is therefore the criterion that matters most, and the one most often blurred by marketing. There are three broad models, in increasing order of strength, covered in depth in /blog/code-isolation-hierarchy.

  • Containers (namespaces + cgroups + seccomp): fast and cheap, but every container shares the one host kernel, so a kernel-level escape is a host compromise. Many products marketed as 'sandboxes' are really hardened containers — fine for trusted code, riskier for arbitrary agent output (/blog/why-docker-is-not-a-sandbox).
  • User-space kernel (gVisor): a second kernel in user space intercepts guest syscalls so they don't hit the host kernel directly, shrinking the attack surface without a full VM — a real step up from a plain container, with workload-dependent compatibility and performance trade-offs.
  • Hardware-virtualized microVMs (Firecracker, Kata): each sandbox gets its own guest kernel, isolated by KVM. Guest code never touches the host kernel; the exposed surface is the much smaller, better-audited VMM. The right default for arbitrary untrusted code — see /blog/firecracker-vs-docker and /blog/what-is-a-microvm.

Where the field lands, from each vendor's own primary sources: E2B and Vercel Sandbox both run sandboxes as Firecracker microVMs (E2B in its infra docs; Vercel states it plainly and links the project), as does PandaStack. Modal uses gVisor — confirmed in its own security docs — a different bet from a hardware-virtualized VM. Northflank offers a choice of Kata Containers (via Cloud Hypervisor), Firecracker, or gVisor per workload. Fly.io Sprites are widely reported Firecracker-based, consistent with Fly Machines, though confirm on Fly's docs. Daytona describes a dedicated kernel and complete isolation (VM-like) without naming a hypervisor — so I won't name one. The honest spine: for arbitrary agent code, microVM-class isolation (Firecracker or Kata) clears the bar, gVisor is a meaningful middle, and a plain shared-kernel container generally does not.

Don't over-read 'microVM' as 'immune.' Hardware virtualization is a far stronger boundary than a shared kernel, and a minimal VMM's attack surface is much smaller and better-audited than the full Linux syscall interface a container shares — but it is not zero. VMMs have had bugs; KVM has had bugs. The honest claim is 'dramatically smaller, better-audited attack surface,' not 'unbreakable.' Defense in depth — a privilege-dropping jailer, seccomp, per-sandbox egress controls — still matters on top of the VM boundary. We cover that layering in /blog/secure-code-execution-for-ai-agents.

Criterion 2: hosted vs self-host

This is the cleanest structural fork in the road, and frequently the reason a team looks past the obvious hosted choice. The question is concrete: where does the agent's code physically execute, and who operates the machines? The word 'self-hosted' is badly overloaded here, so be precise about three distinct things.

  • Hosted-only managed service: you call an API, code runs on the vendor's infrastructure, you never touch a host. Modal, Vercel Sandbox, and Fly.io Sprites are hosted-only — none documents a self-host path. The least operational work, full stop.
  • Bring-your-own-cloud (BYOC): a proprietary control plane manages compute in your cloud account. Northflank offers this — it gives data locality, but it isn't source-available software you run end-to-end, so don't conflate it with open-source.
  • Genuinely open-source and self-hostable: source you deploy on your own machines. E2B (Apache-2.0), Daytona (AGPL-3.0; managed/self-host/hybrid), microsandbox (libkrun, OSS), and PandaStack all live here — with the usual caveat that licenses change, so confirm in each repo.

PandaStack's core is Apache-2.0 and designed to self-host on your own Linux KVM hosts — anything with /dev/kvm. You run the control-plane API and a per-host agent; sandboxes execute entirely on your infrastructure. There's a hosted offering too, but self-host is first-class: same binaries, same agent, configurable SDK base URL so identical code points at either. The honest counterweight applies to every self-hostable option here: self-hosting is real operational weight — KVM hosts, an agent fleet, networking, snapshot storage. If you don't have an infra team or the appetite to grow one, a hosted-only provider is genuinely less work, and that's a legitimate reason to stay hosted.

Criterion 3: cold-start and create latency

Inside an agent loop, how long create() blocks is often the difference between a tool that feels instant and one that feels broken — and the cost compounds across a long trajectory. PandaStack's design choice is specific, because it's where our numbers come from: there is no warm pool of idle VMs. Every create restores a baked Firecracker snapshot on demand — a snapshot that already contains a booted kernel, a running guest agent, and an open network stack, so 'starting' a sandbox is really 'restore memory pages and resume.' That lands at 179ms p50, roughly 203ms p99. The only slow path is the first-ever spawn of a brand-new template, which cold-boots (~3s) and bakes the snapshot; every create after is on the fast restore path (mechanics in /docs/internals/snapshot-restore).

Most other providers also advertise fast startup, several quoting sub-second or millisecond figures. I deliberately won't reprint competitor latency numbers, because cold-start is the metric easiest to mis-measure across vendors: warm pool versus true cold boot, snapshot resume versus full boot, your region versus theirs, your real template versus a trivial one. The only number you should trust is the one you measure yourself — on your template, in your region. Treat every vendor's headline figure (including how you read ours) as a hypothesis to benchmark, not a settled fact.

Criterion 4: forking and copy-on-write state

Forking is where the microVM model pays off in ways containers can't easily match, and a real point of divergence between providers — so evaluate it directly, not from a feature matrix. PandaStack exposes full snapshots and forks as first-class primitives. A full snapshot captures the whole machine — guest memory plus rootfs. A fork clones a running sandbox via copy-on-write: guest memory shared through MAP_PRIVATE (the kernel copies a page only when it's written), the rootfs cloned with an XFS reflink — an O(metadata) operation where data stays shared until something writes (dm-snapshot is also supported). A same-host fork completes in about 400ms; a cross-host fork (download plus restore) runs 1.2–3.5s. The workload this fits: tree-search, agent rollouts, 'try five fixes and keep the one that passes' — warm the environment once (dependencies installed, dataset loaded, REPL hot), then fork it N times in parallel without re-running setup. Internals in /docs/concepts/snapshots-and-forks and /docs/internals/fork-cow; the conceptual walkthrough is /blog/snapshot-and-fork-explained.

Persistence is the flip side of forking, and the field splits philosophically here. Fly.io Sprites, for instance, make persistence the default — the filesystem survives indefinitely between sessions and the VM scales to zero when idle. That's a genuinely different bet on the same Firecracker primitive than PandaStack's snapshot-restore-on-every-create with no warm pool. Neither is wrong; they optimize for different shapes of work — long-lived stateful environments versus cheap, identical, disposable creates. The honest framing is 'two designs on one isolation tech,' not 'one is faster.' If durable per-agent state is your core requirement, weight that fit heavily — see /blog/pandastack-vs-fly-sprites.

Criterion 5: platform breadth

Sandboxes are ephemeral by design, so the interesting question is what holds state and structure around them. Some tools are deliberately focused — a sandbox primitive and nothing more, which is a legitimate strength (simpler to reason about, easier to swap out) if all you need is to run code. Others bundle a wider platform, consolidating onto one substrate and one bill but coupling you more tightly: on the broad end, Modal positions around serverless AI/ML compute, and Northflank is a full managed cloud platform where sandboxes are one feature among many. PandaStack runs everything on one microVM substrate:

  • Managed PostgreSQL 16 — each database its own dedicated Firecracker microVM with a durable volume, so a per-tenant database is a first-class object on the same substrate as your sandboxes.
  • Git-driven app hosting — connect a repo, auto-detect the framework, blue-green deploys, and scale-to-zero via auto-hibernate when idle.
  • Serverless functions with cron schedules, durable volumes for state beyond the ephemeral CoW rootfs, and first-party templates (base, code-interpreter, agent, browser, postgres-16, claude-agent) so common agent shapes start ready rather than from a bare OS — the data-analyst path is in /docs/guides/code-interpreter.

The point isn't 'more features win' — it's a fit question. If you're building an AI product that also needs a database per tenant and a place to host the app, having it on one substrate and one bill is the argument for breadth. If all you need is to run code, that breadth is irrelevant and a focused tool is cleaner. Decide which side of that line you're on first, because breadth you don't use is just surface area you have to learn.

Criterion 6: pricing posture

I won't quote dollar figures for anyone — ourselves included — because pricing here changes often enough that any number I print will be stale. Go to each vendor's live pricing page and date what you find. The posture, though, shapes your real bill more than the per-unit rate does:

  • Metered usage is the norm: most hosted options bill on a mix of CPU time, memory, creations, storage, and egress, usually per-second, usually with a free or credit tier. The headline rate matters less than how idle time is treated.
  • Idle treatment can dominate: agent workloads spend a lot of wall-clock time waiting on model calls. A design that bills only active CPU, or scales a sandbox to zero when it sleeps, can cost dramatically less for bursty loops than one billing wall-clock for an idle VM — this is where bills diverge most.
  • Self-host changes the equation: with an open-source option you trade a per-second hosted bill for your own hardware plus operational cost. At low volume the hosted bill almost always wins; at scale or under a data-residency constraint, owning the substrate can flip it. Run the math at your projected volume.

The field, at a glance

The short version of each option, by isolation model and hosting, with a link to the head-to-head where one exists. The deep dives follow the same discipline: specific numbers only for PandaStack, the competitor in general terms with a 'verify against their docs' caveat, and an honest 'pick the other one when…' section.

  • PandaStack — open-source (Apache-2.0) Firecracker microVMs, self-hostable; snapshot-restore on every create (179ms p50), first-class CoW forking (~400ms same-host), plus managed Postgres, app hosting, and functions on one substrate.
  • E2B — focused, mature, hosted-first Firecracker sandbox with an Apache-2.0 core that's also self-hostable. See /blog/pandastack-vs-e2b.
  • Modal — hosted serverless AI/ML compute with a Sandbox primitive on gVisor. See /blog/pandastack-vs-modal.
  • Daytona — a dev-environment-and-sandbox angle (AGPL-3.0; managed, self-host, or hybrid). See /blog/pandastack-vs-daytona.
  • Northflank — a managed full-stack cloud platform with a sandbox feature, BYOC, and a choice of Kata/Firecracker/gVisor. See /blog/pandastack-vs-northflank.
  • Vercel Sandbox — hosted, Firecracker-backed, tightly integrated with the Vercel AI SDK; client SDK open source, runtime not. See /blog/pandastack-vs-vercel-sandbox.
  • Fly.io Sprites — persistent-by-default Firecracker-based VMs that scale to zero. See /blog/pandastack-vs-fly-sprites.
  • OSS building blocks — Firecracker (the minimal Rust VMM under much of the market), Kata, gVisor, microsandbox. Roundup at /blog/best-open-source-code-sandboxes.

Where PandaStack fits (and where I'm allowed to be specific)

To pull the threads together: every PandaStack sandbox is a Firecracker microVM with its own guest kernel (5.10, Ubuntu 24.04 guest), isolated by KVM, running under a jailer that drops privileges and exposes only a minimal virtio device model (net, block, vsock). Networking is per-sandbox — its own Linux netns, veth pair, and tap from 16,384 pre-allocated /30 subnets per agent (/docs/concepts/networking-natid). Boot is snapshot-restore on every create (179ms p50, no warm pool), with an optional UFFD streaming mode that pages guest memory on demand from object storage — HTTP Range GET in 4 MiB chunks, zero-page elision, a prefetch trace, a shared per-host chunk cache, optional 2 MiB hugepages — so it begins restoring before the whole memory image is local (/docs/internals/streaming-restore). You drive it from the pandastack Python package, the @pandastack/sdk TypeScript SDK, or the pandastack CLI — each reads PANDASTACK_API_KEY (keys prefixed pds_) with a configurable base URL, so the same code targets the hosted offering or your self-hosted control plane unchanged. It's the right pick when you want an open-source Firecracker platform you can own end-to-end — and the wrong pick if you have no infra appetite and a hosted-only service would do.

When something other than PandaStack is the right call

Being an honest broker means saying plainly when another tool fits better. Map your situation to the option, not the reverse:

  • Pick E2B (or another focused hosted sandbox) when you want zero infrastructure to operate and would rather adopt a mature default than own a substrate.
  • Pick Vercel Sandbox when you're already building on the Vercel AI SDK and want the tightest path from 'the LLM writes code' to 'it runs safely' inside that stack.
  • Pick Modal when your real workload is scale-out AI/ML compute (GPU jobs, batch inference) and the sandbox is incidental, and you want it fully managed.
  • Pick Daytona when its dedicated-kernel, development-environment model maps to how your team works better than a raw sandbox primitive.
  • Pick Northflank when you want a unified managed platform — sandboxes alongside apps, databases, and GPU — possibly in your own cloud via BYOC, and don't need the software itself to be open-source.
  • Pick Fly.io Sprites when persistence is your core requirement: long-lived agent environments that keep state across sessions, not identical disposable creates.
  • Pick an OSS building block (Firecracker, Kata, gVisor, microsandbox) when you want maximum control and have the appetite to build or self-host the platform layer yourself — see /blog/best-open-source-code-sandboxes.

One honest note on the locked-in baseline: many teams start from OpenAI's hosted Code Interpreter (Python-only, hosted-only, no self-host) and outgrow it. OpenAI's own agent stack now treats the sandbox as a pluggable, swappable backend — the clearest signal that the sandbox is a layer you get to choose, not a default you're stuck with. Where these options diverge is everything above the isolation boundary — boot path, fork semantics, persistence, breadth, and whether you can own the substrate. Concentrate your evaluation there, because isolation strength is roughly the thing the good ones already agree on.

Don't pick from this post — or any roundup, including the ones written by the vendors themselves — on the strength of a description. Isolation backends get swapped, licenses change, pricing shifts monthly, and 'microVM' covers a wide range of real behavior. Pull every quantitative claim (price, boot time, region availability) live from each vendor's own page and date it. Then build a one-hour spike against your top two: create a sandbox in your own region, fork into the branching pattern you actually use, run your real agent code under realistic load, and measure it. An afternoon of hands-on testing settles more than a week of reading comparison pages.

The bottom line

There is no single best code execution sandbox for AI agents — there's a best one for your six constraints. The serious options share the thing that matters most: hardware-virtualized microVM (or, for Modal, gVisor) isolation is the correct foundation for running code your agent wrote, and that's not where they differ. They differ on whether you can self-host, how fast create() returns, whether forking is first-class, how much platform comes with it, and how the bill is shaped. Start from the criterion forcing your hand — usually self-host, cold-start, or forking — shortlist the two that fit, and prototype against both before you commit. PandaStack's bet, for the record, is an Apache-2.0 Firecracker core wrapped in a full platform — snapshot-restore on every create (179ms p50), CoW forking (~400ms same-host), per-sandbox networking, managed services — that you run end-to-end on your own hardware. If that matches your constraints, benchmark it against the field and keep us honest.

Frequently asked questions

What is the best code execution sandbox for AI agents?

There's no universal winner — the best choice depends on six constraints: isolation model, hosted vs self-host, cold-start latency, forking semantics, platform breadth, and pricing posture. For arbitrary agent code, the serious options share strong isolation: E2B, Vercel Sandbox, and PandaStack run Firecracker microVMs (own guest kernel, KVM isolation), Fly.io Sprites are widely reported to be Firecracker-based, Modal uses gVisor, and Northflank offers a choice of Kata/Firecracker/gVisor. They diverge above the isolation boundary. PandaStack is the open-source (Apache-2.0) option you can self-host on your own KVM hosts, with snapshot-restore on every create (179ms p50, no warm pool) and first-class copy-on-write forking (~400ms same-host). Decide which of the six criteria is forcing your decision, then prototype your top two against your real workload before committing.

Do AI code execution sandboxes use Firecracker microVMs?

Many of the strongest ones do, but not all. E2B, Vercel Sandbox, and PandaStack each run sandboxes as Firecracker microVMs, where every sandbox gets its own guest kernel and is isolated by hardware virtualization (KVM). Fly.io Sprites are widely reported to be Firecracker-based, consistent with Fly Machines. Modal takes a different approach with gVisor, a user-space kernel that mediates guest syscalls. Northflank offers a choice of Kata Containers, Firecracker, or gVisor per workload. Both Firecracker and Kata are microVM-class isolation, which is the right bar for untrusted agent code; gVisor is a meaningful step up from a plain container but a different bet from a full VM. Always confirm a given provider's isolation backend in its own documentation, since 'microVM-based' is sometimes used loosely.

Can I self-host a code execution sandbox for AI agents?

Yes — several options are genuinely open-source and self-hostable, though the term is overloaded. PandaStack's core is Apache-2.0 and runs end-to-end on your own Linux KVM hosts (control-plane API plus a per-host agent; sandboxes execute on your infrastructure; the same binaries power the hosted offering, with a configurable SDK base URL). E2B (Apache-2.0), Daytona (AGPL-3.0), and microsandbox are also self-hostable, as are the OSS building blocks Firecracker, Kata, and gVisor. Be careful with neighboring claims: Northflank's 'self-hosted' means Bring-Your-Own-Cloud (a proprietary control plane in your cloud), and Vercel open-sources only its client SDK, not the runtime. Modal, Vercel Sandbox, and Fly Sprites are hosted-only. The honest trade-off for any self-host path is operational weight — KVM hosts, an agent fleet, networking, and storage — so verify each candidate's current license in its own repo first.

How should I evaluate a sandbox for my AI agent?

Decide which of six things is forcing the decision — isolation model, hosted vs self-host, cold-start latency, fork/snapshot semantics, platform breadth, and pricing posture — then evaluate only your top two candidates against it. Don't trust a feature matrix or any vendor's headline latency or price (including ours): cold-start and fork timings are easy to mis-measure across providers, and pricing changes monthly. Pull quantitative claims live from each vendor's own page and date them. Then build a short spike: call create() in your own region, fork into the branching pattern your agent actually uses, and run your real code under realistic load. An hour of measurement on your own workload settles more than a week of reading comparison posts.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free
Written by Ajay Kumar, Founder, PandaStack.