all posts

The Code Isolation Hierarchy

Ajay Kumar··11 min read

There is a ladder of code-isolation techniques — bare process, container, gVisor, Kata, microVM, confidential VM, with WASM standing off to one side — and almost every argument about "is this a real sandbox?" is really an argument about which rung you're on and whether it matches your threat. This is the canonical walk up that ladder. For each rung the only two questions that matter are: what are you defending against, and what overhead can you pay? Get those two right and the rest of the choice makes itself.

The framing matters because the rungs are not a single "more secure" scale. Most of the ladder is about protecting the host from the code. The top rung flips the threat model entirely and protects the code from the host. WASM isn't on the ladder at all — it's a different axis of isolation. Treat "higher" as "different," not strictly "better," and you'll stop over-paying for boundaries you don't need and under-protecting against the ones you do.

The only two questions: threat model and overhead

Every isolation technique is a trade between two things. The first is the threat model: who is the adversary, and what are they allowed to touch? Your own reviewed first-party service is a different adversary from an LLM executing commands derived from a web page it just scraped. The second is overhead: boot latency, memory footprint, syscall cost, and operational complexity. A boundary you can't afford to create per task is a boundary you'll quietly skip under load, which is worse than picking a cheaper one honestly. The whole ladder is the space of answers to "how much overhead will you pay for how strong a boundary, against which adversary?"

The ladder, rung by rung

Here is the full spectrum in order of increasing (and, at the top, different) isolation, with the cost that buys it. Read it as a menu, not a ranking — the right rung is the lowest one that covers your actual threat.

  1. Bare process — your code runs as a normal OS process with whatever permissions, environment variables, secrets, and network the parent had. There is no boundary. This is the right "rung" only for code you wrote and trust completely; it is never acceptable for untrusted or model-generated input. Overhead: zero.
  2. Container (Docker/runc) — a process wrapped in Linux kernel features: namespaces (what it can see), cgroups (what it can consume), capability drops, a seccomp-bpf syscall filter, and optionally an LSM profile (AppArmor/SELinux). A genuine, useful isolation improvement — but every one of those mechanisms is enforced by the same kernel the container shares with the host. Overhead: near-zero. Threat model: trusted-ish code; co-tenancy of untrusted code is where it gets risky.
  3. gVisor / runsc — a user-space kernel (the Sentry, written in Go) that intercepts the application's syscalls and re-implements them itself, so the workload mostly talks to gVisor instead of the host kernel. It interposes on and shrinks the host kernel attack surface dramatically, at the cost of syscall-interception overhead. Overhead: measurable on syscall- and I/O-heavy workloads. Threat model: untrusted code where you want a smaller host-kernel surface without a full VM.
  4. Kata Containers — the container/OCI experience (an image, a pod, kubectl) but each workload runs inside a real lightweight VM with its own guest kernel. Kata is the runtime/orchestration layer; the actual VMM underneath is pluggable (QEMU, Cloud Hypervisor, or Firecracker). Overhead: VM-class, amortized by the container UX. Threat model: untrusted/multi-tenant workloads that need a hardware boundary but want to keep container tooling.
  5. microVM (Firecracker / Cloud Hypervisor) — each workload gets its own guest kernel, isolated by CPU hardware virtualization (Intel VT-x / AMD-V, exposed to Linux via KVM). The host no longer shares its kernel syscall surface with the guest; the surface that remains is the VMM plus the KVM ioctl interface plus a deliberately minimal virtio device model. Overhead: a few MB of guest memory and a guest kernel boot. Threat model: arbitrary untrusted code at scale — the purpose-built rung for it.
  6. Confidential VM (AMD SEV-SNP / Intel TDX) — a VM whose memory is encrypted at runtime and remotely attestable, so even the host hypervisor and operator cannot read the guest's plaintext memory. This is a different threat model, not just more of the same: it protects the guest from the host. Overhead: hardware-specific, plus attestation plumbing. Threat model: you don't trust the operator of the machine you're running on.
The most common misconception is reading this list as a single security dial where you should always turn it up. It isn't. Rungs 1–5 protect the host from the code; rung 6 protects the code from the host — a different question entirely. And a confidential VM does nothing extra to contain a guest that escapes its own kernel. Match the rung to the threat, not to the height of the ladder.

Why the container rung is the one people over-trust

Containers deserve the close look because they're where most teams stop, often without realizing the boundary they bought. A container is assembled from real kernel security features — namespaces isolate visibility, cgroups limit consumption, capability drops remove privileged operations, seccomp filters the syscall set, and an LSM can add mandatory access control on top. Each of those meaningfully reduces attack surface and stops many real attacks. The honest claim is not that they provide no security; it's that they are not a hard isolation boundary of the strength a hypervisor provides, because the kernel enforcing them is the same kernel shared with the host.

That structural fact is the whole point. The kernel is simultaneously the thing running the container and the thing being protected from it. So a kernel privilege-escalation bug reachable through the syscall interface, or a runtime bug (the well-known 2019 runc escape via a leaked /proc/self/exe file descriptor is the canonical historical example, fixed long ago), or a dangerous misconfiguration (a --privileged container, a mounted docker.sock, host bind mounts, CAP_SYS_ADMIN) can all reach the host. Seccomp helps here: Docker's default profile blocks several dozen of the most dangerous syscalls. But it still allows 300-plus, so it shrinks the shared-kernel risk without eliminating it. We go deep on the escape mechanics in the dedicated post on why a container is not a sandbox; the takeaway for the ladder is that the container rung is great for your code and structurally weak for someone else's.

gVisor: a user-space kernel, not a VM

gVisor sits between the container and microVM rungs, and it's frequently mis-described, so precision pays off. Its Sentry component re-implements the Linux syscall interface in Go and runs in user space; the application's syscalls are intercepted and serviced by the Sentry rather than passed straight to the host kernel. The Sentry itself still makes a constrained, seccomp-filtered set of host syscalls — it interposes on and reduces the host kernel surface, it does not stop touching the host kernel entirely. The price is the interposition overhead, which is real and most visible on syscall- and I/O-heavy workloads.

gVisor needs a "platform" for syscall interception and address-space switching, and one option is a KVM platform — but even in KVM mode gVisor is NOT a hardware VM like Firecracker. It keeps a process model with no virtualized hardware layer; it borrows CPU virtualization extensions for faster, safer address-space isolation. Calling gVisor "a VM" is the fastest way to lose a security reviewer. It's a user-space kernel that can use virtualization extensions, not a guest with its own kernel.

Kata and microVMs: the hardware boundary

The next step up is a genuine hardware-virtualization boundary, and two rungs share it. Kata Containers keeps the container experience but runs each workload inside a lightweight VM with its own guest kernel; Kata is the runtime that orchestrates this, sitting on top of a pluggable VMM (QEMU, Cloud Hypervisor, or Firecracker). microVMs are the same isolation primitive without the container framing: a thin VMM boots a minimal guest, and the boundary is enforced by the CPU's virtualization extensions through KVM rather than by a shared kernel.

What makes this rung qualitatively different from a container is the shape of the remaining attack surface. A container exposes the full Linux syscall ABI — hundreds of syscalls — to the host kernel. A microVM exposes the VMM plus the KVM ioctl interface plus a small virtio device model. Firecracker leans into this: it's written in Rust, ships a minimal device model (virtio-net, virtio-block, virtio-vsock, plus a serial console and a trivial keyboard controller used only to signal reboot), and runs behind a jailer that chroots the process, applies cgroups and seccomp, and drops privileges as defense-in-depth. The boundary is smaller and far more heavily audited than the shared kernel — which is exactly why this rung is the right default for arbitrary untrusted code. For the full mechanics of how that boundary is built and where it ends, see the microVM explainer and the Firecracker-vs-Docker comparison.

Smaller surface is not zero surface. The VMM's virtio device emulation is the primary thing a hostile guest probes, KVM has had real guest-to-host escape CVEs (Google's kvmCTF pays up to $250,000 for one), and microarchitectural side channels — Spectre-class branch-target injection, MDS, and newer guest-to-host variants — cross the VM boundary in principle, because branch predictors and caches are shared at the hardware level. A microVM is a memory-isolation boundary, not a microarchitectural-isolation one. It's meaningfully stronger than a shared kernel, not absolute.

Confidential VMs: flipping the threat model

The top rung answers a question none of the others do. AMD SEV-SNP and Intel TDX encrypt guest memory at runtime with a per-VM (or per-trust-domain) key the hypervisor never holds, add memory-integrity protection against hypervisor-driven remapping and replay, and provide hardware-rooted remote attestation so a relying party can verify the guest before trusting it. The result: the host hypervisor and operator can't read the guest's plaintext memory. Where rungs 1–5 protect the host from the guest, a confidential VM additionally protects the guest from the host.

That makes it the right rung only when your adversary includes the operator of the machine — running on infrastructure you don't control or audit, handling data the cloud provider must not see. It's important to be honest about its limits: it removes the host from the trust boundary for guest memory, but it does not by itself defend against all side-channel or microarchitectural attacks (both SEV-SNP and TDX have published interface and side-channel research against them), and it does nothing to shrink the guest's own kernel attack surface — a confidential VM still runs a full guest kernel that can still be exploited from inside. Different threat model, not a free strict upgrade.

WASM/WASI: off to the side, not up the ladder

WebAssembly belongs in this conversation but not on this ladder, because it isolates on a different axis. A WASM module runs in its own linear memory and cannot address anything outside it, and WASI uses capability-based, deny-by-default security: a module gets no ambient host access — no files, no sockets — unless the host explicitly hands it a capability (a pre-opened directory, a socket handle). That's least-privilege enforced at the language and runtime level, not at the OS-kernel or hardware level. It's excellent for fine-grained, embedded, untrusted plugins where you control the surface tightly. It is not a drop-in for "run an arbitrary Linux process, or a Python script with native dependencies" — for that you're back on the kernel-to-VM ladder.

Where most people should land — and why

Match the rung to the threat. For your own reviewed code with no untrusted input, a container is a perfectly reasonable boundary and the simplest to operate — don't pay for a VM you don't need. For fine-grained plugins you can constrain to a capability set, WASM is a clean fit. But for the case that dominates modern infrastructure — arbitrary, untrusted, often model-generated code, frequently multi-tenant — you want a hardware boundary, which puts you on the microVM rung. That's where you get a separate guest kernel and a small, audited host surface without flipping into the operator-distrust threat model that confidential VMs solve at extra cost. The guide at /blog/how-to-sandbox-untrusted-code frames the whole decision, /blog/why-docker-is-not-a-sandbox explains why the container rung fails for hostile code, and /blog/run-untrusted-code-safely walks the operational pattern.

This is the rung PandaStack is built on. Every sandbox is a Firecracker microVM with its own guest kernel (5.10, Ubuntu 24.04 guest), isolated by KVM hardware virtualization — not a shared-kernel container. The historical objection to the VM rung was cost, and that's the part PandaStack engineers away: there's no warm pool of idle VMs; every create restores a baked Firecracker snapshot on demand, at a p50 of 179ms (about 203ms p99). A same-host fork — copy-on-write guest memory plus an XFS-reflink rootfs — lands around 400ms. The core is open source and Apache-2.0, so you can self-host it on your own Linux KVM hosts and keep the microVM boundary on infrastructure you control. Picking the microVM rung used to mean choosing isolation over speed; here it costs you a couple hundred milliseconds, which is the whole point of putting arbitrary untrusted code on the right rung instead of an easy one.

Frequently asked questions

What is the code isolation hierarchy?

It's the spectrum of techniques for isolating running code, ordered by increasing (and, at the top, different) isolation strength: bare process, container (namespaces/cgroups/seccomp on a shared host kernel), gVisor (a user-space kernel that intercepts syscalls), Kata Containers (container UX inside a lightweight VM), microVM (Firecracker/Cloud Hypervisor, its own guest kernel under hardware virtualization), and confidential VM (SEV-SNP/TDX, encrypted memory that excludes even the host). WASM/WASI sits off to the side as a language-level, capability-based sandbox. The right choice is the lowest rung that covers your actual threat model at an overhead you can afford.

Is gVisor a virtual machine like Firecracker?

No. gVisor is a user-space kernel (the Sentry, written in Go) that intercepts an application's syscalls and re-implements them, shrinking the host kernel attack surface without booting a separate guest kernel. It can use a KVM platform for faster address-space isolation, but even then it keeps a process model with no virtualized hardware layer — it borrows CPU virtualization extensions rather than being a hardware VM. Firecracker, by contrast, boots a real guest kernel isolated by KVM hardware virtualization. They sit on adjacent rungs: gVisor between containers and microVMs, Firecracker at the microVM rung.

Is a higher rung always more secure?

No — higher means different, not strictly better. Rungs from bare process up through microVM all protect the host from the code. A confidential VM (SEV-SNP/TDX) flips the threat model to protect the code from the host, which only matters if you distrust the machine's operator; it does nothing extra to contain a guest that escapes its own kernel. WASM isolates on a separate axis entirely (language-level capabilities). Always match the rung to what you're actually defending against rather than reaching for the top of the ladder by default.

Which isolation level should I use for untrusted or AI-generated code?

For arbitrary untrusted code — especially multi-tenant or model-generated commands — the microVM rung is the right default: each workload gets its own guest kernel isolated by hardware virtualization (KVM), so an escape must break a small, audited VMM/KVM boundary instead of the full shared Linux syscall surface a container exposes. A container is fine for your own trusted code, and WASM fits tightly-scoped plugins. PandaStack runs every sandbox as a Firecracker microVM, created in about 179ms via snapshot-restore, so the hardware boundary costs roughly a couple hundred milliseconds rather than a full VM boot.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free
Written by Ajay Kumar, Founder, PandaStack.