all posts

The Firecracker Jailer Explained

Ajay Kumar··9 min read

People reach for Firecracker because of one property: the guest runs under hardware virtualization (KVM), so untrusted code in the VM can't touch the host. True. But that's only half the picture, and it's the half everyone already knows. The other half is the question that should keep you up at night if you run multi-tenant workloads: the Virtual Machine Monitor — the Firecracker process itself — is ordinary host code. It parses guest I/O, emulates virtio devices, and handles whatever the guest throws at it. If a bug in that code is exploitable, the attacker doesn't land in the guest. They land on your host, in the VMM's process. The jailer exists to make sure that landing is as small and survivable as possible.

The `jailer` is a separate binary that ships with Firecracker. You don't run `firecracker` directly in production — you run `jailer`, which sets up a locked-down environment and then `exec`s Firecracker inside it. This post is an internals walk-through of exactly what it sets up and why each layer matters. I'm Ajay, I built PandaStack; every sandbox, database, and app we run is a jailed Firecracker microVM, so this is the security model we bet the platform on.

Why jail a process that's already running a VM?

The instinct is reasonable: if KVM already separates guest from host, isn't a second layer redundant? It isn't, because the two layers defend different boundaries. KVM defends the guest-to-host boundary — the guest can only escape by breaking the hypervisor's hardware-enforced isolation. The jailer defends the VMM-to-host boundary — what happens if the VMM process itself is compromised, before any guest escape even matters.

Concretely: the guest sends data to emulated devices, and that data is parsed by Firecracker running on the host. A memory-safety bug in a device model, a logic flaw in the API socket handler, a malformed virtio descriptor that triggers undefined behaviour — any of these could give an attacker control of the VMM process without ever "escaping" KVM. If that process runs as root in the host's namespaces with the full filesystem visible, you've lost the host. If it runs as an unprivileged user, in its own chroot, in its own PID and network namespaces, with most syscalls blocked by seccomp, the attacker has popped a process that can see almost nothing and do almost nothing. That gap is the entire point of defense in depth.

The mental model: KVM is the wall between the guest and the VMM. The jailer is the wall around the VMM itself. You want both, because a VMM bug shouldn't equal host root — it should equal a compromised process that's already boxed in.

What the jailer actually does, step by step

When you launch the jailer, it performs a sequence of host-side hardening steps and then drops itself into the box it just built before handing control to Firecracker. The order matters — privileged setup happens first, privilege-dropping happens last — but the layers are:

  • cgroup placement — the jailer creates a cgroup for the microVM and moves the process into it, so CPU and memory limits apply to the VMM and its guest from the start.
  • Namespaces — it unshares into its own mount, PID, and (optionally) network namespaces, so the VMM can't see host processes, can't see the host mount tree, and lives on an isolated network path.
  • chroot + pivot_root — it builds a minimal chroot jail (typically /srv/jailer/firecracker/<id>/root) and pivot_roots into it, so the only filesystem the VMM can reach is the handful of files you explicitly placed there.
  • Device setup — it creates the few device nodes the VMM needs inside the jail, principally /dev/kvm (and /dev/net/tun when networking is used), via mknod with tightly scoped permissions.
  • Drop privileges — it setgid/setuid to an unprivileged uid/gid you pass in, so Firecracker never runs as root once setup is done.
  • seccomp-bpf — Firecracker then installs a seccomp filter on itself, allowing only the small set of syscalls the VMM legitimately needs and killing the process on anything else.

After all of that, the jailer `exec`s the Firecracker binary inside the jail. From that moment, the running VMM is an unprivileged process that can see one directory, one or two device nodes, no host PIDs, and can call only a vetted syscall set. Everything else is gone.

A realistic jailer invocation

Here's roughly what launching a jailed microVM looks like. The arguments before the `--` configure the jail; everything after `--` is passed through to the Firecracker binary it execs. The flags map directly onto the layers above.

# Each microVM gets a unique id; the jailer builds its jail under
# <chroot-base>/firecracker/<id>/root and pivot_roots into it.
sudo jailer \
  --id 6f3c1a9e-microvm \
  --exec-file /usr/bin/firecracker \
  --uid 30000 \
  --gid 30000 \
  --chroot-base-dir /srv/jailer \
  --netns /var/run/netns/ns-6f3c1a9e \
  --cgroup-version 2 \
  --cgroup cpu.max="50000 100000" \
  --cgroup memory.max=2147483648 \
  -- \
  --api-sock /run/api.socket \
  --config-file /vm-config.json

# Note the paths after `--` are relative to the jail root, not the host.
# /run/api.socket really lives at <chroot-base>/firecracker/<id>/root/run/api.socket

A few things worth noticing. The `--uid`/`--gid` are the unprivileged identity Firecracker ends up as — never 0. The `--netns` is the pre-created network namespace the VMM joins, which is how its traffic stays on an isolated path. The `--cgroup` flags pin CPU and memory limits before the guest even boots. And critically, the paths after `--` (`/run/api.socket`, `/vm-config.json`) are interpreted from inside the chroot — which is why every file Firecracker needs (kernel image, rootfs, sockets) must be staged into the jail directory first. If it isn't in the jail, the VMM literally cannot see it.

The jailer hardens the VMM, not the guest. It does not make a careless guest configuration safe — if you bind-mount host secrets into the jail or hand the guest a privileged device, the jailer will happily lock that mistake inside the box with the VMM. Stage only what the VM genuinely needs.

The seccomp layer: filtering the VMM's own syscalls

The layer that surprises people is seccomp, because it constrains Firecracker, not the guest. seccomp-bpf is a Linux kernel feature that lets a process install a BPF program the kernel runs on every syscall the process makes. The program inspects the syscall number (and optionally argument values) and returns a verdict — allow, return an error, or kill the process. Firecracker ships a default filter that allows only the syscalls a correctly-behaving VMM needs and traps everything else.

Why this matters for defense in depth: even if an attacker fully controls the VMM process via a device-emulation bug, they're standing inside a process that can't `execve` a shell, can't `open` arbitrary files, can't `socket` to wherever they like — because those syscalls aren't on the allowlist. Combined with the unprivileged uid and the empty chroot, the set of useful things a compromised VMM can do shrinks toward zero. Conceptually, the filter looks like a syscall allowlist:

# Conceptual shape of a Firecracker-style seccomp allowlist.
# Default action for anything not listed: KILL the process.

default_action: kill_process

allowed_syscalls:
  - read            # serve guest I/O
  - write
  - epoll_wait      # the event loop
  - epoll_ctl
  - ioctl           # but argument-filtered: only KVM_RUN, KVM_* ioctls
  - mmap            # set up guest memory
  - munmap
  - futex
  - exit / exit_group

# Note `ioctl` is allowed only with specific KVM arguments, not blanket.
# A syscall like `execve` or `ptrace` is simply absent -> kill on attempt.

The real filter is more nuanced — it's architecture-specific and filters certain syscalls by argument (for example, allowing `ioctl` only for the KVM operations the VMM actually issues). But the shape is what matters: a tight allowlist with a kill-on-violation default, so an unexpected syscall doesn't return an error the attacker can probe around — it terminates the process outright. Always check the current Firecracker docs for the exact default filter, since it evolves with the codebase.

The layered model, end to end

Put the pieces together and you get three concentric boundaries an attacker would have to defeat in sequence, each enforced by a different mechanism:

  • Layer 1 — KVM (hardware): the guest is confined to its virtual CPU and memory by the CPU's virtualization extensions. To reach the VMM at all, guest code must exploit the hypervisor through the narrow virtio device interface.
  • Layer 2 — jailer host-side hardening (kernel namespaces + chroot + cgroups + unprivileged uid): even with control of the VMM process, the attacker sees no host filesystem, no host PIDs, an isolated network namespace, capped resources, and zero privileges.
  • Layer 3 — seccomp-bpf (syscall allowlist): the compromised VMM can only make the handful of syscalls a legitimate VMM makes; anything else kills the process before it does damage.

No single layer is asked to be perfect. The guest-to-host wall (KVM) is small and heavily audited, but the jailer assumes it might fail. The VMM is memory-safe Rust, but the jailer assumes it might have a bug anyway. seccomp assumes the chroot and uid drop might be bypassed, and vice versa. That's what "defense in depth" actually means in practice: each layer is designed on the pessimistic assumption that the others have already been defeated.

How PandaStack uses this

Every workload on PandaStack — a code-interpreter sandbox, a managed Postgres database, a hosted app — runs as a jailed Firecracker microVM with the layered model above. The guest gets its own kernel under KVM; the VMM is jailed into its own chroot, namespaces, and cgroup as an unprivileged user; and seccomp constrains the VMM's syscalls. None of that costs you latency at create time: because every create restores a baked snapshot on demand rather than cold-booting, a sandbox is live at p50 around 179ms (cold first boot is roughly 3s, and that happens once per template). The jail is rebuilt fresh per VM, so the security boundary is per-sandbox, not shared.

The takeaway: "it runs in a VM" is necessary but not sufficient for running untrusted code. The VM protects the host from the guest. The jailer protects the host from the VMM. If you're building on Firecracker directly, use the jailer — running raw `firecracker` as root skips the entire second wall. If you're building on a platform, this is the kind of hardening you should expect under the hood before you hand it someone else's code.

Frequently asked questions

What is the Firecracker jailer?

The jailer is a separate binary shipped with Firecracker that sets up a locked-down host-side environment for the VMM process and then execs Firecracker inside it. It places the process in a cgroup, unshares into its own mount/PID/network namespaces, chroots (via pivot_root) into a minimal jail directory, creates only the device nodes the VMM needs (like /dev/kvm), drops to an unprivileged uid/gid, and lets Firecracker install a seccomp filter on itself. The goal is that a compromised VMM process can't become host root.

If KVM already isolates the guest, why is the jailer needed?

KVM defends the guest-to-host boundary — guest code is confined by hardware virtualization. The jailer defends a different boundary: the VMM (Firecracker) process is ordinary host code that parses guest I/O, so a bug in device emulation or the API handler could compromise that process without any KVM escape. The jailer ensures that if the VMM is compromised, the attacker lands in an unprivileged process inside an empty chroot with isolated namespaces and a restrictive seccomp filter, not on the host as root.

What does the seccomp filter in Firecracker do?

It constrains the VMM's own syscalls. seccomp-bpf installs a BPF program the kernel runs on every syscall Firecracker makes; Firecracker's default filter allows only the small set of syscalls a legitimate VMM needs (and filters some, like ioctl, by argument so only specific KVM operations are permitted) and kills the process on anything else. So even an attacker who fully controls the VMM process can't execve a shell, open arbitrary files, or open arbitrary sockets. Check the current Firecracker docs for the exact default filter, as it evolves.

Does the jailer make the guest configuration safe automatically?

No. The jailer hardens the VMM process, not the guest workload. If you stage host secrets into the jail directory or expose a privileged device to the VM, the jailer will lock that mistake inside the box rather than prevent it. You still have to stage only the files and devices the VM genuinely needs (kernel image, rootfs, the API socket, /dev/kvm) into the jail root, since the chroot means anything not staged is invisible to the VMM.

Does PandaStack run Firecracker through the jailer?

Yes. Every PandaStack workload — sandboxes, managed databases, and hosted apps — runs as a jailed Firecracker microVM: the guest under KVM, the VMM jailed into its own chroot, namespaces, and cgroup as an unprivileged user, with seccomp on the VMM's syscalls. The jail is rebuilt per VM so the boundary is per-sandbox, and because every create restores a baked snapshot rather than cold-booting, sandboxes are live at roughly 179ms p50 despite the full hardening stack.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free
Written by Ajay Kumar, Founder, PandaStack.