all posts

Firecracker Networking Explained: TAP, netns, NAT

Ajay Kumar··10 min read

When a Firecracker microVM sends a packet, that packet takes a longer journey than the guest realizes. Inside the VM, the guest kernel sees an ordinary network interface, brings it up, gets an IP, and routes traffic through it exactly as it would on bare metal. It has no idea that the interface is a paravirtualized device, that the wire on the other end is a host-side TAP device, that the TAP lives in a Linux network namespace it can never see, or that every outbound packet gets NAT'd through a chain of iptables rules before it touches a real network. The guest thinks it has a normal NIC. It has a polite illusion of one — and that illusion is what makes it safe to run someone else's code. This post walks the entire path, from the virtio-net device the guest believes in down to the per-sandbox namespace and NAT that contain it, then shows how PandaStack's NATID design makes that whole apparatus cheap enough to stand up on every single create.

What the guest sees: a virtio-net device

Start inside the VM. A conventional hypervisor would emulate a real network card — an Intel e1000 or similar — register by register, so the guest's stock driver can talk to it. That emulation is faithful but slow: every access to a hardware register traps out to the host and back. Firecracker doesn't do that. It exposes a virtio-net device, which is a paravirtualized NIC. "Paravirtualized" means the guest knows, by way of its virtio driver, that it's talking to a hypervisor rather than to silicon, and the two sides cooperate through shared-memory ring buffers (virtqueues) instead of pretending a physical chip exists. The guest posts packet buffers into a queue, kicks the host once, and the host drains the queue — far fewer traps, far higher throughput.

From the guest's point of view none of that complexity is visible. It sees an interface — typically eth0 — with a MAC address and a link that comes up. It runs DHCP or a static config, installs a default route, and starts sending. The virtio-net device is part of Firecracker's deliberately minimal device model: net, block, and vsock, plus a serial console, and almost nothing else. That small surface is a big part of why a microVM is a tighter boundary than a shared-kernel container — there are far fewer emulated devices to attack. We cover the broader version of that argument in /blog/what-is-a-microvm. For networking specifically, the thing to hold onto is that the guest's NIC is one end of a pipe, and the interesting half of the pipe is on the host.

The host side: a TAP device

On the host, the other end of that pipe is a TAP device. A TAP is a virtual Layer 2 (Ethernet) interface provided by the Linux kernel: instead of being backed by physical hardware, it's backed by a userspace program — here, Firecracker. When the guest transmits an Ethernet frame, it appears on the host's TAP device as if a real machine had sent it down a cable; when the host writes a frame to the TAP, it arrives at the guest's virtio-net interface. The TAP is the seam between the virtual world and the host's normal Linux networking stack.

This matters because once frames are on a TAP, they're just ordinary traffic on an ordinary Linux interface. Everything the kernel can do to packets — route them, NAT them, filter them with iptables, rate-limit them, drop them — applies. Firecracker itself does not implement any network policy. It hands the guest a virtio-net device wired to a TAP and stops there. All of the routing, address translation, and isolation is done by the host's network stack around that TAP. So the real question of "how do I network a microVM safely" becomes "where do I put that TAP, and what rules surround it." The naive answer — put every TAP on one shared bridge — is also the dangerous one, and that's the next section.

Two distinct interfaces, easy to conflate: the guest's eth0 (a virtio-net device, inside the VM) and the host's TAP (outside the VM). They're the two ends of the same link. Firecracker connects them; everything else — routing, NAT, filtering — happens on the host side of the TAP, where the guest has no reach.

Per-sandbox isolation: a network namespace per VM

Here's the failure mode you want to design out from the start. If you create one Linux bridge on the host and plug every sandbox's TAP into it, all those guests share a Layer 2 segment. That means sandbox A can ARP for sandbox B, send it packets directly, and — depending on your firewall — reach it. It also means every guest sits on the same subnet as your host's other services, so a curious or hostile guest can scan for an internal database, an admin API, or the cloud metadata endpoint at 169.254.169.254. For trusted workloads that might be fine. For untrusted or multi-tenant code — exactly the workload microVMs exist to run — it's a cross-tenant data leak waiting to happen.

The fix is a Linux network namespace per sandbox. A network namespace (netns) is an isolated copy of the kernel's entire networking stack: its own interfaces, its own routing table, its own iptables/NAT rules, its own ARP cache. A process or interface inside one netns cannot see interfaces in another. If you put each sandbox's TAP inside its own netns, there is no shared segment for guests to discover each other on — each sandbox's network world is a private island. To connect that island to the outside, you run a veth pair: a virtual Ethernet cable with two ends, one inside the sandbox's namespace and one in the host's root namespace. Traffic that needs to leave the sandbox goes guest → TAP → (inside the netns) → veth → (out to the root namespace) → and then through the host's normal egress path.

The payoff is that isolation becomes structural rather than a filter you remember to apply. One sandbox literally cannot address another's traffic, because they're in different namespaces with no link between them. And because the routing table and the iptables rules are per-namespace, the namespace is exactly where you express egress policy — "this sandbox may reach the package registry and nothing else" is enforced at a boundary the guest can't touch. Tearing the sandbox down removes the whole netns, and its entire network world disappears atomically with it. We go deeper on why this matters for agents and untrusted code in /blog/ai-agent-isolation-filesystem-network.

NAT and egress: how packets actually leave

A sandbox in a private namespace with a private IP can't route to the internet on its own — the rest of the world has no idea how to reach 10.x addresses living in a namespace on your host. So the last piece is NAT (network address translation). On the host's root namespace, an iptables MASQUERADE rule rewrites the source address of outbound packets from the sandbox's private IP to the host's real address, tracks the connection, and rewrites replies back. This is the same mechanism your home router uses to put a dozen devices behind one public IP. Outbound connections work; unsolicited inbound connections don't, because there's no mapping for them until the guest initiates.

That default — outbound allowed, inbound denied — is a reasonable start, but for untrusted code it is not the finish. NAT lets a guest reach anything its routing allows, which by default is the entire internet. If the threat model includes data exfiltration or a prompt-injected agent phoning home, you want the inverse posture: default-deny egress, then an allowlist of the few destinations the task genuinely needs. The per-sandbox namespace is the right place to enforce that, because its iptables chain governs only that one guest. None of this comes for free from the isolation boundary itself — the namespace gives you the enforcement point; the policy is yours to write.

For untrusted or multi-tenant code, treat egress as default-deny. A per-sandbox namespace stops one guest reaching another, but NAT alone still lets a guest reach the whole internet — and that's the channel exfiltration uses. Allowlist only what the task needs (a package registry, one API), and verify the cloud metadata endpoint (169.254.169.254) is unreachable from inside the guest. The boundary enforces your policy; it doesn't choose it for you.

What the platform automates, in shell

It's worth seeing the moving parts as plain commands, because that demystifies the whole thing. Conceptually, wiring up one sandbox is: create a namespace, create the veth pair and move one end into it, create the TAP inside the namespace, and add the NAT rule on the host. The snippet below is illustrative — it shows the shape of what's happening, not the exact commands PandaStack runs. On a real platform you do not type this per sandbox; the agent does it for you (and, as the next section explains, pre-builds most of it ahead of time).

# ILLUSTRATIVE ONLY — this is the concept the platform automates per sandbox.
# You never run these by hand; the agent does, and pre-builds them in advance.

# 1. A private network namespace for this one sandbox.
ip netns add ns-demo

# 2. A veth pair: vg-demo (guest side) <-> vh-demo (host side).
ip link add vg-demo type veth peer name vh-demo
ip link set vg-demo netns ns-demo            # move guest end into the namespace

# 3. The TAP device Firecracker drives as the guest's NIC, inside the netns.
ip netns exec ns-demo ip tuntap add tap0 mode tap
ip netns exec ns-demo ip link set tap0 up

# 4. Address the /30 subnet and bring the links up.
ip netns exec ns-demo ip addr add 10.200.0.2/30 dev tap0
ip addr add 10.200.0.1/30 dev vh-demo        # gateway, host side

# 5. NAT outbound traffic so the private IP can reach the world.
iptables -t nat -A POSTROUTING -s 10.200.0.0/30 -j MASQUERADE
# (For untrusted code, prepend default-deny egress rules to this chain.)

The reason you don't want to run those commands on the create path is cost. Doing ip netns add, building the veth pair, creating the TAP, and installing iptables rules cold takes on the order of 100ms — fine occasionally, ruinous if it's on the latency budget of every sandbox you create. That's the problem PandaStack's NATID design exists to solve.

PandaStack's NATID: pre-allocate the network

PandaStack's networking layer is called NATID, and its core move is to do all of that namespace-and-veth-and-TAP work ahead of time instead of on the hot path. Each agent pre-allocates a pool of 16,384 /30 subnets out of 10.200.0.0/16, handed out as sequential /30 blocks (10.200.0.0/30, 10.200.0.4/30, and so on). A /30 holds exactly two usable addresses — the gateway plus one guest — so there is no room in a sandbox's own subnet for any other guest to even exist. Each slot is pre-built as a complete unit: the network namespace, the veth pair, the TAP device, and the iptables rules, all standing and ready before any sandbox needs them.

Because the expensive structural work is already done, allocating networking for a new sandbox is roughly a 9ms patch rather than the ~100ms cold setup of building a namespace from scratch. Configuring the TAP for the specific guest — patching its MAC and routes — is about 6ms of that. This is a large part of why a PandaStack create lands at 179ms p50: the network stage is a fast patch of a pre-built slot, not a from-scratch ip netns add. The mechanics are documented at /docs/concepts/networking-natid and the engineering reference at /docs/internals/networking; the create pipeline as a whole, with per-stage timings, is in /blog/how-firecracker-boots-fast.

Each sandbox's network resources follow a consistent naming scheme keyed to its id, which makes them trivial to find and to clean up:

  • ns-<id> — the sandbox's dedicated Linux network namespace, isolating its entire networking stack from every other sandbox.
  • vh-<id> — the host-side end of the veth pair, living in the root namespace and acting as the gateway.
  • vg-<id> — the guest-side end of the veth pair, inside ns-<id>.
  • tap0 — the TAP device inside the namespace that Firecracker drives as the guest's virtio-net NIC.

Baked guest identity: why MAC and routes get patched

There's a wrinkle that comes from the fact that PandaStack creates sandboxes by restoring a snapshot, not by cold-booting. When a template is first baked, the guest boots, configures its network interface, and gets frozen with a specific IP, MAC, and gateway recorded in its snapshot. Every later create restores that same frozen guest — which means the guest wakes up believing it still has the exact IP and MAC it had at bake time. The host has to make that belief true.

So on restore, the agent patches the host side to match the baked values: it sets the TAP's MAC and the routes so that the restored guest's frozen network identity lines up with the namespace it's now placed in. The guest does no reconfiguration — it has no idea it was ever frozen — and from its perspective the network simply works, with the identity it remembers. This is the networking half of the snapshot-restore boot path: the guest's view of its NIC is preserved exactly, and the host quietly rewires the slot underneath it. The broader snapshot-restore story is in /blog/how-firecracker-boots-fast.

The whole path, end to end

Put it together and a packet from a PandaStack sandbox travels: guest userspace → guest kernel → virtio-net eth0 → (Firecracker) → tap0 inside ns-<id> → vg-<id> → vh-<id> in the root namespace → iptables NAT → the host's real network. Every hop after the TAP is host-controlled and per-sandbox: a private namespace so guests can't see each other, a veth pair as the only door out, and a NAT chain where you enforce egress policy. The guest sees none of it — just a normal-looking NIC that happens to be a polite illusion.

The reason all of this is worth doing per sandbox, rather than sharing one bridge, is the threat model: untrusted and multi-tenant code must not be able to reach another tenant's traffic, your internal services, or the metadata endpoint, and must not be able to exfiltrate freely. Per-sandbox namespaces make the first three structural and give you the enforcement point for the fourth. NATID's contribution is making that isolation cheap enough — pre-allocated slots, a ~9ms patch instead of ~100ms of cold setup — that a fully isolated network is the default on every create, not a luxury you ration. PandaStack's core is open source under Apache-2.0, so you can stand up the agent on your own Linux KVM hosts and watch the namespaces and TAPs get created yourself. For the design reference, start at /docs/internals/networking.

Frequently asked questions

How does networking work for a Firecracker microVM?

Inside the VM, the guest sees a paravirtualized virtio-net device (typically eth0) with a MAC and a link it brings up like any normal NIC. On the host, the other end of that link is a TAP device — a virtual Layer 2 interface backed by Firecracker. Frames the guest sends appear on the TAP as ordinary Linux traffic, where the host stack routes, NATs, and filters them. Firecracker implements no network policy itself; it just connects the guest's virtio-net device to a TAP and leaves all routing, NAT, and isolation to the host's networking stack around that TAP.

What is the TAP device in Firecracker networking?

A TAP is a virtual Ethernet (Layer 2) interface provided by the Linux kernel that's backed by a userspace program instead of physical hardware. For a microVM, Firecracker is that program: when the guest transmits a frame on its virtio-net NIC it surfaces on the host's TAP device, and frames written to the TAP arrive at the guest. The TAP is the seam between the virtualized guest and the host's normal Linux network stack — once traffic is on the TAP, the kernel can route it, NAT it, or filter it with iptables like any other interface.

Why put each microVM in its own network namespace?

If every sandbox's TAP shares one Linux bridge, all the guests sit on the same Layer 2 segment — so one can ARP for and reach another, and all of them can scan for your internal services or the cloud metadata endpoint. Putting each sandbox's TAP in its own network namespace gives it a private, isolated copy of the networking stack (its own interfaces, routing table, and iptables rules). Guests in different namespaces can't address each other's traffic, and the per-namespace rule set is exactly where you enforce egress policy. For untrusted or multi-tenant code, that isolation is the difference between a sandbox and a cross-tenant leak.

How does PandaStack's NATID networking make per-sandbox isolation fast?

NATID pre-allocates a pool of 16,384 /30 subnets per agent out of 10.200.0.0/16, building each slot — network namespace, veth pair, TAP device, and iptables rules — ahead of time rather than on the create path. Building all of that cold costs around 100ms; allocating a pre-built slot is roughly a 9ms patch (about 6ms of which is configuring the TAP's MAC and routes for the specific guest). That pre-allocation is a big reason a PandaStack create lands at 179ms p50. Because PandaStack restores a snapshot rather than cold-booting, the agent also patches the TAP MAC and routes on restore to match the IP/MAC/gateway frozen into the guest at bake time, so the restored guest sees the network identity it remembers.

Does network isolation stop an untrusted sandbox from exfiltrating data?

Not by itself. A per-sandbox network namespace stops one guest from reaching another and from sitting on your internal segment, but the default NAT setup still lets a guest reach the whole internet outbound — and that outbound path is exactly what exfiltration uses. To close it, run default-deny egress and allowlist only the destinations the task needs, and verify the cloud metadata endpoint (169.254.169.254) is unreachable from the guest. The namespace gives you a clean per-sandbox enforcement point; the egress policy itself is yours to write.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free
Written by Ajay Kumar, Founder, PandaStack.