The Snapshot-Restore Boot Path: Every Sandbox in Under 200ms
Most platforms that promise fast sandboxes keep a warm pool: a fleet of idle VMs already booted and waiting, so a create just hands you one. It works, and it's expensive — you pay for RAM and CPU that sit doing nothing, sized for your peak, burning money during every trough. PandaStack takes the opposite bet. There is no warm pool of idle VMs. Every single create restores a baked Firecracker snapshot on demand, and still lands at a 179ms p50 with a roughly 203ms p99. This post walks the actual create pipeline that makes on-demand restore fast enough to skip the pool entirely — allocate a pre-built network slot, copy-on-write the rootfs, fork+exec the VMM, load the snapshot, resume the vCPUs, probe that the guest is up — plus the memory copy-on-write and on-demand streaming that keep the snapshot-load step cheap, and the one-time auto-bake that produces the snapshot in the first place.
Why there's no warm pool
A warm pool exists to hide boot latency. Booting a VM takes seconds, so you boot a bunch ahead of time and keep them parked. The catch is that a parked VM is a fully resident machine: it holds its RAM, it has a scheduled vCPU, and the host carries it whether or not anyone ever uses it. To absorb a burst you have to keep the pool sized for the burst, which means most of the time you're paying for capacity that's idle. Worse, a fresh pool VM has to be reset or recycled between tenants, which is its own latency and its own correctness risk.
Snapshot restore removes the reason the pool exists. If you can produce a ready-to-use machine from cold storage in under 200ms, you don't need to keep one warm. You keep a baked snapshot on disk — a frozen, already-booted machine — and stamp out a copy whenever a create arrives. Between creates, nothing runs. The host's only standing cost is the artifacts on disk and a small pool of pre-built network plumbing, both cheap. That's how PandaStack gets idle cost down to roughly zero while keeping create latency in warm-pool territory.
- Warm pool — keep N idle VMs booted and parked; a create leases one. Fast to hand out, but you pay for N machines' worth of RAM and CPU continuously, you size N for peak, and each VM must be reset between tenants. Idle cost scales with the pool.
- Snapshot-restore on demand (PandaStack) — keep a baked snapshot on disk and a small pool of pre-built network slots; a create restores a fresh copy. No machine runs between creates, so idle cost is ~0; you pay only for sandboxes that actually exist, and every restore starts from the same clean baked baseline.
A warm pool pays to keep machines running so create is fast. Snapshot restore makes create fast enough that you never keep a machine running you don't need.
What "restore" actually means
A Firecracker snapshot is a running microVM frozen at a single instant, serialized to a small set of files: the guest's entire physical RAM (vm.mem), the VMM's device and CPU state (vm.state) — vCPU registers, the in-kernel interrupt controller, the clock, every virtio device's configuration — and the rootfs disk it was running against. Restoring it is not a boot. The guest doesn't run init, doesn't re-probe devices, doesn't start systemd. The kernel was already up, the page cache was already warm, processes were already running, and restore brings all of that back exactly as it was frozen.
Mechanically, restore is closer to "map a memory file and resume the vCPUs" than to "start a computer." That's why it lands in tens of milliseconds rather than seconds, and it's the property the whole no-warm-pool design rests on. The deeper mechanics live in /blog/firecracker-memory-snapshots and /blog/how-firecracker-boots-fast; what matters here is that restore is cheap enough to do on every create.
The create pipeline, step by step
Restoring the snapshot is only one stage of a create. To get from an API call to a sandbox you can run commands in, the agent has to allocate networking, lay down a writable rootfs, launch the VMM, load the snapshot, resume, and confirm the guest is reachable. Here is the pipeline with rough per-stage costs on the fast path. Several stages overlap in practice; these are the contributions, not a strict serial sum.
- Allocate a NATID network slot (~1ms). PandaStack keeps a small warm pool of pre-built Linux network namespaces — netns + veth pair + tap + iptables — so create grabs a ready slot instead of doing ip netns add / ip link add cold, which would cost ~100ms. The address space tops out at 16,384 /30 subnets per agent; the warm pool is just the prebuilt depth in front of that, and if it drains a slot is built on demand. (This is the one thing kept warm — cheap plumbing, not running VMs.)
- Configure the tap device in the namespace (~6ms). The guest's baked snapshot expects a specific IP, MAC, and gateway, so the agent patches the tap's MAC and routes to match the values frozen at bake time. The restored guest sees the exact network identity it remembers.
- Reflink the rootfs (~4ms). The writable disk is an XFS reflink clone of the template's ext4 rootfs — an O(metadata) copy-on-write clone. Blocks are shared with the template until the guest writes, so this is constant-time regardless of image size. (dm-snapshot is also supported.)
- Fork+exec Firecracker under the jailer (~25ms). The VMM process starts in its dropped-privilege jail. There's no firmware phase or device enumeration — it comes up ready to be handed a snapshot.
- POST /snapshot/load (~49ms). Firecracker memory-maps vm.mem and loads vm.state. The memory mapping is lazy and copy-on-write (MAP_PRIVATE), so pages fault in only as the guest touches them rather than being eagerly copied up front — which is why this step doesn't scale with guest RAM size.
- POST /snapshot/state Resume (~6ms). The vCPUs are unpaused. The guest is now running, mid-instruction, exactly where the snapshot froze it — no reboot, no init.
- Probe TCP :22 (~40ms). The agent confirms the guest's network stack is live and accepting connections before declaring the sandbox ready, so a create only returns success once the box is actually usable.
- Insert the sandbox row (~6ms, async). Bookkeeping in the agent's database, done off the critical path so it doesn't gate readiness.
The two stages that dominate are the snapshot load (~49ms) and the readiness probe (~40ms). Everything else — networking, rootfs, VMM launch — is engineered to be small and to overlap, which is how the whole pipeline lands at a 179ms p50 with a tight p99 of ~203ms. None of these stages involves a pre-booted, parked VM; each create builds its machine from cold artifacts and warm plumbing.
// The agent drives Firecracker's HTTP API over its Unix socket.
// Step 5 of the pipeline: load the baked snapshot into a fresh VMM.
// PUT /snapshot/load
{
"snapshot_path": "/seed/base/vm.state",
"mem_backend": {
"backend_type": "File", // map local vm.mem copy-on-write (MAP_PRIVATE)
"backend_path": "/seed/base/vm.mem"
},
"resume_vm": false // resume is a separate PATCH so the agent
} // can patch device state first
// Step 6: unpause the vCPUs. The guest resumes mid-instruction.
// PATCH /vm -> { "state": "Resumed" }Memory copy-on-write: why restore is cheap
The naive mental model of "load a 2 GiB snapshot" is "read 2 GiB off disk into RAM, then run." If that were true, on-demand restore would be too slow to skip the warm pool, and the whole design would collapse. It isn't true. When Firecracker loads the File backend, it memory-maps vm.mem with MAP_PRIVATE, and that flag is doing the heavy lifting.
A MAP_PRIVATE mapping is lazy and copy-on-write at page granularity. Nothing is eagerly copied at load time — the mapping just establishes that the guest's physical RAM is backed by the file. A read fault on a page the guest hasn't touched since restore is resolved by the kernel mapping that 4 KiB page straight from the file, no copy; multiple guests restoring the same snapshot share those identical pages in the page cache. A write fault triggers the copy-on-write: the kernel makes a private 4 KiB copy of just that one page, and the original stays pristine for everyone else.
The consequence is that a create pays, in time and RAM, only for the pages the guest actually touches between resume and ready — its working set — not for the full image. That's what makes restore-on-demand viable: the cost of materializing a fresh machine is the cost of its working set, not the cost of its declared memory. It's the same copy-on-write trick that makes fork() of a Unix process cheap, applied to an entire VM's RAM.
When the memory file isn't local: streaming with userfaultfd
The pipeline above assumes the snapshot's memory file is already on the host's disk. On a fresh agent, or one that has never served a given template, vm.mem might live in object storage instead. Downloading a multi-gigabyte memory image before you can restore would blow the latency budget and reintroduce exactly the kind of dead time the no-pool design exists to avoid. So PandaStack can stream it on demand.
This uses userfaultfd, a Linux kernel feature that delivers page-fault events to a user-space handler. Instead of pointing Firecracker at a local file, the agent hands it a userfaultfd and backs guest memory itself. The flow per fault: the guest touches a page that isn't resident → the kernel raises a fault and parks the faulting vCPU → userfaultfd hands that fault to the agent's handler → the handler translates the address into an offset in vm.mem and fetches the surrounding 4 MiB chunk from object storage over an HTTP Range GET → and installs it with UFFDIO_COPY, which wakes the parked vCPU. The guest only ever pulls the pages it actually touches, and it never knows a page came over the network.
Two optimizations keep streaming close to local-disk speed for the common case. All-zero chunks are elided entirely — a fault in a zeroed region is served with a zero-fill and no fetch — and a prefetch trace recorded at bake time replays the hot page set in the background so chunks are often already in a shared per-host cache by the time a vCPU faults. The first restore on a host pays object-storage latency once; later restores of the same template read locally. Be precise about scope, though: userfaultfd streams memory, not the disk. The rootfs always has to be a local file because copy-on-write disk cloning (reflink or dm-snapshot) needs a local block device. Streaming removes the big vm.mem download specifically. The full mechanics are in /blog/userfaultfd-explained.
Where the snapshot comes from: auto-bake
Restore needs a snapshot to restore, and a snapshot needs a booted machine to capture. The very first time a template is ever spawned on an agent, there's no snapshot yet, so PandaStack does a real cold boot — the full Firecracker boot plus the guest's own userspace coming up to a ready state. That takes on the order of 3 seconds. Once the guest is ready, the agent bakes it: it freezes the running machine's memory and device state to disk and records the network identity it booted with.
From that point on, every create of that template restores the baked snapshot through the fast path. The ~3s cold boot is a one-time cost amortized across every subsequent create. Re-baking a template — a new kernel, new rootfs contents — invalidates the old snapshot and triggers a fresh cold-boot-and-bake on the next spawn. So the honest description of the system is: it cold-boots rarely and restores constantly, and the sub-200ms figure describes the constant case. Templates can also be pre-baked at agent startup so the very first user-facing create already hits the fast path.
What this looks like from the SDK
All of the above is invisible to the caller. From the PandaStack SDK a create is a single call that returns a ready sandbox in well under 200ms, and the next line can already run a command in it — because by the time create returns, the readiness probe has confirmed the guest is up.
from pandastack import Sandbox
# No warm pool is consulted. This restores the baked "base" template
# snapshot on demand: p50 ~179ms (~49ms of that is the snapshot-load step),
# ~203ms p99. Nothing was kept running waiting for you.
box = Sandbox.create(template="base", ttl_seconds=3600)
# create() already probed the guest's network stack, so it's usable now.
out = box.exec("python -c 'print(2 ** 10)'")
print(out.stdout) # 1024
# Between this create and the next, no machine of yours is running on the
# host. You pay for the sandbox that exists, not for a pool that might.
box.exec("echo 'idle cost between creates: ~0'")The first create of a brand-new template on an agent may hit the ~3s cold-boot-and-bake; every create after that takes the restore path. There's no flag to flip and no pool to size — the system bakes once and restores forever.
The summary
PandaStack keeps no warm pool of idle VMs. Every create restores a baked Firecracker snapshot on demand through a pipeline that's been shaved to the bone: grab a pre-built NATID network slot (~1ms), patch the tap (~6ms), reflink the rootfs copy-on-write (~4ms), fork+exec the jailed VMM (~25ms), POST /snapshot/load to map vm.mem with MAP_PRIVATE (~49ms), resume the vCPUs (~6ms), and probe the guest is reachable (~40ms). The memory map is lazy and copy-on-write, so a create only pays for the pages the guest actually touches; userfaultfd extends that to streaming vm.mem from object storage when it isn't local. The snapshot itself comes from a one-time ~3s cold boot and auto-bake, after which the system cold-boots rarely and restores constantly. The result is a 179ms p50 create with idle cost near zero — warm-pool speed without the warm-pool bill.
Every sandbox, managed Postgres database, and git-driven app on PandaStack runs on exactly this path, and the core is open source under Apache-2.0 — so you can run the control-plane API and per-host agent on your own Linux KVM hosts and watch the create timings yourself. For the boot-vs-restore distinction start with /blog/how-firecracker-boots-fast; for the memory mechanics, /blog/firecracker-memory-snapshots; for streaming, /blog/userfaultfd-explained.
Frequently asked questions
Does PandaStack keep a warm pool of idle VMs?
No. There is no warm pool of idle VMs. Every create restores a baked Firecracker snapshot on demand. The only thing kept warm is a small pool of pre-built Linux network slots (netns + veth + tap + iptables), which is cheap plumbing, not running machines. Between creates nothing runs, so idle cost is roughly zero — you pay only for sandboxes that actually exist, not for a pool sized to your peak.
How does a create land under 200ms without a warm pool?
By restoring a frozen machine instead of booting one. The create pipeline allocates a pre-built NATID network slot (~1ms), patches the tap device (~6ms), reflinks the rootfs copy-on-write (~4ms), fork+execs Firecracker under the jailer (~25ms), POSTs /snapshot/load to map vm.mem and load device state (~49ms), resumes the vCPUs (~6ms), and probes TCP :22 to confirm the guest is reachable (~40ms). These overlap and add up to a 179ms p50 with a ~203ms p99.
Why doesn't the snapshot-load step get slower for guests with more RAM?
Because Firecracker maps vm.mem with MAP_PRIVATE, which is lazy and copy-on-write at the page level. At load time nothing is copied — the mapping just points the guest's RAM at the file. A page faults in only when the guest first touches it, and a write triggers a private copy of just that 4 KiB page. So a create pays only for the working set the guest actually touches between resume and ready, not the full image, which is why the ~49ms load step doesn't scale with declared guest memory.
Where does the baked snapshot come from?
From a one-time auto-bake. The first time a template is spawned on an agent there's no snapshot, so PandaStack does a real cold boot — the full Firecracker boot plus userspace coming up — which takes about 3 seconds. The agent then freezes that ready machine to disk as a snapshot. Every create after that restores the snapshot in under 200ms. Re-baking a template invalidates the old snapshot and triggers a fresh cold-boot-and-bake on the next spawn, so the system cold-boots rarely and restores constantly.
How can an agent restore a snapshot whose memory file isn't local?
By streaming the memory on demand with userfaultfd. Instead of a local file, Firecracker is handed a userfaultfd and the agent backs guest memory itself. When the guest faults on a page, the kernel parks the vCPU and posts the fault to the agent, which fetches the surrounding 4 MiB chunk from object storage over an HTTP Range GET and installs it with UFFDIO_COPY. All-zero chunks are elided with no fetch, and a prefetch trace warms the hot set into a shared per-host cache. userfaultfd streams memory only — the rootfs still has to be a local file for copy-on-write disk cloning.
49ms p50 cold start. Fork, snapshot, and scale to zero.