How to Optimize MicroVM Cold Start

Ajay Kumar·June 27, 2026·9 min read

Cold start is the latency between "I asked for a machine" and "the machine will run my code." For a microVM that means: allocate networking, lay down a disk, launch the VMM, get a kernel and userspace to a ready state, and confirm the guest is reachable. Every one of those steps is a place to lose tens of milliseconds — or, if you do it naively, tens of seconds. This post is a practical tour of the techniques that actually move the number, in roughly the order they matter, with the mechanism behind each one rather than a list of flags to copy.

The single most important idea up front: the largest cold-start win is not making the boot faster. It is not booting. Everything else in this post is a refinement once you've made that one structural choice. PandaStack creates a fresh Firecracker microVM with a median latency of 179ms (p99 ~203ms), and that number is a snapshot restore, not a kernel boot — the cold boot happens once per template (~3s) and is then amortized away. Let's start there, because it's the lever that dwarfs the rest.

The ranking that matters: snapshot-restore is the structural win (seconds → tens of ms). The other five techniques are what keep a restore fast and make it work across a fleet — they shave the remaining milliseconds and remove the eager downloads. Get #1 right first.

The techniques, ranked by impact

If you only remember one ordering, remember this one. The further up the list, the more cold-start time it removes:

Snapshot-restore instead of cold boot — the structural win. Boot once, snapshot a running machine, restore per request. Turns a multi-second boot into a tens-of-milliseconds restore. Nothing else comes close.
Copy-on-write rootfs (reflink) — make the per-create disk clone O(metadata) instead of a full image copy, so create cost is constant regardless of image size.
MAP_PRIVATE memory + lazy page-in — map the snapshot's memory file copy-on-write so restore doesn't read the whole RAM image up front; pages fault in only as the guest touches them.
Prewarmed network namespaces — build the netns + veth + tap + iptables ahead of time so the hot path grabs a ready slot instead of paying ~100ms to construct one cold.
Minimal kernel + tiny device model — fewer drivers to probe and no firmware phase, which makes the one cold boot you do pay for cheap, and shrinks the snapshot's working set.
userfaultfd / on-demand memory streaming — stream the RAM image from object storage so a fresh host can restore without first downloading multiple gigabytes of vm.mem.

1. Snapshot-restore instead of cold boot

A cold boot starts a computer. The kernel initializes, drivers probe hardware, init brings up userspace, your service starts and binds its port. Even a lean microVM is doing real work there, and your application's own startup often dominates. A snapshot restore does none of it. A Firecracker snapshot serializes a running microVM at one instant — the guest's entire physical RAM (vm.mem) plus the VMM state (vm.state): vCPU registers, the interrupt controller, the clock, every virtio device's config. Restoring it maps that memory back and resumes the vCPUs mid-instruction. The kernel was already up, the page cache already warm, your process already running and listening. You're resuming a paused machine, not building one.

This is why the shape of the latency changes so completely. Cold boot scales with how much your kernel and app have to do to become ready. Restore scales with how fast you can map a file and unpause CPUs. The first is seconds; the second is tens of milliseconds. On PandaStack the snapshot-load step itself is roughly 49ms, and the whole create — networking, disk, VMM launch, load, resume, readiness probe — lands at 179ms p50. The cold boot still exists, but you pay it once per template (~3s) when you bake the snapshot, then amortize it across every create after.

Here's the difference sketched in shell. The cold path boots a kernel and waits for the app; the warm path is captured once, then every subsequent start is a load-and-resume:

# COLD BOOT (the slow path you want to do exactly once)
# kernel init + driver probe + userspace + your app starting
firecracker --api-sock /run/fc.sock &
fc PUT /boot-source       kernel_image_path=vmlinux
fc PUT /drives/rootfs     path_on_host=rootfs.ext4
fc PUT /network-interfaces/eth0 host_dev_name=tap0
fc PUT /actions           action_type=InstanceStart
wait_for_app_ready        # seconds: init + service startup

# BAKE ONCE: freeze the running, ready machine to disk
fc PUT /snapshot/create   snapshot_path=vm.state mem_file_path=vm.mem

# WARM RESTORE (every create from now on)
firecracker --api-sock /run/fc.sock &
fc PUT /snapshot/load     snapshot_path=vm.state mem_file_path=vm.mem  # ~tens of ms
fc PUT /snapshot/state    state=Resumed                               # vCPUs unpaused
# guest is already booted, app already listening — no init, no reboot

Re-baking a template (new kernel, new rootfs contents) invalidates the old snapshot — the next spawn cold-boots and bakes again. That's correct behavior, not a bug: a snapshot is only valid against the exact disk and kernel it was frozen from. Treat the bake as part of your build, not your request path.

2. Copy-on-write rootfs (reflink)

Every sandbox needs its own writable disk, but copying a multi-gigabyte rootfs per create would obliterate your latency budget and your disk. The fix is a copy-on-write clone. On XFS, a reflink (cp --reflink) creates a new file that shares the original's data blocks and only allocates new blocks when the clone is written to. The clone is O(metadata) — it copies the block map, not the blocks — so it's effectively constant-time no matter how big the image is. dm-snapshot gives the equivalent at the device-mapper layer.

Why this matters for cold start: it makes the disk stage of create a fixed, tiny cost (single-digit milliseconds) instead of something that grows with your image. The template's rootfs stays read-mostly and shared across every clone, so you also get better host page-cache behavior — the common blocks are hot. The catch worth internalizing: CoW cloning needs a local block device, which is why the rootfs always has to be a local file even when other artifacts can live remotely. There's a deeper treatment in /blog/copy-on-write-rootfs.

3. MAP_PRIVATE memory + lazy page-in

A 2 GiB guest produces a 2 GiB memory file. If restore had to read all of it before the guest could run, restore latency would scale with RAM size and the whole snapshot advantage would erode. It doesn't, because the memory file is mapped with MAP_PRIVATE — a private, copy-on-write mapping. The kernel doesn't copy the file in; it sets up the mapping and faults pages in lazily, only as the guest actually touches them. Writes get a private copy so the underlying snapshot file is never mutated and can back many restores at once.

The leverage here is the working-set effect: a guest with 2 GiB of RAM usually touches only a few hundred megabytes to reach a ready state. Lazy page-in means you pay for the working set, not the allocation. This is the reason the ~49ms snapshot load doesn't grow with guest memory size — restoring a bigger guest maps a bigger file but still only faults in what gets used. It's the same copy-on-write, lazy-by-default philosophy as the reflinked rootfs, applied to memory.

4. Prewarmed network namespaces

Networking is a sneaky cold-start cost. Building a Linux network namespace from scratch — ip netns add, ip link add for the veth pair, moving an end into the namespace, creating the tap, installing iptables rules — costs on the order of 100ms when done cold. That's more than half a fast restore, spent on plumbing, on the hot path.

So don't do it on the hot path. PandaStack pre-allocates a pool of network slots — netns + veth pair + tap + iptables, fully constructed — so a create grabs a ready slot and only patches the tap's MAC to match the values the guest's snapshot was baked with (~a few milliseconds instead of ~100ms). Each agent pre-allocates up to 16,384 /30 subnets (a full /16 worth of slots), and the warm pool is the prebuilt depth sitting in front of that — if it drains, a slot is built on demand rather than failing the request. The general principle generalizes beyond networking: anything expensive and reusable should be constructed off the critical path and handed out ready. The networking design is covered in /blog/firecracker-networking-explained.

5. Minimal kernel + tiny device model

This one makes the cold boot you can't fully avoid as cheap as possible — which matters because that bake cost sets a floor, and a smaller working set also makes restores leaner. A conventional VM emulates a whole PC: BIOS/UEFI firmware, a PCI bus, legacy interrupt controllers, emulated disk and network controllers, a VGA console. The guest firmware probes and initializes all of it, then a bootloader loads a general-purpose kernel that re-discovers the same hardware. That discovery-and-init dance is where the seconds go.

No firmware phase — Firecracker loads the guest kernel directly and jumps into it. There is no BIOS/UEFI to sit through.
A minimal virtio device model — net, block, vsock, plus a serial console. No PCI bus to enumerate, no legacy device emulation, no VGA. Almost nothing for the guest to probe.
A stripped, purpose-built guest kernel — compiled with only the drivers a microVM actually has, so kernel init skips the long tail of hardware it will never see.
A small, single-purpose VMM under a jailer — fast to fork+exec, with a deliberately tiny host attack surface.

Fewer drivers to probe is the throughline: every device the guest doesn't have is initialization it doesn't run, and a leaner ready state is a smaller memory working set to fault back in on restore. So this technique compounds with the others — it lowers the cold-boot floor and trims what every restore has to page in.

6. userfaultfd / on-demand memory streaming

Lazy page-in via MAP_PRIVATE assumes vm.mem is already on local disk. In a multi-host fleet it often isn't — snapshots live in object storage so any host can restore any template. Downloading a multi-gigabyte memory image before you can start is pure dead time, and most of those bytes are pages the guest will never touch this boot. userfaultfd deletes that wait by extending the same lazy idea across the network.

userfaultfd is a Linux feature that routes page faults to a user-space handler. Firecracker can restore in UFFD mode: instead of pointing at a local memory file, it hands a userfaultfd to an external process that backs the guest's RAM. When the guest faults on an absent page, the kernel posts the fault to the handler, which fetches the surrounding 4 MiB chunk from object storage over an HTTP Range GET and installs it with UFFDIO_COPY, waking the parked vCPU. The guest never knows the page came over the network. All-zero regions are elided (no fetch), a prefetch trace warms the hot set in the background, and a shared per-host chunk cache makes the first restore pay object-storage latency once while every later restore reads locally. The full mechanics are in /blog/userfaultfd-explained — the cold-start point is that a fresh host can start serving restores without first downloading the whole image.

Streaming applies to memory, not disk. The rootfs must stay a local file because copy-on-write cloning needs a local block device. userfaultfd removes the big vm.mem download specifically; agents still sync and reflink the rootfs locally.

What a fast create looks like in practice

All six techniques are infrastructure-side; from the caller's perspective the payoff is just that create returns fast and the machine is immediately usable. With the PandaStack SDK a create is a single call that resolves once the guest is actually reachable, so the next line can run a command without polling or retry loops:

from pandastack import Sandbox

# Returns once the microVM is restored and reachable — a snapshot
# restore under the hood, not a cold boot. ~179ms p50.
sbx = Sandbox.create(template="code-interpreter")

# The guest is already booted; run code immediately.
result = sbx.exec("python -c 'print(2 ** 10)'")
print(result.stdout)  # -> 1024

sbx.delete()

There's no warm-pool config to tune and no "wait for ready" loop, because the readiness probe is part of create: it returns success only after confirming the guest's network stack is accepting connections. The 179ms you observe is the sum of the techniques above — prewarmed networking, reflinked rootfs, MAP_PRIVATE load, resume, and the readiness check — with the cold boot already amortized into the template's baked snapshot.

The same primitive shows up everywhere

Once you see create as "restore a snapshot," related operations fall out of the same machinery. A fork snapshots a specific running sandbox and restores that copy-on-write — same-host forks run in roughly 400–750ms because the parent's memory is resident and the rootfs reflinks locally; cross-host forks are 1.2–3.5s because the artifacts move over the network first. Managed Postgres databases are the honest counterexample: creating one takes 30–90s because it blocks on the database actually bootstrapping and becoming ready, which is real boot work that no snapshot trick can skip away. The lesson is to apply snapshot-restore where the expensive part is reaching a reusable ready state, and accept a real boot where the work is genuinely per-instance.

If you want to optimize your own microVM cold start, do them in order: get off cold boot and onto snapshot-restore first (it's the order-of-magnitude win), make the rootfs copy-on-write, map memory lazily, prewarm your networking, slim the kernel and device model, and reach for userfaultfd streaming once you're running a fleet and the memory-image download becomes the bottleneck. PandaStack's control-plane API and per-host agent are open source under Apache-2.0, so you can read the whole create pipeline — and measure these timings on your own Linux KVM hosts. Start with /blog/how-firecracker-boots-fast for the create pipeline step by step.

Frequently asked questions

What is the single biggest way to reduce microVM cold start?

Stop cold-booting. The largest win by far is snapshot-restore: boot a machine once to a ready state, snapshot its memory and device state, and restore that snapshot per request instead of booting fresh. A cold boot scales with kernel init plus your app's startup (seconds); a restore scales with mapping a file and resuming vCPUs (tens of milliseconds). On PandaStack the snapshot load is ~49ms and the full create is 179ms p50 — the ~3s cold boot is paid once per template at bake time and then amortized across every create.

Why doesn't snapshot-restore latency grow with guest RAM size?

Because the memory file is mapped MAP_PRIVATE — a copy-on-write mapping that the kernel pages in lazily — rather than copied up front. Restoring a 2 GiB guest doesn't read 2 GiB off disk before the guest runs; it sets up the mapping and faults pages in only as the guest touches them. Since a guest typically touches only a few hundred megabytes to reach a ready state, you pay for the working set, not the full allocation.

How does copy-on-write rootfs speed up create?

A reflink clone (cp --reflink on XFS, or dm-snapshot) shares the template's data blocks and only allocates new blocks when the clone is written. The clone copies the block map, not the data, so it's O(metadata) — effectively constant-time regardless of image size. That turns the per-create disk stage into a fixed single-digit-millisecond cost instead of a full copy that grows with the image. The trade-off is that the rootfs must be a local block device, so it can't be streamed remotely the way memory can.

When should I use userfaultfd memory streaming?

When you run a multi-host fleet and snapshots live in object storage, so a host may need to restore a template whose multi-gigabyte memory image isn't local yet. userfaultfd lets the guest start and pull only the pages it touches over HTTP Range GETs, instead of blocking on a full download. It pays off most when many creates of the same template share a per-host chunk cache. For a single one-off restore on a cold host, a plain local restore can be simpler — measure your working sets and cache hit rates first.

Why are prewarmed network namespaces worth it?

Constructing a Linux network namespace cold — ip netns add, the veth pair, the tap device, iptables rules — costs around 100ms, which can be more than half of a fast restore. Pre-allocating those slots ahead of time lets create grab a ready namespace and just patch the tap MAC to match the baked snapshot, dropping the network stage to a few milliseconds. The general principle: anything expensive and reusable should be built off the critical path and handed out ready.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free

Written by Ajay Kumar, Founder, PandaStack.