Thaw: How We Made a Cold Start Take 164ms
Every scale-to-zero platform makes the same trade. Scale to zero and you pay nothing while idle — but the next request eats a cold start. The whole game for a decade has been: how do you avoid making the user wait while a machine starts up?
Vercel's Fluid compute is the most polished answer on the avoid-the-cold-start side. Per their docs, it reduces cold starts through bytecode caching and function pre-warming on production, and it shares a single instance across concurrent invocations — reusing idle resources before scaling up. It bills active CPU (the time your code is executing) plus provisioned memory. Fluid doesn't make a cold start fast so much as arrange for you to rarely hit one. That's a genuinely good product decision, and for a lot of workloads it's the right one. (I'm describing it from Vercel's public docs — verify against theirs, they move fast.)
We wanted the other thing. Actually zero — the sandbox deleted, nothing left to run or bill while idle — and a cold start fast enough that you stop caring it happened. This post is how we got a real, deployed app to wake from a complete stop in about 164 milliseconds. We call it Thaw.
Start with the two honest numbers
Before Thaw, a scaled-to-zero app on PandaStack woke by cold-booting. We delete the whole sandbox when an app goes idle — CPU, RAM, and disk all freed, nothing left running — and on the next request we boot a fresh microVM from the app's baked disk image, pulled from object storage. That's true scale-to-zero, and it's honest: $0 while idle. But the wake was about 54 seconds. Break it down and most of it is unavoidable with a cold boot: roughly 26s to pull the multi-gigabyte image, ~8s for the kernel and init, then ~20s for the app process to start and bind its port.
The other number we already had was 64 milliseconds. That's how long it takes us to restore a baked Firecracker snapshot for a base template — the same snapshot-restore path that gives our sandboxes a 179ms p50 create with no warm pool. A snapshot restore isn't a boot. The kernel is already up, init has already run, the process is already in memory. You're mapping a frozen machine back into existence, not building one.
So the question wrote itself. App wake was 54 seconds because it was a boot. Snapshot restore was 64 milliseconds because it wasn't. What would it take to make app wake a restore?
Why apps couldn't just use snapshots already
Because we'd been burned by exactly this, and the scar tissue was load-bearing. Earlier in PandaStack's life an app's deploy did capture a memory snapshot — and it caused a genuinely nasty incident. The app would sleep, wake, serve perfectly for the smoke test, and then start throwing filesystem I/O errors hours later. The cause was subtle and worth stating precisely, because it's the whole reason Thaw is designed the way it is.
A Firecracker memory snapshot captures RAM and CPU state — and RAM includes the kernel's page cache, the in-memory copy of disk blocks the guest has touched. The snapshot recorded a pointer to the rootfs, not its bytes. Sleep then deleted that rootfs. On wake we restored the frozen page cache over a blank template disk. For a while everything served from RAM and looked fine. The instant the guest had to read a block that wasn't cached — or write one back — it hit a disk that no longer matched what memory believed was there. ext4 errors, 500s, hours after a green smoke test.
How Thaw works
Thaw bakes a per-app seed — a complete frozen microVM — and restores it on wake. The two halves that make it safe are when we bake and what the restore lands on.
The bake: one atomic pause, app stopped
We don't bake the seed at deploy time. We bake it the first time the app goes idle — which is the perfect moment, because the app is doing nothing and the sandbox is about to be deleted anyway. Inside the guest, we stop the app process and delete its environment file first. Then, under a single atomic pause, we do three things from one frozen instant: snapshot the memory, snapshot the CPU/device state, and copy that exact rootfs into the seed.
Two properties fall out of that ordering, and both matter. Because the app is stopped before the snapshot, the memory we capture has no in-flight request, no half-held mutex, no live socket — and no plaintext secrets, because the env file is gone. It's a quiesced, secret-free machine. And because the disk copy happens inside the same pause as the memory snapshot, the RAM and the disk come from the identical instant. They cannot drift, because there was no time between them in which anything could change.
The restore: onto the seed's own disk, never a template
On wake we restore over the seed's own byte-identical disk — the copy we took during that pause — gated by a SHA-256 hash. Never a fresh template clone. This is the single line that the old incident violated, and making it structurally impossible is the point. The frozen page cache in the restored memory is now sitting over exactly the disk it was cached from. There's nothing for it to lie about.
Then we hand off to a fresh process. The app was stopped in the seed, so on restore we re-deliver the environment (including any secrets that rotated while the app was asleep — they're applied at wake, never frozen into the snapshot) and start the app as a brand-new process on the warm machine. Before we route a single request to it, a liveness gate writes a file, fsyncs it, and reads it back — forcing a real disk round-trip. If memory and disk were ever going to disagree, that's where it surfaces, and we fall back rather than serve corruption.
So Thaw is not 'resurrect your exact running process mid-request.' We're deliberately not doing that — it's where the danger lives. Thaw is: restore a warm, coherent machine in milliseconds, then start a clean process on it. You get the speed of a snapshot with the safety of a cold boot.
What we measured
Here's the part that matters. We ran a real Node HTTP app through the full lifecycle on our production fleet — deploy, idle, the seed bake on first sleep, delete the sandbox, then a request to wake it — and measured the wake end to end.
THAW WAKE (real Node app, production, kernel 6.17)
restore wall time: 164 ms
boot_mode: snapshot-natid (a restore, not a boot)
app response: THAW_OK marker=<survived> who=<rotated env> pid=<new>
ext4 errors: 0Every property we designed for showed up in that one response. The wake was 164ms — a restore, confirmed by the boot mode, not a cold boot. A marker we wrote to disk before the bake survived the delete-and-restore cycle, which proves the restore landed on the seed's own coherent disk. The process ID was new, proving a fresh process rather than a resurrected one. The environment was the rotated value applied at wake, not the stale one frozen at deploy. And zero ext4 errors — the incident class that haunted the first attempt is, this time, structurally absent.
From 54 seconds to 164 milliseconds. That's not an optimization; it's a different mechanism. The 54-second path built a machine. The 164ms path unfroze one.
Thaw vs Fluid: different bets
It's worth being precise about how this compares to Fluid, because they're solving the same problem from opposite directions and both bets are defensible.
Fluid keeps cold starts rare — pre-warming, bytecode caching, and sharing a single instance across concurrent invocations — and bills active CPU plus provisioned memory. Its great strength is that, under steady traffic, you essentially never hit a cold start at all. It's a managed, no-config model that's an excellent default when your traffic keeps instances busy.
Thaw bets on the substrate instead. Because a Firecracker microVM freezes to a file and thaws in milliseconds, we don't work to avoid the cold start — we make it cheap enough to stop avoiding. The sandbox is deleted while idle, so there's nothing to run or bill. The trade is honest too: the very first wake after each deploy is still the slower cold-boot path (we haven't baked a seed yet), and every wake after the first idle is the sub-second Thaw. So Thaw shines for the long tail — the thousands of apps, preview environments, and side projects that are idle most of the time and need to be instant the moment someone shows up.
The honest one-liner: Fluid works to avoid the cold start. Thaw makes the cold start fast — and goes all the way to zero.
Why this is possible on this substrate
Thaw isn't a clever trick layered on top of containers — it falls out of the microVM substrate we already run on. A container can't be frozen to a coherent file and thawed; it shares the host kernel, so there's no self-contained machine state to snapshot. A Firecracker microVM is a complete machine — its own guest kernel, its own memory, its own virtual disk — and that's exactly what makes it freezable. The same property that gives a microVM stronger isolation than a container is the property that lets it cold-restore in milliseconds.
Which is the thing we keep coming back to. The boring, security-driven choice to run real microVMs instead of hardened containers is what unlocked the exciting performance result. You don't get Thaw on a substrate that was never a real machine in the first place.
Thaw is live on PandaStack's git-driven app hosting. Connect a repo, let it scale to zero, and the next visitor gets a machine that was frozen — thawed back in about the time it takes to read this sentence.
Frequently asked questions
How fast is a Thaw cold restore?
About 164 milliseconds for the microVM restore, measured end-to-end on a real Node app on our production fleet. That's the restore of a frozen microVM (snapshot-restore), not a cold boot — for comparison, the cold-boot path it replaces took roughly 54 seconds.
How is Thaw different from Vercel Fluid?
They solve the same problem from opposite directions. Fluid works to avoid cold starts — pre-warming, bytecode caching, and sharing one instance across concurrent invocations — and bills active CPU plus provisioned memory. Thaw instead scales all the way to zero (the sandbox is deleted, so there's nothing to run while idle) and makes the cold start itself fast by restoring a frozen microVM in ~164ms instead of booting a new one.
Is restoring a memory snapshot safe? Doesn't that risk corruption?
It's safe because of two deliberate choices. The seed is baked from an app-stopped guest under a single atomic pause, so the memory and the disk are captured from the same instant and can't drift. And on wake we restore over the seed's own byte-identical, SHA-256-gated disk — never a fresh template — so the restored page cache sits over exactly the disk it was cached from. A disk-touching liveness check runs before any request is served.
Does Thaw resurrect my exact running process?
No, by design. Thaw restores a warm, coherent machine in milliseconds, then starts your app as a fresh process on it (with current environment and any rotated secrets applied at wake). It's the speed of a snapshot with the safety of a clean process start — we deliberately don't resurrect in-flight request state, which is where the danger lives.
Why can PandaStack do this when container platforms can't?
Because PandaStack runs every app in a real Firecracker microVM — a complete machine with its own guest kernel, memory, and virtual disk. That's what can be frozen to a file and thawed back. A container shares the host kernel, so there's no self-contained machine to snapshot. The same property that makes microVMs more isolated than containers is what makes them cold-restorable in milliseconds.
49ms p50 cold start. Fork, snapshot, and scale to zero.