Memory Overcommit & Page Sharing: How MicroVMs Get Dense
Here is a number that confuses people the first time they see it: a PandaStack sandbox on the base template is configured with 2 GiB of RAM, but the firecracker process backing it will often show a resident set of a few hundred megabytes. Multiply that gap across a host full of sandboxes and you arrive at the trick that makes microVM platforms economical — a host can carry far more guests than a naive (host RAM ÷ configured guest size) division would allow. This post is about the kernel and VMM mechanics that make that legal: the difference between configured and resident memory, copy-on-write page sharing on snapshot restore, the virtio-balloon as a reclaim lever, how the kernel accounts for overcommit, and demand paging so idle guest memory need not even be resident. It's a companion to the economics story — that post asks whether per-tenant isolation is affordable; this one opens the hood on why the memory adds up the way it does. And it ends on an honest note, because overcommit is a bet on statistical idleness, not a free lunch.
Configured RAM is a promise; resident RAM is a bill
When you give a guest 2 GiB, you are not handing it 2 GiB of physical memory up front. You are promising that up to 2 GiB of guest-physical addresses will resolve to something when the guest touches them. The host only allocates a real physical page the moment the guest actually reads or writes a given guest page. Until then, that address is a promise the kernel hasn't had to keep. This is the whole foundation of density: the guest's configured size is its ceiling, but its resident set — the pages the host has genuinely backed with RAM — is what shows up on the bill.
And most sandboxes touch a fraction of their ceiling. A code-interpreter guest that boots, runs a short Python snippet, and idles waiting for the next request has warmed its kernel, its libc, its interpreter — call it a few hundred megabytes — and left the other gigabyte-and-change of its address space completely untouched. The kernel never allocated physical pages for memory the guest never referenced. You provisioned a 2 GiB machine and are paying for the working set, which for a mostly-idle guest is a small, quiet corner of it.
You can watch the gap directly. The resident set of a firecracker process is what it's costing the host right now, regardless of the guest's configured size:
# Configured guest size lives in the VM config; resident size is what the
# host is actually backing with physical RAM right now. They rarely match.
$ ps -o rss,command -C firecracker --no-headers | awk '{print $1/1024 " MiB " $2}'
287 MiB /usr/bin/firecracker --api-sock ... # a 2 GiB-configured guest, mostly idle
# Host-wide accounting: Committed_AS can exceed MemTotal under overcommit.
# That overshoot IS the density bet — it only pays off if guests stay idle.
$ grep -E 'MemTotal|MemAvailable|Committed_AS|CommitLimit' /proc/meminfo
MemTotal: 32900120 kB
MemAvailable: 21044880 kB
Committed_AS: 41230512 kB # promised more than MemTotal — fine, until it isn't
$ sysctl vm.overcommit_memory vm.overcommit_ratio
vm.overcommit_memory = 0 # heuristic overcommit (the usual default)
vm.overcommit_ratio = 50That Committed_AS-over-MemTotal overshoot is the density bet stated in one line: the host has promised more memory than it physically has, wagering that not every guest will call the promise in at the same instant. Whether that's brilliant or reckless depends entirely on how idle your guests actually are — which is the theme we'll keep returning to.
Copy-on-write page sharing: one physical copy, a hundred readers
The configured-vs-resident gap explains why one guest is cheap. Page sharing explains why the hundredth guest is nearly free. PandaStack restores every sandbox from a baked Firecracker snapshot — there's no warm pool of idle VMs — and the snapshot's memory image is mapped copy-on-write, with MAP_PRIVATE over the shared snapshot file. That single word, MAP_PRIVATE, is doing an enormous amount of work.
Here's the consequence. A hundred sandboxes restored from the same template start with byte-for-byte identical memory: same guest kernel, same booted userland, same warmed-up runtime. Because they all map the same snapshot file MAP_PRIVATE, every one of them reads those pages out of a single shared physical copy. The kernel does not duplicate the template's warmed pages a hundred times. A read fault on a shared page maps it straight from the one shared file; only a write fault allocates a fresh private page — and only for the specific 4 KiB page that guest just dirtied. Everyone else keeps reading the original.
So a hundred identical restores don't cost a hundred times the snapshot's RAM. They cost one shared copy of the common pages plus whatever each guest has individually scribbled since it started. The common case — the guest kernel, the interpreter, the libraries every sandbox loads and never modifies — is paid for exactly once on the host and shared across every guest that restored from that template. This is the same copy-on-write principle PandaStack uses for the rootfs disk (XFS reflink), just applied to RAM pages instead of disk blocks: shared until written, private only where it diverges.
Demand paging: idle memory need not even be resident
Copy-on-write keeps shared pages from being duplicated. Demand paging goes one step further: it keeps cold pages from being resident at all. With UFFD memory streaming enabled, the guest's memory isn't loaded up front from a local file — it's paged in lazily over the network from object storage as the guest touches it. A userfaultfd handler sits between the guest and its memory: the guest touches an address, the kernel raises a page-fault event, the handler fetches the relevant 4 MiB chunk from GCS via an HTTP Range GET and installs it, and the guest continues. Pages the guest never touches are never fetched and never occupy host RAM.
There's a nice refinement that makes this even cheaper: zero-elision. A baked header records which chunks of the snapshot are non-zero. A page fault on a chunk that's known-zero is satisfied by zero-filling in place — no fetch, no network, no stored bytes. A freshly-booted guest's address space is mostly zeroes, so a large share of its 'memory' costs nothing to serve. Combine that with a prefetch trace that warms the known-hot chunks in the background, and idle guest memory converges on genuinely free: not duplicated (copy-on-write), not resident (demand-paged), and not even fetched (zero-elided). The full mechanism is in /blog/userfaultfd-explained and /docs/internals/streaming-restore.
The virtio-balloon: a lever to take memory back
Copy-on-write and demand paging keep memory from being allocated. The virtio-balloon is the mechanism for reclaiming memory a guest already got but no longer needs. The balloon is a paravirtual device: a driver inside the guest that the host can 'inflate.' When the host asks the balloon to inflate, the guest driver allocates guest pages and hands them back to the host, effectively telling the guest kernel 'this memory is spoken for, don't use it' — and the host is then free to reclaim those physical pages for someone else. Deflate the balloon and the guest gets them back.
The balloon matters because overcommit is a statistical bet, and every bet wants a hedge. A guest that briefly ballooned up to run a heavy job and then went idle is holding physical pages it isn't using. Inflating its balloon lets the host claw that RAM back and hand it to a guest that needs it now, without killing anyone. It's cooperative — it depends on a working guest driver and a guest that has genuinely free memory to surrender — so it's a pressure-relief valve, not a guarantee. But it turns 'this guest over-reserved' from a permanent loss into a recoverable one, which is exactly the kind of tool you want when you've promised more RAM than you have.
Overcommit accounting, and the honest risk
All of this rests on the Linux kernel being willing to promise memory it doesn't currently have. That willingness is governed by vm.overcommit_memory, and it's worth being precise about, because this is where the bet is actually placed.
- vm.overcommit_memory = 0 (heuristic, the usual default) — the kernel allows overcommit but rejects allocations that look wildly unreasonable. Pragmatic middle ground: most promises succeed, egregious ones are refused up front.
- vm.overcommit_memory = 1 (always) — the kernel never refuses an allocation on accounting grounds. Maximum density, maximum faith. You are fully committed to the idleness bet.
- vm.overcommit_memory = 2 (never) — the kernel refuses to promise more than CommitLimit (swap + a ratio of RAM). Safest, least dense: you can only promise what you can actually back. This forecloses the density trick entirely.
Modes 0 and 1 are what make packing possible: they let Committed_AS exceed MemTotal, so a host can hand out 2 GiB promises to more guests than it has physical 2-GiB slots. That's not a bug — it's the same principle airlines use to sell more seats than the plane has, wagering that not everyone shows up. And it works for the same reason: guests, like passengers, are usually idle relative to their reservation.
This is why 'statistical idleness' is the load-bearing phrase. Overcommit pays off precisely to the degree that your guests' memory demand is uncorrelated and mostly below their ceiling. A fleet of sandboxes doing independent, bursty, mostly-idle work is an ideal case — the peaks rarely align, so the host rides the average. A fleet that all runs the same memory-hungry job on the same cron tick is the adversarial case — the peaks align perfectly, and the average is a lie. The mechanics in this post don't change which case you're in; they just make the good case dramatically more profitable and give you levers (the balloon, demand paging) to soften the bad one.
Four flavors of a guest's memory, by who's really paying
It helps to stop thinking of a guest's memory as one number and start sorting it by who actually bears the cost. Every page in a running guest falls into one of four buckets, and only one of them is genuinely expensive:
- CoW-shared — pages identical to the template snapshot the guest hasn't written. Backed by a single physical copy shared across every guest from that template. Marginal host cost per additional guest: essentially zero. This is the bulk of an idle guest's memory.
- Private / dirtied — pages the guest has written since restore, so copy-on-write forked them off into a private physical page. This is the real per-guest cost, and it grows exactly as fast as the guest modifies memory — no faster.
- Balloon-reclaimed — pages the guest was given but has surrendered back to the host via an inflated balloon. Counted against the guest's configured size but not costing the host physical RAM; reclaimable on demand for other guests.
- Demand-paged (not resident) — configured pages the guest hasn't touched yet. Under UFFD streaming they aren't even fetched, and zero pages aren't stored at all. Cost to the host: nothing until first touch.
Read that list and the density result stops feeling like sleight of hand. A mostly-idle guest is overwhelmingly CoW-shared and demand-paged, with a thin sliver of private dirtied pages. The host is paying real RAM only for that sliver, times the number of guests, plus one shared copy of the common image. That's the arithmetic behind 'more sandboxes than the sum of their configured RAM' — not a fabricated packing ratio, just an honest accounting of which pages cost anything.
Why no warm pool makes idle sandboxes nearly free
There's one more mechanism that compounds all of the above, and it's architectural rather than kernel-level: PandaStack keeps no warm pool of idle VMs. Every create restores a baked snapshot on demand — p50 179ms, p99 ~203ms, with the snapshot-load step around 49ms and only the very first spawn of a fresh template paying the ~3s cold boot before the snapshot is baked for everyone after. A same-host fork lands in 400–750ms by mapping the parent's memory copy-on-write; a cross-host fork runs 1.2–3.5s because the bytes have to travel first.
Why does that matter for memory? Because the classic way to hide slow boots is to keep a rack of VMs warm and idle — and warm idle VMs are the single worst thing for the memory bet. They hold resident pages, they don't balloon down, and they consume the physical RAM you were counting on lending to active guests. Fast snapshot-restore lets you skip the warm pool entirely: an idle sandbox can simply be snapshotted and deleted, freeing its host RAM completely, then recreated in under 200ms when the next request lands. Idle stops being a resident-memory tax and becomes, near enough, nothing. The overcommit bet is safest precisely when your idle guests aren't sitting on physical RAM at all — and the no-warm-pool model is how you get there. The boot-path details are in /blog/how-firecracker-boots-fast and /docs/internals/snapshot-restore.
Configured RAM is a promise, not a purchase. Copy-on-write shares it, the balloon reclaims it, demand paging defers it — and overcommit dares to promise more than you have, betting the guests won't all call it in at once. Density is that bet, made carefully.
The takeaway: density is a bet you can engineer, not a law you can break
Put the four mechanisms together and the picture is coherent. Configured memory is a ceiling, not a purchase — resident memory is the real bill. Copy-on-write page sharing means identical guests share one physical copy of their common image. Demand paging means untouched memory need not be resident, and zero pages need not exist at all. The balloon gives the host a lever to reclaim over-provisioned RAM cooperatively. And overcommit accounting lets the kernel promise more than it holds, on the wager that guests are statistically idle relative to their configuration. Each one widens the gap between what you promised and what you're paying, and density lives in that gap.
The discipline is in never forgetting it's a wager. The OOM killer is real, correlated load is real, and no mechanism here manufactures physical memory. What they do is make the good case — independent, bursty, mostly-idle guests — pay off handsomely, and give you tools to survive the bad case. That's the honest version of 'microVMs get dense': not magic, but a well-understood statistical bet with kernel-level machinery to tilt the odds. PandaStack's core is open source under Apache-2.0, so you can watch the RSS-vs-configured gap on your own hosts and size your own overcommit against your own load. For the disk side of the same copy-on-write idea, see /blog/copy-on-write-rootfs; for the economics the mechanics pay for, /blog/microvm-density-economics; and for the demand-paging internals, /blog/userfaultfd-explained.
Frequently asked questions
What's the difference between a guest's configured RAM and its resident RAM?
Configured RAM is the ceiling you give the guest — the maximum guest-physical memory it's allowed to address (e.g. 2 GiB on PandaStack's base template). Resident RAM is what the host has actually backed with physical pages, which shows up as the RSS of the firecracker process. The host only allocates a real page when the guest first touches a given address, so a mostly-idle guest that boots and runs a short task often has a few hundred megabytes resident against a 2 GiB configuration. Density lives entirely in that gap: you provision for the ceiling but pay for the working set.
How does copy-on-write let many microVMs share memory?
PandaStack restores every sandbox from a baked snapshot mapped copy-on-write (MAP_PRIVATE over the shared snapshot file). Guests restored from the same template start with byte-for-byte identical memory, so every one of them reads the common pages — guest kernel, runtime, libraries — out of a single shared physical copy. A read fault maps the shared page directly; only a write fault allocates a private 4 KiB page, and only for the page that guest actually dirtied. So a hundred identical restores cost one shared copy of the common image plus each guest's private writes, not a hundred full copies.
What does vm.overcommit_memory control, and what's the risk?
vm.overcommit_memory governs whether the Linux kernel will promise more memory than it physically has. Mode 0 (heuristic, the usual default) allows overcommit but rejects wildly unreasonable allocations; mode 1 (always) never refuses on accounting grounds for maximum density; mode 2 (never) refuses to promise beyond CommitLimit, which forecloses the density trick. Modes 0 and 1 let Committed_AS exceed MemTotal — the host hands out more RAM promises than it can back, betting guests stay idle. The risk is real: if enough guests touch enough of their configured memory at once, the host runs out of physical pages and the OOM killer starts terminating processes. Overcommit is a statistical bet, not magic.
What is the virtio-balloon and how does it reclaim memory?
The virtio-balloon is a paravirtual device with a driver inside the guest that the host can inflate to reclaim memory. On inflate, the guest driver allocates guest pages and hands them back to the host, telling the guest kernel not to use them — and the host is then free to give those physical pages to another guest. Deflate returns them. It's cooperative: it depends on a working guest driver and a guest that has genuinely free memory to surrender, so it's a pressure-relief valve rather than a guarantee. Its value is turning an over-provisioned guest's unused RAM from a permanent loss into something recoverable — a useful hedge when you've overcommitted.
Does an idle sandbox cost memory on PandaStack?
Very little, and it can be made to cost essentially nothing. An idle guest's memory is overwhelmingly copy-on-write-shared with its template (one physical copy across all guests) and demand-paged (untouched pages aren't resident, and under UFFD streaming zero pages aren't even fetched), leaving only a thin sliver of privately dirtied pages as real per-guest cost. Because PandaStack keeps no warm pool and restores from a baked snapshot in ~179ms, the strongest move is to snapshot-and-delete an idle sandbox entirely — freeing its host RAM completely — and recreate it under 200ms on the next request. That turns idle from a resident-memory tax into roughly nothing.
49ms p50 cold start. Fork, snapshot, and scale to zero.