The Firecracker virtio-balloon Device, Explained

Ajay Kumar·July 5, 2026·8 min read

A microVM you hand to an AI agent asks for 2 GiB of RAM and then, most of the time, sits there using a few hundred megabytes of it. The guest booted, ran a command, and is now idle — but from the host's point of view that 2 GiB is spoken for. Multiply by a few hundred idle sandboxes on one box and you're wasting most of your physical RAM on memory nobody is touching. The virtio-balloon device is Firecracker's answer to that waste: a cooperative channel through which the host can ask a guest to give some of its idle RAM back, so that memory can be spent on a guest that actually needs it. This post explains what the balloon actually is, the three knobs Firecracker gives you, how you drive it at runtime, and the honest caveats of a mechanism that runs entirely on the honor system.

What the balloon actually is

The name is a good mental model. Inside the guest there's a small driver — the balloon — that can inflate and deflate. When the host wants memory back, it tells the balloon to inflate to some size. The guest driver responds by allocating that many pages from its own free memory and pinning them, then telling the host "I'm not using these anymore, and I promise not to." The host, now holding that promise, can reclaim the underlying physical pages and hand them to another guest or another process. When the guest needs the memory back, the host tells the balloon to deflate: the driver frees the pinned pages back into the guest's allocator, and the host repopulates them on demand.

The elegant part is what the balloon does not do. It doesn't compress memory, it doesn't swap, it doesn't page anything to disk. It's a negotiation over which guest pages are currently "claimed" by a balloon that neither side is really using. Inflating the balloon is the guest volunteering pages; deflating is the guest asking for them back. The balloon is the guest politely handing RAM back to the host, on the honor system — and that honor system is both why it's cheap and why it has sharp edges we'll get to.

Ballooning moves the free-memory boundary, not data. A page inside an inflated balloon is a page the guest has agreed to stop using, so the host can safely reclaim its physical backing. Nothing is copied — the win is that idle RAM stops being reserved.

Why this matters: density and overcommit

The reason a serverless-microVM platform cares about ballooning is memory overcommit. If every guest that asks for 2 GiB actually pinned 2 GiB of physical RAM the whole time it existed, the number of microVMs you could pack on a host would be exactly (host RAM ÷ 2 GiB), full stop — and most of that RAM would sit idle behind sandboxes that booted and went quiet. Ballooning breaks that ceiling. When a guest is idle, the host inflates its balloon and reclaims the slack; when the guest gets busy again, the host deflates and gives it back. You can safely run more microVMs than you have physical RAM, as long as their combined working set — the memory they're actually touching at any instant — fits.

This is the same bet an operating system makes when it lets processes allocate more virtual memory than physical RAM: the sum of what's promised can exceed what exists, because not everyone spends at once. Ballooning gives the host a lever to enforce that bet at the VM level, reclaiming from the quiet guests to fund the loud ones. It pairs naturally with Firecracker's copy-on-write memory model — where guests restored from the same snapshot already share identical read-only pages — to push memory density well past the naive per-guest reservation.

The three knobs Firecracker gives you

Firecracker's balloon is deliberately small — three fields, matching its minimalist device philosophy. You configure it once before boot (or, for the target size, at runtime), and there's nothing else to tune.

amount_mib — the target balloon size in MiB. This is how much RAM you want the guest to hand back. Set it to 512 and the guest inflates its balloon to pin ~512 MiB of its own free pages, letting the host reclaim that much. Zero means a fully deflated balloon (the guest keeps all its RAM). This is the one field you drive at runtime.
deflate_on_oom — a safety valve. When true, if the guest hits an out-of-memory condition, the balloon automatically deflates to give pages back rather than letting the guest's OOM killer start executing processes. It's the difference between "the host over-reclaimed, so the balloon quietly gives memory back" and "the host over-reclaimed, so the guest kills your workload." You almost always want this on.
stats_polling_interval_s — how often (in seconds) the guest reports memory statistics back to the host over the balloon's stats virtqueue: free memory, available memory, swap activity, page-fault counts. Zero disables stats. Non-zero lets the host see how much slack a guest actually has before deciding how far to inflate — reclaim blind and you risk squeezing a guest that has nothing to give.

The stats knob is what turns ballooning from a guess into a policy. Without it you're inflating balloons and hoping the guest had the room; with it, the host reads each guest's free/available memory and inflates only the guests with genuine slack, and only as far as their reported headroom. That feedback loop is what makes aggressive density safe rather than reckless.

Configuring the balloon and driving it at runtime

Like every Firecracker device, the balloon is wired up through the REST-over-Unix-socket API. You PUT the initial config before InstanceStart, then PATCH the target size while the guest runs. The config is a single small object; the runtime knob is a single field.

// PUT /balloon — before InstanceStart.
// Start fully deflated, with the OOM safety valve on and
// stats reported once a second so the host can see guest slack.
{
  "amount_mib": 0,
  "deflate_on_oom": true,
  "stats_polling_interval_s": 1
}

Once the guest is running, you don't tear anything down to reclaim memory — you just move the target. A single PATCH to /balloon inflates or deflates the balloon to the new amount_mib, and the guest driver reacts asynchronously: it allocates and pins pages to reach the target on inflate, or frees them on deflate.

# --- Reclaim ~512 MiB from an idle guest ---
# Inflate the balloon: the guest pins ~512 MiB of its free
# pages and the host reclaims the physical backing.
curl --unix-socket /run/fc.sock -X PATCH 'http://localhost/balloon' \
  -d '{"amount_mib": 512}'

# --- The guest gets busy again: give the memory back ---
# Deflate fully: the driver frees the pinned pages back into
# the guest allocator; the host repopulates on demand.
curl --unix-socket /run/fc.sock -X PATCH 'http://localhost/balloon' \
  -d '{"amount_mib": 0}'

# --- Read the guest's memory stats to decide how far to inflate ---
curl --unix-socket /run/fc.sock -X GET 'http://localhost/balloon/statistics'

That's the whole runtime surface: one PATCH to reclaim, one PATCH to return, one GET to see what the guest can spare. A control plane driving a fleet reads the stats, finds the idle guests with real slack, inflates their balloons to reclaim it, and deflates the moment a guest shows signs of needing its memory back.

Inflation is a request, not a command with a deadline. The PATCH returns immediately; the guest driver reaches the target on its own schedule as it can allocate free pages. A busy guest with little free memory will inflate slowly or not at all — which is exactly the behavior you want, since it means the guest has nothing to spare.

The honest caveats: it's cooperative

Everything above depends on the guest playing along, and that's the catch you have to design around. Ballooning is cooperative: the host asks, and a well-behaved guest with a working balloon driver complies. But the host cannot force it. This has three practical consequences worth stating plainly.

A hostile or driverless guest simply won't inflate. If the guest never loaded the virtio-balloon driver, or is running code that deliberately ignores balloon requests, your PATCH does nothing — the guest keeps its RAM. In a multi-tenant setting where the guest runs untrusted code, you cannot assume the balloon will cooperate.
deflate_on_oom is not optional in practice. If you inflate a balloon and the guest later needs that memory, an inflated balloon competing with real allocations can drive the guest into OOM. Turning deflate_on_oom on means the balloon yields first, protecting the workload; leaving it off means the guest's OOM killer may fire. For anything you don't want killed, set it on.
The balloon is not a memory limit. It's a way to reclaim slack from a cooperating guest, not a cap you can enforce. If you need a hard ceiling on what a guest can consume — the guarantee that a tenant cannot exceed N MiB no matter what — that has to come from the host: the microVM's configured memory size, plus host-level cgroup limits on the Firecracker process. Ballooning is the soft, cooperative layer on top of those hard limits, not a replacement for them.

Treat the balloon as an optimization, never as isolation. It reclaims idle memory from friendly guests to improve density; it does not defend against a guest that wants to hoard RAM. The real containment is the baked microVM memory size and a host cgroup limit — the balloon rides on top of those, it doesn't stand in for them.

Balloon vs. a hard cap vs. UFFD streaming

It helps to place ballooning next to the other two memory levers a Firecracker platform has, because they solve different problems and compose rather than compete.

virtio-balloon — a cooperative, runtime-adjustable channel to reclaim idle RAM from a running guest and hand it back later. Soft: depends on the guest driver complying. Best for density and overcommit across many mostly-idle guests. Not a security boundary and not a hard limit.
Hard memory cap (guest size + host cgroup) — the enforced ceiling. The microVM's configured RAM plus a cgroup limit on the Firecracker process is what a guest physically cannot exceed. Enforced by the host, needs no guest cooperation. This is the isolation guarantee; ballooning operates strictly inside it.
UFFD memory streaming — solves a different axis entirely: not how much RAM a running guest holds, but how a guest's memory is loaded at restore. Instead of reading the whole memory image up front, guest pages are faulted in on demand (streamed from object storage), so a guest pulls only its working set. It shrinks the cost of starting a guest; ballooning shrinks the cost of keeping an idle one around.

The three stack cleanly. The hard cap sets the ceiling no guest can cross. UFFD streaming makes each guest cheap to bring up by paging in only what it touches. And ballooning reclaims the slack from guests that came up, went quiet, and are sitting on memory they aren't using. Together they let one host hold far more microVMs than a naive per-guest reservation would allow — which is the whole economic argument for microVMs over always-on containers or full VMs.

How this fits PandaStack's density model

PandaStack's entire cost model rests on idle sandboxes being close to free. Every sandbox, managed database, and hosted app is its own Firecracker microVM, and the platform is built to pack many of them onto a single KVM host — the networking layer alone pre-allocates 16,384 /30 subnets per agent so per-sandbox network isolation is never the bottleneck. The binding constraint is memory, and that's exactly where the balloon earns its keep: a sandbox that booted (a create lands at a 179ms p50, roughly 203ms p99, restoring a baked snapshot rather than the ~3s of a first cold boot) and then went idle is a prime candidate to have its slack reclaimed, so the RAM funds a sandbox that's actually working.

Ballooning composes with the copy-on-write memory that snapshot-restore already gives you. Guests restored from the same template snapshot share identical read-only pages, so their baseline footprint is smaller than their nominal size to begin with; ballooning then reclaims the idle slack on top of that sharing. The same primitive underlies forks — a same-host fork runs in roughly 400–750ms because it shares the parent's resident memory copy-on-write, and a cross-host fork is 1.2–3.5s once the artifacts move over the network. Dense, cheap-when-idle microVMs are the point, and the balloon is one of the levers that gets you there. PandaStack's core is open source under Apache-2.0, so you can run the control-plane API and per-host agent on your own KVM hosts and watch the memory math yourself. For the copy-on-write memory model that ballooning builds on, start at /blog/firecracker-memory-snapshots; for the minimal device model the balloon belongs to, /blog/firecracker-virtio-devices.

Frequently asked questions

What is the Firecracker virtio-balloon device?

It's a cooperative memory-reclamation channel. A small balloon driver inside the guest can inflate — allocating and pinning free guest pages and telling the host it won't use them — so the host can reclaim that physical RAM and give it to another guest. When the guest needs the memory back, the host deflates the balloon and the driver frees the pinned pages. Nothing is copied or swapped; ballooning just moves the boundary of which pages are claimed, letting a host reclaim idle guests' slack. It's the guest handing RAM back on the honor system, which is why it's cheap and also why it's cooperative rather than enforced.

What are the three Firecracker balloon configuration options?

amount_mib is the target balloon size — how much RAM you want the guest to give back (0 means fully deflated, guest keeps all its memory); this is the field you PATCH at runtime to reclaim or return memory. deflate_on_oom, when true, makes the balloon automatically deflate if the guest hits an out-of-memory condition, so the balloon yields memory instead of the guest's OOM killer firing — you almost always want it on. stats_polling_interval_s sets how often (in seconds) the guest reports memory stats (free, available, swap, page faults) back to the host, so the host can see how much slack a guest actually has before deciding how far to inflate; 0 disables stats.

How do you inflate or deflate a Firecracker balloon at runtime?

Through Firecracker's REST-over-Unix-socket API. You configure the balloon with a PUT /balloon before InstanceStart, then adjust it while the guest runs with a PATCH /balloon carrying a new amount_mib. To reclaim about 512 MiB from an idle guest you PATCH amount_mib to 512; to give the memory back you PATCH it to 0. The PATCH returns immediately and the guest driver reaches the target asynchronously as it can allocate free pages. You can read the guest's memory statistics with GET /balloon/statistics to decide how far to inflate — reclaiming blind risks squeezing a guest that has nothing to spare.

Is memory ballooning a substitute for a real memory limit?

No. Ballooning is cooperative and soft — it depends on the guest's balloon driver complying, so a hostile or driverless guest can simply ignore inflation requests and keep its RAM. It reclaims idle slack from friendly guests to improve density; it is not a security boundary and not an enforced cap. A hard memory ceiling has to come from the host: the microVM's configured memory size plus a cgroup limit on the Firecracker process, which a guest physically cannot exceed regardless of cooperation. Use ballooning as an optimization layered on top of those hard limits, and keep deflate_on_oom on so an over-reclaimed guest yields memory rather than killing your workload.

How does ballooning enable running more microVMs than physical RAM?

It lets a host overcommit memory safely. Most microVMs boot, do a little work, and go idle holding far less than they reserved. By inflating idle guests' balloons the host reclaims that slack and reallocates it to guests that are actually busy, then deflates when an idle guest wakes up. As long as the combined working set of all guests fits in physical RAM at any instant, you can run more microVMs than a naive per-guest reservation would allow. It pairs with copy-on-write snapshot memory (guests from the same template share read-only pages) and with on-demand UFFD paging, so idle guests cost close to nothing — which is the economic core of packing many sandboxes onto one host.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free

Written by Ajay Kumar, Founder, PandaStack.