KVM explained for developers: the hardware boundary under microVMs
If you've read anything about Firecracker, microVMs, or "how do I run untrusted code safely," you've hit the term KVM, usually wedged next to phrases like "hardware isolation" and "VT-x" with no explanation. This post fixes that. KVM (Kernel-based Virtual Machine) is the Linux feature that turns your CPU's virtualization extensions into a real isolation boundary — and it's the layer doing the actual security work underneath Firecracker, QEMU, Cloud Hypervisor, AWS Lambda, and every microVM platform including PandaStack. Understanding it once clears up a lot of fuzzy mental models.
What KVM actually is
KVM is a Linux kernel module (kvm.ko, plus a CPU-specific kvm-intel.ko or kvm-amd.ko). Loading it does something specific and slightly mind-bending: it turns the Linux kernel itself into a hypervisor. Not a separate product you install on bare metal like ESXi, and not purely a userspace program — the kernel gains the ability to run guest virtual machines directly, using hardware features built into the CPU.
It sits in an awkward spot in the old type-1 / type-2 taxonomy. A type-1 (bare-metal) hypervisor runs directly on hardware; a type-2 (hosted) hypervisor runs as an application on a normal OS. KVM is type-2-ish: it's part of a general-purpose Linux that's also running your other processes, but because it's the kernel and it drives the CPU's virtualization hardware directly, it performs like a type-1. People argue about which bucket it belongs in; the practical answer is "it's Linux, and Linux is the hypervisor."
The crucial detail: KVM is not a software emulator. It does not interpret guest instructions. The guest's code runs natively on the physical CPU, at full speed, because modern CPUs have a dedicated hardware mode for exactly this.
The CPU extensions that make it real
KVM is a thin wrapper around hardware virtualization extensions that have shipped in essentially every server and laptop CPU for over a decade: Intel calls theirs VT-x, AMD calls theirs AMD-V (SVM), and ARM has its own virtualization extensions (EL2). These add a new privilege axis to the CPU. Without them, the classic x86 ring model goes ring 0 (kernel) down to ring 3 (userspace). With them, the CPU gains a separate guest mode (Intel's terms: VMX non-root) and host mode (VMX root), each with its own full set of rings.
So a guest kernel can run in its own ring 0 — it genuinely believes it owns the machine — while still being one layer below the host's hypervisor. The guest gets a real ring 0; it just isn't the host's ring 0. That separation is enforced in silicon, not by a software supervisor checking every instruction, which is why virtualized code runs at near-native speed.
The VM-exit: where the actual security boundary lives
Here's the mechanism that matters most, and the one that's worth carrying around in your head. While a guest is doing ordinary work — arithmetic, memory access to its own RAM, running userspace — the CPU executes its instructions directly with zero hypervisor involvement. But the moment the guest does something privileged or boundary-crossing — touches an I/O port, hits a device register, executes a sensitive instruction, faults on a page it doesn't own — the CPU traps. It freezes the guest, switches from guest mode back to host mode, and hands control to KVM. This transition is called a VM-exit (a VMEXIT).
The VM-exit is the security boundary. The guest cannot reach the host by "asking nicely" — there is no syscall it can make into the host kernel, no shared call gate. Its only way to affect anything outside its own sandbox is to do something that traps, at which point the host CPU — not the guest — decides what happens next and hands the situation to host-side code that the guest can't see or influence. The CPU itself plays bouncer; the guest never gets to walk up to the host kernel and request a favor.
After a VM-exit, KVM (or the userspace VMM above it) inspects why the trap happened, emulates whatever the guest was trying to do — "you wrote to that virtio device register, fine, here's the result" — and then issues a VM-entry to resume the guest exactly where it left off. The guest never knows it was paused. This trap-emulate-resume loop is the entire game, and minimizing how often it happens (fewer exits) is a big part of why microVMs are fast.
A container asks the host kernel to do things, all day, through hundreds of syscalls. A KVM guest can't ask the host anything — it can only trap, and the hardware decides the trap is now the host's problem, not the guest's lever.
How a VMM like Firecracker drives /dev/kvm
KVM exposes itself to userspace as a device file: /dev/kvm. It deliberately does not include a device model — KVM gives you CPU and memory virtualization and nothing else. A Virtual Machine Monitor (VMM) like Firecracker, QEMU, or Cloud Hypervisor is the userspace program that opens /dev/kvm and drives it through ioctl calls to assemble an actual machine: it creates the VM, carves out guest memory, creates virtual CPUs, and emulates the devices (the virtio-net card, the block device, the serial console) that KVM intentionally leaves out.
The conceptual sequence a VMM runs is short and the same everywhere:
/* Conceptual VMM control flow against /dev/kvm (error handling omitted) */
int kvm = open("/dev/kvm", O_RDWR);
/* 1. Create the VM container */
int vmfd = ioctl(kvm, KVM_CREATE_VM, 0);
/* 2. Hand the guest some host memory to use as its physical RAM */
ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &mem_region);
/* 3. Create a virtual CPU (one ioctl per vCPU) */
int vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
/* 4. Set initial register state (entry point, etc.), then run the guest */
for (;;) {
ioctl(vcpufd, KVM_RUN, 0); /* blocks: CPU runs guest in guest-mode */
switch (run->exit_reason) { /* a VM-exit landed us back here */
case KVM_EXIT_IO: /* guest touched an I/O port -> VMM emulates it */
case KVM_EXIT_MMIO: /* guest hit a device register -> VMM emulates it */
case KVM_EXIT_HLT: /* guest halted */
/* handle, then loop back into KVM_RUN to resume the guest */
}
}That KVM_RUN ioctl is the loop. It blocks in the kernel while the vCPU runs the guest natively in guest mode, and it returns to userspace only when a VM-exit happens that KVM can't or shouldn't handle on its own — for example, the guest poked a virtio device the VMM is responsible for emulating. The VMM does its bit, then calls KVM_RUN again to re-enter the guest. A real VMM wraps this with a thread per vCPU, device emulation, and a control plane, but the heart of every one of them is this open / create / run loop against /dev/kvm.
Firecracker's design choice is to keep the userspace side of this as small as possible: a minimal Rust VMM that emulates only a handful of virtio devices and a serial console, with no BIOS, no PCI, no legacy hardware. Fewer emulated devices means fewer kinds of VM-exit to handle and a smaller host-side attack surface — the part of the system that isn't enforced by the CPU.
Why this beats the container model on attack surface
This is the whole reason the industry runs untrusted and multi-tenant workloads on KVM-backed microVMs instead of bare containers. A container is a Linux process. Its isolation comes from namespaces and cgroups, but it shares the host's single Linux kernel, and it interacts with that kernel through the full system-call interface — hundreds of syscalls, each a potential bug. A kernel vulnerability or a container escape reachable through any of them puts the host and every neighboring container at risk. Compare the two surfaces:
- Shared-kernel container — the guest calls into the host's kernel directly through the full syscall ABI (hundreds of entry points). The boundary is software (namespaces/cgroups) on a kernel that's shared by everyone on the box. One kernel bug or container escape = host and all neighbors compromised.
- KVM guest (microVM) — the guest has its own kernel and cannot make a syscall into the host at all. Its only interaction is the CPU-enforced VM-exit, handled by a tiny VMM emulating a few virtio devices. An escape has to break hardware virtualization or the small VMM — a deliberately narrow, heavily audited surface.
- Enforcement layer — container: a software supervisor (the shared kernel) decides what's allowed. KVM guest: the CPU's virtualization hardware traps the guest first, before any host software runs.
- Blast radius of a bug — container: the host and every co-tenant. KVM guest: one VM.
Neither boundary is magic — KVM and the VMM have had their own CVEs, and you still want a privilege-dropping jailer, seccomp filters, and KVM kept patched. But the surface area is wildly different. "Hundreds of syscalls into a kernel you share with strangers" versus "a hardware trap into a few thousand lines of audited Rust" is not a close contest when the code you're running was written by someone — or something — you don't trust. That's exactly why AWS built Firecracker on KVM to run Lambda and Fargate, and why PandaStack runs every sandbox, managed database, and hosted app as its own KVM-backed Firecracker microVM.
Nested virtualization, and "needs /dev/kvm" in practice
Because KVM relies on the CPU's hardware mode, it needs that hardware to be present and exposed. Two practical consequences trip people up. First: nested virtualization. If you're already inside a VM — say a cloud instance — running KVM inside it means the guest mode has to be virtualized too. This works, but only when the outer hypervisor explicitly exposes the virtualization extensions to its guest (on most clouds this is a specific instance type or a flag like enable-nested-virtualization). If those extensions aren't passed through, /dev/kvm simply won't appear inside the VM, and a VMM that needs it can't start.
Second: Apple Silicon Macs. The Mac's M-series CPU has ARM virtualization extensions, but macOS surfaces them through Apple's Virtualization.framework, not /dev/kvm (which is a Linux thing). To run Firecracker on a Mac you boot a Linux VM via Apple's framework, and that Linux guest gets nested KVM — so /dev/kvm exists inside the Linux VM and Firecracker runs there. That's exactly the path PandaStack's local dev setup uses on Apple Silicon: Apple Virtualization.framework provides the outer nested-virt boundary, Linux's KVM provides the inner one for the microVMs.
Concretely, checking for KVM support is a two-line affair: confirm the CPU advertises the extensions, and confirm the device node exists and is usable.
# 1. Does the CPU advertise virtualization extensions?
# vmx = Intel VT-x, svm = AMD-V. Count > 0 means yes.
egrep -c '(vmx|svm)' /proc/cpuinfo
# 2. Is the KVM device node present and accessible?
ls -l /dev/kvm
# crw-rw---- 1 root kvm 10, 232 ... /dev/kvm <- good; you can run a VMM
# No such file or directory <- KVM not available
# (BIOS virtualization disabled, or you're in a VM
# without nested virtualization enabled)
# 3. Are the kernel modules loaded?
lsmod | egrep 'kvm|kvm_intel|kvm_amd'If /proc/cpuinfo shows the flag but /dev/kvm is missing, it's almost always one of three things: virtualization disabled in the BIOS/firmware, the kvm modules not loaded, or you're inside a VM that hasn't had nested virtualization turned on. If the flag itself is missing, you're on hardware (or a virtualization layer) that doesn't expose the extensions at all.
The mental model to keep
Strip away the acronyms and KVM is three ideas. One: it's a Linux kernel module that turns the kernel into a hypervisor by driving the CPU's built-in virtualization extensions (VT-x / AMD-V / ARM EL2). Two: guest code runs natively on the real CPU in a separate hardware-enforced guest mode, and the only way out is a VM-exit — a CPU trap that hands control to the host, which is the actual security boundary. Three: a VMM like Firecracker opens /dev/kvm, builds a machine out of ioctls, and runs the guest in a KVM_RUN loop, emulating the few devices KVM leaves out.
That's the layer doing the real isolation work under every microVM. When a platform says it gives untrusted code "hardware isolation," this is what it means: not a software supervisor checking the guest's behavior, but a CPU that won't let the guest touch anything outside its sandbox without trapping first. If you want to see it end to end, start with /blog/what-is-a-microvm for the layer above, and /blog/firecracker-vs-docker for the contrast with the shared-kernel model — KVM is the reason that contrast exists. The PandaStack core is open source under Apache-2.0, so you can run the agent on your own KVM hosts and watch the VM-exits yourself.
Frequently asked questions
What is KVM in simple terms?
KVM (Kernel-based Virtual Machine) is a Linux kernel module that turns the Linux kernel into a hypervisor. It uses the CPU's hardware virtualization extensions (Intel VT-x, AMD-V, or ARM's virtualization extensions) so guest VMs run their code natively on the real CPU at near-native speed, while staying isolated in a separate hardware-enforced guest mode. It's the layer that does the actual isolation under Firecracker, QEMU, and microVM platforms.
Is KVM a type-1 or type-2 hypervisor?
It's debated, because KVM blurs the line. It runs as part of a general-purpose Linux (which looks type-2/hosted), but because it's the kernel itself driving the CPU's virtualization hardware directly, it performs like a type-1/bare-metal hypervisor. The practical answer: Linux is the hypervisor, and it has near-type-1 performance.
Why is a KVM microVM safer than a container for untrusted code?
A container shares the host's single Linux kernel and reaches it through the full system-call interface — hundreds of entry points, any of which could carry a bug enabling a host compromise. A KVM guest has its own kernel and can't make a syscall into the host at all; its only interaction with the outside is a CPU-enforced VM-exit handled by a tiny VMM. That's a far smaller, more heavily audited attack surface, which is why AWS Lambda and platforms like PandaStack run untrusted code in KVM-backed microVMs.
What does it mean when something "needs /dev/kvm"?
/dev/kvm is the device file the kernel exposes once the KVM module is loaded and the CPU's virtualization extensions are available. A VMM like Firecracker opens it and drives KVM through ioctls to create the VM, memory, and vCPUs. "Needs /dev/kvm" means you require a real CPU with virtualization extensions exposed to your environment — present on bare metal and KVM-capable cloud instances, but absent in a plain container or a VM without nested virtualization enabled.
How do I run KVM (and Firecracker) on an Apple Silicon Mac?
macOS doesn't expose /dev/kvm — that's a Linux interface. Instead, you boot a Linux VM through Apple's Virtualization.framework (which uses the M-series chip's ARM virtualization extensions), and that Linux guest gets nested KVM. So /dev/kvm exists inside the Linux VM and Firecracker runs there. PandaStack's local dev setup on Apple Silicon uses exactly this: Apple's framework as the outer boundary, Linux KVM as the inner one for the microVMs.
49ms p50 cold start. Fork, snapshot, and scale to zero.