all posts

seccomp explained for developers: filtering syscalls to shrink the kernel attack surface

Ajay Kumar··9 min read

seccomp (secure computing mode) is a Linux kernel feature that lets a process pre-commit to the set of system calls it's allowed to make. Once a filter is installed, any syscall outside the allowlist is intercepted by the kernel before it does any work — and the process is killed, or the call returns an error, or a few other configurable outcomes. It's one of the most leveraged isolation primitives in Linux, it's what Docker quietly applies to every container you run, and it's the second hardware-independent layer that Firecracker stacks on top of KVM. This post explains what seccomp actually is, the syscall-filtering model in concrete terms, how it shrinks the kernel attack surface, and the part most explainers skip: why seccomp on its own is not a substitute for a separate kernel — and how a microVM uses it as defense in depth rather than the whole defense.

What seccomp actually is

Everything a process does that touches the outside world — open a file, send a packet, fork a child, map memory, talk to a device — goes through a system call. The syscall interface is the one door between userspace and the kernel. It's also the entire attack surface a malicious process has against the kernel: if there's an exploitable bug in the kernel, it's reachable through some syscall (often an obscure one with a rarely-exercised argument). The Linux syscall table has well over 300 entries, and the average program uses a few dozen of them.

seccomp's premise follows directly: if a process only needs 40 syscalls, why leave the other 300 reachable? seccomp lets a process voluntarily drop its ability to make calls it doesn't need. The original 2005 mode (now called strict mode) was draconian — once enabled, the process could make exactly four syscalls (read, write, exit, sigreturn) and nothing else. Useful for a pure compute kernel, useless for anything that does I/O. The version everyone actually uses arrived in 2012: seccomp-bpf.

seccomp-bpf (filter mode) lets you attach a BPF program — a small classic-BPF bytecode filter — that the kernel runs on every syscall the process makes. The filter inspects the syscall number and, with limits, its arguments, then returns a verdict: allow it, kill the process, return an errno, send a signal, or trap to a handler. The filter is the allowlist (or denylist) in executable form. Crucially, it's one-way and inherited: once installed you can't remove or loosen it, and it carries across fork and execve, so a process can lock itself — and everything it spawns — into a reduced syscall vocabulary it can never escape.

The mental model: seccomp is a programmable bouncer standing at the one door into the kernel. Every syscall has to show ID. The bouncer checks it against a list you wrote in advance and either waves it through or stops it cold — and you can never bribe the bouncer to relax the list once the shift has started.

The syscall-filtering model, concretely

A seccomp-bpf filter receives a small struct describing the syscall in flight: the syscall number, the CPU architecture (you must check this — syscall numbers differ across architectures, and forgetting it is a classic bypass), and the syscall's arguments as raw values. The filter is straight-line BPF: compare the syscall number against the ones you permit, and return an action. The actions that matter in practice:

  • SCMP_ACT_ALLOW — let the syscall proceed normally. This is the verdict for everything on your allowlist.
  • SCMP_ACT_ERRNO — block the syscall but let the process keep running, returning a chosen errno (commonly EPERM). The program sees a permission error instead of dying — gentler, and what Docker's default profile uses for most blocked calls.
  • SCMP_ACT_KILL_PROCESS — terminate the whole process immediately on a forbidden syscall. The harshest, safest verdict: no graceful degradation, no chance to retry a different way. This is what you want for a hardened sandbox.
  • SCMP_ACT_TRAP — raise SIGSYS so the process can handle the violation itself, and SCMP_ACT_NOTIFY (user-notification) — hand the syscall to a supervising process to inspect and decide, which powers more advanced policies.

Writing raw BPF by hand is miserable, so almost nobody does. Most code uses libseccomp, which gives you a readable API — create a filter context with a default action, add per-syscall rules, load it — and compiles it to BPF for you. The shape is always the same: pick a default verdict (deny-by-default is the only sane choice), then enumerate the exceptions.

/* Deny-by-default seccomp filter with libseccomp (error handling omitted).
 * Default verdict: kill the process. Then allow only what we need. */
#include <seccomp.h>

/* Default action for any syscall NOT explicitly allowed below:
 * kill the whole process. Deny-by-default is the only safe baseline. */
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);

/* The allowlist: the handful of syscalls this program legitimately needs. */
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read),  0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(rt_sigreturn), 0);

/* You can also match on arguments, not just the syscall number.
 * e.g. allow open() ONLY for read-only flags (arg 1 == O_RDONLY): */
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 1,
                 SCMP_A2(SCMP_CMP_EQ, O_RDONLY));

/* Compile to BPF and install it. After this returns, the filter is
 * irreversible and is inherited across fork() and execve(). */
seccomp_load(ctx);

/* From here on, any syscall not on the allowlist kills the process
 * before the kernel does any of the work the syscall asked for. */

Two properties make this worth the effort. First, the check happens before the syscall executes — a blocked call never reaches the vulnerable kernel code path at all, so it's not "detect and clean up," it's "the door never opens." Second, deny-by-default means a syscall you forgot to think about — including one added to the kernel after you shipped — is denied automatically, rather than silently allowed. An allowlist fails closed; a denylist fails open. That asymmetry is the whole reason to default to kill.

How this shrinks the kernel attack surface

The value isn't in blocking the syscalls a program never calls anyway — it's in blocking the ones an exploit would reach for. A huge fraction of kernel privilege-escalation CVEs live in syscalls that ordinary workloads never touch: exotic socket families, obscure ioctl paths, keyring management, BPF program loading, namespace and mount operations, ptrace, kexec. A web server has no business calling any of them. If your filter denies them, an attacker who achieves code execution inside that process can't pivot through those bugs to the kernel, because the door to them was nailed shut before the attacker ever arrived.

This is precisely what Docker does. Every container runs under a default seccomp profile that blocks roughly 40-plus syscalls out of the full table — including the ones most associated with container escapes and kernel exploitation — while allowing the large common set that real programs need. You almost never notice it because the profile is tuned to permit normal workloads. You can swap or disable it per-container, which is the single most instructive seccomp experiment you can run:

# Docker applies its DEFAULT seccomp profile automatically.
# You can point at a custom profile:
docker run --security-opt seccomp=/path/to/profile.json myimage

# ...or, to see what the filter is actually buying you, turn it OFF
# (DON'T do this in production -- it widens the kernel attack surface):
docker run --security-opt seccomp=unconfined myimage

# A minimal custom profile is just a default action + an allowlist:
# {
#   "defaultAction": "SCMP_ACT_ERRNO",
#   "architectures": ["SCMP_ARCH_X86_64"],
#   "syscalls": [
#     { "names": ["read","write","openat","close","fstat","mmap",
#                 "exit_group","rt_sigreturn"],
#       "action": "SCMP_ACT_ALLOW" }
#   ]
# }
#
# Same model as the C above: deny-by-default, then enumerate exceptions --
# just expressed as JSON the container runtime compiles into a BPF filter.

Run something under seccomp=unconfined next to the default and you can watch certain exploit techniques that the default profile quietly neutralizes become reachable again. That's the point: seccomp narrows the set of kernel code paths a compromised process can drive.

What seccomp is NOT: the surface is still the host kernel

Here's the part that gets glossed over, and the reason this post exists. seccomp narrows the door, but the room on the other side of the door is still the host's single shared kernel. A seccomp-confined container is a process running directly on the host kernel, and the syscalls you do allow — and you have to allow a substantial set for anything useful to run — execute against that same kernel that every other tenant on the box shares. seccomp shrinks the attack surface; it does not change whose kernel that surface belongs to.

seccomp is the bouncer; the host kernel is still the whole nightclub, and everyone in the building is sharing it. A tighter guest list at the door doesn't give each guest their own building — it just means fewer people get in. If there's a structural fault inside the club, a smaller crowd doesn't fix it; it only lowers the odds that the one person who can trigger it walked through the door tonight.

Two consequences follow. First, seccomp reduces probability, not blast radius. If a kernel bug is reachable through one of the syscalls you had to allow — and real exploits routinely live in common, must-allow calls like ioctl, futex, or memory-management calls, not just exotic ones — a confined process can still drive it, and a successful kernel compromise still owns the host and every co-tenant on it, because they all share that kernel. A smaller allowlist makes a successful exploit less likely; it does nothing to contain one that lands. Second, seccomp filtering is itself code, and it has had its own bypasses (architecture-confusion tricks, argument-check gaps, races) — a narrower door is still a door.

So seccomp is a genuinely valuable hardening layer and a poor isolation boundary to rely on alone for untrusted, multi-tenant code. The thing that actually changes the blast radius is giving each workload its own kernel — which is what a KVM-backed microVM does, and where seccomp finds its proper role.

How Firecracker layers seccomp on top of KVM

Firecracker's isolation story is two distinct boundaries doing two different jobs, and seccomp is the second one. The first boundary is hardware: each microVM is a KVM guest with its own guest kernel, and guest code can't make a syscall into the host at all — its only way out is a CPU-enforced VM-exit trap (covered in the KVM explainer). That's the layer that gives each workload a separate kernel and shrinks the blast radius to a single VM.

But the Firecracker VMM is itself a host userspace process — the program that opens /dev/kvm, runs the KVM_RUN loop, and emulates the handful of virtio devices the guest sees. If a guest found a bug in that device-emulation code and broke out of the VM, it would land in the Firecracker process on the host. So Firecracker installs a strict seccomp-bpf filter on itself at startup, allowlisting only the small set of syscalls the VMM legitimately needs to run a VM — the relevant ioctls for KVM, the I/O calls for its devices and sockets, memory mapping, and little else. Everything else is denied, with kill as the default.

  • KVM (hardware boundary) — gives each microVM its own guest kernel; guest code cannot syscall into the host, only trap via VM-exit. This is what makes the blast radius one VM instead of the whole host. It's the boundary seccomp alone can't provide.
  • seccomp on the VMM (software boundary) — confines the Firecracker host process itself, so even a guest that breaks out of KVM into the VMM hits a process that can make only a tiny, audited set of syscalls. A breakout lands in a straitjacket, not in a shell.
  • Defense in depth — the two are complementary, not redundant. KVM is the strong boundary that changes blast radius; seccomp is the backstop that hardens the small host-side surface (the VMM and its device emulation) that sits below KVM. You want both because each covers the other's failure mode.
  • Plus a jailer — Firecracker is typically run under a jailer that drops privileges, sets up a chroot, and uses cgroups/namespaces, so the seccomp-confined VMM is also unprivileged and resource-bounded. Layers all the way down.

Read in that order, the difference from the container model is sharp. A bare seccomp-confined container is one layer: a confined process on the shared host kernel — narrow door, but if something gets through, it's straight into the kernel everyone shares. A Firecracker microVM is seccomp applied where it's strongest — backstopping a tiny host-side process — sitting underneath the real boundary, which is a separate guest kernel enforced by the CPU. seccomp isn't the thing keeping the guest away from the host; KVM is. seccomp is there in case KVM's small host-side surface ever fails.

The mental model to keep

seccomp is three ideas. One: it's a kernel feature that lets a process pre-commit, irreversibly, to an allowlist of syscalls it may make — a BPF filter the kernel runs on every call, returning allow, kill, errno, or trap. Two: deny-by-default filtering shrinks the kernel attack surface by making the obscure, exploit-favored syscalls unreachable before an attacker arrives, which is exactly why Docker confines every container with one. Three — the one to actually internalize: it narrows the door, but the room behind it is still the shared host kernel, so seccomp lowers the odds of a successful kernel exploit without changing who gets owned when one lands. That's why it's a hardening layer, not an isolation boundary you stake multi-tenant untrusted code on alone.

The right way to use it is the way Firecracker does: as defense in depth on top of a real boundary. PandaStack runs every sandbox, managed database, and hosted app as its own KVM-backed Firecracker microVM — separate guest kernel for blast-radius, seccomp on the VMM and a privilege-dropping jailer to harden the thin host-side surface underneath. For the layer that actually does the blast-radius work, read /blog/kvm-explained-for-developers; for the contrast with the shared-kernel model that seccomp can only narrow, not replace, read /blog/why-docker-is-not-a-sandbox. The PandaStack core is open source under Apache-2.0, so you can read the exact seccomp filters the VMM ships with and run the whole stack on your own KVM hosts.

Frequently asked questions

What is seccomp in simple terms?

seccomp (secure computing mode) is a Linux kernel feature that lets a process pre-commit to the set of system calls it's allowed to make. With seccomp-bpf, you attach a small BPF filter that the kernel runs on every syscall the process makes; anything outside the allowlist is blocked before it executes — the process is killed, or the call returns an error, depending on the verdict you chose. The filter is irreversible once installed and is inherited across fork and exec, so a process can lock itself and its children into a reduced syscall vocabulary.

How does seccomp reduce the kernel attack surface?

Every interaction a process has with the kernel goes through a syscall, and any exploitable kernel bug is reachable through some syscall — often an obscure one. A deny-by-default seccomp filter allows only the few dozen syscalls a program actually needs and blocks the hundreds it doesn't, including the exotic calls that many privilege-escalation exploits rely on. A compromised process then can't reach those vulnerable kernel code paths at all, because the check happens before the syscall runs. Docker applies a default seccomp profile to every container for exactly this reason.

Is seccomp enough to safely run untrusted code?

Not on its own. seccomp narrows the syscall interface, but the syscalls you do have to allow still execute against the host's single shared kernel — the same kernel every other tenant uses. seccomp lowers the probability of a successful kernel exploit; it doesn't change the blast radius if one lands, because a kernel compromise still owns the host and every co-tenant. For untrusted, multi-tenant code you want a separate kernel per workload (a KVM-backed microVM), with seccomp used as a hardening layer on top rather than the whole boundary.

How does Firecracker use seccomp?

Firecracker layers seccomp on top of KVM as defense in depth. KVM gives each microVM its own guest kernel, so guest code can't syscall into the host — that's the boundary that limits blast radius to one VM. But the Firecracker VMM is itself a host process (it drives /dev/kvm and emulates a few virtio devices), so Firecracker installs a strict seccomp-bpf filter on itself that allowlists only the small set of syscalls the VMM needs. If a guest ever broke out of KVM into the VMM, it would land in a process that can make almost no syscalls — a straitjacket, not a shell. Firecracker is typically also run under a privilege-dropping jailer.

What's the difference between a seccomp-confined container and a microVM?

A seccomp-confined container is a single layer: a process with a narrowed syscall allowlist running directly on the shared host kernel. If something reaches the kernel through an allowed syscall, the host and all co-tenants are at risk. A microVM uses seccomp where it's strongest — backstopping a tiny host-side VMM process — underneath the real boundary, which is a separate guest kernel enforced by the CPU via KVM. seccomp isn't what keeps the guest away from the host in a microVM; KVM is. seccomp is the backstop in case KVM's small host-side surface ever fails.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free
Written by Ajay Kumar, Founder, PandaStack.