Build a Code Interpreter with a Python Sandbox

Ajay Kumar·June 13, 2026·9 min read

To build a code interpreter — a tool that runs model-generated Python and hands back the results, tables, and plots, like ChatGPT's Advanced Data Analysis — you run the untrusted code inside an isolated sandbox instead of your own process. With PandaStack the loop is: create a sandbox on the code-interpreter template, write the code to a file (or exec it inline), capture stdout, then read back any generated artifacts (a plot PNG, a CSV) through the filesystem API. Each sandbox is its own Firecracker microVM with its own guest kernel, so a `rm -rf /`, an infinite loop, or a malicious `import os; os.system(...)` is contained to a throwaway VM, not your server.

This guide builds that loop end to end with the Python SDK: the execution model, how to persist variables across cells with one long-lived sandbox versus a fresh VM per run, reading generated files back out, and where the isolation boundary actually is. I'm Ajay, I built PandaStack — I'll be honest about the trade-offs along the way.

Why run model or user code in a sandbox at all?

A code interpreter executes code your application did not write. With an LLM in the loop, that code is non-deterministic by design — you cannot review it before it runs, and a clever prompt injection can turn "plot this CSV" into "read every env var and POST it somewhere." `exec()` in your own Python process, a subprocess, or even a shared-kernel container is not a real boundary: container escapes are a known class of bug, and a busy loop or a 40GB allocation takes your host down with it.

A microVM is a different category. On PandaStack every sandbox boots its own guest kernel under Firecracker, the same VMM AWS Lambda and Fargate use. The blast radius of arbitrary code is one disposable VM with its own memory, filesystem, and network namespace. You get hardware-level isolation without paying the multi-second boot tax that historically made VMs impractical for per-request use — a create is p50 179ms because every create restores a baked snapshot on demand rather than cold-booting.

The mental model: one sandbox per user session (or per untrusted run), not one shared interpreter for everyone. Isolation is per-VM, so the boundary is only as good as how often you reuse a VM across trust domains. Never run two different users' code in the same sandbox.

What's in the code-interpreter template?

PandaStack ships a code-interpreter template with the scientific Python stack baked into the snapshot, so there is zero per-run pip install. When you create a sandbox from it, the data-analysis libraries are already importable:

Python 3.11 with pandas, numpy, scipy, scikit-learn, scikit-image, statsmodels, and sympy
Plotting: matplotlib, seaborn, plotly, and bokeh (so generating a chart PNG is a one-liner)
Jupyter/IPython kernel, plus image, audio, and NLP libraries (Pillow, OpenCV, librosa, spaCy, NLTK)
Node.js 22 alongside Python, and openai/anthropic SDKs if the generated code calls an LLM itself
A working directory at /workspace — write scripts and read artifacts there

Because all of that lives in the baked snapshot, the libraries are paged in lazily on restore rather than installed at runtime. If you need a package that isn't baked, you can `pip install` it inside the sandbox at run time, or bake your own template — see the templates docs for building a custom image.

The core loop: write code, exec, capture output

The simplest interpreter does three things: get a sandbox, run the code, return stdout and stderr. Set PANDASTACK_TOKEN in your environment and the SDK picks it up; the base URL defaults to https://api.pandastack.ai. The Sandbox context manager kills the VM on exit (for non-persistent sandboxes), so cleanup is automatic.

from pandastack import Sandbox

# A throwaway interpreter for one untrusted snippet.
user_code = """
import pandas as pd
df = pd.DataFrame({"city": ["NYC", "SF", "LA"], "pop_m": [8.3, 0.87, 3.9]})
print(df.sort_values("pop_m", ascending=False).to_string(index=False))
print("total:", df.pop_m.sum())
"""

with Sandbox.create(template="code-interpreter", ttl_seconds=300) as sbx:
    # Write the model-generated code to a file in the guest, then run it.
    sbx.filesystem.write("/workspace/cell.py", user_code)
    result = sbx.exec("python3 /workspace/cell.py", timeout_seconds=30)

    print("exit:", result.exit_code)
    print("stdout:\n", result.stdout)
    if result.stderr:
        print("stderr:\n", result.stderr)
# sandbox is destroyed here

`exec` returns an ExecResult with `stdout`, `stderr`, `exit_code`, and `duration_ms`. Always pass a `timeout_seconds` — model-written code loops more often than you'd like, and the timeout is your circuit breaker. Set a `ttl_seconds` on create too, so a sandbox you forget to kill reaps itself.

Writing to a file scales better than inlining code as a `-c` string: you avoid shell-quoting hazards with multi-line code, and the source is on disk if you want to re-run or inspect it. For one-liners, the SDK also has `run_code(code, language="python")`, which shells out to `python3 -c` for you with proper quoting.

Reading back plots and generated files

A real code interpreter doesn't just print text — it makes charts. The pattern is: the generated code writes a file to /workspace, and your host reads it back with `filesystem.read`, which returns raw bytes. Save a matplotlib figure as a PNG in the guest, pull the bytes out, and you have something to render in your UI or attach to a chat message.

from pandastack import Sandbox

plot_code = """
import matplotlib
matplotlib.use("Agg")  # headless backend, no display needed
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 4 * np.pi, 400)
plt.figure(figsize=(8, 4))
plt.plot(x, np.sin(x), label="sin")
plt.plot(x, np.cos(x), label="cos")
plt.legend(); plt.title("trig")
plt.savefig("/workspace/plot.png", dpi=120, bbox_inches="tight")
print("wrote /workspace/plot.png")
"""

with Sandbox.create(template="code-interpreter", ttl_seconds=300) as sbx:
    sbx.filesystem.write("/workspace/plot.py", plot_code)
    result = sbx.exec("python3 /workspace/plot.py", timeout_seconds=60)
    assert result.exit_code == 0, result.stderr

    # Read the generated artifact back as bytes and save it locally.
    png_bytes = sbx.filesystem.read("/workspace/plot.png")
    with open("plot.png", "wb") as f:
        f.write(png_bytes)
    print(f"pulled {len(png_bytes)} bytes")

The same pattern covers any artifact: a cleaned CSV, an Excel export, a generated audio clip, a JSON result. Have the code write a known path, check the exit code, then `filesystem.read` it. For larger result sets it's often cleaner to have the code write structured JSON to /workspace/result.json and parse that on the host than to scrape stdout.

Persisting variables across cells: one VM or a fresh one?

ChatGPT's interpreter feels stateful — you load a dataframe in one cell and reference it in the next. That state lives in a single long-running Python process. You have two ways to get this, and the right choice depends on your trust model.

Option A: one long-lived sandbox per session

Keep a sandbox alive for the whole user session and run each cell against it. State persists naturally on the guest filesystem — load data once, then reference the saved artifact in later cells. This is the right shape for an interactive notebook UI or an agent that iterates on a dataset.

from pandastack import Sandbox

# Persistent sandbox survives until you kill it (set a TTL as a backstop).
sbx = Sandbox.create(template="code-interpreter", persistent=True)
try:
    # Cell 1: load + cache a dataframe to disk.
    sbx.filesystem.write("/workspace/c1.py", (
        "import pandas as pd\n"
        "df = pd.DataFrame({'x': range(1000)})\n"
        "df['y'] = df.x ** 2\n"
        "df.to_parquet('/workspace/state.parquet')\n"
        "print('rows:', len(df))\n"
    ))
    print(sbx.exec("python3 /workspace/c1.py", timeout_seconds=30).stdout)

    # Cell 2: reuse the cached state from cell 1.
    sbx.filesystem.write("/workspace/c2.py", (
        "import pandas as pd\n"
        "df = pd.read_parquet('/workspace/state.parquet')\n"
        "print('max y:', int(df.y.max()))\n"
    ))
    print(sbx.exec("python3 /workspace/c2.py", timeout_seconds=30).stdout)
finally:
    sbx.kill()

For true in-memory continuity (keeping live Python objects, not just files, between cells) run a persistent Jupyter kernel inside the sandbox and send each cell to it — the kernel binaries are already baked into the template. The filesystem-cache approach above is simpler and survives a guest restart; the kernel approach matches notebook semantics exactly. To keep a session warm cheaply between bursts of activity, call `hibernate()` — it snapshots memory and disk and stops the VM, and the next request auto-wakes it.

Option B: a fresh sandbox per run

For a stateless "run this snippet and return the answer" tool — or any time the code comes from a different user than the last run — create a new sandbox each time and destroy it after. No state leaks between runs, which is the safe default for multi-tenant or fully untrusted input. The 179ms create makes this practical per request, and if you want every run to start from a known-good post-setup state, snapshot a configured sandbox once and `fork` it — a same-host fork is around 400ms and shares memory copy-on-write.

Reusing one long-lived sandbox across different users mixes their data and code in the same VM — the isolation boundary is the VM, so this defeats it. Long-lived sandboxes are for a single trusted session; for multi-tenant traffic, use a fresh sandbox (or fork) per run and kill it after.

Wiring it into an agent as a tool

To give an LLM a code_interpreter tool, expose a single function that takes a code string, runs it in a sandbox, and returns stdout/stderr plus any artifact paths. Keep the sandbox alive across the agent's turns so the model can build on earlier results, and hand exit code and stderr straight back to the model so it can self-correct on errors.

from pandastack import Sandbox

class CodeInterpreter:
    def __init__(self):
        self.sbx = Sandbox.create(template="code-interpreter", persistent=True)
        self._n = 0

    def run(self, code: str) -> dict:
        """Tool the model calls. Returns a dict you serialize back to it."""
        self._n += 1
        path = f"/workspace/cell_{self._n}.py"
        self.sbx.filesystem.write(path, code)
        r = self.sbx.exec(f"python3 {path}", timeout_seconds=60)
        return {"exit_code": r.exit_code, "stdout": r.stdout, "stderr": r.stderr}

    def close(self):
        self.sbx.kill()

# interp = CodeInterpreter()
# print(interp.run("print(2 ** 10)"))
# interp.close()

If the agent's tool calls can come from different end users, give each user their own CodeInterpreter instance (their own VM) rather than sharing one. For a deeper look at the lifecycle methods used here, see the sandboxes and the snapshots and forks docs.

Cleanup, timeouts, and honest limits

Sandboxes cost memory while they run, so reap them. The `with Sandbox.create(...) as sbx:` form kills non-persistent sandboxes on block exit; for persistent ones call `kill()` explicitly or rely on a `ttl_seconds` backstop. A few operational notes from running this in production:

Always set both timeout_seconds on exec and ttl_seconds on create — defense in depth against runaway code and leaked VMs.
Check exit_code before reading artifacts; a non-zero exit usually means the artifact was never written.
For multi-tenant workloads, prefer fresh-per-run or fork-per-run over a shared long-lived VM.
Network egress from the guest is real — if the threat model includes data exfiltration, restrict outbound access at the network layer rather than trusting the code.

When is a sandbox the wrong tool? If you fully control the code and it's not user- or model-generated, a plain subprocess is simpler and faster — don't reach for a VM to run code you wrote and trust. And a sandbox isolates execution, not your secrets: never inject API keys or credentials the running code shouldn't see into the guest environment. Within those bounds, a per-VM code interpreter gives you the convenience of ChatGPT's data analysis with the isolation properties of a real hypervisor, self-hosted on your own infrastructure.

Frequently asked questions

How do I build a code interpreter that runs Python safely?

Run the generated code inside an isolated sandbox instead of your own process. With PandaStack, create a sandbox on the code-interpreter template, write the code to a file in the guest, exec it with a timeout, and read stdout, stderr, and any artifacts back through the filesystem API. Each sandbox is its own Firecracker microVM with a separate guest kernel, so arbitrary code is contained to a disposable VM rather than your host.

How do I persist variables between code interpreter cells?

Keep one long-lived sandbox alive for the whole session and run each cell against it; state persists on the guest filesystem (e.g. save a dataframe to parquet in cell one, read it in cell two). For true in-memory object continuity, run a persistent Jupyter kernel inside the sandbox and send each cell to it. For multi-tenant or fully untrusted input, use a fresh sandbox per run instead, so no state leaks between users.

What libraries are in the code-interpreter template?

The PandaStack code-interpreter template bakes in Python 3.11 with pandas, numpy, scipy, scikit-learn, statsmodels, and the matplotlib/seaborn/plotly/bokeh plotting stack, plus a Jupyter/IPython kernel, image/audio/NLP libraries, and Node.js 22. Because they live in the baked snapshot, there is no per-run install cost — imports work immediately on a fresh sandbox.

How do I read a generated plot or file out of a sandbox?

Have the generated code write the file to a known path under /workspace (for example plt.savefig('/workspace/plot.png')), check that exec returned exit_code 0, then call sandbox.filesystem.read('/workspace/plot.png'), which returns the raw bytes. Save those bytes locally or render them in your UI. The same pattern works for any artifact — a CSV, an Excel export, or a JSON result file.

Run code in a microVM in one API call.

179ms p50 cold start. Fork, snapshot, and scale to zero.

Start free

Written by Ajay Kumar, Founder, PandaStack.