all posts

Fork a microVM for Tree-of-Thought Agents

Ajay Kumar··9 min read

A reasoning agent spends most of its life at a fork in the road. It has a bug in front of it and three plausible fixes; a slow endpoint and two libraries that might speed it up; a gnarly module and several ways to refactor it. Single-path agents handle this the way an anxious person handles a menu: pick one, commit, and if it's wrong, back out and try again — serially, one dead end at a time. Tree-of-thought says do the obvious thing instead: try all of them at once, score the outcomes, and keep the best. The catch has always been cost. Running N trajectories usually means N environments and N times the setup. Copy-on-write microVM forking removes that catch: you snapshot the agent's exact running state and fork it into N children that share the parent's memory and disk until they diverge — so you pay for the divergence, not for N full copies. This post is about turning that primitive into a branch-and-explore loop.

The idea: branch the machine, not just the prompt

Tree-of-thought is usually described as a prompting technique — sample several continuations of the model's reasoning, evaluate them, expand the promising ones. That framing quietly assumes the only state that matters lives in the context window. For a coding or reasoning agent it doesn't. The real state is a whole computer: a cloned repo, an installed dependency tree, a built target, a loaded dataset, a half-applied migration, environment variables, a warm interpreter with objects in memory. Branch only the prompt and every child re-derives that computer from scratch.

Branch the machine and each child inherits all of it. The agent reaches a decision point, you freeze the VM exactly as it is, and you fork it N ways. Child 1 tries the JWT fix, child 2 tries OAuth, child 3 rewrites the whole auth module out of spite. Each runs its candidate against the same starting reality — same files, same deps, same warm caches — then you run the tests, score each branch by whether they pass, keep the winner, and delete the rest. The losers cost almost nothing because they never diverged far enough to allocate much of their own.

The unit of a thought here is a real, isolated machine — not a string. Each branch is a full microVM with its own guest kernel, so one child's confidently-wrong `rm -rf` can't reach its siblings or the parent. The tree is made of computers, and the dead branches get thrown away for the cost of a delete().

Why copy-on-write fork is what makes this affordable

The reason nobody does this with normal VMs or even containers is that N independent environments cost N times the setup, and setup is the expensive part. Cloning a repo and running an install can take minutes; do it five times in parallel and you've spent five times the minutes and five times the RAM before a single candidate has been tried. Best-of-N stops being worth it somewhere around N=2.

Copy-on-write inverts that math. A same-host fork on PandaStack lands in roughly 400–750ms, and it does no bulk data movement at all — it's a reflink of the rootfs plus a memory map, then resume. The N children start out sharing essentially all of the parent's pages and disk blocks. They only allocate their own storage for the bytes they actually change. So the cost of a fork isn't proportional to the size of the environment; it's proportional to how much each branch diverges. Five branches that each touch a few files and run a test suite share the giant common baseline — the cloned repo, the node_modules, the built artifacts — and pay only for their own small deltas.

Two copy-on-write mechanisms do the work, one for each half of the machine.

Copy-on-write memory (MAP_PRIVATE)

The parent's RAM is captured as a memory file. When a child restores from it, PandaStack maps that file with MAP_PRIVATE, which is lazy and copy-on-write at page granularity. Nothing is eagerly copied. When the child's guest reads a page, the kernel serves it straight from the shared snapshot file — no copy, and every sibling reading the same untouched page shares it in the page cache. When the child writes a page, the kernel faults, makes a private 4 KiB copy for just that child, and leaves the original intact for everyone else. A branch that reads its whole 2 GiB working set but only modifies a handful of pages holds almost no private memory of its own. That's why you can fan out to several branches on one host without the RAM bill scaling linearly with N.

The disk gets the same treatment via XFS reflinks. A reflink (cp --reflink) creates a new rootfs file that points at the same underlying data extents as the parent's — an O(metadata) operation that returns in about the time it takes to write a little filesystem bookkeeping, regardless of whether the rootfs is 2 GiB or 20. The two files share blocks until one writes, at which point the filesystem copies only the affected blocks for the writer. So when child 2 runs `git apply` and edits three source files, it privately owns those three files' blocks and continues sharing everything else — the .git objects, the dependency tree, the OS itself — with the parent and its siblings.

A fork is those two things happening together: MAP_PRIVATE on the memory file and a reflink on the rootfs, neither of which copies real data up front. The child is a full, independent machine that happens to start life sharing almost everything with its parent.

The concrete demo: best-of-N code fix

The cleanest place to see this pay off is bug-fixing, because it comes with a built-in judge: the test suite. Agents are unreliable per-attempt but much stronger in aggregate — ask for one fix and it's a coin flip, ask for five and score them by whether the tests go green and your odds jump. The naive implementation spins up five environments and clones-plus-installs in each. The fork implementation warms one environment, then forks it per candidate so every child inherits the clone and the install for free.

Here's the whole loop with the Python SDK. Set PANDASTACK_API_KEY in your environment first; install with pip install pandastack.

from pandastack import Sandbox
import concurrent.futures as cf

# 1. Warm ONE environment. Pay the expensive setup exactly once.
parent = Sandbox.create(template="code-interpreter", persistent=True, ttl_seconds=1800)
parent.exec("git clone --depth 1 https://github.com/acme/service /work")
parent.exec("cd /work && pip install -r requirements.txt", timeout_seconds=300)

# 2. Freeze the warm state as a clean checkpoint we can always resume from.
checkpoint = parent.snapshot()
print("checkpoint:", checkpoint)

# 3. Five different model-written fixes for the same failing bug.
candidates = generate_n_fixes(n=5)   # each is a unified diff

def try_fix(patch: str) -> dict:
    # Fork the warm parent (~400-750ms same-host). The child inherits the
    # cloned repo AND the installed deps via copy-on-write -- no re-setup.
    child = parent.fork()
    try:
        child.filesystem.write("/work/fix.patch", patch)
        child.exec("cd /work && git apply fix.patch")
        # The judge: exit_code 0 means the suite passed.
        r = child.exec("cd /work && pytest -q", timeout_seconds=180)
        return {"patch": patch, "score": 1 if r.exit_code == 0 else 0,
                "sandbox": child.id, "tail": r.stdout[-800:]}
    finally:
        # A losing branch barely diverged, so deleting it reclaims almost
        # nothing -- copy-on-write already made it nearly free to exist.
        child.delete()

# 4. Explore all five branches in parallel, each in its own microVM.
with cf.ThreadPoolExecutor(max_workers=5) as pool:
    results = list(pool.map(try_fix, candidates))

# 5. Keep the winner. Ties broken by whoever you trust more.
winners = [r for r in results if r["score"] == 1]
print(f"{len(winners)}/5 candidates passed the suite")
best = winners[0] if winners else None

# The parent still holds the checkpoint -- resume and branch again with a
# new batch of ideas if nothing passed. The trunk outlives the branches.
parent.delete()

The shape is always the same: warm once, snapshot, fork per candidate, run, score on a real signal (here an exit code, not the model's opinion of its own work), keep the winner, delete the rest. Because the losers barely diverged, deleting them reclaims almost nothing — copy-on-write already made them nearly free to exist. That's the property that makes N a knob you can actually turn up.

Fork clones the machine, not the outside world. If a branch sends an email, charges a card, or writes to a shared external database, every branch does it — you'll deliver five emails and keep one fix. Score branches on in-VM signals (tests, builds, local checks), and make outside side effects the job of the winner only, after you've picked it.

Deeper: the fork tree

Best-of-N is one level deep — fork, score, pick. But nothing stops a child from being a parent. Each fork is a full sandbox you can snapshot and fork again, and PandaStack tracks the parent-child relationships in its database, so what you actually get is a tree. That's what makes it a genuine tree-of-thought search rather than a flat batch: expand the most promising branch, fork it into its own children, prune the ones that stop looking good, and keep going down the line that's working.

  1. Root: the warm environment after setup — repo cloned, deps installed, tests currently red.
  2. Level 1: fork into a handful of high-level strategies (fix the auth layer, fix the caller, fix the schema). Run a cheap check on each; prune the obviously-broken ones immediately.
  3. Level 2: take the surviving strategy and fork it into concrete variants (three ways to implement the chosen fix). Run the full suite on each.
  4. Keep the green leaf; every other node in the tree gets deleted. The path from root to that leaf is your solution, and copy-on-write meant the whole tree cost roughly one environment plus everyone's small deltas.

Because you snapshot before each branch point, every node is a known-good resume point. If a whole subtree turns out to be a dead end, you don't unwind anything — you just abandon it and re-expand a different node from its own snapshot. This is version-control branching applied to a live, executing machine: cheap branches, cheap merges-by-selection, and history you can walk back up.

Forking shares unmodified state — it does not isolate you from the parent's bugs. If the parent is wedged or mid-failure when you fork, every child inherits that exact wedged state and you've parallelized a broken starting point. Branch from a clean checkpoint, and take a fresh snapshot right before fanning out if you want a guaranteed-good root.

When fork-and-explore beats single-path-with-retries

Fork-and-explore is not free complexity you should sprinkle everywhere. It earns its keep when three things are true: setup is expensive relative to each attempt, you have a cheap and honest way to score an outcome, and the attempts are genuinely independent (trying OAuth doesn't teach you anything about trying JWT). When those hold, exploring in parallel dominates retrying in series — you get wall-clock latency close to a single attempt instead of N attempts back to back, and you get the best of N instead of the first that limps past.

It's the wrong tool when attempts should learn from each other (if attempt 2 should read attempt 1's stack trace and adjust, that's a sequential loop, not a fan-out), when there's no reliable scorer (best-of-N with a bad judge just launders randomness), or when setup is trivial (if a fresh environment costs a 179ms create from the template, just create N of them and skip the forking machinery entirely). Here's the trade-off laid out plainly:

  • Fork-and-explore — warm the environment once, snapshot, fork N children via copy-on-write, run all candidates in parallel, keep the winner. Wall-clock is roughly one attempt. Cost is one environment plus each branch's small divergence. Best when setup is expensive, attempts are independent, and you have a real scorer.
  • Single-path-with-retries — try one candidate, and if it fails, undo and try the next in the same environment, serially. Wall-clock is the sum of every attempt you make. Simple, no fork machinery, and correct when attempts must learn from each other — but you feel every failed attempt in latency and you keep the first that passes, not the best.
  • Full-copy-per-branch — provision N independent environments and re-run the full clone-and-install in each. You get parallelism, but you pay N times the setup in both time and RAM, so N stays small before it stops being worth it. This is the option copy-on-write forking is designed to make obsolete for shared-state exploration.

The honest summary: reach for retries when the loop is inherently sequential, reach for plain per-task creates when there's no shared state worth preserving (a 179ms create is simpler than a fork and gives you a guaranteed-clean baseline), and reach for fork-and-explore precisely when you've built up expensive shared context and want to spend it exploring several futures at once.

The mechanics that make the branches cheap

It's worth being precise about what you're paying for, because the whole approach lives or dies on the fork staying cheap. A same-host fork — the default, because PandaStack's scheduler places a fork on the parent's host — is roughly 400–750ms: the parent's memory is already resident, the rootfs reflinks locally, no bytes cross the network. A cross-host fork runs 1.2–3.5s because the memory and disk artifacts have to be moved to the target machine before it can restore, and a reflink only works within one filesystem. For fan-out into a tree you almost always want same-host: it's faster and it lets every branch share the parent's page cache, which is exactly the copy-on-write sharing that keeps N branches from costing N times the RAM.

For context on the surrounding numbers: a normal create restores a baked template snapshot at 179ms p50 (~203ms p99), the first-ever spawn of a template does one real cold boot around 3s before it's baked, and the same underlying snapshot machinery is what a fork reuses — a fork just restores your specific running sandbox instead of the generic template. Deleting a branch is immediate; its private pages and blocks are freed and its shared ones were never really its own. That whole cost model — cheap to branch, cheap to discard, expensive only where you diverge — is what turns tree-of-thought from a nice diagram into something you'd actually run in a loop.

Put it together and the branch-and-explore agent is a small idea with a sharp edge: stop making your agent guess at every fork in the road when you can afford to try every road and keep the one that arrives. Copy-on-write forking is what makes 'try every road' cost about as much as trying one — you pay for divergence, not for the number of branches — and a test suite is what tells you which one arrived. Warm once, fork N ways, keep the winner, delete the regrets. The regrets, mercifully, were almost free.

Frequently asked questions

What is a tree-of-thought agent and how does forking help?

A tree-of-thought agent explores several candidate trajectories from a decision point instead of committing to one, then keeps the best. The challenge is that each trajectory needs its own environment, and re-running setup N times is expensive. Copy-on-write microVM forking solves this: you snapshot the agent's exact running state and fork it into N children that share the parent's memory and disk until they diverge. Each branch inherits the whole warm environment — cloned repo, installed dependencies, loaded data — for free, so exploring N paths costs roughly one environment plus each branch's small deltas rather than N full copies.

Why is copy-on-write fork cheap enough to run N branches?

A fork does no bulk data movement. The rootfs is cloned with an XFS reflink — an O(metadata) operation that shares data blocks until a branch writes — and the parent's memory is mapped MAP_PRIVATE, so pages are shared read-only across siblings and copied privately only when a branch modifies them. The result is that a fork's cost is proportional to how much each branch diverges, not to the size of the environment or the number of branches. On PandaStack a same-host fork lands in roughly 400–750ms, so best-of-N stays affordable as you turn N up.

How does best-of-N code fixing work with microVM forking?

You warm one environment (clone the repo, install dependencies) once, snapshot it, then fork it once per candidate fix. Each child applies a different model-written patch and runs the test suite in parallel, scored by whether the tests pass (exit code 0). You keep a winning branch and delete the rest — and because losing branches barely diverged, copy-on-write made them nearly free to exist and free to discard. The key is scoring on a real in-VM signal like an exit code rather than the model's own opinion of its work.

When should I use fork-and-explore instead of retrying one path?

Use fork-and-explore when setup is expensive relative to each attempt, the attempts are independent (trying one approach teaches you nothing about another), and you have a cheap, honest scorer like a test suite. Then parallel exploration gives you wall-clock latency close to a single attempt and returns the best of N. Use single-path retries instead when attempts must learn from each other (a sequential loop where attempt 2 reads attempt 1's error), and use plain per-task creates when there's no shared state worth preserving — a 179ms create from the template is simpler than forking and gives a guaranteed-clean baseline.

What are the limits of forking for parallel agents?

Forking clones the machine, not the outside world, so external side effects don't fork safely — if a branch sends an email or charges a card, every branch does it, so keep side effects to the winner after you've picked it. Forking also shares the parent's exact state including its bugs; fork from a clean checkpoint, not a wedged one. And a same-host fork (~400–750ms) is much cheaper than cross-host (1.2–3.5s, because artifacts move over the network and reflinks are single-filesystem), so fan out on the parent's host to keep branches fast and sharing the page cache.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free
Written by Ajay Kumar, Founder, PandaStack.