Run Flaky Parallel Tests in Isolated MicroVMs

Ajay Kumar·June 28, 2026·9 min read

Most flaky integration tests are not flaky. They are deterministic — you just can't see the determinant, because it lives in shared state. Two tests hit the same Postgres row, race for port 8080, both write /tmp/cache, or one bumps a kernel tunable the next one reads. Run them alone and they pass. Run them together, or in a different order, or on a busy CI box, and one of them fails roughly whenever Mercury is in retrograde. The fix is to stop sharing: give each test (or each shard) its own throwaway Firecracker microVM with a clean kernel, a clean filesystem, its own network namespace and port space, and optionally its own managed database. Fork a fully-seeded base, run the test, kill the VM. There is nothing left to leak.

Flaky tests are usually shared mutable state

Unit tests are easy to isolate — pure functions don't fight each other. Integration and end-to-end tests are where it falls apart, because they touch real resources, and real resources are global by default. When you run them in parallel on one machine (or sequentially on a runner that never resets), they share everything that isn't explicitly partitioned.

Shared database: tests truncate, seed, and read the same tables. Test A's INSERT is still visible to test B; test B's TRUNCATE pulls the rug out from under test A mid-run. Transactional rollback helps until a test commits, spawns a background job, or crosses a connection.
Shared ports: two tests both bind 8080, 5432, or a random-but-not-random-enough port. The second one gets EADDRINUSE, or worse, connects to the first one's server and 'passes' against the wrong process.
Shared filesystem: /tmp, ~/.config, a fixtures directory, an upload folder. One test writes a file another test globs, and now ordering decides the result.
Shared kernel state: containers share the host kernel. A test that tweaks a sysctl, exhausts file descriptors, fills the conntrack table, or changes the system clock leaks that into every other container on the box. You don't even get a stack trace — you get 'sometimes the network call times out.'
Shared singletons in the test process itself: module-level caches, connection pools, a global temp dir, a monkeypatch that wasn't undone. Parallelism turns these from latent bugs into intermittent failures.

The tell: a test that passes alone, passes in a different order, and fails only under load or in parallel is not flaky — it has a hidden dependency on shared state. 'Flaky' is just the name we give to a race we haven't localized yet.

Why containers don't fully fix it

The obvious move is 'run each test in its own container.' That fixes the easy half — filesystem and process namespace — and it's a real improvement over a shared runner. But a container is a process group sharing the host kernel, so the leaks that matter most for flaky integration tests can still get through.

Kernel tunables are global: sysctls, the conntrack table, the entropy pool, fd limits, the system clock. One container's change is every container's reality, because there's one kernel. A test that adjusts net.ipv4.* or runs the VM clock forward affects its neighbors.
Host port collisions: containers on a shared host network, or with carelessly mapped ports, still race for the same host port. You end up writing port-allocation logic — which is a thing you now have to test.
Shared host resources under load: page cache, the disk, the noisy-neighbor CPU. These don't 'leak' state but they leak timing, and timing is what most integration flake is actually about.
Weaker isolation boundary: a shared kernel means a kernel bug or a container escape crosses between tests. For trusted first-party tests that's a lower concern than for untrusted code — but it's still one more thing the container model doesn't give you for free.

A microVM closes those gaps because it isn't sharing the thing that's leaking. Each PandaStack sandbox is a full Firecracker microVM with its own guest kernel, its own network namespace and full port space (no host-port arbitration — every VM has its own :8080), and its own filesystem. A sysctl change, a clock skew, a conntrack flood, a /tmp collision — all of it is scoped to one VM you're about to throw away.

The pattern: seed a base once, fork it per test

Giving every test its own VM sounds expensive, and if you cold-booted a fresh OS and reinstalled dependencies and re-seeded a database for each one, it would be. The microVM model removes that cost with snapshot-and-fork. You pay the setup once, then stamp out identical copies cheaply.

Build a 'test base' once: create a sandbox, install your app and its dependencies, run migrations, load fixtures/seed data, start whatever services the suite needs — then snapshot it. This is your known-good starting state.
Fork the base per test: each test gets its own VM that starts from byte-for-byte identical state. A same-host fork is ~400–750ms and uses copy-on-write memory and rootfs, so the Nth fork doesn't recopy gigabytes of seed data — it shares pages with the base until it writes.
Run the test inside its fork: it mutates its own database, binds its own ports, scribbles its own /tmp. Nothing it does is visible to any sibling fork.
Tear down by killing the VM: no truncate, no cleanup fixture, no 'reset the world' script that's only as correct as the last person who edited it. Delete the VM and every side effect dies with it.

Forking from a seeded snapshot also makes tests reproducible: every test starts from identical, known-good state. 'It only fails on CI' usually means CI's shared environment differs from yours. When every test forks the same snapshot, CI and your laptop start from the same bytes.

Forking a seeded base sandbox per test in Python

Here's the whole shape: build the seeded base once, then fork it for each test, run the test against the fork, collect the result, and kill the fork. Set PANDASTACK_API_KEY in your environment first. Each fork is an independent microVM, so this loop is safe to run with as many tests in flight as your hosts have memory for.

from pandastack import Sandbox

# ---- One-time setup: build a fully-seeded 'test base' and snapshot it. ----
# Do this once per suite run (or cache it across runs keyed on your lockfile +
# migration hash). Everything below forks from this known-good state.
base = Sandbox.create(template="base", ttl_seconds=1800)
base.exec("git clone --depth 1 https://github.com/acme/widget.git /app")
base.exec("cd /app && npm ci")                 # deps installed once
base.exec("cd /app && npm run db:migrate")     # schema applied once
base.exec("cd /app && npm run db:seed")        # fixtures loaded once
seed = base.snapshot()                          # known-good starting state
base.kill()

# ---- Per test: fork the seed, run, tear down. ----
def run_one_test(test_cmd: str) -> dict:
    # Fork = a fresh microVM starting from identical seeded state (~400-750ms,
    # copy-on-write). Its DB, ports, and /tmp are entirely its own.
    sbx = seed.fork()
    try:
        r = sbx.exec(f"cd /app && {test_cmd}", timeout_seconds=300)
        return {"cmd": test_cmd, "passed": r.exit_code == 0, "log": r.stdout}
    finally:
        sbx.kill()   # every side effect dies with the VM

# Run shards in parallel — no shared mutable state means no cross-talk.
from concurrent.futures import ThreadPoolExecutor
tests = [
    "npm test -- tests/billing",
    "npm test -- tests/auth",
    "npm test -- tests/webhooks",
]
with ThreadPoolExecutor(max_workers=len(tests)) as pool:
    results = list(pool.map(run_one_test, tests))

for res in results:
    print(("PASS" if res["passed"] else "FAIL"), res["cmd"])

The important property is in the `finally`: cleanup is `sbx.kill()`, full stop. There is no per-test database reset, no port bookkeeping, no 'remember to delete the temp file.' The fork is the unit of isolation and the unit of teardown at the same time. If a test corrupts its database or leaves a daemon running or fills its disk, it corrupted a VM that no longer exists by the time the next test starts.

A per-test database fixture (pytest)

Shared databases are the number-one source of integration flake, so they deserve a dedicated pattern. You have two clean options. The lightweight one: the database lives inside each forked microVM, so forking the seeded base gives every test its own Postgres with the schema and fixtures already loaded — no per-test create cost. The heavyweight one: provision a managed database per test for true network-level separation. Here's the in-VM version as a pytest fixture, which is what most suites want.

import pytest
from pandastack import Sandbox

# Built once per session: a base VM with Postgres running, schema migrated,
# and fixtures seeded. The DB lives inside this VM's filesystem.
@pytest.fixture(scope="session")
def seeded_snapshot():
    base = Sandbox.create(template="base", ttl_seconds=3600)
    base.exec("service postgresql start")
    base.exec("cd /app && npm ci && npm run db:migrate && npm run db:seed")
    snap = base.snapshot()       # Postgres + schema + fixtures, frozen
    base.kill()
    yield snap

# Built per test: fork the snapshot -> a private VM with its OWN Postgres,
# pre-seeded, on its own :5432. Tests can't see each other's writes.
@pytest.fixture
def test_sandbox(seeded_snapshot):
    sbx = seeded_snapshot.fork()          # ~400-750ms, copy-on-write
    try:
        yield sbx
    finally:
        sbx.kill()                        # DB and all writes vanish

def test_charge_creates_invoice(test_sandbox):
    # This test mutates its own database freely. No truncate, no rollback,
    # no interference from the test running next to it.
    r = test_sandbox.exec(
        "cd /app && node -e \"require('./billing').charge('cust_1', 999)\"",
        timeout_seconds=60,
    )
    assert r.exit_code == 0, r.stderr
    rows = test_sandbox.exec(
        "psql -tAc \"select count(*) from invoices\" -U app app",
        timeout_seconds=30,
    )
    assert rows.stdout.strip() == "1"

Because the database is inside the fork, there is no shared instance to truncate between tests and no global migration race at suite startup — the schema was applied once, before the snapshot. If you need a database that survives independently of the test VM, or you want the network boundary too, PandaStack can provision a managed PostgreSQL per test instead; note that a managed database create takes 30–90s (it blocks until Postgres is ready), so reserve the per-test managed-DB approach for the cases that truly need an out-of-VM database and use the in-VM seeded snapshot for the common path.

Shared-runner tests vs. a microVM per test

Database: shared runner gives one DB that tests truncate/seed and race over; microVM-per-test gives each test its own seeded Postgres with no cross-talk and no reset fixture.
Ports: shared runner forces port-allocation logic and risks EADDRINUSE or connecting to the wrong process; each microVM has its own full port space, so every test can bind :8080.
Filesystem: shared runner shares /tmp, fixtures, and upload dirs across tests; each microVM has a private filesystem that's deleted on teardown.
Kernel state: shared runner (and containers) share one kernel, so sysctls, conntrack, fd limits, and the clock leak between tests; each microVM has its own guest kernel, so those changes are scoped to one VM.
Teardown: shared runner needs a cleanup script that's only as correct as its last edit; microVM teardown is 'kill the VM' — every side effect goes with it.
Parallelism: shared runner makes parallel runs the moment latent races become failures; microVM-per-test has no shared mutable state, so parallelism is safe by construction.
Reproducibility: shared runner state drifts and differs between laptop and CI; every microVM forks the same snapshot, so all tests start from identical bytes.
Cost of a clean start: shared runner pays a slow scrub-or-reseed between tests; a same-host fork is ~400–750ms with copy-on-write, so a clean start per test is cheap.

Parallelism stops being scary

The reason teams cap test parallelism at -j2 and pray isn't fear of CPUs — it's fear of the races that wider parallelism exposes. When there is no shared mutable state, that fear evaporates. You can fan out as wide as your hosts have memory for, because fork N and fork N+1 have nothing in common except read-only copy-on-write pages they'll never both write. Each agent supports a very large number of concurrent sandboxes — the per-agent network design alone pre-allocates 16,384 /30 subnets — so the binding constraint becomes memory and CPU, not isolation. Throw more hosts at the suite and the wall-clock time drops roughly linearly, because the tests genuinely don't interact.

There's a debugging dividend too. When a test fails, you can keep its VM alive instead of killing it, then shell in and inspect the exact, frozen state that produced the failure — its database, its logs, its /tmp — rather than trying to reconstruct a shared environment that three other tests have since stomped on. The thing that made the test flaky in the first place (shared state mutating underneath you) is the same thing that made flaky tests impossible to debug.

When a VM per test is overkill

This is a sledgehammer, and not every test is a walnut. Be honest about where it pays off.

Pure unit tests with no I/O don't need it — they're already isolated, and an in-process test runner is far faster than any VM. Don't fork a microVM to assert that add(2, 2) == 4.
Suites that are already fast and reliable shouldn't be 'fixed.' If your integration tests pass deterministically at high parallelism today, you've already solved isolation some other way; leave it alone.
Tests against a shared external system you don't control (a third-party staging API, a shared message bus) won't be isolated by your VM — the shared thing is on the other side of the network. A microVM isolates your side, not theirs.
Very large numbers of tiny tests can spend more time forking than testing. Shard them: fork one VM per worker/shard and run many tests inside it, rather than one VM per individual assertion. You still kill the worker's VM at the end, so no state survives across shards.

The sweet spot is the suite everyone dreads: integration and end-to-end tests that touch a database, bind ports, hit the filesystem, and fail nondeterministically under parallelism. For those, a forked-from-seed microVM per test (or per shard) trades a few hundred milliseconds of fork time for the elimination of an entire category of bugs. Clean state by construction, real kernel isolation, and teardown that's just `kill()` — that's how you make 'flaky' a word your team stops using.

Frequently asked questions

Why are my integration tests flaky when they pass individually?

Almost always shared mutable state. Tests that pass alone but fail in parallel or in a different order have a hidden dependency on something global — the same database rows, the same port, the same /tmp file, or a kernel tunable one test changes and another reads. It isn't true randomness; it's a race you haven't localized. Giving each test its own isolated environment (its own database, ports, filesystem, and kernel) removes the shared resource the race depends on, and the 'flake' disappears.

Don't containers already isolate my tests?

Containers isolate the easy parts — filesystem and process namespace — and that's a real improvement over a shared runner. But containers share the host kernel, so global kernel state still leaks between tests: sysctls, the conntrack table, file-descriptor limits, the clock, and host-port collisions. Those are exactly the things that cause hard-to-reproduce integration flake. A Firecracker microVM gives each test its own guest kernel and full port space, so a sysctl change or a clock skew is scoped to one disposable VM instead of every test on the box.

Isn't a microVM per test too slow to be practical?

Not with snapshot-and-fork. You seed a 'test base' VM once — install dependencies, run migrations, load fixtures — and snapshot it. Each test then forks that snapshot, which on the same host takes about 400–750ms and uses copy-on-write memory and disk, so it doesn't recopy your seed data. The dominant cost stays your actual test work, not provisioning. For very small tests, fork one VM per shard/worker and run many tests inside it instead of one VM per assertion.

How do I give each test its own database?

The simplest way is to put Postgres inside the seeded base VM: start it, migrate, and load fixtures before you snapshot. Every test that forks the snapshot then gets its own pre-seeded Postgres on its own :5432, with no shared instance to truncate and no startup migration race. If you need a database that lives outside the test VM (or want a network boundary too), PandaStack can provision a managed PostgreSQL per test — but a managed create takes 30–90s while it waits for Postgres to be ready, so reserve that for cases that genuinely need an out-of-VM database.

How do I tear down a test environment cleanly?

Kill the VM. That's the whole teardown. Because each test runs in its own throwaway microVM, deleting it removes every side effect — database writes, leftover processes, temp files, port bindings — at once. There's no per-test cleanup fixture to maintain and no 'reset the world' script that silently rots. If a test fails, you can instead keep its VM alive and shell in to inspect the exact frozen state that produced the failure, then kill it when you're done.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free

Written by Ajay Kumar, Founder, PandaStack.