Building a Reproducible Wasm Benchmark Harness

This guide builds a small JavaScript harness that benchmarks an exported WebAssembly function so the result is reproducible: it warms the JIT, times a fixed number of iterations, computes median/p95/stddev instead of a single sample, and runs inside a Web Worker to reduce scheduler noise — while guarding against the optimizer deleting the work you meant to measure.

Prerequisites

  • [ ] Node.js 20+ (for process.hrtime.bigint() nanosecond timing) or a browser with Web Workers
  • [ ] A compiled .wasm module exporting one hot function (here, run(i32) -> i32)
  • [ ] binaryen 116+ if you want to compare optimized variants (wasm-opt --version)
  • [ ] A quiet machine: AC power, no background builds, turbo boost disabled where possible

Procedure

A reproducible harness is built from six non-negotiable parts: a stated unit of work, instantiation outside the timed region, a discarded warmup phase, a high-resolution monotonic clock, a result sink to defeat dead-code elimination, and robust statistics over the samples. Skip any one and the number becomes a coin flip. The steps below assemble them in order; the seventh moves the whole thing into a Web Worker, which is the single biggest noise reduction available in a browser.

1. Decide what one “iteration” is

Pick a unit of work large enough to dominate timer overhead but small enough to be representative. A single call into a kernel that runs for 5 ns is mostly boundary overhead; a call that processes a 4,096-element array runs long enough that the boundary is negligible. State the unit explicitly so a reader can convert your ns/iter into elements/s.

2. Instantiate once, outside the timed region

Compilation and linking are one-time costs unrelated to steady-state throughput. Instantiate before any timing so the timed loop measures only the kernel.

import { readFile } from "node:fs/promises";

const bytes = await readFile(process.argv[2]);
const { instance } = await WebAssembly.instantiate(bytes, {});
const run = instance.exports.run;

3. Warm up, and discard every warmup sample

Drive the function untimed until the engine tiers it from the baseline compiler to the optimizing compiler. A fixed warmup of max(1000, M/10) calls is a reasonable default; verify it is enough by checking that the median is stable across runs.

const M = Number(process.argv[3] ?? 100_000);
const WARMUP = Math.max(1000, M / 10);
let sink = 0;
for (let i = 0; i < WARMUP; i++) sink ^= run(i);

4. Time the loop with nanosecond resolution

Use process.hrtime.bigint() in Node — it is monotonic and reports true nanoseconds. Record one sample per iteration into a pre-allocated Float64Array so the array itself does not allocate inside the timed region.

const samples = new Float64Array(M);
for (let i = 0; i < M; i++) {
  const t0 = process.hrtime.bigint();
  sink ^= run(i);
  const t1 = process.hrtime.bigint();
  samples[i] = Number(t1 - t0);
}

5. Consume the result so DCE cannot delete it

Fold every return value into an accumulator and observe it after the loop. If the runtime can prove the result is unused, escape analysis may delete the call entirely.

if (sink === 0.5) console.log("kept alive", sink); // never true; just observes sink

6. Compute robust statistics

Sort once, then index for percentiles. Report the median (typical cost), p95 (tail), and stddev (spread). The mean is included only as a cross-check — if mean and median diverge sharply, outliers are present.

samples.sort((a, b) => a - b);
const at = (q) => samples[Math.min(M - 1, Math.floor(M * q))];
const mean = samples.reduce((a, b) => a + b, 0) / M;
const stddev = Math.sqrt(samples.reduce((a, b) => a + (b - mean) ** 2, 0) / M);
console.log({
  median_ns: +at(0.5).toFixed(1),
  p95_ns: +at(0.95).toFixed(1),
  stddev_ns: +stddev.toFixed(1),
  n: M,
});

7. Pin to a Web Worker to cut noise (browser)

On the main thread, your timed loop competes with rendering, input handling, and requestAnimationFrame callbacks. Moving the harness into a dedicated Worker isolates it from the event loop and produces tighter distributions. Use performance.now() there, but time a batch and divide, because its resolution is clamped.

// worker.js
self.onmessage = async ({ data }) => {
  const { instance } = await WebAssembly.instantiate(data.bytes, {});
  const run = instance.exports.run;
  const M = data.iterations, BATCH = 1000;
  let sink = 0;
  for (let i = 0; i < M / 10; i++) sink ^= run(i); // warmup
  const samples = new Float64Array(M / BATCH);
  for (let b = 0; b < samples.length; b++) {
    const t0 = performance.now();
    for (let i = 0; i < BATCH; i++) sink ^= run(i);
    samples[b] = (performance.now() - t0) * 1e6 / BATCH; // ns per call
  }
  samples.sort((a, b) => a - b);
  self.postMessage({ median_ns: samples[samples.length >> 1], sink });
};

Expected output

A clean run on a quiet laptop for a 4,096-element SAXPY kernel, three back-to-back invocations:

$ node bench.mjs kernel.O3.wasm 200000
{ median_ns: 2630.4, p95_ns: 2811.0, stddev_ns: 240.7, n: 200000 }
{ median_ns: 2628.9, p95_ns: 2799.5, stddev_ns: 233.1, n: 200000 }
{ median_ns: 2631.7, p95_ns: 2805.2, stddev_ns: 238.8, n: 200000 }

Medians within ~0.1% across runs is the signal that the harness is reproducible. Converting: 2,630 ns for 4,096 elements is ~1,557 Melem/s. If your three medians spread by more than a couple of percent, the machine is too noisy — re-run pinned to a Worker or quiet the system before trusting the number.

Read the three statistics together rather than reporting only one. The median is your headline cost. The p95 sitting ~7% above it (2,811 vs 2,630 ns here) is healthy — a tight tail with no pathological spikes. If instead p95 were 2–3× the median, something is contaminating a minority of iterations: a GC cycle, a memory.grow, or the scheduler stealing the core. The stddev confirms the picture: at 240 ns on a 2,630 ns median it is under 10% of the mean, so the distribution is narrow. A stddev approaching or exceeding the median is a red flag that the samples are bimodal — often the signature of a warmup that ended partway through the timed loop, mixing baseline-tier and optimized-tier iterations.

Gotchas

Sub-nanosecond results mean DCE won. If the median prints as 0.3 ns or 0.0 ns, the engine deleted the call because the result was unused, or wasm-opt --precompute folded a constant-input kernel at build time. Fix: make the input runtime-dependent (read from a buffer filled at startup) and keep the sink ^= run(i) accumulator with a final observation.

performance.now() resolution is clamped. Without cross-origin isolation it can be pinned to 100 µs; even isolated it is 5 µs. Timing a single nanosecond-scale call returns 0. Always time a batch and divide, as in step 7.

Thermal throttling drifts your numbers upward over time. A long run heats the CPU, the governor drops the clock, and later iterations look slower — inflating p95 and stddev. Keep runs short (a few seconds), watch for a rising trend across repeated invocations, and benchmark on AC power so the governor does not switch profiles mid-run.

Allocating inside the timed loop measures the allocator. Pre-allocate samples and any scratch buffers before timing. A push into a growing array triggers reallocation and GC that has nothing to do with your kernel — and worse, it injects GC pauses precisely into the window you are measuring, fattening the very tail you are trying to read.

Comparing two variants in separate processes loses cache state. V8 caches compiled Wasm within a single node process but not across invocations. If you benchmark a.wasm in one run and b.wasm in another, each pays its own cold-compile cost, which is fair — but you also forfeit any chance to detect cross-variant interference. To compare two artifacts apples-to-apples, instantiate both in one process, warm both, and time them back to back, ideally alternating which goes first to cancel ordering bias.

Performance note

The harness overhead itself — two hrtime.bigint() calls plus an XOR — is about 30–60 ns per iteration on V8. For a kernel that runs in 2,600 ns this is under 3% and safely ignorable, but for a 5 ns kernel it would dominate. When the unit of work is that small, switch to the batch-timing form (time BATCH calls under one timer pair, divide) so the per-call timer cost amortizes to near zero.

Frequently Asked Questions

How many warmup iterations are enough? Enough that the median stops changing. Start with M/10, then double the warmup and re-run; if the median is stable, you were warm. For most kernels a few thousand calls reaches the optimizing tier; pathological cases (large functions, on-stack replacement boundaries) can need more.

Why a Web Worker instead of just running on the main thread? The main thread’s event loop interleaves your loop with rendering, timers, and input, injecting multi-millisecond gaps that fatten the tail. A dedicated Worker has none of that traffic, so the distribution tightens and the p95 reflects the kernel rather than the scheduler.

Should I subtract a measured timer overhead from each sample? Generally no — it adds a calibration step that can itself be wrong, and the timer cost is constant, so it shifts every sample equally without changing comparisons between variants. Just keep the unit of work large enough that timer overhead is a small fraction of the sample.

How many timed iterations do I actually need? Enough that the median stabilizes and the standard error is small relative to the differences you care about. For nanosecond-scale work, 100,000–200,000 iterations is typically plenty; for millisecond-scale work, a few hundred passes suffices because each pass already contains enormous internal work. The empirical test is the same in both cases: run three times and confirm the medians agree to within your tolerance. If they do not, increase iterations or quiet the machine — do not just average the disagreement away.

← Back to Wasm Performance Benchmarking