Wasm Performance Benchmarking
Most WebAssembly benchmarks you see online are wrong. They run a tight loop once, print the
first performance.now() delta they get, and conclude that Wasm is either 50× faster than
JavaScript or barely faster at all — usually both, on the same machine, depending on the day.
The truth is that a microbenchmark is a measuring instrument, and an uncalibrated instrument
produces noise. This guide treats benchmarking as engineering: a reproducible harness with
warmup and fixed iteration counts, statistical reporting instead of single samples, and a clear
separation between what you meant to measure and the overhead you accidentally measured. It also
covers reading the optimizer — wasm-opt’s pass pipeline and the Binaryen text IR it emits — so
you can explain why a number changed, not just observe that it did.
Prerequisites
- [ ]
binaryen116 or newer on yourPATH(wasm-opt --versionprintswasm-opt version 116) - [ ] Node.js 20+ for
process.hrtime.bigint()and stable--allow-natives-syntaxbehaviour - [ ] A Chromium-based browser (Chrome 120+) and Firefox 121+ for DevTools profiling
- [ ]
wabtforwasm-objdumpandwasm-dis(verification cross-checks) - [ ] A quiet machine: close other tabs, disable turbo boost if you can, and run on AC power
- [ ]
WebAssemblyglobal available in your runtime (all the above satisfy this)
A benchmark is only reproducible if the inputs are pinned. Lock your toolchain versions in CI the
same way setting up CI/CD for Rust Wasm projects
pins the Rust toolchain — a binaryen minor bump can change which passes run by default and move
your numbers several percent.
The harness as a measuring loop
Every honest microbenchmark has the same shape: drive the function untimed until the JIT and the Wasm tiering compiler have settled (warmup), then run a fixed number of timed iterations, then aggregate the per-iteration samples into robust statistics. The diagram below is the loop you are building.
Two properties make this trustworthy. First, warmup is discarded, never averaged in — the
first few hundred calls run in the baseline (Liftoff in V8, the interpreter tier in
SpiderMonkey) before the optimizing tier (TurboFan/Ironmonkey) kicks in, and mixing cold and
hot samples produces a meaningless mean. Second, you report the distribution, not the mean —
the median is robust to the occasional GC pause or scheduler preemption, and the p95 tells you the
tail. A single number hides both.
The reason warmup is not optional in WebAssembly specifically is that browsers and Node deliberately compile a module twice. The baseline compiler trades code quality for compile speed so the page can start running almost immediately; a background thread then recompiles the hot functions with the optimizing compiler and hot-swaps them in. During that window a function can run 3–10× slower than it will once tiered. If your timed loop straddles the swap, you average two different machines together and the result is reproducible only by accident. The fix is mechanical: run the function untimed long enough that every hot function has tiered up, confirm by checking that the median stops moving when you double the warmup, and only then start recording.
The choice of clock matters just as much as the warmup. In Node, process.hrtime.bigint() returns a
monotonic nanosecond counter that never jumps backwards and is not affected by wall-clock adjustments —
exactly what you want for differences. In the browser, performance.now() is the equivalent, but its
resolution is intentionally coarsened to defend against high-resolution timing attacks, so you time a
batch of iterations under one clock read and divide. Never use Date.now() for either: it is
millisecond-resolution and not guaranteed monotonic, so a single NTP correction can produce a negative
duration.
Step-by-step workflow
The workflow below produces comparable artifacts, proves the optimizer actually changed the body, runs the harness, and localizes any regression. Each step is a single runnable command or a focused edit; run them in order, because attributing a throughput change to a flag is only meaningful when the two binaries you compare differ by exactly that flag and nothing else.
1. Build an optimized and an unoptimized artifact
Compile your kernel, then produce explicit variants so you can attribute differences to the optimizer rather than the compiler:
# baseline: whatever your toolchain emits, no post-processing
cp kernel.wasm kernel.O0.wasm
# the three levels you will actually compare
wasm-opt kernel.O0.wasm -O2 -o kernel.O2.wasm
wasm-opt kernel.O0.wasm -O3 -o kernel.O3.wasm
wasm-opt kernel.O0.wasm -Os -o kernel.Os.wasm
2. Inspect what wasm-opt did before trusting the number
Dump the Binaryen text IR so a faster run has an explanation:
wasm-opt kernel.O0.wasm -O3 --print -o /dev/null | head -n 40
--print runs the full -O3 pipeline and then prints the optimized module as Binaryen IR. If a
function shrank from a loop to a constant, you will see it here — and you will know the optimizer
folded your benchmark away. Reading that IR fluently is its own skill, covered in
reading Binaryen IR from wasm-opt.
3. Measure size and instruction counts
wasm-opt kernel.O3.wasm --metrics -o /dev/null
ls -l kernel.O0.wasm kernel.O3.wasm
--metrics prints a per-category instruction census (total, binary, call, load, loop,
etc.). A throughput win should correlate with fewer loop/call nodes or a tighter binary; if the
metrics are identical, your “improvement” is measurement noise.
4. Run the harness
node bench.mjs kernel.O3.wasm 200000
The harness — built in the next section — instantiates the module once, warms up, runs the timed loop, and prints a stats table. Run it three times; if the medians disagree by more than a couple of percent, your machine is too noisy and you need to pin to a worker and quiet the system.
5. Profile to localize a regression
When a number moves the wrong way, open Chrome DevTools → Performance, record the timed loop, and
look at the bottom-up tree. Wasm frames appear with their function index or name (if a name
section survives); a hot Liftoff frame that never tiers up is a warmup bug, not a kernel problem.
Firefox’s profiler shows the same with explicit baseline/ion tier annotations.
DevTools profiling adds a second dimension the harness alone cannot give you: where the time goes.
A median tells you the kernel is slow; a flame chart tells you it is slow inside a single bounds-checked
load in an inner loop, or that half the samples land in a memory.grow you did not expect. In Chrome,
enable “Memory” in the Performance recording to overlay GC events — a sawtooth heap with frequent minor
collections during the timed window explains a fat p95 immediately. In Firefox, the per-frame tier badge
is the fastest way to confirm warmup: if the badge says baseline on a frame you expected to be hot, the
function never tiered and your warmup is too short or the function is too large for the inliner. Treat the
harness and the profiler as complementary — the harness produces the number, the profiler explains it.
A reproducible harness
This is a minimal but honest Node harness. It instantiates once, warms up, times M iterations with
process.hrtime.bigint() (nanosecond resolution, monotonic), consumes the result so dead-code
elimination cannot delete the work, and reports median/p95/stddev.
// bench.mjs — run: node bench.mjs <file.wasm> <iterations>
import { readFile } from "node:fs/promises";
const [, , file, iterArg] = process.argv;
const M = Number(iterArg ?? 100_000);
const WARMUP = Math.max(1000, M / 10);
const bytes = await readFile(file);
const { instance } = await WebAssembly.instantiate(bytes, {});
const kernel = instance.exports.run; // exported i32->i32 hot function
let sink = 0; // accumulator the optimizer cannot prove is dead
// warmup: drive the tiering compiler, results discarded
for (let i = 0; i < WARMUP; i++) sink ^= kernel(i);
// timed loop: one sample per call
const samples = new Float64Array(M);
for (let i = 0; i < M; i++) {
const t0 = process.hrtime.bigint();
sink ^= kernel(i);
const t1 = process.hrtime.bigint();
samples[i] = Number(t1 - t0); // nanoseconds
}
// consume sink so the JIT keeps the loop body
if (sink === 0.5) console.log("unreachable", sink);
samples.sort((a, b) => a - b);
const median = samples[Math.floor(M * 0.5)];
const p95 = samples[Math.floor(M * 0.95)];
const mean = samples.reduce((a, b) => a + b, 0) / M;
const stddev = Math.sqrt(
samples.reduce((a, b) => a + (b - mean) ** 2, 0) / M,
);
console.log(
`median ${median.toFixed(1)} ns p95 ${p95.toFixed(1)} ns ` +
`stddev ${stddev.toFixed(1)} ns (n=${M})`,
);
The single most important line is sink ^= kernel(i). Without it — if you call kernel(i) and
throw the result away — V8’s escape analysis can prove the call has no observable effect and delete
it, and you end up timing an empty loop at ~0.3 ns/iter. Always feed the result into something the
runtime cannot prove is dead, such as an XOR accumulator you print at the end. In the browser, swap
process.hrtime.bigint() for performance.now(), but be aware its resolution is clamped (see
gotchas), so prefer timing a batch of iterations and dividing.
There is a deliberate asymmetry in this harness worth naming. It records one timer pair per
iteration, which is correct only when the kernel runs long enough — hundreds of nanoseconds or more —
that the two hrtime calls (≈30–60 ns of overhead together) are a small fraction of the sample. For a
genuinely tiny kernel that asymmetry inverts: the timer dominates, every sample is mostly clock-read
cost, and the distribution is meaningless. The remedy is batch timing — wrap BATCH calls in one timer
pair and divide the elapsed time by BATCH — which amortizes the timer overhead to near zero at the
cost of losing per-iteration granularity. Pick per-iteration timing when you want the full distribution
and the kernel is large; pick batch timing when the kernel is small and you only need a robust central
estimate. The harness in building a reproducible Wasm benchmark harness
shows both forms side by side and when to reach for each.
Note also that the importObject here is empty ({}). If your kernel imports host functions — a
Math.random, a logging callback, a memory you supply — those imports become part of what you measure,
and a slow JavaScript import called inside the hot loop will swamp the kernel. Keep the timed function
pure where you can, and if it must call back into the host, measure that import’s cost separately so you
know how much of the number belongs to Wasm and how much to the boundary it crosses.
Optimization flags & tradeoffs, with numbers
The three levels you compared in step 1 trade throughput against binary size. Representative figures
for a numeric kernel (a SAXPY inner loop, y[i] = a*x[i] + y[i] over 1M elements) compiled from
Rust and post-processed with wasm-opt:
| Pass | Throughput (Melem/s) | .wasm size |
When to pick it |
|---|---|---|---|
-O0 (none) |
410 | 12.4 KB | never ship this; baseline only |
-Os |
980 | 7.1 KB | size-constrained bundles, cold-start sensitive |
-O2 |
1180 | 8.0 KB | the safe default — most of -O3, smaller |
-O3 |
1240 | 9.6 KB | compute-bound hot paths; aggressive inlining |
The headline: -O3 buys only ~5% over -O2 here but costs 20% more bytes, because the SAXPY loop is
memory-bandwidth bound and inlining cannot help bandwidth. On a branchy, call-heavy kernel the gap
widens to 20–40% because -O3’s --inlining-optimizing removes call overhead the bandwidth-bound
case never had. This is exactly why you measure your kernel rather than trusting a table: the
right level is workload-dependent. The size side of this tradeoff — and how wasm-opt achieves it —
is the focus of reducing Wasm bundle size with wasm-opt,
and the full flag matrix lives in Wasm optimization flags & size reduction.
Two practical corollaries follow. First, the size column is not free even when you are chasing
throughput: a larger .wasm takes longer to download and compile, so on a cold start the
instantiate cost can erase a steady-state win you only realize after thousands of calls. If your
module runs a kernel a handful of times per page load, -Os may beat -O3 end-to-end despite being
slower per iteration. Second, --converge (repeating the pass pipeline until the output stabilizes) can
squeeze another few percent of size out of any level, but it does not change throughput meaningfully and
roughly doubles optimize time — reach for it on shipping artifacts, not on every benchmark build. The
discipline is to benchmark the level you actually intend to ship, on the input distribution you actually
expect, rather than reporting the headline peak from a synthetic loop.
Gotchas & failure modes
The optimizer deleted your benchmark. If a “Wasm” run reports sub-nanosecond per-iteration
times, the optimizer constant-folded the kernel or DCE removed the loop. The fix is twofold: make the
input data-dependent (read from a buffer the harness fills at runtime, not a compile-time constant)
and consume the output. In Wasm specifically, wasm-opt --precompute will evaluate any expression
with constant operands at optimize time — so run(42) where 42 is hard-coded can become a single
i32.const return.
You never warmed up the JIT. Timing the first 100 calls measures the baseline tier
(Liftoff/interpreter), which can be 3–10× slower than the optimized tier. The classic symptom is
“Wasm is barely faster than JS” — because you compared cold optimized-by-nobody Wasm against
already-hot JS. Warm both to steady state before timing.
You measured the boundary, not the kernel. A call from JavaScript into Wasm costs a few nanoseconds of marshaling; if your kernel itself takes 5 ns, half your number is boundary overhead. Either amortize by doing more work per call (process 10,000 elements, not one), or — if the boundary is what you care about — measure it deliberately and label it as such.
GC noise contaminates the mean. A major GC during the timed loop adds a multi-millisecond spike. The median ignores it; the mean does not. This is the whole reason to report the median, and to look at p95 separately to see whether the tail is acceptable.
Cold vs hot tiers across runs. V8 caches compiled Wasm within a process but not across node
invocations. If you compare two .wasm files by running node twice, the second run is not
penalized by the first’s warmup — good — but neither benefits from it. Keep the harness in one
process and benchmark both variants back to back if you want apples-to-apples.
Verification
Before publishing a number, prove the artifact is what you think it is:
# instruction census — confirm -O3 actually changed the body
wasm-opt kernel.O0.wasm --metrics -o /dev/null
wasm-opt kernel.O3.wasm --metrics -o /dev/null
# disassemble the hot function and confirm the loop survived
wasm-objdump -d kernel.O3.wasm | grep -A 20 'func\[.*run'
# structural sanity
wasm-validate kernel.O3.wasm
If --metrics shows the same instruction counts for -O0 and -O3, the optimizer had nothing to
do (or you optimized the wrong file). If wasm-objdump shows your loop replaced by a single constant
return, DCE/precompute folded it — your benchmark is measuring nothing. These two checks catch the
majority of bogus benchmark results before they reach a slide deck.
In this guide
- Building a reproducible Wasm benchmark harness — the full harness: warmup, fixed iterations, median/p95/stddev, and pinning to a worker to cut noise.
- Reading Binaryen IR from wasm-opt —
dump the text IR with
--print/wasm-dis, and see what--vacuum,--inlining-optimizing, and--precomputeactually change. - Measuring Wasm vs JavaScript throughput — benchmark the same numeric kernel in both, warm both JITs, isolate compute from marshaling, and read the results table.
Frequently Asked Questions
Why report the median and p95 instead of the average? Microbenchmarks are contaminated by rare, large outliers — a GC pause, a scheduler preemption, a thermal throttle event. These skew the arithmetic mean by an unpredictable amount but barely move the median. The median answers “what does a typical iteration cost?” and the p95 answers “how bad is the tail?”. The mean answers neither reliably, which is why it is the wrong default for latency data.
Should I include WebAssembly.instantiate time in the benchmark?
Only if startup is what you are measuring. Instantiation (compile + link) is a one-time cost that has
nothing to do with steady-state throughput, so for a compute benchmark you instantiate once before
the timed loop. For a cold-start benchmark — “how fast can I go from bytes to first result?” — you
measure exactly that and label it separately. Conflating the two is the most common way Wasm looks
artificially slow.
Is performance.now() accurate enough for nanosecond-scale work?
No. Browsers clamp performance.now() resolution to 5 µs (sometimes 100 µs without cross-origin
isolation) as a Spectre mitigation, so a single 5 ns iteration is unmeasurable. Time a batch of,
say, 100,000 iterations and divide. In Node, process.hrtime.bigint() gives true nanosecond
resolution and does not need batching.
Why does my Wasm benchmark get faster the second time I run the function? Tiering. The engine first runs your module in a fast-to-compile baseline tier, then recompiles hot functions in an optimizing tier in the background. The transition can take a few hundred to a few thousand calls, which is exactly why the warmup phase exists — to reach steady state before any timed sample is recorded.
Does wasm-opt -O3 always beat -O2 on throughput?
No. -O3 adds more aggressive (and slower-to-run) passes like extra inlining, but on
bandwidth-bound or already-simple kernels the gain is in the noise while the binary grows. Treat
-O2 as the default and only adopt -O3 when your harness shows a real, repeatable win on your
kernel.
Related
- Wasm optimization flags & size reduction — the
-O2/-O3/-Os/-Ozflag matrix and size tradeoffs. - Reducing Wasm bundle size with wasm-opt — the post-compilation pass pipeline in depth.
- Setting up CI/CD for Rust Wasm projects — pinning toolchain versions so benchmarks stay reproducible.
- Is WebAssembly faster than JavaScript for DOM manipulation? — where the boundary, not the kernel, decides the winner.
← Back to Compilation Pipelines & Toolchain Setup