Measuring Wasm vs JavaScript Throughput

This guide benchmarks the same numeric kernel in JavaScript and WebAssembly fairly: implement an identical algorithm in both, warm both JITs to steady state, isolate the compute from the JavaScript–Wasm boundary, and report a results table that shows when Wasm wins and when an optimized JIT ties it.

Prerequisites

  • [ ] Node.js 20+ or a browser with Web Workers and cross-origin isolation
  • [ ] A .wasm kernel and a byte-for-byte equivalent JavaScript implementation
  • [ ] binaryen 116+ so the Wasm side is -O2/-O3 optimized, not -O0
  • [ ] The harness from building a reproducible Wasm benchmark harness

Procedure

1. Pick a kernel that exercises real compute

Use a tight numeric loop where the work, not the call, dominates. SAXPY (y[i] = a*x[i] + y[i]) over a large array is ideal: it is pure arithmetic over a flat buffer, so it isolates raw throughput from allocation and branching. A mandelbrot escape-time loop is a good branchy alternative.

// JavaScript reference kernel — operates on a Float32Array in place
function saxpyJS(a, x, y, n) {
  for (let i = 0; i < n; i++) y[i] = a * x[i] + y[i];
}

2. Write the identical kernel for Wasm

The Wasm version must do the same arithmetic over the same data layout — operating directly on linear memory so neither side has an unfair representation advantage.

(func (export "saxpy") (param $a f32) (param $xp i32) (param $yp i32) (param $n i32)
  (local $i i32)
  (loop $L
    (f32.store
      (i32.add (local.get $yp) (i32.shl (local.get $i) (i32.const 2)))
      (f32.add
        (f32.mul (local.get $a)
          (f32.load (i32.add (local.get $xp) (i32.shl (local.get $i) (i32.const 2)))))
        (f32.load (i32.add (local.get $yp) (i32.shl (local.get $i) (i32.const 2))))))
    (local.set $i (i32.add (local.get $i) (i32.const 1)))
    (br_if $L (i32.lt_u (local.get $i) (local.get $n)))))

3. Share one buffer — do not copy per call

Place x and y once in the module’s linear memory and pass pointers. Copying the arrays in on every call would measure memory bandwidth, not the kernel, and would unfairly penalize Wasm.

const mem = new Float32Array(instance.exports.memory.buffer);
const N = 1 << 20;                 // 1,048,576 elements
const XP = 0, YP = N * 4;          // byte offsets for x and y
mem.set(x, XP / 4);
mem.set(y, YP / 4);
// timed call touches only the kernel, no marshaling:
instance.exports.saxpy(2.0, XP, YP, N);

4. Warm both JITs identically

Run each implementation untimed to steady state before timing either. JavaScript’s optimizing tier needs the same chance to warm up that Wasm’s does; comparing cold JS to hot Wasm (or vice versa) is the most common way these benchmarks lie.

for (let i = 0; i < 50; i++) saxpyJS(2.0, x, y, N);                 // warm JS
for (let i = 0; i < 50; i++) instance.exports.saxpy(2.0, XP, YP, N); // warm Wasm

5. Time both under one harness and report distributions

Use the same timing code for both, batch enough work that the boundary is negligible, and report median/p95 as in the harness guide. Run several passes back to back and alternate the order to cancel any warm-cache bias.

function timeKernel(fn, passes) {
  for (let i = 0; i < 50; i++) fn();            // warm to steady state
  const samples = new Float64Array(passes);
  for (let p = 0; p < passes; p++) {
    const t0 = process.hrtime.bigint();
    fn();                                        // one full pass over N elements
    samples[p] = Number(process.hrtime.bigint() - t0) / 1e6; // ms
  }
  samples.sort((a, b) => a - b);
  return samples[passes >> 1]; // median ms
}

const jsMs = timeKernel(() => saxpyJS(2.0, x, y, N), 200);
const wasmMs = timeKernel(() => instance.exports.saxpy(2.0, XP, YP, N), 200);
console.log({ jsMs, wasmMs, ratio: +(jsMs / wasmMs).toFixed(2) });

Because each pass already touches a million elements, per-pass timing is fine here — the kernel runs for hundreds of microseconds, so the two hrtime reads are negligible and there is no need for the batch form. The ratio is what you report; the absolute milliseconds depend on the machine.

Expected output

A representative run (Node 20, V8, 1,048,576-element Float32Array, 200 timed passes, -O3 Wasm):

Kernel Impl Median / pass Throughput Ratio
SAXPY, N=1,048,576 JavaScript 0.74 ms 1,420 Melem/s 1.00×
SAXPY, N=1,048,576 Wasm -O3 0.68 ms 1,540 Melem/s 1.08×
Mandelbrot, 800×600, 1000 iters JavaScript 41.2 ms 1.00×
Mandelbrot, 800×600, 1000 iters Wasm -O3 14.9 ms 2.77×
SAXPY, N=64 (tiny) JavaScript 38 ns 1.00×
SAXPY, N=64 (tiny) Wasm -O3 71 ns 0.54×

Three lessons fall out of this table. On large SAXPY the two are within ~8% — the loop is memory-bandwidth bound, and V8’s TurboFan vectorizes Float32Array loops well, so there is little left for Wasm to win. On branchy mandelbrot, Wasm is ~2.8× faster because the JIT cannot predict the escape branches as tightly and pays deoptimization risk Wasm does not. On the tiny N=64 case Wasm is slower, because the per-call boundary crossing (~30 ns) dwarfs 64 elements of work — proving that input size, not language, decides the winner at the small end.

The deeper reason these results are not contradictory is that JavaScript and Wasm fail in different places. A JIT is fast when the code stays monomorphic and the type feedback it gathered during warmup keeps holding; it slows down — sometimes catastrophically, via deoptimization back to the baseline tier — when a value’s type changes, an array goes from “packed doubles” to “holey”, or a branch it speculated on turns out unpredictable. Wasm has none of that speculation: its types are fixed at compile time, so there is no deopt cliff to fall off, which is exactly why the branchy mandelbrot kernel favors it. Conversely, Wasm cannot beat physics — on a bandwidth-bound stream like SAXPY both languages are waiting on the same memory bus, so they converge. The practical reading: expect Wasm’s advantage to track how unpredictable your control flow is, not how much arithmetic you do.

Gotchas

Comparing cold Wasm against warm JS. If you measure Wasm on its first few calls (baseline tier) against JavaScript that has already tiered up, Wasm looks slow for the wrong reason. Warm both with the same iteration count before timing, every time.

Including instantiate time in the Wasm number. Compilation and linking are one-time startup costs. Folding them into a throughput benchmark makes Wasm look catastrophically slow on small inputs. Instantiate once before the timed region; benchmark startup separately if you care about it.

Tiny inputs dominated by call overhead. At small N the ~30 ns boundary crossing per call swamps the compute, so Wasm “loses”. This is not a kernel result — it is the boundary. Either batch many small problems into one call or report the crossover N where Wasm overtakes JS (here, around N=256).

Letting the data be a compile-time constant. If a or the array contents are literals the optimizer can fold, one side computes nothing. Fill the buffers with runtime data and consume the output so neither implementation is hollowed out by dead-code elimination.

Performance note

The crossover point matters more than the peak ratio. For SAXPY, Wasm only pulls ahead once N is large enough that compute outweighs the ~30 ns call boundary — below roughly 256 elements, JavaScript ties or wins outright because the boundary is the whole cost. Whether Wasm is “faster” is therefore a question about your input distribution, not the language; the same lens applies when the boundary is the DOM, as explored in is WebAssembly faster than JavaScript for DOM manipulation?.

Frequently Asked Questions

Why does JavaScript sometimes tie Wasm on numeric loops? Modern JITs like TurboFan aggressively optimize monomorphic loops over typed arrays, including auto-vectorization. When the kernel is simple and bandwidth-bound, there is little headroom left for Wasm to exploit, so the two converge. Wasm’s edge grows with branch complexity, integer-heavy work, and avoiding GC — not with simple float loops.

Should marshaling cost count against Wasm? It depends on the question. If you are deciding whether to move an algorithm into Wasm, the marshaling is a real cost you will pay, so include it. If you are measuring the kernel’s intrinsic speed to decide which is the better compute engine, exclude it and benchmark with data resident in linear memory. State which one you did — conflating them is how benchmarks mislead.

How big must the input be for Wasm to win? Large enough that compute time exceeds the per-call boundary (tens of nanoseconds). For SAXPY that crossover is around N=256 on V8; for branchy kernels like mandelbrot, Wasm can win even at small sizes because the JIT’s branch handling is the bottleneck. Find your own crossover by sweeping N.

Does it matter whether I use Float32Array or Float64Array? Yes, and it can flip the result. JavaScript numbers are doubles, so a Float64Array kernel maps onto the JIT’s native representation with no conversions, while Float32Array forces a round-trip narrowing on each store. Wasm, by contrast, has first-class f32 and f64 and pays no such penalty. Benchmark the precision you will actually ship — measuring f32 Wasm against f64 JS (or vice versa) compares two different algorithms and tells you nothing useful about either.

← Back to Wasm Performance Benchmarking