Reading Binaryen IR from wasm-opt

This guide shows how to dump Binaryen’s text intermediate representation with wasm-opt --print and wasm-dis, read a small function’s IR, and watch concrete optimization passes — --vacuum, --inlining-optimizing, and --precompute — rewrite that IR before and after, so a benchmark change has an explanation instead of a shrug.

Prerequisites

  • [ ] binaryen 116+ on your PATH (wasm-opt --versionwasm-opt version 116)
  • [ ] A .wasm module to inspect, ideally one with a small arithmetic function
  • [ ] wabt for cross-checking against true wat (wasm-dis ships with Binaryen, wasm2wat with wabt)

Procedure

1. Dump the IR without optimizing

wasm-opt --print prints the module as Binaryen IR after running whatever passes you asked for. With no optimization flags, it prints the IR as parsed — the cleanest way to see the starting point:

wasm-opt kernel.wasm --print -o /dev/null

-o /dev/null discards the binary output; you only want the printed IR on stdout. An equivalent way to get unoptimized text is wasm-dis, which is a pure disassembler (no passes, ever):

wasm-dis kernel.wasm -o kernel.ir.wat

2. Read one function

Binaryen IR looks like wat but is a distinct, more regular S-expression form. Consider a function that squares its argument and adds a constant:

(func $square_plus (param $x i32) (result i32)
  (i32.add
    (i32.mul
      (local.get $x)
      (local.get $x))
    (i32.const 7)))

Read it inside-out, like any S-expression: local.get $x twice feeds i32.mul, whose result feeds i32.add alongside i32.const 7. Binaryen always names locals and prints fully parenthesized, folded form — there is no stack-machine listing here, which is one way IR differs from raw wat.

The folded form is the whole point. Raw wat and the underlying binary are a stack machine: operands are pushed, then an opcode consumes them. Binaryen instead keeps an expression tree, where each node’s children are its operands, because trees are what optimizers manipulate — you can match a subtree, replace it, and re-fold without tracking an implicit operand stack. When you read (i32.add (i32.mul ...) (i32.const 7)) you are seeing the exact data structure the passes rewrite. This is why a pass description like “replaces a subexpression with a constant” maps so cleanly onto what you observe in --print output: the node really is swapped in the tree. Get comfortable reading these trees and the rest of this guide reads itself.

3. Watch --vacuum remove dead code

--vacuum deletes expressions whose results are unused and that have no side effects. Start with IR that computes a value and throws it away:

(func $noisy (param $x i32) (result i32)
  (drop (i32.mul (local.get $x) (local.get $x)))  ;; dead: result dropped
  (i32.add (local.get $x) (i32.const 1)))

Run only that pass:

wasm-opt kernel.wasm --vacuum --print -o /dev/null

The dropped multiply disappears because it is pure and unused:

(func $noisy (param $x i32) (result i32)
  (i32.add (local.get $x) (i32.const 1)))

This is the pass that punishes a sloppy benchmark: if you do not consume your kernel’s result, --vacuum (and the engine’s own DCE) can erase the work.

4. Watch --precompute constant-fold

--precompute evaluates expressions whose operands are all constants. Given a call with a hard-coded argument that the inliner has exposed:

(func $cube (param $x i32) (result i32)
  (i32.mul (local.get $x) (i32.mul (local.get $x) (local.get $x))))
(func $main (result i32)
  (call $cube (i32.const 4)))   ;; constant argument

After --inlining-optimizing exposes the body and --precompute folds it, $main collapses to a single constant:

wasm-opt kernel.wasm --inlining-optimizing --precompute --print -o /dev/null
(func $main (result i32)
  (i32.const 64))   ;; 4 * 4 * 4, computed at optimize time

A benchmark that calls run(4) with a literal will be folded to i32.const like this — which is why benchmark inputs must be runtime data, not literals.

5. Watch --inlining-optimizing flatten a call

--inlining-optimizing inlines small callees and re-optimizes the caller. Given:

(func $double (param $x i32) (result i32)
  (i32.shl (local.get $x) (i32.const 1)))
(func $use (param $x i32) (result i32)
  (i32.add (call $double (local.get $x)) (i32.const 3)))

After the pass, the call $double is gone and its body is spliced into $use:

(func $use (param $x i32) (result i32)
  (i32.add (i32.shl (local.get $x) (i32.const 1)) (i32.const 3)))

This removes the call overhead a throughput benchmark would otherwise pay, and is a large part of why -O3 (which inlines more aggressively than -O2) can win on call-heavy kernels.

6. Census the module with --metrics

--metrics prints a per-category instruction count instead of IR — the fastest way to confirm a pass actually changed something:

wasm-opt kernel.wasm -O3 --metrics -o /dev/null

Expected output

A --metrics run prints a census like this; compare the -O0 and -O3 numbers to attribute a speedup to fewer calls or loops:

total
 [exports]      : 3
 [funcs]        : 12
 [globals]      : 1
 [total]        : 489
 binary         : 142
 call           : 9
 const          : 78
 global.get     : 6
 load           : 41
 local.get      : 96
 loop           : 4
 store          : 12

After -O3 on a call-heavy module you typically see call and funcs drop (inlining) and total fall (vacuum + simplification). If call is unchanged, the inliner judged the callees too large — that is a finding, not a failure.

The categories that move tell you which lever the optimizer pulled. A drop in loop means a loop was unrolled or eliminated; a drop in call means inlining fired; a drop in const plus a shrinking total usually means --precompute folded constant expressions; a drop in load/store can indicate local subexpression elimination removed redundant memory traffic. Pairing a throughput measurement with a before/after --metrics diff turns “it got 12% faster” into “it got 12% faster because four calls were inlined and two redundant loads were removed” — which is the difference between a number you can defend and one you merely observed. Save both censuses to files and diff them when a result surprises you.

Gotchas

Binaryen IR is not exactly wat. It is a normalized, fully-folded form: locals are always named, some sugar is expanded, and block/loop labeling differs. Do not paste --print output into a wat assembler and expect a byte-identical round-trip; use wasm-dis/wasm2wat if you need canonical text format, and wat2wasm to re-assemble.

Pass ordering changes the result. Passes are not commutative. --precompute only folds the run(4) example after --inlining-optimizing has exposed the body; run them in the other order and the call is still opaque. The -O3 meta-pass picks a tuned ordering for you, which is why hand-picking individual passes can do worse than -O3 if you order them naively.

--print shows post-pass IR, wasm-dis shows none. If you want “before”, use wasm-dis or --print with no optimization flags. Mixing them up — printing after -O3 and calling it “before” — is an easy way to misread what a pass did.

A missing name section turns functions into indices. Once you strip the custom name section (common in shipping builds), --print and --metrics label functions as $0, $1, and so on instead of $square_plus. The IR is identical, but it is far harder to read. When you are diagnosing, optimize a copy that keeps the names (wasm-opt -O3 without --strip-debug/--strip-producers) so the IR stays legible, and strip only the artifact you actually ship.

Performance note

Reading IR is a diagnostic, not an optimization: it costs you nothing at runtime and saves hours of guessing. The single highest-value check is running --metrics before and after your chosen pass level — a throughput win that does not show up as fewer call/loop/binary nodes is almost always measurement noise, and the census tells you that in one command.

Frequently Asked Questions

Why does --print show different code than wasm-objdump -d? --print shows Binaryen’s structured IR (folded S-expressions); wasm-objdump -d shows the linear stack-machine bytecode disassembly. They describe the same module at different abstraction levels — IR is what the optimizer manipulates, the disassembly is what the engine decodes.

Can I run a single pass to see its effect in isolation? Yes — pass just that flag, e.g. wasm-opt in.wasm --vacuum --print -o /dev/null. Isolating one pass is the clearest way to learn what it does, but remember the real pipelines (-O2/-O3) run dozens in a tuned order, so a pass in isolation may do less than it does in context.

Does --print modify my output binary? No, when you send the binary to /dev/null. --print is a side-channel that writes IR to stdout; the optimized binary still goes to -o. To both optimize and inspect, use -o real.wasm --print.

← Back to Wasm Performance Benchmarking