Reading Binaryen IR from wasm-opt
This guide shows how to dump Binaryen’s text intermediate representation with wasm-opt --print
and wasm-dis, read a small function’s IR, and watch concrete optimization passes — --vacuum,
--inlining-optimizing, and --precompute — rewrite that IR before and after, so a benchmark
change has an explanation instead of a shrug.
Prerequisites
- [ ]
binaryen116+ on yourPATH(wasm-opt --version→wasm-opt version 116) - [ ] A
.wasmmodule to inspect, ideally one with a small arithmetic function - [ ]
wabtfor cross-checking against truewat(wasm-disships with Binaryen,wasm2watwith wabt)
Procedure
1. Dump the IR without optimizing
wasm-opt --print prints the module as Binaryen IR after running whatever passes you asked for. With
no optimization flags, it prints the IR as parsed — the cleanest way to see the starting point:
wasm-opt kernel.wasm --print -o /dev/null
-o /dev/null discards the binary output; you only want the printed IR on stdout. An equivalent way
to get unoptimized text is wasm-dis, which is a pure disassembler (no passes, ever):
wasm-dis kernel.wasm -o kernel.ir.wat
2. Read one function
Binaryen IR looks like wat but is a distinct, more regular S-expression form. Consider a function
that squares its argument and adds a constant:
(func $square_plus (param $x i32) (result i32)
(i32.add
(i32.mul
(local.get $x)
(local.get $x))
(i32.const 7)))
Read it inside-out, like any S-expression: local.get $x twice feeds i32.mul, whose result feeds
i32.add alongside i32.const 7. Binaryen always names locals and prints fully parenthesized,
folded form — there is no stack-machine listing here, which is one way IR differs from raw wat.
The folded form is the whole point. Raw wat and the underlying binary are a stack machine: operands
are pushed, then an opcode consumes them. Binaryen instead keeps an expression tree, where each node’s
children are its operands, because trees are what optimizers manipulate — you can match a subtree, replace
it, and re-fold without tracking an implicit operand stack. When you read (i32.add (i32.mul ...) (i32.const 7))
you are seeing the exact data structure the passes rewrite. This is why a pass description like “replaces a
subexpression with a constant” maps so cleanly onto what you observe in --print output: the node really is
swapped in the tree. Get comfortable reading these trees and the rest of this guide reads itself.
3. Watch --vacuum remove dead code
--vacuum deletes expressions whose results are unused and that have no side effects. Start with IR
that computes a value and throws it away:
(func $noisy (param $x i32) (result i32)
(drop (i32.mul (local.get $x) (local.get $x))) ;; dead: result dropped
(i32.add (local.get $x) (i32.const 1)))
Run only that pass:
wasm-opt kernel.wasm --vacuum --print -o /dev/null
The dropped multiply disappears because it is pure and unused:
(func $noisy (param $x i32) (result i32)
(i32.add (local.get $x) (i32.const 1)))
This is the pass that punishes a sloppy benchmark: if you do not consume your kernel’s result,
--vacuum (and the engine’s own DCE) can erase the work.
4. Watch --precompute constant-fold
--precompute evaluates expressions whose operands are all constants. Given a call with a hard-coded
argument that the inliner has exposed:
(func $cube (param $x i32) (result i32)
(i32.mul (local.get $x) (i32.mul (local.get $x) (local.get $x))))
(func $main (result i32)
(call $cube (i32.const 4))) ;; constant argument
After --inlining-optimizing exposes the body and --precompute folds it, $main collapses to a
single constant:
wasm-opt kernel.wasm --inlining-optimizing --precompute --print -o /dev/null
(func $main (result i32)
(i32.const 64)) ;; 4 * 4 * 4, computed at optimize time
A benchmark that calls run(4) with a literal will be folded to i32.const like this — which is why
benchmark inputs must be runtime data, not literals.
5. Watch --inlining-optimizing flatten a call
--inlining-optimizing inlines small callees and re-optimizes the caller. Given:
(func $double (param $x i32) (result i32)
(i32.shl (local.get $x) (i32.const 1)))
(func $use (param $x i32) (result i32)
(i32.add (call $double (local.get $x)) (i32.const 3)))
After the pass, the call $double is gone and its body is spliced into $use:
(func $use (param $x i32) (result i32)
(i32.add (i32.shl (local.get $x) (i32.const 1)) (i32.const 3)))
This removes the call overhead a throughput benchmark would otherwise pay, and is a large part of why
-O3 (which inlines more aggressively than -O2) can win on call-heavy kernels.
6. Census the module with --metrics
--metrics prints a per-category instruction count instead of IR — the fastest way to confirm a pass
actually changed something:
wasm-opt kernel.wasm -O3 --metrics -o /dev/null
Expected output
A --metrics run prints a census like this; compare the -O0 and -O3 numbers to attribute a
speedup to fewer calls or loops:
total
[exports] : 3
[funcs] : 12
[globals] : 1
[total] : 489
binary : 142
call : 9
const : 78
global.get : 6
load : 41
local.get : 96
loop : 4
store : 12
After -O3 on a call-heavy module you typically see call and funcs drop (inlining) and total
fall (vacuum + simplification). If call is unchanged, the inliner judged the callees too large —
that is a finding, not a failure.
The categories that move tell you which lever the optimizer pulled. A drop in loop means a loop was
unrolled or eliminated; a drop in call means inlining fired; a drop in const plus a shrinking
total usually means --precompute folded constant expressions; a drop in load/store can indicate
local subexpression elimination removed redundant memory traffic. Pairing a throughput measurement with
a before/after --metrics diff turns “it got 12% faster” into “it got 12% faster because four calls were
inlined and two redundant loads were removed” — which is the difference between a number you can defend
and one you merely observed. Save both censuses to files and diff them when a result surprises you.
Gotchas
Binaryen IR is not exactly wat. It is a normalized, fully-folded form: locals are always named,
some sugar is expanded, and block/loop labeling differs. Do not paste --print output into a wat
assembler and expect a byte-identical round-trip; use wasm-dis/wasm2wat if you need canonical text
format, and wat2wasm to re-assemble.
Pass ordering changes the result. Passes are not commutative. --precompute only folds the
run(4) example after --inlining-optimizing has exposed the body; run them in the other order and
the call is still opaque. The -O3 meta-pass picks a tuned ordering for you, which is why hand-picking
individual passes can do worse than -O3 if you order them naively.
--print shows post-pass IR, wasm-dis shows none. If you want “before”, use wasm-dis or
--print with no optimization flags. Mixing them up — printing after -O3 and calling it “before” —
is an easy way to misread what a pass did.
A missing name section turns functions into indices. Once you strip the custom name section
(common in shipping builds), --print and --metrics label functions as $0, $1, and so on instead
of $square_plus. The IR is identical, but it is far harder to read. When you are diagnosing, optimize a
copy that keeps the names (wasm-opt -O3 without --strip-debug/--strip-producers) so the IR stays
legible, and strip only the artifact you actually ship.
Performance note
Reading IR is a diagnostic, not an optimization: it costs you nothing at runtime and saves hours of
guessing. The single highest-value check is running --metrics before and after your chosen pass
level — a throughput win that does not show up as fewer call/loop/binary nodes is almost always
measurement noise, and the census tells you that in one command.
Frequently Asked Questions
Why does --print show different code than wasm-objdump -d?
--print shows Binaryen’s structured IR (folded S-expressions); wasm-objdump -d shows the linear
stack-machine bytecode disassembly. They describe the same module at different abstraction levels —
IR is what the optimizer manipulates, the disassembly is what the engine decodes.
Can I run a single pass to see its effect in isolation?
Yes — pass just that flag, e.g. wasm-opt in.wasm --vacuum --print -o /dev/null. Isolating one pass
is the clearest way to learn what it does, but remember the real pipelines (-O2/-O3) run dozens in
a tuned order, so a pass in isolation may do less than it does in context.
Does --print modify my output binary?
No, when you send the binary to /dev/null. --print is a side-channel that writes IR to stdout; the
optimized binary still goes to -o. To both optimize and inspect, use -o real.wasm --print.
Related
- Building a reproducible Wasm benchmark harness — the harness whose results this IR explains.
- Measuring Wasm vs JavaScript throughput — attribute a Wasm win to specific passes.
- Reducing Wasm bundle size with wasm-opt — the same passes, viewed through binary size.
← Back to Wasm Performance Benchmarking