Encoding Strings Across the Wasm Boundary

This guide answers one task precisely: how to send a string from JavaScript into a WebAssembly module and get a string back, byte for byte, without truncation, leaks, or traps.

A Wasm function signature carries only numbers, so a string never crosses the boundary as a string. It crosses as a pointer and a length — two i32 values that locate UTF-8 bytes in the module’s linear memory. JavaScript strings are UTF-16 internally and Rust strings are UTF-8, so every crossing is also a transcode. Get the encoding, the byte count, and the freeing right and strings just work; get any one wrong and you read garbage or leak memory.

The reason strings are the trickiest of the common payloads is that two independent things can go wrong at once: the encoding (UTF-16 to UTF-8 and back) and the bookkeeping (allocating, copying, and freeing the bytes in a heap the module owns). The good news is that both follow a fixed recipe. Once you have walked the recipe a couple of times — and seen that wasm-bindgen runs the exact same steps under the hood — you can write or debug any string boundary by hand, including the cases where the generated glue does not give you the control you need.

Prerequisites

  • [ ] A module exporting alloc(size) -> ptr, dealloc(ptr, size), and memory (the parent guide shows a minimal one)
  • [ ] Browser or Node 20+ with global TextEncoder and TextDecoder
  • [ ] wabt for wat2wasm and wasm-objdump if you build the module by hand
  • [ ] Familiarity with the (ptr, len) convention and typed-array views over memory.buffer

JS → Wasm: send a string in

The inbound path is encode, allocate, copy, call, free — five steps that map one-to-one onto the manual ABI.

  1. Encode the JS string to UTF-8 bytes. TextEncoder always emits UTF-8; the result is a Uint8Array whose .length is the byte count you will pass as len.

    const enc = new TextEncoder();
    const bytes = enc.encode("café");     // 5 bytes: 'é' is two bytes in UTF-8
  2. Allocate that many bytes in linear memory. Call the module’s exported allocator; it returns a pointer into the heap it owns.

    const ptr = instance.exports.alloc(bytes.length);
  3. Copy the bytes to that offset. Build a Uint8Array view over the current memory.buffer and set() the payload at ptr. Always rebuild the view here — a preceding alloc may have grown and detached memory.

    new Uint8Array(instance.exports.memory.buffer, ptr, bytes.length).set(bytes);
  4. Call the function with (ptr, len). The module reads exactly len bytes starting at ptr.

    const outLen = instance.exports.uppercase_ascii(ptr, bytes.length);
  5. Free the input. Whoever allocated frees. Call dealloc(ptr, len) once the module has finished reading — wrap it in finally so an exception cannot leak the buffer.

    instance.exports.dealloc(ptr, bytes.length);

Wasm → JS: read a string out

When the module returns a string it hands back a pointer and a length — usually via the multi-value return (result i32 i32), or by writing both into an out-pointer. The host then decodes and frees.

  1. Receive (ptr, len). With multi-value, the call returns a two-element array.

    const [outPtr, outLen] = instance.exports.build_greeting(ptr, bytes.length);
  2. Decode the bytes to a JS string. Slice a view at (outPtr, outLen) and run it through TextDecoder. Use .slice() (a copy) before the next alloc, or decode immediately, because the next allocation could detach the buffer.

    const out = new Uint8Array(instance.exports.memory.buffer, outPtr, outLen);
    const text = new TextDecoder().decode(out);   // decode copies into a JS string
  3. Free the module’s output buffer. The string the module allocated for its result is now yours to release.

    instance.exports.dealloc(outPtr, outLen);

What wasm-bindgen does

With wasm-bindgen you write the Rust and the glue is generated — but the glue performs exactly the steps above. This function takes and returns an owned string:

use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub fn build_greeting(name: &str) -> String {
    format!("Hello, {name}!")
}

The emitted JavaScript calls an internal passStringToWasm helper that runs TextEncoder.encodeInto straight into linear memory (avoiding a temporary array), passes (ptr, len), then reads the returned (ptr, len) back with TextDecoder.decode and calls the generated __wbindgen_free. The wasm-bindgen deep dive annotates that shim line by line. The point of seeing the raw version first is that nothing magic happens: you can reproduce wasm-bindgen’s string handling by hand when a hot path needs it.

Two details are worth lifting out of the generated code. First, wasm-bindgen uses encodeInto, not encode followed by a set() — that writes the UTF-8 directly into the module heap in a single pass, saving an intermediate Uint8Array allocation and one copy. When you hand-roll a hot string path it is worth doing the same: enc.encodeInto(str, new Uint8Array(memory.buffer, ptr, capacity)) returns a { read, written } result so you know exactly how many bytes landed. Second, for a &str argument the generated glue frees the input buffer for you immediately after the call, and for a returned String it frees the module’s result buffer right after decoding — the same finally/decode-then-free discipline shown above, just generated. Knowing this is what lets you reason about why a wasm-bindgen call allocates and frees twice per string, and where a manual ABI could avoid one of those round trips.

Expected output

Running the round trip and logging the decoded result and a byte-length check:

> build_greeting("café")
input bytes:  [0x63,0x61,0x66,0xc3,0xa9]   // 5 bytes, JS .length was 4
output text:  "Hello, café!"
output bytes: 13                            // not 12 — 'é' is still 2 bytes

Confirm any static strings compiled into the module land where you expect with a data-section dump:

wasm-objdump -s -j data greeting.wasm
# Data[0]: ... 48 65 6c 6c 6f 2c 20   "Hello, "

Gotchas

  • UTF-8 vs UTF-16 length mismatch. "café".length is 4 (UTF-16 code units) but its UTF-8 encoding is 5 bytes. Passing the JS .length as len truncates the last byte and corrupts the trailing character — or, for emoji and CJK, drops whole code points. Fix: always pass new TextEncoder().encode(s).length, never s.length.

  • Assuming null termination. The (ptr, len) convention is length-prefixed, not NUL-terminated. A module that calls strlen on your buffer will read past len until it finds a zero byte — straight into adjacent allocations, or off the end of memory: RuntimeError: memory access out of bounds. Fix: either always pass an explicit length, or, if the module truly needs a C string, allocate len + 1 bytes and write a trailing 0.

  • Forgetting to free → leak. Each alloc for an input string, and each result buffer the module returns, must be freed. Skip it and memory.buffer.byteLength climbs on every call. Fix: free the input in finally; free the output right after decoding.

  • Stale view after a grow. Building the Uint8Array before the alloc that grows memory leaves you writing into a detached, zero-length buffer — the bytes silently vanish. Fix: construct the view from memory.buffer after every alloc, as the steps above do.

Performance note

The transcode cost scales with byte count, not call count: TextEncoder.encode and TextDecoder.decode are roughly linear at ~1–3 GB/s in modern engines, so a 1 KB string costs under a microsecond while a 1 MB string costs a few hundred. The fixed per-call overhead (the alloc/free round trip and the Wasm call itself) is a few hundred nanoseconds. The takeaway: many tiny string calls are dominated by fixed overhead — batch them — while a few large ones are dominated by the linear transcode, which only zero-copy layouts (rarely possible for strings, since the encodings differ) can avoid.

A concrete example makes the batching point sharp. Suppose you need to uppercase 10,000 short labels. Calling the module once per label pays the ~300 ns fixed overhead 10,000 times — about 3 ms of pure overhead before any work happens. Concatenating the labels with a separator, sending them as one buffer, and splitting the result on the JavaScript side pays that overhead once and lets the linear transcode dominate, where it belongs. The same logic applies in reverse: if a module produces many small strings, have it write them into one contiguous buffer with a length table rather than returning each through its own call. When you cannot batch — interactive, one-string-at-a-time work — prefer encodeInto over encode to shave the intermediate allocation, and reuse a single scratch buffer across calls so the allocator is not churning. None of these tricks change the asymptotics, but for string-heavy boundaries they are routinely the difference between marshaling being invisible and marshaling being the profile.

Frequently Asked Questions

Can I avoid the copy entirely for strings? Almost never, because JavaScript stores UTF-16 and Wasm wants UTF-8 — the transcode itself is a copy. TextEncoder.encodeInto lets you encode directly into linear memory (one pass instead of two), which is the closest you get. Truly zero-copy hand-offs are for byte buffers that need no transcode.

Why does the decoded string sometimes have a replacement character (�)? You decoded a byte range that does not start and end on UTF-8 code-point boundaries — usually a wrong len, or slicing a multibyte character in half. Confirm len is the exact encoded byte count the module wrote, and that you are not truncating the view.

Do I free the input buffer before or after reading the output? After. The module may still reference the input while producing the output. Free the input once the call returns, then decode and free the output — and never reuse a pointer after dealloc.

← Back to Passing Complex Types Across the Boundary