Skip to content

Decode inbound lone-surrogate frames instead of dropping them#77

Merged
linkdata merged 2 commits into
mainfrom
fix/wire-inbound-lone-surrogate
Jun 18, 2026
Merged

Decode inbound lone-surrogate frames instead of dropping them#77
linkdata merged 2 commits into
mainfrom
fix/wire-inbound-lone-surrogate

Conversation

@linkdata

Copy link
Copy Markdown
Owner

Defect (per-package code review)

Parse decoded the quoted Data of inbound Input/Click/ContextMenu/Remove frames with strconv.Unquote, returning failure (and ReadLoop then silently discarding the frame) when it errored — before the strings.ToValidUTF8 sanitizer ran.

The browser encodes that data with JSON.stringify, whose string grammar is not a subset of strconv.Unquote's: a lone UTF-16 surrogate is emitted as the literal escape \udXXX (ES2019 well-formed JSON.stringify does not throw on lone surrogates), and strconv.Unquote rejects \udXXX (strconv/quote.go calls utf8.ValidRune, false for U+D800–U+DFFF). Input value is a UTF-16 DOMString and can hold lone surrogates (emoji split, IME/paste, programmatic set), so a legitimate user event was silently lost rather than sanitized.

The package doc already noted the two grammars "overlap rather than nest" but only reasoned about the outbound appendJSONQuote direction; the inbound JSON.stringify → Unquote direction was unguarded.

Fix

Fall back to a JSON string decode (which maps a lone surrogate to U+FFFD) when strconv.Unquote fails, so the event is delivered sanitized rather than dropped. The message is rejected only if both decoders fail; the common quoted path, the unquoted path and the verbatim Set/Call path are unchanged.

How the implementation was chosen (benchmarked)

Three decoders were benchmarked (benchstat -col /impl, count=10): strconv.Unquote only, JSON only, and Unquote-with-JSON-fallback. Pure JSON regressed every inbound frame ~10× with 4–5 extra allocations; the fallback matches strconv.Unquote on the common path and pays the JSON cost only on the rare surrogate case.

The benchmark also exposed a subtle regression: inlining json.Unmarshal([]byte(data), &data) forced Parse's data local to escape to the heap on every call (a uniform +1 alloc on all frames, including the unquoted and Set/Call paths that never decode). Moving the address-taken target into a small jsonUnquoteString helper keeps data on the stack.

BenchmarkParse (kept as a regression guard) confirms the common, unquoted and verbatim paths are allocation-identical to main; only the previously-dropped surrogate frame costs more:

frame before (main) after
input_plain 1 alloc 1 alloc
input_unquoted 1 alloc 1 alloc
set_verbatim 1 alloc 1 alloc
input_surrogate dropped decoded (6 allocs)

Tests

  • Test_wsParse_InboundLoneSurrogate feeds Input\tJid.1\t"\ud800"\n and asserts the event is delivered with Data = U+FFFD, not dropped. It fails on the unpatched code.
  • Fuzz_appendJSONQuote / the existing outbound round-trip remain green (outbound path untouched).

Verification

go vet, gofumpt -l, staticcheck, go build, go test -race ./..., and go test -tags debug -race ./... all pass.

linkdata added 2 commits June 18, 2026 05:05
Parse decoded the quoted Data of inbound Input/Click/ContextMenu/Remove frames
with strconv.Unquote, returning failure when it errored. The browser encodes that
data with JSON.stringify, whose grammar is not a subset of strconv.Unquote's: a
lone UTF-16 surrogate is emitted as the literal escape "\udXXX" (JSON.stringify
does not throw on lone surrogates), which strconv.Unquote rejects. ReadLoop then
silently discarded the whole frame, before the ToValidUTF8 sanitizer ran, losing a
legitimate user event. Input values are UTF-16 DOMStrings and can hold lone
surrogates (emoji split, IME/paste, programmatic set).

Fall back to a JSON string decode (which maps a lone surrogate to U+FFFD) when
strconv.Unquote fails, so the event is delivered sanitized rather than dropped. The
message is rejected only if both decoders fail; the common and verbatim paths are
unchanged.

The fallback was chosen by benchmarking three decoders (strconv.Unquote only, JSON
only, and Unquote-with-JSON-fallback): pure JSON regressed every inbound frame ~10x
with 4-5 extra allocations, while the fallback matches strconv.Unquote on the common
path and pays the JSON cost only on the rare surrogate case. The benchmark also
exposed that inlining json.Unmarshal(&data) forced Parse's data local to escape to
the heap on every call (a uniform +1 alloc on all frames, including the unquoted and
Set/Call paths that never decode); moving the address-taken target into
jsonUnquoteString keeps data on the stack. BenchmarkParse confirms the common,
unquoted and verbatim paths are allocation-identical to before; only the
previously-dropped surrogate frame costs more. Kept as a regression guard.
@linkdata linkdata merged commit 2643d11 into main Jun 18, 2026
6 checks passed
@linkdata linkdata deleted the fix/wire-inbound-lone-surrogate branch June 18, 2026 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant