Decode inbound lone-surrogate frames instead of dropping them#77
Merged
Conversation
Parse decoded the quoted Data of inbound Input/Click/ContextMenu/Remove frames with strconv.Unquote, returning failure when it errored. The browser encodes that data with JSON.stringify, whose grammar is not a subset of strconv.Unquote's: a lone UTF-16 surrogate is emitted as the literal escape "\udXXX" (JSON.stringify does not throw on lone surrogates), which strconv.Unquote rejects. ReadLoop then silently discarded the whole frame, before the ToValidUTF8 sanitizer ran, losing a legitimate user event. Input values are UTF-16 DOMStrings and can hold lone surrogates (emoji split, IME/paste, programmatic set). Fall back to a JSON string decode (which maps a lone surrogate to U+FFFD) when strconv.Unquote fails, so the event is delivered sanitized rather than dropped. The message is rejected only if both decoders fail; the common and verbatim paths are unchanged. The fallback was chosen by benchmarking three decoders (strconv.Unquote only, JSON only, and Unquote-with-JSON-fallback): pure JSON regressed every inbound frame ~10x with 4-5 extra allocations, while the fallback matches strconv.Unquote on the common path and pays the JSON cost only on the rare surrogate case. The benchmark also exposed that inlining json.Unmarshal(&data) forced Parse's data local to escape to the heap on every call (a uniform +1 alloc on all frames, including the unquoted and Set/Call paths that never decode); moving the address-taken target into jsonUnquoteString keeps data on the stack. BenchmarkParse confirms the common, unquoted and verbatim paths are allocation-identical to before; only the previously-dropped surrogate frame costs more. Kept as a regression guard.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Defect (per-package code review)
Parsedecoded the quotedDataof inboundInput/Click/ContextMenu/Removeframes withstrconv.Unquote, returning failure (andReadLoopthen silently discarding the frame) when it errored — before thestrings.ToValidUTF8sanitizer ran.The browser encodes that data with
JSON.stringify, whose string grammar is not a subset ofstrconv.Unquote's: a lone UTF-16 surrogate is emitted as the literal escape\udXXX(ES2019 well-formedJSON.stringifydoes not throw on lone surrogates), andstrconv.Unquoterejects\udXXX(strconv/quote.gocallsutf8.ValidRune, false for U+D800–U+DFFF). Inputvalueis a UTF-16DOMStringand can hold lone surrogates (emoji split, IME/paste, programmatic set), so a legitimate user event was silently lost rather than sanitized.The package doc already noted the two grammars "overlap rather than nest" but only reasoned about the outbound
appendJSONQuotedirection; the inboundJSON.stringify → Unquotedirection was unguarded.Fix
Fall back to a JSON string decode (which maps a lone surrogate to U+FFFD) when
strconv.Unquotefails, so the event is delivered sanitized rather than dropped. The message is rejected only if both decoders fail; the common quoted path, the unquoted path and the verbatim Set/Call path are unchanged.How the implementation was chosen (benchmarked)
Three decoders were benchmarked (
benchstat -col /impl, count=10):strconv.Unquoteonly, JSON only, and Unquote-with-JSON-fallback. Pure JSON regressed every inbound frame ~10× with 4–5 extra allocations; the fallback matchesstrconv.Unquoteon the common path and pays the JSON cost only on the rare surrogate case.The benchmark also exposed a subtle regression: inlining
json.Unmarshal([]byte(data), &data)forcedParse'sdatalocal to escape to the heap on every call (a uniform +1 alloc on all frames, including the unquoted and Set/Call paths that never decode). Moving the address-taken target into a smalljsonUnquoteStringhelper keepsdataon the stack.BenchmarkParse(kept as a regression guard) confirms the common, unquoted and verbatim paths are allocation-identical tomain; only the previously-dropped surrogate frame costs more:Tests
Test_wsParse_InboundLoneSurrogatefeedsInput\tJid.1\t"\ud800"\nand asserts the event is delivered withData= U+FFFD, not dropped. It fails on the unpatched code.Fuzz_appendJSONQuote/ the existing outbound round-trip remain green (outbound path untouched).Verification
go vet,gofumpt -l,staticcheck,go build,go test -race ./..., andgo test -tags debug -race ./...all pass.