Fix/linebreaks conversion quotes#7511
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7511 +/- ##
==========================================
+ Coverage 99.61% 99.62% +0.01%
==========================================
Files 283 285 +2
Lines 11877 11971 +94
Branches 2898 2920 +22
==========================================
+ Hits 11831 11926 +95
+ Misses 46 45 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This pull request normalizes quote selector creation and anchoring for both HTML and PDF documents to ensure that stored selectors match what users see in rendered text. The key change is that line breaks (from <br> tags and block elements in HTML, or newlines in PDF) are now converted to spaces, and consecutive whitespace is collapsed to single spaces. This prevents issues where text like <p>foo<br>bar</p> was previously stored as "foobar" but is now correctly stored as "foo bar" to match the visual rendering.
Changes:
- Introduced
rendered-text.tsmodule that builds normalized text from HTML DOM with offset mappings between raw and normalized positions - Updated
TextQuoteAnchorto use normalized text when creating and matching selectors - Applied consistent PDF text normalization in selector creation and anchoring
- Normalized quote display in the UI component to match the stored format
- Updated test fixtures and baselines to reflect normalized selector output
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/annotator/anchoring/rendered-text.ts | New module providing HTML text normalization with offset mapping for converting between raw and normalized positions |
| src/annotator/anchoring/types.ts | Updated TextQuoteAnchor to use normalized text for selector creation and matching |
| src/annotator/anchoring/pdf.ts | Applied consistent PDF text normalization in describe() and anchor() paths |
| src/sidebar/components/Annotation/AnnotationQuote.tsx | Normalized quote display in UI to match stored format |
| src/annotator/anchoring/test/rendered-text-test.js | New tests for the rendered-text normalization module |
| src/annotator/anchoring/test/types-test.js | Updated test expectations to match normalized selector format and relaxed some assertions |
| src/annotator/anchoring/test/pdf-test.js | Updated test expectations and relaxed some assertions to accommodate normalization |
| src/annotator/anchoring/test/html-test.js | Added normalization helpers and updated tests to compare normalized selectors |
| src/annotator/anchoring/test/html-baselines/wikipedia-regression-testing.json | Updated baseline expectations with normalized prefix/suffix values |
| src/annotator/anchoring/test/html-baselines/minimal.json | Updated baseline expectations with normalized prefix/suffix values |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // This is the bug the PR fixes: a selection that crosses a `<br>` used | ||
| // to produce `exact: "foobar"` because `textContent` doesn't include a | ||
| // character for the `<br>`. With the substitution, the stored quote | ||
| // reflects what the user actually sees in the rendered page. |
There was a problem hiding this comment.
I'd suggest removing the PR reference from this comment. Something like:
// A selection crossing a `<br>` should produce `exact: "foo bar"` (with
// a space), not `"foobar"`, so the stored quote reflects what the user
// sees on the rendered page.
Summary
Fix the line-break bug where annotations spanning a
<br>element were stored and matched as run-together words (e.g.<p>foo<br>bar</p>was stored as"foobar"instead of"foo bar").This PR changes the format at both the storage and matching layers so they stay aligned: new annotations store the quote with
<br>substituted by a space, andmatchQuoteruns against the same substituted text. As a consequence, the DB export and the in-app export both reflect what the sidebar shows, without any per-consumer logic.How it works
renderedTextFromRange(range): clones the range's contents, replaces<br>elements with single-space text nodes, returns the resultingtextContent. Used inTextQuoteAnchor.fromRangeto produceexactfrom the user's selection andprefix/suffixfrom the surroundingTextRangeregions.renderedTextOf(root): walks the DOM in document order and returns{ text, brPositionsInText }— the BR-substituted text of the root plus the offsets of every synthesized space. Used byTextQuoteAnchor.toPositionAnchorsomatchQuotesees the same format as what was stored.renderedOffsetToRaw(brPositions, offset): translates a match offset back to a rawtextContentoffset by subtracting the number of<br>positions that precede it.TextPositionAnchorcontinues to operate in raw offsets, so existing position-based code is unaffected.Backward compatibility
Existing annotations stay in their current stored format and continue to re-anchor through
matchQuote's fuzzy tolerance (50% edit distance). For typical cases (one or two<br>tags in the selection) this works out of the box. The pathological case (pre-existing annotations whose selection spans many consecutive<br>tags in a short window) may fail to re-anchor and become orphans, but this is expected to be rare.