Skip to content

Fix/linebreaks conversion quotes#7511

Merged
Elimpizza merged 29 commits into
mainfrom
fix/linebreaks-conversion-quotes
May 21, 2026
Merged

Fix/linebreaks conversion quotes#7511
Elimpizza merged 29 commits into
mainfrom
fix/linebreaks-conversion-quotes

Conversation

@Elimpizza
Copy link
Copy Markdown
Contributor

@Elimpizza Elimpizza commented Feb 26, 2026

Summary

Fix the line-break bug where annotations spanning a <br> element were stored and matched as run-together words (e.g. <p>foo<br>bar</p> was stored as "foobar" instead of "foo bar").

This PR changes the format at both the storage and matching layers so they stay aligned: new annotations store the quote with <br> substituted by a space, and matchQuote runs against the same substituted text. As a consequence, the DB export and the in-app export both reflect what the sidebar shows, without any per-consumer logic.

How it works

  • renderedTextFromRange(range): clones the range's contents, replaces <br> elements with single-space text nodes, returns the resulting textContent. Used in TextQuoteAnchor.fromRange to produce exact from the user's selection and prefix/suffix from the surrounding TextRange regions.
  • renderedTextOf(root): walks the DOM in document order and returns { text, brPositionsInText } — the BR-substituted text of the root plus the offsets of every synthesized space. Used by TextQuoteAnchor.toPositionAnchor so matchQuote sees the same format as what was stored.
  • renderedOffsetToRaw(brPositions, offset): translates a match offset back to a raw textContent offset by subtracting the number of <br> positions that precede it. TextPositionAnchor continues to operate in raw offsets, so existing position-based code is unaffected.

Backward compatibility

Existing annotations stay in their current stored format and continue to re-anchor through matchQuote's fuzzy tolerance (50% edit distance). For typical cases (one or two <br> tags in the selection) this works out of the box. The pathological case (pre-existing annotations whose selection spans many consecutive <br> tags in a short window) may fail to re-anchor and become orphans, but this is expected to be rare.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.62%. Comparing base (b07c1b8) to head (1efc9a4).
⚠️ Report is 35 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7511      +/-   ##
==========================================
+ Coverage   99.61%   99.62%   +0.01%     
==========================================
  Files         283      285       +2     
  Lines       11877    11971      +94     
  Branches     2898     2920      +22     
==========================================
+ Hits        11831    11926      +95     
+ Misses         46       45       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Elimpizza Elimpizza marked this pull request as ready for review February 26, 2026 16:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request normalizes quote selector creation and anchoring for both HTML and PDF documents to ensure that stored selectors match what users see in rendered text. The key change is that line breaks (from <br> tags and block elements in HTML, or newlines in PDF) are now converted to spaces, and consecutive whitespace is collapsed to single spaces. This prevents issues where text like <p>foo<br>bar</p> was previously stored as "foobar" but is now correctly stored as "foo bar" to match the visual rendering.

Changes:

  • Introduced rendered-text.ts module that builds normalized text from HTML DOM with offset mappings between raw and normalized positions
  • Updated TextQuoteAnchor to use normalized text when creating and matching selectors
  • Applied consistent PDF text normalization in selector creation and anchoring
  • Normalized quote display in the UI component to match the stored format
  • Updated test fixtures and baselines to reflect normalized selector output

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/annotator/anchoring/rendered-text.ts New module providing HTML text normalization with offset mapping for converting between raw and normalized positions
src/annotator/anchoring/types.ts Updated TextQuoteAnchor to use normalized text for selector creation and matching
src/annotator/anchoring/pdf.ts Applied consistent PDF text normalization in describe() and anchor() paths
src/sidebar/components/Annotation/AnnotationQuote.tsx Normalized quote display in UI to match stored format
src/annotator/anchoring/test/rendered-text-test.js New tests for the rendered-text normalization module
src/annotator/anchoring/test/types-test.js Updated test expectations to match normalized selector format and relaxed some assertions
src/annotator/anchoring/test/pdf-test.js Updated test expectations and relaxed some assertions to accommodate normalization
src/annotator/anchoring/test/html-test.js Added normalization helpers and updated tests to compare normalized selectors
src/annotator/anchoring/test/html-baselines/wikipedia-regression-testing.json Updated baseline expectations with normalized prefix/suffix values
src/annotator/anchoring/test/html-baselines/minimal.json Updated baseline expectations with normalized prefix/suffix values

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/annotator/anchoring/pdf.ts Outdated
Comment thread src/annotator/anchoring/types.ts Outdated
Comment thread src/annotator/anchoring/types.ts Outdated
Comment thread src/sidebar/components/Annotation/AnnotationQuote.tsx Outdated
Comment thread src/annotator/anchoring/test/types-test.js Outdated
Comment thread src/annotator/anchoring/test/pdf-test.js Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/annotator/anchoring/pdf.ts Outdated
Comment thread src/annotator/anchoring/pdf.ts Outdated
Comment thread src/annotator/anchoring/pdf.ts
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/annotator/anchoring/types.ts Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Comment thread src/annotator/util/normalize.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/pdf.ts Outdated
Comment thread src/sidebar/components/Annotation/AnnotationQuote.tsx Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts
Comment thread src/annotator/anchoring/types.ts Outdated
Comment thread src/annotator/anchoring/pdf.ts Outdated
Comment thread src/annotator/anchoring/test/html-test.js Outdated
Comment thread src/annotator/anchoring/test/html-test.js Outdated
Comment thread src/annotator/anchoring/types.ts
Comment thread src/sidebar/components/Annotation/AnnotationQuote.tsx Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/types.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment on lines +45 to +48
// This is the bug the PR fixes: a selection that crosses a `<br>` used
// to produce `exact: "foobar"` because `textContent` doesn't include a
// character for the `<br>`. With the substitution, the stored quote
// reflects what the user actually sees in the rendered page.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest removing the PR reference from this comment. Something like:

// A selection crossing a `<br>` should produce `exact: "foo bar"` (with
// a space), not `"foobar"`, so the stored quote reflects what the user
// sees on the rendered page.

Copy link
Copy Markdown
Contributor

@karenrasmussen karenrasmussen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing job!!!!

@Elimpizza Elimpizza merged commit f3af36f into main May 21, 2026
4 checks passed
@Elimpizza Elimpizza deleted the fix/linebreaks-conversion-quotes branch May 21, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants