Skip to content

feat(kamado)!: replace jsdom with linkedom#115

Merged
YusukeHirao merged 30 commits into
devfrom
feat/replace-jsdom-with-linkedom
Jun 12, 2026
Merged

feat(kamado)!: replace jsdom with linkedom#115
YusukeHirao merged 30 commits into
devfrom
feat/replace-jsdom-with-linkedom

Conversation

@YusukeHirao

Copy link
Copy Markdown
Member

Summary

  • jsdom を linkedom 0.18.12 に置き換え — yarn bench --full --pages=500 --runs=3median 約3.9倍速、RSS 約47%削減yarn.lock から38件超の transitive を削減
  • 新 API: ctx.baseURL / ctx.getHref(el)manipulateDOM hook 第3引数 (ManipulateDOMHookContext) に追加し、jsdom 時代の window.location.href ベースの URL 解決を取り戻す
  • 新 export: resolveHref(el, base?) / DomWindow / DomDocument / DomElementkamado/utils/dom から公開
  • セキュリティ強化: javascript: / data: / vbscript: / file: スキームを resolveHref が拒否、basic-auth credential を結果から除去、<img src> の path traversal (.. 経由の rootDir 脱出) をブロック

Benchmarks

yarn bench --full --pages=500 --runs=3

jsdom (dev) linkedom (this PR) 改善
median build 2.84s 0.73s 約3.9倍
throughput 176 pages/s 685 pages/s 約3.9倍
RSS 687 MB 364 MB 約47%

Breaking changes

削除 移行先
DomSerializeOptions.url hook は manipulateDOM の第3引数 ctx.baseURL を読む
ManipulateDOMOptions.host ctx.baseURL を上記経由で取得、または直接 pkg.production.baseURL を設定
manipulateDOM hook 第3引数の型 TransformContext ManipulateDOMHookContext (TransformContext を拡張、baseURLgetHref を追加)
hook 引数の lib.dom 型 (Element[] / Window) linkedom 由来の DomElement[] / DomWindow (getComputedStyle 等の layout API は不在、<a>.href も絶対URLを返さない — ctx.getHref を使用)

Migration の詳細・before/after の書き換え例は packages/kamado/ARCHITECTURE.md (DOM Library セクション) と packages/@kamado-io/page-compiler/README.md を参照してください。

linkedom 挙動差分の補正 (kamado 側で吸収)

linkedom の挙動 kamado の補正
<head> を自動補完しない getDOM が documentElement の children を走査して欠落時に挿入
parseHTML('<!doctype html>')documentElementnull null ガードで空文字を返却 (throw しない)
--!> を comment terminator として認識しない 入力段階で --> に正規化
Document.baseURI / Window.location 未実装 ctx.getHref(el) を新設

Test plan

  • yarn lint — clean
  • yarn build — 全 5 パッケージ成功
  • yarn test467 passed (39 test files)
  • dev ブランチで yarn bench --full --pages=500 --runs=3 を取り、本 PR と比較済み
  • sagawa など実利用 config を回し、url / host を渡している箇所がないか確認 (TypeScript で typechecking 漏れた as any 経由の余地あり)
  • dev server (yarn dev) を起動し、サンプルページのスクリプト/メタタグ注入 (inject-to-head) が引き続き機能することを目視確認

Notes

  • alpha バージョン (2.0.0-alpha.16) なので破壊変更を伴うが、リリース時に 2.0.0-alpha.17 以降で公開予定
  • セッション中に xhigh-effort code review と QA review を回し、検出された 25 件以上の指摘 (security 含む) と対応するテストも本 PR に含まれる

🤖 Generated with Claude Code

BREAKING CHANGE: DomSerializeOptions no longer accepts `url`. The
previous lib.dom-shaped Window/Element types are replaced with linkedom
derived DomWindow/DomElement types, so hook signatures change.
…linkedom

BREAKING CHANGE: ManipulateDOMOptions no longer accepts `host`; it
existed only to pass `url` into the (now removed) DomSerializeOptions.
Hook signatures use DomElement/DomWindow from kamado/utils/dom.
- Strip leading HTML comments before fragment detection so a license
  banner ahead of <!doctype html> no longer routes the document through
  the fragment path.
- Serialize fragments via tmpContainer.innerHTML so text and comment
  nodes survive (linkedom's .children is element-only).
- Null-guard documentElement so a bare "<!doctype html>" input returns
  "" instead of throwing on .outerHTML.
- Auto-create <head> when missing so downstream regex transforms (e.g.
  inject-to-head) still find an anchor under linkedom (which does not
  auto-insert one).
…tor scope

- Detect real "data:" URLs (single colon), not just "data://", so genuine
  data URIs no longer slip past the protocol guard.
- Strip query string and fragment before the extension check so a cache-
  busted src like "hero.png?v=abc" is sized normally.
- Apply selector only to <img>; <picture> > <source> elements always
  participate so an img-shaped selector doesn't silently drop sources.
- Drop the redundant array spread on linkedom's querySelectorAll
  (already a real Array at runtime).
- Re-parse the imageSizes output via linkedom and assert through DOM
  queries instead of fragile substring/regex matches, so the selector
  test cannot pass when target/other lookups silently return empty.
- Craft the URL-guard srcs so each one would slip into trackDependency
  if its protocol/extension guard were removed, then assert the captured
  dependency set is empty. The earlier version relied on fs.stat
  returning null and passed even when the guards were missing.
- New packages/kamado/src/utils/dom.spec.ts directly covers the four
  recent getDOM hardening points so a regression in fragment detection,
  text/comment preservation, null documentElement, or auto-head
  insertion fails on its own.
- Re-parse imageSizes output via linkedom in the <img> and
  <picture> > <source> tests so attribute order or quoting shifts
  cannot let the assertions pass vacuously.
- Drop "should pass through pug file without compiler" — it documented
  an undefined-behavior path (pug source reaching the DOM serializer
  without a compiler) and the recent linkedom switch turned its
  former empty-string output into "p Hello, world!\n", a quiet
  behavioral surprise nobody actually relies on.
- Add a comment on the layout snapshot explaining that <head></head>
  is the auto-inserted one and must not disappear.
…ints

Add a "DOM Library (linkedom)" section to ARCHITECTURE.md / .ja.md
listing every place linkedom diverges from jsdom (auto-head, null
documentElement, baseURI/location, .children excluding text/comments,
setAttribute ordering) and the file in src/utils/dom.ts that
compensates for each. Names dom.spec.ts and manipulate-dom.spec.ts as
the front-line guards so future linkedom bumps know what to re-verify.
…ntext

linkedom does not populate window.location / Document.baseURI, so reading
<a>.href returns the raw attribute instead of an absolute URL. The
previous host/url plumbing into JSDOM was removed in this branch and
left no first-class replacement — hooks had to fish baseURL out of
ctx.context.pkg.production manually.

- ManipulateDOMHookContext now extends TransformContext with:
    baseURL: string | undefined  — serve→devServer URL, build→pkg.production.baseURL/host
    getHref(el): string | null   — resolves the element's href against baseURL
- New kamado/utils/dom export resolveHref(el, base?) underpins getHref
  and is available for direct use outside the manipulateDOM pipeline.
- Tests cover resolveHref unit-level (absolute, relative, protocol-
  relative, malformed, missing attr, URL object base) and integration
  via manipulateDOM (build vs serve base, no-base→null, absolute pass-
  through).
- ARCHITECTURE.md / .ja.md updated to point at the new hook surface.
resolveHref previously returned whatever new URL() produced, so a hook
that interpolated the result into HTML could become an XSS sink for
`javascript:` / `data:` / `vbscript:` / `file:` hrefs, and a base URL
with basic-auth credentials leaked the password into every resolved
href. The helper now refuses opaque/local schemes and clears
URL.username / URL.password before returning, keeping it safe-by-default
for HTML output.
imageSizes joined rootDir + src segments with no escape check, so a
crafted attribute like `<img src="../../../etc/passwd.png">` resolved
outside rootDir; fs.stat would then read arbitrary bytes, image-size
would try to parse them, and trackDependency would record the escaped
path in the incremental-build manifest. Switch to path.resolve and
verify the result stays within rootDir before touching the filesystem.
…erver gaps

- Trim baseURL/host and treat empty strings as missing so `baseURL:
  process.env.PROD_BASE_URL ?? ''` no longer silently disables every
  ctx.getHref() in production.
- Validate the candidate URL via `new URL(...)` and return undefined on
  parse failure so misconfigured values like `'example.com'` (no scheme)
  surface as 'no base' instead of throwing inside resolveHref on every
  href.
- Default the pkg.production.host fallback to `https://` (it lands in
  user-visible HTML now via ctx.getHref — http:// would be a mixed-
  content / SEO regression).
- Optional-chain devServer access so partial test/dev harnesses don't
  TypeError on missing host/port; return undefined instead.
- Trim the src attribute before the protocol guards so a stray leading
  space (common in CMS-authored content) no longer slips past the
  http(s)://, //, and data: checks and ends up pointing trackDependency
  at a nonsense path under rootDir.
- Cache by srcPath (with `?` / `#` stripped) plus stat size, not raw src.
  Otherwise rotating cache-buster query strings forced a fresh
  image-size parse on every build despite the on-disk file being
  unchanged, defeating the cache layer the PR added for exactly that
  case.
…etter side effect

- Comment-strip regex now matches both --> and --!> (the HTML5-tolerated
  variant) so a banner like `<!-- copyright --!><html>...</html>` is no
  longer misclassified as a fragment.
- Auto-head detection walks documentElement.children explicitly rather
  than reading document.head, which in some linkedom versions
  materializes a <head> on first access. The previous if/insert branch
  was effectively dead code carried by that side effect; switching to a
  children walk makes the insertion path actually run when needed and
  survives a linkedom upgrade that drops the auto-creation behavior.
The previous `as unknown as readonly DomElement[]` cast assumed linkedom's
querySelectorAll returns a real Array. It currently does (0.18.12), but a
future shape change to a non-Array iterable would silently break flatMap
flattening and crash on the first <img>. Switch to a for...of loop that
works for both arrays and iterables so future linkedom upgrades degrade
gracefully.
…inator

- Add resolveHref tests for the rejected schemes (javascript:, data:,
  vbscript:, file:) and for basic-auth credential stripping on both the
  relative-via-base and absolute paths.
- Add a fragment-classification test that pins the --!> comment
  terminator as recognized: `<!-- copyright --!><html>...</html>` must
  serialize as a full document, not a stripped-down fragment.
- Restore the pug-without-compiler test (page-compiler.spec.ts) with the
  new expected output ('p Hello, world!\\n') so the linkedom-era
  text-pass-through behavior is pinned. It's undefined behavior in the
  formal sense (users should always register a compiler) but worth a
  test to keep the symptom from silently flipping again.
Adds a getBaseURL spec helper and 9 tests that pin each branch of
resolveBaseURL — empty/whitespace baseURL → undefined, scheme-less
baseURL → undefined, trim before validation, host-only → https://
fallback, empty host → undefined, serve-mode missing devServer →
undefined (no TypeError), and serve-mode partial devServer → undefined.
The hardening that landed in c53bfc0 was previously only assured by
default-context tests; reverting the patch now fails the matching
case-specific test instead of silently passing.
- Path-traversal test: `<img src="../../../etc/passwd.png">` must not
  reach trackDependency or emit width/height. Pre-fix, the join-segments
  approach would have escaped rootDir; the fix uses path.resolve +
  path.relative to enforce containment.
- Site-absolute src test: `/site-root.svg` must rebase under rootDir
  (not the filesystem root) so the traversal guard doesn't over-reject
  legitimate site-root-relative URLs.
- Whitespace trim tests: cover both the leading-space external URL case
  (must still register as external) and a local src wrapped in spaces
  (must still resolve and size).
…urface

Adds 4 integration tests reaching the resolveHref guards through the
manipulateDOM hook (the surface users actually touch):

- javascript: URL → ctx.getHref returns null (XSS guard)
- data: URL → null
- baseURL with basic-auth userinfo → resolved href has credentials
  stripped and contains neither the username nor the password
- empty href attribute → null

The unit tests in dom.spec.ts already pin resolveHref; this layer
guarantees the wiring through manipulateDOM's hook context is not
silently bypassed.
…ow-list

- Auto-inserted <head> must be the first child of <html>, not appended
  somewhere arbitrary. The earlier 'contains <head>' assertion would
  pass even if linkedom dropped the head after <body>.
- An existing <head> placed AFTER <body> must not be duplicated; the
  children-walk detection in getDOM has to be position-agnostic.
- mailto: and tel: URLs are intentionally NOT in REJECTED_SCHEMES; pin
  that decision so a future contributor doesn't blanket-block opaque
  schemes by accident.
…rcPath-based keying

- beforeEach clears the on-disk @d-zero/builder/image-sizes cache so a
  same-name same-byte-length fixture from a previous run can't leak
  width/height into the next test (cache key is srcPath:size, not a
  full file path).
- New "cache-busted src values share the cache entry" test rewrites the
  file with same byte length but different parsed sizes between the
  two transforms and asserts the second access returns the cached
  result. If the cache key regressed to raw src, the test would catch
  it because the second parse would yield the new bytes.
resolveHref's REJECTED_SCHEMES set names `vbscript:` as one of the
dangerous schemes it refuses to emit. Documentation that mentions the
list trips cspell unless the word is whitelisted.
…olveHref

- DomWindow/DomDocument/DomElement: explain the re-export rationale and
  call out that linkedom does not implement layout APIs / URL reflection.
- DomSerializeOptions.hook: spell out when the hook runs and that
  mutations land in the returned string.
- domSerialize: document the fragment-vs-full-document shape preservation
  and the bare-doctype empty-string fallback.
- resolveHref: consolidate the duplicated JSDoc blocks into a single one
  with explicit @param / @returns; mention dangerous-scheme rejection and
  basic-auth credential scrubbing in the contract.
…M API

README updates:
- Replace the stale hook signature (`Element[]` / `Window` /
  `TransformContext`) with `DomElement[]` / `DomWindow` /
  `ManipulateDOMHookContext`, and document `ctx.baseURL` + `ctx.getHref`.
- Remove the deleted `options.host` entry.
- Spell out the selector's <img>-only scope and the query/fragment
  stripping on extension checks under imageSizes.
- Add a "Resolving Anchor URLs with ctx.getHref" worked example.

JSDoc updates (manipulate-dom.ts, image.ts):
- ManipulateDOMHookContext: document each member with @param/@returns
  semantics; correct the host fallback to https:// (matching the impl).
- ManipulateDOMOptions: fill in hook/imageSizes member docs.
- manipulateDOM: rewrite the summary to mention the hook context wiring.
- resolveBaseURL: document return-undefined-on-bad-input contract.
- ImageSizesOptions / imageSizes: scope selector to <img>, mention the
  query/fragment strip and path-traversal guard.
@YusukeHirao YusukeHirao requested a review from yusasa16 as a code owner June 12, 2026 05:58
…ager

Lockfile metadata had drifted to version 9 (written by a stray yarn
4.14.1 invocation during local development) while the project pins
`packageManager: "yarn@4.12.0"`. CI's `yarn install --immutable` refuses
to rewrite the version field, failing the install step. Regenerate with
the pinned yarn so the lockfile matches what CI uses.
@YusukeHirao YusukeHirao merged commit fc31d07 into dev Jun 12, 2026
1 check passed
@YusukeHirao YusukeHirao deleted the feat/replace-jsdom-with-linkedom branch June 12, 2026 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant