Skip to content

feat(beholder): expose extractMetaFromDocument for jsdom-backed meta extraction#879

Merged
YusukeHirao merged 1 commit into
devfrom
feat/beholder-extract-meta-from-document
Jun 17, 2026
Merged

feat(beholder): expose extractMetaFromDocument for jsdom-backed meta extraction#879
YusukeHirao merged 1 commit into
devfrom
feat/beholder-extract-meta-from-document

Conversation

@YusukeHirao

Copy link
Copy Markdown
Member

Summary

  • HTML 文字列 → jsdom などで DOM 化 → メタ抽出、というユースケースを Puppeteer なしで行える公開 API extractMetaFromDocument(window, context) を追加した。
  • Puppeteer 経路で使われていた <head> 走査ロジックを meta/collect-head.ts の純関数 collectHeadFromDocument に切り出し、jsdom 経路と Puppeteer 経路で単一ソース化した(Function.prototype.toString 経由で page.evaluate(string) に流す)。
  • 戻り値の Meta 形状は Scraper.scrapeStart() と完全一致。下流コンシューマは入力経路で分岐しなくていい。
  • jsdom を devDependency としてピン留めし、jsdom 駆動の spec を 12 件追加。

Why

@d-zero/beholder のメタ抽出ロジックは強力だが、これまで Puppeteer の Page を渡さないと使えなかった。HTML 文字列をすでに持っているケース(fetch 結果、アーカイブ、フィクスチャ)でも同じ抽出が使えるようにする。jsdom のような DOM 実装はユーザランドの責務とし、beholder は DOM ライブラリ非依存を維持。

What's new

  • extractMetaFromDocument(window, context): Promise<Meta>index.ts から export
  • ExtractMetaContext 型 — url / html? / statusCode? / headers? / includeRaw?
  • meta/collect-head.ts — リアルム非依存の DOM 走査関数。window から HTML クラスコンストラクタを destructure して instanceof を呼び出し側のリアルムで解決
  • collectHeadOnPage リファクタ: page.evaluate(\(${fn.toString()})(window, ${globals})`)` で共通実装を呼ぶように変更(同じロジックを2か所に書かない)

Test plan

  • yarn build — 27 packages success
  • yarn test packages/@d-zero/beholder — 131 tests passed(既存 119 + 新規 12)
  • yarn lint — 0 issues
  • 新規 spec extract-meta.spec.ts で以下を jsdom 経由で検証:
    • title / lang / description
    • og:* / twitter:*
    • viewport / robots / theme-color の media 分岐
    • canonical / alternate hreflang
    • JSON-LD(正常 + parseError)
    • microdata / RDFa
    • base / iframe
    • context.html 省略時の documentElement.outerHTML フォールバック
    • includeRaw: true_raw 返却
    • kind: 'window-global' ブランチ(dom.window.dataLayer = [] で擬似的に発動)
    • headers / statusCode 受け渡し

🤖 Generated with Claude Code

…extraction

Add a Puppeteer-free public entry point that accepts any `Window` (e.g. jsdom)
and returns the same `Meta` shape produced by `Scraper.scrapeStart()`. The DOM
walk previously inlined inside `collectHeadOnPage` is extracted to
`meta/collect-head.ts` and reused by both the Puppeteer path
(`page.evaluate(string)` over the shared function source) and the new
`extractMetaFromDocument` path — keeping a single source of truth for the raw
head collector.

- New: `extractMetaFromDocument(window, context)` and `ExtractMetaContext`,
  exported from `index.ts`.
- New: `meta/collect-head.ts` houses the realm-agnostic collector. The
  function reads HTML class constructors off the passed `window` so
  `instanceof` resolves against the caller's realm (browser or jsdom).
- Refactor: `collectHeadOnPage` now does `page.evaluate` over the shared
  function source via `Function.prototype.toString`, eliminating the
  duplicated inline body.
- Tests: 12 new `extract-meta.spec.ts` cases driven by jsdom, covering title,
  og/twitter, viewport/robots/theme-color media branches, link canonical,
  JSON-LD (valid + parse error), microdata/RDFa, base/iframe, outerHTML
  fallback, `_raw` debug, `window-global` simulation, and headers/statusCode
  forwarding.
- Docs: README gains a "Puppeteer なし" usage section; JSDoc on the new API
  carries `@example` and documents the `as unknown as Window` cast jsdom
  requires.
- Deps: pin `jsdom@29.1.1` and `@types/jsdom@28.0.3` as devDependencies
  (test-only; no runtime impact).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@YusukeHirao YusukeHirao requested a review from yusasa16 as a code owner June 17, 2026 03:27
@YusukeHirao YusukeHirao merged commit 58ab2fd into dev Jun 17, 2026
2 checks passed
@YusukeHirao YusukeHirao deleted the feat/beholder-extract-meta-from-document branch June 17, 2026 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant