feat(beholder): expose extractMetaFromDocument for jsdom-backed meta extraction#879
Merged
Merged
Conversation
…extraction Add a Puppeteer-free public entry point that accepts any `Window` (e.g. jsdom) and returns the same `Meta` shape produced by `Scraper.scrapeStart()`. The DOM walk previously inlined inside `collectHeadOnPage` is extracted to `meta/collect-head.ts` and reused by both the Puppeteer path (`page.evaluate(string)` over the shared function source) and the new `extractMetaFromDocument` path — keeping a single source of truth for the raw head collector. - New: `extractMetaFromDocument(window, context)` and `ExtractMetaContext`, exported from `index.ts`. - New: `meta/collect-head.ts` houses the realm-agnostic collector. The function reads HTML class constructors off the passed `window` so `instanceof` resolves against the caller's realm (browser or jsdom). - Refactor: `collectHeadOnPage` now does `page.evaluate` over the shared function source via `Function.prototype.toString`, eliminating the duplicated inline body. - Tests: 12 new `extract-meta.spec.ts` cases driven by jsdom, covering title, og/twitter, viewport/robots/theme-color media branches, link canonical, JSON-LD (valid + parse error), microdata/RDFa, base/iframe, outerHTML fallback, `_raw` debug, `window-global` simulation, and headers/statusCode forwarding. - Docs: README gains a "Puppeteer なし" usage section; JSDoc on the new API carries `@example` and documents the `as unknown as Window` cast jsdom requires. - Deps: pin `jsdom@29.1.1` and `@types/jsdom@28.0.3` as devDependencies (test-only; no runtime impact). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
extractMetaFromDocument(window, context)を追加した。<head>走査ロジックをmeta/collect-head.tsの純関数collectHeadFromDocumentに切り出し、jsdom 経路と Puppeteer 経路で単一ソース化した(Function.prototype.toString経由でpage.evaluate(string)に流す)。Meta形状はScraper.scrapeStart()と完全一致。下流コンシューマは入力経路で分岐しなくていい。Why
@d-zero/beholderのメタ抽出ロジックは強力だが、これまで Puppeteer のPageを渡さないと使えなかった。HTML 文字列をすでに持っているケース(fetch 結果、アーカイブ、フィクスチャ)でも同じ抽出が使えるようにする。jsdom のような DOM 実装はユーザランドの責務とし、beholder は DOM ライブラリ非依存を維持。What's new
extractMetaFromDocument(window, context): Promise<Meta>—index.tsから exportExtractMetaContext型 —url/html?/statusCode?/headers?/includeRaw?meta/collect-head.ts— リアルム非依存の DOM 走査関数。windowから HTML クラスコンストラクタを destructure してinstanceofを呼び出し側のリアルムで解決collectHeadOnPageリファクタ:page.evaluate(\(${fn.toString()})(window, ${globals})`)` で共通実装を呼ぶように変更(同じロジックを2か所に書かない)Test plan
yarn build— 27 packages successyarn test packages/@d-zero/beholder— 131 tests passed(既存 119 + 新規 12)yarn lint— 0 issuesextract-meta.spec.tsで以下を jsdom 経由で検証:context.html省略時のdocumentElement.outerHTMLフォールバックincludeRaw: trueで_raw返却kind: 'window-global'ブランチ(dom.window.dataLayer = []で擬似的に発動)headers/statusCode受け渡し🤖 Generated with Claude Code