feat(site-migrator): generate YAML frontmatter from nitpicker DB and prepend to extracted pages#904
Merged
Merged
Conversation
Used in the site-migrator JSDoc and README for the next commit to describe nitpicker DB columns that store pre-absolutised URLs (canonical, og:image, twitter:image, etc.).
…prepend to extracted pages Switch frontmatter generation away from parse5-based HTML scraping (the now- deleted `extractFrontmatter`) onto direct reads from the `.nitpicker` DB via `@nitpicker/query`'s `getPageDetail`, which exposes the page meta as a flat column schema. The new `getFrontmatter(session, url)` maps each non-empty DB column onto the existing `Frontmatter` shape, runs title-shaped fields through `splitTitle` (because the DB stores the full `<title>` text and has no pre-split column), and drops whitespace-only / null values so downstream consumers never see placeholder strings. `splitTitle` is lifted out of the old parse5 extractor into its own `html/split-title.ts` module — pure string operation, parse5-free. `formatFrontmatter` (new) serialises a `Frontmatter` to the `---\n…\n---\n` YAML block consumed by the downstream scaffold pipeline. Conventions are pinned for that pipeline's expectations (js-yaml with `forceQuotes`, `quotingType: '"'`, `lineWidth: -1`, `indent: 2`, `noRefs: true`; stable key order; nested `og:` / `twitter:` maps; empty sub-objects omitted). `extractPages` now runs `getPageHtml` and `getFrontmatter` in parallel via `Promise.allSettled`, prepends the formatted YAML block to the extracted HTML, and writes the combined buffer. Meta reads are best-effort: if `getFrontmatter` rejects (e.g. transient SQLite contention), the page's already-extracted body is still written without a frontmatter block, rather than discarding the body along with the metadata failure. `TwitterFrontmatter.url` is removed — `.nitpicker` has no corresponding column and the standard convention is to read `og.url` when a Twitter-specific URL is needed. CLI `USAGE` now mentions the YAML prepend so operators reading `--help` understand the output format. Public API churn (package is `private: true` so no npm consumers): - removed: `extractFrontmatter` - added: `getFrontmatter`, `formatFrontmatter`, `splitTitle`, `TitlePair`
d0519a4 to
3a03c6d
Compare
…ypes JSDoc
jsdoc/no-undefined-types warns because eslint-plugin-jsdoc only resolves
{@link} against type / class names, not arbitrary export functions. The
warnings were surfacing as CI annotations on the PR diff (R14 / R23).
Switch the function references to backtick-quoted identifiers; the actual
type links (ArchiveSession.close) keep their {@link}.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
@d-zero/site-migratorの frontmatter 生成を parse5 ベース (extractFrontmatter) から.nitpickerDB 直読 (getFrontmatter+@nitpicker/queryのgetPageDetail) へ全面置換formatFrontmatterを新設し、後続の scaffold パイプライン互換の---\n…\n---\nYAML ブロックを生成。extractPagesのパイプライン (getPageHtml∥getFrontmatter→extractMainContent→ prepend →writeFile) に統合.htmlは「YAML frontmatter ブロック + レイアウト剥がし後の HTML」という中間成果物。後続の scaffold パイプラインで消費される前提主要変更
src/archive/get-frontmatter.ts(新)@nitpicker/queryのgetPageDetailを呼び、flat meta カラム ⇄Frontmatterをマップ。title 系はsplitTitleを通し、whitespace-only カラムは空扱いsrc/html/split-title.ts(新)extractFrontmatterからsplitTitleを抽出src/html/format-frontmatter.ts(新)js-yaml(forceQuotes/quotingType: '\"'/lineWidth: -1/indent: 2/noRefs) で scaffold パイプライン互換出力。og.* / twitter.* は nested map、空はサブツリーごと省略src/page-extractor/extract-pages.tsPromise.allSettledでgetPageHtml/getFrontmatterを並列実行。meta 失敗は fail-soft(本文は書き出す)src/types.tsTwitterFrontmatter.urlを削除(DB に対応カラム無し。og.urlで代替)src/index.tsextractFrontmatterを export から削除、getFrontmatter/formatFrontmatter/splitTitleを追加src/html/extract-frontmatter.{ts,spec.ts}src/cli.tsUSAGE冒頭に「YAML frontmatter prepended」の概要を追記package.json@nitpicker/{crawler,query}をバンプ、js-yaml+@types/js-yaml追加README.mdcspell.json設計判断
rawTitleはsplitTitle経由で再現: DB に rawTitle カラムは無いが、pages.title全文に分割演算を適用すれば等価結果が得られるgetFrontmatter例外時は HTML 本文だけ書き出す。本文抽出は成功しているので、メタ失敗で道連れにすると損失が非対称twitter.url削除: DB 非保持。og.urlで代替する慣習に従うテスト
get-frontmatter.spec.ts7 件、split-title.spec.ts7 件、format-frontmatter.spec.ts9 件、extract-pages.spec.ts追加 3 件、migrate.spec.ts追加 1 件)chrome-devtoolsMCP (canary + flags 有効) で出力.htmlをfile://で開き、frontmatter prepend 後の DOM・console・network を確認。Quirks Mode 警告は中間成果物として想定内(README に明記)注意
.yarnrc.ymlのnpmMinimalAgeGate: 7dは変更していない。新リリースの@nitpicker/{crawler,query}は yarn.lock に pin 済みで、yarn installは resolution ではなく lockfile 解決のため 7 日ゲートを通る(ゲートはyarn add等の resolution 時のみ走る)。Test plan
yarn buildyarn lintyarn test.nitpickerでdz-migrateを流し、出力.html先頭に YAML frontmatter ブロックがあり、og.* / twitter.* が nested、空フィールドが省略されていることを目視🤖 Generated with Claude Code