Skip to content

feat: add --inventory mode for orphan / unused file discovery#87

Merged
YusukeHirao merged 9 commits into
devfrom
feat/inventory
Jun 18, 2026
Merged

feat: add --inventory mode for orphan / unused file discovery#87
YusukeHirao merged 9 commits into
devfrom
feat/inventory

Conversation

@YusukeHirao

@YusukeHirao YusukeHirao commented Jun 17, 2026

Copy link
Copy Markdown
Member

Summary

crawl --inventory <urls.txt> を追加し、サーバー側で取得した URL リストと既存 .nitpicker を突き合わせて 未登録 URL だけ を取り込む機能を実装しました。クロールの link graph からは到達できない「孤立 LP」「使われていないサーバー上のファイル」を query CLI / MCP / viewer から発見できるようになります。

Goal

  • グラフ上の離れ小島ページ(完全孤立 LP)の特定
  • 削除して構わないサーバー上の不要ファイルの特定

動作

  1. 既存 archive を Archive.open() で開き <archive>.bak 作成(失敗時は復元、成功時は削除)
  2. URL リストファイル(1 行 1 URL、空行 / # コメント可)を @d-zero/readtext/list でパース
  3. archived scope 外 URL は警告して skip、既存 pages / resources にある URL は skip
  4. 新規 URL のみ HEAD(→ 必要なら GET)で content-type を判定
    • HTML → puppeteer レンダリング + 再帰クロール、source='inventory-seed'(直接由来)/ 'inventory-discovered'(再帰由来)
    • 非 HTML → HEAD のみで resources に登録、source='inventory-seed'
  5. 再帰クロール中のサブリソース(新規分のみ)は source='inventory-discovered'

--append / --retry-failed / --resume / --diff / --output / --list / --list-file / --single と排他。2 回目以降の --inventory は既存行を touch しないので非破壊('inventory-seed' が demote されることはない)。

出力 surface

  • CLI: nitpicker query <archive> isolated-pages / unused-resources
  • MCP tool: list_isolated_pages / list_unused_resources
  • viewer: /isolated-pages / /unused-resources ビューに SourceBadge(crawled / inv:seed / inv:disc)

判定は source 不問で referrer = 0 件、archived roots を除外。source 列はバッジ表示用にのみ含めて返す。

DB スキーマ変更

  • pages.source TEXT NOT NULL DEFAULT 'crawled' + INDEX(source)
  • resources.source TEXT NOT NULL DEFAULT 'crawled' + INDEX(source)
  • 既存 0.10 archive には migratePagesResourcesSource で runtime auto-migrate(idempotent、read-only viewer attach では走らない)
  • #insertPage の UPDATE で source は CASE WHEN source = 'crawled' THEN ? ELSE source END の one-way 昇格のみ — anchor 経由で先に作られた placeholder(DEFAULT 'crawled')はレンダリングスクレイプ時に正しい inventory ラベルへ昇格、すでに 'inventory-*' の行は二度目以降の pass で demote されない

コミット内訳

  • 1a78fe1 feat(crawler): pages/resources に source 列追加 + 型 + migration
  • 109d346 feat(crawler): Crawler → Archive → Database に source を配線
  • 869e99e feat(crawler): CrawlerOrchestrator.inventory() + existing-url ヘルパ
  • 61b9054 feat(cli): crawl --inventory フラグ + 排他バリデーション
  • 5a467d3 feat(query): listIsolatedPages / listUnusedResources
  • e0c0c5c feat(cli,mcp): query CLI subcommand + MCP tool 2 つ
  • f4939ed feat(viewer): viewer routes + UI + SourceBadge
  • dbe71e1 test(crawler): inventory E2E + fixtures + 仕上げドキュメント + source promotion fix
  • efb5226 test(viewer): Resources ナビ lookup を exact: true で fix(新 Unused Resources エントリ追加に伴う ambiguity)

Test plan

  • yarn build(14 packages)
  • yarn lint(CSpell 0 issues)
  • yarn test(1147 unit pass / 5 skipped)
  • yarn vitest run --config vitest.e2e.config.ts packages/test-server/src/__tests__/e2e/inventory.e2e.ts(3 ケース pass: inventory-seed / inventory-discovered / 非 HTML inventory-seed)
  • yarn workspace @nitpicker/viewer test:e2e(26 ケース pass、Playwright)
  • CI 全体(push 後に走る)

動作例

# inventory 取り込み
nitpicker crawl ./my-site.nitpicker --inventory ./server-urls.txt

# 孤立ページ / 未使用ファイルの発見
nitpicker query ./my-site.nitpicker isolated-pages --pretty
nitpicker query ./my-site.nitpicker unused-resources --pretty

# viewer
nitpicker viewer ./my-site.nitpicker
# → サイドバーの "Isolated Pages" / "Unused Resources" から

🤖 Generated with Claude Code

YusukeHirao and others added 8 commits June 17, 2026 23:03
Add `pages.source` and `resources.source` (NOT NULL DEFAULT 'crawled') so
later commits can label rows brought in by `crawl --inventory` as
`inventory-seed` / `inventory-discovered`. Isolation queries judge orphans
by `referrer = 0` — `source` only labels the row for the viewer badge.

The runtime-idempotent `migratePagesResourcesSource` runs on writer-side
`Database.connect` to upgrade existing 0.10 archives in place; read-only
viewer attaches skip it. SQLite ALTER TABLE ADD COLUMN with NOT NULL
DEFAULT applies the default to every existing row, so no explicit backfill
is needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…chive → Database

Add CrawlerOptions.inventoryMode (a Set<string> of seed URLs, or null
outside inventory). The Crawler emits an optional `source` on `page` /
`externalPage` / `response` events: `'inventory-seed'` for URLs that came
straight off the user-supplied list, `'inventory-discovered'` for anything
found by following links from those pages (or any sub-resource captured by
puppeteer while rendering them). Outside inventory mode, `source` is left
undefined and the DB DEFAULT `'crawled'` applies.

The orchestrator forwards `source` into `Archive.setPage` /
`setExternalPage` / `setResources`, which thread it into
`Database.insertResource` and `Database.updatePage` → `#insertPage`.
Provenance is written ONLY when the row is freshly inserted (in
`#getIdByUrl`'s INSERT path); existing rows keep their original label. This
is what makes a second `crawl --inventory` non-destructive — a page first
discovered as `'inventory-seed'` never gets demoted on later passes.

Also moves the `PageSource` enum to the crawler package (its rightful
owner) and re-exports it from `@nitpicker/query` so CLI / MCP / viewer
keep a single import path. Adds a `derivePageSource` pure helper plus
spec covering inventoryMode=null and seed/discover discrimination.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pers

`CrawlerOrchestrator.inventory(archivePath, urls)` cross-references a
user-supplied URL list against an existing `.nitpicker` and imports only
URLs the archive does not yet track. HTML responses become Crawler seeds
in `inventoryMode` (rendered pages are labelled `'inventory-seed'`, links
followed from them `'inventory-discovered'`); non-HTML URLs land in
`resources` directly as `'inventory-seed'` without a browser launch.
`.bak` is taken before any mutation and restored on throw — same
backup-and-restore contract as `append()`.

Guards: list-mode archives and archives with unfinished `pending` URLs
are rejected up front, since the latter would otherwise inherit the
inventory provenance label by mistake. Scope-foreign URLs are warned and
skipped (inventory is per-server, not cross-host).

`Archive.getExistingPageUrls` / `getExistingResourceUrls` (backed by
chunked `Database.getExistingPageUrls` / `getExistingResourceUrls` reads
that stay under SQLite's IN-parameter cap) provide the filter that lets
the second (and N-th) `--inventory` pass be a no-op for already-known
URLs — `'inventory-seed'` rows are never demoted on later passes. The
two DB lookups now run in parallel via `Promise.all`. HEAD probes over
novel URLs also run concurrently to keep wall-clock proportional to the
slowest single response, not the sum.

`deriveResourceSource` pure helper encodes the "sub-resources are never
seeds" rule alongside `derivePageSource`; the pair stay in lockstep if
`PageSource` grows new variants. Also fixes a missed emit path for
external pages (the `!fetchExternal && isExternal` branch in
`Crawler.#runDeal`) so it carries the inventory source label like the
other emits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the `--inventory <urls.txt>` flag to the `crawl` subcommand and a
matching `inventoryCrawl()` dispatch that hands the parsed URL list to
`CrawlerOrchestrator.inventory()`. Mutually exclusive with
`--append` / `--retry-failed` / `--resume` / `--diff` / `--output` /
`--list` / `--list-file` / `--single`; the dispatch enforces this with
explicit error messages so the operator picks one existing-archive mode
intentionally.

URL list parsing reuses `@d-zero/readtext/list` — same blank-line +
`#`-comment conventions as `--list-file`. Empty files raise
`No URLs found in inventory file: <path>` before any archive mutation,
and `validateUrls()` rejects malformed entries up front so the operator
fixes the list before HEAD probes touch the network.

crawl.spec.ts gains 13 cases covering the dispatch (happy path, empty
file, invalid URL, missing positional, extra positional, plus 8 mutually-
exclusive flag combinations). pipeline.ts's startCrawl call gets an
explicit `inventory: undefined` so the InferFlags shape stays exact.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`listIsolatedPages(accessor)` returns internal HTML pages with no inbound
anchors, excluding archived roots — the "orphan LP" surface the
inventory feature was built to expose. `listUnusedResources(accessor)`
returns internal sub-resources that no archived page references — the
"file on the server nobody loads" surface.

Both queries judge isolation purely by the link graph
(`anchors.hrefId IS NULL` / `resources-referrers.id IS NULL`) and IGNORE
`source` in the WHERE clause: a `'crawled'` row that lost its referrers
and an `'inventory-seed'` row that never gained any both surface here.
`source` is returned as a per-row badge for the viewer to render.

`info.roots` is JSON-decoded once per call to exclude the seed URLs from
the isolated list (otherwise they would dominate the result set). The
`?? 'crawled'` fallback on the source column tolerates archives created
before the runtime migration ran.

Specs cover the three discriminations that matter: link-graph
inclusion/exclusion, internal-vs-external filtering, and the
`crawled / inventory-seed / inventory-discovered` source badge being
read back from the DB column rather than synthesised by the helper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… and MCP tools

`nitpicker query <archive> isolated-pages [--limit --offset]` and
`unused-resources [--limit --offset]` surface the new query helpers
through the CLI; `--archiveId`-keyed `list_isolated_pages` and
`list_unused_resources` MCP tools mirror them for AI assistants.

`mapFlagsToQueryOptions` uses a fallthrough case for the two sub-commands
since both take only pagination flags — `archiveId` arrives via the MCP
server context layer, not as a CLI flag, so no flag-side validation is
needed here. The exhaustive switch over `QuerySubCommand` enforces that
both cases are handled at compile time.

Specs cover the new dispatch paths end-to-end: `dispatch-query.spec.ts`
verifies the new query helpers receive the limit/offset pass-through;
`map-flags-to-query-options.spec.ts` checks both shapes; the MCP server
spec asserts the tool count and the presence of the two new tool names.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `/isolated-pages` and `/unused-resources` views, the matching Hono
endpoints (`/api/isolated-pages`, `/api/unused-resources`), TanStack Query
hooks, and sidebar entries so the surfaces are discoverable without URL
typing. Each table row shows the new `SourceBadge` component, which gives
`crawled` / `inventory-seed` / `inventory-discovered` three distinct
visual treatments — neutral, warn, muted-warn — so audit operators can
scan a long list and spot inventory-sourced rows at a glance.

`translations.ts` gains both `nav.*` labels and the full `views.*` view
copy for en and ja so the existing en/ja-parity spec keeps passing. CSS
adds the three badge variants under the existing `--badge-warn-*`
palette so theme switching keeps working.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `inventoryRoutes` to the test-server (a hidden landing page, a page
linked only from it, and a stray PDF) so the orchestrator E2E can
exercise the three labelling paths end-to-end: an HTML URL on the list
gets `'inventory-seed'`, a page reached by following links from a seed
gets `'inventory-discovered'`, and a non-HTML URL on the list lands in
`resources` as `'inventory-seed'` without a browser launch.

The third path (the `discovered` page) exposed a real gap: when an anchor
from a seed page named a URL the crawl had never seen, `#getIdByUrl`'s
INSERT path created a placeholder row with the DB DEFAULT `'crawled'`,
and the subsequent scrape went through the UPDATE branch — which
deliberately did NOT touch `source` — leaving the row mis-labelled. The
fix is a one-way source promotion in `#insertPage`: the UPDATE now sets
`source = CASE WHEN source = 'crawled' THEN ? ELSE source END`. So a
placeholder labelled `'crawled'` gets promoted to its inventory variant
on the rendering scrape, while a row that was already labelled
`'inventory-seed'` (or `'inventory-discovered'`) is never demoted on
later passes.

README + CLAUDE.md gain the `--inventory` mode description alongside the
existing `--append` / `--retry-failed` docs so operators discover the
feature from the entry-point reference.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@YusukeHirao YusukeHirao marked this pull request as ready for review June 17, 2026 15:34
…urces' clash

The new 'Unused Resources' sidebar entry made the existing
`getByRole('link', { name: 'Resources' })` resolve to two links under
playwright's strict mode (substring match), failing the
"サイドバーから各ビューへ遷移できる" test. `exact: true` pins the
lookup to the original 'Resources' entry without touching the new one.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@YusukeHirao YusukeHirao merged commit 210037e into dev Jun 18, 2026
5 checks passed
@YusukeHirao YusukeHirao deleted the feat/inventory branch June 18, 2026 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant