feat: add --inventory mode for orphan / unused file discovery by YusukeHirao · Pull Request #87 · d-zero-dev/nitpicker

YusukeHirao · 2026-06-17T14:03:55Z

Summary

crawl --inventory <urls.txt> を追加し、サーバー側で取得した URL リストと既存 .nitpicker を突き合わせて 未登録 URL だけ を取り込む機能を実装しました。クロールの link graph からは到達できない「孤立 LP」「使われていないサーバー上のファイル」を query CLI / MCP / viewer から発見できるようになります。

Goal

グラフ上の離れ小島ページ（完全孤立 LP）の特定
削除して構わないサーバー上の不要ファイルの特定

動作

既存 archive を Archive.open() で開き <archive>.bak 作成（失敗時は復元、成功時は削除）
URL リストファイル（1 行 1 URL、空行 / # コメント可）を @d-zero/readtext/list でパース
archived scope 外 URL は警告して skip、既存 pages / resources にある URL は skip
新規 URL のみ HEAD（→ 必要なら GET）で content-type を判定
- HTML → puppeteer レンダリング + 再帰クロール、source='inventory-seed'（直接由来）/ 'inventory-discovered'（再帰由来）
- 非 HTML → HEAD のみで resources に登録、source='inventory-seed'
再帰クロール中のサブリソース（新規分のみ）は source='inventory-discovered'

--append / --retry-failed / --resume / --diff / --output / --list / --list-file / --single と排他。2 回目以降の --inventory は既存行を touch しないので非破壊（'inventory-seed' が demote されることはない）。

出力 surface

CLI: nitpicker query <archive> isolated-pages / unused-resources
MCP tool: list_isolated_pages / list_unused_resources
viewer: /isolated-pages / /unused-resources ビューに SourceBadge（crawled / inv:seed / inv:disc）

判定は source 不問で referrer = 0 件、archived roots を除外。source 列はバッジ表示用にのみ含めて返す。

DB スキーマ変更

pages.source TEXT NOT NULL DEFAULT 'crawled' + INDEX(source)
resources.source TEXT NOT NULL DEFAULT 'crawled' + INDEX(source)
既存 0.10 archive には migratePagesResourcesSource で runtime auto-migrate（idempotent、read-only viewer attach では走らない）
#insertPage の UPDATE で source は CASE WHEN source = 'crawled' THEN ? ELSE source END の one-way 昇格のみ — anchor 経由で先に作られた placeholder（DEFAULT 'crawled'）はレンダリングスクレイプ時に正しい inventory ラベルへ昇格、すでに 'inventory-*' の行は二度目以降の pass で demote されない

コミット内訳

1a78fe1 feat(crawler): pages/resources に source 列追加 + 型 + migration
109d346 feat(crawler): Crawler → Archive → Database に source を配線
869e99e feat(crawler): CrawlerOrchestrator.inventory() + existing-url ヘルパ
61b9054 feat(cli): crawl --inventory フラグ + 排他バリデーション
5a467d3 feat(query): listIsolatedPages / listUnusedResources
e0c0c5c feat(cli,mcp): query CLI subcommand + MCP tool 2 つ
f4939ed feat(viewer): viewer routes + UI + SourceBadge
dbe71e1 test(crawler): inventory E2E + fixtures + 仕上げドキュメント + source promotion fix
efb5226 test(viewer): Resources ナビ lookup を exact: true で fix（新 Unused Resources エントリ追加に伴う ambiguity）

Test plan

yarn build（14 packages）
yarn lint（CSpell 0 issues）
yarn test（1147 unit pass / 5 skipped）
yarn vitest run --config vitest.e2e.config.ts packages/test-server/src/__tests__/e2e/inventory.e2e.ts（3 ケース pass: inventory-seed / inventory-discovered / 非 HTML inventory-seed）
yarn workspace @nitpicker/viewer test:e2e（26 ケース pass、Playwright）
CI 全体（push 後に走る）

動作例

# inventory 取り込み
nitpicker crawl ./my-site.nitpicker --inventory ./server-urls.txt

# 孤立ページ / 未使用ファイルの発見
nitpicker query ./my-site.nitpicker isolated-pages --pretty
nitpicker query ./my-site.nitpicker unused-resources --pretty

# viewer
nitpicker viewer ./my-site.nitpicker
# → サイドバーの "Isolated Pages" / "Unused Resources" から

🤖 Generated with Claude Code

Add `pages.source` and `resources.source` (NOT NULL DEFAULT 'crawled') so later commits can label rows brought in by `crawl --inventory` as `inventory-seed` / `inventory-discovered`. Isolation queries judge orphans by `referrer = 0` — `source` only labels the row for the viewer badge. The runtime-idempotent `migratePagesResourcesSource` runs on writer-side `Database.connect` to upgrade existing 0.10 archives in place; read-only viewer attaches skip it. SQLite ALTER TABLE ADD COLUMN with NOT NULL DEFAULT applies the default to every existing row, so no explicit backfill is needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…chive → Database Add CrawlerOptions.inventoryMode (a Set<string> of seed URLs, or null outside inventory). The Crawler emits an optional `source` on `page` / `externalPage` / `response` events: `'inventory-seed'` for URLs that came straight off the user-supplied list, `'inventory-discovered'` for anything found by following links from those pages (or any sub-resource captured by puppeteer while rendering them). Outside inventory mode, `source` is left undefined and the DB DEFAULT `'crawled'` applies. The orchestrator forwards `source` into `Archive.setPage` / `setExternalPage` / `setResources`, which thread it into `Database.insertResource` and `Database.updatePage` → `#insertPage`. Provenance is written ONLY when the row is freshly inserted (in `#getIdByUrl`'s INSERT path); existing rows keep their original label. This is what makes a second `crawl --inventory` non-destructive — a page first discovered as `'inventory-seed'` never gets demoted on later passes. Also moves the `PageSource` enum to the crawler package (its rightful owner) and re-exports it from `@nitpicker/query` so CLI / MCP / viewer keep a single import path. Adds a `derivePageSource` pure helper plus spec covering inventoryMode=null and seed/discover discrimination. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pers `CrawlerOrchestrator.inventory(archivePath, urls)` cross-references a user-supplied URL list against an existing `.nitpicker` and imports only URLs the archive does not yet track. HTML responses become Crawler seeds in `inventoryMode` (rendered pages are labelled `'inventory-seed'`, links followed from them `'inventory-discovered'`); non-HTML URLs land in `resources` directly as `'inventory-seed'` without a browser launch. `.bak` is taken before any mutation and restored on throw — same backup-and-restore contract as `append()`. Guards: list-mode archives and archives with unfinished `pending` URLs are rejected up front, since the latter would otherwise inherit the inventory provenance label by mistake. Scope-foreign URLs are warned and skipped (inventory is per-server, not cross-host). `Archive.getExistingPageUrls` / `getExistingResourceUrls` (backed by chunked `Database.getExistingPageUrls` / `getExistingResourceUrls` reads that stay under SQLite's IN-parameter cap) provide the filter that lets the second (and N-th) `--inventory` pass be a no-op for already-known URLs — `'inventory-seed'` rows are never demoted on later passes. The two DB lookups now run in parallel via `Promise.all`. HEAD probes over novel URLs also run concurrently to keep wall-clock proportional to the slowest single response, not the sum. `deriveResourceSource` pure helper encodes the "sub-resources are never seeds" rule alongside `derivePageSource`; the pair stay in lockstep if `PageSource` grows new variants. Also fixes a missed emit path for external pages (the `!fetchExternal && isExternal` branch in `Crawler.#runDeal`) so it carries the inventory source label like the other emits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds the `--inventory <urls.txt>` flag to the `crawl` subcommand and a matching `inventoryCrawl()` dispatch that hands the parsed URL list to `CrawlerOrchestrator.inventory()`. Mutually exclusive with `--append` / `--retry-failed` / `--resume` / `--diff` / `--output` / `--list` / `--list-file` / `--single`; the dispatch enforces this with explicit error messages so the operator picks one existing-archive mode intentionally. URL list parsing reuses `@d-zero/readtext/list` — same blank-line + `#`-comment conventions as `--list-file`. Empty files raise `No URLs found in inventory file: <path>` before any archive mutation, and `validateUrls()` rejects malformed entries up front so the operator fixes the list before HEAD probes touch the network. crawl.spec.ts gains 13 cases covering the dispatch (happy path, empty file, invalid URL, missing positional, extra positional, plus 8 mutually- exclusive flag combinations). pipeline.ts's startCrawl call gets an explicit `inventory: undefined` so the InferFlags shape stays exact. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`listIsolatedPages(accessor)` returns internal HTML pages with no inbound anchors, excluding archived roots — the "orphan LP" surface the inventory feature was built to expose. `listUnusedResources(accessor)` returns internal sub-resources that no archived page references — the "file on the server nobody loads" surface. Both queries judge isolation purely by the link graph (`anchors.hrefId IS NULL` / `resources-referrers.id IS NULL`) and IGNORE `source` in the WHERE clause: a `'crawled'` row that lost its referrers and an `'inventory-seed'` row that never gained any both surface here. `source` is returned as a per-row badge for the viewer to render. `info.roots` is JSON-decoded once per call to exclude the seed URLs from the isolated list (otherwise they would dominate the result set). The `?? 'crawled'` fallback on the source column tolerates archives created before the runtime migration ran. Specs cover the three discriminations that matter: link-graph inclusion/exclusion, internal-vs-external filtering, and the `crawled / inventory-seed / inventory-discovered` source badge being read back from the DB column rather than synthesised by the helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… and MCP tools `nitpicker query <archive> isolated-pages [--limit --offset]` and `unused-resources [--limit --offset]` surface the new query helpers through the CLI; `--archiveId`-keyed `list_isolated_pages` and `list_unused_resources` MCP tools mirror them for AI assistants. `mapFlagsToQueryOptions` uses a fallthrough case for the two sub-commands since both take only pagination flags — `archiveId` arrives via the MCP server context layer, not as a CLI flag, so no flag-side validation is needed here. The exhaustive switch over `QuerySubCommand` enforces that both cases are handled at compile time. Specs cover the new dispatch paths end-to-end: `dispatch-query.spec.ts` verifies the new query helpers receive the limit/offset pass-through; `map-flags-to-query-options.spec.ts` checks both shapes; the MCP server spec asserts the tool count and the presence of the two new tool names. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds `/isolated-pages` and `/unused-resources` views, the matching Hono endpoints (`/api/isolated-pages`, `/api/unused-resources`), TanStack Query hooks, and sidebar entries so the surfaces are discoverable without URL typing. Each table row shows the new `SourceBadge` component, which gives `crawled` / `inventory-seed` / `inventory-discovered` three distinct visual treatments — neutral, warn, muted-warn — so audit operators can scan a long list and spot inventory-sourced rows at a glance. `translations.ts` gains both `nav.*` labels and the full `views.*` view copy for en and ja so the existing en/ja-parity spec keeps passing. CSS adds the three badge variants under the existing `--badge-warn-*` palette so theme switching keeps working. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds `inventoryRoutes` to the test-server (a hidden landing page, a page linked only from it, and a stray PDF) so the orchestrator E2E can exercise the three labelling paths end-to-end: an HTML URL on the list gets `'inventory-seed'`, a page reached by following links from a seed gets `'inventory-discovered'`, and a non-HTML URL on the list lands in `resources` as `'inventory-seed'` without a browser launch. The third path (the `discovered` page) exposed a real gap: when an anchor from a seed page named a URL the crawl had never seen, `#getIdByUrl`'s INSERT path created a placeholder row with the DB DEFAULT `'crawled'`, and the subsequent scrape went through the UPDATE branch — which deliberately did NOT touch `source` — leaving the row mis-labelled. The fix is a one-way source promotion in `#insertPage`: the UPDATE now sets `source = CASE WHEN source = 'crawled' THEN ? ELSE source END`. So a placeholder labelled `'crawled'` gets promoted to its inventory variant on the rendering scrape, while a row that was already labelled `'inventory-seed'` (or `'inventory-discovered'`) is never demoted on later passes. README + CLAUDE.md gain the `--inventory` mode description alongside the existing `--append` / `--retry-failed` docs so operators discover the feature from the entry-point reference. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…urces' clash The new 'Unused Resources' sidebar entry made the existing `getByRole('link', { name: 'Resources' })` resolve to two links under playwright's strict mode (substring match), failing the "サイドバーから各ビューへ遷移できる" test. `exact: true` pins the lookup to the original 'Resources' entry without touching the new one. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

YusukeHirao and others added 8 commits June 17, 2026 23:03

YusukeHirao marked this pull request as ready for review June 17, 2026 15:34

YusukeHirao merged commit 210037e into dev Jun 18, 2026
5 checks passed

YusukeHirao deleted the feat/inventory branch June 18, 2026 00:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --inventory mode for orphan / unused file discovery#87

feat: add --inventory mode for orphan / unused file discovery#87
YusukeHirao merged 9 commits into
devfrom
feat/inventory

YusukeHirao commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YusukeHirao commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Goal

動作

出力 surface

DB スキーマ変更

コミット内訳

Test plan

動作例

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YusukeHirao commented Jun 17, 2026 •

edited

Loading