feat: add --inventory mode for orphan / unused file discovery#87
Merged
Conversation
Add `pages.source` and `resources.source` (NOT NULL DEFAULT 'crawled') so later commits can label rows brought in by `crawl --inventory` as `inventory-seed` / `inventory-discovered`. Isolation queries judge orphans by `referrer = 0` — `source` only labels the row for the viewer badge. The runtime-idempotent `migratePagesResourcesSource` runs on writer-side `Database.connect` to upgrade existing 0.10 archives in place; read-only viewer attaches skip it. SQLite ALTER TABLE ADD COLUMN with NOT NULL DEFAULT applies the default to every existing row, so no explicit backfill is needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…chive → Database Add CrawlerOptions.inventoryMode (a Set<string> of seed URLs, or null outside inventory). The Crawler emits an optional `source` on `page` / `externalPage` / `response` events: `'inventory-seed'` for URLs that came straight off the user-supplied list, `'inventory-discovered'` for anything found by following links from those pages (or any sub-resource captured by puppeteer while rendering them). Outside inventory mode, `source` is left undefined and the DB DEFAULT `'crawled'` applies. The orchestrator forwards `source` into `Archive.setPage` / `setExternalPage` / `setResources`, which thread it into `Database.insertResource` and `Database.updatePage` → `#insertPage`. Provenance is written ONLY when the row is freshly inserted (in `#getIdByUrl`'s INSERT path); existing rows keep their original label. This is what makes a second `crawl --inventory` non-destructive — a page first discovered as `'inventory-seed'` never gets demoted on later passes. Also moves the `PageSource` enum to the crawler package (its rightful owner) and re-exports it from `@nitpicker/query` so CLI / MCP / viewer keep a single import path. Adds a `derivePageSource` pure helper plus spec covering inventoryMode=null and seed/discover discrimination. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pers `CrawlerOrchestrator.inventory(archivePath, urls)` cross-references a user-supplied URL list against an existing `.nitpicker` and imports only URLs the archive does not yet track. HTML responses become Crawler seeds in `inventoryMode` (rendered pages are labelled `'inventory-seed'`, links followed from them `'inventory-discovered'`); non-HTML URLs land in `resources` directly as `'inventory-seed'` without a browser launch. `.bak` is taken before any mutation and restored on throw — same backup-and-restore contract as `append()`. Guards: list-mode archives and archives with unfinished `pending` URLs are rejected up front, since the latter would otherwise inherit the inventory provenance label by mistake. Scope-foreign URLs are warned and skipped (inventory is per-server, not cross-host). `Archive.getExistingPageUrls` / `getExistingResourceUrls` (backed by chunked `Database.getExistingPageUrls` / `getExistingResourceUrls` reads that stay under SQLite's IN-parameter cap) provide the filter that lets the second (and N-th) `--inventory` pass be a no-op for already-known URLs — `'inventory-seed'` rows are never demoted on later passes. The two DB lookups now run in parallel via `Promise.all`. HEAD probes over novel URLs also run concurrently to keep wall-clock proportional to the slowest single response, not the sum. `deriveResourceSource` pure helper encodes the "sub-resources are never seeds" rule alongside `derivePageSource`; the pair stay in lockstep if `PageSource` grows new variants. Also fixes a missed emit path for external pages (the `!fetchExternal && isExternal` branch in `Crawler.#runDeal`) so it carries the inventory source label like the other emits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the `--inventory <urls.txt>` flag to the `crawl` subcommand and a matching `inventoryCrawl()` dispatch that hands the parsed URL list to `CrawlerOrchestrator.inventory()`. Mutually exclusive with `--append` / `--retry-failed` / `--resume` / `--diff` / `--output` / `--list` / `--list-file` / `--single`; the dispatch enforces this with explicit error messages so the operator picks one existing-archive mode intentionally. URL list parsing reuses `@d-zero/readtext/list` — same blank-line + `#`-comment conventions as `--list-file`. Empty files raise `No URLs found in inventory file: <path>` before any archive mutation, and `validateUrls()` rejects malformed entries up front so the operator fixes the list before HEAD probes touch the network. crawl.spec.ts gains 13 cases covering the dispatch (happy path, empty file, invalid URL, missing positional, extra positional, plus 8 mutually- exclusive flag combinations). pipeline.ts's startCrawl call gets an explicit `inventory: undefined` so the InferFlags shape stays exact. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`listIsolatedPages(accessor)` returns internal HTML pages with no inbound anchors, excluding archived roots — the "orphan LP" surface the inventory feature was built to expose. `listUnusedResources(accessor)` returns internal sub-resources that no archived page references — the "file on the server nobody loads" surface. Both queries judge isolation purely by the link graph (`anchors.hrefId IS NULL` / `resources-referrers.id IS NULL`) and IGNORE `source` in the WHERE clause: a `'crawled'` row that lost its referrers and an `'inventory-seed'` row that never gained any both surface here. `source` is returned as a per-row badge for the viewer to render. `info.roots` is JSON-decoded once per call to exclude the seed URLs from the isolated list (otherwise they would dominate the result set). The `?? 'crawled'` fallback on the source column tolerates archives created before the runtime migration ran. Specs cover the three discriminations that matter: link-graph inclusion/exclusion, internal-vs-external filtering, and the `crawled / inventory-seed / inventory-discovered` source badge being read back from the DB column rather than synthesised by the helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… and MCP tools `nitpicker query <archive> isolated-pages [--limit --offset]` and `unused-resources [--limit --offset]` surface the new query helpers through the CLI; `--archiveId`-keyed `list_isolated_pages` and `list_unused_resources` MCP tools mirror them for AI assistants. `mapFlagsToQueryOptions` uses a fallthrough case for the two sub-commands since both take only pagination flags — `archiveId` arrives via the MCP server context layer, not as a CLI flag, so no flag-side validation is needed here. The exhaustive switch over `QuerySubCommand` enforces that both cases are handled at compile time. Specs cover the new dispatch paths end-to-end: `dispatch-query.spec.ts` verifies the new query helpers receive the limit/offset pass-through; `map-flags-to-query-options.spec.ts` checks both shapes; the MCP server spec asserts the tool count and the presence of the two new tool names. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `/isolated-pages` and `/unused-resources` views, the matching Hono endpoints (`/api/isolated-pages`, `/api/unused-resources`), TanStack Query hooks, and sidebar entries so the surfaces are discoverable without URL typing. Each table row shows the new `SourceBadge` component, which gives `crawled` / `inventory-seed` / `inventory-discovered` three distinct visual treatments — neutral, warn, muted-warn — so audit operators can scan a long list and spot inventory-sourced rows at a glance. `translations.ts` gains both `nav.*` labels and the full `views.*` view copy for en and ja so the existing en/ja-parity spec keeps passing. CSS adds the three badge variants under the existing `--badge-warn-*` palette so theme switching keeps working. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `inventoryRoutes` to the test-server (a hidden landing page, a page linked only from it, and a stray PDF) so the orchestrator E2E can exercise the three labelling paths end-to-end: an HTML URL on the list gets `'inventory-seed'`, a page reached by following links from a seed gets `'inventory-discovered'`, and a non-HTML URL on the list lands in `resources` as `'inventory-seed'` without a browser launch. The third path (the `discovered` page) exposed a real gap: when an anchor from a seed page named a URL the crawl had never seen, `#getIdByUrl`'s INSERT path created a placeholder row with the DB DEFAULT `'crawled'`, and the subsequent scrape went through the UPDATE branch — which deliberately did NOT touch `source` — leaving the row mis-labelled. The fix is a one-way source promotion in `#insertPage`: the UPDATE now sets `source = CASE WHEN source = 'crawled' THEN ? ELSE source END`. So a placeholder labelled `'crawled'` gets promoted to its inventory variant on the rendering scrape, while a row that was already labelled `'inventory-seed'` (or `'inventory-discovered'`) is never demoted on later passes. README + CLAUDE.md gain the `--inventory` mode description alongside the existing `--append` / `--retry-failed` docs so operators discover the feature from the entry-point reference. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…urces' clash
The new 'Unused Resources' sidebar entry made the existing
`getByRole('link', { name: 'Resources' })` resolve to two links under
playwright's strict mode (substring match), failing the
"サイドバーから各ビューへ遷移できる" test. `exact: true` pins the
lookup to the original 'Resources' entry without touching the new one.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
crawl --inventory <urls.txt>を追加し、サーバー側で取得した URL リストと既存.nitpickerを突き合わせて 未登録 URL だけ を取り込む機能を実装しました。クロールの link graph からは到達できない「孤立 LP」「使われていないサーバー上のファイル」を query CLI / MCP / viewer から発見できるようになります。Goal
動作
Archive.open()で開き<archive>.bak作成(失敗時は復元、成功時は削除)#コメント可)を@d-zero/readtext/listでパースpages/resourcesにある URL は skipsource='inventory-seed'(直接由来)/'inventory-discovered'(再帰由来)resourcesに登録、source='inventory-seed'source='inventory-discovered'--append/--retry-failed/--resume/--diff/--output/--list/--list-file/--singleと排他。2 回目以降の--inventoryは既存行を touch しないので非破壊('inventory-seed'が demote されることはない)。出力 surface
nitpicker query <archive> isolated-pages/unused-resourceslist_isolated_pages/list_unused_resources/isolated-pages//unused-resourcesビューにSourceBadge(crawled / inv:seed / inv:disc)判定は
source不問でreferrer = 0件、archived roots を除外。source列はバッジ表示用にのみ含めて返す。DB スキーマ変更
pages.source TEXT NOT NULL DEFAULT 'crawled'+INDEX(source)resources.source TEXT NOT NULL DEFAULT 'crawled'+INDEX(source)migratePagesResourcesSourceで runtime auto-migrate(idempotent、read-only viewer attach では走らない)#insertPageの UPDATE で source はCASE WHEN source = 'crawled' THEN ? ELSE source ENDの one-way 昇格のみ — anchor 経由で先に作られた placeholder(DEFAULT'crawled')はレンダリングスクレイプ時に正しい inventory ラベルへ昇格、すでに'inventory-*'の行は二度目以降の pass で demote されないコミット内訳
1a78fe1feat(crawler): pages/resources に source 列追加 + 型 + migration109d346feat(crawler): Crawler → Archive → Database に source を配線869e99efeat(crawler):CrawlerOrchestrator.inventory()+ existing-url ヘルパ61b9054feat(cli):crawl --inventoryフラグ + 排他バリデーション5a467d3feat(query):listIsolatedPages/listUnusedResourcese0c0c5cfeat(cli,mcp): query CLI subcommand + MCP tool 2 つf4939edfeat(viewer): viewer routes + UI +SourceBadgedbe71e1test(crawler): inventory E2E + fixtures + 仕上げドキュメント + source promotion fixefb5226test(viewer):Resourcesナビ lookup をexact: trueで fix(新Unused Resourcesエントリ追加に伴う ambiguity)Test plan
yarn build(14 packages)yarn lint(CSpell 0 issues)yarn test(1147 unit pass / 5 skipped)yarn vitest run --config vitest.e2e.config.ts packages/test-server/src/__tests__/e2e/inventory.e2e.ts(3 ケース pass: inventory-seed / inventory-discovered / 非 HTML inventory-seed)yarn workspace @nitpicker/viewer test:e2e(26 ケース pass、Playwright)動作例
🤖 Generated with Claude Code