fix(crawler): converge --retry-failed via classifier expansion, rescues, and scope-cred guard by YusukeHirao · Pull Request #90 · d-zero-dev/nitpicker

YusukeHirao · 2026-06-19T11:23:53Z

Summary

This branch started as a tiny fix for an interval-delay quirk on DNS-burned hosts and grew into a wider hardening pass on the crawl convergence story — keeping --retry-failed honest, classifying every observable failure shape, and adding two new safety nets for puppeteer-side oddities. Also fixes a credential-leak guard for the scope-auth flow.

Convergence: `--retry-failed` stops looping on permanent failures

New PERMANENT_ERROR_KINDS derived constant; Database.resetFailedPages now consults classifyErrorKind and refuses to reset rows whose latest message is permanent (NXDOMAIN, expired cert, ERR_BLOCKED_BY_CLIENT, HTTP parse error, ECONNREFUSED). Without this, every pass replayed the same failure forever.
New client-blocked ErrorKind for Chromium's ERR_BLOCKED_BY_* family, plus dns-transient split (EAI_AGAIN no longer punishes hosts via the DNS-burned cache).
Classifier matchers tightened: tls regex anchored against false-positive altnames substrings, protocol matcher catches Attempted to use detached Frame and the Page.<method> returned family.

Rescue 1: HEAD-resolved chain when puppeteer dies on the destination

When HEAD pre-flight resolved a real 3xx chain but puppeteer threw on the final URL (HTTPS→HTTP downgrade is the canonical case), record the source as a redirect-edge instead of status=-1. Avoids the retry-forever loop on URLs whose chain truth is already known.

Rescue 2: JS-redirect via `page.url()`

New buildJsRedirectEdge + deriveJsRedirectTarget + isJsRedirectErrorShape helpers. When puppeteer's page.goto() resolves to null (window.location.replace(...) / <meta refresh>) and page.url() reports a meaningfully-different destination, record the source as a redirect-edge. WHATWG URL canonicalisation filters case / slash / port / credential noise; browser-internal schemes are rejected.
One late-stage bug fix: beholder's scrapeStart returns {type:'error'} rather than throwing on internal #fetchData failures, so the page.url() capture had to be added on the return path too — the original catch-arm-only capture would have made the rescue dead in production.

Rescue 3: HEAD/GET both dead → one puppeteer try

When HEAD pre-flight + its GET fallback both exhaust retries on a URL that looks HTML-shaped, and the residual error suggests middlebox interference (timeout / connection-reset / parse-error), launch puppeteer exactly once. Some WAF configurations drop bare probes but answer real navigations.

Throughput / hygiene

HEAD pre-flight timeout escalates 10s → 30s → 60s per retry attempt (was fixed 10s) so slow but eventually-responsive hosts get rescued without paying the slow-server tax on the happy path.
GET fallback inside fetchDestination when HEAD response is unusable (405 / parse-error / etc.).
Per-URL interval delay is skipped for hosts in the DNS-burned cache — no point making the crawl wait between hits to a host that will short-circuit anyway.

Security: scope-credential leak

page.authenticate is now registered unconditionally to drain Chromium's native HTTP-auth dialog (previously hung navigations on sub-resources demanding Basic auth).
URL-embedded scope credentials (user:pass@host) are stripped before page.goto() so they never enter Chromium's HTTP-auth cache and never leak to cross-origin sub-resources on the same IP/port. Pinned by a new E2E (scope-auth-leak.e2e.ts).

Query / viewer

getSummary excludes excludes-pattern pages from status / content-type distributions (they were inflating the Unknown bucket).
Viewer translates the new client-blocked / dns-transient / local-network / parse-error ErrorKind buckets.

Docs

CLAUDE.md / ARCHITECTURE.md updated with the rescue paths, the --retry-failed convergence rule, and the ErrorKind derived-constants review checklist.

Live verification

A previously stuck source row (status=-1 / UnknownError after 15 --retry-failed passes on a real archive) flipped to status=301 Moved Permanently with redirectDestId set after one run with the rescue. Removing it from the retry pool permanently. Debug log: JS-redirect rescue fired for <source> → <dest> (HEAD available) followed by Set redirect / Set redirected url archive writes.

Test plan

yarn build — all 14 projects pass
yarn test — 1861 unit tests pass (5 skipped)
yarn vitest run --config vitest.e2e.config.ts — 25 files / 113 E2E pass
yarn lint — eslint + prettier + cspell + textlint + markuplint all clean
code-review xhigh sweep — all surviving findings addressed
live crawl on a real archive verifies the JS-redirect rescue end-to-end
CI green

🤖 Generated with Claude Code

Move the per-URL interval wait out of dealer and into the worker callback so that hosts already known to be DNS-burned can skip it. With dealer-managed interval the wait fires before #sendHeadRequest sees the cache hit, so every short-circuited URL still consumed the configured interval — exactly the cost the cache was meant to avoid. The non-burned path keeps the same delay() + %countdown(...) wait so the dealer display reads identically; only burned hosts now pull URLs through with no spacing.

…ame code When pre-flight HEAD returned 405 / 501 / 503 and the GET fallback was also rejected with the same status, fetch-destination used to throw an Error. The orchestrator then recorded the page as `status: -1` / `statusText: 'UnknownError'`, losing the only useful diagnostic the server had given us (e.g. a real 503 Service Unavailable became indistinguishable from a DNS failure in the archive). When the second-pass status matches the first, the server's verdict is genuine — resolve with the PageData instead so the archive remembers the real status. The HEAD-only quirk path (different status on GET) is unchanged.

…error Extend ErrorKind with three new buckets so the noise previously dumped in `unknown` becomes actionable: - `dns-transient` — `EAI_AGAIN` and `EREFUSED`, evaluated before `dns` so the shared `getaddrinfo` token does not pull a local-resolver hiccup into the persistent DNS bucket. The DNS-burned host cache (which marks on `=== 'dns'`) therefore stops learning these transients, removing the "WiFi blip burns every host" failure mode. - `local-network` — Chromium `ERR_INTERNET_DISCONNECTED`, `ERR_NETWORK_CHANGED`, `ERR_NETWORK_IO_SUSPENDED`, `ERR_ADDRESS_UNREACHABLE`, `ERR_NETWORK_UNREACHABLE`, plus Node `ENETUNREACH`, `EHOSTUNREACH`, `EADDRNOTAVAIL`, `ENOTCONN`, `EPIPE`. Catches OS-sleep / WiFi-change / unreachable-route symptoms that were silently being labelled unknown. - `parse-error` — `Parse Error`, `Expected HTTP/`, `Unexpected end of stream`. Distinguishes broken proxies and corrupt responses from protocol-level browser failures. `timeout` is widened to also catch the `Timeout: <url>` form emitted by `NetTimeoutError` (the HEAD pre-flight race) so it stops landing in `unknown`. The SQL pre-filter in `Database.listDnsBurnedHostCandidates` drops `%EAI_AGAIN%` to mirror the new classification: only persistent NXDOMAIN-shaped messages flow into the DNS-burned cache preload. cspell: allow NXDOMAIN, libc, RTSP — vocabulary used by the new tests.

A flat 10s race timeout on every HEAD retry kept missing slow-but-reachable sites (government / overloaded servers that need 20-40s to answer). The work function inside `Crawler.#sendHeadRequest` now tracks the attempt index and passes a `[10s, 30s, 60s]` budget to `fetchDestination`, so the first try stays fast for healthy hosts while later retries give a slow server real room to respond. Attempts past the array length clamp to the final entry so any `retry > escalation.length - 1` configuration just keeps the longest budget. `fetchDestination` gains a `timeout` option (default 10s, preserving back-compat for non-crawler callers) and forwards it to the 405/501/503 GET fallback so the second-pass request uses the same budget as the original HEAD attempt. `NetTimeoutError` results are intentionally NOT written into `destinationCache`: caching them would let a future retry re-throw the stale 10s timeout instead of letting the 30s/60s attempt actually run. Test stub for `retryCall` now honours `retries`, invoking the fn up to `retries + 1` times so the new escalation behaviour can be observed without paying the actual back-off seconds.

Add en/ja labels for dns-transient, local-network, and parse-error so the Summary -1 breakdown and the Errors view stop showing raw kind strings (the i18n fallback) for the three buckets that used to land in unknown. Also widen the get-error-kind-label round-trip test to cover every member of the eleven-entry ErrorKind union.

Two WHY callouts in the DNS-burned host cache section: - Why `EAI_AGAIN` is classified as `dns-transient` rather than `dns` (avoids the "WiFi blip burns every host" failure mode, codifies the matcher-order contract). - Why HEAD pre-flight timeout escalates `10s → 30s → 60s` across retries (slow-but-reachable servers get a real chance without making every healthy URL pay the slow-server tax up front).

Three classification holes that were silently dumping rows into `unknown`, identified by reading the live crawl_errors / page_errors of an in-flight archive: - New `client-blocked` kind for Chromium's ERR_BLOCKED_* family — per the upstream net_error_list.h comment, `ERR_BLOCKED_BY_CLIENT` is literally "The client chose to block the request." (i.e. it is the browser's deliberate refusal, not a server response). Covers BY_CLIENT, BY_ADMINISTRATOR, BY_RESPONSE, BY_CSP, BY_ORB, BY_FINGERPRINTING_PROTECTION, IN_INCOGNITO_BY_ADMINISTRATOR, CLEARTEXT_NOT_PERMITTED, NETWORK_ACCESS_REVOKED. - Tls matcher gains `Hostname/IP does not match certificate` and `altnames` so Node's tls hostname-mismatch errors stop landing in `unknown`. - Timeout matcher swaps `^Timeout:` (line-anchored) for `Timeout: https?:` (URL-tailed). Beholder wraps retry-exhausted errors as `[Retried N times] Timeout: <url>`, which never has `Timeout:` at line start; the anchored form was missing every archived NetTimeoutError record.

WAF/CDN middleboxes routinely refuse HEAD with parse-error/connection reset/timeout while still accepting GET — the HEAD-only pre-flight was therefore mislabelling reachable hosts as unreachable, exhausting their retry budget and leaving them as status=-1 in the archive. When the HEAD attempt returns a NetTimeoutError, a connection-reset, or a parse-error, retry once with method=GET reusing the same cache key and userAgent. Only these three causes trigger the fallback: a proper 4xx/5xx response is already a definitive answer and is left alone (the existing 405/501/503 ladder still applies). NetTimeoutError is deliberately not cached so the crawler-level retry budget keeps applying across attempts — caching would freeze the first slow probe as the verdict for every later URL on the host.

Mirror the new `client-blocked` ErrorKind from the crawler in the viewer dictionaries (en / ja) and add it to the round-trip union test, so the Errors panel surfaces "Blocked by browser" / "ブラウザがブロック" instead of falling back to the raw `client-blocked` slug.

Pages discovered via link extraction and immediately matched by an `excludes` pattern are still inserted into the `pages` table (so the discovery layer can dedupe further occurrences) but never actually scraped — they carry `isSkipped = 1` and NULL `status` / `contentType`. Counting them in the Summary status histogram and content-type distribution produced a confusing all-NULL bucket (2,799 rows on the archive that surfaced the bug, dwarfing the real 5xx tail) that obscured the actual crawl outcome. Skip them in both aggregations. `OR isSkipped IS NULL` keeps pre-flag-era rows visible.

`--retry-failed` was not converging across iterations: NXDOMAIN / expired-cert / `ERR_BLOCKED_BY_CLIENT` / HTTP parse-error / refused TCP hosts all met the SQL fail-shape filter (`status=-1` / NULL / 5xx), got reset every pass, were re-attempted, failed identically, and rejoined the candidate pool — leaving the retry target stuck at its original size forever. `resetFailedPages` now resolves the latest recorded message for each SQL candidate (`page_errors` → `crawl_errors`, no `error.log` to keep the writer dependency surface narrow) and filters out any candidate whose message classifies into `PERMANENT_ERROR_KINDS`. An absent message keeps the candidate (still `unknown`, still retried) so the change errs on the side of investigation. The set is pinned by `permanent-error-kinds.spec.ts` so accidental widening triggers a CI failure; the membership rationale (and the "deliberately retryable" cases — `dns-transient`, `connection-reset`, `timeout`, `local-network`, `protocol`, `unknown`) is documented in the constant's JSDoc. Known limitation: pre-`crawl_errors`-era archives whose failures live only in `error.log` lose the exclusion — `getFailedPageMessages` deliberately omits the text-log parser. Run one fresh crawl pass first to populate the structured tables. Documented in the helper's JSDoc.

…HTML URL When the HEAD pre-flight (+ its GET companion in `fetchDestination`) exhausts retries on a URL that looks HTML-shaped, with a residual error that suggests middlebox / WAF interference rather than the host being unreachable (kinds `timeout` / `connection-reset` / `parse-error`), launch a full puppeteer navigation exactly once as a last-chance safety net. Some real-world WAF configurations drop bare HEAD/GET probes but answer a real browser navigation — those URLs were otherwise lost as `status=-1` indefinitely. The fallback set is deliberately narrow: - `dns` / `dns-transient` / `tls` / `client-blocked` / `connection-refused` / `local-network` / `connection-timeout` all reproduce the same outcome under puppeteer, paying browser-launch cost for nothing. - `protocol` is a puppeteer-side race; bouncing back to puppeteer recreates it. - `unknown` is excluded by design — Chromium per unclassifiable error is too expensive. When a real WAF pattern lands in `unknown`, add a matcher to `classifyErrorKind` so it bubbles into one of the three included kinds. `PreloadShortCircuitError` is filtered out automatically: its synthesised `getaddrinfo ENOTFOUND` message classifies as `dns`. Result handling: - `fallback.type === 'success'` → same post-processing as the normal path (`markBrowserScrape`, `#scrapedDestinations` claim). - `fallback.type === 'skipped'` → propagated as a skip (excludeKeywords hit inside puppeteer); without this branch the page would be recorded as `status=-1` with the HEAD's timeout message, conflating operator-intent skip with network failure. - `fallback.type === 'error'` or `#launchBrowserAndScrape` throws → `Unreachable (fallback failed)` lane label, HEAD's original error surfaced in `crawl_errors` (puppeteer's noisier wrapper would bury the root cause). `metadataOnly` mode opts out — the path is a bandwidth-saving scrape that wouldn't have launched the browser anyway.

Capture the WHY behind the two new safety nets so future maintainers adding a kind to `ErrorKind` know which derived constants need a parallel review: - `PERMANENT_ERROR_KINDS` (`--retry-failed` convergence contract) and - `PUPPETEER_FALLBACK_KINDS` (HEAD/GET → puppeteer one-shot retry). Adding an ErrorKind without checking both sets risks a silent regression — `--retry-failed` looping on a permanent failure forever or a recoverable URL never reaching the puppeteer safety net. CLAUDE.md / ARCHITECTURE.md / README.md updates: - Replace the stale "8 種" enumeration with a pointer to the `ErrorKind` union (12 kinds today; the source-of-truth is the code). - Document the GET fallback inside `fetchDestination` and the puppeteer fallback inside `Crawler.#scrapePage` as the two-layer middlebox safety net (parse-error / connection-reset / timeout). - Note the `--retry-failed` convergence rule and the known limitation for pre-`crawl_errors`-era archives whose failures live only in `error.log`. - README's `--retry-failed` block gains one sentence on the new permanent-kind exclusion so users reading "なぜ retry 対象が減った？" find the answer without diving into source. `types.ts` JSDoc adds a "derived constants that MUST be reviewed when this union changes" section right under the kind table, pointing at both derived constants.

…TTP-auth dialogs `page.authenticate` was only called when the URL itself carried `user:pass@` credentials, so a page whose sub-resource (third-party script, ad pixel, abandoned CDN domain that now serves `WWW-Authenticate: Basic ...`) triggered Chromium's NATIVE HTTP-auth dialog had no handler — and Chromium blocks the request waiting for input that never arrives, hanging the entire page navigation until puppeteer's navigation timeout fires. Downstream that surfaces as a `timeout` ErrorKind even though the root cause is "one resource demands Basic auth, nobody at the dialog." Unlike `alert/confirm/prompt`, the HTTP-auth dialog is NOT captured by `page.on('dialog')` — the only way to drain it is to register credentials via `page.authenticate` BEFORE navigation. Registering with empty strings makes Chromium respond immediately with an empty `Authorization` header. By spec the server replies with another 401 (or accepts, on the rare intentional public endpoint), so the offending sub-resource fails fast as a definitive HTTP-level failure and the rest of the page continues rendering — no hang either way. URL-embedded credentials still win when present, so existing scope-auth flows behave identically. `?? ''` because ExURL types `username` / `password` as `string | null` while puppeteer's `Credentials` interface needs `string`.

…op cross-origin leak Previous behaviour: when a crawl was seeded with a scope-auth URL (`http://user:pass@host/`), puppeteer's page.goto received the URL verbatim. Chromium promoted those credentials into its HTTP-auth cache keyed on (scheme, host, port, realm), and the network stack then re-attached the cached `Authorization` header to every subsequent same-origin request — INCLUDING cross-origin sub-resources that share the same IP / port (e.g. an embedded `<img src="http://127.0.0.1:8010/…">` loaded from a `localhost:8010` page). The `Fetch.authRequired` event never fires for these pre-emptive cache attachments, so nothing inside puppeteer or this crawler could filter the leak after the fact. Fix: rebuild the navigation URL with `parseUrl(url.href)` and clear its `username` / `password` before handing it to `scrapeStart`. The copy is detached from the queued URL, so other code paths (`linkList`, redirect tracking, dealer dedup) still see the credentialed form. The scope auth is still applied via `page.authenticate({user, pass})` — invoked unconditionally so that in-scope challenges resolve and Chromium's native HTTP-auth dialog gets drained (empty credentials suffice; sub-resources demanding Basic auth then fail with a definitive 401 instead of hanging the navigation until the puppeteer timeout fires). The new E2E in `test-server` pins both halves of the contract: removing either the auth call (main 401 hangs) or the URL strip (scope cred reaches the off-scope endpoint) makes the test fail.

…ss-origin E2E A 401-protected `/scope-auth-leak/main` route embeds an `<img src="http://127.0.0.1:8010/scope-auth-leak/external-asset.png">`, which is treated as off-scope because the hostname differs from the crawl seed (`localhost`) even though both resolve to the same IP / port — the canonical "same-IP cross-origin" shape used elsewhere in this test-server for off-scope checks. The external endpoint records every `Authorization` header it observes. The E2E asserts both halves of the contract: 1. The in-scope main page must scrape successfully — proves the crawler IS forwarding the scope credentials to in-scope origins, so the test cannot trivially pass by sending no auth anywhere. 2. NO observed Authorization header on the off-scope endpoint may carry the scope credentials (full-string match, plus a substring defence-in-depth against future variants of the leak). Removing either side of the crawler fix (the `page.authenticate` call OR the URL-strip before `scrapeStart`) makes this test fail. A small `/scope-auth-leak/reset` endpoint is exposed so the test runs independently of any previous run's recorded state.

Five small but real correctness issues surfaced by an adversarial recall-mode review of the branch — collected into one fix because each is a one-or-two-line tightening of an existing landing point, and the regression-test coverage for them lives in the same set of spec files. - **TLS regex anchored** (`classify-error-kind.ts`): the bare `altnames` alternative matched any message containing the substring — including a request path like `/altnames/lookup` in a 5xx body. Since `tls ∈ PERMANENT_ERROR_KINDS`, a false positive permanently excluded that page from `--retry-failed`. Now anchored to `certificate'?s? altnames` (covers both the Node `Hostname/IP does not match certificate's altnames` and the bare `certificate's altnames` phrasings). New `does NOT misclassify the word "altnames"…` test pins the guard. - **Navigation URL rebuilt, not mutated** (`crawler.ts`): the credential-leak guard previously did `parseUrl(url.href); navigateUrl.username = ''; ...`. ExURL pre-computes `href` / `withoutHash` / `withoutHashAndAuth` at parse time, so the post-hoc field assignment left those derived strings carrying `user:pass@host`. The leak guard happened to work only because beholder's current `scrapeStart` reads `url.withoutHashAndAuth` for `page.goto` — a future bump that consumes `url.href` instead would silently re-introduce the leak with the existing E2E still passing. Now built from `parseUrl(url.withoutHashAndAuth)` so every derived field is credential-free. - **Cache exemption widened** (`fetch-destination.ts`): only `NetTimeoutError` was excluded from `destinationCache`. HEAD `parse-error` / `connection-reset` (the kinds `shouldGetFallbackOnHeadFailure` already singles out as possibly-recoverable) got cached on attempt 1, so the 30s/60s `HEAD_TIMEOUT_ESCALATION_MS` retries hit the cache, re-threw the stale 10s failure, and the escalation never reached the network. The exemption now mirrors `shouldGetFallbackOnHeadFailure`'s kind set. - **Empty `page_errors.message` falls through** (`get-failed-page-messages.ts`): a scraper phase can record a trigger event with an empty message. The old `if (!messageByPageId.has(id))` short-circuit treated the empty string as "resolved" and skipped the `crawl_errors` lookup — so a page whose `page_errors` row was empty but whose `crawl_errors` row classified as `dns` / `tls` / `client-blocked` ended up looking like `unknown` to `--retry-failed`, defeating convergence on exactly the cases the guard is supposed to catch. - **Fallback `'error'` branch logs** (`crawler.ts`): `#launchBrowserAndScrape` catches its own exceptions and returns `{type:'error', shutdown:...}` rather than throwing, so the catch arm never saw that path. The puppeteer-side failure mode (including any `shutdown` flag the scraper attached) was invisible to operators reading the diagnostic log. Now emits a `crawlerLog('Puppeteer fallback returned error for %s: %s (shutdown=%s)', ...)` breadcrumb before falling through to surface the HEAD error as the user-visible cause. Regression tests added: `database.spec.ts` pins the all-permanent → `[]` short-circuit and the `[Retried N times] getaddrinfo ENOTFOUND` wrapped form. `get-failed-page-messages.spec.ts` covers the empty-message fallthrough and the `crawl_errors.url IS NULL` defensive guard.

…e predicate The `where('isSkipped', 0).orWhereNull('isSkipped')` carve-out is applied twice in `getSummary` (once for the page-status histogram, once for the content-type distribution) — and a near-identical inline form lives in `Database.resetFailedPages` over in the crawler package. The code-review xhigh sweep flagged the duplication: any future panel that aggregates `pages` without the carve-out re-introduces the "Unknown / Errored" bucket bloat the filter was added to eliminate, and three copies of the same predicate make the convention easy to miss. This commit pulls the in-package duplication into `exclude-skipped-pages.ts` (one file, one export, JSDoc the rationale once) and rewrites the two call sites in `getSummary` as `.where(excludeSkippedPages)`. `Database.resetFailedPages` keeps its inline copy — `@nitpicker/crawler` cannot depend on `@nitpicker/query` (reverse direction), so cross-package deduplication would require either a tier-0 helper package or a re-export contract neither of which is justified for a 2-line knex predicate. The crawler's own JSDoc cross-links the convention.

…spec missing helpers QA review (qa-engineer skill) flagged two test gaps that turn into silent regressions if the underlying logic is ever changed: - The cache-exemption rule inside `fetch-destination.ts` is the guard that lets `HEAD_TIMEOUT_ESCALATION_MS` actually escalate on parse-error / connection-reset retries. It was inlined as the `shouldGetFallbackOnHeadFailure` private helper, with no direct unit coverage — removing the helper or narrowing its kind set would not fail a single existing spec. - `excludeSkippedPages` (the new `getSummary` predicate dedup helper from the earlier code-review fix) had no spec at all, violating the project rule "1 関数 1 ファイルにはユニットテスト必須". Promote `shouldGetFallbackOnHeadFailure` to its own `should-get-fallback-on-head-failure.ts` module (one file, one export, JSDoc the contract) so `fetch-destination.ts` consumes it via import. Pin every branch of the kind set with a `should-get-fallback-on-head-failure.spec.ts`: NetTimeoutError / parse-error / connection-reset opt IN; dns / tls / connection-refused / connection-timeout / timeout / client-blocked / unknown opt OUT. Each `expect` matches one documented branch in the helper's JSDoc, so changing the helper without updating the spec is now impossible to do silently. `excludeSkippedPages` gets a spec that walks all three states (`isSkipped = 0` / `1` / `NULL`) against a real `pages` table through `Archive.create`. Demonstrates the legacy-NULL backward- compat behavior and the empty-result edge case.

…ghten scope-leak assertion Two QA-review gap closures in the E2E layer: - `retry-failed.e2e.ts` gains a third `describe` block that proves the `PERMANENT_ERROR_KINDS` exclusion (Layer 2 of this PR) converges across multiple `--retry-failed` passes. Baseline crawl seeds a status=500 candidate from `/flaky/recoverable`, then injects a `getaddrinfo ENOTFOUND` message directly into `page_errors` to classify the candidate as `dns ∈ PERMANENT_ERROR_KINDS` (real DNS failures can't be simulated hermetically on CI runners). Healing the endpoint and running `retryFailed` TWICE in a row then proves the page stays at status=500 — if the exclusion were removed, the first pass would heal it to 200. The previous test set only covered "5xx page heals on retry"; nothing pinned "permanent kinds are excluded across iterations". - `scope-auth-leak.e2e.ts` per-header loop drops the nested `if (header !== null)` and folds the four assertions into a single contract: `not.toBe(SCOPE_AUTH_HEADER)` + `not.toContain(SCOPE_USER / PASS)` + `expect([null, EMPTY_AUTH_HEADER]).toContain(header)`. Same invariant, no branching inside the test body — clearer about what the leak guard is actually promising. `/scope-auth-leak/external-asset.png` substring assertion uses `header ?? ''` so the null path stays valid without an `if`.

… frame" from unknown A live archive (~1700 crawl_errors rows) had 53 messages slipping into `unknown` despite having clear classification signal. Two matchers were under-reaching: - **tls** missed the bare Node phrasing `certificate has expired` — it emits without an `ERR_CERT_*` / `ERR_SSL_*` prefix when the peer's `notAfter` is past. 9 distinct expired-cert hosts were falling out of tls and (since `tls ∈ PERMANENT_ERROR_KINDS`) also losing the `--retry-failed` exclusion meant for permanent failures. - **protocol** required `frame (?:was |got )?detached` word order and missed puppeteer's current `Frame.ts` emit "Attempted to use detached Frame '<id>'" (reverse order). 9 distinct frame IDs landed in unknown. The tls token is end-anchored as `certificate has expired\s*$` (NOT a bare substring) so application-layer error bodies that happen to echo the phrase mid-string — e.g. an upstream JSON 5xx body that includes "License certificate has expired (LICENSE_EXPIRED) — please renew" — do NOT misclassify as tls (which would permanently exclude them from `--retry-failed`). The protocol token is anchored to puppeteer's exact prefix `Attempted to use detached frame` (its current `Frame.ts` emitter, the `/i` flag catches the uppercase `Frame` form automatically), not the bare two-token substring. A console / logger message that mentions the same two tokens in unrelated context falls through to unknown. Regression coverage: - positive: bare `certificate has expired` + `[Retried 3 times]` wrapper both classify as tls - negative: three mid-string `certificate has expired` shapes fall through to unknown - trailing-whitespace bound: space / tab / newline forms still classify - positive: puppeteer `Attempted to use detached Frame '...'` + suffix variants classify as protocol - case insensitive: lowercase + mixed case both work - negative: unrelated `detached frame` console diagnostics fall through - wildcard pin: `Page.<method> returned` matches reload / evaluate / close beyond the original `goto` observation These changes are derived-on-read — no archive re-crawl is needed. Just re-opening an existing archive with the new library reclassifies the affected rows from `unknown` to `tls` / `protocol` automatically.

…eturns null A URL that the HEAD pre-flight resolves through a 3xx chain to a status-200 final destination, but whose final destination puppeteer then fails to render (e.g. HTTPS→HTTP downgrade redirects that Chromium silently refuses to follow, surfacing as "The method Page.goto returned null"), used to drop the entire chain on the floor: the page row landed at `status = -1 / UnknownError` with `redirectDestId = NULL`, then re-entered `--retry-failed` every iteration because the SQL fail-shape filter caught it AND no `redirectDestId` filtered it out. Two cooperating fixes: - **`Crawler.#scrapePage`** now folds the browser-failure-with-known- redirect-chain case into the existing `redirect-edge` path. When `browserResult.type === 'error'` AND `headCheckResult.redirectPaths.length > 0`, the chain that HEAD already proved is authoritative — return `redirect-edge` so the orchestrator routes it through `Archive.setRedirect` → `recordRedirect` → `#linkRedirectSources`, which sets `redirectDestId` and excludes the row from `--retry-failed` naturally. - **`#linkRedirectSources`** now stamps a redirect source's row with `status = 301 / statusText = 'Moved Permanently'` ONLY when the row carries no definitive status yet (NULL or `-1`). HEAD pre-flight does not retain each hop's individual status code (`redirectPaths` is a URL[] without statuses), so any 3xx hop must collapse to one representative — 301 is the canonical pick. Existing definitive statuses (200 / 302 / 307 / etc.) from prior direct scrapes are preserved verbatim — the conditional flip avoids rewriting accurate history during migration. Result for the migration shape: opening an existing archive with the new library and running `--retry-failed` reclassifies legacy HTTPS→HTTP-downgrade rows as 301 redirect sources and drops them out of the retry target set. For fresh crawls, 302/307 URLs that get their status captured directly keep their accurate code; only the "never directly scraped" placeholder rows pick up the 301 marker. Regression coverage in `database.spec.ts`: - redirect source whose row was `-1 / UnknownError` is flipped to `301 / Moved Permanently` - redirect source whose row already had `302 / Found` is NOT overwritten - redirect source whose placeholder row had NULL status is stamped 301

When `page.goto()` resolves to `null` because a client-side `window.location.replace(...)` / `<meta refresh>` fires before the original response materialises, the scraper throws `The method Page.goto returned null` — classified as `protocol`, neither permanent nor a puppeteer-fallback kind. Without a rescue the source row is persisted as `status = -1` and `--retry-failed` resets it every pass without converging. `buildJsRedirectEdge` reads `page.url()` from the still-alive puppeteer page after the throw, canonicalises and credential-strips the destination, and synthesises a `RedirectEdgeResult`. The dest is re-enqueued; `linkList.done` is told NOT to fold the dest into the done-set (`completion.includeRedirectPaths: false`) so the add() can route it into pending. Trigger is narrow: only the exact `Page.goto returned null` sentinel AND a meaningfully-different post-navigation URL (WHATWG canonicalisation filters case/slash/ port/credential noise; browser-internal schemes are rejected). Two call sites share the helper: - HEAD-success then puppeteer-throw: PageData spreads HEAD, preserving the real HTTP 200 status. - HEAD-fail then puppeteer-fallback then fallback-throw: PageData is a `linkToPageData` placeholder with status=-1 (later flipped to 301 by `#linkRedirectSources`). Genuine browser failures (TLS / target-closed / timeout) fall through unchanged. `RedirectEdgeResult.source` discriminator (`http-chain` / `js-redirect`) lets `#runDeal` decide between fold-into-done-set (http-chain) and enqueue-the-dest (js-redirect). Test-server gets a `/js-redirect/` route + E2E that exercises the client-side window.location.replace shape. Note: under the current puppeteer version the rescue path itself is not triggered (puppeteer follows the JS navigation cleanly), so the E2E observes the beholder-side redirect outcome rather than the rescue. The rescue logic itself is pinned by 46 unit cases across derive-js-redirect-target, is-js-redirect-error-shape, build-js-redirect-edge, and link-list includeRedirectPaths. Known limitation: on the HEAD-fail path the upstream's truthful HTTP status is lost (row reads as 301). Strictly better than the `status = -1` retry-loop; lift deferred at 0.x.

…URE.md Adds a CLAUDE.md block under the "HEAD pre-flight の救済" section explaining the rescue's WHY (Page.goto returned null lands in `protocol`, not in `PERMANENT_ERROR_KINDS` nor `PUPPETEER_FALLBACK_KINDS` → infinite retry loop without intervention), the two firing call sites (HEAD-success then puppeteer-throw, HEAD-fail then puppeteer-fallback), the narrow trigger contract (`isJsRedirectErrorShape` + WHATWG-canonical `deriveJsRedirectTarget`), the known 200-to-301 status flip on the fallback path, and the E2E gap (current puppeteer version follows the JS navigation rather than throwing, so the rescue path itself is pinned by unit tests not the E2E). Adds one row to the ARCHITECTURE.md scraper error-handling table pointing at the new `buildJsRedirectEdge` / `derive-js-redirect-target` / `is-js-redirect-error-shape` JSDoc as the authoritative source. No source-code changes.

…o JS-redirect rescue fires The rescue introduced in a39a486 read `page.url()` only inside the catch arm of `#launchBrowserAndScrape`, but beholder's `scrapeStart` catches `#fetchData` throws internally and *returns* `{ type: 'error', shutdown: true, ... }` rather than re-throwing. The catch arm therefore never ran for the exact failure shape the rescue was designed to handle (`Page.goto returned null` from a client-side `window.location.replace()` / `<meta refresh>`), and the rescue effectively no-op'd in production. Capture `page.url()` on the error-return path too, while `page` is still alive (finally has not called `handleBrowserClose`), and attach `postNavigationUrl` to the returned envelope. The catch-arm capture is kept as the safety net for outer throws (browser.newPage, page.setUserAgent, ...) that bypass scrapeStart entirely. Verified live on a customer archive that previously stayed `status = -1 / UnknownError` across many `--retry-failed` passes: after this fix the source row flips to `status = 301 / Moved Permanently` with `redirectDestId` wired to the JS-redirect destination, and the page leaves the retry pool. Debug log shows `JS-redirect rescue fired for ... → ... (HEAD available)` followed by `Set redirect` / `Set redirected url` archive writes.

YusukeHirao added 25 commits June 18, 2026 23:46

YusukeHirao merged commit 57f2b23 into dev Jun 19, 2026
5 checks passed

YusukeHirao deleted the fix/dns-burned-host-skip-interval branch June 19, 2026 11:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(crawler): converge --retry-failed via classifier expansion, rescues, and scope-cred guard#90

fix(crawler): converge --retry-failed via classifier expansion, rescues, and scope-cred guard#90
YusukeHirao merged 25 commits into
devfrom
fix/dns-burned-host-skip-interval

YusukeHirao commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YusukeHirao commented Jun 19, 2026

Summary

Convergence: --retry-failed stops looping on permanent failures

Rescue 1: HEAD-resolved chain when puppeteer dies on the destination

Rescue 2: JS-redirect via page.url()

Rescue 3: HEAD/GET both dead → one puppeteer try

Throughput / hygiene

Security: scope-credential leak

Query / viewer

Docs

Live verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Convergence: `--retry-failed` stops looping on permanent failures

Rescue 2: JS-redirect via `page.url()`