Skip to content

fix(crawler): converge --retry-failed via classifier expansion, rescues, and scope-cred guard#90

Merged
YusukeHirao merged 25 commits into
devfrom
fix/dns-burned-host-skip-interval
Jun 19, 2026
Merged

fix(crawler): converge --retry-failed via classifier expansion, rescues, and scope-cred guard#90
YusukeHirao merged 25 commits into
devfrom
fix/dns-burned-host-skip-interval

Conversation

@YusukeHirao

Copy link
Copy Markdown
Member

Summary

This branch started as a tiny fix for an interval-delay quirk on DNS-burned hosts and grew into a wider hardening pass on the crawl convergence story — keeping --retry-failed honest, classifying every observable failure shape, and adding two new safety nets for puppeteer-side oddities. Also fixes a credential-leak guard for the scope-auth flow.

Convergence: --retry-failed stops looping on permanent failures

  • New PERMANENT_ERROR_KINDS derived constant; Database.resetFailedPages now consults classifyErrorKind and refuses to reset rows whose latest message is permanent (NXDOMAIN, expired cert, ERR_BLOCKED_BY_CLIENT, HTTP parse error, ECONNREFUSED). Without this, every pass replayed the same failure forever.
  • New client-blocked ErrorKind for Chromium's ERR_BLOCKED_BY_* family, plus dns-transient split (EAI_AGAIN no longer punishes hosts via the DNS-burned cache).
  • Classifier matchers tightened: tls regex anchored against false-positive altnames substrings, protocol matcher catches Attempted to use detached Frame and the Page.<method> returned family.

Rescue 1: HEAD-resolved chain when puppeteer dies on the destination

  • When HEAD pre-flight resolved a real 3xx chain but puppeteer threw on the final URL (HTTPS→HTTP downgrade is the canonical case), record the source as a redirect-edge instead of status=-1. Avoids the retry-forever loop on URLs whose chain truth is already known.

Rescue 2: JS-redirect via page.url()

  • New buildJsRedirectEdge + deriveJsRedirectTarget + isJsRedirectErrorShape helpers. When puppeteer's page.goto() resolves to null (window.location.replace(...) / <meta refresh>) and page.url() reports a meaningfully-different destination, record the source as a redirect-edge. WHATWG URL canonicalisation filters case / slash / port / credential noise; browser-internal schemes are rejected.
  • One late-stage bug fix: beholder's scrapeStart returns {type:'error'} rather than throwing on internal #fetchData failures, so the page.url() capture had to be added on the return path too — the original catch-arm-only capture would have made the rescue dead in production.

Rescue 3: HEAD/GET both dead → one puppeteer try

  • When HEAD pre-flight + its GET fallback both exhaust retries on a URL that looks HTML-shaped, and the residual error suggests middlebox interference (timeout / connection-reset / parse-error), launch puppeteer exactly once. Some WAF configurations drop bare probes but answer real navigations.

Throughput / hygiene

  • HEAD pre-flight timeout escalates 10s → 30s → 60s per retry attempt (was fixed 10s) so slow but eventually-responsive hosts get rescued without paying the slow-server tax on the happy path.
  • GET fallback inside fetchDestination when HEAD response is unusable (405 / parse-error / etc.).
  • Per-URL interval delay is skipped for hosts in the DNS-burned cache — no point making the crawl wait between hits to a host that will short-circuit anyway.

Security: scope-credential leak

  • page.authenticate is now registered unconditionally to drain Chromium's native HTTP-auth dialog (previously hung navigations on sub-resources demanding Basic auth).
  • URL-embedded scope credentials (user:pass@host) are stripped before page.goto() so they never enter Chromium's HTTP-auth cache and never leak to cross-origin sub-resources on the same IP/port. Pinned by a new E2E (scope-auth-leak.e2e.ts).

Query / viewer

  • getSummary excludes excludes-pattern pages from status / content-type distributions (they were inflating the Unknown bucket).
  • Viewer translates the new client-blocked / dns-transient / local-network / parse-error ErrorKind buckets.

Docs

  • CLAUDE.md / ARCHITECTURE.md updated with the rescue paths, the --retry-failed convergence rule, and the ErrorKind derived-constants review checklist.

Live verification

A previously stuck source row (status=-1 / UnknownError after 15 --retry-failed passes on a real archive) flipped to status=301 Moved Permanently with redirectDestId set after one run with the rescue. Removing it from the retry pool permanently. Debug log: JS-redirect rescue fired for <source> → <dest> (HEAD available) followed by Set redirect / Set redirected url archive writes.

Test plan

  • yarn build — all 14 projects pass
  • yarn test — 1861 unit tests pass (5 skipped)
  • yarn vitest run --config vitest.e2e.config.ts — 25 files / 113 E2E pass
  • yarn lint — eslint + prettier + cspell + textlint + markuplint all clean
  • code-review xhigh sweep — all surviving findings addressed
  • live crawl on a real archive verifies the JS-redirect rescue end-to-end
  • CI green

🤖 Generated with Claude Code

Move the per-URL interval wait out of dealer and into the worker
callback so that hosts already known to be DNS-burned can skip it.
With dealer-managed interval the wait fires before #sendHeadRequest
sees the cache hit, so every short-circuited URL still consumed the
configured interval — exactly the cost the cache was meant to avoid.

The non-burned path keeps the same delay() + %countdown(...) wait so
the dealer display reads identically; only burned hosts now pull URLs
through with no spacing.
…ame code

When pre-flight HEAD returned 405 / 501 / 503 and the GET fallback was
also rejected with the same status, fetch-destination used to throw an
Error. The orchestrator then recorded the page as `status: -1` /
`statusText: 'UnknownError'`, losing the only useful diagnostic the
server had given us (e.g. a real 503 Service Unavailable became
indistinguishable from a DNS failure in the archive).

When the second-pass status matches the first, the server's verdict is
genuine — resolve with the PageData instead so the archive remembers
the real status. The HEAD-only quirk path (different status on GET) is
unchanged.
…error

Extend ErrorKind with three new buckets so the noise previously dumped in
`unknown` becomes actionable:

- `dns-transient` — `EAI_AGAIN` and `EREFUSED`, evaluated before `dns` so
  the shared `getaddrinfo` token does not pull a local-resolver hiccup
  into the persistent DNS bucket. The DNS-burned host cache (which marks
  on `=== 'dns'`) therefore stops learning these transients, removing
  the "WiFi blip burns every host" failure mode.
- `local-network` — Chromium `ERR_INTERNET_DISCONNECTED`,
  `ERR_NETWORK_CHANGED`, `ERR_NETWORK_IO_SUSPENDED`, `ERR_ADDRESS_UNREACHABLE`,
  `ERR_NETWORK_UNREACHABLE`, plus Node `ENETUNREACH`, `EHOSTUNREACH`,
  `EADDRNOTAVAIL`, `ENOTCONN`, `EPIPE`. Catches OS-sleep / WiFi-change /
  unreachable-route symptoms that were silently being labelled unknown.
- `parse-error` — `Parse Error`, `Expected HTTP/`, `Unexpected end of
  stream`. Distinguishes broken proxies and corrupt responses from
  protocol-level browser failures.

`timeout` is widened to also catch the `Timeout: <url>` form emitted by
`NetTimeoutError` (the HEAD pre-flight race) so it stops landing in
`unknown`. The SQL pre-filter in `Database.listDnsBurnedHostCandidates`
drops `%EAI_AGAIN%` to mirror the new classification: only persistent
NXDOMAIN-shaped messages flow into the DNS-burned cache preload.

cspell: allow NXDOMAIN, libc, RTSP — vocabulary used by the new tests.
A flat 10s race timeout on every HEAD retry kept missing slow-but-reachable
sites (government / overloaded servers that need 20-40s to answer). The
work function inside `Crawler.#sendHeadRequest` now tracks the attempt
index and passes a `[10s, 30s, 60s]` budget to `fetchDestination`, so the
first try stays fast for healthy hosts while later retries give a slow
server real room to respond. Attempts past the array length clamp to the
final entry so any `retry > escalation.length - 1` configuration just
keeps the longest budget.

`fetchDestination` gains a `timeout` option (default 10s, preserving
back-compat for non-crawler callers) and forwards it to the 405/501/503
GET fallback so the second-pass request uses the same budget as the
original HEAD attempt. `NetTimeoutError` results are intentionally NOT
written into `destinationCache`: caching them would let a future retry
re-throw the stale 10s timeout instead of letting the 30s/60s attempt
actually run.

Test stub for `retryCall` now honours `retries`, invoking the fn up to
`retries + 1` times so the new escalation behaviour can be observed
without paying the actual back-off seconds.
Add en/ja labels for dns-transient, local-network, and parse-error so
the Summary -1 breakdown and the Errors view stop showing raw kind
strings (the i18n fallback) for the three buckets that used to land in
unknown. Also widen the get-error-kind-label round-trip test to cover
every member of the eleven-entry ErrorKind union.
Two WHY callouts in the DNS-burned host cache section:

- Why `EAI_AGAIN` is classified as `dns-transient` rather than `dns`
  (avoids the "WiFi blip burns every host" failure mode, codifies the
  matcher-order contract).
- Why HEAD pre-flight timeout escalates `10s → 30s → 60s` across
  retries (slow-but-reachable servers get a real chance without making
  every healthy URL pay the slow-server tax up front).
Three classification holes that were silently dumping rows into
`unknown`, identified by reading the live crawl_errors / page_errors
of an in-flight archive:

- New `client-blocked` kind for Chromium's ERR_BLOCKED_* family — per
  the upstream net_error_list.h comment, `ERR_BLOCKED_BY_CLIENT` is
  literally "The client chose to block the request." (i.e. it is the
  browser's deliberate refusal, not a server response). Covers
  BY_CLIENT, BY_ADMINISTRATOR, BY_RESPONSE, BY_CSP, BY_ORB,
  BY_FINGERPRINTING_PROTECTION, IN_INCOGNITO_BY_ADMINISTRATOR,
  CLEARTEXT_NOT_PERMITTED, NETWORK_ACCESS_REVOKED.
- Tls matcher gains `Hostname/IP does not match certificate` and
  `altnames` so Node's tls hostname-mismatch errors stop landing in
  `unknown`.
- Timeout matcher swaps `^Timeout:` (line-anchored) for
  `Timeout: https?:` (URL-tailed). Beholder wraps retry-exhausted
  errors as `[Retried N times] Timeout: <url>`, which never has
  `Timeout:` at line start; the anchored form was missing every
  archived NetTimeoutError record.
WAF/CDN middleboxes routinely refuse HEAD with parse-error/connection
reset/timeout while still accepting GET — the HEAD-only pre-flight
was therefore mislabelling reachable hosts as unreachable, exhausting
their retry budget and leaving them as status=-1 in the archive.

When the HEAD attempt returns a NetTimeoutError, a connection-reset,
or a parse-error, retry once with method=GET reusing the same cache
key and userAgent. Only these three causes trigger the fallback: a
proper 4xx/5xx response is already a definitive answer and is left
alone (the existing 405/501/503 ladder still applies).

NetTimeoutError is deliberately not cached so the crawler-level
retry budget keeps applying across attempts — caching would freeze
the first slow probe as the verdict for every later URL on the host.
Mirror the new `client-blocked` ErrorKind from the crawler in the
viewer dictionaries (en / ja) and add it to the round-trip union
test, so the Errors panel surfaces "Blocked by browser" / "ブラウザ
がブロック" instead of falling back to the raw `client-blocked` slug.
Pages discovered via link extraction and immediately matched by an
`excludes` pattern are still inserted into the `pages` table (so the
discovery layer can dedupe further occurrences) but never actually
scraped — they carry `isSkipped = 1` and NULL `status` / `contentType`.
Counting them in the Summary status histogram and content-type
distribution produced a confusing all-NULL bucket (2,799 rows on the
archive that surfaced the bug, dwarfing the real 5xx tail) that
obscured the actual crawl outcome.

Skip them in both aggregations. `OR isSkipped IS NULL` keeps
pre-flag-era rows visible.
`--retry-failed` was not converging across iterations: NXDOMAIN /
expired-cert / `ERR_BLOCKED_BY_CLIENT` / HTTP parse-error / refused
TCP hosts all met the SQL fail-shape filter (`status=-1` / NULL /
5xx), got reset every pass, were re-attempted, failed identically,
and rejoined the candidate pool — leaving the retry target stuck at
its original size forever.

`resetFailedPages` now resolves the latest recorded message for each
SQL candidate (`page_errors` → `crawl_errors`, no `error.log` to
keep the writer dependency surface narrow) and filters out any
candidate whose message classifies into `PERMANENT_ERROR_KINDS`. An
absent message keeps the candidate (still `unknown`, still retried)
so the change errs on the side of investigation.

The set is pinned by `permanent-error-kinds.spec.ts` so accidental
widening triggers a CI failure; the membership rationale (and the
"deliberately retryable" cases — `dns-transient`, `connection-reset`,
`timeout`, `local-network`, `protocol`, `unknown`) is documented in
the constant's JSDoc.

Known limitation: pre-`crawl_errors`-era archives whose failures
live only in `error.log` lose the exclusion — `getFailedPageMessages`
deliberately omits the text-log parser. Run one fresh crawl pass
first to populate the structured tables. Documented in the helper's
JSDoc.
…HTML URL

When the HEAD pre-flight (+ its GET companion in `fetchDestination`)
exhausts retries on a URL that looks HTML-shaped, with a residual
error that suggests middlebox / WAF interference rather than the host
being unreachable (kinds `timeout` / `connection-reset` /
`parse-error`), launch a full puppeteer navigation exactly once as a
last-chance safety net. Some real-world WAF configurations drop bare
HEAD/GET probes but answer a real browser navigation — those URLs
were otherwise lost as `status=-1` indefinitely.

The fallback set is deliberately narrow:
- `dns` / `dns-transient` / `tls` / `client-blocked` /
  `connection-refused` / `local-network` / `connection-timeout` all
  reproduce the same outcome under puppeteer, paying browser-launch
  cost for nothing.
- `protocol` is a puppeteer-side race; bouncing back to puppeteer
  recreates it.
- `unknown` is excluded by design — Chromium per unclassifiable error
  is too expensive. When a real WAF pattern lands in `unknown`, add a
  matcher to `classifyErrorKind` so it bubbles into one of the three
  included kinds.

`PreloadShortCircuitError` is filtered out automatically: its
synthesised `getaddrinfo ENOTFOUND` message classifies as `dns`.

Result handling:
- `fallback.type === 'success'` → same post-processing as the normal
  path (`markBrowserScrape`, `#scrapedDestinations` claim).
- `fallback.type === 'skipped'` → propagated as a skip (excludeKeywords
  hit inside puppeteer); without this branch the page would be
  recorded as `status=-1` with the HEAD's timeout message, conflating
  operator-intent skip with network failure.
- `fallback.type === 'error'` or `#launchBrowserAndScrape` throws →
  `Unreachable (fallback failed)` lane label, HEAD's original error
  surfaced in `crawl_errors` (puppeteer's noisier wrapper would
  bury the root cause).

`metadataOnly` mode opts out — the path is a bandwidth-saving scrape
that wouldn't have launched the browser anyway.
Capture the WHY behind the two new safety nets so future maintainers
adding a kind to `ErrorKind` know which derived constants need a
parallel review:

- `PERMANENT_ERROR_KINDS` (`--retry-failed` convergence contract) and
- `PUPPETEER_FALLBACK_KINDS` (HEAD/GET → puppeteer one-shot retry).

Adding an ErrorKind without checking both sets risks a silent
regression — `--retry-failed` looping on a permanent failure forever
or a recoverable URL never reaching the puppeteer safety net.

CLAUDE.md / ARCHITECTURE.md / README.md updates:
- Replace the stale "8 種" enumeration with a pointer to the
  `ErrorKind` union (12 kinds today; the source-of-truth is the code).
- Document the GET fallback inside `fetchDestination` and the
  puppeteer fallback inside `Crawler.#scrapePage` as the two-layer
  middlebox safety net (parse-error / connection-reset / timeout).
- Note the `--retry-failed` convergence rule and the known limitation
  for pre-`crawl_errors`-era archives whose failures live only in
  `error.log`.
- README's `--retry-failed` block gains one sentence on the new
  permanent-kind exclusion so users reading "なぜ retry 対象が減った
  ?" find the answer without diving into source.

`types.ts` JSDoc adds a "derived constants that MUST be reviewed
when this union changes" section right under the kind table, pointing
at both derived constants.
…TTP-auth dialogs

`page.authenticate` was only called when the URL itself carried
`user:pass@` credentials, so a page whose sub-resource (third-party
script, ad pixel, abandoned CDN domain that now serves
`WWW-Authenticate: Basic ...`) triggered Chromium's NATIVE HTTP-auth
dialog had no handler — and Chromium blocks the request waiting for
input that never arrives, hanging the entire page navigation until
puppeteer's navigation timeout fires. Downstream that surfaces as a
`timeout` ErrorKind even though the root cause is "one resource
demands Basic auth, nobody at the dialog."

Unlike `alert/confirm/prompt`, the HTTP-auth dialog is NOT captured
by `page.on('dialog')` — the only way to drain it is to register
credentials via `page.authenticate` BEFORE navigation.

Registering with empty strings makes Chromium respond immediately
with an empty `Authorization` header. By spec the server replies
with another 401 (or accepts, on the rare intentional public
endpoint), so the offending sub-resource fails fast as a definitive
HTTP-level failure and the rest of the page continues rendering —
no hang either way. URL-embedded credentials still win when present,
so existing scope-auth flows behave identically.

`?? ''` because ExURL types `username` / `password` as `string |
null` while puppeteer's `Credentials` interface needs `string`.
…op cross-origin leak

Previous behaviour: when a crawl was seeded with a scope-auth URL
(`http://user:pass@host/`), puppeteer's page.goto received the URL
verbatim. Chromium promoted those credentials into its HTTP-auth
cache keyed on (scheme, host, port, realm), and the network stack
then re-attached the cached `Authorization` header to every
subsequent same-origin request — INCLUDING cross-origin
sub-resources that share the same IP / port (e.g. an embedded
`<img src="http://127.0.0.1:8010/…">` loaded from a `localhost:8010`
page). The `Fetch.authRequired` event never fires for these
pre-emptive cache attachments, so nothing inside puppeteer or this
crawler could filter the leak after the fact.

Fix: rebuild the navigation URL with `parseUrl(url.href)` and clear
its `username` / `password` before handing it to `scrapeStart`. The
copy is detached from the queued URL, so other code paths
(`linkList`, redirect tracking, dealer dedup) still see the
credentialed form. The scope auth is still applied via
`page.authenticate({user, pass})` — invoked unconditionally so that
in-scope challenges resolve and Chromium's native HTTP-auth dialog
gets drained (empty credentials suffice; sub-resources demanding
Basic auth then fail with a definitive 401 instead of hanging the
navigation until the puppeteer timeout fires).

The new E2E in `test-server` pins both halves of the contract:
removing either the auth call (main 401 hangs) or the URL strip
(scope cred reaches the off-scope endpoint) makes the test fail.
…ss-origin E2E

A 401-protected `/scope-auth-leak/main` route embeds an
`<img src="http://127.0.0.1:8010/scope-auth-leak/external-asset.png">`,
which is treated as off-scope because the hostname differs from the
crawl seed (`localhost`) even though both resolve to the same IP /
port — the canonical "same-IP cross-origin" shape used elsewhere in
this test-server for off-scope checks.

The external endpoint records every `Authorization` header it
observes. The E2E asserts both halves of the contract:

1. The in-scope main page must scrape successfully — proves the
   crawler IS forwarding the scope credentials to in-scope origins,
   so the test cannot trivially pass by sending no auth anywhere.
2. NO observed Authorization header on the off-scope endpoint may
   carry the scope credentials (full-string match, plus a substring
   defence-in-depth against future variants of the leak).

Removing either side of the crawler fix (the `page.authenticate`
call OR the URL-strip before `scrapeStart`) makes this test fail.

A small `/scope-auth-leak/reset` endpoint is exposed so the test
runs independently of any previous run's recorded state.
Five small but real correctness issues surfaced by an adversarial
recall-mode review of the branch — collected into one fix because
each is a one-or-two-line tightening of an existing landing point,
and the regression-test coverage for them lives in the same set of
spec files.

- **TLS regex anchored** (`classify-error-kind.ts`): the bare
  `altnames` alternative matched any message containing the
  substring — including a request path like
  `/altnames/lookup` in a 5xx body. Since `tls ∈ PERMANENT_ERROR_KINDS`,
  a false positive permanently excluded that page from
  `--retry-failed`. Now anchored to `certificate'?s? altnames`
  (covers both the Node `Hostname/IP does not match certificate's
  altnames` and the bare `certificate's altnames` phrasings). New
  `does NOT misclassify the word "altnames"…` test pins the guard.

- **Navigation URL rebuilt, not mutated** (`crawler.ts`): the
  credential-leak guard previously did
  `parseUrl(url.href); navigateUrl.username = ''; ...`. ExURL
  pre-computes `href` / `withoutHash` / `withoutHashAndAuth` at parse
  time, so the post-hoc field assignment left those derived strings
  carrying `user:pass@host`. The leak guard happened to work only
  because beholder's current `scrapeStart` reads
  `url.withoutHashAndAuth` for `page.goto` — a future bump that
  consumes `url.href` instead would silently re-introduce the leak
  with the existing E2E still passing. Now built from
  `parseUrl(url.withoutHashAndAuth)` so every derived field is
  credential-free.

- **Cache exemption widened** (`fetch-destination.ts`): only
  `NetTimeoutError` was excluded from `destinationCache`. HEAD
  `parse-error` / `connection-reset` (the kinds
  `shouldGetFallbackOnHeadFailure` already singles out as
  possibly-recoverable) got cached on attempt 1, so the 30s/60s
  `HEAD_TIMEOUT_ESCALATION_MS` retries hit the cache, re-threw the
  stale 10s failure, and the escalation never reached the
  network. The exemption now mirrors
  `shouldGetFallbackOnHeadFailure`'s kind set.

- **Empty `page_errors.message` falls through** (`get-failed-page-messages.ts`):
  a scraper phase can record a trigger event with an empty message.
  The old `if (!messageByPageId.has(id))` short-circuit treated the
  empty string as "resolved" and skipped the `crawl_errors` lookup —
  so a page whose `page_errors` row was empty but whose
  `crawl_errors` row classified as `dns` / `tls` / `client-blocked`
  ended up looking like `unknown` to `--retry-failed`, defeating
  convergence on exactly the cases the guard is supposed to catch.

- **Fallback `'error'` branch logs** (`crawler.ts`):
  `#launchBrowserAndScrape` catches its own exceptions and returns
  `{type:'error', shutdown:...}` rather than throwing, so the
  catch arm never saw that path. The puppeteer-side failure mode
  (including any `shutdown` flag the scraper attached) was invisible
  to operators reading the diagnostic log. Now emits a
  `crawlerLog('Puppeteer fallback returned error for %s: %s
  (shutdown=%s)', ...)` breadcrumb before falling through to
  surface the HEAD error as the user-visible cause.

Regression tests added: `database.spec.ts` pins the
all-permanent → `[]` short-circuit and the
`[Retried N times] getaddrinfo ENOTFOUND` wrapped form.
`get-failed-page-messages.spec.ts` covers the empty-message
fallthrough and the `crawl_errors.url IS NULL` defensive guard.
…e predicate

The `where('isSkipped', 0).orWhereNull('isSkipped')` carve-out is
applied twice in `getSummary` (once for the page-status histogram,
once for the content-type distribution) — and a near-identical
inline form lives in `Database.resetFailedPages` over in the
crawler package. The code-review xhigh sweep flagged the
duplication: any future panel that aggregates `pages` without the
carve-out re-introduces the "Unknown / Errored" bucket bloat the
filter was added to eliminate, and three copies of the same
predicate make the convention easy to miss.

This commit pulls the in-package duplication into
`exclude-skipped-pages.ts` (one file, one export, JSDoc the
rationale once) and rewrites the two call sites in `getSummary` as
`.where(excludeSkippedPages)`. `Database.resetFailedPages` keeps
its inline copy — `@nitpicker/crawler` cannot depend on
`@nitpicker/query` (reverse direction), so cross-package
deduplication would require either a tier-0 helper package or a
re-export contract neither of which is justified for a 2-line
knex predicate. The crawler's own JSDoc cross-links the convention.
…spec missing helpers

QA review (qa-engineer skill) flagged two test gaps that turn into
silent regressions if the underlying logic is ever changed:

- The cache-exemption rule inside `fetch-destination.ts` is the
  guard that lets `HEAD_TIMEOUT_ESCALATION_MS` actually escalate on
  parse-error / connection-reset retries. It was inlined as the
  `shouldGetFallbackOnHeadFailure` private helper, with no direct
  unit coverage — removing the helper or narrowing its kind set
  would not fail a single existing spec.
- `excludeSkippedPages` (the new `getSummary` predicate dedup
  helper from the earlier code-review fix) had no spec at all,
  violating the project rule "1 関数 1 ファイルにはユニットテスト必須".

Promote `shouldGetFallbackOnHeadFailure` to its own
`should-get-fallback-on-head-failure.ts` module (one file, one
export, JSDoc the contract) so `fetch-destination.ts` consumes it
via import. Pin every branch of the kind set with a
`should-get-fallback-on-head-failure.spec.ts`:
NetTimeoutError / parse-error / connection-reset opt IN;
dns / tls / connection-refused / connection-timeout / timeout /
client-blocked / unknown opt OUT. Each `expect` matches one
documented branch in the helper's JSDoc, so changing the helper
without updating the spec is now impossible to do silently.

`excludeSkippedPages` gets a spec that walks all three states
(`isSkipped = 0` / `1` / `NULL`) against a real `pages` table
through `Archive.create`. Demonstrates the legacy-NULL backward-
compat behavior and the empty-result edge case.
…ghten scope-leak assertion

Two QA-review gap closures in the E2E layer:

- `retry-failed.e2e.ts` gains a third `describe` block that proves
  the `PERMANENT_ERROR_KINDS` exclusion (Layer 2 of this PR) converges
  across multiple `--retry-failed` passes. Baseline crawl seeds a
  status=500 candidate from `/flaky/recoverable`, then injects a
  `getaddrinfo ENOTFOUND` message directly into `page_errors` to
  classify the candidate as `dns ∈ PERMANENT_ERROR_KINDS` (real DNS
  failures can't be simulated hermetically on CI runners). Healing
  the endpoint and running `retryFailed` TWICE in a row then proves
  the page stays at status=500 — if the exclusion were removed, the
  first pass would heal it to 200. The previous test set only
  covered "5xx page heals on retry"; nothing pinned "permanent kinds
  are excluded across iterations".

- `scope-auth-leak.e2e.ts` per-header loop drops the nested
  `if (header !== null)` and folds the four assertions into a single
  contract: `not.toBe(SCOPE_AUTH_HEADER)` + `not.toContain(SCOPE_USER
  / PASS)` + `expect([null, EMPTY_AUTH_HEADER]).toContain(header)`.
  Same invariant, no branching inside the test body — clearer about
  what the leak guard is actually promising.

`/scope-auth-leak/external-asset.png` substring assertion uses
`header ?? ''` so the null path stays valid without an `if`.
… frame" from unknown

A live archive (~1700 crawl_errors rows) had 53 messages slipping into
`unknown` despite having clear classification signal. Two matchers were
under-reaching:

- **tls** missed the bare Node phrasing `certificate has expired` — it
  emits without an `ERR_CERT_*` / `ERR_SSL_*` prefix when the peer's
  `notAfter` is past. 9 distinct expired-cert hosts were falling out
  of tls and (since `tls ∈ PERMANENT_ERROR_KINDS`) also losing the
  `--retry-failed` exclusion meant for permanent failures.

- **protocol** required `frame (?:was |got )?detached` word order and
  missed puppeteer's current `Frame.ts` emit
  "Attempted to use detached Frame '<id>'" (reverse order). 9 distinct
  frame IDs landed in unknown.

The tls token is end-anchored as `certificate has expired\s*$` (NOT a
bare substring) so application-layer error bodies that happen to echo
the phrase mid-string — e.g. an upstream JSON 5xx body that includes
"License certificate has expired (LICENSE_EXPIRED) — please renew" —
do NOT misclassify as tls (which would permanently exclude them from
`--retry-failed`).

The protocol token is anchored to puppeteer's exact prefix
`Attempted to use detached frame` (its current `Frame.ts` emitter, the
`/i` flag catches the uppercase `Frame` form automatically), not the
bare two-token substring. A console / logger message that mentions
the same two tokens in unrelated context falls through to unknown.

Regression coverage:
- positive: bare `certificate has expired` + `[Retried 3 times]`
  wrapper both classify as tls
- negative: three mid-string `certificate has expired` shapes fall
  through to unknown
- trailing-whitespace bound: space / tab / newline forms still classify
- positive: puppeteer `Attempted to use detached Frame '...'` +
  suffix variants classify as protocol
- case insensitive: lowercase + mixed case both work
- negative: unrelated `detached frame` console diagnostics fall through
- wildcard pin: `Page.<method> returned` matches reload / evaluate /
  close beyond the original `goto` observation

These changes are derived-on-read — no archive re-crawl is needed.
Just re-opening an existing archive with the new library reclassifies
the affected rows from `unknown` to `tls` / `protocol` automatically.
…eturns null

A URL that the HEAD pre-flight resolves through a 3xx chain to a
status-200 final destination, but whose final destination puppeteer
then fails to render (e.g. HTTPS→HTTP downgrade redirects that
Chromium silently refuses to follow, surfacing as
"The method Page.goto returned null"), used to drop the entire chain
on the floor: the page row landed at `status = -1 / UnknownError`
with `redirectDestId = NULL`, then re-entered `--retry-failed` every
iteration because the SQL fail-shape filter caught it AND no
`redirectDestId` filtered it out.

Two cooperating fixes:

- **`Crawler.#scrapePage`** now folds the browser-failure-with-known-
  redirect-chain case into the existing `redirect-edge` path. When
  `browserResult.type === 'error'` AND
  `headCheckResult.redirectPaths.length > 0`, the chain that HEAD
  already proved is authoritative — return `redirect-edge` so the
  orchestrator routes it through `Archive.setRedirect` →
  `recordRedirect` → `#linkRedirectSources`, which sets
  `redirectDestId` and excludes the row from `--retry-failed`
  naturally.

- **`#linkRedirectSources`** now stamps a redirect source's row with
  `status = 301 / statusText = 'Moved Permanently'` ONLY when the
  row carries no definitive status yet (NULL or `-1`). HEAD
  pre-flight does not retain each hop's individual status code
  (`redirectPaths` is a URL[] without statuses), so any 3xx hop must
  collapse to one representative — 301 is the canonical pick.
  Existing definitive statuses (200 / 302 / 307 / etc.) from prior
  direct scrapes are preserved verbatim — the conditional flip avoids
  rewriting accurate history during migration.

Result for the migration shape: opening an existing archive with the
new library and running `--retry-failed` reclassifies legacy
HTTPS→HTTP-downgrade rows as 301 redirect sources and drops them out
of the retry target set. For fresh crawls, 302/307 URLs that get
their status captured directly keep their accurate code; only the
"never directly scraped" placeholder rows pick up the 301 marker.

Regression coverage in `database.spec.ts`:
- redirect source whose row was `-1 / UnknownError` is flipped to
  `301 / Moved Permanently`
- redirect source whose row already had `302 / Found` is NOT
  overwritten
- redirect source whose placeholder row had NULL status is stamped
  301
When `page.goto()` resolves to `null` because a client-side
`window.location.replace(...)` / `<meta refresh>` fires before the
original response materialises, the scraper throws `The method
Page.goto returned null` — classified as `protocol`, neither
permanent nor a puppeteer-fallback kind. Without a rescue the
source row is persisted as `status = -1` and `--retry-failed`
resets it every pass without converging.

`buildJsRedirectEdge` reads `page.url()` from the still-alive
puppeteer page after the throw, canonicalises and credential-strips
the destination, and synthesises a `RedirectEdgeResult`. The dest
is re-enqueued; `linkList.done` is told NOT to fold the dest into
the done-set (`completion.includeRedirectPaths: false`) so the
add() can route it into pending. Trigger is narrow: only the exact
`Page.goto returned null` sentinel AND a meaningfully-different
post-navigation URL (WHATWG canonicalisation filters case/slash/
port/credential noise; browser-internal schemes are rejected).

Two call sites share the helper:
- HEAD-success then puppeteer-throw: PageData spreads HEAD,
  preserving the real HTTP 200 status.
- HEAD-fail then puppeteer-fallback then fallback-throw: PageData
  is a `linkToPageData` placeholder with status=-1 (later flipped
  to 301 by `#linkRedirectSources`).

Genuine browser failures (TLS / target-closed / timeout) fall
through unchanged.

`RedirectEdgeResult.source` discriminator (`http-chain` /
`js-redirect`) lets `#runDeal` decide between fold-into-done-set
(http-chain) and enqueue-the-dest (js-redirect).

Test-server gets a `/js-redirect/` route + E2E that exercises the
client-side window.location.replace shape. Note: under the current
puppeteer version the rescue path itself is not triggered (puppeteer
follows the JS navigation cleanly), so the E2E observes the
beholder-side redirect outcome rather than the rescue. The rescue
logic itself is pinned by 46 unit cases across
derive-js-redirect-target, is-js-redirect-error-shape,
build-js-redirect-edge, and link-list includeRedirectPaths.

Known limitation: on the HEAD-fail path the upstream's truthful
HTTP status is lost (row reads as 301). Strictly better than the
`status = -1` retry-loop; lift deferred at 0.x.
…URE.md

Adds a CLAUDE.md block under the "HEAD pre-flight の救済" section
explaining the rescue's WHY (Page.goto returned null lands in `protocol`,
not in `PERMANENT_ERROR_KINDS` nor `PUPPETEER_FALLBACK_KINDS` → infinite
retry loop without intervention), the two firing call sites
(HEAD-success then puppeteer-throw, HEAD-fail then puppeteer-fallback),
the narrow trigger contract (`isJsRedirectErrorShape` + WHATWG-canonical
`deriveJsRedirectTarget`), the known 200-to-301 status flip on the
fallback path, and the E2E gap (current puppeteer version follows the
JS navigation rather than throwing, so the rescue path itself is pinned
by unit tests not the E2E).

Adds one row to the ARCHITECTURE.md scraper error-handling table
pointing at the new `buildJsRedirectEdge` / `derive-js-redirect-target`
/ `is-js-redirect-error-shape` JSDoc as the authoritative source.

No source-code changes.
…o JS-redirect rescue fires

The rescue introduced in a39a486 read `page.url()` only inside the
catch arm of `#launchBrowserAndScrape`, but beholder's `scrapeStart`
catches `#fetchData` throws internally and *returns*
`{ type: 'error', shutdown: true, ... }` rather than re-throwing. The
catch arm therefore never ran for the exact failure shape the rescue
was designed to handle (`Page.goto returned null` from a client-side
`window.location.replace()` / `<meta refresh>`), and the rescue
effectively no-op'd in production.

Capture `page.url()` on the error-return path too, while `page` is
still alive (finally has not called `handleBrowserClose`), and attach
`postNavigationUrl` to the returned envelope. The catch-arm capture
is kept as the safety net for outer throws (browser.newPage,
page.setUserAgent, ...) that bypass scrapeStart entirely.

Verified live on a customer archive that previously stayed
`status = -1 / UnknownError` across many `--retry-failed` passes:
after this fix the source row flips to `status = 301 / Moved
Permanently` with `redirectDestId` wired to the JS-redirect
destination, and the page leaves the retry pool. Debug log shows
`JS-redirect rescue fired for ... → ... (HEAD available)` followed
by `Set redirect` / `Set redirected url` archive writes.
@YusukeHirao YusukeHirao merged commit 57f2b23 into dev Jun 19, 2026
5 checks passed
@YusukeHirao YusukeHirao deleted the fix/dns-burned-host-skip-interval branch June 19, 2026 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant