fix(crawler): converge --retry-failed via classifier expansion, rescues, and scope-cred guard#90
Merged
Merged
Conversation
Move the per-URL interval wait out of dealer and into the worker callback so that hosts already known to be DNS-burned can skip it. With dealer-managed interval the wait fires before #sendHeadRequest sees the cache hit, so every short-circuited URL still consumed the configured interval — exactly the cost the cache was meant to avoid. The non-burned path keeps the same delay() + %countdown(...) wait so the dealer display reads identically; only burned hosts now pull URLs through with no spacing.
…ame code When pre-flight HEAD returned 405 / 501 / 503 and the GET fallback was also rejected with the same status, fetch-destination used to throw an Error. The orchestrator then recorded the page as `status: -1` / `statusText: 'UnknownError'`, losing the only useful diagnostic the server had given us (e.g. a real 503 Service Unavailable became indistinguishable from a DNS failure in the archive). When the second-pass status matches the first, the server's verdict is genuine — resolve with the PageData instead so the archive remembers the real status. The HEAD-only quirk path (different status on GET) is unchanged.
…error Extend ErrorKind with three new buckets so the noise previously dumped in `unknown` becomes actionable: - `dns-transient` — `EAI_AGAIN` and `EREFUSED`, evaluated before `dns` so the shared `getaddrinfo` token does not pull a local-resolver hiccup into the persistent DNS bucket. The DNS-burned host cache (which marks on `=== 'dns'`) therefore stops learning these transients, removing the "WiFi blip burns every host" failure mode. - `local-network` — Chromium `ERR_INTERNET_DISCONNECTED`, `ERR_NETWORK_CHANGED`, `ERR_NETWORK_IO_SUSPENDED`, `ERR_ADDRESS_UNREACHABLE`, `ERR_NETWORK_UNREACHABLE`, plus Node `ENETUNREACH`, `EHOSTUNREACH`, `EADDRNOTAVAIL`, `ENOTCONN`, `EPIPE`. Catches OS-sleep / WiFi-change / unreachable-route symptoms that were silently being labelled unknown. - `parse-error` — `Parse Error`, `Expected HTTP/`, `Unexpected end of stream`. Distinguishes broken proxies and corrupt responses from protocol-level browser failures. `timeout` is widened to also catch the `Timeout: <url>` form emitted by `NetTimeoutError` (the HEAD pre-flight race) so it stops landing in `unknown`. The SQL pre-filter in `Database.listDnsBurnedHostCandidates` drops `%EAI_AGAIN%` to mirror the new classification: only persistent NXDOMAIN-shaped messages flow into the DNS-burned cache preload. cspell: allow NXDOMAIN, libc, RTSP — vocabulary used by the new tests.
A flat 10s race timeout on every HEAD retry kept missing slow-but-reachable sites (government / overloaded servers that need 20-40s to answer). The work function inside `Crawler.#sendHeadRequest` now tracks the attempt index and passes a `[10s, 30s, 60s]` budget to `fetchDestination`, so the first try stays fast for healthy hosts while later retries give a slow server real room to respond. Attempts past the array length clamp to the final entry so any `retry > escalation.length - 1` configuration just keeps the longest budget. `fetchDestination` gains a `timeout` option (default 10s, preserving back-compat for non-crawler callers) and forwards it to the 405/501/503 GET fallback so the second-pass request uses the same budget as the original HEAD attempt. `NetTimeoutError` results are intentionally NOT written into `destinationCache`: caching them would let a future retry re-throw the stale 10s timeout instead of letting the 30s/60s attempt actually run. Test stub for `retryCall` now honours `retries`, invoking the fn up to `retries + 1` times so the new escalation behaviour can be observed without paying the actual back-off seconds.
Add en/ja labels for dns-transient, local-network, and parse-error so the Summary -1 breakdown and the Errors view stop showing raw kind strings (the i18n fallback) for the three buckets that used to land in unknown. Also widen the get-error-kind-label round-trip test to cover every member of the eleven-entry ErrorKind union.
Two WHY callouts in the DNS-burned host cache section: - Why `EAI_AGAIN` is classified as `dns-transient` rather than `dns` (avoids the "WiFi blip burns every host" failure mode, codifies the matcher-order contract). - Why HEAD pre-flight timeout escalates `10s → 30s → 60s` across retries (slow-but-reachable servers get a real chance without making every healthy URL pay the slow-server tax up front).
Three classification holes that were silently dumping rows into `unknown`, identified by reading the live crawl_errors / page_errors of an in-flight archive: - New `client-blocked` kind for Chromium's ERR_BLOCKED_* family — per the upstream net_error_list.h comment, `ERR_BLOCKED_BY_CLIENT` is literally "The client chose to block the request." (i.e. it is the browser's deliberate refusal, not a server response). Covers BY_CLIENT, BY_ADMINISTRATOR, BY_RESPONSE, BY_CSP, BY_ORB, BY_FINGERPRINTING_PROTECTION, IN_INCOGNITO_BY_ADMINISTRATOR, CLEARTEXT_NOT_PERMITTED, NETWORK_ACCESS_REVOKED. - Tls matcher gains `Hostname/IP does not match certificate` and `altnames` so Node's tls hostname-mismatch errors stop landing in `unknown`. - Timeout matcher swaps `^Timeout:` (line-anchored) for `Timeout: https?:` (URL-tailed). Beholder wraps retry-exhausted errors as `[Retried N times] Timeout: <url>`, which never has `Timeout:` at line start; the anchored form was missing every archived NetTimeoutError record.
WAF/CDN middleboxes routinely refuse HEAD with parse-error/connection reset/timeout while still accepting GET — the HEAD-only pre-flight was therefore mislabelling reachable hosts as unreachable, exhausting their retry budget and leaving them as status=-1 in the archive. When the HEAD attempt returns a NetTimeoutError, a connection-reset, or a parse-error, retry once with method=GET reusing the same cache key and userAgent. Only these three causes trigger the fallback: a proper 4xx/5xx response is already a definitive answer and is left alone (the existing 405/501/503 ladder still applies). NetTimeoutError is deliberately not cached so the crawler-level retry budget keeps applying across attempts — caching would freeze the first slow probe as the verdict for every later URL on the host.
Mirror the new `client-blocked` ErrorKind from the crawler in the viewer dictionaries (en / ja) and add it to the round-trip union test, so the Errors panel surfaces "Blocked by browser" / "ブラウザ がブロック" instead of falling back to the raw `client-blocked` slug.
Pages discovered via link extraction and immediately matched by an `excludes` pattern are still inserted into the `pages` table (so the discovery layer can dedupe further occurrences) but never actually scraped — they carry `isSkipped = 1` and NULL `status` / `contentType`. Counting them in the Summary status histogram and content-type distribution produced a confusing all-NULL bucket (2,799 rows on the archive that surfaced the bug, dwarfing the real 5xx tail) that obscured the actual crawl outcome. Skip them in both aggregations. `OR isSkipped IS NULL` keeps pre-flag-era rows visible.
`--retry-failed` was not converging across iterations: NXDOMAIN / expired-cert / `ERR_BLOCKED_BY_CLIENT` / HTTP parse-error / refused TCP hosts all met the SQL fail-shape filter (`status=-1` / NULL / 5xx), got reset every pass, were re-attempted, failed identically, and rejoined the candidate pool — leaving the retry target stuck at its original size forever. `resetFailedPages` now resolves the latest recorded message for each SQL candidate (`page_errors` → `crawl_errors`, no `error.log` to keep the writer dependency surface narrow) and filters out any candidate whose message classifies into `PERMANENT_ERROR_KINDS`. An absent message keeps the candidate (still `unknown`, still retried) so the change errs on the side of investigation. The set is pinned by `permanent-error-kinds.spec.ts` so accidental widening triggers a CI failure; the membership rationale (and the "deliberately retryable" cases — `dns-transient`, `connection-reset`, `timeout`, `local-network`, `protocol`, `unknown`) is documented in the constant's JSDoc. Known limitation: pre-`crawl_errors`-era archives whose failures live only in `error.log` lose the exclusion — `getFailedPageMessages` deliberately omits the text-log parser. Run one fresh crawl pass first to populate the structured tables. Documented in the helper's JSDoc.
…HTML URL When the HEAD pre-flight (+ its GET companion in `fetchDestination`) exhausts retries on a URL that looks HTML-shaped, with a residual error that suggests middlebox / WAF interference rather than the host being unreachable (kinds `timeout` / `connection-reset` / `parse-error`), launch a full puppeteer navigation exactly once as a last-chance safety net. Some real-world WAF configurations drop bare HEAD/GET probes but answer a real browser navigation — those URLs were otherwise lost as `status=-1` indefinitely. The fallback set is deliberately narrow: - `dns` / `dns-transient` / `tls` / `client-blocked` / `connection-refused` / `local-network` / `connection-timeout` all reproduce the same outcome under puppeteer, paying browser-launch cost for nothing. - `protocol` is a puppeteer-side race; bouncing back to puppeteer recreates it. - `unknown` is excluded by design — Chromium per unclassifiable error is too expensive. When a real WAF pattern lands in `unknown`, add a matcher to `classifyErrorKind` so it bubbles into one of the three included kinds. `PreloadShortCircuitError` is filtered out automatically: its synthesised `getaddrinfo ENOTFOUND` message classifies as `dns`. Result handling: - `fallback.type === 'success'` → same post-processing as the normal path (`markBrowserScrape`, `#scrapedDestinations` claim). - `fallback.type === 'skipped'` → propagated as a skip (excludeKeywords hit inside puppeteer); without this branch the page would be recorded as `status=-1` with the HEAD's timeout message, conflating operator-intent skip with network failure. - `fallback.type === 'error'` or `#launchBrowserAndScrape` throws → `Unreachable (fallback failed)` lane label, HEAD's original error surfaced in `crawl_errors` (puppeteer's noisier wrapper would bury the root cause). `metadataOnly` mode opts out — the path is a bandwidth-saving scrape that wouldn't have launched the browser anyway.
Capture the WHY behind the two new safety nets so future maintainers adding a kind to `ErrorKind` know which derived constants need a parallel review: - `PERMANENT_ERROR_KINDS` (`--retry-failed` convergence contract) and - `PUPPETEER_FALLBACK_KINDS` (HEAD/GET → puppeteer one-shot retry). Adding an ErrorKind without checking both sets risks a silent regression — `--retry-failed` looping on a permanent failure forever or a recoverable URL never reaching the puppeteer safety net. CLAUDE.md / ARCHITECTURE.md / README.md updates: - Replace the stale "8 種" enumeration with a pointer to the `ErrorKind` union (12 kinds today; the source-of-truth is the code). - Document the GET fallback inside `fetchDestination` and the puppeteer fallback inside `Crawler.#scrapePage` as the two-layer middlebox safety net (parse-error / connection-reset / timeout). - Note the `--retry-failed` convergence rule and the known limitation for pre-`crawl_errors`-era archives whose failures live only in `error.log`. - README's `--retry-failed` block gains one sentence on the new permanent-kind exclusion so users reading "なぜ retry 対象が減った ?" find the answer without diving into source. `types.ts` JSDoc adds a "derived constants that MUST be reviewed when this union changes" section right under the kind table, pointing at both derived constants.
…TTP-auth dialogs
`page.authenticate` was only called when the URL itself carried
`user:pass@` credentials, so a page whose sub-resource (third-party
script, ad pixel, abandoned CDN domain that now serves
`WWW-Authenticate: Basic ...`) triggered Chromium's NATIVE HTTP-auth
dialog had no handler — and Chromium blocks the request waiting for
input that never arrives, hanging the entire page navigation until
puppeteer's navigation timeout fires. Downstream that surfaces as a
`timeout` ErrorKind even though the root cause is "one resource
demands Basic auth, nobody at the dialog."
Unlike `alert/confirm/prompt`, the HTTP-auth dialog is NOT captured
by `page.on('dialog')` — the only way to drain it is to register
credentials via `page.authenticate` BEFORE navigation.
Registering with empty strings makes Chromium respond immediately
with an empty `Authorization` header. By spec the server replies
with another 401 (or accepts, on the rare intentional public
endpoint), so the offending sub-resource fails fast as a definitive
HTTP-level failure and the rest of the page continues rendering —
no hang either way. URL-embedded credentials still win when present,
so existing scope-auth flows behave identically.
`?? ''` because ExURL types `username` / `password` as `string |
null` while puppeteer's `Credentials` interface needs `string`.
…op cross-origin leak Previous behaviour: when a crawl was seeded with a scope-auth URL (`http://user:pass@host/`), puppeteer's page.goto received the URL verbatim. Chromium promoted those credentials into its HTTP-auth cache keyed on (scheme, host, port, realm), and the network stack then re-attached the cached `Authorization` header to every subsequent same-origin request — INCLUDING cross-origin sub-resources that share the same IP / port (e.g. an embedded `<img src="http://127.0.0.1:8010/…">` loaded from a `localhost:8010` page). The `Fetch.authRequired` event never fires for these pre-emptive cache attachments, so nothing inside puppeteer or this crawler could filter the leak after the fact. Fix: rebuild the navigation URL with `parseUrl(url.href)` and clear its `username` / `password` before handing it to `scrapeStart`. The copy is detached from the queued URL, so other code paths (`linkList`, redirect tracking, dealer dedup) still see the credentialed form. The scope auth is still applied via `page.authenticate({user, pass})` — invoked unconditionally so that in-scope challenges resolve and Chromium's native HTTP-auth dialog gets drained (empty credentials suffice; sub-resources demanding Basic auth then fail with a definitive 401 instead of hanging the navigation until the puppeteer timeout fires). The new E2E in `test-server` pins both halves of the contract: removing either the auth call (main 401 hangs) or the URL strip (scope cred reaches the off-scope endpoint) makes the test fail.
…ss-origin E2E A 401-protected `/scope-auth-leak/main` route embeds an `<img src="http://127.0.0.1:8010/scope-auth-leak/external-asset.png">`, which is treated as off-scope because the hostname differs from the crawl seed (`localhost`) even though both resolve to the same IP / port — the canonical "same-IP cross-origin" shape used elsewhere in this test-server for off-scope checks. The external endpoint records every `Authorization` header it observes. The E2E asserts both halves of the contract: 1. The in-scope main page must scrape successfully — proves the crawler IS forwarding the scope credentials to in-scope origins, so the test cannot trivially pass by sending no auth anywhere. 2. NO observed Authorization header on the off-scope endpoint may carry the scope credentials (full-string match, plus a substring defence-in-depth against future variants of the leak). Removing either side of the crawler fix (the `page.authenticate` call OR the URL-strip before `scrapeStart`) makes this test fail. A small `/scope-auth-leak/reset` endpoint is exposed so the test runs independently of any previous run's recorded state.
Five small but real correctness issues surfaced by an adversarial
recall-mode review of the branch — collected into one fix because
each is a one-or-two-line tightening of an existing landing point,
and the regression-test coverage for them lives in the same set of
spec files.
- **TLS regex anchored** (`classify-error-kind.ts`): the bare
`altnames` alternative matched any message containing the
substring — including a request path like
`/altnames/lookup` in a 5xx body. Since `tls ∈ PERMANENT_ERROR_KINDS`,
a false positive permanently excluded that page from
`--retry-failed`. Now anchored to `certificate'?s? altnames`
(covers both the Node `Hostname/IP does not match certificate's
altnames` and the bare `certificate's altnames` phrasings). New
`does NOT misclassify the word "altnames"…` test pins the guard.
- **Navigation URL rebuilt, not mutated** (`crawler.ts`): the
credential-leak guard previously did
`parseUrl(url.href); navigateUrl.username = ''; ...`. ExURL
pre-computes `href` / `withoutHash` / `withoutHashAndAuth` at parse
time, so the post-hoc field assignment left those derived strings
carrying `user:pass@host`. The leak guard happened to work only
because beholder's current `scrapeStart` reads
`url.withoutHashAndAuth` for `page.goto` — a future bump that
consumes `url.href` instead would silently re-introduce the leak
with the existing E2E still passing. Now built from
`parseUrl(url.withoutHashAndAuth)` so every derived field is
credential-free.
- **Cache exemption widened** (`fetch-destination.ts`): only
`NetTimeoutError` was excluded from `destinationCache`. HEAD
`parse-error` / `connection-reset` (the kinds
`shouldGetFallbackOnHeadFailure` already singles out as
possibly-recoverable) got cached on attempt 1, so the 30s/60s
`HEAD_TIMEOUT_ESCALATION_MS` retries hit the cache, re-threw the
stale 10s failure, and the escalation never reached the
network. The exemption now mirrors
`shouldGetFallbackOnHeadFailure`'s kind set.
- **Empty `page_errors.message` falls through** (`get-failed-page-messages.ts`):
a scraper phase can record a trigger event with an empty message.
The old `if (!messageByPageId.has(id))` short-circuit treated the
empty string as "resolved" and skipped the `crawl_errors` lookup —
so a page whose `page_errors` row was empty but whose
`crawl_errors` row classified as `dns` / `tls` / `client-blocked`
ended up looking like `unknown` to `--retry-failed`, defeating
convergence on exactly the cases the guard is supposed to catch.
- **Fallback `'error'` branch logs** (`crawler.ts`):
`#launchBrowserAndScrape` catches its own exceptions and returns
`{type:'error', shutdown:...}` rather than throwing, so the
catch arm never saw that path. The puppeteer-side failure mode
(including any `shutdown` flag the scraper attached) was invisible
to operators reading the diagnostic log. Now emits a
`crawlerLog('Puppeteer fallback returned error for %s: %s
(shutdown=%s)', ...)` breadcrumb before falling through to
surface the HEAD error as the user-visible cause.
Regression tests added: `database.spec.ts` pins the
all-permanent → `[]` short-circuit and the
`[Retried N times] getaddrinfo ENOTFOUND` wrapped form.
`get-failed-page-messages.spec.ts` covers the empty-message
fallthrough and the `crawl_errors.url IS NULL` defensive guard.
…e predicate
The `where('isSkipped', 0).orWhereNull('isSkipped')` carve-out is
applied twice in `getSummary` (once for the page-status histogram,
once for the content-type distribution) — and a near-identical
inline form lives in `Database.resetFailedPages` over in the
crawler package. The code-review xhigh sweep flagged the
duplication: any future panel that aggregates `pages` without the
carve-out re-introduces the "Unknown / Errored" bucket bloat the
filter was added to eliminate, and three copies of the same
predicate make the convention easy to miss.
This commit pulls the in-package duplication into
`exclude-skipped-pages.ts` (one file, one export, JSDoc the
rationale once) and rewrites the two call sites in `getSummary` as
`.where(excludeSkippedPages)`. `Database.resetFailedPages` keeps
its inline copy — `@nitpicker/crawler` cannot depend on
`@nitpicker/query` (reverse direction), so cross-package
deduplication would require either a tier-0 helper package or a
re-export contract neither of which is justified for a 2-line
knex predicate. The crawler's own JSDoc cross-links the convention.
…spec missing helpers QA review (qa-engineer skill) flagged two test gaps that turn into silent regressions if the underlying logic is ever changed: - The cache-exemption rule inside `fetch-destination.ts` is the guard that lets `HEAD_TIMEOUT_ESCALATION_MS` actually escalate on parse-error / connection-reset retries. It was inlined as the `shouldGetFallbackOnHeadFailure` private helper, with no direct unit coverage — removing the helper or narrowing its kind set would not fail a single existing spec. - `excludeSkippedPages` (the new `getSummary` predicate dedup helper from the earlier code-review fix) had no spec at all, violating the project rule "1 関数 1 ファイルにはユニットテスト必須". Promote `shouldGetFallbackOnHeadFailure` to its own `should-get-fallback-on-head-failure.ts` module (one file, one export, JSDoc the contract) so `fetch-destination.ts` consumes it via import. Pin every branch of the kind set with a `should-get-fallback-on-head-failure.spec.ts`: NetTimeoutError / parse-error / connection-reset opt IN; dns / tls / connection-refused / connection-timeout / timeout / client-blocked / unknown opt OUT. Each `expect` matches one documented branch in the helper's JSDoc, so changing the helper without updating the spec is now impossible to do silently. `excludeSkippedPages` gets a spec that walks all three states (`isSkipped = 0` / `1` / `NULL`) against a real `pages` table through `Archive.create`. Demonstrates the legacy-NULL backward- compat behavior and the empty-result edge case.
…ghten scope-leak assertion Two QA-review gap closures in the E2E layer: - `retry-failed.e2e.ts` gains a third `describe` block that proves the `PERMANENT_ERROR_KINDS` exclusion (Layer 2 of this PR) converges across multiple `--retry-failed` passes. Baseline crawl seeds a status=500 candidate from `/flaky/recoverable`, then injects a `getaddrinfo ENOTFOUND` message directly into `page_errors` to classify the candidate as `dns ∈ PERMANENT_ERROR_KINDS` (real DNS failures can't be simulated hermetically on CI runners). Healing the endpoint and running `retryFailed` TWICE in a row then proves the page stays at status=500 — if the exclusion were removed, the first pass would heal it to 200. The previous test set only covered "5xx page heals on retry"; nothing pinned "permanent kinds are excluded across iterations". - `scope-auth-leak.e2e.ts` per-header loop drops the nested `if (header !== null)` and folds the four assertions into a single contract: `not.toBe(SCOPE_AUTH_HEADER)` + `not.toContain(SCOPE_USER / PASS)` + `expect([null, EMPTY_AUTH_HEADER]).toContain(header)`. Same invariant, no branching inside the test body — clearer about what the leak guard is actually promising. `/scope-auth-leak/external-asset.png` substring assertion uses `header ?? ''` so the null path stays valid without an `if`.
… frame" from unknown A live archive (~1700 crawl_errors rows) had 53 messages slipping into `unknown` despite having clear classification signal. Two matchers were under-reaching: - **tls** missed the bare Node phrasing `certificate has expired` — it emits without an `ERR_CERT_*` / `ERR_SSL_*` prefix when the peer's `notAfter` is past. 9 distinct expired-cert hosts were falling out of tls and (since `tls ∈ PERMANENT_ERROR_KINDS`) also losing the `--retry-failed` exclusion meant for permanent failures. - **protocol** required `frame (?:was |got )?detached` word order and missed puppeteer's current `Frame.ts` emit "Attempted to use detached Frame '<id>'" (reverse order). 9 distinct frame IDs landed in unknown. The tls token is end-anchored as `certificate has expired\s*$` (NOT a bare substring) so application-layer error bodies that happen to echo the phrase mid-string — e.g. an upstream JSON 5xx body that includes "License certificate has expired (LICENSE_EXPIRED) — please renew" — do NOT misclassify as tls (which would permanently exclude them from `--retry-failed`). The protocol token is anchored to puppeteer's exact prefix `Attempted to use detached frame` (its current `Frame.ts` emitter, the `/i` flag catches the uppercase `Frame` form automatically), not the bare two-token substring. A console / logger message that mentions the same two tokens in unrelated context falls through to unknown. Regression coverage: - positive: bare `certificate has expired` + `[Retried 3 times]` wrapper both classify as tls - negative: three mid-string `certificate has expired` shapes fall through to unknown - trailing-whitespace bound: space / tab / newline forms still classify - positive: puppeteer `Attempted to use detached Frame '...'` + suffix variants classify as protocol - case insensitive: lowercase + mixed case both work - negative: unrelated `detached frame` console diagnostics fall through - wildcard pin: `Page.<method> returned` matches reload / evaluate / close beyond the original `goto` observation These changes are derived-on-read — no archive re-crawl is needed. Just re-opening an existing archive with the new library reclassifies the affected rows from `unknown` to `tls` / `protocol` automatically.
…eturns null A URL that the HEAD pre-flight resolves through a 3xx chain to a status-200 final destination, but whose final destination puppeteer then fails to render (e.g. HTTPS→HTTP downgrade redirects that Chromium silently refuses to follow, surfacing as "The method Page.goto returned null"), used to drop the entire chain on the floor: the page row landed at `status = -1 / UnknownError` with `redirectDestId = NULL`, then re-entered `--retry-failed` every iteration because the SQL fail-shape filter caught it AND no `redirectDestId` filtered it out. Two cooperating fixes: - **`Crawler.#scrapePage`** now folds the browser-failure-with-known- redirect-chain case into the existing `redirect-edge` path. When `browserResult.type === 'error'` AND `headCheckResult.redirectPaths.length > 0`, the chain that HEAD already proved is authoritative — return `redirect-edge` so the orchestrator routes it through `Archive.setRedirect` → `recordRedirect` → `#linkRedirectSources`, which sets `redirectDestId` and excludes the row from `--retry-failed` naturally. - **`#linkRedirectSources`** now stamps a redirect source's row with `status = 301 / statusText = 'Moved Permanently'` ONLY when the row carries no definitive status yet (NULL or `-1`). HEAD pre-flight does not retain each hop's individual status code (`redirectPaths` is a URL[] without statuses), so any 3xx hop must collapse to one representative — 301 is the canonical pick. Existing definitive statuses (200 / 302 / 307 / etc.) from prior direct scrapes are preserved verbatim — the conditional flip avoids rewriting accurate history during migration. Result for the migration shape: opening an existing archive with the new library and running `--retry-failed` reclassifies legacy HTTPS→HTTP-downgrade rows as 301 redirect sources and drops them out of the retry target set. For fresh crawls, 302/307 URLs that get their status captured directly keep their accurate code; only the "never directly scraped" placeholder rows pick up the 301 marker. Regression coverage in `database.spec.ts`: - redirect source whose row was `-1 / UnknownError` is flipped to `301 / Moved Permanently` - redirect source whose row already had `302 / Found` is NOT overwritten - redirect source whose placeholder row had NULL status is stamped 301
When `page.goto()` resolves to `null` because a client-side `window.location.replace(...)` / `<meta refresh>` fires before the original response materialises, the scraper throws `The method Page.goto returned null` — classified as `protocol`, neither permanent nor a puppeteer-fallback kind. Without a rescue the source row is persisted as `status = -1` and `--retry-failed` resets it every pass without converging. `buildJsRedirectEdge` reads `page.url()` from the still-alive puppeteer page after the throw, canonicalises and credential-strips the destination, and synthesises a `RedirectEdgeResult`. The dest is re-enqueued; `linkList.done` is told NOT to fold the dest into the done-set (`completion.includeRedirectPaths: false`) so the add() can route it into pending. Trigger is narrow: only the exact `Page.goto returned null` sentinel AND a meaningfully-different post-navigation URL (WHATWG canonicalisation filters case/slash/ port/credential noise; browser-internal schemes are rejected). Two call sites share the helper: - HEAD-success then puppeteer-throw: PageData spreads HEAD, preserving the real HTTP 200 status. - HEAD-fail then puppeteer-fallback then fallback-throw: PageData is a `linkToPageData` placeholder with status=-1 (later flipped to 301 by `#linkRedirectSources`). Genuine browser failures (TLS / target-closed / timeout) fall through unchanged. `RedirectEdgeResult.source` discriminator (`http-chain` / `js-redirect`) lets `#runDeal` decide between fold-into-done-set (http-chain) and enqueue-the-dest (js-redirect). Test-server gets a `/js-redirect/` route + E2E that exercises the client-side window.location.replace shape. Note: under the current puppeteer version the rescue path itself is not triggered (puppeteer follows the JS navigation cleanly), so the E2E observes the beholder-side redirect outcome rather than the rescue. The rescue logic itself is pinned by 46 unit cases across derive-js-redirect-target, is-js-redirect-error-shape, build-js-redirect-edge, and link-list includeRedirectPaths. Known limitation: on the HEAD-fail path the upstream's truthful HTTP status is lost (row reads as 301). Strictly better than the `status = -1` retry-loop; lift deferred at 0.x.
…URE.md Adds a CLAUDE.md block under the "HEAD pre-flight の救済" section explaining the rescue's WHY (Page.goto returned null lands in `protocol`, not in `PERMANENT_ERROR_KINDS` nor `PUPPETEER_FALLBACK_KINDS` → infinite retry loop without intervention), the two firing call sites (HEAD-success then puppeteer-throw, HEAD-fail then puppeteer-fallback), the narrow trigger contract (`isJsRedirectErrorShape` + WHATWG-canonical `deriveJsRedirectTarget`), the known 200-to-301 status flip on the fallback path, and the E2E gap (current puppeteer version follows the JS navigation rather than throwing, so the rescue path itself is pinned by unit tests not the E2E). Adds one row to the ARCHITECTURE.md scraper error-handling table pointing at the new `buildJsRedirectEdge` / `derive-js-redirect-target` / `is-js-redirect-error-shape` JSDoc as the authoritative source. No source-code changes.
…o JS-redirect rescue fires The rescue introduced in a39a486 read `page.url()` only inside the catch arm of `#launchBrowserAndScrape`, but beholder's `scrapeStart` catches `#fetchData` throws internally and *returns* `{ type: 'error', shutdown: true, ... }` rather than re-throwing. The catch arm therefore never ran for the exact failure shape the rescue was designed to handle (`Page.goto returned null` from a client-side `window.location.replace()` / `<meta refresh>`), and the rescue effectively no-op'd in production. Capture `page.url()` on the error-return path too, while `page` is still alive (finally has not called `handleBrowserClose`), and attach `postNavigationUrl` to the returned envelope. The catch-arm capture is kept as the safety net for outer throws (browser.newPage, page.setUserAgent, ...) that bypass scrapeStart entirely. Verified live on a customer archive that previously stayed `status = -1 / UnknownError` across many `--retry-failed` passes: after this fix the source row flips to `status = 301 / Moved Permanently` with `redirectDestId` wired to the JS-redirect destination, and the page leaves the retry pool. Debug log shows `JS-redirect rescue fired for ... → ... (HEAD available)` followed by `Set redirect` / `Set redirected url` archive writes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch started as a tiny fix for an interval-delay quirk on DNS-burned hosts and grew into a wider hardening pass on the crawl convergence story — keeping
--retry-failedhonest, classifying every observable failure shape, and adding two new safety nets for puppeteer-side oddities. Also fixes a credential-leak guard for the scope-auth flow.Convergence:
--retry-failedstops looping on permanent failuresPERMANENT_ERROR_KINDSderived constant;Database.resetFailedPagesnow consultsclassifyErrorKindand refuses to reset rows whose latest message is permanent (NXDOMAIN, expired cert,ERR_BLOCKED_BY_CLIENT, HTTP parse error,ECONNREFUSED). Without this, every pass replayed the same failure forever.client-blockedErrorKindfor Chromium'sERR_BLOCKED_BY_*family, plusdns-transientsplit (EAI_AGAINno longer punishes hosts via the DNS-burned cache).tlsregex anchored against false-positivealtnamessubstrings,protocolmatcher catchesAttempted to use detached Frameand thePage.<method> returnedfamily.Rescue 1: HEAD-resolved chain when puppeteer dies on the destination
status=-1. Avoids the retry-forever loop on URLs whose chain truth is already known.Rescue 2: JS-redirect via
page.url()buildJsRedirectEdge+deriveJsRedirectTarget+isJsRedirectErrorShapehelpers. When puppeteer'spage.goto()resolves tonull(window.location.replace(...)/<meta refresh>) andpage.url()reports a meaningfully-different destination, record the source as a redirect-edge. WHATWG URL canonicalisation filters case / slash / port / credential noise; browser-internal schemes are rejected.scrapeStartreturns{type:'error'}rather than throwing on internal#fetchDatafailures, so thepage.url()capture had to be added on the return path too — the original catch-arm-only capture would have made the rescue dead in production.Rescue 3: HEAD/GET both dead → one puppeteer try
timeout/connection-reset/parse-error), launch puppeteer exactly once. Some WAF configurations drop bare probes but answer real navigations.Throughput / hygiene
10s → 30s → 60sper retry attempt (was fixed 10s) so slow but eventually-responsive hosts get rescued without paying the slow-server tax on the happy path.fetchDestinationwhen HEAD response is unusable (405 / parse-error / etc.).Security: scope-credential leak
page.authenticateis now registered unconditionally to drain Chromium's native HTTP-auth dialog (previously hung navigations on sub-resources demanding Basic auth).user:pass@host) are stripped beforepage.goto()so they never enter Chromium's HTTP-auth cache and never leak to cross-origin sub-resources on the same IP/port. Pinned by a new E2E (scope-auth-leak.e2e.ts).Query / viewer
getSummaryexcludesexcludes-pattern pages from status / content-type distributions (they were inflating the Unknown bucket).client-blocked/dns-transient/local-network/parse-errorErrorKindbuckets.Docs
--retry-failedconvergence rule, and theErrorKindderived-constants review checklist.Live verification
A previously stuck source row (
status=-1 / UnknownErrorafter 15--retry-failedpasses on a real archive) flipped tostatus=301 Moved PermanentlywithredirectDestIdset after one run with the rescue. Removing it from the retry pool permanently. Debug log:JS-redirect rescue fired for <source> → <dest> (HEAD available)followed bySet redirect/Set redirected urlarchive writes.Test plan
yarn build— all 14 projects passyarn test— 1861 unit tests pass (5 skipped)yarn vitest run --config vitest.e2e.config.ts— 25 files / 113 E2E passyarn lint— eslint + prettier + cspell + textlint + markuplint all clean🤖 Generated with Claude Code