feat(beholder): rewrite getAnchorList with single AX tree + parallel describeNode (#876)#877
Merged
Merged
Conversation
…describeNode (#876) Replaces the per-anchor `page.accessibility.snapshot({ root })` call (one CDP round-trip plus a Chrome-side AX subtree computation, ~42ms each) with: 1. A single `Accessibility.getFullAXTree` fetch that maps every non-ignored AX node by `backendDOMNodeId → accessible name`. 2. A parallel `DOM.describeNode({ objectId })` per anchor handle to resolve each anchor's backend node id. 3. `Promise.all` across every anchor with a lazy `textContent` fallback when the AX tree has no matching node (ignored / aria-hidden / display:none). On the regression page (1181 anchors), `getAnchors` drops from ~53s to ~0.4s. Resilience: - AX tree fetch, `DOM.describeNode`, and the overall operation are each wrapped in `raceWithTimeout`; on the outer race firing, the function returns a snapshot of whatever anchors were collected so far. - A `cancelled` flag short-circuits in-flight per-anchor work after the outer race trips so abandoned CDP work cannot mutate the returned array. - `resolveAnchor` is wrapped in try/catch so one disposed handle cannot poison the `Promise.all` over the remaining anchors. `Scraper#scrapeStart` now emits `%countdown(...)%s` placeholders on the `getAnchors` / `getMeta` / `extractImages` `changePhase` events so the Display renders a live countdown for these phases, matching `openPage`. Behavior note: `DEFAULT_DOM_EVALUATION_TIMEOUT` raised from 30_000 to 180_000 so a single DOM-evaluation phase has the full upstream retryable budget (3 min) before forcing a retry. Public API surface and return shapes are unchanged; callers that relied on the implicit 30s default now wait up to 180s on unresponsive pages before the fallback kicks in. Pass an explicit `timeout` (or `ScraperOptions.domEvaluationTimeout`) to restore the prior budget. Closes #876
166b51b to
7e5b089
Compare
… coverage Mocking `_client()` in unit tests was masking a real risk: if puppeteer ever renames the internal accessor (it is not part of the public API), `getInternalCDPClient` returns null and every anchor silently falls back to textContent-only mode — the perf rewrite reverts to per-anchor CDP without any failing test. - Emit a one-shot WARN log via the beholder:dom debug namespace when `_client()` is missing or throws so the degraded mode is observable in production logs. - Add a tripwire spec that loads the installed `puppeteer-core` CDP Page source and asserts the `_client()` method still exists. If a future puppeteer release removes or renames it, this test fails and forces a maintainer to update the accessor path instead of silently degrading.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #876.
getAnchorListの per-anchorpage.accessibility.snapshot({ root })(1 アンカーあたり 1 CDP ラウンドトリップ + Chrome 側 AX サブツリー計算で ~42ms) を Strategy F に置換:Accessibility.getFullAXTreeを 1 回だけ呼び、ignored でない AX node をbackendDOMNodeId → accessible nameの Map に格納DOM.describeNode({ objectId })を並列で叩き backend node id を解決textContentフェッチにフォールバックリグレッションページ (1181 anchors) で
getAnchorsフェーズが ~53 秒 → ~0.4 秒 (~130×) に短縮。補強
DOM.describeNode/ 全体操作それぞれをraceWithTimeoutで包む。外側 race が発火したらそれまでに集めた anchor の snapshot を返す (部分結果)cancelledフラグで in-flight な per-anchor 処理を短絡 → 既に返した配列に late mutation を起こさないresolveAnchor全体を try/catch で囲い、disposed handle 1 個が Promise.all 全体を巻き込まないように並行追加 (関連 UX 改善)
Scraper#scrapeStartのgetAnchors/getMeta/extractImages各changePhaseemit に%countdown(${domEvaluationTimeout},...)%sプレースホルダを追加。Dealer の Display がopenPageと同様にライブカウントダウンを描画する動作上の注意 (API contract は変えていない)
DEFAULT_DOM_EVALUATION_TIMEOUTを 30_000 → 180_000 に引き上げ。上位Scraper#fetchDataの retryable timeout (3 分) と整合し、1 phase で予算を使い切らないようにする。型・シグネチャ・戻り値は不変。デフォルト依存の呼び出し元は応答のないページで fallback までの待ち時間が 30s → 180s に伸びる。元の挙動が必要ならgetAnchorList(page, options, timeout)またはScraperOptions.domEvaluationTimeoutに明示的に値を渡すtypes.tsのScraperOptions.domEvaluationTimeoutの docstring も 30s → 180s に同期Test plan
yarn lintパスyarn buildパスyarn testパス (24 ケース新規/書き換え、全 689 ケース緑)nodes) / describeNode malformed (nonode)_client()不在時は textContent fallback