Skip to content

feat(beholder): rewrite getAnchorList with single AX tree + parallel describeNode (#876)#877

Merged
YusukeHirao merged 3 commits into
devfrom
perf/beholder-getanchor-list-876
Jun 16, 2026
Merged

feat(beholder): rewrite getAnchorList with single AX tree + parallel describeNode (#876)#877
YusukeHirao merged 3 commits into
devfrom
perf/beholder-getanchor-list-876

Conversation

@YusukeHirao

@YusukeHirao YusukeHirao commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

Closes #876.

getAnchorList の per-anchor page.accessibility.snapshot({ root }) (1 アンカーあたり 1 CDP ラウンドトリップ + Chrome 側 AX サブツリー計算で ~42ms) を Strategy F に置換:

  1. Accessibility.getFullAXTree1 回だけ呼び、ignored でない AX node を backendDOMNodeId → accessible name の Map に格納
  2. 各 anchor handle に対し DOM.describeNode({ objectId })並列で叩き backend node id を解決
  3. Promise.all で全 anchor を並列処理、AX miss 時のみ遅延 textContent フェッチにフォールバック

リグレッションページ (1181 anchors) で getAnchors フェーズが ~53 秒 → ~0.4 秒 (~130×) に短縮。

補強

  • AX tree fetch / DOM.describeNode / 全体操作それぞれを raceWithTimeout で包む。外側 race が発火したらそれまでに集めた anchor の snapshot を返す (部分結果)
  • 外側 race 後の cancelled フラグで in-flight な per-anchor 処理を短絡 → 既に返した配列に late mutation を起こさない
  • resolveAnchor 全体を try/catch で囲い、disposed handle 1 個が Promise.all 全体を巻き込まないように

並行追加 (関連 UX 改善)

  • Scraper#scrapeStartgetAnchors / getMeta / extractImageschangePhase emit に %countdown(${domEvaluationTimeout},...)%s プレースホルダを追加。Dealer の Display が openPage と同様にライブカウントダウンを描画する

動作上の注意 (API contract は変えていない)

  • DEFAULT_DOM_EVALUATION_TIMEOUT30_000 → 180_000 に引き上げ。上位 Scraper#fetchData の retryable timeout (3 分) と整合し、1 phase で予算を使い切らないようにする。型・シグネチャ・戻り値は不変。デフォルト依存の呼び出し元は応答のないページで fallback までの待ち時間が 30s → 180s に伸びる。元の挙動が必要なら getAnchorList(page, options, timeout) または ScraperOptions.domEvaluationTimeout に明示的に値を渡す
  • types.tsScraperOptions.domEvaluationTimeout の docstring も 30s → 180s に同期

Test plan

  • yarn lint パス
  • yarn build パス
  • yarn test パス (24 ケース新規/書き換え、全 689 ケース緑)
  • 新規テストカバレッジ:
    • AX hit / 空 AX name / AX miss / AX tree fetch reject / describeNode reject
    • AX tree malformed (no nodes) / describeNode malformed (no node)
    • ignored AX node (aria-hidden / display:none) は textContent fallback
    • disposed handle が他の anchor を巻き込まない
    • 全体 timeout 時に部分結果を返す
    • _client() 不在時は textContent fallback
  • 実ブラウザでの動作確認 (Issue [Beholder] getAnchorList の N回 per-anchor CDP 往復で大量リンクページが極端に遅い #876 のリグレッションページで再現確認 — レビュー時に手動)

@YusukeHirao YusukeHirao requested a review from yusasa16 as a code owner June 16, 2026 09:14
…describeNode (#876)

Replaces the per-anchor `page.accessibility.snapshot({ root })` call (one CDP
round-trip plus a Chrome-side AX subtree computation, ~42ms each) with:

1. A single `Accessibility.getFullAXTree` fetch that maps every non-ignored
   AX node by `backendDOMNodeId → accessible name`.
2. A parallel `DOM.describeNode({ objectId })` per anchor handle to resolve
   each anchor's backend node id.
3. `Promise.all` across every anchor with a lazy `textContent` fallback when
   the AX tree has no matching node (ignored / aria-hidden / display:none).

On the regression page (1181 anchors), `getAnchors` drops from ~53s to ~0.4s.

Resilience:
- AX tree fetch, `DOM.describeNode`, and the overall operation are each
  wrapped in `raceWithTimeout`; on the outer race firing, the function
  returns a snapshot of whatever anchors were collected so far.
- A `cancelled` flag short-circuits in-flight per-anchor work after the
  outer race trips so abandoned CDP work cannot mutate the returned array.
- `resolveAnchor` is wrapped in try/catch so one disposed handle cannot
  poison the `Promise.all` over the remaining anchors.

`Scraper#scrapeStart` now emits `%countdown(...)%s` placeholders on the
`getAnchors` / `getMeta` / `extractImages` `changePhase` events so the
Display renders a live countdown for these phases, matching `openPage`.

Behavior note: `DEFAULT_DOM_EVALUATION_TIMEOUT` raised from 30_000 to
180_000 so a single DOM-evaluation phase has the full upstream retryable
budget (3 min) before forcing a retry. Public API surface and return
shapes are unchanged; callers that relied on the implicit 30s default
now wait up to 180s on unresponsive pages before the fallback kicks in.
Pass an explicit `timeout` (or `ScraperOptions.domEvaluationTimeout`) to
restore the prior budget.

Closes #876
@YusukeHirao YusukeHirao force-pushed the perf/beholder-getanchor-list-876 branch from 166b51b to 7e5b089 Compare June 16, 2026 09:21
@YusukeHirao YusukeHirao changed the title feat(beholder)!: rewrite getAnchorList with single AX tree + parallel describeNode (#876) feat(beholder): rewrite getAnchorList with single AX tree + parallel describeNode (#876) Jun 16, 2026
… coverage

Mocking `_client()` in unit tests was masking a real risk: if puppeteer
ever renames the internal accessor (it is not part of the public API),
`getInternalCDPClient` returns null and every anchor silently falls back
to textContent-only mode — the perf rewrite reverts to per-anchor CDP
without any failing test.

- Emit a one-shot WARN log via the beholder:dom debug namespace when
  `_client()` is missing or throws so the degraded mode is observable
  in production logs.
- Add a tripwire spec that loads the installed `puppeteer-core` CDP Page
  source and asserts the `_client()` method still exists. If a future
  puppeteer release removes or renames it, this test fails and forces a
  maintainer to update the accessor path instead of silently degrading.
@YusukeHirao YusukeHirao merged commit df97f3f into dev Jun 16, 2026
2 checks passed
@YusukeHirao YusukeHirao deleted the perf/beholder-getanchor-list-876 branch June 16, 2026 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Beholder] getAnchorList の N回 per-anchor CDP 往復で大量リンクページが極端に遅い

1 participant