Skip to content

Parallelise skill file fetches and reuse HTTPS connections#5335

Open
jamesbroadhead wants to merge 2 commits into
mainfrom
jb/speed-up-skills-install
Open

Parallelise skill file fetches and reuse HTTPS connections#5335
jamesbroadhead wants to merge 2 commits into
mainfrom
jb/speed-up-skills-install

Conversation

@jamesbroadhead
Copy link
Copy Markdown
Contributor

Summary

  • Hoist a single shared http.Client in libs/aitools/installer/installer.go so the transport pool reuses TCP+TLS connections across fetches. MaxIdleConnsPerHost is bumped from Go's default 2 → 16 so parallel fetches to raw.githubusercontent.com actually reuse connections instead of churning handshakes.
  • Fetch a skill's files concurrently in installSkillToDir via errgroup.WithContext with SetLimit(8). First error cancels in-flight peers, preserving the prior bail-on-first-error semantics.

Why

databricks aitools install was sequential: every file constructed a fresh &http.Client{} and threw it away, paying the full TCP+TLS handshake per file. For --experimental (26 skills × ~5 files each = ~120 HTTPS GETs to GitHub raw) that meant wall-clock was dominated by handshake round-trips.

Benchmarked locally against https://raw.githubusercontent.com/databricks/databricks-agent-skills/v0.2.0 with a freshly built CLI on each side:

Condition Baseline This change Speedup
Cold (no DNS/CDN warm) 42.7 s 0.52 s ~80×
Warm 0.55–0.79 s 0.20–0.21 s ~3–4×

The cold-start number is what a first-time databricks aitools install --experimental user actually pays.

The fetchFileFn package var that tests stub remains unchanged, so no test mocks need updating.

Test plan

  • go test ./libs/aitools/... — all green.
  • gofmt -l libs/aitools/installer/installer.go — clean.
  • go vet ./libs/aitools/... — clean.
  • End-to-end timing against live GitHub raw — see table above.

This pull request and its description were written by Isaac.

Each skill file was fetched sequentially with a fresh http.Client per call,
so every file paid the full TCP+TLS handshake to raw.githubusercontent.com
and `databricks aitools install --experimental` (26 skills, ~120 files) took
~40 s on a cold network. Hoist a shared http.Client with a tuned transport,
and fetch a skill's files concurrently with errgroup. Wall-clock drops to
under a second.

Co-authored-by: Isaac
@eng-dev-ecosystem-bot
Copy link
Copy Markdown
Collaborator

eng-dev-ecosystem-bot commented May 27, 2026

Commit: 48261fe

Run: 26502959179

Cover concurrent skill file fetches and errgroup cancellation so the installer performance path keeps its intended behavior.
Copy link
Copy Markdown
Member

@simonfaltum simonfaltum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Focused change, the two halves (parallel fetches + shared transport with bumped MaxIdleConnsPerHost) belong together. Tests verify the right invariants (both goroutines reach the fetcher before either releases; errgroup propagates cancellation to in-flight peers) and survived -race -count=50 locally.

Nits, none blocking:

  • The 8 concurrency limit and * 2 idle-conn slack would benefit from a one-line rationale.
  • sync.OnceValue is the codebase's lazy-init idiom (e.g. libs/clicompat/clicompat.go); the IIFE works fine since init is eager, just flagging for consistency.
  • Skills are still installed serially across the outer loop, so the cold-start win mostly comes from the shared client keeping idle conns warm between skills, not intra-skill parallelism. Worth a sentence in the PR body.
  • TestInstallSkillToDirCancelsInFlightFetchesOnError uses a 1s deadline; bumping to ~5s would be cheap CI insurance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants