Skip to content

perf: vectorize KV cache prefix matching with numpy#2179

Open
nausicaalii wants to merge 1 commit intoabetlen:mainfrom
nausicaalii:perf/vectorize-prefix-match
Open

perf: vectorize KV cache prefix matching with numpy#2179
nausicaalii wants to merge 1 commit intoabetlen:mainfrom
nausicaalii:perf/vectorize-prefix-match

Conversation

@nausicaalii
Copy link
Copy Markdown

Summary

  • Replace O(n) Python for-loop in generate() KV cache prefix matching and longest_token_prefix() with numpy vectorized element-wise comparison
  • Uses np.argmin on a boolean equality array to find the first mismatch position in a single vectorized pass

Motivation

The current prefix matching iterates token-by-token in Python to find where the cached prompt diverges from the new prompt. This is fine for short prompts, but becomes a bottleneck as conversation history grows — multi-turn chat sessions can accumulate 10K–100K+ tokens in input_ids, and the linear Python loop runs on every generate() call.

Numpy's vectorized comparison runs in optimized C/SIMD, giving significant speedup for large token sequences while preserving identical behavior.

Test plan

  • Verified longest_token_prefix correctness across edge cases: empty sequences, full match, partial match, single element, no match, different lengths, large sequences (10K tokens)
  • test_real_model — passes (low-level batch decode)
  • test_real_llama — passes (multiple sequential create_completion calls that exercise prefix matching)
  • test_real_llama_embeddings — passes

Replace O(n) Python for-loop in KV cache prefix matching and
longest_token_prefix() with numpy vectorized comparison.

The element-wise numpy comparison runs in optimized C/SIMD
instead of Python's interpreter loop, which matters as
conversation history grows (10K+ tokens).

No change in behavior — both paths find the first position
where cached and new token sequences diverge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant