Added the NEON implementation for distance calculation and added the corresponding UT test.#657
Added the NEON implementation for distance calculation and added the corresponding UT test.#657AmberLZY wants to merge 2 commits into
Conversation
…corresponding UT test.
|
@AmberLZY, could you let us know roughly what the performance improvement was with the NEON optimization on the platform you tested on? |
@kiplingw , this is the performance result we obtained based on our own benchmark on an ARM chip architecture (KUNPENG 920 7280Z). If you have other benchmarks, we can use them for testing. |
|
Just to be clear, the first column is a dataset and the next two columns are what? |
The next two columns indicate the throughput (QPS) of the entire system. |
|
Great! Do you have a benchmark for construction as well out of curiosity? |
Yes, we built it ourselves. |
|
Can you show us the benchmark for constructing the index with and without the NEON optimization? |
Here is the link to our benchmark: The NEON optimization can be enabled or disabled based on the -DUSE_NEON option at compile time. |
|
Thank you, but I don't have a test bed available. What I was asking is if you could provide benchmark data, like you did in your table above, but specifically on the time improvement to construct the index with the NEON ISA hardware acceleration. |
We did not test improvements in the time taken to construct the index. Our tests focus on the improvement of QPS in the search phase. |
|
It would be helpful if you could benchmark construction time since your optimizations, at least on ARM, likely impact it. Search time is important, but construction time is too for some users. |
Thank you very much for your suggestion! We will consider this for future improvements. |
|
Hi @ilyajob05 and @yurymalkov, hope you're both doing well! We wanted to follow up on this PR, which adds ARM NEON hardware acceleration for IP and L2 distance calculations. We understand you're busy, and we truly appreciate the effort the maintainers have been putting into the project recently. A quick summary of what this PR brings: NEON-optimized implementations for both IP and L2 distance, with build-time opt-in via -DUSE_NEON With the growing adoption of ARM-based servers in HPC and cloud environments, we believe this optimization could benefit a meaningful portion of the community. If there's anything we can do to help review or refine this PR, please let us know. Thank you again for your time and for maintaining this project! |
Hello @wjunLu , @AmberLZY |
Hello @ilyajob05 , we have removed the company header. If there are any other requirements, please let us know. |
Hello @kiplingw , based on your previous suggestion, we tested the index construction time on the ARM platform, and the comparison results show that NEON optimization can help reduce the index construction time. |
|
Good work @AmberLZY. There's improvements in most of your data sets, and it looks like with the best improvements with 16-bit floats. |
|
Hi @ilyajob05, thank you so much for your thoughtful feedback on the copyright header — we really appreciate you taking the time to explain the reasoning. Just wanted to let you know that @AmberLZY has already addressed your concern: the corporate copyright header has been removed from the code. In addition, based on @kiplingw's earlier suggestion, construction-time benchmark results have also been provided, showing measurable improvements on ARM platforms. We believe the PR is now in good shape and ready for re-review. We genuinely hope this contribution can be merged — ARM NEON optimization is increasingly relevant as ARM-based servers grow in adoption, and we'd love for the broader hnswlib community to benefit from it. If there's anything else you'd like us to adjust or clarify, we're very happy to do so. Thank you again for your time and for maintaining this wonderful project! |


Summary
Changes
Design Decisions
Testing
Unit tests for the NEON implementation of distance calculations have been added.
Benchmark results show a significant performance improvement on ARM servers.
Notes
We are happy to work with the maintainers to refine the code structure as needed.
If this contribution is well received, we have additional optimizations to offer, such as a NEON implementation for FP16 distance calculations and an optimization for ID reordering in the HNSW base library, which can be submitted in subsequent PRs.