Add NIXL to test suite#16
Conversation
|
This looks good, I've cherry-picked it onto |
|
Thank you, that makes me so happy! I can see the scripts and version pins are on master now. I noticed Geoff dropped the placeholder row I manually added to the README. Did I miss a step or does the row appear once there are actual results for nixl? Either way, if there's anything I can do to make your lives easier, just let me know! ^^ |
|
Indeed, the row should appear once there are some results, probably around this time tomorrow. Nothing missed on your part ^^ If you could submit any future PRs against master that would be helpful; don't worry about rebasing the one you already have open for code_saturne though - I'll cherry-pick that once I've reviewed it. |
Summary
Adds clone, build, and example scripts for NIXL (e128059), NVIDIA's network interconnect library for GPU-to-GPU transfers across heterogeneous fabrics. The validation builds NIXL against a locally-compiled UCX and runs the bundled C++ transfer example.
Dependencies
NIXL requires UCX with multi-thread support (--enable-mt). The Ubuntu-packaged UCX (1.16) is too old and lacks APIs that NIXL uses. 00-clone.sh therefore clones UCX (c982cef) alongside NIXL, and 01-build.sh builds and installs UCX from source before building NIXL.
Build notes
UCX is built first into ucx-install/ using autoconf. The --with-cuda flag points at the CUDA toolkit that test.sh provides via environment variables.
NIXL is then built with Meson and Ninja. Two points to note:
Validation
02-example.sh runs nixl/build/examples/cpp/nixl_example and checks that the output contains Test done, which the example prints on a successful end-to-end transfer. Two environment variables are required at runtime:
Status
Passes locally on sm_75 under both native CUDA 13.1 and SCALE 1.7.0. Results are marked ? pending CI validation on the repo's target hardware.