Skip to content

Improve CUDA resource management for MPI jobs#185

Open
vmitq wants to merge 2 commits intowavefunction91:masterfrom
vmitq:feature/cuda-oversubscribe
Open

Improve CUDA resource management for MPI jobs#185
vmitq wants to merge 2 commits intowavefunction91:masterfrom
vmitq:feature/cuda-oversubscribe

Conversation

@vmitq
Copy link
Copy Markdown

@vmitq vmitq commented Mar 13, 2026

The code detects the number of local MPI processes and available CUDA devices and assigns a GPU ID to each process in a round-robin fashion. When determining the available memory, it is divided evenly among the processes sharing the same GPU.

That simplifies GPU resource management when running jobs with multiple GPUs per host or multiple processes per GPU.

Split memory evenly between processes on one GPU
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates device-backend initialization to become MPI-aware (when built with MPI), enabling per-node GPU assignment in a round-robin manner and adjusting reported available GPU memory for multi-process-per-GPU runs.

Changes:

  • Extend make_device_backend to accept an MPI communicator (when MPI is enabled) and propagate it from the device runtime environment.
  • Add an MPI-aware CUDABackend constructor that selects a GPU based on local shared-memory rank.
  • Adjust CUDABackend::get_available_mem() to scale down available memory based on local process/device sharing.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/runtime_environment/device/hip/hip_backend.cxx Updates backend factory signature to accept an MPI communicator (currently unused).
src/runtime_environment/device/device_runtime_environment_impl.hpp Passes the runtime communicator into make_device_backend (MPI builds).
src/runtime_environment/device/device_backend.hpp Updates factory declaration to accept MPI communicator; adds include to access MPI types/macros.
src/runtime_environment/device/cuda/cuda_backend.hpp Adds MPI-related state and an MPI-aware constructor to CUDABackend.
src/runtime_environment/device/cuda/cuda_backend.cxx Implements MPI-aware CUDA init (local rank → GPU) and memory-splitting logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/runtime_environment/device/cuda/cuda_backend.cxx
Comment thread src/runtime_environment/device/cuda/cuda_backend.cxx
Comment thread src/runtime_environment/device/cuda/cuda_backend.cxx
Comment thread src/runtime_environment/device/cuda/cuda_backend.cxx Outdated
Comment thread src/runtime_environment/device/device_backend.hpp Outdated
Comment thread src/runtime_environment/device/hip/hip_backend.cxx
- Free local_comm in CUDABackend destructor
- Guard against ndev <= 0
- Move MPI_Barrier from get_available_mem() to the call site
- Narrow include in device_backend.hpp
- Suppress unused MPI_Comm parameter warning in HIP backend
@awvwgk awvwgk added the cuda CUDA related Issue label May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda CUDA related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants