cuda: add simple cudaMalloc/cudaFree allocator as opt-in (workaround for #2038) by T0nd3 · Pull Request #2044 · OpenNMT/CTranslate2

T0nd3 · 2026-05-12T15:19:33Z

Summary

Addresses #2038 (del model deadlocks on ROCm 7.2.1 + Windows + gfx1100 inside the hipcub CachingDeviceAllocator's free path).

Adds a third option for `CT2_CUDA_ALLOCATOR`:

`cub_caching` (existing default on Windows / pre-CUDA-11.2)
`cuda_malloc_async` (existing default on Linux+CUDA / Linux+HIP)
`simple` / `none` (new) — stateless `cudaMalloc` / `cudaFree`, no caching, no per-block events

The default behaviour is unchanged. The `simple` allocator is an opt-in workaround for users hitting the deadlock until the ROCm runtime bug is fixed upstream.

Why not just hardcode the workaround on Windows-HIP?

I checked whether `hipMallocAsync` would work as a drop-in (it's currently disabled on Windows via `CT2_USE_ASYNC_ALLOC = !_WIN32` per #1072). On ROCm 7.2.0 + RX 7900 XTX + Windows 11, the first `generate()` call still crashes with `IndexError: invalid vector subscript` — so `hipMallocAsync` isn't usable as a Windows default yet. Reverted the experiment, kept the existing guard.

The hipcub deadlock itself isn't reproducible on ROCm 7.2.0 in my local setup (the reporter sees it on 7.2.1), so I can't fix the bug at its root in this repo. The opt-in `simple` allocator just routes around every code path the upstream bug is sensitive to: no cached blocks, no `hipEventRecord`, no per-block streams.

Trade-off

Every allocation becomes a fresh `cudaMalloc` (no cache reuse).
For typical inference workloads (load model → many forward passes → unload), allocation happens once at load time and then mostly stays the same, so the impact is small.
For workloads with frequent allocation/deallocation churn, expect a measurable slowdown. Those workloads can stay on `cub_caching`.

Test plan

Build with `-DWITH_HIP=ON` on gfx1100 (Windows 11, ROCm 7.2.0).
`CT2_CUDA_ALLOCATOR=simple python repro.py` → Whisper-medium load + inference + `del model` returns in ~10 ms (vs ~22 ms with `cub_caching`).
Default allocator path still works: `CT2_CUDA_ALLOCATOR=cub_caching` → identical token output, identical timing.
All 15 tests in my local `python/tests/test_flash_attention.py` pass under `CT2_CUDA_ALLOCATOR=simple` with the same correctness thresholds as on `cub_caching`.
CI: build & wheel jobs.
Confirmation from del model deadlocks on ROCm 7.2.1 + gfx1100 (Windows) — HIP allocator free path #2038 reporter that `CT2_CUDA_ALLOCATOR=simple` avoids the hang on ROCm 7.2.1.

cc @sssshhhhhh @jordimas

Closes #2038 once confirmed by the reporter.

…MT#2038) Issue OpenNMT#2038 reports that `del model` deadlocks during cleanup on ROCm 7.2.1 + Windows + gfx1100. The trace points at the hipcub CachingDeviceAllocator's DeviceFree path — both the per-block hipEventRecord (recached branch) and the synchronous hipFree (non-recached branch) call into the ROCm runtime, and at least one of them hangs there indefinitely during a `del model`. The CudaAsyncAllocator already exists as an alternative, but it's disabled on Windows (OpenNMT#1072 comment) and a quick check on the local ROCm 7.2 wheels confirms hipMallocAsync still misbehaves there (invalid vector subscript during the first generate() call), so that's not a usable fallback. Add a third allocator option, "simple" / "none", that is a stateless cudaMalloc / cudaFree wrapper. It has no cache, no per-block ready events, and no per-block associated streams, so it can't trip any of the state-tracking code paths in CachingDeviceAllocator that the upstream runtime bug is sensitive to. The trade-off is that every allocation becomes a fresh cudaMalloc — fine for typical inference workloads that allocate once and reuse, more costly for workloads that allocate often. Selected via `CT2_CUDA_ALLOCATOR=simple` (or `none`). Default behaviour is unchanged. Verified locally on RX 7900 XTX (gfx1100, ROCm 7.2.0, Windows 11): - Whisper-medium load + inference + `del model` succeeds in both allocator modes (deadlock not reproducible on 7.2.0; reporter sees it on 7.2.1). - Existing flash-attention pytest suite (15 tests) passes under `CT2_CUDA_ALLOCATOR=simple` with identical results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: add simple cudaMalloc/cudaFree allocator as opt-in (workaround for #2038)#2044

cuda: add simple cudaMalloc/cudaFree allocator as opt-in (workaround for #2038)#2044
T0nd3 wants to merge 1 commit into
OpenNMT:masterfrom
T0nd3:fix/hip-allocator-deadlock-2038

T0nd3 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

T0nd3 commented May 12, 2026

Summary

Why not just hardcode the workaround on Windows-HIP?

Trade-off

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant