Skip to content

cuda: add simple cudaMalloc/cudaFree allocator as opt-in (workaround for #2038)#2044

Open
T0nd3 wants to merge 1 commit into
OpenNMT:masterfrom
T0nd3:fix/hip-allocator-deadlock-2038
Open

cuda: add simple cudaMalloc/cudaFree allocator as opt-in (workaround for #2038)#2044
T0nd3 wants to merge 1 commit into
OpenNMT:masterfrom
T0nd3:fix/hip-allocator-deadlock-2038

Conversation

@T0nd3
Copy link
Copy Markdown

@T0nd3 T0nd3 commented May 12, 2026

Summary

Addresses #2038 (del model deadlocks on ROCm 7.2.1 + Windows + gfx1100 inside the hipcub CachingDeviceAllocator's free path).

Adds a third option for `CT2_CUDA_ALLOCATOR`:

  • `cub_caching` (existing default on Windows / pre-CUDA-11.2)
  • `cuda_malloc_async` (existing default on Linux+CUDA / Linux+HIP)
  • `simple` / `none` (new) — stateless `cudaMalloc` / `cudaFree`, no caching, no per-block events

The default behaviour is unchanged. The `simple` allocator is an opt-in workaround for users hitting the deadlock until the ROCm runtime bug is fixed upstream.

Why not just hardcode the workaround on Windows-HIP?

I checked whether `hipMallocAsync` would work as a drop-in (it's currently disabled on Windows via `CT2_USE_ASYNC_ALLOC = !_WIN32` per #1072). On ROCm 7.2.0 + RX 7900 XTX + Windows 11, the first `generate()` call still crashes with `IndexError: invalid vector subscript` — so `hipMallocAsync` isn't usable as a Windows default yet. Reverted the experiment, kept the existing guard.

The hipcub deadlock itself isn't reproducible on ROCm 7.2.0 in my local setup (the reporter sees it on 7.2.1), so I can't fix the bug at its root in this repo. The opt-in `simple` allocator just routes around every code path the upstream bug is sensitive to: no cached blocks, no `hipEventRecord`, no per-block streams.

Trade-off

  • Every allocation becomes a fresh `cudaMalloc` (no cache reuse).
  • For typical inference workloads (load model → many forward passes → unload), allocation happens once at load time and then mostly stays the same, so the impact is small.
  • For workloads with frequent allocation/deallocation churn, expect a measurable slowdown. Those workloads can stay on `cub_caching`.

Test plan

  • Build with `-DWITH_HIP=ON` on gfx1100 (Windows 11, ROCm 7.2.0).
  • `CT2_CUDA_ALLOCATOR=simple python repro.py` → Whisper-medium load + inference + `del model` returns in ~10 ms (vs ~22 ms with `cub_caching`).
  • Default allocator path still works: `CT2_CUDA_ALLOCATOR=cub_caching` → identical token output, identical timing.
  • All 15 tests in my local `python/tests/test_flash_attention.py` pass under `CT2_CUDA_ALLOCATOR=simple` with the same correctness thresholds as on `cub_caching`.
  • CI: build & wheel jobs.
  • Confirmation from del model deadlocks on ROCm 7.2.1 + gfx1100 (Windows) — HIP allocator free path #2038 reporter that `CT2_CUDA_ALLOCATOR=simple` avoids the hang on ROCm 7.2.1.

cc @sssshhhhhh @jordimas

Closes #2038 once confirmed by the reporter.

…MT#2038)

Issue OpenNMT#2038 reports that `del model` deadlocks during cleanup on ROCm 7.2.1
+ Windows + gfx1100.  The trace points at the hipcub CachingDeviceAllocator's
DeviceFree path — both the per-block hipEventRecord (recached branch) and
the synchronous hipFree (non-recached branch) call into the ROCm runtime,
and at least one of them hangs there indefinitely during a `del model`.

The CudaAsyncAllocator already exists as an alternative, but it's disabled
on Windows (OpenNMT#1072 comment) and a quick check on the local ROCm 7.2 wheels
confirms hipMallocAsync still misbehaves there (invalid vector subscript
during the first generate() call), so that's not a usable fallback.

Add a third allocator option, "simple" / "none", that is a stateless
cudaMalloc / cudaFree wrapper.  It has no cache, no per-block ready
events, and no per-block associated streams, so it can't trip any of the
state-tracking code paths in CachingDeviceAllocator that the upstream
runtime bug is sensitive to.  The trade-off is that every allocation
becomes a fresh cudaMalloc — fine for typical inference workloads that
allocate once and reuse, more costly for workloads that allocate often.

Selected via `CT2_CUDA_ALLOCATOR=simple` (or `none`).  Default behaviour
is unchanged.

Verified locally on RX 7900 XTX (gfx1100, ROCm 7.2.0, Windows 11):
  - Whisper-medium load + inference + `del model` succeeds in both
    allocator modes (deadlock not reproducible on 7.2.0; reporter sees
    it on 7.2.1).
  - Existing flash-attention pytest suite (15 tests) passes under
    `CT2_CUDA_ALLOCATOR=simple` with identical results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

del model deadlocks on ROCm 7.2.1 + gfx1100 (Windows) — HIP allocator free path

1 participant