Skip to content

sc.get.aggregate memory leak for Dask array #4074

@mumichae

Description

@mumichae

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

When working with large data (>8M cells) and trying to generate many pseudobulks (>100k), I noticed that the memory usage explodes, which defeats the purpose of using dask over loading everything in memory.

Minimal code sample

# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "scanpy@git+https://github.com/scverse/scanpy.git@main",
# ]
# ///
#
# This script automatically imports the development branch of scanpy to check for issues

import scanpy as sc
import anndata as ad

adata = ad.experimental.read_lazy(<large file with >8M cells>)
adata.obs = adata.obs.to_memory()

adata.obs['group'] = adata.obs[['donor_id', 'cluster']].astype(str).agg('-'.join, axis=1)
pb_data = sc.get.aggregate(adata, 'group', 'sum')

pb_data.layers['sum'].compute()

Error output

There appear to be 2 leaked semaphore objects to clean up at shutdown

Versions

Details
scanpy  1.12.1
----    ----
wrapt   2.1.2
stack_data      0.6.3
fast-array-utils        1.4.1
matplotlib      3.10.8
anndata 0.12.10
scikit-learn    1.8.0
packaging       26.1
parso   0.8.6
h5py    3.16.0
traitlets       5.14.3
cycler  0.12.1
six     1.17.0
pyarrow 23.0.1
psutil  6.1.1
legacy-api-wrap 1.5
typing_extensions       4.15.0
MarkupSafe      3.0.3
prompt_toolkit  3.0.52
scipy   1.16.3
Deprecated      1.3.1
PyYAML  6.0.3
ipython 9.12.0
executing       2.2.1
pytz    2026.1.post1
numcodecs       0.15.1
msgpack 1.1.2
sparse  0.18.0
colorama        0.4.6
tblib   3.2.2
natsort 8.4.0
joblib  1.5.3
threadpoolctl   3.6.0
setuptools      82.0.1
decorator       5.2.1
pure_eval       0.2.3
python-dateutil 2.9.0.post0
jedi    0.19.2
pyparsing       3.3.2
asttokens       3.0.1
asciitree       0.3.3
pillow  12.2.0
xarray  2026.4.0
numba   0.65.0
Pygments        2.20.0
cloudpickle     3.1.2
kiwisolver      1.5.0
llvmlite        0.47.0
numpy   2.4.4
session-info2   0.4.1
zarr    2.18.7
pandas  2.3.3
Jinja2  3.1.6
wcwidth 0.6.0
dask    2024.7.1
toolz   1.1.0
----    ----
Python  3.12.13 | packaged by conda-forge | (main, Mar  5 2026, 16:50:00) [GCC 14.3.0]
OS      Linux-4.18.0-553.33.1.el8_10.x86_64-x86_64-with-glibc2.28
CPU     128/128 logical CPU cores, x86_64
GPU     No GPU found

Metadata

Metadata

Assignees

No one assigned

    Labels

    Triage 🩺This issue needs to be triaged by a maintainer

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions