Please make sure these conditions are met
What happened?
When working with large data (>8M cells) and trying to generate many pseudobulks (>100k), I noticed that the memory usage explodes, which defeats the purpose of using dask over loading everything in memory.
Minimal code sample
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "scanpy@git+https://github.com/scverse/scanpy.git@main",
# ]
# ///
#
# This script automatically imports the development branch of scanpy to check for issues
import scanpy as sc
import anndata as ad
adata = ad.experimental.read_lazy(<large file with >8M cells>)
adata.obs = adata.obs.to_memory()
adata.obs['group'] = adata.obs[['donor_id', 'cluster']].astype(str).agg('-'.join, axis=1)
pb_data = sc.get.aggregate(adata, 'group', 'sum')
pb_data.layers['sum'].compute()
Error output
There appear to be 2 leaked semaphore objects to clean up at shutdown
Versions
Details
scanpy 1.12.1
---- ----
wrapt 2.1.2
stack_data 0.6.3
fast-array-utils 1.4.1
matplotlib 3.10.8
anndata 0.12.10
scikit-learn 1.8.0
packaging 26.1
parso 0.8.6
h5py 3.16.0
traitlets 5.14.3
cycler 0.12.1
six 1.17.0
pyarrow 23.0.1
psutil 6.1.1
legacy-api-wrap 1.5
typing_extensions 4.15.0
MarkupSafe 3.0.3
prompt_toolkit 3.0.52
scipy 1.16.3
Deprecated 1.3.1
PyYAML 6.0.3
ipython 9.12.0
executing 2.2.1
pytz 2026.1.post1
numcodecs 0.15.1
msgpack 1.1.2
sparse 0.18.0
colorama 0.4.6
tblib 3.2.2
natsort 8.4.0
joblib 1.5.3
threadpoolctl 3.6.0
setuptools 82.0.1
decorator 5.2.1
pure_eval 0.2.3
python-dateutil 2.9.0.post0
jedi 0.19.2
pyparsing 3.3.2
asttokens 3.0.1
asciitree 0.3.3
pillow 12.2.0
xarray 2026.4.0
numba 0.65.0
Pygments 2.20.0
cloudpickle 3.1.2
kiwisolver 1.5.0
llvmlite 0.47.0
numpy 2.4.4
session-info2 0.4.1
zarr 2.18.7
pandas 2.3.3
Jinja2 3.1.6
wcwidth 0.6.0
dask 2024.7.1
toolz 1.1.0
---- ----
Python 3.12.13 | packaged by conda-forge | (main, Mar 5 2026, 16:50:00) [GCC 14.3.0]
OS Linux-4.18.0-553.33.1.el8_10.x86_64-x86_64-with-glibc2.28
CPU 128/128 logical CPU cores, x86_64
GPU No GPU found
Please make sure these conditions are met
What happened?
When working with large data (>8M cells) and trying to generate many pseudobulks (>100k), I noticed that the memory usage explodes, which defeats the purpose of using dask over loading everything in memory.
Minimal code sample
Error output
Versions
Details