Skip to content

Scrublet failure using batch_key #4098

@40YTGan

Description

@40YTGan

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

I was running pp.scrublet on a FACS resolved scRNAseq data GSE144273 to reproduce:

Pei, W., Shang, F., Wang, X., Fanti, A.-K., Greco, A., Busch, K., Klapproth, K., Zhang, Q., Quedenau, C., Sauer, S., Feyerabend, T. B., Höfer, T., & Rodewald, H.-R. (2020). Resolving Fates and Single-Cell Transcriptomes of Hematopoietic Stem Cell Clones by PolyloxExpress Barcoding. Cell Stem Cell, 27(3), 383-395.e8. https://doi.org/10.1016/j.stem.2020.07.018

I concat()ed all anndata count matrices together to avoid dropping celltype-specific genes in QC, and assigned them a batch identifier (per cell type per mice). Then, I ran scrublet to detect doublets within each FACS purified batch. There, scrublet consistently fail with the exact dimension mismatch error:

ValueError: shape is inconsistent with obs (3597130 rows instead of 168972)

Note: By the time of using scrublet, the anndata object have 168972 obs, and scrublet ALWAYS produces 3597130 rows in error. The same, consistent shape is also observed in minimum code sample.

Minimal code sample

# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "scanpy@git+https://github.com/scverse/scanpy.git@main",
# ]
# ///
#
# This script automatically imports the development branch of scanpy to check for issues

import scanpy as sc
# your reproducer code

# Create a batched data
exps = []
for exp in ["exp1", "exp2", "exp3", "exp4"]: #different batches
    exp_adata = sc.datasets.pbmc3k()
    exp_obs = exp_adata.obs.copy()
    exp_adata.obs = exp_obs.assign(
        sample = exp
    )
    exps.append(exp_adata)

adata = sc.concat(adatas=exps)

#QC filtering doesn't matter

# scrublet
sc.pp.scrublet(adata, batch_key="sample", n_neighbors=10) # fails

Error output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[45], line 22
     18 
     19 #QC filtering doesn't matter
     20 
     21 # scrublet
---> 22 sc.pp.scrublet(adata, batch_key="sample", n_neighbors=10) # fails

    [... skipping hidden 1 frame]

File d:\Bioinfo\.venv\Lib\site-packages\scanpy\preprocessing\_scrublet\__init__.py:281, in scrublet(adata, adata_sim, batch_key, sim_doublet_ratio, expected_doublet_rate, stdev_doublet_rate, synthetic_doublet_umi_subsampling, knn_dist_metric, normalize_variance, log_transform, mean_center, n_prin_comps, use_approx_neighbors, get_doublet_neighbor_parents, n_neighbors, threshold, verbose, copy, random_state)
    276 scrubbed_obs = pd.concat([scrub["obs"] for scrub in scrubbed]).astype(
    277     adata.obs.dtypes
    278 )
    280 # Now reset the obs to get the scrublet scores
--> 281 adata.obs = scrubbed_obs.loc[adata.obs_names.values]
    283 # Save the .uns from each batch separately
    284 adata.uns["scrublet"] = {}

File d:\Bioinfo\.venv\Lib\site-packages\anndata\_core\anndata.py:856, in AnnData.obs(self, value)
    854 @obs.setter
    855 def obs(self, value: pd.DataFrame | XDataset):
--> 856     self._set_dim_df(value, "obs")

File d:\Bioinfo\.venv\Lib\site-packages\anndata\_core\anndata.py:787, in AnnData._set_dim_df(self, value, attr)
    786 def _set_dim_df(self, value: pd.DataFrame | XDataset, attr: Literal["obs", "var"]):
--> 787     value = _gen_dataframe(
    788         value,
    789         [f"{attr}_names", f"{'row' if attr == 'obs' else 'col'}_names"],
    790         source="shape",
    791         attr=attr,
    792         length=self.n_obs if attr == "obs" else self.n_vars,
    793     )
    794     raise_value_error_if_multiindex_columns(value, attr)
    795     value_idx = self._prep_dim_index(value.index, attr)

File D:\python313\Lib\functools.py:934, in singledispatch.<locals>.wrapper(*args, **kw)
    931 if not args:
    932     raise TypeError(f'{funcname} requires at least '
    933                     '1 positional argument')
--> 934 return dispatch(args[0].__class__)(*args, **kw)

File d:\Bioinfo\.venv\Lib\site-packages\anndata\_core\aligned_df.py:89, in _gen_dataframe_df(anno, index_names, source, attr, length)
     87     raise ValueError(msg)
     88 if length is not None and length != len(anno):
---> 89     raise _mk_df_error(source, attr, length, len(anno))
     90 anno = anno.copy(deep=False)
     91 if not is_string_dtype(anno.index[~anno.index.isna()]):

ValueError: `shape` is inconsistent with `obs` (43200 rows instead of 10800)

Versions

Details
<stdin>-0:1: FutureWarning: Use `print_header` instead
scanpy  1.12.1
----    ----
legacy-api-wrap 1.5
charset-normalizer      3.4.7
matplotlib      3.10.8
pyparsing       3.3.2
kiwisolver      1.5.0
natsort 8.4.0
typing_extensions       4.15.0
donfig  0.8.1.post1
session-info2   0.4.1
numcodecs       0.16.5
pyreadline3     3.5.4
colorama        0.4.6
google-crc32c   1.8.0
joblib  1.5.3
pyarrow 23.0.1
pillow  12.2.0
scipy   1.17.1
fast-array-utils        1.4.1
threadpoolctl   3.6.0
llvmlite        0.47.0
six     1.17.0
pytz    2026.1.post1
psutil  7.2.2
pandas  2.3.3
numpy   2.4.4
numba   0.65.0
cycler  0.12.1
python-dateutil 2.9.0.post0
packaging       26.0
PyYAML  6.0.3
scikit-learn    1.8.0
h5py    3.16.0
anndata 0.12.10
----    ----
Python  3.13.7 (tags/v3.13.7:bcee1c3, Aug 14 2025, 14:15:11) [MSC v.1944 64 bit (AMD64)]
OS      Windows-11-10.0.26200-SP0
CPU     32/32 logical CPU cores, Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
GPU     ID: 0, NVIDIA GeForce RTX 4060 Laptop GPU, Driver: 596.36, Memory: 8188 MiB
Updated 2026-05-01 14:56

Metadata

Metadata

Assignees

No one assigned

    Labels

    Triage 🩺This issue needs to be triaged by a maintainer

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions