Please make sure these conditions are met
What happened?
I was running pp.scrublet on a FACS resolved scRNAseq data GSE144273 to reproduce:
Pei, W., Shang, F., Wang, X., Fanti, A.-K., Greco, A., Busch, K., Klapproth, K., Zhang, Q., Quedenau, C., Sauer, S., Feyerabend, T. B., Höfer, T., & Rodewald, H.-R. (2020). Resolving Fates and Single-Cell Transcriptomes of Hematopoietic Stem Cell Clones by PolyloxExpress Barcoding. Cell Stem Cell, 27(3), 383-395.e8. https://doi.org/10.1016/j.stem.2020.07.018
I concat()ed all anndata count matrices together to avoid dropping celltype-specific genes in QC, and assigned them a batch identifier (per cell type per mice). Then, I ran scrublet to detect doublets within each FACS purified batch. There, scrublet consistently fail with the exact dimension mismatch error:
ValueError: shape is inconsistent with obs (3597130 rows instead of 168972)
Note: By the time of using scrublet, the anndata object have 168972 obs, and scrublet ALWAYS produces 3597130 rows in error. The same, consistent shape is also observed in minimum code sample.
Minimal code sample
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "scanpy@git+https://github.com/scverse/scanpy.git@main",
# ]
# ///
#
# This script automatically imports the development branch of scanpy to check for issues
import scanpy as sc
# your reproducer code
# Create a batched data
exps = []
for exp in ["exp1", "exp2", "exp3", "exp4"]: #different batches
exp_adata = sc.datasets.pbmc3k()
exp_obs = exp_adata.obs.copy()
exp_adata.obs = exp_obs.assign(
sample = exp
)
exps.append(exp_adata)
adata = sc.concat(adatas=exps)
#QC filtering doesn't matter
# scrublet
sc.pp.scrublet(adata, batch_key="sample", n_neighbors=10) # fails
Error output
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[45], line 22
18
19 #QC filtering doesn't matter
20
21 # scrublet
---> 22 sc.pp.scrublet(adata, batch_key="sample", n_neighbors=10) # fails
[... skipping hidden 1 frame]
File d:\Bioinfo\.venv\Lib\site-packages\scanpy\preprocessing\_scrublet\__init__.py:281, in scrublet(adata, adata_sim, batch_key, sim_doublet_ratio, expected_doublet_rate, stdev_doublet_rate, synthetic_doublet_umi_subsampling, knn_dist_metric, normalize_variance, log_transform, mean_center, n_prin_comps, use_approx_neighbors, get_doublet_neighbor_parents, n_neighbors, threshold, verbose, copy, random_state)
276 scrubbed_obs = pd.concat([scrub["obs"] for scrub in scrubbed]).astype(
277 adata.obs.dtypes
278 )
280 # Now reset the obs to get the scrublet scores
--> 281 adata.obs = scrubbed_obs.loc[adata.obs_names.values]
283 # Save the .uns from each batch separately
284 adata.uns["scrublet"] = {}
File d:\Bioinfo\.venv\Lib\site-packages\anndata\_core\anndata.py:856, in AnnData.obs(self, value)
854 @obs.setter
855 def obs(self, value: pd.DataFrame | XDataset):
--> 856 self._set_dim_df(value, "obs")
File d:\Bioinfo\.venv\Lib\site-packages\anndata\_core\anndata.py:787, in AnnData._set_dim_df(self, value, attr)
786 def _set_dim_df(self, value: pd.DataFrame | XDataset, attr: Literal["obs", "var"]):
--> 787 value = _gen_dataframe(
788 value,
789 [f"{attr}_names", f"{'row' if attr == 'obs' else 'col'}_names"],
790 source="shape",
791 attr=attr,
792 length=self.n_obs if attr == "obs" else self.n_vars,
793 )
794 raise_value_error_if_multiindex_columns(value, attr)
795 value_idx = self._prep_dim_index(value.index, attr)
File D:\python313\Lib\functools.py:934, in singledispatch.<locals>.wrapper(*args, **kw)
931 if not args:
932 raise TypeError(f'{funcname} requires at least '
933 '1 positional argument')
--> 934 return dispatch(args[0].__class__)(*args, **kw)
File d:\Bioinfo\.venv\Lib\site-packages\anndata\_core\aligned_df.py:89, in _gen_dataframe_df(anno, index_names, source, attr, length)
87 raise ValueError(msg)
88 if length is not None and length != len(anno):
---> 89 raise _mk_df_error(source, attr, length, len(anno))
90 anno = anno.copy(deep=False)
91 if not is_string_dtype(anno.index[~anno.index.isna()]):
ValueError: `shape` is inconsistent with `obs` (43200 rows instead of 10800)
Versions
Details
<stdin>-0:1: FutureWarning: Use `print_header` instead
scanpy 1.12.1
---- ----
legacy-api-wrap 1.5
charset-normalizer 3.4.7
matplotlib 3.10.8
pyparsing 3.3.2
kiwisolver 1.5.0
natsort 8.4.0
typing_extensions 4.15.0
donfig 0.8.1.post1
session-info2 0.4.1
numcodecs 0.16.5
pyreadline3 3.5.4
colorama 0.4.6
google-crc32c 1.8.0
joblib 1.5.3
pyarrow 23.0.1
pillow 12.2.0
scipy 1.17.1
fast-array-utils 1.4.1
threadpoolctl 3.6.0
llvmlite 0.47.0
six 1.17.0
pytz 2026.1.post1
psutil 7.2.2
pandas 2.3.3
numpy 2.4.4
numba 0.65.0
cycler 0.12.1
python-dateutil 2.9.0.post0
packaging 26.0
PyYAML 6.0.3
scikit-learn 1.8.0
h5py 3.16.0
anndata 0.12.10
---- ----
Python 3.13.7 (tags/v3.13.7:bcee1c3, Aug 14 2025, 14:15:11) [MSC v.1944 64 bit (AMD64)]
OS Windows-11-10.0.26200-SP0
CPU 32/32 logical CPU cores, Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
GPU ID: 0, NVIDIA GeForce RTX 4060 Laptop GPU, Driver: 596.36, Memory: 8188 MiB
Updated 2026-05-01 14:56
Please make sure these conditions are met
What happened?
I was running
pp.scrubleton a FACS resolved scRNAseq data GSE144273 to reproduce:Pei, W., Shang, F., Wang, X., Fanti, A.-K., Greco, A., Busch, K., Klapproth, K., Zhang, Q., Quedenau, C., Sauer, S., Feyerabend, T. B., Höfer, T., & Rodewald, H.-R. (2020). Resolving Fates and Single-Cell Transcriptomes of Hematopoietic Stem Cell Clones by PolyloxExpress Barcoding. Cell Stem Cell, 27(3), 383-395.e8. https://doi.org/10.1016/j.stem.2020.07.018
I
concat()ed all anndata count matrices together to avoid dropping celltype-specific genes in QC, and assigned them a batch identifier (per cell type per mice). Then, I ranscrubletto detect doublets within each FACS purified batch. There,scrubletconsistently fail with the exact dimension mismatch error:Note: By the time of using
scrublet, the anndata object have 168972 obs, andscrubletALWAYS produces 3597130 rows in error. The same, consistent shape is also observed in minimum code sample.Minimal code sample
Error output
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[45], line 22 18 19 #QC filtering doesn't matter 20 21 # scrublet ---> 22 sc.pp.scrublet(adata, batch_key="sample", n_neighbors=10) # fails [... skipping hidden 1 frame] File d:\Bioinfo\.venv\Lib\site-packages\scanpy\preprocessing\_scrublet\__init__.py:281, in scrublet(adata, adata_sim, batch_key, sim_doublet_ratio, expected_doublet_rate, stdev_doublet_rate, synthetic_doublet_umi_subsampling, knn_dist_metric, normalize_variance, log_transform, mean_center, n_prin_comps, use_approx_neighbors, get_doublet_neighbor_parents, n_neighbors, threshold, verbose, copy, random_state) 276 scrubbed_obs = pd.concat([scrub["obs"] for scrub in scrubbed]).astype( 277 adata.obs.dtypes 278 ) 280 # Now reset the obs to get the scrublet scores --> 281 adata.obs = scrubbed_obs.loc[adata.obs_names.values] 283 # Save the .uns from each batch separately 284 adata.uns["scrublet"] = {} File d:\Bioinfo\.venv\Lib\site-packages\anndata\_core\anndata.py:856, in AnnData.obs(self, value) 854 @obs.setter 855 def obs(self, value: pd.DataFrame | XDataset): --> 856 self._set_dim_df(value, "obs") File d:\Bioinfo\.venv\Lib\site-packages\anndata\_core\anndata.py:787, in AnnData._set_dim_df(self, value, attr) 786 def _set_dim_df(self, value: pd.DataFrame | XDataset, attr: Literal["obs", "var"]): --> 787 value = _gen_dataframe( 788 value, 789 [f"{attr}_names", f"{'row' if attr == 'obs' else 'col'}_names"], 790 source="shape", 791 attr=attr, 792 length=self.n_obs if attr == "obs" else self.n_vars, 793 ) 794 raise_value_error_if_multiindex_columns(value, attr) 795 value_idx = self._prep_dim_index(value.index, attr) File D:\python313\Lib\functools.py:934, in singledispatch.<locals>.wrapper(*args, **kw) 931 if not args: 932 raise TypeError(f'{funcname} requires at least ' 933 '1 positional argument') --> 934 return dispatch(args[0].__class__)(*args, **kw) File d:\Bioinfo\.venv\Lib\site-packages\anndata\_core\aligned_df.py:89, in _gen_dataframe_df(anno, index_names, source, attr, length) 87 raise ValueError(msg) 88 if length is not None and length != len(anno): ---> 89 raise _mk_df_error(source, attr, length, len(anno)) 90 anno = anno.copy(deep=False) 91 if not is_string_dtype(anno.index[~anno.index.isna()]): ValueError: `shape` is inconsistent with `obs` (43200 rows instead of 10800)Versions
Details