Skip to content

wilcoxon DE, vs_all_perturbed #4114

@zboldyga

Description

@zboldyga

What kind of feature would you like to request?

Additional function parameters / changed functionality / changed defaults?

Please describe your wishes

@ilan-gold another part of my optimizations I added while doing perturb seq benchmarks was including a mode in wilcoxon DE for vs_all_perturbed. This has emerged as a technique in perturb seq: when performing differential expression on each target perturbation, instead of using all other cells as a reference, this approach uses all other cells except the controls (so only cells that had a perturbation on a coding gene).

e.g. it's introduced here: https://arxiv.org/pdf/2506.22641 in section 3.3 and 4.2 , and you can see it in the corresponding code: https://github.com/shiftbioscience/Diversity_By_Design/blob/main/data/norman19/get_data.py#L40 . This is also used in the more recent Dynamic Range Fraction paper in perturb seq, and probably some other papers derived from these works.

Practically I think this is most relevant when non-targeting (control) cell count is high relative to total cell count. Which does tend to happen, because every batch tends to include non-targeting cells to compensate for batch effects.

That said, it seems it might be a pattern that is here to stay, e.g. it's superior under certain conditions.

So my first question is -- does this deserve support in the scanpy wilcoxon DE implementation? My alternative was to filter data beforehand, but that is clunky.

If so:

  1. This should use a one-versus-rest approach where a single ranking is performed on the 'rest' cells, rather than re-doing ranking on every perturbation. That was the hoist issue I fixed in the other PR (the 2 line change). So this would just need to hit that path.

  2. I can spearhead getting this implemented end-to-end in scanpy and illico.

  3. As for how, I'm thinking it might make sense if there's a parameter that allows specifying a single group to exclude from 'rest' when using 'rest' as the the reference? It should be clear from the parameter name that this only impacts the reference cells and is unrelated to 'groups' (which perturbations DE is computed for)... This seemed a better option than adding another mode (e.g. vs_all_perturbed) because there are non-perturb-seq uses of wilcoxon DE. It also didn't seem necessary to allow multiple exclusion groups, I can't think of any other cases where it'd be helpful to offer to allow excluding more than one group. And this exclusion approach should be a lightweight change.

Thoughts?

If you think this is right, I can draft a PR to show what it looks like in scanpy and ping the illico author(s) to coordinate on that end.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions