Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 180 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,85 +2,212 @@

[![Tests](https://github.com/YosefLab/csde/actions/workflows/test.yml/badge.svg)](https://github.com/YosefLab/csde/actions/workflows/test.yml)

`csde` (Corrected Spatial Differential Expression) is a Python package designed to **identify differentially expressed (DE) genes between spatially-resolved cell populations** (e.g., T-cells inside vs. outside a tumor).
Automated pipelines for spatial transcriptomics produce cell quantifications (cell-by-gene expression matrices and label assignments) that contain systematic errors, e.g., due to mis-segmentation of cell boundaries.
These errors can propagate into downstream analyses of differential expression, leading to false discoveries or missed signals

Standard analysis relies on cell population assignments (e.g., "infiltrating" vs. "non-infiltrating") obtained automatically from clustering/ML that are often prone to errors. `csde` corrects for these inaccuracies by leveraging a small subset of validated "ground-truth" data, providing rigorous statistical guarantees for spatially-resolved DE analyses.
CSDE corrects for these errors by combining the large automated dataset with a small set of manually validated cells, using prediction-powered inference to recover unbiased estimates with valid confidence intervals.

Refer to the preprint and the [project repository](https://github.com/YosefLab/csde) for more details.
The current codebase focuses on the comparison of a given cell type across two spatial regions.
It allows users to
1. export per-cell annotation panels for a small subset of cells (e.g. 600)
2. manually validate the segmentation and type assignment for these cells
3. run the CSDE model to get corrected DE estimates for all genes

Refer to the [preprint](https://www.biorxiv.org/content/10.64898/2026.01.15.699786v1) for details on the method.

### Input requirements

The workflow takes a [SpatialData](https://spatialdata.scverse.org/) zarr as input. Its `"table"` AnnData must contain:

- **raw expression counts** in `.X` or a named layer
- **the following `obs` columns:**

| obs column | content |
| --- | --- |
| `cell_type` (configurable) | cell-type label for each cell |
| `spatial_group` (configurable) | binary spatial region label (e.g. `0` = outside tumour, `1` = inside tumour) |
| `center_x`, `center_y` | cell centroid in microns |

The zarr must also expose at least one **fluorescence image channel** (e.g. `"DAPI"`, `"Cellbound2"`) used to render the per-cell annotation panels.

## Installation

```bash
pip install csde
pip install "csde[cuda12]" # GPU (CUDA 12)
pip install "csde[annotate]" # annotation UI (Step 2, requires streamlit)
pip install "csde[cuda12,annotate]" # both
```

By default, this installs JAX with CPU support. To enable GPU support (CUDA), install with the appropriate extra (e.g., for CUDA 12):
## Workflow overview

```
SpatialData zarr
1. Export annotation panels ←─ scripts/export.py
(importance-sampled cells,
one image per cell)
2. Manual validation ←─ scripts/annotate.py
(annotator marks each cell
as correctly / incorrectly labelled)
3. Run CSDE ←─ scripts/differential_expression.py
(corrected DE estimates)
```

---

## Step 1 — Export annotation panels (`scripts/export.py`)

Before running the statistical model, a small subset of cells must be manually validated. `csde` provides tooling to generate the per-cell images needed for that step.

```bash
pip install "csde[cuda12]"
python scripts/export.py \
--sdata /path/to/region.zarr \
--out /path/to/annotation_dir \
--cell-type-key cell_type \
--cell-type-of-interest macrophages \
--target-proportion 0.4 \
--gene-colors scripts/gene_colors_file.json \
--image-channel Cellbound2 \
--n-cells 600
```

## Data Requirements
`--target-proportion` controls the fraction of cells of interest in the subsample. Cells of interest are upweighted accordingly (importance sampling); the unnormalized weight for each sampled cell is stored in `metadata.csv` for downstream use.

`csde` requires two `AnnData` objects containing gene expression counts. Typically, these are obtained by splitting your full dataset into two groups:
The script writes:

### 1. `adata_pred`: The dataset to analyze
This object contains the bulk of your cells (e.g., the majority of the tissue) where only standard (predicted) cell population assignments are available.
```
/path/to/annotation_dir/
├── images/
│ ├── cell_<id>.png # one panel per cell
│ └── ...
├── config.json # all export arguments (read by annotate.py)
├── metadata.csv # cell_id, cell_type, image_path, sampling_weight, center_x, center_y
└── annotations.json # {cell_id: true/false} — written by annotate.py
```

**Requirements:**
* A column in `.obs` (e.g., `"cell_population"`) containing cell population labels (e.g., "T cell (infiltrating)" vs. "T cell (non-infiltrating)"). These labels can be derived from heuristics (e.g., distance to tumor) and/or computational classifiers.
Each panel contains:
- **Left** — fluorescence image crop + cell boundaries + transcript dots for genes listed in `gene_colors`
- **Right** — top expressed genes (bar chart); genes in `gene_colors` use their assigned colour, others are grey

### 2. `adata_gt`: The correction set
This object contains a small subset of randomly sampled cells whose cell population assignments have been **validated** to serve as a ground truth. This set allows `csde` to estimate the error rate of the standard predictions.
### Gene color file

**Requirements:**
* **Prediction column:** The same column name as in `adata_pred` (e.g., `"cell_population"`), containing the automated labels.
* **Validation column:** A **boolean** column in `.obs` (e.g., `"is_correct"`) indicating if the automated label matches the validation ground truth (see [How to construct `adata_gt`?](#how-to-construct-adata_gt)).
A simple JSON mapping gene names to colours:

## Usage
```json
{
"CD68": "#e41a1c",
"MRC1": "#377eb8",
"C1QA": "#4daf4a",
"FCGR3A": "#ff7f00"
}
```

```python
from csde import run_csde
<details>
<summary>Python API</summary>

results = run_csde(
# `AnnData` datasets to analyze
adata_pred=adata_pred,
adata_gt=adata_gt,
# Column containing the predicted labels (in BOTH datasets)
pred_cell_pop_key="cell_population",
# The two populations to compare
cell_pop_a="T-cell (infiltrating)", # Reference group
cell_pop_b="T-cell (non-infiltrating)", # Target group

# Boolean column in adata_gt verifying the prediction
gt_key="is_correct",

# Optional: Use a specific layer for counts (default uses .X)
layer_name="counts"
```python
import json
import spatialdata as sd
from csde import export_cell_panels, subsample_cells, plot_top_genes

sdata = sd.read_zarr("/path/to/region.zarr")
gene_colors = json.load(open("gene_colors.json"))

metadata = export_cell_panels(
sdata=sdata,
annotation_dir="/path/to/annotation_dir",
cell_type_key="cell_type",
cell_type_of_interest="macrophages",
target_proportion=0.4,
gene_colors=gene_colors,
image_channel="Cellbound2",
n_cells=600,
)
```
</details>

# Returns a DataFrame with log_fold_change, p_value, and adjusted p_value
print(results.head())
---

## Step 2 — Manual validation (`scripts/annotate.py`)

For each exported image, an annotator decides whether the automated cell-type label is correct. The result is a boolean column `is_correct` added to `metadata.csv`, which becomes `adata_gt` in Step 3.

```bash
streamlit run scripts/annotate.py -- --dir /path/to/annotation_dir
```

VS Code Remote forwards the Streamlit port automatically. Open the URL printed in the terminal, then use:

- **`1`** — label as correct
- **`2`** — label as incorrect

Progress is saved after every keypress to `annotations.json`. Re-running the command resumes from where you left off. You can also start annotating while `export.py` is still running — the UI picks up newly exported cells automatically.

---

## Step 3 — Differential expression (`scripts/differential_expression.py`)

```bash
python scripts/differential_expression.py --dir /path/to/annotation_dir
```

### Output Columns
The returned DataFrame is indexed by gene name and contains:
* `log_fold_change`: The estimated log-fold change of expression (Target vs. Reference). Positive values indicate upregulation in `cell_pop_b`.
* `p_value`: The raw p-value from the hypothesis test (two-sided).
* `p_value_adj`: The p-value adjusted for multiple testing (Benjamini-Hochberg FDR).
Reads all export settings from `config.json` and writes gene-level results to `<dir>/results.csv`.

| option | default | description |
|---|---|---|
| `--dir` | *(required)* | annotation directory (output of steps 1 & 2) |
| `--out` | `<dir>/results.csv` | output CSV path |
| `--spatial-group-key` | `spatial_group` | obs column encoding the two spatial populations |
| `--n-cells-expressed-threshold` | `10` | min annotated cells expressing a gene for it to be tested |
| `--noise-model` | `poisson` | `poisson` or `nb` (negative binomial) |

### Output columns

## How to construct `adata_gt`?
| column | description |
|---|---|
| `log_fold_change` | estimated LFC (positive = upregulated in target population) |
| `p_value` | raw two-sided p-value |
| `p_value_adj` | Benjamini-Hochberg adjusted p-value |

Constructing `adata_gt` requires validating the cell population labels for a small subset of cells (e.g., random sample). This involves:
1. **Sampling**: Select a small random subset of cells from your dataset.
2. **Data Access**: Extract the relevant data for these cells: their gene expression profile, their spatial coordinates, and importantly, a **high-resolution image crop** of the cell (with segmentation boundaries if available) to assess morphology.
3. **Validation**: Visually inspect these data points to determine the true cell identity.
4. **Annotation**: Create the `is_correct` boolean column based on your assessment.
<details>
<summary>Python API</summary>

These steps can be performed manually or using dedicated tools.
Our [experimental repository](https://github.com/YosefLab/csde/blob/main/csde_experiments)
provides an example of how these steps were performed for MERFISH data.
The full CSDE statistical model is callable directly from Python, without going through the CLI scripts.

To streamline this process, for MERFISH or other spatial transcriptomics data, we recommend using **[SpatialData](https://spatialdata.scverse.org/)** to access the data and perform the manual validation.
`prepare_csde_inputs` reads `config.json`, `metadata.csv`, and `annotations.json` from the annotation directory produced by Steps 1 & 2. It returns two AnnData objects restricted to the same gene set:

- `adata_gt` — the manually validated cells, with an `is_correct` boolean column in `.obs` and a `sampling_weight` column reflecting the importance-sampling weight assigned during export
- `adata_other` — all remaining cells (not manually validated); their `obs` must contain a `prediction` column (integer) encoding the spatial population each cell was assigned to by the automated pipeline: `0` = reference region, `1` = target region, `2` = neither

```python
from csde import prepare_csde_inputs, run_csde

inputs = prepare_csde_inputs(
annotation_dir="/path/to/annotation_dir", # same dir as Steps 1 & 2
spatial_group_key="spatial_group",
n_cells_expressed_threshold=10,
)
adata_gt = inputs["adata_gt"] # manually validated cells
adata_other = inputs["adata_other"] # all other cells

results = run_csde(
adata_pred=adata_other,
adata_gt=adata_gt,
pred_cell_pop_key="prediction", # obs column: 0=reference, 1=target, 2=other
cell_pop_a=0, # reference population
cell_pop_b=1, # target population (LFC = log(target/reference))
gt_key="is_correct", # boolean correctness label from Step 2
layer_name="counts",
importance_weights=adata_gt.obs["sampling_weight"].values, # from metadata.csv
)
# DataFrame indexed by gene: log_fold_change, p_value, p_value_adj
print(results.head())
```

</details>
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ dev = [
"isort",
"flake8",
]
annotate = ["streamlit"]
cuda12 = ["jax[cuda12]"]
cuda13 = ["jax[cuda13]"]

Expand Down
Loading
Loading