Skip to content

talkowski-lab/POGZ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Code for: CRISPR-engineered deletion of POGZ alters transcription factor binding at promoters of genes involved in synaptic signaling

This repository contains the analysis code associated with the publication:

[Authors]. [Paper Title]. [Journal], [Year]. DOI: [DOI]

Overview

This study investigates the transcriptional and chromatin accessibility consequences of heterozygous POGZ loss-of-function in human iPSC-derived neurons (iN) and neural stem cells (NSC). Experiments were performed in two independent iPSC genetic backgrounds (GM8330 and MGH) to identify robust, reproducible effects.

The analyses include:

  • Bulk RNA-seq differential expression and co-expression analysis
  • ATAC-seq peak calling, differential chromatin accessibility analysis
  • Transcription factor footprinting and differential TF binding analysis

Repository structure

code/
├── RNA-seq/
│   ├── rnaseq_analysis_iN_final.Rmd           # Main RNA-seq analysis notebook
│   ├── co-expression_analysis_allsamples.R    # WGCNA co-expression analysis
│   ├── co-expression_data.R                   # Preprocessing for co-expression
│   └── RNAseq.analysis/R/                     # Supporting R functions
│       ├── rnaseq_analysis_functions.R        # DESeq2 wrappers, volcano plots, PCA
│       ├── rnaseq_qc_functions.R              # QC helper functions
│       ├── co_expression_functions.R          # WGCNA helper functions
│       ├── enrichment.R                       # Pathway enrichment functions
│       └── read_pathway_db.R                  # Load pathway databases
└── ATAC-seq/
    ├── ATACSeq/                               # ATAC-seq processing pipeline (Python)
    │   └── bin/
    │       ├── callpeaks.py                   # Peak calling (MACS2)
    │       ├── poolpeaks.py                   # Merge peaks across samples
    │       ├── callfootprints.py              # TF footprinting (TOBIAS ATACorrect + ScoreBigwig)
    │       └── calltfbindings.py              # Differential TF binding (TOBIAS BINDetect)
    ├── Differential_accessibility/
    │   ├── diff_peaks_v2.R                    # DiffBind differential accessibility script
    │   └── differential_accessibility.Rmd    # Differential accessibility analysis notebook
    └── Transcription_factor_footprint/
        ├── 1. merge_ATACpeaks.sh              # Merge peak files per cell type / background
        ├── 2. get_bamfile_list.sh             # Collect BAM file lists per group
        ├── 3. call_TF_footprints.sh           # Run TOBIAS ATACorrect + ScoreBigwig
        ├── 4. iN_merge_bigwigs.sh             # Merge per-sample bigWigs per group
        ├── 5. call_TF_bindings.sh             # Run TOBIAS BINDetect (DEL vs WT)
        └── tf_binding.Rmd                     # TF binding analysis and visualization notebook

Dependencies

R (≥ 4.0)

Package Purpose
DESeq2 Differential expression analysis
sva Surrogate variable analysis (batch correction)
WGCNA Weighted gene co-expression network analysis
DiffBind Differential chromatin accessibility
GenomicRanges Genomic interval operations
biomaRt Gene annotation retrieval
ggplot2, ggrepel, ggpubr Visualization
pheatmap Heatmaps
VennDiagram Overlap visualization
dplyr, stringr, data.table Data manipulation

Python (≥ 3.7)

Tool Purpose
samtools BAM processing
bedtools Genomic interval operations
MACS2 Peak calling
TOBIAS TF footprinting and differential binding
deepTools BigWig generation

Analysis workflows

Note: The code provided here covers the iN heterozygous deletion analysis as an example. The same workflows were applied to NSC and compound heterozygous samples in the paper.

RNA-seq

  1. Preprocessing and QC — Gene counts and library-size-normalized CPM are computed from a STAR-aligned count matrix. Low-expressed genes are filtered at CPM ≥ 0.5 in at least one group.

  2. Differential expression — DESeq2 is run separately for GM8330 and MGH backgrounds. Surrogate variables (SVA) are estimated and included as covariates to remove unwanted technical variation. Results are saved as CSV tables with log₂ fold change and adjusted p-values.

  3. Cross-background replication — DEGs significant at FDR < 0.1 in both backgrounds and regulated in the same direction are taken as the consensus POGZ DEG set.

  4. Pathway enrichment — The consensus DEG set is tested for enrichment in curated gene sets (GO, KEGG, and other pathway databases) using hypergeometric tests.

  5. Co-expression analysis — WGCNA is run on SVA-corrected, log₂-normalized counts from all iN samples. Module–trait correlations are computed against genotype and other metadata covariates.

The main differential expression notebook is code/RNA-seq/rnaseq_analysis_iN_final.Rmd. Co-expression analysis is in code/RNA-seq/co-expression_analysis_allsamples.R.


ATAC-seq

Step 1 — Peak calling and IDR

python code/ATAC-seq/ATACSeq/bin/callpeaks.py --bam <sample.bam> --outdir <peaks_dir>

Peaks are called with MACS2. IDR is used to define conservative and optimal peak sets across replicates.

Step 2 — Differential accessibility (DiffBind)

Run diff_peaks_v2.R with a DiffBind sample sheet:

Rscript code/ATAC-seq/Differential_accessibility/diff_peaks_v2.R \
  <metadata.csv> <background> <tissue> <output_dir> <use_overlapped_peaks> [is_compound_het]

Arguments: (1) DiffBind sample sheet CSV, (2) genetic background (e.g. GM or MGH), (3) cell type (e.g. iN or NSC), (4) output directory, (5) whether to use overlapped peaks across sample groups (TRUE/FALSE), (6) whether samples are compound heterozygous — optional, defaults to FALSE.

Differentially accessible regions (DARs) are identified at FDR < 0.05. The full analysis including annotation and visualization is in code/ATAC-seq/Differential_accessibility/differential_accessibility.Rmd.

Step 3 — TF footprinting and differential binding (TOBIAS)

Run the numbered shell scripts in order within code/ATAC-seq/Transcription_factor_footprint/:

# 1. Merge IDR peaks per cell type and background
bash "code/ATAC-seq/Transcription_factor_footprint/1. merge_ATACpeaks.sh"

# 2. Collect per-group BAM file lists
bash "code/ATAC-seq/Transcription_factor_footprint/2. get_bamfile_list.sh"

# 3. Compute ATAC-seq bias correction and footprint scores (TOBIAS ATACorrect + ScoreBigwig)
bash "code/ATAC-seq/Transcription_factor_footprint/3. call_TF_footprints.sh"

# 4. Merge per-sample bigWigs per group
bash "code/ATAC-seq/Transcription_factor_footprint/4. iN_merge_bigwigs.sh"

# 5. Run differential TF binding analysis (TOBIAS BINDetect, DEL vs WT)
bash "code/ATAC-seq/Transcription_factor_footprint/5. call_TF_bindings.sh"

TF motifs are from the JASPAR vertebrates non-redundant collection, supplemented with custom POGZ motifs. Downstream statistical analysis and visualization are in code/ATAC-seq/Transcription_factor_footprint/tf_binding.Rmd.


Data availability

Raw sequencing data and processed files are deposited at GEO under accession [GEO accession].


Citation

If you use this code, please cite:

[Authors]. [Paper Title]. [Journal], [Year]. DOI: [DOI]

About

RNAseq and ATACseq analyses of CRISPR-edited POGZ in human neural stem cells and induced neurons

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors