This project builds a complete machine learning pipeline to predict DNA methylation status based on genomic features. The goal is to identify epigenetic patterns that can serve as early detection biomarkers for rare genetic vision disorders, such as Retinitis Pigmentosa.
After evaluating multiple algorithms, a simple linear model outperformed complex tree-based models, indicating that the feature boundaries in this specific dataset are highly linear.
- Winning Model: Logistic Regression
- Accuracy: 97.00%
The data is sourced from the Kaggle dataset: DNA Methylation Data - Epigenetic Biomarkers. It contains 1,000 samples with four key genomic features.
| Feature Name | Type | Description |
|---|---|---|
cpg_density |
Genomic Feature | The density or concentration of CpG sites in the specific genomic region. |
genomic_location |
Genomic Feature | The normalized physical position of the site within the chromosome. |
regulatory_score |
Genomic Feature | A metric indicating the potential regulatory activity (like promoters or enhancers) of the region. |
conservation_score |
Genomic Feature | A score representing how well this genomic sequence has been preserved across evolution. |
methylation_status |
Target Variable | The binary classification target (0 = Unmethylated, 1 = Methylated). |