Repository Purpose

This repository is a comprehensive guide to Machine Learning, designed to bridge theoretical concepts with practical, hands-on implementations. It serves as a learning lab for anyone—from beginners to practitioners—looking to deepen their understanding of core ML foundations and algorithms.

Goals

Demystify Machine Learning through structured explanations and illustrative examples
Organize ML algorithms into key paradigms: Supervised, Unsupervised, Semi-Supervised, and Reinforcement Learning
Enable experimentation with interactive Jupyter Notebooks for real-world learning
Support understanding of mathematical concepts and simplify complex topics like optimization, statistics, and linear algebra

What is Machine Learning?

Machine Learning (ML) is a subset of Artificial Intelligence that allows systems to learn from experience (data) and improve their performance on a task without being explicitly programmed with rules. Instead of following hardcoded instructions, the system identifies patterns in data and uses those patterns to make predictions or decisions.

Analogy:

Think of a baby learning to recognize animals. At first, the baby is shown pictures of cats and dogs. Over time, the baby begins to notice patterns — cats have pointy ears, dogs often have longer snouts. Eventually, the baby can identify a new picture as a "dog" or "cat" based on what they’ve seen before — even without being told the rules. Machine Learning works in a similar way: it learns from examples instead of being told exactly what to do.

Example:

A machine learning model learns to recommend movies based on a user's viewing history and preferences — just like how a friend might suggest a movie based on what you’ve enjoyed before.

Types of Machine Learning

Supervised Learning

This is by far the most widely used type of ML in real-world applications.

What it is: You train a model on labeled data (i.e., the input and expected output are both known).
Use Cases:
- Email spam detection
- Credit scoring
- Medical diagnosis
- House price prediction

Popular Algorithms

Linear Regression

Concept: Predicts a continuous value (e.g., student test score) based on one or more input features.
Essential Math:

$y = w_1x_1 + w_2x_2 + \cdots + w_nx_n + b$
It minimizes the Mean Squared Error (MSE) between predicted and actual values.
Use Case: Predicting prices, trends, or scores.

Logistic Regression

Concept: Used for binary classification (e.g., pass/fail, spam/ham).
Essential Math:

$P(y = 1 \mid x) = \sigma(w_1x_1 + w_2x_2 + \cdots + w_nx_n + b)$

Where the sigmoid function is:

$\sigma(z) = \frac{1}{1 + e^{-z}}$
Use Case: Disease prediction, marketing response, fraud detection.

Decision Trees

Concept: A flowchart-like structure where each internal node splits the data based on a feature.
Essential Math:
- Gini Impurity:
  
  $G = 1 - \sum_{i=1}^{C} p_i^2$
- Entropy (for Information Gain):
  
  $H = - \sum_{i=1}^{C} p_i \log_2(p_i)$
Use Case: Customer segmentation, credit risk modeling.

Random Forest

Concept: An ensemble of decision trees trained on random subsets of data and features.
Essential Math:
- For Regression:
  
  ŷ = (1 / T) × (y₁ + y₂ + ... + yₜ)
For Classification:

ŷ = majority vote of (y₁, y₂, ..., yₜ)
Use Case: Robust classification and regression tasks, e.g., loan approval, stock prediction.

Support Vector Machines (SVM)

Concept:
- Finds the hyperplane that best separates the data into classes.
Essential Math:
- Decision boundary:
  $w \cdot x + b = 0$
- Optimization constraint:
  $y_i(w \cdot x_i + b) \geq 1$
- Margin to maximize:
  $\frac{2}{\lVert w \rVert}$
Can use the kernel trick (e.g., RBF kernel) to handle non-linear decision boundaries.
Use Case: Text classification, face recognition, bioinformatics.

k-Nearest Neighbors (kNN)

Concept: Classifies a sample based on the majority vote (classification) or average (regression) of its k closest neighbors.
Essential Math:
- Euclidean Distance:
  $d(x, x') = \sqrt{ \sum_{i=1}^{n} (x_i - x'_i)^2 }$
Other distance metrics can be used, such as Manhattan, Cosine, or Minkowski, depending on the data.
Use Case: Recommender systems, image classification, anomaly detection.

Unsupervised Learning

What it is: The model tries to find patterns and groupings in the data without labeled outputs.
Use Cases:
- Customer segmentation
- Market basket analysis
- Anomaly detection
Popular Algorithms:
- k-Means Clustering
- DBSCAN
- PCA (Principal Component Analysis)
Python Libraries: scikit-learn, scipy, matplotlib

Reinforcement Learning

What it is: An agent learns to make decisions by interacting with an environment and getting feedback (rewards or penalties).
Use Cases:
- Robotics
- Game playing (e.g., AlphaGo)
- Self-driving cars
Popular Libraries: OpenAI Gym, Stable-Baselines, TensorFlow, PyTorch

Semi-Supervised Learning

What it is: Combines a small amount of labeled data with a large amount of unlabeled data to improve learning when labeling is expensive.
Use Cases:
- Web page classification
- Medical imaging
- Speech recognition
- Fraud detection
Popular Algorithms:
- Self-training
- Label propagation
- Semi-supervised Support Vector Machines (S3VM)
- Graph-based methods
Python Libraries: scikit-learn, sklearn.semi_supervised, TensorFlow, PyTorch

Types of Data

Machine learning problems and evaluation metrics depend heavily on the type of data being used.
Understanding data types helps determine:

Which algorithm to use
Which preprocessing steps to apply
Which evaluation metrics are appropriate

flowchart TD
    A[Data Types] --> B[Numerical]
    A --> C[Categorical]
    A --> D[Binary]
    A --> E[Time Series]
    A --> F[Text]
    A --> G[Image]
    A --> H[Tabular]

    B --> B1[Discrete]
    B --> B2[Continuous]

    C --> C1[Nominal]
    C --> C2[Ordinal]

Details

Numerical Data

Numerical data represents quantitative values and can be measured.

Examples:

Age: 25
Salary: 5000
Temperature: 22.5°C
House price: 350000

Types of Numerical Data

Discrete
Integer values (counts)

Number of customers
Number of clicks
Number of purchases

Continuous
Real values (measurements)

Height
Weight
Time
Price

Used in:

Regression
Forecasting
Time series

graph TD
    A[Numerical Data] --> B[Discrete]
    A --> C[Continuous]

    B --> B1[Counts]
    B --> B2[Integer values]

    C --> C1[Measurements]
    C --> C2[Real numbers]

Categorical Data

Categorical data represents labels or categories.

Examples:

Color: Red, Blue, Green
Country: Brazil, USA, Canada
Payment method: Credit Card, Pix, Cash

Types of Categorical Data

Nominal (No order)

Dog, Cat, Bird
Red, Blue, Green
Male, Female

No ranking exists.

Ordinal (Has order)

Small, Medium, Large
Low, Medium, High
Bronze, Silver, Gold

Ranking exists, but distances are not meaningful.

Used in:

Classification
Segmentation
Decision trees

graph TD
    A[Categorical Data] --> B[Nominal]
    A --> C[Ordinal]

    B --> B1[No order]
    B --> B2[Labels]

    C --> C1[Ordered categories]
    C --> C2[Ranking exists]

Binary Data

Binary data is a special case of categorical data with two values.

Examples:

Yes / No
True / False
Spam / Not Spam
Fraud / Not Fraud

Used in:

Binary classification
Logistic regression
Precision / Recall / F1 evaluation

graph TD
    A[Binary Data] --> B[Yes / No]
    A --> C[True / False]
    A --> D[0 / 1]
    A --> E[Spam / Not Spam]

Time Series Data

Data indexed by time order.

Examples:

Stock prices
Sales per day
Temperature over time
Website traffic

Characteristics:

Order matters
Cannot shuffle randomly
Requires special validation

Used in:

Forecasting
Anomaly detection
Sequential models

graph TD
    A[Time Series Data] --> B[Ordered by Time]
    A --> C[Temporal Dependency]
    A --> D[Trend]
    A --> E[Seasonality]

Text Data

Unstructured textual information.

Examples:

Emails
Reviews
Tweets
Documents

Requires preprocessing:

Tokenization
Embeddings
Vectorization

Used in:

NLP
Sentiment analysis
Classification

graph TD
    A[Text Data] --> B[Unstructured]
    A --> C[Requires Processing]

    C --> D[Tokenization]
    C --> E[Embedding]
    C --> F[Vectorization]

Image Data

Visual pixel-based data.

Examples:

Photos
Medical images
Satellite images

Used in:

Computer Vision
Object detection
Image classification

graph TD
    A[Image Data] --> B[Pixels]
    A --> C[Channels]

    C --> D[RGB]
    C --> E[Grayscale]

    A --> F[Computer Vision Tasks]

Tabular Data

Most common structured format (rows and columns).

Example:

Age	Salary	Country	Bought
25	3000	Brazil	Yes
40	8000	USA	No

Used in:

Most ML problems
Classification
Regression

graph TD
    A[Tabular Data] --> B[Rows]
    A --> C[Columns]
    A --> D[Features]
    A --> E[Target]

Data Type vs Problem Type

Data Type	Typical Task
Numerical	Regression
Categorical	Classification
Binary	Binary Classification
Time Series	Forecasting
Text	NLP
Image	Computer Vision
Tabular	General ML

Why This Matters for Evaluation

The data type determines the evaluation metric:

Data Type	Task	Metrics
Categorical	Classification	Accuracy, F1
Binary	Classification	Precision, Recall
Numerical	Regression	MAE, MSE
Ranking	Retrieval	Precision@K
Time Series	Forecasting	MAE, RMSE

Machine Learning Techniques

Classification
A supervised learning task where the model learns to categorize data into predefined classes or labels.
Example: Predicting if an email is spam or not spam.
Regression
A supervised learning task where the goal is to predict a continuous value.
Example: Predicting the price of a house based on size, location, etc.
Clustering
An unsupervised learning method where the algorithm groups data into clusters based on similarity—without predefined labels.
Example: Segmenting customers into groups based on their behavior or purchases.
Anomaly Detection
Identifying data points that are unusual or deviate significantly from the majority.
Example: Detecting fraudulent credit card transactions.
Sequence Mining
Analyzing and identifying patterns in ordered data (sequences), especially over time.
Example: Finding common sequences in customer purchases or website navigation.
Dimension Reduction
Reducing the number of features (dimensions) in a dataset while keeping important information—used to simplify models and visualize high-dimensional data.
Example: Using PCA (Principal Component Analysis) to reduce image data with thousands of pixels into just a few features.
Recommendation System
A system that suggests items (movies, products, etc.) to users based on their preferences or behaviors.
Example: Netflix recommending movies or shows based on your watch history.

Machine Learning Model Lifecycle

Problem Definition
Clearly define the objective of the machine learning task.
Example: Predict customer churn or classify product reviews as positive or negative.
Data Collection
Gather relevant and sufficient raw data from various sources like databases, APIs, sensors, or manual input.
Example: Collecting user behavior logs or survey results.
Data Preparation
Clean, transform, and structure the data for training. This includes handling missing values, encoding categories, and normalizing values.
Example: Converting text into numeric form or removing outliers.
Model Development and Evaluation
Choose a model type, train it using prepared data, and evaluate its accuracy, precision, recall, or other relevant metrics.
Example: Training a decision tree and evaluating it using cross-validation.
Model Deployment
Integrate the trained model into a production environment where it can receive real input and make predictions.
Example: Deploying a fraud detection model via an API to monitor real-time transactions.

Repository Structure for the Future

machine-learning/
│
├── README.md                             # High-level introduction to Machine Learning
│
evaluations/
│
├── README.md
│
├── 01.metrics/
│   ├── 01.classification/
│   │   ├── confusion_matrix.md
│   │   ├── accuracy.md
│   │   ├── precision.md
│   │   ├── recall.md
│   │   ├── f1_score.md
│   │   ├── roc_auc.md
│   │   └── log_loss.md
│   │
│   ├── 02.regression/
│   │   ├── mae.md
│   │   ├── mse.md
│   │   ├── rmse.md
│   │   ├── r2_score.md
│   │   └── mape.md
│   │
│   └── 03.ranking/
│       ├── precision_at_k.md
│       ├── recall_at_k.md
│       └── ndcg.md
│
├── 02.validation/
│   ├── train_test_split.md
│   ├── cross_validation.md
│   ├── k_fold.md
│   ├── stratified_k_fold.md
│   ├── leave_one_out.md
│   ├── time_series_split.md
│   ├── bootstrap.md
│
├── supervised/
│   ├── README.md                       # Core concepts: labeled data, overfitting, etc.
│   ├── 01.linear_regression.md
│   ├── 02.logistic_regression.md
│   ├── 03.k_nearest_neighbors.md
│   ├── 04.naive_bayes.md
│   ├── 05.svm.md
│   ├── 06.decision_trees.md
│   ├── 07.random_forest.md
│   ├── 08.gradient_boosting.md
│   ├── 09.neural_networks.md
│   ├── algorithms/
│   └── notebooks/
│
├── unsupervised/
│   ├── README.md                       # Key ideas: clustering, dimensionality reduction, etc.
│   ├── 01.k_means.md
│   ├── 02.dbscan.md
│   ├── 03.hierarchical_clustering.md
│   ├── 04.pca.md
│   ├── 05.tsne.md
│   ├── algorithms/
│   └── notebooks/
│
├── reinforcement_learning/
│   ├── README.md                       # Basics of agents, environments, rewards, etc.
│   ├── 01.q_learning.md
│   ├── 02.sarsa.md
│   ├── 03.deep_q_network.md
│   ├── 04.policy_gradient.md
│   ├── algorithms/
│   └── notebooks/
│
├── semi_supervised_learning/
│   ├── README.md                       # Hybrid between supervised and unsupervised
│   ├── 01.self_training.md
│   ├── 02.label_propagation.md
│   ├── algorithms/
│   └── notebooks/
│
└── shared_resources/
    ├── datasets/                         # Sample datasets used across topics
    ├── utils/                            # Reusable utility functions
    └── references.md                     # Useful academic references and links

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
algorithms		algorithms
evaluations		evaluations
supervised-learning		supervised-learning
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Repository Purpose

What is Machine Learning?

Analogy:

Example:

Types of Machine Learning

Supervised Learning

Popular Algorithms

$y = w_1x_1 + w_2x_2 + \cdots + w_nx_n + b$

$P(y = 1 \mid x) = \sigma(w_1x_1 + w_2x_2 + \cdots + w_nx_n + b)$

$\sigma(z) = \frac{1}{1 + e^{-z}}$

$G = 1 - \sum_{i=1}^{C} p_i^2$

$H = - \sum_{i=1}^{C} p_i \log_2(p_i)$

ŷ = (1 / T) × (y₁ + y₂ + ... + yₜ)

ŷ = majority vote of (y₁, y₂, ..., yₜ)

$w \cdot x + b = 0$

$y_i(w \cdot x_i + b) \geq 1$

$\frac{2}{\lVert w \rVert}$

$d(x, x') = \sqrt{ \sum_{i=1}^{n} (x_i - x'_i)^2 }$

Unsupervised Learning

Reinforcement Learning

Semi-Supervised Learning

Types of Data

Numerical Data

Types of Numerical Data

Categorical Data

Types of Categorical Data

Binary Data

Time Series Data

Text Data

Image Data

Tabular Data

Data Type vs Problem Type

Why This Matters for Evaluation

Machine Learning Techniques

Machine Learning Model Lifecycle

Repository Structure for the Future

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages