Skip to content

gil-son/machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Repository Purpose

This repository is a comprehensive guide to Machine Learning, designed to bridge theoretical concepts with practical, hands-on implementations. It serves as a learning lab for anyone—from beginners to practitioners—looking to deepen their understanding of core ML foundations and algorithms.

Goals

  • Demystify Machine Learning through structured explanations and illustrative examples
  • Organize ML algorithms into key paradigms: Supervised, Unsupervised, Semi-Supervised, and Reinforcement Learning
  • Enable experimentation with interactive Jupyter Notebooks for real-world learning
  • Support understanding of mathematical concepts and simplify complex topics like optimization, statistics, and linear algebra

What is Machine Learning?

Machine Learning (ML) is a subset of Artificial Intelligence that allows systems to learn from experience (data) and improve their performance on a task without being explicitly programmed with rules. Instead of following hardcoded instructions, the system identifies patterns in data and uses those patterns to make predictions or decisions.

Analogy:

Think of a baby learning to recognize animals. At first, the baby is shown pictures of cats and dogs. Over time, the baby begins to notice patterns — cats have pointy ears, dogs often have longer snouts. Eventually, the baby can identify a new picture as a "dog" or "cat" based on what they’ve seen before — even without being told the rules. Machine Learning works in a similar way: it learns from examples instead of being told exactly what to do.

Example:

A machine learning model learns to recommend movies based on a user's viewing history and preferences — just like how a friend might suggest a movie based on what you’ve enjoyed before.


Types of Machine Learning

Supervised Learning

This is by far the most widely used type of ML in real-world applications.

  • What it is: You train a model on labeled data (i.e., the input and expected output are both known).
  • Use Cases:
    • Email spam detection
    • Credit scoring
    • Medical diagnosis
    • House price prediction

Popular Algorithms

Linear Regression
  • Concept: Predicts a continuous value (e.g., student test score) based on one or more input features.

  • Essential Math:

    $y = w_1x_1 + w_2x_2 + \cdots + w_nx_n + b$

  • It minimizes the Mean Squared Error (MSE) between predicted and actual values.

  • Use Case: Predicting prices, trends, or scores.

Logistic Regression
  • Concept: Used for binary classification (e.g., pass/fail, spam/ham).

  • Essential Math:

    $P(y = 1 \mid x) = \sigma(w_1x_1 + w_2x_2 + \cdots + w_nx_n + b)$

    Where the sigmoid function is:

    $\sigma(z) = \frac{1}{1 + e^{-z}}$

  • Use Case: Disease prediction, marketing response, fraud detection.

Decision Trees
  • Concept: A flowchart-like structure where each internal node splits the data based on a feature.

  • Essential Math:

    • Gini Impurity:

      $G = 1 - \sum_{i=1}^{C} p_i^2$

    • Entropy (for Information Gain):

      $H = - \sum_{i=1}^{C} p_i \log_2(p_i)$

  • Use Case: Customer segmentation, credit risk modeling.

Random Forest
  • Concept: An ensemble of decision trees trained on random subsets of data and features.

  • Essential Math:

    • For Regression:

      ŷ = (1 / T) × (y₁ + y₂ + ... + yₜ)

  • For Classification:

    ŷ = majority vote of (y₁, y₂, ..., yₜ)

  • Use Case: Robust classification and regression tasks, e.g., loan approval, stock prediction.

Support Vector Machines (SVM)
  • Concept:
    • Finds the hyperplane that best separates the data into classes.
  • Essential Math:
    • Decision boundary:

      $w \cdot x + b = 0$

    • Optimization constraint:

      $y_i(w \cdot x_i + b) \geq 1$

    • Margin to maximize:

      $\frac{2}{\lVert w \rVert}$

  • Can use the kernel trick (e.g., RBF kernel) to handle non-linear decision boundaries.
  • Use Case: Text classification, face recognition, bioinformatics.
k-Nearest Neighbors (kNN)
  • Concept: Classifies a sample based on the majority vote (classification) or average (regression) of its k closest neighbors.

  • Essential Math:

    • Euclidean Distance:

      $d(x, x') = \sqrt{ \sum_{i=1}^{n} (x_i - x'_i)^2 }$

  • Other distance metrics can be used, such as Manhattan, Cosine, or Minkowski, depending on the data.

  • Use Case: Recommender systems, image classification, anomaly detection.


Unsupervised Learning

  • What it is: The model tries to find patterns and groupings in the data without labeled outputs.
  • Use Cases:
    • Customer segmentation
    • Market basket analysis
    • Anomaly detection
  • Popular Algorithms:
    • k-Means Clustering
    • DBSCAN
    • PCA (Principal Component Analysis)
  • Python Libraries: scikit-learn, scipy, matplotlib

Reinforcement Learning

  • What it is: An agent learns to make decisions by interacting with an environment and getting feedback (rewards or penalties).
  • Use Cases:
    • Robotics
    • Game playing (e.g., AlphaGo)
    • Self-driving cars
  • Popular Libraries: OpenAI Gym, Stable-Baselines, TensorFlow, PyTorch

Semi-Supervised Learning

  • What it is: Combines a small amount of labeled data with a large amount of unlabeled data to improve learning when labeling is expensive.
  • Use Cases:
    • Web page classification
    • Medical imaging
    • Speech recognition
    • Fraud detection
  • Popular Algorithms:
    • Self-training
    • Label propagation
    • Semi-supervised Support Vector Machines (S3VM)
    • Graph-based methods
  • Python Libraries: scikit-learn, sklearn.semi_supervised, TensorFlow, PyTorch

Types of Data

Machine learning problems and evaluation metrics depend heavily on the type of data being used.
Understanding data types helps determine:

  • Which algorithm to use
  • Which preprocessing steps to apply
  • Which evaluation metrics are appropriate
flowchart TD
    A[Data Types] --> B[Numerical]
    A --> C[Categorical]
    A --> D[Binary]
    A --> E[Time Series]
    A --> F[Text]
    A --> G[Image]
    A --> H[Tabular]

    B --> B1[Discrete]
    B --> B2[Continuous]

    C --> C1[Nominal]
    C --> C2[Ordinal]
Loading

Details

Numerical Data

Numerical data represents quantitative values and can be measured.

Examples:

  • Age: 25
  • Salary: 5000
  • Temperature: 22.5°C
  • House price: 350000

Types of Numerical Data

Discrete
Integer values (counts)

  • Number of customers
  • Number of clicks
  • Number of purchases

Continuous
Real values (measurements)

  • Height
  • Weight
  • Time
  • Price

Used in:

  • Regression
  • Forecasting
  • Time series
graph TD
    A[Numerical Data] --> B[Discrete]
    A --> C[Continuous]

    B --> B1[Counts]
    B --> B2[Integer values]

    C --> C1[Measurements]
    C --> C2[Real numbers]
Loading

Categorical Data

Categorical data represents labels or categories.

Examples:

  • Color: Red, Blue, Green
  • Country: Brazil, USA, Canada
  • Payment method: Credit Card, Pix, Cash

Types of Categorical Data

Nominal (No order)

  • Dog, Cat, Bird
  • Red, Blue, Green
  • Male, Female

No ranking exists.

Ordinal (Has order)

  • Small, Medium, Large
  • Low, Medium, High
  • Bronze, Silver, Gold

Ranking exists, but distances are not meaningful.

Used in:

  • Classification
  • Segmentation
  • Decision trees
graph TD
    A[Categorical Data] --> B[Nominal]
    A --> C[Ordinal]

    B --> B1[No order]
    B --> B2[Labels]

    C --> C1[Ordered categories]
    C --> C2[Ranking exists]
Loading

Binary Data

Binary data is a special case of categorical data with two values.

Examples:

  • Yes / No
  • True / False
  • Spam / Not Spam
  • Fraud / Not Fraud

Used in:

  • Binary classification
  • Logistic regression
  • Precision / Recall / F1 evaluation
graph TD
    A[Binary Data] --> B[Yes / No]
    A --> C[True / False]
    A --> D[0 / 1]
    A --> E[Spam / Not Spam]
Loading

Time Series Data

Data indexed by time order.

Examples:

  • Stock prices
  • Sales per day
  • Temperature over time
  • Website traffic

Characteristics:

  • Order matters
  • Cannot shuffle randomly
  • Requires special validation

Used in:

  • Forecasting
  • Anomaly detection
  • Sequential models
graph TD
    A[Time Series Data] --> B[Ordered by Time]
    A --> C[Temporal Dependency]
    A --> D[Trend]
    A --> E[Seasonality]

Loading

Text Data

Unstructured textual information.

Examples:

  • Emails
  • Reviews
  • Tweets
  • Documents

Requires preprocessing:

  • Tokenization
  • Embeddings
  • Vectorization

Used in:

  • NLP
  • Sentiment analysis
  • Classification
graph TD
    A[Text Data] --> B[Unstructured]
    A --> C[Requires Processing]

    C --> D[Tokenization]
    C --> E[Embedding]
    C --> F[Vectorization]

Loading

Image Data

Visual pixel-based data.

Examples:

  • Photos
  • Medical images
  • Satellite images

Used in:

  • Computer Vision
  • Object detection
  • Image classification
graph TD
    A[Image Data] --> B[Pixels]
    A --> C[Channels]

    C --> D[RGB]
    C --> E[Grayscale]

    A --> F[Computer Vision Tasks]
Loading

Tabular Data

Most common structured format (rows and columns).

Example:

Age Salary Country Bought
25 3000 Brazil Yes
40 8000 USA No

Used in:

  • Most ML problems
  • Classification
  • Regression
graph TD
    A[Tabular Data] --> B[Rows]
    A --> C[Columns]
    A --> D[Features]
    A --> E[Target]
Loading

Data Type vs Problem Type

Data Type Typical Task
Numerical Regression
Categorical Classification
Binary Binary Classification
Time Series Forecasting
Text NLP
Image Computer Vision
Tabular General ML

Why This Matters for Evaluation

The data type determines the evaluation metric:

Data Type Task Metrics
Categorical Classification Accuracy, F1
Binary Classification Precision, Recall
Numerical Regression MAE, MSE
Ranking Retrieval Precision@K
Time Series Forecasting MAE, RMSE

Machine Learning Techniques

  • Classification
    A supervised learning task where the model learns to categorize data into predefined classes or labels.
    Example: Predicting if an email is spam or not spam.

  • Regression
    A supervised learning task where the goal is to predict a continuous value.
    Example: Predicting the price of a house based on size, location, etc.

  • Clustering
    An unsupervised learning method where the algorithm groups data into clusters based on similarity—without predefined labels.
    Example: Segmenting customers into groups based on their behavior or purchases.

  • Anomaly Detection
    Identifying data points that are unusual or deviate significantly from the majority.
    Example: Detecting fraudulent credit card transactions.

  • Sequence Mining
    Analyzing and identifying patterns in ordered data (sequences), especially over time.
    Example: Finding common sequences in customer purchases or website navigation.

  • Dimension Reduction
    Reducing the number of features (dimensions) in a dataset while keeping important information—used to simplify models and visualize high-dimensional data.
    Example: Using PCA (Principal Component Analysis) to reduce image data with thousands of pixels into just a few features.

  • Recommendation System
    A system that suggests items (movies, products, etc.) to users based on their preferences or behaviors.
    Example: Netflix recommending movies or shows based on your watch history.


Machine Learning Model Lifecycle

  • Problem Definition
    Clearly define the objective of the machine learning task.
    Example: Predict customer churn or classify product reviews as positive or negative.

  • Data Collection
    Gather relevant and sufficient raw data from various sources like databases, APIs, sensors, or manual input.
    Example: Collecting user behavior logs or survey results.

  • Data Preparation
    Clean, transform, and structure the data for training. This includes handling missing values, encoding categories, and normalizing values.
    Example: Converting text into numeric form or removing outliers.

  • Model Development and Evaluation
    Choose a model type, train it using prepared data, and evaluate its accuracy, precision, recall, or other relevant metrics.
    Example: Training a decision tree and evaluating it using cross-validation.

  • Model Deployment
    Integrate the trained model into a production environment where it can receive real input and make predictions.
    Example: Deploying a fraud detection model via an API to monitor real-time transactions.


Repository Structure for the Future

machine-learning/
│
├── README.md                             # High-level introduction to Machine Learning
│
evaluations/
│
├── README.md
│
├── 01.metrics/
│   ├── 01.classification/
│   │   ├── confusion_matrix.md
│   │   ├── accuracy.md
│   │   ├── precision.md
│   │   ├── recall.md
│   │   ├── f1_score.md
│   │   ├── roc_auc.md
│   │   └── log_loss.md
│   │
│   ├── 02.regression/
│   │   ├── mae.md
│   │   ├── mse.md
│   │   ├── rmse.md
│   │   ├── r2_score.md
│   │   └── mape.md
│   │
│   └── 03.ranking/
│       ├── precision_at_k.md
│       ├── recall_at_k.md
│       └── ndcg.md
│
├── 02.validation/
│   ├── train_test_split.md
│   ├── cross_validation.md
│   ├── k_fold.md
│   ├── stratified_k_fold.md
│   ├── leave_one_out.md
│   ├── time_series_split.md
│   ├── bootstrap.md
│
├── supervised/
│   ├── README.md                       # Core concepts: labeled data, overfitting, etc.
│   ├── 01.linear_regression.md
│   ├── 02.logistic_regression.md
│   ├── 03.k_nearest_neighbors.md
│   ├── 04.naive_bayes.md
│   ├── 05.svm.md
│   ├── 06.decision_trees.md
│   ├── 07.random_forest.md
│   ├── 08.gradient_boosting.md
│   ├── 09.neural_networks.md
│   ├── algorithms/
│   └── notebooks/
│
├── unsupervised/
│   ├── README.md                       # Key ideas: clustering, dimensionality reduction, etc.
│   ├── 01.k_means.md
│   ├── 02.dbscan.md
│   ├── 03.hierarchical_clustering.md
│   ├── 04.pca.md
│   ├── 05.tsne.md
│   ├── algorithms/
│   └── notebooks/
│
├── reinforcement_learning/
│   ├── README.md                       # Basics of agents, environments, rewards, etc.
│   ├── 01.q_learning.md
│   ├── 02.sarsa.md
│   ├── 03.deep_q_network.md
│   ├── 04.policy_gradient.md
│   ├── algorithms/
│   └── notebooks/
│
├── semi_supervised_learning/
│   ├── README.md                       # Hybrid between supervised and unsupervised
│   ├── 01.self_training.md
│   ├── 02.label_propagation.md
│   ├── algorithms/
│   └── notebooks/
│
└── shared_resources/
    ├── datasets/                         # Sample datasets used across topics
    ├── utils/                            # Reusable utility functions
    └── references.md                     # Useful academic references and links

About

Comprehensive guide to Machine Learning: theory, algorithms, and hands-on examples.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors