Skip to content

This repository provides end-to-end implementations of classical and modern classification algorithms, demonstrating their application on real-world datasets. Each model is implemented with clear, well-documented code that emphasizes both theoretical understanding and practical deployment.

Notifications You must be signed in to change notification settings

bahmadnzf/Classification_models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Classification Models Repository

A comprehensive implementation and comparison of five fundamental classification algorithms in machine learning, designed for educational purposes and practical application.

Overview

This repository provides end-to-end implementations of classical and modern classification algorithms, demonstrating their application on real-world datasets. Each model is implemented with clear, well-documented code that emphasizes both theoretical understanding and practical deployment considerations.

Classification is a supervised learning task where models learn to predict categorical outcomes from labeled training data. This repository showcases how different algorithms approach the same problem using distinct mathematical frameworks and optimization strategies, allowing practitioners to understand their relative strengths, weaknesses, and appropriate use cases.

Repository Structure

├── data/
│   ├── raw/                    # Original datasets
│   ├── processed/              # Cleaned and preprocessed data
│   └── README.md               # Data documentation
├── models/
│   ├── logistic_regression/    # Logistic regression implementation
│   ├── decision_tree/          # Decision tree classifier
│   ├── random_forest/          # Random forest ensemble
│   ├── xgboost/                # XGBoost gradient boosting
│   └── neural_network/         # Neural network classifier
├── notebooks/
│   ├── exploratory_analysis.ipynb
│   ├── model_comparison.ipynb
│   └── hyperparameter_tuning.ipynb
├── src/
│   ├── preprocessing.py        # Data preprocessing utilities
│   ├── evaluation.py           # Model evaluation metrics
│   ├── visualization.py        # Plotting functions
│   └── utils.py                # Helper functions
├── results/
│   ├── metrics/                # Performance metrics
│   ├── visualizations/         # Plots and charts
│   └── model_comparison.csv    # Consolidated results
├── requirements.txt
└── README.md

Implemented Models

This repository implements five classification algorithms, each representing different approaches to learning decision boundaries and making predictions.

Model Type Core Principle Key Strengths Common Use Cases Interpretability
Logistic Regression Linear Model Uses sigmoid function to model class probabilities through weighted feature combinations Fast training, probabilistic outputs, works well with linearly separable data, low computational cost Binary classification, baseline modeling, risk assessment, credit scoring High - coefficient weights directly show feature importance
Decision Tree Tree-Based Recursively splits data based on feature values to create hierarchical decision rules Handles non-linear relationships, requires minimal preprocessing, intuitive visualization Rule-based systems, exploratory analysis, feature interaction discovery Very High - can be visualized and explained as if-then rules
Random Forest Ensemble (Bagging) Combines multiple decision trees trained on bootstrapped samples with random feature subsets Reduces overfitting, robust to outliers, handles missing values, provides feature importance General-purpose classification, high-dimensional data, imbalanced datasets Moderate - feature importance available but individual predictions harder to trace
XGBoost Ensemble (Boosting) Sequentially builds trees that correct errors of previous trees using gradient optimization State-of-the-art performance, handles sparse data, built-in regularization, parallel processing Kaggle competitions, structured data problems, risk modeling, ranking tasks Moderate - SHAP values help explain but complex tree interactions
Neural Network Deep Learning Learns hierarchical feature representations through multiple layers of weighted transformations Captures complex patterns, scales to large datasets, flexible architecture, end-to-end learning Image/text classification, complex non-linear problems, large-scale applications Low - "black box" nature, requires interpretation techniques like LIME or attention

Model Implementations

Logistic Regression

Logistic regression models the probability that an instance belongs to a particular class using the logistic (sigmoid) function. Despite its name, it's a classification algorithm that transforms a linear combination of features into probabilities between 0 and 1.

Mathematical Foundation: The model learns weights (coefficients) for each feature that maximize the likelihood of correctly predicting the training labels. For binary classification, it uses the sigmoid function σ(z) = 1/(1 + e^(-z)) to map linear predictions to probabilities.

Implementation Highlights:

  • Feature scaling using standardization for optimal convergence
  • Regularization (L1/L2) to prevent overfitting
  • Probability calibration for reliable confidence scores
  • Support for both binary and multiclass classification (one-vs-rest)

When to Use: Choose logistic regression when you need interpretable results, have linearly separable classes, require fast inference, or need probabilistic predictions for downstream decision-making.

Decision Tree

Decision trees learn a hierarchy of if-then rules by recursively partitioning the feature space. At each node, the algorithm selects the feature and threshold that best separates the classes according to a purity criterion (Gini impurity or entropy).

Implementation Highlights:

  • Gini impurity and information gain splitting criteria
  • Pruning strategies to prevent overfitting (max depth, min samples split)
  • Handling of both numerical and categorical features
  • Visualization tools for tree structure interpretation

When to Use: Decision trees excel when you need fully interpretable models, have mixed data types, require quick prototyping, or need to extract explicit business rules from data.

Random Forest

Random Forest is an ensemble method that constructs multiple decision trees during training and outputs the mode of their predictions. It introduces randomness through bootstrap sampling (bagging) and random feature selection at each split.

Implementation Highlights:

  • Configurable number of trees (n_estimators) for bias-variance tradeoff
  • Out-of-bag error estimation for internal validation
  • Feature importance ranking through mean decrease in impurity
  • Parallel tree construction for computational efficiency

When to Use: Random forests are ideal for general-purpose classification, especially with high-dimensional data, when you need robust performance without extensive tuning, or when interpretability is secondary to accuracy.

XGBoost

XGBoost (eXtreme Gradient Boosting) builds an ensemble of trees sequentially, where each new tree attempts to correct the residual errors of the previous ensemble. It uses gradient descent optimization and includes sophisticated regularization.

Implementation Highlights:

  • Second-order gradient information for better convergence
  • L1/L2 regularization on leaf weights and tree structure
  • Handling of missing values through learned directions
  • Learning rate scheduling and early stopping
  • GPU acceleration support for large datasets

When to Use: XGBoost shines in competitive scenarios (Kaggle), structured/tabular data problems, when maximum predictive performance is required, or when dealing with complex non-linear relationships.

Neural Network

Neural networks learn hierarchical representations through multiple layers of neurons with non-linear activation functions. Each layer transforms the input into increasingly abstract representations suitable for classification.

Implementation Highlights:

  • Multi-layer perceptron (MLP) architecture with configurable depth and width
  • ReLU activation functions and dropout regularization
  • Batch normalization for training stability
  • Adam optimizer with learning rate scheduling
  • Cross-entropy loss for classification objectives

When to Use: Neural networks are appropriate when you have large datasets, complex non-linear patterns, require transfer learning capabilities, or are working with unstructured data (after appropriate preprocessing).

Evaluation Methodology

All models are evaluated using a consistent framework to ensure fair comparison:

Metrics Implemented:

  • Accuracy: Overall correctness, suitable for balanced datasets
  • Precision: Proportion of positive predictions that are correct (low false positives)
  • Recall: Proportion of actual positives correctly identified (low false negatives)
  • F1-Score: Harmonic mean of precision and recall, balances both metrics
  • AUC-ROC: Area under ROC curve, measures discrimination ability across thresholds
  • Confusion Matrix: Detailed breakdown of prediction types
  • Classification Report: Comprehensive per-class metrics

Cross-Validation: All models undergo 5-fold stratified cross-validation to ensure robust performance estimates and detect overfitting. Stratification maintains class proportions in each fold, critical for imbalanced datasets.

Hyperparameter Tuning: Grid search and random search strategies are employed to identify optimal hyperparameters for each algorithm, documented in the respective model directories.

Dataset Requirements

This repository is designed to work with tabular classification datasets. Your data should include:

  • Feature columns: Numerical or categorical predictors
  • Target column: Categorical outcome variable (binary or multiclass)
  • Sufficient samples: At least 1000 instances recommended for neural networks
  • Clean data: Missing values handled, outliers addressed

Example datasets used for demonstration:

  • Binary classification: Credit default prediction, disease diagnosis
  • Multiclass classification: Iris species, handwritten digit recognition (tabular features)

Getting Started

Prerequisites

Python 3.8+
pip install -r requirements.txt

Installation

# Clone the repository
git clone https://github.com/yourusername/classification-models.git
cd classification-models

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Quick Start

# Example: Training and evaluating all models
from src.preprocessing import load_and_preprocess_data
from models import train_all_models
from src.evaluation import compare_models

# Load data
X_train, X_test, y_train, y_test = load_and_preprocess_data('data/raw/dataset.csv')

# Train all models
trained_models = train_all_models(X_train, y_train)

# Evaluate and compare
results = compare_models(trained_models, X_test, y_test)
print(results)

Key Features

Comprehensive Implementation: Each model includes data preprocessing, training, evaluation, and interpretation components with best practices.

Fair Comparison Framework: Standardized evaluation metrics and cross-validation procedures ensure meaningful model comparisons.

Educational Focus: Well-commented code with explanations of algorithmic decisions, hyperparameter choices, and performance trade-offs.

Production-Ready Patterns: Includes model serialization, logging, error handling, and validation checks suitable for deployment.

Visualization Tools: Automated generation of performance plots, decision boundaries, feature importance charts, and confusion matrices.

Results and Analysis

After running all models, the repository generates:

  1. Comparative Performance Table: Side-by-side metrics for all algorithms
  2. ROC Curves: Visual comparison of model discrimination ability
  3. Feature Importance Rankings: Understanding which predictors drive decisions
  4. Training Time Analysis: Computational efficiency comparison
  5. Prediction Confidence Distributions: Reliability of model outputs

These outputs are saved in the results/ directory for further analysis and reporting.

Best Practices Demonstrated

  • Data Leakage Prevention: Strict train-test separation with preprocessing fitted only on training data
  • Reproducibility: Random seeds set for all stochastic components
  • Scalability Considerations: Efficient data structures and vectorized operations
  • Model Versioning: Serialized models with metadata for tracking experiments
  • Documentation: Comprehensive docstrings and inline comments

Contributing

Contributions are welcome! Areas for enhancement include:

  • Additional algorithms (SVM, Naive Bayes, k-NN)
  • Advanced ensemble techniques (stacking, blending)
  • Automated machine learning (AutoML) integration
  • Additional datasets and domain applications
  • Enhanced visualization capabilities
  • Deployment examples (Flask API, Docker containers)

Please open an issue to discuss proposed changes before submitting pull requests.

Dependencies

Core libraries utilized:

  • scikit-learn: Classical machine learning algorithms and utilities
  • xgboost: Gradient boosting implementation
  • tensorflow/keras or pytorch: Neural network frameworks
  • pandas: Data manipulation and analysis
  • numpy: Numerical computing
  • matplotlib/seaborn: Visualization
  • joblib: Model serialization

See requirements.txt for complete dependency list with versions.

License

This project is licensed under the MIT License - see LICENSE file for details.

Acknowledgments

This repository is designed for educational purposes, drawing on established machine learning theory and best practices from the research community. It aims to bridge the gap between theoretical understanding and practical implementation of classification algorithms.

Contact

For questions, suggestions, or collaboration opportunities, please open an issue or reach out through the repository's discussion forum.


Note: This repository focuses on tabular data classification. For computer vision or natural language processing tasks, specialized architectures (CNNs, RNNs, Transformers) would be more appropriate and may be covered in separate repositories.

About

This repository provides end-to-end implementations of classical and modern classification algorithms, demonstrating their application on real-world datasets. Each model is implemented with clear, well-documented code that emphasizes both theoretical understanding and practical deployment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •