Skip to content

khaldon/sms-binary-classification

Repository files navigation

sms-binary-classification

A production-ready binary classifier to detect spam SMS messages using engineered features and XGBoost. Key result: Our model catches 76% of all spam SMS messages a 18% point improvment over the current system (58% -> 76%). This significantly reduces phishing risk. At the same time Of all SMS messages flagged as spam, 89.8% are truly spam, meaning only 1 in 10 SMS in the spam folder is actually legitimate well below the 15% false alarm tolerance

Quick Start

1. Activate the enviroment and install dependencies

This project uses uv package manager

# create enviroment 
uv init 

# activate enviroment 
source .venv/bin/activate

# install dependencies 
uv sync 

2. Dataset

This is the dataset used in this model to quick start: SMS Spam collection Dataset

you have to run first notebooks/01_eda.ipynbto split the dataset to trainset and testset just to prevent the data leakage. you will find the splitting in data/processed/.

Each must contain:

  • message: raw SMS text
  • label: "spam" or "ham"

3.Run the full pipline

# Preprocess data → train models → evaluate
python src/dataset.py
python src/modeling/train.py

or use the Makefile:

make data 
make train 

4. Make a prediction

# if you want to use the default spam message
python src/modeling/predict.py 
# OR 
make predict 

#if you want to add custom message 
python src/modeling/predict.py -m "YOUR MESSAGE"
# OR 
make predict TEXT="YOUR MESSAGE"

Illustration

sms-spam

To classify your own message, edit the main() in predict.py you can either use CLI input or hardcoded the message.

Model Precision Recall F1-Score Accuracy
MVP Baseline >=0.85 >=0.75 >=0.80 -
Dumb Baseline (All Ham) 0.0000 0.0000 0.0000 0.8655
Logistic Regression 0.7909 0.5800 0.6692 0.9229
Random Forest 0.9237 0.7267 0.8134 0.9552
XGboosting 0.8976 0.7600 0.8231 0.9561

Meets both security (high recall) and user experience (low false positives) goals.

Model: XGBoost (outperformed Logistic Regression and Random Forest)

Features: Text length, capitalization ratio, punctuation counts, keyword flags (free, win, urgent, etc.)

Note: This model is trained on English-language SMS. Performance may degrade on non-English or highly obfuscated spam.

Full evaluation in:

notebooks/evaluation_model.ipynb

Evaluation metrics are auto-generated during training and saved in reports/

Output Locations

Output Path
Processed data data/processed/train.parquet , test.parquet
Trained models models/xg_boosting.pkl, random_forest.pkl, Logistic_regression.pkl
Evaluation report reports/model_evaluation_summary.md
EDA & analysis notebooks/

Project Organization

(Standardized via Cookiecutter Data Science )

├── data/processed/        ← Input: train_set.csv, test_set.csv
├── models/                ← Output: .pkl model files
├── reports/               ← Model summary & evaluation
├── notebooks/             ← EDA and experimentation
└── src/
    ├── dataset.py         ← Preprocessing & feature engineering
    └── modeling/
        ├── train.py       ← Trains 3 models, saves all
        └── predict.py     ← Classifies new SMS


Model Card – Full transparency on model behavior, limitations, and ethics


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published