sms-binary-classification

A production-ready binary classifier to detect spam SMS messages using engineered features and XGBoost. Key result: Our model catches 76% of all spam SMS messages a 18% point improvment over the current system (58% -> 76%). This significantly reduces phishing risk. At the same time Of all SMS messages flagged as spam, 89.8% are truly spam, meaning only 1 in 10 SMS in the spam folder is actually legitimate well below the 15% false alarm tolerance

Quick Start

1. Activate the enviroment and install dependencies

This project uses uv package manager

# create enviroment 
uv init 

# activate enviroment 
source .venv/bin/activate

# install dependencies 
uv sync

2. Dataset

This is the dataset used in this model to quick start: SMS Spam collection Dataset

you have to run first notebooks/01_eda.ipynbto split the dataset to trainset and testset just to prevent the data leakage. you will find the splitting in data/processed/.

Each must contain:

message: raw SMS text
label: "spam" or "ham"

3.Run the full pipline

# Preprocess data → train models → evaluate
python src/dataset.py
python src/modeling/train.py

or use the Makefile:

make data 
make train

4. Make a prediction

# if you want to use the default spam message
python src/modeling/predict.py 
# OR 
make predict 

#if you want to add custom message 
python src/modeling/predict.py -m "YOUR MESSAGE"
# OR 
make predict TEXT="YOUR MESSAGE"

Illustration

To classify your own message, edit the main() in predict.py you can either use CLI input or hardcoded the message.

Model	Precision	Recall	F1-Score	Accuracy
MVP Baseline	>=0.85	>=0.75	>=0.80	-
Dumb Baseline (All Ham)	0.0000	0.0000	0.0000	0.8655
Logistic Regression	0.7909	0.5800	0.6692	0.9229
Random Forest	0.9237	0.7267	0.8134	0.9552
XGboosting	0.8976	0.7600	0.8231	0.9561

Meets both security (high recall) and user experience (low false positives) goals.

Model: XGBoost (outperformed Logistic Regression and Random Forest)

Features: Text length, capitalization ratio, punctuation counts, keyword flags (free, win, urgent, etc.)

Note: This model is trained on English-language SMS. Performance may degrade on non-English or highly obfuscated spam.

Full evaluation in:

notebooks/evaluation_model.ipynb

Evaluation metrics are auto-generated during training and saved in reports/

Output Locations

Output	Path
Processed data	`data/processed/train.parquet` , `test.parquet`
Trained models	`models/xg_boosting.pkl`, `random_forest.pkl`, `Logistic_regression.pkl`
Evaluation report	`reports/model_evaluation_summary.md`
EDA & analysis	`notebooks/`

Project Organization

(Standardized via Cookiecutter Data Science )

├── data/processed/        ← Input: train_set.csv, test_set.csv
├── models/                ← Output: .pkl model files
├── reports/               ← Model summary & evaluation
├── notebooks/             ← EDA and experimentation
└── src/
    ├── dataset.py         ← Preprocessing & feature engineering
    └── modeling/
        ├── train.py       ← Trains 3 models, saves all
        └── predict.py     ← Classifies new SMS

Model Card – Full transparency on model behavior, limitations, and ethics

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data/external		data/external
docs		docs
models		models
notebooks		notebooks
references		references
reports		reports
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sms-binary-classification

Quick Start

1. Activate the enviroment and install dependencies

2. Dataset

3.Run the full pipline

4. Make a prediction

Illustration

Output Locations

Project Organization

About

Uh oh!

Releases

Packages

Languages

khaldon/sms-binary-classification

Folders and files

Latest commit

History

Repository files navigation

sms-binary-classification

Quick Start

1. Activate the enviroment and install dependencies

2. Dataset

3.Run the full pipline

4. Make a prediction

Illustration

Output Locations

Project Organization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages