A production-ready binary classifier to detect spam SMS messages using engineered features and XGBoost. Key result: Our model catches 76% of all spam SMS messages a 18% point improvment over the current system (58% -> 76%). This significantly reduces phishing risk. At the same time Of all SMS messages flagged as spam, 89.8% are truly spam, meaning only 1 in 10 SMS in the spam folder is actually legitimate well below the 15% false alarm tolerance
This project uses uv package manager
# create enviroment
uv init
# activate enviroment
source .venv/bin/activate
# install dependencies
uv sync This is the dataset used in this model to quick start: SMS Spam collection Dataset
you have to run first notebooks/01_eda.ipynbto split the dataset to trainset and testset just to prevent the data leakage.
you will find the splitting in data/processed/.
Each must contain:
- message: raw SMS text
- label: "spam" or "ham"
# Preprocess data → train models → evaluate
python src/dataset.py
python src/modeling/train.pyor use the Makefile:
make data
make train # if you want to use the default spam message
python src/modeling/predict.py
# OR
make predict
#if you want to add custom message
python src/modeling/predict.py -m "YOUR MESSAGE"
# OR
make predict TEXT="YOUR MESSAGE"To classify your own message, edit the main() in predict.py you can either use CLI input or hardcoded the message.
| Model | Precision | Recall | F1-Score | Accuracy |
|---|---|---|---|---|
| MVP Baseline | >=0.85 | >=0.75 | >=0.80 | - |
| Dumb Baseline (All Ham) | 0.0000 | 0.0000 | 0.0000 | 0.8655 |
| Logistic Regression | 0.7909 | 0.5800 | 0.6692 | 0.9229 |
| Random Forest | 0.9237 | 0.7267 | 0.8134 | 0.9552 |
| XGboosting | 0.8976 | 0.7600 | 0.8231 | 0.9561 |
Meets both security (high recall) and user experience (low false positives) goals.
Model: XGBoost (outperformed Logistic Regression and Random Forest)
Features: Text length, capitalization ratio, punctuation counts, keyword flags (free, win, urgent, etc.)
Note: This model is trained on English-language SMS. Performance may degrade on non-English or highly obfuscated spam.
Full evaluation in:
notebooks/evaluation_model.ipynb
Evaluation metrics are auto-generated during training and saved in
reports/
| Output | Path |
|---|---|
| Processed data | data/processed/train.parquet , test.parquet |
| Trained models | models/xg_boosting.pkl, random_forest.pkl, Logistic_regression.pkl |
| Evaluation report | reports/model_evaluation_summary.md |
| EDA & analysis | notebooks/ |
(Standardized via Cookiecutter Data Science )
├── data/processed/ ← Input: train_set.csv, test_set.csv
├── models/ ← Output: .pkl model files
├── reports/ ← Model summary & evaluation
├── notebooks/ ← EDA and experimentation
└── src/
├── dataset.py ← Preprocessing & feature engineering
└── modeling/
├── train.py ← Trains 3 models, saves all
└── predict.py ← Classifies new SMS
Model Card – Full transparency on model behavior, limitations, and ethics
