A simple Naive Bayes text classifier that detects whether a message is Spam or Ham (Not Spam).
-
Preprocesses text (lowercasing + tokenization with regex)
-
Builds a vocabulary from training data
-
Uses Multinomial Naive Bayes with Laplace smoothing
-
Reports accuracy, precision, recall, and F1-score
-
Includes an interactive demo to test custom messages
https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset?resource=download
-
The CSV file has two columns:
-
v1 โ Label (ham or spam)
-
v2 โ Message text
-
- v1
- SPAM
- v2
- "Congratulations! You've won a $1000 Walmart gift card. Call now!"
- v1
- HAM
- v2
- "Can we reschedule our meeting to 3 PM tomorrow?"
1. Clone this repository
2. Run the script:
python main.py
- Accuracy: 0.95
- Precision: 0.94
- Recall: 0.91
- F1-Score: 0.92
--- Spam Classifier Demo ---
Enter a message to classify:
"WIN A FREE iPhone! Click now!"
Prediction: SPAM
๐ฎ Future Improvements
Add stopword removal and stemming/lemmatization
Try different classifiers (Logistic Regression, SVM, Neural Networks)
Deploy as a simple web app with Flask/Streamlit