This project analyzes text messages from different chat sources using various Natural Language Processing (NLP) techniques and machine learning models.
The project implements multiple approaches to analyze and classify chat messages:
-
Logistic Regression Model
- Custom implementation of logistic regression from scratch
- Feature extraction from text data
- Model evaluation and error analysis
- PyTorch implementation for comparison
-
Naive Bayes Classification
- Implementation of Naive Bayes algorithm for text classification
- Calculation of prior and likelihood probabilities
- Word frequency analysis
- Model accuracy evaluation
-
Cosine Similarity Analysis
- Using Gensim for text embedding
- Creation of TF-IDF and LSI models
- Document similarity comparisons
- Vector space modeling
- Text preprocessing including tokenization and cleaning
- Feature extraction for machine learning models
- Vocabulary building and frequency analysis
- Custom Logistic Regression implementation
- Naive Bayes classifier
- Gensim-based similarity models
- PyTorch neural network implementation
- NumPy for numerical computations
- Pandas for data manipulation
- NLTK for natural language processing
- Gensim for text similarity analysis
- PyTorch for deep learning implementation
- Matplotlib for visualization
The project achieves high accuracy in classification tasks:
- Logistic Regression accuracy: ~99.2%
- Naive Bayes accuracy: ~99.7%
The cosine similarity analysis successfully identifies semantically similar messages across different chat sources.
Logistic_regression_which_chat.ipynb: Logistic regression implementation and analysisNaive_Bayes_which_chat.ipynb: Naive Bayes classifier implementationCosine_similarity.ipynb: Text similarity analysis using Gensimutils.py: Helper functions for text processing and model implementation
The project is implemented in Jupyter notebooks. To run the analysis:
- Install required dependencies
- Load the chat data
- Run the notebooks in sequence
- Examine results and visualizations
- Python 3.x
- NumPy
- Pandas
- NLTK
- Gensim
- PyTorch
- Matplotlib