A machine learning pipeline for detecting early signs of depression from social media data (Twitter/X). The project uses sentiment analysis and multiple classification algorithms to analyze tweets and author bios for mental health indicators.
- Data Collection: Automated tweet collection using Twitter API
- Text Preprocessing: Comprehensive text cleaning (URL removal, emoji handling, stemming, stopword removal)
- Sentiment Analysis: Multi-method sentiment analysis using TextBlob, VADER, and NRCLex
- Classification Models: Comparison of 7+ ML algorithms (Decision Tree, Random Forest, SVM, KNN, etc.)
- Dual Analysis: Separate models for tweet content and author bios
step1_collecting_data.py: Collect tweets and metadata from Twitter APIstep2_preprocessing.py: Data cleaning and preprocessing (tokenization, normalization, stemming)step3_sentiment_analysis.py: Sentiment and emotion analysis using multiple librariesstep4_model_tweet.py: Train and save Decision Tree model on tweet textstep5_model_bio.py: Train and save Decision Tree model on author biotry_classification_bio.py: Compare multiple classification algorithms on bio featurestry_classification_tweet.py: Compare multiple classification algorithms on tweet featuresrequirements.txt: Python package dependencies
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtpython -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('words'); nltk.download('wordnet'); nltk.download('vader_lexicon'); nltk.download('omw-1.4')"Create a .env file in the project root:
TWITTER_CONSUMER_KEY=your_consumer_key
TWITTER_CONSUMER_SECRET=your_consumer_secret
TWITTER_ACCESS_TOKEN=your_access_token
TWITTER_ACCESS_TOKEN_SECRET=your_access_token_secret
TWITTER_BEARER_TOKEN=your_bearer_tokenRun the pipeline steps in order:
python step1_collecting_data.pyUpdate the script to collect tweets for your desired keywords.
python step2_preprocessing.pyUpdate input/output paths in the script before running.
python step3_sentiment_analysis.pyPerforms sentiment labeling using TextBlob and emotion analysis with NRCLex.
python step4_model_tweet.py
python step5_model_bio.pyTrains and saves Decision Tree models for predictions.
python try_classification_tweet.py
python try_classification_bio.pyCompare performance of multiple algorithms (Random Forest, SVM, Logistic Regression, KNN, Naive Bayes, etc.)
The project focuses on mental health-related keywords including:
- Depression indicators: depressed, depression, dysthymia, feeling hopeless
- Anxiety indicators: anxious, stress, overwhelmed
- Emotional states: grief, frustration, anger, isolation
- Support topics: #mentalhealthsupport, #mentalhealthrecovery
- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)
- Naive Bayes
- Passive Aggressive Classifier
- TextBlob: Polarity-based sentiment classification
- VADER: Social media-optimized sentiment analysis
- NRCLex: Emotion lexicon analysis (fear, anger, joy, sadness, etc.)
queries_for_*.csv: Raw collected tweetspreprocessed_data_with_stemming.csv: Cleaned and processed datanew_data_labeled_stemmed.csv: Sentiment-labeled datasetdecision_tree_model_tweet.pkl: Trained tweet classifierdecision_tree_model_bio.pkl: Trained bio classifiertfidf_v_final_tweet.pkl: TF-IDF vectorizer for tweetstfidf_v_final_bio.pkl: TF-IDF vectorizer for bios
- All scripts use configurable paths - update input/output paths before running
- Twitter API credentials are required for data collection
- The preprocessing pipeline includes: URL removal, emoji conversion, punctuation removal, contraction expansion, stemming, and stopword removal
- Models are saved using pickle for later use in production
See LICENSE file for details.
See CONTRIBUTING.md file for details.