This project involves building a logistic regression model from scratch to classify movie reviews as either positive (1) or negative (0) using word-level sentiment scores from the VADER Sentiment Analysis tool.
The objective is to implement logistic regression with gradient descent to accurately predict sentiment based on the dataset provided.
- Dataset: A CSV file (
train.csv) containing movie reviews and their corresponding sentiment labels. - Data Cleaning:
- Drop rows with Neutral Sentiment.
- Convert text into numerical features using VADER's word-level sentiment scores.
- Replace each word with its corresponding sentiment score from the VADER lexicon.
- If a word is not found in the VADER lexicon, assign a score of 0.
- Ensure all embeddings have the same length by padding shorter sequences with zeros.
"I really love this amazing product!"
[0, 0.1779, 3.2, 0, 2.9, 0]
[0, 0.1779, 3.2, 0, 2.9, 0, 0, 0]
- Implement logistic regression from scratch.
- Use gradient descent for optimization.
- Do not use pre-built deep learning models.
- Utilize libraries like
pandas,numpy, andmatplotlibfor data processing and visualization.
- Split
train.csvinto training (80%) and validation (20%) sets. - Use a classification threshold of 0.5:
- Predicted Probability ≥ 0.5 → Classify as 1 (Positive).
- Predicted Probability < 0.5 → Classify as 0 (Negative).
- Use gradient descent to optimize model weights.
- Evaluate model performance using accuracy score.
- Metric Used: Accuracy Score
- Formula:
Accuracy = (Number of Correct Predictions) / (Total Predictions)
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download VADER lexicon
nltk.download('vader_lexicon')
# Initialize VADER
analyzer = SentimentIntensityAnalyzer()
# Access the VADER lexicon
vaderLex = analyzer.lexicon
# Example
print(vaderLex.get("happy")) # Output: 2.7The final output must be saved as a CSV file with two columns:
| id | Vader_Binary_Sentiment |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 1 |
| 3 | 0 |
- Ensure your submission file matches this format to be correctly evaluated.
- Install required dependencies:
pip install numpy pandas nltk matplotlib scikit-learn
- Run the preprocessing script to clean the dataset.
- Train the logistic regression model using gradient descent.
- Evaluate the model using accuracy score.
- Generate predictions on the test dataset.
- Save predictions as
submission.csv.
Accuracy ScoreSentiment AnalysisLogistic RegressionGradient Descent
Authors: Moughel Mohamed Souhail,Haytham Raiss,Mouad Wadi
For questions or feedback, feel free to reach out!