Sentiment Analysis on IMDB Dataset

This project performs sentiment analysis on the IMDB movie review dataset using Natural Language Processing (NLP) techniques for data preprocessing and machine learning algorithms for classification. The goal is to classify movie reviews as either positive or negative based on their content.

Project Overview

This project leverages the IMDB dataset to classify movie reviews as positive or negative. The project consists of two main parts:

Data Preprocessing: Cleaning and preparing the text data.
Model Training and Evaluation: Training various machine learning models using TF-IDF vectorized features and evaluating their performance.

Dataset

The dataset used in this project is the IMDB movie review dataset, which contains 50,000 reviews labeled as either positive or negative.

Preprocessing Steps

The text preprocessing includes:

HTML Parsing: Remove HTML tags using BeautifulSoup.
Tag and Symbol Removal: Remove unwanted tags like [br], excessive dots, and asterisks using regex.
Contraction Expansion: Expand contractions like "don't" to "do not" using the contractions library.
Text Normalization: Convert text to ASCII format and remove non-ASCII characters.
Noise Removal: Remove URLs, mentions, and hashtags.
Character Deduplication: Reduce repeated characters (e.g., "loooove" becomes "loove").
Non-Alphanumeric Removal: Remove all non-alphanumeric characters except spaces.
Lowercasing: Convert the text to lowercase.
Negation Handling: Handle negations, where words following a negation are prefixed with "NOT_".
Tokenization: Split text into individual words.
Stopword Removal: Remove common stopwords like "the", "and", etc.
Lemmatization: Reduce each word to its base form (e.g., "running" becomes "run").

Feature Extraction

We used TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with an n-gram range of (1, 3) to transform the text data into features suitable for machine learning models.

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer with n-gram range (1, 3)
vectorizer = TfidfVectorizer(ngram_range=(1, 3))

# Fit the vectorizer on the training data and transform both training and test data
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

Models Used

Three machine learning models were trained on the TF-IDF features:

Support Vector Machine (SVM)
Naive Bayes (MultinomialNB)
Random Forest Classifier Each model was trained and evaluated based on the accuracy and classification report.

# Initialize and train models
svm_model = SVC()
nb_model = MultinomialNB()
rf_model = RandomForestClassifier()

# Fit models on training data
svm_model.fit(X_train_tfidf, y_train)
nb_model.fit(X_train_tfidf, y_train)
rf_model.fit(X_train_tfidf, y_train)

Evaluation

The models were evaluated using the accuracy score and classification report (precision, recall, f1-score) to measure their performance on the test data.

from sklearn.metrics import classification_report, accuracy_score

# Function to evaluate models
def evaluate_model(y_true, y_pred, model_name):
    print(f"Evaluation for {model_name}:")
    print(f"Accuracy: {accuracy_score(y_true, y_pred):.4f}")
    print(classification_report(y_true, y_pred))

# Evaluate Random Forest
evaluate_model(y_test, y_pred_random_forest, "Random Forest")

# Evaluate SVM
evaluate_model(y_test, y_pred_svm, "SVM")

# Evaluate Naive Bayes
evaluate_model(y_test, y_pred_naive_bayes, "Naive Bayes")

Result with Naive Bayes

Result with SVM

Result with Random Forest

Libraries and Dependencies

The following libraries were used:

pandas: Data manipulation and analysis.
scikit-learn: For machine learning models and evaluation metrics.
spacy: For NLP preprocessing tasks.
nltk: For tokenization, stopword removal, and lemmatization.
BeautifulSoup: For HTML parsing and text cleaning.
matplotlib and seaborn: For data visualization.

Future Improvements

Fine-tune hyperparameters for better model performance.
Experiment with deep learning models such as LSTMs or transformers.
Use word embeddings (e.g., Word2Vec, GloVe) instead of TF-IDF for feature extraction.
Handle more complex cases of negation and sarcasm in reviews.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
NaiveBayes-result.png		NaiveBayes-result.png
README.md		README.md
RF-result.png		RF-result.png
SVM-result.png		SVM-result.png
requirements.txt		requirements.txt
streamlit.py		streamlit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentiment Analysis on IMDB Dataset

Table of Contents

Project Overview

Dataset

Preprocessing Steps

Feature Extraction

Models Used

Evaluation

Libraries and Dependencies

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

suphyusinhtet/SentimentAnalysis_IMDB_Project

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis on IMDB Dataset

Table of Contents

Project Overview

Dataset

Preprocessing Steps

Feature Extraction

Models Used

Evaluation

Libraries and Dependencies

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages