Originally this was a task assigned to me for an employment opportunity. I've since decided to revamp this project into a modern approach, and focused primarily on trying to explain what this is and how it works - serving as a more educational resource for anybody interested in such topics (note that I am a beginner).
A modern, production-ready sentiment analysis system for hotel reviews that automatically classifies customer feedback as positive or negative sentiment. This project demonstrates advanced natural language processing (NLP) and machine learning preprocessing techniques.
Sentiment Analysis is a branch of Natural Language Processing (NLP) that determines the emotional tone or attitude expressed in text. In the context of hotel reviews, it helps businesses:
- Understand Customer Satisfaction: Automatically categorize thousands of reviews as positive or negative
- Monitor Brand Reputation: Track sentiment trends over time
- Improve Services: Identify common themes in negative feedback
- Automate Review Processing: Replace manual review categorization with AI
Manual Processing:
- β Time-consuming human review of each comment
- β Inconsistent categorization between reviewers
- β Cannot scale to thousands of reviews
This AI Approach:
- β Process 26,000+ reviews in seconds
- β Consistent, objective classification criteria
- β Scalable to any dataset size
- β Configurable sensitivity thresholds
This dataset contains 26,386 real hotel reviews scraped from Booking.com, providing a rich source of authentic customer feedback.
π Dataset Overview:
βββ 26,386 hotel reviews
βββ 15 data columns
βββ Reviews from multiple countries
βββ Ratings from 1.0 to 10.0 scale
βββ Raw text reviews + metadata
| Column | Description | Example |
|---|---|---|
review_text |
Customer's written review | "The hotel was clean and staff friendly..." |
rating |
Numerical rating (1-10) | 8.5 |
hotel_name |
Name of the hotel | "Villa Pura Vida" |
nationality |
Reviewer's country | "Belgium" |
reviewed_at |
Date of review | "11 July 2021" |
Hotel: Villa Pura Vida
Rating: 8.5/10
Review: "Everything was perfect! Quiet, cozy place to relax.
The breakfast was excellent and the staff was very helpful..."
Nationality: Poland
Date: July 2021
π Rating Statistics:
βββ Range: 1.0 - 10.0
βββ Average: 8.45/10
βββ Negative (< 5): 462 reviews (1.8%)
βββ Neutral (5-7): 6,725 reviews (25.5%)
βββ Positive (7+): 19,199 reviews (72.7%)
I expected to find a balanced distribution of positive, neutral, and negative reviews, similar to typical product review datasets (roughly 60% positive, 25% neutral, 15% negative).
My analysis revealed a highly positive-skewed dataset:
π― Expected Distribution:
βββ Positive: ~60%
βββ Neutral: ~25%
βββ Negative: ~15%
π Actual Distribution:
βββ Positive: 95.6% (25,198 reviews)
βββ Negative: 4.4% (1,165 reviews)
- Selection Bias: People are more likely to review when they have extreme experiences
- Platform Effect: Booking.com may pre-filter very negative reviews
- Hotel Quality: Dataset may focus on higher-rated establishments
- Review Incentives: Hotels may encourage satisfied customers to review
This imbalanced dataset presents classic ML challenges:
- Class Imbalance: Need techniques like stratified sampling
- Model Bias: Risk of always predicting "positive"
- Evaluation Metrics: Accuracy alone is misleading (95.6% by always guessing positive)
- Real-world Value: Better at detecting rare negative sentiment
SentimentAnalysis/
βββ π DataPreprocess.py # Main preprocessing pipeline
βββ π― example_usage.py # Usage demonstration
βββ π analyze_ratings.py # Data distribution analysis
βββ οΏ½οΈ sentiment_gui.py # Interactive GUI application
βββ π launch_gui.py # GUI launcher script
βββ οΏ½π requirements.txt # Python dependencies
βββ π booking_reviews copy.csv # Hotel reviews dataset
βββ π README.md # This documentation
Modern object-oriented preprocessing pipeline featuring:
- Automatic column detection for any CSV structure
- Advanced text cleaning (HTML, URLs, punctuation)
- Smart sentiment classification with configurable thresholds
- Robust error handling and validation
- Professional logging and type hints
User-friendly graphical interface featuring:
- Real-time sentiment analysis with confidence scores
- Detailed explanations of why text was classified as positive/negative
- Interactive model training with progress indicators
- Example reviews to test the system
- Model performance metrics and statistics
- Custom data loading for your own CSV files
Interactive demonstration showing:
- Complete preprocessing workflow
- Sample output and statistics
- Performance metrics
- Ready-to-use ML data
Comprehensive analysis tool for:
- Rating distribution visualization
- Column structure examination
- Data quality assessment
- Statistical summaries
Demonstration script showing:
- GUI capabilities overview
- Feature explanations
- Example analysis output
- Perfect for headless environments
# Install dependencies
pip install -r requirements.txt
# Launch the interactive GUI
python launch_gui.pyThe GUI provides:
- π― Interactive Analysis: Type any hotel review and get instant sentiment analysis
- π§ AI Explanations: Detailed breakdown of why the AI made its decision
- π Model Training: Train the AI model on your data with progress tracking
- π Performance Metrics: See how well the model performs
- π‘ Example Reviews: Try pre-loaded examples to see how it works
# Run the complete analysis
python example_usage.py
# Explore data distribution
python analyze_ratings.pyfrom DataPreprocess import ReviewDataPreprocessor
# Initialize and process
preprocessor = ReviewDataPreprocessor('booking_reviews copy.csv')
X, y = preprocessor.prepare_data()
# Results: X = processed text, y = sentiment labels
print(f"Dataset: {len(X)} reviews")
print(f"Positive sentiment: {y.mean():.1%}")The sentiment analysis GUI provides a comprehensive, user-friendly interface for analyzing hotel reviews with detailed explanations.
- Real-time Analysis: Type any hotel review and get instant sentiment classification
- Confidence Scores: See how confident the AI is in its prediction (0-100%)
- Visual Results: Clear positive/negative indicators with color coding
- Word-level Analysis: See which specific words influenced the decision
- Impact Scores: Understand how much each word contributed to the final sentiment
- Model Transparency: Detailed breakdown of the AI's decision-making process
- One-click Training: Train the AI model on 26,000+ hotel reviews
- Progress Tracking: Real-time progress bar during model training
- Performance Metrics: See accuracy, precision, recall, and confusion matrix
- Custom Data: Load your own CSV files for analysis
- Pre-loaded Examples: Try positive, negative, and neutral review samples
- Custom Input: Analyze any hotel review text you want to test
- Processed Text View: See how the AI cleans and processes your input
- Step-by-step Explanations: Learn how sentiment analysis works
- Feature Importance: Understand which words matter most
- Model Architecture: See the technical details behind the predictions
π₯οΈ Main Interface Layout:
βββ ποΈ Control Panel: Train model, load data, view status
βββ βοΈ Input Section: Enter reviews, load examples
βββ π Analysis Tab: Sentiment results and confidence
βββ π§ Explanation Tab: Why this sentiment? (Word analysis)
βββ π Model Info Tab: Performance metrics and details
When you analyze a review, the GUI shows:
- Overall Sentiment: Positive or Negative with confidence percentage
- Key Influencing Words:
- β Words that made it seem positive (e.g., "excellent", "friendly", "clean")
- β Words that made it seem negative (e.g., "terrible", "dirty", "rude")
- Impact Scores: Numerical values showing how much each word mattered
- Processing Steps: How the raw text was cleaned and prepared
- Model Details: Technical information about the AI algorithm
Input Review: "The hotel was absolutely fantastic! Great location and friendly staff."
AI Analysis:
- π Sentiment: POSITIVE (89.2% confidence)
- β Key Positive Words: "fantastic" (+0.245), "great" (+0.156), "friendly" (+0.134)
- π§ Explanation: The model detected strong positive language with words like "fantastic" and "great" that are highly associated with positive hotel experiences in the training data.
- Object-Oriented Design: Clean, maintainable class structure
- Type Hints: Full static type checking support
- Error Handling: Graceful failure with meaningful messages
- Logging: Structured debug information
- Documentation: Comprehensive docstrings
# What the preprocessing does:
"<p>Great hotel! Visit https://example.com</p>"
β
"great hotel visit"
# Removes: HTML tags, URLs, punctuation, stopwords, short words
# Keeps: Meaningful content words for sentiment analysis- Column Auto-Detection: Works with any CSV structure
- Missing Data: Robust handling of null/invalid entries
- Rating Flexibility: Configurable sentiment thresholds
- Stratified Splitting: Maintains class balance in train/test
β
Successfully processed: 26,363 reviews
π Sentiment distribution: 95.6% positive, 4.4% negative
π§ Train/test split: 21,090 / 5,273 samples
β‘ Processing time: ~30 seconds
Original: "The hotel was absolutely fantastic! Great location near the beach.
Staff were super helpful. Would definitely recommend! π"
Processed: "hotel absolutely fantastic great location near beach staff
super helpful would definitely recommend"
The preprocessed data is ready for machine learning:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)
X_vectorized = vectorizer.fit_transform(X_train)- Logistic Regression: Fast, interpretable baseline
- Random Forest: Handles feature interactions
- SVM: Good for text classification
- Neural Networks: LSTM/BERT for advanced performance
- SMOTE: Synthetic minority oversampling
- Class weights: Penalize majority class
- Threshold tuning: Optimize decision boundary
- Ensemble methods: Combine multiple approaches
| Issue | Solution |
|---|---|
| NLTK download fails | Script includes fallback stopword lists |
| Column not found | Use analyze_ratings.py to check structure |
| Memory issues | Process data in chunks for large datasets |
| Encoding errors | Ensure CSV is UTF-8 encoded |
Problem: "GUI won't launch on remote server"
Solution: The GUI requires a graphical display. Use the command-line tools instead: python example_usage.py
Problem: "Model training takes too long"
Solution: Training on 26,000 reviews takes 30-60 seconds on modern hardware. The progress bar shows activity.
Problem: "Analysis seems inaccurate"
Solution: Remember the model is trained on hotel reviews specifically and may not work well for other domains. myself).
pandas >= 2.0.0 # Data manipulation
nltk >= 3.8.0 # Natural language processing
scikit-learn >= 1.3.0 # Machine learning tools
numpy >= 1.24.0 # Numerical computing