Skip to content

Conversation

@devZenta
Copy link
Owner

@devZenta devZenta commented Dec 8, 2025

This pull request adds a new script, preprocessing.py, that implements a complete preprocessing and modeling pipeline for predicting whether a movie is "good" (IMDB score ≥ 7) using a Random Forest classifier. The script covers data loading, cleaning, feature engineering, model training with stability testing, and provides an example prediction.

Key changes include:

Data Loading and Preprocessing:

  • Loads the processed IMDB dataset from the data/processed/imdb_bdd.csv file and performs data cleaning by dropping rows with missing or invalid values in key columns.
  • Applies one-hot encoding to the genres column and selects relevant numeric and categorical features for modeling.

Model Training and Evaluation:

  • Trains a RandomForestClassifier using selected features and evaluates its stability over 5 different random splits, reporting accuracy and F1-score for each run.
  • Uses an OrdinalEncoder to handle categorical variables, ensuring unknown categories are encoded safely.

Reporting and Example Prediction:

  • Prints detailed classification reports and summarizes model stability based on the standard deviation of accuracy across runs.
  • Provides a worked example: predicts the probability that a Tim Burton/Johnny Depp movie with specific genres is classified as "good". (src/preprocessing/preprocessing.pyR1-R125

- Load and clean movie data from CSV, removing rows with missing crucial information and outliers.
- Perform feature engineering by creating dummy variables for movie genres.
- Split the dataset into training and testing sets.
- Encode categorical text features using OrdinalEncoder.
- Optimize RandomForestClassifier parameters using GridSearchCV with F1-score as the scoring metric.
- Evaluate model performance with a custom probability threshold for classifying movies as Good or Bad.
- Include an example prediction for a specific movie scenario.
- Updated data loading path to accommodate project structure changes.
- Simplified data cleaning process by removing unnecessary print statements.
- Enhanced feature engineering by directly creating genre dummy variables.
- Implemented stability testing with multiple iterations to assess model performance.
- Replaced GridSearchCV with fixed hyperparameters for RandomForestClassifier.
- Added detailed accuracy and F1-Score reporting for each iteration.
- Included stability assessment based on accuracy standard deviation.
- Streamlined example prediction section to utilize the latest trained model.
@devZenta devZenta added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 8, 2025
Copy link
Collaborator

@francoisdotdev francoisdotdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nickel

@devZenta devZenta merged commit 39bc239 into main Dec 8, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants