Feature/scikit #1

devZenta · 2025-12-08T13:57:05Z

This pull request adds a new script, preprocessing.py, that implements a complete preprocessing and modeling pipeline for predicting whether a movie is "good" (IMDB score ≥ 7) using a Random Forest classifier. The script covers data loading, cleaning, feature engineering, model training with stability testing, and provides an example prediction.

Key changes include:

Data Loading and Preprocessing:

Loads the processed IMDB dataset from the data/processed/imdb_bdd.csv file and performs data cleaning by dropping rows with missing or invalid values in key columns.
Applies one-hot encoding to the genres column and selects relevant numeric and categorical features for modeling.

Model Training and Evaluation:

Trains a RandomForestClassifier using selected features and evaluates its stability over 5 different random splits, reporting accuracy and F1-score for each run.
Uses an OrdinalEncoder to handle categorical variables, ensuring unknown categories are encoded safely.

Reporting and Example Prediction:

Prints detailed classification reports and summarizes model stability based on the standard deviation of accuracy across runs.
Provides a worked example: predicts the probability that a Tim Burton/Johnny Depp movie with specific genres is classified as "good". (src/preprocessing/preprocessing.pyR1-R125

- Load and clean movie data from CSV, removing rows with missing crucial information and outliers. - Perform feature engineering by creating dummy variables for movie genres. - Split the dataset into training and testing sets. - Encode categorical text features using OrdinalEncoder. - Optimize RandomForestClassifier parameters using GridSearchCV with F1-score as the scoring metric. - Evaluate model performance with a custom probability threshold for classifying movies as Good or Bad. - Include an example prediction for a specific movie scenario.

- Updated data loading path to accommodate project structure changes. - Simplified data cleaning process by removing unnecessary print statements. - Enhanced feature engineering by directly creating genre dummy variables. - Implemented stability testing with multiple iterations to assess model performance. - Replaced GridSearchCV with fixed hyperparameters for RandomForestClassifier. - Added detailed accuracy and F1-Score reporting for each iteration. - Included stability assessment based on accuracy standard deviation. - Streamlined example prediction section to utilize the latest trained model.

francoisdotdev

nickel

GARATONCODE added 3 commits December 8, 2025 10:50

feat: add preprocessing module for data preparation

7614b9e

devZenta requested review from Degalax, GARATONCODE, Lockxii, eather55 and francoisdotdev December 8, 2025 13:57

devZenta assigned GARATONCODE Dec 8, 2025

devZenta added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 8, 2025

Merge branch 'main' into feature/scikit

913c172

francoisdotdev approved these changes Dec 8, 2025

View reviewed changes

devZenta merged commit 39bc239 into main Dec 8, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/scikit #1

Feature/scikit #1

Uh oh!

devZenta commented Dec 8, 2025

Uh oh!

francoisdotdev left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Feature/scikit #1

Feature/scikit #1

Uh oh!

Conversation

devZenta commented Dec 8, 2025

Uh oh!

francoisdotdev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants