Feature/scikit #1
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request adds a new script,
preprocessing.py, that implements a complete preprocessing and modeling pipeline for predicting whether a movie is "good" (IMDB score ≥ 7) using a Random Forest classifier. The script covers data loading, cleaning, feature engineering, model training with stability testing, and provides an example prediction.Key changes include:
Data Loading and Preprocessing:
data/processed/imdb_bdd.csvfile and performs data cleaning by dropping rows with missing or invalid values in key columns.genrescolumn and selects relevant numeric and categorical features for modeling.Model Training and Evaluation:
RandomForestClassifierusing selected features and evaluates its stability over 5 different random splits, reporting accuracy and F1-score for each run.OrdinalEncoderto handle categorical variables, ensuring unknown categories are encoded safely.Reporting and Example Prediction: