Skip to content

Conversation

@gbrunin
Copy link
Owner

@gbrunin gbrunin commented Apr 12, 2023

Basically after the featurization, the NaNs are not replaced by 0 anymore. The infinite values are replaced by NaNs.
Then, the NaNs are handled when fitting the model using a SimpleImputer which can be chosen. It is then stored as an attribute to the model, and can be re-used when predicting new values.

Things to consider:

  • the SimpleImputer is used when fitting a model and predicting values, with Vanilla or Ensemble models. For the Bayesian model, I'm not sure the Scaler is used and I did not do more for the Imputer.
  • the default of -1 is not suitable when combined with StandardScaler, this should be fixed (e.g., forcing the mean when using a StandardScaler, or going further to -10). Should be fixed in fit_genetic and in vanilla.
  • this imputer is not used when performing feature selection. If there are NaNs, the features are scaled to (-0.5, 0.5) and NaNs are replaced by -1. I think this is fine for this step, but can be discussed.
  • many tests are broken by these changes...

@ppdebreuck
Copy link

Thanks Guillaume for the PR,

To be discussed when we meet. Few points on my side:

  • Do not care about Bayesian Model, it is depreciated and will be removed.
  • Using 0 is still a valid choice when merging dataframes. Missing features (full columns) from matminer often correspond to not-present => 0. To be checked rigorously.
  • Another, probably better option, is to fix the feature names for each featurizer beforehand. Missing features (full column) put to default (0) value. Separate Nans are handled individually, with the suggested Imputer.

@gbrunin
Copy link
Owner Author

gbrunin commented Apr 13, 2023

Thanks for your comments PP.
Using clean_df at the pre-processing step, fully-NaN features are removed (they were already in case a preset is used as a featurizer, but it can be done even in case another featurizer is used). Is that satisfactory? See here.
I'm not sure I got your last point, so maybe it's easier to discuss live about this. Let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants