During my Petroleum Engineering days I was always dealing with data analytics, but at some point I ran out of tools from my toolbox. I decided to learn data science on my own time and started with reading books referenced at the end. The biggest challenge for me was lack of structured examples that capture modeling process from start to finish 🤨.
I started organizing the material and putting the pieces of the puzzle together during the learning process 🤔. I'm hoping people that are facing the same challanges can utilize this repository to answer some of their own questions. Don't be afraid to start, the biggest hurdle is usually self-doubt.
Topics covered within this repo:
- EDA
- Preprocessing
- Feature selection / extraction
- Modeling
- Evaluations and improvements
P.S. There are sections with repetitive code that can be easily placed inside a function. Since repetition is the best path to learning (cross-validation would concur), this was done intentionally.
Below is a brief description of what's in each folder. More detailed information can be found within each project.
Folders
- Craigslist Car Pricing - regression model that predicts car posting price.
- Credit Score Classification - classification of people with good/bad credit.
- Maunaloa Volcano CO2 Levels - time series forecasting. Sourced from A. Muller lectures
- US Population Income - classification model for prediction people making >$50k.
- Wine Ratings - predicting wine ratings from free text reviews.
Files
- ML Cheat Sheet - summary of models, theory and assumptions.
Datasets are not attached, but could be downloaded by following links mentioned within each project. Below are libraries and IDE used in each project.
IDE
Libraries
- Sklearn
- Numpy
- Matplotlib
- Pandas
- Scipy
- Statsmodels
- Category_encoders
- Seaborn
- Math
- Skopt
Libraries can be installed with pip through terminal/command line
pip install numpy
If all mentioned packages are installed, you don't need anything else, time to download the notebooks and get going! 🚀
I've referenced the following books to learn statistics, python and ML algorithms. They should suffice to get started with ML, with rare supplemental Google searches.
