the name of this repo is Data-Analysis-Tools, but it actually contains the whole data analysis pipeline, including:
| folder | meaning |
|---|---|
| data_wrangling | From very raw data to some usable and easy-to-understand raw data |
| data_analysis | From raw data to some easy-to-visualize or machine-learning-ready data. |
| data_visualization | Visualize the data. |
| datasets | Datasets to play with over those folders. |
| projects | Very domain-specific data analysis prjects |
Preparation -> Preporcessing -> analysis -> posprocessing
To my understanding is :
This is the most valuable and always been underestimate.
This is not only be DB, but also activities to acctual get the data from world and also generate the feature.
We need know how to cooperate with DBs and how to use pandas to generate feature (after step 3)
it is also called data wrangling.
Here is where pandas plays and also ML techs.
For this we need varies of data visualization skills.
- Do researches on the topic, know the data by heart, use as much human knowledge as possible.
- Notice the outlier
Project Athena: AI-based automotive data analysis tool which acts like an experienced data scientist, tells you important facts in the dataframe, interact with you to make conclusions and predictions.