-
Notifications
You must be signed in to change notification settings - Fork 3
Home
Daniel Salazar edited this page Jan 31, 2018
·
4 revisions
Welcome to the CS_6820 wiki!
Summary statistics- Percentiles can help identify the range for most of the data.
- Averages and medians can describe central tendency.
- Correlations can indicate strong relations
- Box-plots can help identify outliers
- Density plots and histograms show the spread of the data
- Scatter plots can describe bivariate relationships
- Missing data effects some models more than others
- Even for models that can handle missing data, they can be sensitive to it (missing data for certain variables can lead to poor predictions).
- Missing data can be more common in production
- Missing value imputation can get very sophisticated
- An outlier is somewhat subjective.
- Outliers can be very common in multidimensional data
- Some models are less sensitive (ex. tree models are more robust) to outliers than others (ex. Regression models are less robust)
- Outliers can be the result of bad data collection, or they can be legitimate extreme (or unusual) values.
- Sometimes outliers are the interesting data points we want to model, and other times they just get in the way.
- Row data is somewhat too granular for modeling.
- The granularity of the data affects the interpretation of our model
- Aggregating data can also remove bias posed by more frequent observations in the raw data
- Aggregating data can also lessen the number of missing values and the effect of outliers
- Make the model easier to interpret (ex. binning)
- Capture more complex relationships (ex. NNs)
- Reduce data redundancy and dimensionality (ex. PCA)
- Rescale variables (ex. Standardizing or normalizing)
A machine learning algorithm uses data to automatically learn the rules. It simplifies the complexities of the data into relationships described by the rules. A predictive model is an algorithm that learns the prediction rules.
Specifics of the model itself- Accurate: Are we making good predictions?
- Interpretable: How easy is it to explain how the predictions are made?
- Fast: How long does it take to build a model, and how long does the model take to make predictions?
- Scalable: How much longer do we have to wait if we build/predict using a lot more data?
- It relies on more features to learn and predict (ex. 2 vs 10 features to predict a target)
- It relies on more complex feature engineering (ex. Polynomial terms, interactions, or principle components)
- It has more computational overhead (ex. A single decision tree vs random forest of 100 trees)
- A regression model can have more features, or polynomial terms and interaction terms
- A decision tree can have more or less depth
- A neural network is similar to regression but much more complex in its feature engineering
- A random forest is similar to a decision tree, but complex because it builds multiple trees
- A model that is consumed by a web app needs to be fast
- A model that is used to predict in batch needs to be scalable
- A model that updates a dashboard as data streams in may need to be fast and scalable
- Is this a good model in terms of accuracy?
- Define objective: What problem am I solving?
- Collect and manage data: What information do I need?
- Build the model: find patterns in the data that leads to a solution.
- Evaluate and critique the model: Does the model solve my problem?
- Present results and document: Establish that I can solve the problem, and how.
- Deploy the model: Doeploy the model to solve the problem in the real world.
- Whether the model meets the business goals
- How much pre-processing the models needs
- How accurate the model is
- How explainable the model is
- How fast the model is in making predictions
- How scalable the models is (building and predicting)
(Info collected from Microsoft Azure Machine Learning Presentation)