- features = predictor variables = independent variables;
- target variable = dependent variable = response variable;
- training a model on the data = fitting a model to the data.
It uses labeled data as input, which are usually represented in a table structure. It divides in:
-
Classification: outputs a label, which is a value in a set C (categories).
- Confusion matrix: table used to evaluate the performance of a classification model.
-
k-nearest neighbors (KNN): predicts the label of a data point by looking at the k closest labeled data points. Code example: sklearn iris species example
-
Logistic Regression: outputs probabilities.
- One of the most commonly used ML algorithms for two-class classification;
- Dependent variable follows Bernoulli Distribution;
- Logistic regression threshold is 0.5, by default.
- ROC curve (Receiver Operating Characteristic Curve): series of different false positive and true positive rates as we vary threshold;
- Estimation: Maximum Likelihood Estimation (MLE);
- Model fitness calculated through Concordance and KS-Statistics;
- It divides in:
- Binary Logistic Regression: target variable has two possible outputs. Examples: spam detection, diabetes prediction, cancer detection, and if a user will click on an advertisement link or buy a product or not. Code example: sklearn breast cancer example
- Multinomial Logistic Regression: target variable has 3 or more nominal categories. Examples: types of iris flowers, and types of wine. Code example: sklearn iris species example
- Ordinal Logistic Regression: target variable has 3 or more ordinal categories. Example: restaurant or product rating (from 1 to 5), and classification of documents into categories.
-
- Code example: sklearn breast cancer example
-
- Code example: sklearn breast cancer example
-
- Code example: sklearn breast cancer example
-
- Code example: sklearn breast cancer example
-
- Code example: sklearn breast cancer example
-
Linear Regression: output is a value in R (continuous);
- Estimation: OLS (Ordinary Least Squares), which is the sum of the squares of the residuals (same as minimizing the mean squared error of the predictions on the training set);
- Code example: sklearn Boston housing prices example
Linear Regression X Logistic Regression:
- Linear regression gives a continuous output (example: house price), logistic regression gives a constant output (example: stock price);
- Linear regression estimated by Ordinary Least Squares (OLS), logistic regression estimated by Maximum Likelihood Estimation (MLE);
In Python we can use scikit-learn (sklearn), TensorFlow and Keras for Supervised Learning.
It tries to discover hidden patterns in unlabeled data. Main tasks:
- Clustering: tries to discover the underlying groups (clusters) in a dataset;
- k-means clustering: finds clusters given a number of clusters (scikit-learn);
- Data visualization: hierarchical clustering and t-SNE;
- Dimension reduction techniques: Principal Component Analysis (PCA);
- Dimension reduction techniques: Non-negative matrix factorization" ("NMF").
Interaction with environment to learn how to optimize behavior, using a system of rewards and punishments.
Useful links: