From 43a585a17d82019440f33c9ac6dfdf0c7f575c03 Mon Sep 17 00:00:00 2001 From: Bryan Kwong Date: Thu, 23 Apr 2020 23:32:38 -0700 Subject: [PATCH 1/2] Added section stated in the branch --- .../7_2_Visualizing_Higher_Dimensions.ipynb | 731 ++++++++++++++++++ 10-Textual-Data/10.3 Distance Metrics.ipynb | 660 ++++++++++++++++ .../10.4 Visual Analysis of Text.ipynb | 505 ++++++++++++ 3 files changed, 1896 insertions(+) create mode 100644 07-Unsupervised-Learning/7_2_Visualizing_Higher_Dimensions.ipynb create mode 100644 10-Textual-Data/10.3 Distance Metrics.ipynb create mode 100644 10-Textual-Data/10.4 Visual Analysis of Text.ipynb diff --git a/07-Unsupervised-Learning/7_2_Visualizing_Higher_Dimensions.ipynb b/07-Unsupervised-Learning/7_2_Visualizing_Higher_Dimensions.ipynb new file mode 100644 index 0000000..c2cb755 --- /dev/null +++ b/07-Unsupervised-Learning/7_2_Visualizing_Higher_Dimensions.ipynb @@ -0,0 +1,731 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "7.2 Visualizing Higher Dimensions.ipynb", + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "XTPLBCWKzfWS", + "colab_type": "text" + }, + "source": [ + "# 7.2 Visualizing Higher Dimensions\n", + "\n", + "By this point, we hope we've convinced you how important it is to visualize your data. While summary statistics are helpful, it doesn't provide us with a good grasp of what the entire dataset looks like. In two dimension, we can use a 2-D scatter plot. In three dimension, we can use a 3-D scatter plot. But what if we have more than three dimensions? This chapter talks about how we can visualize data that is beyond 3 dimensions." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f1mm_fCWm9dO", + "colab_type": "text" + }, + "source": [ + "## Goal of Visualizing Higher Dimensions\n", + "\n", + ">\"I am a Tralfamadorian, seeing all time as you might see a stretch of the Rocky Mountains. All time is all time. It does not change. It does not lend itself to warnings or explanations. It simply is.\" \n", + ">\n", + ">-Kurt Vonnegate in \"Slaughterhour-Five\"\n", + "\n", + "We unfortunately are not Tralfamadorians, instead we are three dimensional beings who can't visually see a fourth dimension like its a location on the Rocky Mountains. However, this doesn't mean that the fourth dimension is meaingless to us. We can derive a lot of understand from understanding the higher dimensions. The problem is, we can't it and thus we can't plot it. \n", + "\n", + "Fortunately, very clever mathematicians throughout history has invented techniques to allow us to simulate what the higher dimension would look like. The rest of the section will discuss these techniques." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XR-JkcQuwozv", + "colab_type": "text" + }, + "source": [ + "## Using Size and Color\n", + "\n", + "This is more of a review from previous sections but one way to visualize more dimensions is by using the size and color attributes of your scatter plots. This is rather intuitive. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "kgChDoxMwyKa", + "colab_type": "code", + "outputId": "cc578404-1e67-476a-f736-e070bc9c233e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 368 + } + }, + "source": [ + "import pandas as pd\n", + "from scipy import stats\n", + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.preprocessing import LabelEncoder, StandardScaler\n", + "import altair as alt\n", + "\n", + "df_bordeaux = pd.read_csv(\"http://dlsun.github.io/pods/data/bordeaux.csv\")\n", + "\n", + "alt.Chart(df_bordeaux).mark_circle().encode(\n", + " alt.X('age',\n", + " scale=alt.Scale(zero=False)\n", + " ),\n", + " alt.Y('sep',\n", + " scale=alt.Scale(zero=False)\n", + " ),\n", + " color=\"summer\",\n", + " size=\"win\"\n", + ")" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "alt.Chart(...)" + ], + "text/html": [ + "\n", + "
\n", + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 30 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gxV83MG_y2LJ", + "colab_type": "text" + }, + "source": [ + "I am sure you can see the limitations of this method: you can only go up for 4 dimensions (5 if you use a 3-D scatter plot). This is still worth mentioning as sometimes, this may be all you need. \n", + "\n", + "For higher dimensions, we should consider either feature selection or feature reduction. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tlWDTyDsr5Zv", + "colab_type": "text" + }, + "source": [ + "## Feature Selection \n", + "\n", + "If we want to compress 10 dimensions worth of data into 2 dimensions, we're bound to lose some detail during that compression. We can measure how much detail we kept at the end with the explained variance ratio which gives the percentage of variance/detail we kept after the compression. The higher the ratio, the more variance and detail we kept. \n", + "\n", + "One way very simple, almost trivial, way to only visualize higher dimensional data is to only plot the two dimensions that explains the most variation in the data. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "g9kxLPD30dxa", + "colab_type": "code", + "outputId": "c1aedb62-573b-406c-cfa6-844ab4054122", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 136 + } + }, + "source": [ + "sclr = StandardScaler()\n", + "df_bordeaux = pd.DataFrame(sclr.fit_transform(df_bordeaux.dropna()), columns=df_bordeaux.columns)\n", + "\n", + "X = df_bordeaux.drop(\"price\", axis=1)\n", + "y = df_bordeaux[\"price\"]\n", + "\n", + "reg = LinearRegression()\n", + "reg.fit(X, y)\n", + "\n", + "scores = pd.Series(dtype=float, name=\"R^2 Values\")\n", + "for column in X.columns: \n", + " reg = LinearRegression()\n", + " reg.fit(X[[column]], y)\n", + " scores[column] = reg.score(X[[column]], y)\n", + "\n", + "scores.sort_values(ascending=False)" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "summer 0.343538\n", + "sep 0.323582\n", + "age 0.206936\n", + "year 0.206936\n", + "har 0.199621\n", + "win 0.053456\n", + "Name: R^2 Values, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 31 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qHprVgDLxKyE", + "colab_type": "text" + }, + "source": [ + "As we can see above, average summer temperature **summer** and average september temperature **sep** are the two variables that explain the most variance in the quality of the wine **price**. Thus, if we want to get the best representation of the dataset with only two dimensions, we can make a scatterplot of **summer** vs **sep**. However, even with the two variables that explain the most variation, we can only capture 33% of the variation of the original data. The other 66% is lost to the other features we chose to ignore. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FcUCeYGOx2nP", + "colab_type": "code", + "outputId": "16dc19dd-f2ac-44b7-8ac6-0236cac283d4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 330 + } + }, + "source": [ + "explained_var = X[[\"summer\", \"sep\"]].var(axis=0).sum() / X.var(axis=0).sum() \n", + "\n", + "print(\"% Variance Explained:\", explained_var, end=\"\\n\\n\")\n", + "df_bordeaux.plot.scatter(x=\"summer\", y=\"sep\")" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "stream", + "text": [ + "% Variance Explained: 0.3333333333333333\n", + "\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 32 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEGCAYAAABsLkJ6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAUNklEQVR4nO3dfYxcV33G8efZeLu26rQYrwvEDpiQqLwaJ2zTgCmF8NKQVqZgKFDRgqByU4RU1Ao7CLX0TY1spP5RUVqshAKCBigmtVsCIWAoL2rSrINf8kJCQEmzbkrM1glZsLfrzK9/zF083njt2ezcOefO+X6kVWbvzO793ZvxPHvOPedcR4QAAOUZSl0AACANAgAACkUAAEChCAAAKBQBAACFWpK6gIUYHR2NtWvXpi4DABpl7969P4yIVXO3NyoA1q5dq/Hx8dRlAECj2L7vVNvpAgKAQhEAAFAoAgAACkUAAEChCAAAKBQBAABzTE5Na//9D2lyajp1KbVq1DBQAKjbrn2HtHXnAQ0PDWmm1dL2Teu0cf3q1GXVghYAAFQmp6a1decBHZtp6ZHp4zo209KWnQcGtiVAAABAZeLIUQ0PnfyxODw0pIkjRxNVVC8CAAAqa1Ys00yrddK2mVZLa1YsS1RRvQgAAKisXD6i7ZvWaenwkM4eWaKlw0PavmmdVi4fSV1aLbgIDAAdNq5frQ3nj2riyFGtWbFsYD/8JQIAAB5j5fKRgf7gn0UXEAAUigAAgEIRAABQKAIAAApFAABAoQgAACgUAQAAhSIAAKBQBAAAFIoAAIBCEQAAUCgCAAAKlSwAbJ9r+6u277B9u+0/TFULAJQo5WqgxyX9cUTcavtsSXtt3xgRdySsCQCKkawFEBEPRMSt1eNHJN0paTDvvAwAGcriGoDttZIulHTzKZ7bbHvc9vjhw4f7XRoADKzkAWB7uaSdkt4dET+a+3xE7IiIsYgYW7VqVf8LBIABlTQAbA+r/eH/yYj4XMpaAKA0KUcBWdI1ku6MiL9JVQcAlCplC2CDpN+RdKntfdXX5QnrAYCiJBsGGhHflORU+weA0iW/CAwASIMAAIBCEQAAUCgCAAAKRQAAQKEIAAAoFAEAAIUiAACgUAQAABSKAACAQhEAAFAoAgAACkUAAEChCACgzyanprX//oc0OTWduhQULtly0ECJdu07pK07D2h4aEgzrZa2b1qnjetXpy4LhaIFAPTJ5NS0tu48oGMzLT0yfVzHZlrasvMALQEkQwAAfTJx5KiGh07+Jzc8NKSJI0cTVYTSEQBAn6xZsUwzrdZJ22ZaLa1ZsSxRRSgdAQD0ycrlI9q+aZ2WDg/p7JElWjo8pO2b1mnl8pHUpaFQXAQG+mjj+tXacP6oJo4c1ZoVy/jwR1IEANBnK5eP8MGPLNAFBACFIgAAoFAEAADMUcpsba4BAECHkmZr0wIAgEpps7UJAACo9Gu2di5dTHQBAUClH7O1c+piogUAAJW6Z2vn1sVECwAAOtQ5W3u2i+mYTrQyZruYUkwOJAAAYI66ZmvntiAgXUAA0Ce5LQhICwAA+iinBQEJAADos1wWBKQLCAAKlTQAbH/E9oO2b0tZBwCUKHUL4KOSLktcAxool5mU/VDSsaK/kl4DiIiv216bsgY0T04zKetW0rGi/1K3AM7I9mbb47bHDx8+nLocJJbbTMo6lXSsSCP7AIiIHRExFhFjq1atSl0OEuvXYl05KOlYkUb2AQB0ym0mZZ1KOlakQQCgUXKbSVmnko4VaTgi0u3cvlbSSyWNSvqBpPdHxDXzvX5sbCzGx8f7VB1yNjk1ncVMyn4o6VhRD9t7I2Js7vbUo4DenHL/aK5cZlL2Q0nHiv6iCwgACkUAAEChCAAAKBQBAACFIgAAoFAEAFAwFporGzeEAQrFQnOgBQAUiIXmIBEAQJEGYaE5uq8Wjy4goEBNX2iO7qveoAUAFKjJC83RfdU7tACAQm1cv1obzh9t3EJzs91Xx3SiBTPbfdWUY8gFAQAUrIkLzTW9+yondAEBaJQmd1/lhhYAgMZpavdVbggAAI3UxO6r3NAFBACFIgAAoFAEAAAUigAAgEIRAABQKAIAAApFAABAoRYUALZ/zvbZdRUDAOifrgLA9i/ZPijpgKTbbO+3/YJ6S0PTsV47kLduZwJfI+mdEfENSbL9Ykn/KGldXYWh2VivHchft11Aj85++EtSRHxT0vF6SkLTsV470AzdtgD+3faHJV0rKSS9UdLXbF8kSRFxa031oYFYrx1ohm4D4PnVf98/Z/uFagfCpT2rCI03COu1T05NF7HSZCnHiVPrKgAi4mV1F4LBMbte+5Y51wCa8gFTyvWLUo4T83NEnPlF9pMk/bWkcyLi1bafLemFEXFN3QV2Ghsbi/Hx8X7uEovQxL8uJ6emtWHbHh2bOdGCWTo8pG9tvbQxx9CNUo4Tbbb3RsTY3O3dXgT+qKQbJJ1TfX+3pHf3pjQMqpXLR/T8c5/QqA+U2esXnWavX8ynicNdH89xPh5NPDcl6fYawGhEfMb2eyUpIo7bfrTGuoAkFnr9oqndKP24TtPUc1OSblsAP7a9Uu0LvrJ9iaSHa6sKSGQh95tt8nDXuu+r2+RzU5JuWwB/JGm3pGfY/pakVZJeX1tVQELd3m+26cNd67yvbtPPTSm6DYBnSHq1pHMlbZL0ywv4WaBxurnf7CAMd63rvrqDcG5K0G0X0J9ExI8krZD0MkkfkvT3i9257cts32X7HttXLvb3Af1UdzdKk3FumqHbYaDfjogLbV8l6WBE/NPstse9Y/sstUcTvVLShKRbJL05Iu6Y72cYBoocNXG4a79wbvIw3zDQbrtxDlVLQbxS0jbbI1r8vQQulnRPRHy/KvBTkl4jad4AAHJUVzfKIODc5K3bD/HfUnsewK9FxEOSnijpPYvc92pJ93d8P1FtO4ntzbbHbY8fPnx4kbsEAMzqdimIn0j6XMf3D0h6oK6i5ux7h6QdUrsLqB/7BIASpLwl5CG1RxXNWlNtA5ApZvYOlpRDOW+RdIHtp6v9wf8mSb+dsB4Ap8HM3sGTrAUQEcclvUvtawt3SvpMRNyeqh4A82Nm72BKOpkrIq6XdH3KGkrGED10K8eZvbx/F4/ZvIWiOY+FyG1mL+/f3kh5ERiJ0JzHQuU0s5f3b+/QAihQjs155K/OxeMWgvdv7xAABcqtOY/myGFmL+/f3qELqEA5NeeBheL92ztdLQaXCxaD6y1GUTQD/59OjfPSvcUuBocBlENzHqfHaJf58f5dPLqAgEwx2gV1IwCATM2Oduk0O9oF6AUCADiFHBY9Y7QL6sY1AGCOXPrdZ0e7bJlTC/3e6BUCAOjQ2e8+O9Foy84D2nD+aJIP3lwmX2EwEQBAhxxnmTLaBXXhGgDQYc2KZTo6c/ykbUdnjtPvjoFEAABz2D7t98CgIACADhNHjmrpkrNO2rZ0yVkMvcRAIgCADgy9REkIAKADC42hJIwCQteavPjWQmpf6NDLJp8XlI0AQFdymRz1eDye2rsdetnk8wLQBYQzavKiZHXW3uTzAkgEALrQ5EXJ6qy9yecFkAgAdKHJI2PqrL3J5wXNUtfihAQAzqjJI2PqrL3J5wXNsWvfIW3Ytkdvufpmbdi2R7v3HerZ7+aWkOhak0e71Fl7k88L8jY5Na0N2/bo2MyJlubS4SF9a+ulC3qvcUtILFqTFyWrs/Ymnxfkre7FCekCAoBM1X2diQAAgEzVfZ2JLiAAyFidNwUiAAAgc3VdZ6ILCAAKRQAAfVbXpB5goegCAvqIxeOQE1oAQJ+weBxyQwAAfcLicchNkgCw/Qbbt9tu2X7M9GRgELF4HHKTqgVwm6TXSfp6ov0DfcficchNkovAEXGnJNlOsXsgmTon9QALlf0oINubJW2WpKc+9amJqwEWj8XjkIvaAsD2lyU9+RRPvS8idnX7eyJih6QdUns56B6VBwDFqy0AIuIVdf1uAMDiMQwUAAqVahjoa21PSHqhpM/bviFFHQBQslSjgK6TdF2KfQMA2ugCAoBCEQAAUCgCAAAKRQAAQKEIAAAoFAEAAIUiAACgUAQAABSKAACAQhEAAFAoAgAACkUAAEChCAAAKBQBAACFIgAAoFAEAAAUigAAgEIRAABQKAIAAApFAABAoQiAzE1OTWv//Q9pcmo6dSkABsyS1AVgfrv2HdLWnQc0PDSkmVZL2zet08b1q1OXBWBA0ALI1OTUtLbuPKBjMy09Mn1cx2Za2rLzAC0BAD1DAGRq4shRDQ+d/L9neGhIE0eOJqoIwKAhADK1ZsUyzbRaJ22babW0ZsWyRBUBGDQEQKZWLh/R9k3rtHR4SGePLNHS4SFt37ROK5ePpC4NwIAo4iLw5NS0Jo4c1ZoVyxr1Abpx/WptOH+0kbUDyN/AB0DTR9KsXD7CBz+AWgx0FxAjaQBgfgMdAIykAYD5DXQAMJIGAOY30AHASBoAmN/AXwRmJA0AnNrAB4DESBoAOJWB7gICAMyPAACAQiUJANsfsP0d2wdsX2f7CSnqANBc3Ctj8VK1AG6U9NyIWCfpbknvTVQHgAbate+QNmzbo7dcfbM2bNuj3fsOpS6pkZIEQER8KSKOV9/eJGlNijoANA8z/Hsnh2sAb5f0hfmetL3Z9rjt8cOHD/exLAA5YoZ/79Q2DNT2lyU9+RRPvS8idlWveZ+k45I+Od/viYgdknZI0tjYWNRQKoAGYYZ/79QWABHxitM9b/ttkn5D0ssjgg92AF2ZneG/Zc4qv8z1WbgkE8FsXyZpi6RfjYifpKgBQHMxw783Us0E/qCkEUk32pakmyLiikS1AGggZvgvXpIAiIjzU+wXAHBCDqOAAAAJEAAAUCgCAAAKRQAAQKHcpCH4th+RdFfqOuYYlfTD1EWcQo515ViTlGddOdYk5VlXjjVJedX1tIhYNXdj024Ic1dEjKUuopPt8dxqkvKsK8eapDzryrEmKc+6cqxJyreuTnQBAUChCAAAKFTTAmBH6gJOIceapDzryrEmKc+6cqxJyrOuHGuS8q3rpxp1ERgA0DtNawEAAHqEAACAQmUdAN3ePN72vbYP2t5nezyTmi6zfZfte2xfWWdN1f7eYPt22y3b8w496/O56ramfp+rJ9q+0fZ3q/+umOd1j1bnaZ/t3TXVctpjtz1i+9PV8zfbXltHHQus6W22D3ecm9/rQ00fsf2g7dvmed62/7aq+YDti+quqcu6Xmr74Y5z9af9qKtrEZHtl6RXSVpSPd4mads8r7tX0mguNUk6S9L3JJ0n6Wck7Zf07JrrepakX5T0NUljp3ldP8/VGWtKdK62S7qyenzlad5XUzXXccZjl/ROSf9QPX6TpE9nUNPbJH2wH++hjn2+RNJFkm6b5/nL1b61rCVdIunmTOp6qaR/6+e5WshX1i2AyPDm8V3WdLGkeyLi+xHxf5I+Jek1Ndd1Z0RkNUu6y5r6fq6q3/+x6vHHJP1mzfubTzfH3lnrZyW93NVNNBLW1HcR8XVJ/3ual7xG0sej7SZJT7D9lAzqylrWATDH6W4eH5K+ZHuv7c0Z1LRa0v0d309U23KQ6lzNJ8W5elJEPFA9/h9JT5rndUttj9u+yXYdIdHNsf/0NdUfHg9LWllDLQupSZI2VV0tn7V9bo31dCvnf3MvtL3f9hdsPyd1MZ2SLwXRo5vHvzgiDtn+BbXvMvadKplT1tRz3dTVhb6fqxROV1fnNxERtucbC/206lydJ2mP7YMR8b1e19pA/yrp2oiYtv37ardQLk1cU65uVft9NGX7ckn/IumCxDX9VPIAiB7cPD4iDlX/fdD2dWo3Yx/3h1oPajokqfOvojXVtkU5U11d/o6+nqsu9P1c2f6B7adExANVN8GD8/yO2XP1fdtfk3Sh2v3jvdLNsc++ZsL2Ekk/L2myhzUsuKaI6Nz/1WpfU0mtlvfRYkXEjzoeX2/7Q7ZHIyKLReKy7gLyiZvHb4x5bh5v+2dtnz37WO2LtKe8It+vmiTdIukC20+3/TNqX7yrZRTJQvT7XHUpxbnaLemt1eO3SnpMS8X2Ctsj1eNRSRsk3dHjOro59s5aXy9pz3x/CPWrpjl96xsl3VljPd3aLel3q9FAl0h6uKObLxnbT569ZmP7YrU/c+sM8IVJfRX6dF+S7lG7X29f9TU7GuIcSddXj89Te6TCfkm3q931kLSm6vvLJd2t9l+MtdZU7e+1avd7Tkv6gaQbMjhXZ6wp0blaKekrkr4r6cuSnlhtH5N0dfX4RZIOVufqoKR31FTLY45d0l+o/QeGJC2V9M/V++4/JZ3Xh/Nzppquqt4/+yV9VdIz+1DTtZIekDRTvafeIekKSVdUz1vS31U1H9RpRsL1ua53dZyrmyS9qB91dfvFUhAAUKisu4AAAPUhAACgUAQAABSKAACAQhEAAFAoAgAACkUAAJmpJjPxbxO1402GIlWzoj9fLdJ1m+03un2vhNHq+bFq+QfZ/jPbH7P9Ddv32X6d7e1u31fhi7aHq9fda/uqat33cdsX2b7B9vdsX9Gx7/fYvqVaTO3Pq21r3V6D/+Nqz87OYYE1DDgCAKW6TNJ/R8TzI+K5kr54htc/Q+0FzzZK+oSkr0bE8yQdlfTrHa/7r4hYL+kbkj6q9vINl0ia/aB/ldqLgV0sab2kF9h+SfWzF0j6UEQ8JyLuW/whAqdHAKBUByW90vY2278SEQ+f4fVfiIiZ6ufO0onAOChpbcfrdndsvzkiHomIw5Km3b573Kuqr2+rvVLkM3Vidcj7or2WPdAXyVcDBVKIiLvdvm3g5ZL+yvZX1F7ee/aPoqVzfmS6+rmW7Zk4sYZKSyf/O5ru2D7dsX32dZZ0VUR8uPOXu32rxx8v5piAhaIFgCLZPkfSTyLiE5I+oPZt/e6V9ILqJZtq2vUNkt5ue3lVx+rq3gxA39ECQKmeJ+kDtltqr+T4B5KWSbrG9l+qfR/jnouIL9l+lqT/qFYJnpL0FkmP1rE/4HRYDRQACkUXEAAUigAAgEIRAABQKAIAAApFAABAoQgAACgUAQAAhfp/rOG9WPsCRBEAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y6gWdO1CINp-", + "colab_type": "text" + }, + "source": [ + "Additionally, if we look below, using only two features has hindered our predictive accuracy. This sucks! Fortunately, some very clever mathematicians came up with ways to get around this. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "l7re2E6KHUbs", + "colab_type": "code", + "outputId": "24c2f5c2-ff71-420e-874c-2bbdee6f9a8f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "reg = LinearRegression()\n", + "\n", + "reg.fit(X, y)\n", + "print(\"All Features:\\t\\t\", reg.score(X, y))\n", + "reg.fit(X[[\"summer\", \"sep\"]], y)\n", + "print(\"With PCA Features:\\t\", reg.score(X[[\"summer\", \"sep\"]], y), end=\"\\n\\n\")" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "stream", + "text": [ + "All Features:\t\t 0.7526018827767169\n", + "With PCA Features:\t 0.4633153344681292\n", + "\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tflTZWS688xW", + "colab_type": "text" + }, + "source": [ + "## Dimensionality Reduction\n", + "\n", + "With feature selection, we were only able to capture 33% of the original variance, which isn't great. To capture more variation while still remaining in two variables, have to utilize some clever math.\n", + "\n", + "These clever mathematical techniques are known as feature creation, where we try to create new variables that helps us visualize higher dimensional data. There are many different techniques for dimensionality reduction all of which attempts to accomplish different things. Let's start off with the simplest and most popular one: **Principle Component Analysis**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cZ52OQlGAm6Q", + "colab_type": "text" + }, + "source": [ + "### Principle Component Analysis (PCA)\n", + "\n", + "Simply put, Principle Component Analysis create new features that maximizes variation. What PCA is **not** doing is feature selection, rather it is creating an entirely new arbitrary feature that is a combination of all the features. \n", + "\n", + "PCA involves some simple linear algebra, but SciKit-Learn has a PCA implementation. Note that all dimensionality reduction algorithms in SciKit-Learn is operated very similarity to machine learning algorithms you learned in the previous chapters. Create the object and then run `fit()` or `fit_transform()`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FESc4NoI8-3v", + "colab_type": "code", + "outputId": "1d1b00d1-65f5-40dd-ca5f-93f9ba84ef7c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 255 + } + }, + "source": [ + "from sklearn.decomposition import PCA\n", + "\n", + "# We want to be able to plot this on a 2-D scatter plot so we choose 2 Dimensions\n", + "dimension_we_want = 2\n", + "\n", + "pca = PCA(n_components=dimension_we_want)\n", + "X_2d = pca.fit_transform(X) \n", + "X_2d = pd.DataFrame(X_2d, columns=[\"PCA1\", \"PCA2\"])\n", + "\n", + "print(\"\\n% Variance Explained:\", pca.explained_variance_ratio_.sum(), end=\"\\n\\n\")\n", + "X_2d.head()" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "stream", + "text": [ + "\n", + "% Variance Explained: 0.6462543462926849\n", + "\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PCA1PCA2
02.620822-1.437859
12.1664460.732196
22.359265-0.067934
31.518218-0.792417
41.5199200.451721
\n", + "
" + ], + "text/plain": [ + " PCA1 PCA2\n", + "0 2.620822 -1.437859\n", + "1 2.166446 0.732196\n", + "2 2.359265 -0.067934\n", + "3 1.518218 -0.792417\n", + "4 1.519920 0.451721" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 36 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Gj87JXmmBdNu", + "colab_type": "code", + "outputId": "4574186e-9f94-4b91-9cf9-ba4a68b9f866", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + } + }, + "source": [ + "X_2d.plot.scatter(x=\"PCA1\", y=\"PCA2\")" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 37 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEGCAYAAABsLkJ6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAATe0lEQVR4nO3dcWzcZ33H8c/nEtfxcLZGjpWucUMqUtCqYIKwOjZv0lrYKIgFtVk1OsHEYIuQhkQlpoSqbGOaJtGwMaSViUWUwaSKqpvbBdFWNFVgUAQFB7mmaQrqGKyOGA1eSmOwXaf33R93Xm3X8Tm5+93zu3veL8mS73eX+30vd/597nme3/P8HBECAOSnkroAAEAaBAAAZIoAAIBMEQAAkCkCAAAytTF1ARdi69atsXPnztRlAEBHOX78+E8iYnDl9o4KgJ07d2p8fDx1GQDQUWz/cLXtdAEBQKYIAADIFAEAAJkiAAAgUwQAAGSKAADQEtMz83rs6Wc1PTOfuhSsU0edBgqgnI5MnNLBsUn1VCpaqFZ1aN+w9u7ZnrosNEALAEBTpmfmdXBsUnMLVZ2dP6e5haoOjE3SEugABACApkydmVVPZfmhpKdS0dSZ2UQVYb0IAABNGdrSp4Vqddm2hWpVQ1v6ElWE9SIAADRloL9Xh/YNa1NPRZt7N2pTT0WH9g1roL83dWlogEFgAE3bu2e7Rndt1dSZWQ1t6ePg3yEIAAAtMdDfy4G/w9AFBACZIgAAIFPJAsD2JtvftP2Y7RO2/ypVLQCQo5RjAPOSrouIGds9kh6x/WBEfCNhTQCQjWQBEBEhaaZ+s6f+E6nqAYDcJB0DsL3B9oSkZyQdjYhHV3nMftvjtsdPnz7d/iIBoEslDYCIeCEi9kgaknSN7d2rPOZwRIxExMjg4EuuaQwAuEilOAsoIp6V9CVJ16euBQBykfIsoEHbl9Z/75P025KeTFUPAOQm5VlAvyzps7Y3qBZE90TEFxLWAwBZSXkW0KSk16baPwDkrhRjAACA9iMAACBTBAAAZIoAAIBMEQAAkCkCAAAyRQAAQKYIAADIFAEAAJkiAAAgUwQAAGSKAACATBEAAJApAgAAMkUAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMhUsgCwfYXtL9l+wvYJ2+9PVQsA5Ghjwn2fk/SBiPi27c2Sjts+GhFPJKwJALKRrAUQET+KiG/Xfz8r6aSk7anqAYDclGIMwPZOSa+V9Ogq9+23PW57/PTp0+0uDQC6VvIAsN0vaUzSLRHx3Mr7I+JwRIxExMjg4GD7CwSALpU0AGz3qHbwvysi7k1ZCwDkJuVZQJZ0p6STEfGxVHUAQK5StgBGJb1T0nW2J+o/b0lYDwBkJdlpoBHxiCSn2j8A5C75IDAAIA0CAAAyRQAAQKYIAADIFAEAAJkiAIAVpmfm9djTz2p6Zj51KUChUq4GCpTOkYlTOjg2qZ5KRQvVqg7tG9bePaxRiO5ECwCom56Z18GxSc0tVHV2/pzmFqo6MDZJSwBdiwAA6qbOzKqnsvxPoqdS0dSZ2UQVAcUiAIC6oS19WqhWl21bqFY1tKUvUUVAsQgAoG6gv1eH9g1rU09Fm3s3alNPRYf2DWugvzd1aUAhGAQGlti7Z7tGd23V1JlZDW3p4+CPrkYAACsM9Pdy4EcW6AICgEwRAACQKQIAQCGYUV1+jAEAaDlmVHcGWgB1fFsBWoMZ1Z2DFoD4tgK00uKM6jm9OKlucUY1Z1eVS/YtAL6tdB9ac2kxo7pzZBUAqx0YWP+luxyZOKXR24/pHZ96VKO3H9PnJ06lLik7zKjuHNl0AZ2vm4dvK91jaWtusfvhwNikRndt5eDTZsyo7gxZtADW6ubh20r3oDVXLgP9vXrNFZfyt1RiWbQAGg1K8W2lO9CaAy5M0haA7U/bfsb240XuZz0HBr6tdD5ac8CFSd0C+IykOyT9S5E7WTwwHFgxBsCBofvQmgPWL2kARMRXbO9sx744MOSD1TyB9UndAmjI9n5J+yVpx44dTT0XBwYAeFHpzwKKiMMRMRIRI4ODg6nLAdqOiW0oSulbAEDOWKYERSp9CwDIFcuUoGipTwP9nKSvS3qV7Snb70lZD1AmTGxD0VKfBXRzyv0DZcbENhSNLiCgpJjYVl7dMjDPIDBQYsxfKZ9uGpgnAICSY/5KeXTbirN0AQHAOnXbwDwBAADr9LJLNmj+3AvLtnXywDwBsIZuGejBi3hPcbGOTJzSW+94RJWKJUm9G9zxA/OMAZxHNw30oIb3FBdrad//orB1//t+Q7u2bU5YWXPW1QKw3bPKtq2tL6cYF/qtjxmY3Yf3FM1Yre+/d0NFP3v+hfP8i86wZgDYvtb2lKQf2X5oxdLNDxVZWKtczEXCu22gB7ynaE63Tspr1AI4JOlNEbFV0mFJR22/vn6fC62sBS72W1+3vtk54z3tTGUZs+nWSXmNxgAuiYgTkhQR/2b7pKR7bR+UFIVX16RG1wI+H64g1n14TztP2cZsunFSXqMAWLB9WUT8jyRFxAnbb5D0BUmvKLy6JjXzra8b3+zc8Z52jrJOuOq2SXmNuoA+KGnb0g0RMSXptyR9pKCaWqbZZhsXiu8+vKedgTGb9lizBRARD5/nrs2Snm99Oa3Ht758TM/M8z53CcZs2mPd8wBsD0q6SdLNki6XdF9RRbVatzXb8FJl6y9GcxizaY81A8D2Zkk3SvoDSa+UdK+kKyNiqA21AetS1v5iNIfWe/EatQCekfRNSR+S9EhEhO0bii8LWL+LPdsL5UfrvViNBoFvldQr6R8l3Wq79Gf+ID/0FwMXZ80AiIiPR8TrJb2tvunfJV1u+6DtVxZeHbAO3TpJByiaIy5sPpft3aoNBP9+ROwqpKrzGBkZifHx8XbuEh2Es4CA1dk+HhEjK7c3GgTeJWlbRHxtcVtEPG77QUn/3Poy0clSH4DpLwYuTKNB4I+rNg6w0k8l/b2k3215RehInIYJdJ5Gg8DbIuI7KzfWt+0spCJ0HJZaBjpTowC4dI37OMUCkso5bb8sq0gCZdYoAMZt/8nKjbb/WNLxZndu+3rb37X9lO0PNvt8SKNsp2FezDUggBw1CoBbJP2R7S/b/rv6z39Ieo+k9zezY9sbJH1C0pslXS3pZttXN/OcSKNMp2HSHQWsX6PF4H4s6ddtXytpd33z/RFxrAX7vkbSUxHxfUmyfbdq8w2eaMFzo83KMm2fWcEoo9RnyJ1Po9NAN0l6r6Rdkr4j6c6IONeifW+X9PSS21OSfnWVGvZL2i9JO3bsaNGuUYQynIZZtu4ooMxnyDXqAvqspBHVDv5vlvS3hVe0QkQcjoiRiBgZHBxs9+7RYcrUHQWUvUuy0TyAqyPi1ZJk+07VFoZrlVOSrlhye6i+DWhKWbqjgLJ3STa8JOTiLxFxzm7pdeC/Jekq21eqduB/u2rLTgNNK0N3FFD2LslGXUCvsf1c/eespOHF320/18yO62MJ75P0RUknJd2zeAF6AOgGZe+SbHQW0IYidx4RD0h6oMh9AEBKZe6SXPclIQEAF6esXZKNuoAAAF2KAACATBEAAJApAgAAMkUAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMgUAQAAmSIAACBTBABWNT0zr8eeflbTM/OpS0HG+BwWiyuC4SWOTJzSwbFJ9VQqWqhWdWjfsPbu2Z66LGSGz2HxaAFgmemZeR0cm9TcQlVn589pbqGqA2OTfANDW/E5bA8CAMtMnZlVT2X5x6KnUtHUmdlEFSFHfA7bgwDAMkNb+rRQrS7btlCtamhLX6KKOgt91q3B57A9CAAsM9Dfq0P7hrWpp6LNvRu1qaeiQ/uGNdDfm7q00jsycUqjtx/TOz71qEZvP6bPT5xKXVLH4nPYHo6I9u/UvknShyX9iqRrImJ8Pf9uZGQkxsfX9VA0aXpmXlNnZjW0pY8/unWYnpnX6O3HNLfw4rfWTT0Vfe3gdfz/NYHPYWvYPh4RIyu3pzoL6HFJN0r6p0T7RwMD/b0d8QdXlgPEYp/1nF4MgMU+6074fyyrTvkcdqokARARJyXJdordo0uU6TRB+qzRiRgDQEdq1WmCrRq0pc8anaiwFoDthyVdtspdt0XEkQt4nv2S9kvSjh07WlQdOl0rulxa3YLYu2e7RndtLUWXFLAehQVARLyxRc9zWNJhqTYI3IrnROdrtstlaQtiMUQOjE1qdNfWpg7c9Fmjk9AFhI7UbJcLE42ARIPAtm+Q9A+SBiXdb3siIt6UohZ0rma6XBi0BRK1ACLivogYiojeiNjGwR8Xa6C/V6+54tIL7nZh0BZgNVBkjEFb5I4AQNYYtEXOGAQGgEwRAACQKQIAADJFAABApggAACi5oi40xFlAAFBiRa56SwsAAEqqVaveng8BAAAlVfSaVQQAAJRU0WtWEQAAUFJFr1nFIDAAlFiRa1YRAABQckWtWUUXELJU1HnVQCehBYDsFHledSebnplnaezMEADISlHXAu50hGKe6AJCVrgW8EsVPdkI5UUAICtcC/ilCMV8EQDICtcCfilCMV+MASA7XAt4ucVQPLBiDCD3/5ccEADIEtcCXo5QzBMBAEASoZgjxgAAIFNJAsD2R20/aXvS9n22L01RBwDkLFUL4Kik3RExLOl7km5NVAcAZCtJAETEQxFxrn7zG5KGUtQBADkrwxjAuyU9eL47be+3PW57/PTp020sqxxYtAxAUQo7C8j2w5IuW+Wu2yLiSP0xt0k6J+mu8z1PRByWdFiSRkZGooBSS4v1WQAUqbAAiIg3rnW/7XdJequkN0REVgf29WDRMgBFS3UW0PWSDkjaGxE/T1FD2bE+C4CipRoDuEPSZklHbU/Y/mSiOkqL9VkAFC3JTOCI2JViv52E9VkAFI2lIEqM9VkAFIkAKDnWZwFQlDLMAwAAJEAAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMgUAQAAmSIAACBTBACyMD0zr8eeflbTM/OpSwFKg+Wg0fWOTJzSwRUX1tm7Z3vqsoDkaAGgq03PzOvg2KTmFqo6O39OcwtVHRibpCUAiABAl5s6M6ueyvKPeU+loqkzs4kqAsqDAEBXG9rSp4Vqddm2hWpVQ1v6ElUElAcBgK420N+rQ/uGtamnos29G7Wpp6JD+4a5zCYgBoGRgb17tmt011ZNnZnV0JY+Dv5AHQGALAz093LgB1agCwgAMpUkAGz/te1J2xO2H7J9eYo6ACBnqVoAH42I4YjYI+kLkv4iUR0AkK0kARARzy25+TJJkaIOAMhZskFg238j6Q8l/VTStWs8br+k/ZK0Y8eO9hQHABlwRDFfvm0/LOmyVe66LSKOLHncrZI2RcRfruM5T0v6YeuqLNRWST9JXUQb8Xq7W06vtxtf68sjYnDlxsICYL1s75D0QETsTlpIi9kej4iR1HW0C6+3u+X0enN6ranOArpqyc23SXoyRR0AkLNUYwAfsf0qSVXVunTem6gOAMhWkgCIiH0p9ttmh1MX0Ga83u6W0+vN5rUmHwMAAKTBUhAAkCkCAAAyRQAUyPZHbT9ZX/foPtuXpq6pSLZvsn3CdtV2V55GZ/t629+1/ZTtD6aup2i2P237GduPp66laLavsP0l20/UP8fvT11T0QiAYh2VtDsihiV9T9Ktiesp2uOSbpT0ldSFFMH2BkmfkPRmSVdLutn21WmrKtxnJF2fuog2OSfpAxFxtaTXS/rTbn9/CYACRcRDEXGufvMbkoZS1lO0iDgZEd9NXUeBrpH0VER8PyKel3S3avNYulZEfEXS/6auox0i4kcR8e3672clnZS0PW1VxSIA2ufdkh5MXQSasl3S00tuT6nLDxC5sr1T0mslPZq2kmJxRbAmrWfNI9u3qda8vKudtRVhvWs8AZ3Kdr+kMUm3rFi5uOsQAE2KiDeudb/td0l6q6Q3RBdMumj0ervcKUlXLLk9VN+GLmG7R7WD/10RcW/qeopGF1CBbF8v6YCkvRHx89T1oGnfknSV7SttXyLp7ZI+n7gmtIhtS7pT0smI+FjqetqBACjWHZI2Szpav/zlJ1MXVCTbN9iekvRrku63/cXUNbVSfUD/fZK+qNoA4T0RcSJtVcWy/TlJX5f0KttTtt+TuqYCjUp6p6Tr6n+vE7bfkrqoIrEUBABkihYAAGSKAACATBEAAJApAgAAMkUAAECmCABgFbZfqJ8G+Ljtf7X9C/Xtl9m+2/Z/2j5u+wHbr1zy726xPWf7l5ZsG6ivMjlj+44UrwdYDQEArG42IvZExG5Jz0t6b32i0H2SvhwRr4iI16m2wuu2Jf/uZtUmjN24ZNucpD+X9GftKR1YHwIAaOyrknZJulbSQkT8/4S+iHgsIr4qSbZfIalf0odUC4LFx/wsIh5RLQiA0iAAgDXY3qja+v/fkbRb0vE1Hv521ZaI/qpqM2e3rfFYIDkCAFhdn+0JSeOS/lu1NWIauVnS3RFRVW1BsZsKrA9oGquBAqubjYg9SzfYPiHp91Z7sO1XS7pKtXWfJOkSSf+l2npQQCnRAgDW75ikXtv7FzfYHrb9m6p9+/9wROys/1wu6XLbL09VLNAIi8EBq7A9ExH9q2y/XNLHJb1OtUHdH0i6RbUVQt8SEU8ueezHJP04Im63/QNJv6hay+BZSb8TEU8U/TqAtRAAAJApuoAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMjU/wFur9UWE4oSkQAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cT7Rv9lEDSji", + "colab_type": "text" + }, + "source": [ + "As we can see from above, the two new features (known as Principle Components) are not like any of our input features. Additionally, these two new components explains 64.6% of all the original variation. Let's see how well this performs in explaining our dataset." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eBHVGhd-CJF8", + "colab_type": "code", + "colab": {} + }, + "source": [ + "reg = LinearRegression()\n", + "\n", + "reg.fit(X, y)\n", + "print(\"All Features:\\t\\t\", reg.score(X, y))\n", + "reg.fit(X_2d, y)\n", + "print(\"With PCA Features:\\t\", reg.score(X_2d, y), end=\"\\n\\n\")" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bsoN_a8XI4p2", + "colab_type": "text" + }, + "source": [ + "As expected, with a higher explained variance ratio, we perform better in predicting the quality of the wine. With Dimensionality Reduction, we were able to capture most of the variation in the dataset while still being able to view it in two dimensions. In the most basic of terms, PCA creates a variable projected along the axis of maximum variable.\n", + "\n", + "![](https://raw.githubusercontent.com/bfkwong/data/master/IMG_0187%202.jpg)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vwDiZRX3JYLN", + "colab_type": "text" + }, + "source": [ + "### Optional: Linear Algebra Behind PCA\n", + "\n", + "Principle Component Analysis chooses principle components along te **axis of greatest variance**. In Linear Algebra terms, the axis of greatest variance is the **Eigenvector with the largest Eigenvalue of the covariance matrix ($\\Sigma$)**\n", + "\n", + "The following are the steps in order to do PCA manually. \n", + "\n", + "1. Given data matrix $M$, generate the covariance matrix of $M$ denoted as $\\Sigma$\n", + "2. We then compute the Eigenvector and Eigenvalue of covariance matrix $\\Sigma$\n", + "3. Project the features to the Eigenvector with the largest Eigenvalue using the dot product (cross product for more than 1 dimensions)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "J26dEAoAD0q7", + "colab_type": "code", + "colab": {} + }, + "source": [ + "import numpy as np\n", + "\n", + "# Step 1: Calculate covariance matrix \n", + "cov_mtrx = np.cov(X.T)\n", + "\n", + "# Step 2: Calculate Eigenvector and Eigenvalue\n", + "W,v = np.linalg.eig(cov_mtrx)\n", + "\n", + "# Step 3: Find the largest Eigenvalue and project our data onto the corresponding Eigenvector\n", + "idx_largest_eigenval = np.argmax(W)\n", + "eigenvec = v[:,idx_largest_eigenval]\n", + "\n", + "total = []\n", + "for row in X.index: \n", + " total.append(np.dot(X.loc[row], eigenvec))\n", + "\n", + "pd.DataFrame(pd.Series(total), columns=[\"PCA1\"]).head()" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_G7NWh_1nf_k", + "colab_type": "text" + }, + "source": [ + "One interesting thing you may see here is that the eigenvalue corresponds to the variance explained by its corresponding eigenvalue. The eigenvalue of PCA1 is 2.26 and the variance of PCA1 is also 2.26" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9QCALn43iTYd", + "colab_type": "code", + "colab": {} + }, + "source": [ + "idx_largest_eigenval = np.argmax(W)\n", + "variance = pd.Series(total).var()\n", + "\n", + "print(\"Largest Eigenvalue:\\t\", W[idx_largest_eigenval])\n", + "print(\"Variance of Eigenvector:\", variance)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bmLOaYzvEeG7", + "colab_type": "text" + }, + "source": [ + "### Multidimensional Scaling (MDS)\n", + "\n", + "We now pay a visit to our good friend, the Euclidean distance. One incredibly useful aspect of Euclidean distance is that it works in higher dimensions. The formula $$\\sqrt{x_1^2 + x_2^2}$$ is for two dimensional Euclidean distance, but to move it to a third dimension, it is as easy as adding a $x^3$ variable. One thing to recognize is that Euclidean distance in all dimension is still a number. \n", + "\n", + "Why am I rambling on about something you learned in middle school? Well the realization that Euclidean distance is scalar in all dimensions means that we can preserve the variance of n-th dimensional data in two dimensions as long as we try to ensure that the Euclidean distances in the n-th dimensional is proportion to the Euclidean distance in 2 dimensions. That was a lot to take in, the following image explains the concept.\n", + "\n", + "![](https://raw.githubusercontent.com/bfkwong/data/master/IMG_0188.jpg)\n", + "\n", + "Notice how when we reduced our dimensions from 2 to 1, the distances between points A, B, and C remained the same. Meaning that x, y, and z remained the same between the two dimensions. While distances may not always be preserved perfectly between dimensions, MDS attempts to preserve it as well as possible. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "L4AjVFrwn78I", + "colab_type": "code", + "colab": {} + }, + "source": [ + "from sklearn.manifold import MDS\n", + "# We want to be able to plot this on a 2-D scatter plot so we choose 2 Dimensions\n", + "dimension_we_want = 2\n", + "\n", + "mds = MDS(n_components=dimension_we_want)\n", + "X_2d = mds.fit_transform(X) \n", + "X_2d = pd.DataFrame(X_2d, columns=[\"Dimension 1\", \"Dimension 2\"])\n", + "\n", + "display(X_2d.head())\n", + "print(\"Percentage variance explained:\", X_2d.var().sum()/X.var().sum())\n", + "X_2d.plot.scatter(x=\"Dimension 1\", y=\"Dimension 2\")" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mlIA-o_cEnSD", + "colab_type": "text" + }, + "source": [ + "By using MDS, we were actually able to preserve over 95% of the variance from the original datasets. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "88kI9JxzFO6n", + "colab_type": "text" + }, + "source": [ + "### Linear vs Nonlinear Dimensionality Reduction \n", + "\n", + "Thus far, we have explored PCA (a linear reduction technique) and MDS (a nonlinear reduction technique). While there are a lot of differences between the two methods. The key differences could be boiled down to just the following statement: **linear dimensionality reduction technique only stretch and shift the data while nonlinear techniques make more drastic changes to the data**.\n", + "\n", + "This sometimes leads to nonlinear techniques being better at capturing variance but losing the overall shape of the data whereas linear techniques are better at keeping the general shape of the original data but loses more variance along the way. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7uprQWlUGC37", + "colab_type": "text" + }, + "source": [ + "# Exercises" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UnMJKgsTkDoZ", + "colab_type": "text" + }, + "source": [ + "1. Consider the Iris dataset (https://raw.githubusercontent.com/dlsun/pods/master/data/iris.csv). Drop the \"SepalWidth\" and \"PedalWidth\" columns and then apply PCA on \"SepalLength\" and \"PedalLength\" with `n_components = 2`. How many percent of the variance was PCA able to capture in this case? What happens when we use PCA to compress 2D data into 2D data?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HCXKtA8AlSQq", + "colab_type": "code", + "colab": {} + }, + "source": [ + "" + ], + "execution_count": 0, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/10-Textual-Data/10.3 Distance Metrics.ipynb b/10-Textual-Data/10.3 Distance Metrics.ipynb new file mode 100644 index 0000000..f5137d7 --- /dev/null +++ b/10-Textual-Data/10.3 Distance Metrics.ipynb @@ -0,0 +1,660 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "10.3 Distance Metrics.ipynb", + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "PBgTRCHclclP", + "colab_type": "text" + }, + "source": [ + "# 10.3 Distance Metrics\n", + "\n", + "In the last section, we used Cosine Similarity to measure how similar two documents are. However, there are many other measures of similarity that exists. It will be up to us to decide which metric is best for our application. In this section, we will explore these other methods for measuring text similarity. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9BO29LPWh9Bn", + "colab_type": "text" + }, + "source": [ + "## Hamming Distance\n", + "\n", + "The most simple and intuitive method for measuring string similarity is the hamming distance. Given two strings of equal lengths, it measures the number of symbols that are different at each corresponding index. Take the example of the strings `\"BRYAN\"` and `\"BRIAN\"`. Here, we can see that the Hamming difference between the two strings is 1 because they are different at one location (third letter). " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vhw83KbRkRC9", + "colab_type": "code", + "outputId": "a7706f2a-5a7c-4921-c2c2-446ab4789101", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + } + }, + "source": [ + "def hamming_distance(str1, str2): \n", + " if len(str1) != len(str2): \n", + " return None \n", + "\n", + " differences = 0\n", + " for x in range(len(str1)): \n", + " if str1[x] != str2[x]: \n", + " differences += 1 \n", + " \n", + " return differences\n", + "\n", + "hamming_distance(\"Bryan\", \"Brian\") " + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "1" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 41 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N3B-DJ3HlO0t", + "colab_type": "text" + }, + "source": [ + "In theory, we can modify the Hamming Distance to accept two strings of different lengths, and whatever is left over is counted as a difference. The following is a simple implementation for this modified Hamming Distance" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I7S6HPYxlKy3", + "colab_type": "code", + "outputId": "9bfdfa7b-15a9-49f9-f119-8e3a6bf596a3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "def modified_hamming_distance(str1, str2): \n", + " differences = 0\n", + "\n", + " if len(str1) != len(str2):\n", + " differences += abs(len(str1) - len(str2))\n", + "\n", + " for x in range(len(str1)): \n", + " if str1[x] != str2[x]: \n", + " differences += 1 \n", + " \n", + " return differences\n", + "\n", + "modified_hamming_distance(\"Bryan\", \"Brian!!!\") " + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "4" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 23 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BethmiSBoTJj", + "colab_type": "text" + }, + "source": [ + "## Edit Distance\n", + "\n", + "A more applicable string metric is the edit distance. Generally, the edit distance is the smallest number of operations needed to turn one string into another. These operations may include insertions, deletions, substitutions, transposition, etc. Edit distances have different names based on their set of operations. For this chapter, we will focus on the Damerau-Levenshtein distance.\n", + "\n", + "### Damerau-Levenshtein Distance\n", + "\n", + "The Damerau-Levenshtein distance has four operations: deletion, insertion, substitution, and transposition. Here's what these operations mean:\n", + "\n", + "* Deletion: **s**cat --> cat
\n", + "\"scat\" and \"cat\" have an edit distance of 1 because we can delete the s in \"scat\" to turn it into \"cat\". \n", + "\n", + "* Insertion: at --> **m**at
\n", + "\"at\" and \"mat\" have an edit distance of 1 because we can insert an m into \"at\" to turn it into \"mat\". \n", + "\n", + "* Substitution: **m**at --> **c**at
\n", + "\"mat\" and \"cat\" have an edit distance of 1 because we can substitute the m in \"mat\" for a c to turn \"mat\" into a \"cat\".\n", + "\n", + "* Transposition: e**at** --> e**ta**
\n", + "\"eat\" and \"eta\" have an edit distance of 1 because we can transpose the successive characters of 'a' and 't' to turn \"eat\" into \"eta\". \n", + "\n", + "By measuring how many steps it takes to turn one string into another, it tells us how similar the two strings are. While the concept may be simple, it is infinitely applicable. One of the more interesting applications of the Damerau-Levenshtein distance is lingustic comparison. By computing the DL distance on two strings of different language but of same meaning, we can get a general idea of how similar the two languages are. \n", + "\n", + "Damerau-Levenshtein distance is conceptually simple but it is rather tedious to implement efficiently as it requires Dynamic Programming. NLTK provides us with an implementation of Damerau-Levenshtein distance with the `edit_distance()` function. Note that `edit_distance()` only provides the Levenshtein distance (without transposition). To make `edit_distance()` compute Damerau-Levenshtein distance we specify the parameter `transpositions = True`.\n", + "\n", + "\n", + "Below is an example of DL distance in practice using translation of an excerpt from Winston Churchill's famous \"We shall fight on the beaches\" speech into five different languages.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IKge7Vi6oSR5", + "colab_type": "code", + "outputId": "31720555-6f7e-4da6-987a-e1365eaffd38", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 266 + } + }, + "source": [ + "from nltk.metrics import edit_distance\n", + "import pandas as pd\n", + "\n", + "df_speech = pd.read_csv(\"https://raw.githubusercontent.com/bfkwong/data/master/we_shall_never_surrender.csv\", index_col=\"language\")\n", + "\n", + "def getDLDistance(language): \n", + " return edit_distance(df_speech.loc[\"english\"][\"text\"], language, transpositions=True)\n", + "\n", + "df_speech[\"Distance to English\"] = df_speech[\"text\"].apply(getDLDistance)\n", + "df_speech" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textDistance to English
language
englishWe shall go on to the end. We shall fight in F...0
spanishLlegaremos hasta el final, lucharemos en Franc...276
germanWir werden bis zum Ende weitermachen. Wir werd...334
portugueseIremos até ao fim. Lutaremos na França. Lutare...281
frenchNous irons jusqu'au bout, nous nous battrons e...326
italianndremo avanti fino alla fine. Combatteremo in ...292
\n", + "
" + ], + "text/plain": [ + " text Distance to English\n", + "language \n", + "english We shall go on to the end. We shall fight in F... 0\n", + "spanish Llegaremos hasta el final, lucharemos en Franc... 276\n", + "german Wir werden bis zum Ende weitermachen. Wir werd... 334\n", + "portuguese Iremos até ao fim. Lutaremos na França. Lutare... 281\n", + "french Nous irons jusqu'au bout, nous nous battrons e... 326\n", + "italian ndremo avanti fino alla fine. Combatteremo in ... 292" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 73 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wHvsnPmzfu-A", + "colab_type": "text" + }, + "source": [ + "When it comes to the vocabularies in this excerpt from Churchill's speech. English is most similar to Spanish, followed by Portuguese, Italian, German, and finally French. It is important that we don't extrapolate this conclusion to the entire language, as these results only shows that with the words and punctuations used in this speech." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LSKNeVzBooXL", + "colab_type": "text" + }, + "source": [ + "## Jaccard Distance\n", + "\n", + "This method of comparing string similarity is more based on statistics than pure comparison like Hamming Distance and Edit Distance. The formulation of this metric is as follows: \n", + "\n", + "$$\n", + "d = 1 - {|X \\cap Y| \\over |X \\cup Y|}\n", + "$$\n", + "\n", + "What this equation tells us is that given two strings, we want to find the number of characters they have in common and divide it by the number of characters in total. Then we subtract it by 1, so its is consistent with all other distance metric where the more dissimilar two strings are, the larger their distance metric. Consider the following toy example: " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Qh_exrbztvKp", + "colab_type": "code", + "outputId": "340e9db4-4e95-4494-d7a5-f31df7bf987d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "documentX = \"Republic of Korea\"\n", + "documentY = \"The Republic of Albania\"\n", + "\n", + "setX = set(documentX)\n", + "setY = set(documentY)\n", + "\n", + "intersection = setX.intersection(setY)\n", + "union = setX.union(setY)\n", + "\n", + "jaccard_dist = 1 - (len(intersection)/len(union))\n", + "jaccard_dist" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.33333333333333337" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 94 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yQG8k-bptvfA", + "colab_type": "text" + }, + "source": [ + "Of course, this can be applied to words of a string as well instead of individual characters. Consider the following code: " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KPR7puJQuslG", + "colab_type": "code", + "outputId": "b1a4d95f-0530-4e8f-a742-8ab394c73995", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "import pandas as pd\n", + "import requests\n", + "import re\n", + "\n", + "fed_1 = requests.get(\"http://dlsun.github.io/pods/data/federalist/1.txt\").text.lower()\n", + "fed_2 = requests.get(\"http://dlsun.github.io/pods/data/federalist/2.txt\").text.lower()\n", + "\n", + "fed_1 = set([x for x in re.split(\"\\n| |,|\\.|\\(|\\)\", fed_1) if x != '' and x != '\"'])\n", + "fed_2 = set([x for x in re.split(\"\\n| |,|\\.|\\(|\\)\", fed_2) if x != '' and x != '\"'])\n", + "\n", + "intersect = fed_1.intersection(fed_2)\n", + "union = fed_1.union(fed_2)\n", + "\n", + "jaccard_dist = 1 - (len(intersect)/len(union))\n", + "jaccard_dist" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.7766895200783546" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 118 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6OfWaiK40pp-", + "colab_type": "text" + }, + "source": [ + "Fortunately for us, NLTK has a Jaccard Distance implementation for us to use so that we don't have to write all this code. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZT_pa4cj03Xb", + "colab_type": "code", + "outputId": "0bab68e3-59f1-4a22-fa76-225dac2daf39", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "jaccard_distance(fed_1, fed_2)" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.7766895200783546" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 120 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eiUHy1wPFKM0", + "colab_type": "text" + }, + "source": [ + "On the surface, Jaccard distance seems a bit generic, but that is what makes this such a popular metric. Its only requirement is that the inputs are two sets which means that we can compare basically anything. One additional benefits of Jaccard distance is that duplicates does not affect the results since Jaccard distance operates on sets. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YALHZEiVtxO3", + "colab_type": "text" + }, + "source": [ + "## Building Similarity Matrix\n", + "\n", + "Just like in our previous section, we want to build a similarity matrix that tells us which document is most similar to which other. To create a similarity matrix $S$, we want to ensure that $S_{ij}$ gives us the similarity between string i and string j. Note that in all the formulations we have given above, 0 represent identical strings, this can of course be inverted to fit your needs. \n", + "\n", + "The below code generates the a similarity matrix for the languages used for the translations of \"We shall fight on the beaches\" speech\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oh6l7PK5frn8", + "colab_type": "code", + "outputId": "d64698fa-a6bb-4536-dac1-5009489c7f01", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 266 + } + }, + "source": [ + "import numpy as np\n", + "\n", + "mtrx = np.zeros((df_speech.shape[0], df_speech.shape[0]))\n", + "for i in range(df_speech.shape[0]):\n", + " for j in range(i + 1, df_speech.shape[0]):\n", + " d = edit_distance(df_speech.iloc[i][\"text\"], df_speech.iloc[j][\"text\"])\n", + " mtrx[i][j] = d\n", + " mtrx[j][i] = d\n", + "\n", + "mtrx = pd.DataFrame(mtrx, index=df_speech.index, columns=df_speech.index)\n", + "mtrx" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
languageenglishspanishgermanportuguesefrenchitalian
language
english0.0276.0334.0281.0328.0293.0
spanish276.00.0338.0130.0280.0220.0
german334.0338.00.0344.0368.0345.0
portuguese281.0130.0344.00.0280.0190.0
french328.0280.0368.0280.00.0288.0
italian293.0220.0345.0190.0288.00.0
\n", + "
" + ], + "text/plain": [ + "language english spanish german portuguese french italian\n", + "language \n", + "english 0.0 276.0 334.0 281.0 328.0 293.0\n", + "spanish 276.0 0.0 338.0 130.0 280.0 220.0\n", + "german 334.0 338.0 0.0 344.0 368.0 345.0\n", + "portuguese 281.0 130.0 344.0 0.0 280.0 190.0\n", + "french 328.0 280.0 368.0 280.0 0.0 288.0\n", + "italian 293.0 220.0 345.0 190.0 288.0 0.0" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 87 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lP0zNA9Svyb4", + "colab_type": "text" + }, + "source": [ + "This similarity matrix tells us how similar languages are to each other. It makes sense for this matrix to suggest that Spanish is more similar to Portuguese and Italian than English, French, or German. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lUmS6CXZwtdT", + "colab_type": "text" + }, + "source": [ + "# Exercises " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q3K2dw5oCd2Z", + "colab_type": "text" + }, + "source": [ + "1. Autocorrects determine if a word is correct by comparing it to an existing list of words. Download a dictionary of words from (https://raw.githubusercontent.com/dwyl/english-words/master/words_dictionary.json) and write a function that takes in a string and determine if there are any misspelling using the Hamming Distance." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p3tY5LkR4qVY", + "colab_type": "text" + }, + "source": [ + "2. Damerau-Levenshtein distance is often used to measure how similar two RNA/DNA sequences are. Download the data from (link) and determine which animal HIV came from.\n", + "\n", + " Note: It may take a while to calculate Damerau-Levenshtein distance for long sequences.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bDt8YbtoHhlP", + "colab_type": "text" + }, + "source": [ + "3. Use the Jaccard distance to create a similarity matrix for translations of Churchill's speeches. Can you see any advantages or disadvantages of using the Jaccard distance in this context?" + ] + } + ] +} \ No newline at end of file diff --git a/10-Textual-Data/10.4 Visual Analysis of Text.ipynb b/10-Textual-Data/10.4 Visual Analysis of Text.ipynb new file mode 100644 index 0000000..eadfd70 --- /dev/null +++ b/10-Textual-Data/10.4 Visual Analysis of Text.ipynb @@ -0,0 +1,505 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "10.4 Visual Analysis of Text.ipynb", + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "Lv8WJDNEe4og", + "colab_type": "text" + }, + "source": [ + "# 10.4 Visual Analysis of Texts\n", + "\n", + "An important part of Exploratory Data Analysis if data visualization and it is no different for textual data. Visualizing textual data can help us see patterns that our brains could've otherwise not picked up. This section will teach us ways to look at text data. The dataset for this section will be the federalist papers which is processed below:\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "A-1uNmVtxQaT", + "colab_type": "code", + "colab": {} + }, + "source": [ + "import pandas as pd\n", + "import requests\n", + "\n", + "federalist_dir = \"http://dlsun.github.io/pods/data/federalist/\"\n", + "df_author = pd.read_csv(\"http://dlsun.github.io/pods/data/federalist/authorship.csv\")\n", + "\n", + "df_author = df_author.set_index(\"Paper\")\n", + "\n", + "papers = []\n", + "for paper in df_author.index:\n", + " response = requests.get(federalist_dir + str(paper) + \".txt\", \"r\")\n", + " papers.append(response.text)\n", + "\n", + "df_author[\"Text\"] = papers" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GXgnmrm-oPXD", + "colab_type": "text" + }, + "source": [ + "## N-Gram Frequency\n", + "\n", + "One way to visualize text data is by looking at how often an N-Gram occurs. The simplest way to observe this is by looking at a Unigram (basically its term frequency). This allows us to see how often a specific word is used. In the example below, let us examine how often the authors of the federalist paper mentions the 3 branches of government. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_yzmXUqPkolH", + "colab_type": "code", + "outputId": "c9abdcc2-281e-4727-e9ff-56b3a4b7ea3e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + } + }, + "source": [ + "from sklearn.feature_extraction.text import CountVectorizer\n", + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "# Create a CountVectorizer like in 10.2\n", + "vec = CountVectorizer()\n", + "vec.fit(df_author[\"Text\"]) \n", + "tf_sparse = vec.transform(df_author[\"Text\"]).todense()\n", + "\n", + "# Turn the sparse matrix into a Pandas DataFrame \n", + "df_unigram = pd.DataFrame(tf_sparse, columns=vec.get_feature_names(), index=df_author.index)\n", + "\n", + "# Get the three terms we want to examine\n", + "df_ngram = df_unigram[[\"executive\", \"legislature\", \"judiciary\"]]\n", + "df_ngram.sum()" + ], + "execution_count": 3, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "executive 246\n", + "legislature 190\n", + "judiciary 91\n", + "dtype: int64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-v0j32ya10N0", + "colab_type": "text" + }, + "source": [ + "We can see here the authors of the federalist papers were particularly interested in discussing the executive branch of government. This aligns with consensus understanding of the federalist papers where the three authors were particularly concerned about the overreaching executive. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4EaKhuJH12FV", + "colab_type": "code", + "outputId": "5f09fe54-3562-4867-e6e0-f61fe85c4179", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + } + }, + "source": [ + "df_ngram.plot.line()" + ], + "execution_count": 4, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 4 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eMlGFN_qx0x1", + "colab_type": "text" + }, + "source": [ + "Looking at the line graph, particularly at Federalist 46 (the huge spike in the middle), there is an unusal spike in the number of times all three branches of government are mentioned, which leads us to assume that this paper discusses the government as a whole and not any particular branch. This assumption would be proven right as this essay by Madison is comparing the difference between the state and federal government. \n", + "\n", + "Another way to interpret this chart would be to treat the papers as a time variable. Since the papers were written in chronological order, we can see that as time went on, the authors wrote more and more about the branches of government.\n", + "\n", + "N-gram analysis provides us with a very simple way to analyze our text by seeing what word(s) occured most frequently. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GKjN-Y2t03aQ", + "colab_type": "text" + }, + "source": [ + "## Dimensionality Reduction\n", + "\n", + "While N-grams are great for visualizing a few words over time, it does struggle in visualizing all the documents together. To visualize the whole Count Vector, we will use dimensionality reduction to construct two \"Principle Component\" vocabulary to present the entire corpus. Recall from the previous chapters that the Principle Component does not represent any real vocabulary but rather it is a mathematical construct that maximizes variance. \n", + "\n", + "### Representing Documents\n", + "\n", + "In an exercise in section 10.2, we used a the KNN algorithm to predict federalist papers authors. However, one question we may have is whether or not topics plays a role in the classification. Below, we run PCA on the 85 federalist papers, making sure we don't plot points where the author is unknown to simplify our analysis. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "nPKk-reyzq5l", + "colab_type": "code", + "outputId": "42cc2807-f769-4fa5-b17a-214cedddbfe3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 368 + } + }, + "source": [ + "from sklearn.decomposition import PCA\n", + "import numpy as np\n", + "from altair import *\n", + "\n", + "pca = PCA(n_components=2)\n", + "df_data = pca.fit_transform(df_unigram)\n", + "\n", + "# Turn numpy array back into dataframe and reset index \n", + "df_data = pd.DataFrame(df_data, columns=[\"PCA1\", \"PCA2\"], index=np.arange(1,86,1))\n", + "\n", + "# Merge to get Author name \n", + "df_data = df_data.merge(df_author, left_index=True, right_index=True)\n", + "\n", + "# Drop all papers where the author is not certain \n", + "df_data = df_data[~df_data[\"Author\"].isnull()]\n", + "df_data = df_data.reset_index()\n", + "\n", + "Chart(df_data).mark_circle().encode(\n", + " x=\"PCA1\",\n", + " y=\"PCA2\", \n", + " color=\"Author\",\n", + " tooltip=\"index\"\n", + ")" + ], + "execution_count": 5, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "alt.Chart(...)" + ], + "text/html": [ + "\n", + "
\n", + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 5 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-eTIHF96DnO6", + "colab_type": "text" + }, + "source": [ + "From above, we can see that the way that the papers are clustered is very much based off the Count Vector of the author. Each author forms a very distinct cluster. Topics doesn't really play a role in this chart as their location in the plot is rather random. For example, Federalist 78 to 83 are papers by Hamilton about the Judiciary but they're they're scattered all throughout the graph.\n", + "\n", + "There are many other dimensionality reduction techniques that could be applied in a similary fashion, experiment with all of them to get the best results. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3CQRLXxh6wut", + "colab_type": "text" + }, + "source": [ + "### Representing Words\n", + "\n", + "We can apply a similar version of the dimensionality reduction technique to the words of the document. Instead of reducing the number of words there are so we can visualize the documents, we can reduce the number of documents so that we can visualize the words. In this context, PCA creates two new document, PCA Doc 1 and PCA Doc 2, where each value represents the number of times that a word appears in the PCA docs.\n", + "\n", + "-|PCA Document 1|PCA Document 2\n", + "---|---|---\n", + "**word #1**|times word #1 occur in PCA Doc 1|# of times word #1 occur in PCA Doc 2\n", + "**word #2**|...|...\n", + "\n", + "This technique has the added benefit of capturing the meaning of the word. Consider the following block of code where we reduce the number of documents to 2, and plot the words counts." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ktPaYIbg9BOk", + "colab_type": "code", + "outputId": "d1f0b81e-9863-4996-8210-0b820e05d06e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + } + }, + "source": [ + "from sklearn.decomposition import PCA, KernelPCA\n", + "from sklearn.manifold import MDS\n", + "\n", + "from sklearn.feature_extraction.text import CountVectorizer\n", + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "# Create a CountVectorizer like in 10.2\n", + "vec = CountVectorizer()\n", + "vec.fit(df_author[\"Text\"]) \n", + "tf_sparse = vec.transform(df_author[\"Text\"]).todense()\n", + "\n", + "# Turn the sparse matrix into a Pandas DataFrame \n", + "df_unigram = pd.DataFrame(tf_sparse, columns=vec.get_feature_names(), index=df_author.index)\n", + "\n", + "# Get the three terms we want to examine\n", + "df_unigram = df_unigram.T\n", + "\n", + "# Use a non linear model\n", + "pca = KernelPCA(n_components=2, kernel=\"sigmoid\")\n", + "df_data = pca.fit_transform(df_unigram)\n", + "\n", + "df_data = pd.DataFrame(df_data, columns=[\"Principle Document 1\", \"Principle Document 2\"])\n", + "ax = df_data.plot.scatter(x=\"Principle Document 1\", y=\"Principle Document 2\", alpha=0.35)\n", + "\n", + "ax.annotate(\"monarch\", tuple(df_data.iloc[list(df_unigram.index).index(\"monarch\")]))\n", + "ax.annotate(\"evil\", tuple(df_data.iloc[list(df_unigram.index).index(\"evil\")]))\n", + "ax.annotate(\"good\", tuple(df_data.iloc[list(df_unigram.index).index(\"good\")]))\n", + "ax.annotate(\"executive\", tuple(df_data.iloc[list(df_unigram.index).index(\"executive\")]))" + ], + "execution_count": 6, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Text(0.4415900652159762, -0.09777202180619515, 'executive')" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 6 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j7uONWzz9-Nz", + "colab_type": "text" + }, + "source": [ + "We know that the writers of the federalist papers tried very hard to convince the American public that the executive under the constitution will be good and just because their actions will be checked by other branches of government. This is in contrast of their view that the unchecked monarch is evil and cruel. This conclusion is definitely supported by our graph above where we see that the word \"good\" is close to \"executive\" and \"evil\" is close to \"monarch\"." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UWsQHmBc_HP0", + "colab_type": "text" + }, + "source": [ + "# Exercises" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mtXxcvtgH4E3", + "colab_type": "text" + }, + "source": [ + "1. Download the nucleotide sequence for Immunodeficient viruses and plot each virus on a 2D scatter plot. Treat every character as its own feature (column). Can you tell which viruses are most similar to each other? " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vz5jx7wvc5N5", + "colab_type": "text" + }, + "source": [ + "2. Download the spreadsheet of names at (link) and visualize the names on 2 dimensions. Are you able to derive any meaning from this chart?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_9mOjCNLL9Xh", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + }, + "outputId": "ec768d94-5473-47b5-dc33-c7a44615636a" + }, + "source": [ + "from sklearn.manifold import MDS\n", + "from sklearn.decomposition import PCA, KernelPCA\n", + "import pandas as pd \n", + "import numpy as np\n", + "import altair as alt\n", + "\n", + "\n", + "df_names = pd.read_csv(\"https://raw.githubusercontent.com/dlsun/pods/master/data/names/yob2017.txt\", header=None)\n", + "df_names.columns = [\"Name\", \"Gender\", \"Number\"]\n", + "df_names[\"Names\"] = df_names[\"Name\"]\n", + "df_names = df_names.set_index(\"Names\")\n", + "\n", + "def count_alpha(string): \n", + " counts = [0] * 26\n", + " for x in string.lower():\n", + " counts[ord(x) - 97] += 1\n", + " return counts\n", + "\n", + "pca = PCA(n_components=2)\n", + "df_data = pd.DataFrame(pca.fit_transform(pd.DataFrame(np.array(list(df_names.Name.apply(count_alpha))))), columns=[\"PCA Alphabet 1\", \"PCA Alphabet 2\"], index=df_names.index)\n", + "df_data = df_data.merge(df_names, left_index=True, right_index=True)\n", + "pd.DataFrame(pca.fit_transform(pd.DataFrame(np.array(list(df_names.Name.apply(count_alpha))))), columns=[\"PCA Alphabet 1\", \"PCA Alphabet 2\"]).plot.scatter(x=\"PCA Alphabet 1\", y=\"PCA Alphabet 2\", c = df_names.Gender.map({\"M\": \"blue\", \"F\": \"red\"}), s = [5] * df_names.shape[0])\n" + ], + "execution_count": 33, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 33 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + } + ] +} \ No newline at end of file From ac9ab8754555747bd9030bd39d9d2a7e1f485b04 Mon Sep 17 00:00:00 2001 From: Bryan Kwong Date: Fri, 1 May 2020 12:54:54 -0700 Subject: [PATCH 2/2] Added Changes Requested PR #1 --- 07-Unsupervised-Learning/.DS_Store | Bin 0 -> 6148 bytes ...alizing_Higher_Dimensions-checkpoint.ipynb | 738 +++++++ .../7.3 Visualizing_Higher_Dimensions.ipynb | 738 +++++++ .../7_2_Visualizing_Higher_Dimensions.ipynb | 731 ------- 10-Textual-Data/10.5 Sentiment_Analysis.ipynb | 1700 +++++++++++++++++ 5 files changed, 3176 insertions(+), 731 deletions(-) create mode 100644 07-Unsupervised-Learning/.DS_Store create mode 100644 07-Unsupervised-Learning/.ipynb_checkpoints/7.3 Visualizing_Higher_Dimensions-checkpoint.ipynb create mode 100644 07-Unsupervised-Learning/7.3 Visualizing_Higher_Dimensions.ipynb delete mode 100644 07-Unsupervised-Learning/7_2_Visualizing_Higher_Dimensions.ipynb create mode 100644 10-Textual-Data/10.5 Sentiment_Analysis.ipynb diff --git a/07-Unsupervised-Learning/.DS_Store b/07-Unsupervised-Learning/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..8ada7b08d0c364897a146b305050c2a6823d3b28 GIT binary patch literal 6148 zcmeHKu};H441HHR6tQ%Jf#Eh-s8lg9utcRuAjE{!wX}fRpwd)?4lw0+_)2*8Sru*R ziV)b6?>Ua`yS$6y9DvZ7-?xD#fCg2tw$0%ik^7=+QZvsMF=&k|9N`)d7+|^=dg1+5XrTD^}hubya68JnX}+&tU+%4|($jXA@N75cObt~jl` zg-3cv%c~8$EBUfw7iYj3a0dPd1MJx%jh%!(Is?vtGq7Mlz7LTqm>E_H_0z#Yj{w9v z-72)@S5R`IVP;q*^_D6kaaF=a3Jq<= fjFncrM^&Leib2c_tAzAW{6|1(@WC1QQ3gH%DT_#? literal 0 HcmV?d00001 diff --git a/07-Unsupervised-Learning/.ipynb_checkpoints/7.3 Visualizing_Higher_Dimensions-checkpoint.ipynb b/07-Unsupervised-Learning/.ipynb_checkpoints/7.3 Visualizing_Higher_Dimensions-checkpoint.ipynb new file mode 100644 index 0000000..13d4466 --- /dev/null +++ b/07-Unsupervised-Learning/.ipynb_checkpoints/7.3 Visualizing_Higher_Dimensions-checkpoint.ipynb @@ -0,0 +1,738 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "XTPLBCWKzfWS" + }, + "source": [ + "# 7.3 Visualizing Higher Dimensions" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "f1mm_fCWm9dO" + }, + "source": [ + ">\"I am a Tralfamadorian, seeing all time as you might see a stretch of the Rocky Mountains. All time is all time. It does not change. It does not lend itself to warnings or explanations. It simply is.\" \n", + ">\n", + ">-Kurt Vonnegate in \"Slaughterhour-Five\"\n", + "\n", + "We unfortunately are not Tralfamadorians. We instead we are three dimensional beings who can't visually see a fourth dimension like its a location on the Rocky Mountains. However, this doesn't mean that the fourth dimension is meaingless to us. We can derive a lot of understand from understanding the higher dimensions. The problem is, we can't see it and thus we can't plot it. \n", + "\n", + "Fortunately, very clever mathematicians throughout history has invented techniques to allow us to simulate what the higher dimension would look like. The rest of the section will discuss these techniques." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "XR-JkcQuwozv" + }, + "source": [ + "## Using Size and Color\n", + "\n", + "This is more of a review from previous sections but one way to visualize more dimensions is by using the size and color attributes of your scatter plots. This is rather intuitive. " + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 368 + }, + "colab_type": "code", + "id": "kgChDoxMwyKa", + "outputId": "cc578404-1e67-476a-f736-e070bc9c233e" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "" + ], + "text/plain": [ + "alt.Chart(...)" + ] + }, + "execution_count": 30, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "from scipy import stats\n", + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.preprocessing import LabelEncoder, StandardScaler\n", + "import altair as alt\n", + "\n", + "df_bordeaux = pd.read_csv(\"http://dlsun.github.io/pods/data/bordeaux.csv\")\n", + "\n", + "alt.Chart(df_bordeaux).mark_circle().encode(\n", + " alt.X('age',\n", + " scale=alt.Scale(zero=False)\n", + " ),\n", + " alt.Y('sep',\n", + " scale=alt.Scale(zero=False)\n", + " ),\n", + " color=\"summer\",\n", + " size=\"win\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "gxV83MG_y2LJ" + }, + "source": [ + "I am sure you can see the limitations of this method: you can only go up for 4 dimensions (5 if you use a 3-D scatter plot). This is still worth mentioning as sometimes, this may be all you need. \n", + "\n", + "For higher dimensions, we should consider either feature selection or feature reduction. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "tlWDTyDsr5Zv" + }, + "source": [ + "## Feature Selection \n", + "\n", + "If we want to compress 10 dimensions worth of data into 2 dimensions, we're bound to lose some detail during that compression. We can measure how much detail we kept at the end with the explained variance ratio which gives the percentage of variance/detail we kept after the compression. The higher the ratio, the more variance and detail we kept. \n", + "\n", + "One way very simple, almost trivial, way to only visualize higher dimensional data is to only plot the two dimensions that explains the most variation in the data. " + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 136 + }, + "colab_type": "code", + "id": "g9kxLPD30dxa", + "outputId": "c1aedb62-573b-406c-cfa6-844ab4054122" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "summer 0.343538\n", + "sep 0.323582\n", + "age 0.206936\n", + "year 0.206936\n", + "har 0.199621\n", + "win 0.053456\n", + "Name: R^2 Values, dtype: float64" + ] + }, + "execution_count": 31, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" + } + ], + "source": [ + "sclr = StandardScaler()\n", + "df_bordeaux = pd.DataFrame(sclr.fit_transform(df_bordeaux.dropna()), columns=df_bordeaux.columns)\n", + "\n", + "X = df_bordeaux.drop(\"price\", axis=1)\n", + "y = df_bordeaux[\"price\"]\n", + "\n", + "reg = LinearRegression()\n", + "reg.fit(X, y)\n", + "\n", + "scores = pd.Series(dtype=float, name=\"R^2 Values\")\n", + "for column in X.columns: \n", + " reg = LinearRegression()\n", + " reg.fit(X[[column]], y)\n", + " scores[column] = reg.score(X[[column]], y)\n", + "\n", + "scores.sort_values(ascending=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "qHprVgDLxKyE" + }, + "source": [ + "As we can see above, average summer temperature **summer** and average september temperature **sep** are the two variables that explain the most variance in the quality of the wine **price**. Thus, if we want to get the best representation of the dataset with only two dimensions, we can make a scatterplot of **summer** vs **sep**. However, even with the two variables that explain the most variation, we can only capture 33% of the variation of the original data. The other 66% is lost to the other features we chose to ignore. " + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 330 + }, + "colab_type": "code", + "id": "FcUCeYGOx2nP", + "outputId": "16dc19dd-f2ac-44b7-8ac6-0236cac283d4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "% Variance Explained: 0.3333333333333333\n", + "\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 32, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEGCAYAAABsLkJ6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAUNklEQVR4nO3dfYxcV33G8efZeLu26rQYrwvEDpiQqLwaJ2zTgCmF8NKQVqZgKFDRgqByU4RU1Ao7CLX0TY1spP5RUVqshAKCBigmtVsCIWAoL2rSrINf8kJCQEmzbkrM1glZsLfrzK9/zF083njt2ezcOefO+X6kVWbvzO793ZvxPHvOPedcR4QAAOUZSl0AACANAgAACkUAAEChCAAAKBQBAACFWpK6gIUYHR2NtWvXpi4DABpl7969P4yIVXO3NyoA1q5dq/Hx8dRlAECj2L7vVNvpAgKAQhEAAFAoAgAACkUAAEChCAAAKBQBAABzTE5Na//9D2lyajp1KbVq1DBQAKjbrn2HtHXnAQ0PDWmm1dL2Teu0cf3q1GXVghYAAFQmp6a1decBHZtp6ZHp4zo209KWnQcGtiVAAABAZeLIUQ0PnfyxODw0pIkjRxNVVC8CAAAqa1Ys00yrddK2mVZLa1YsS1RRvQgAAKisXD6i7ZvWaenwkM4eWaKlw0PavmmdVi4fSV1aLbgIDAAdNq5frQ3nj2riyFGtWbFsYD/8JQIAAB5j5fKRgf7gn0UXEAAUigAAgEIRAABQKAIAAApFAABAoQgAACgUAQAAhSIAAKBQBAAAFIoAAIBCEQAAUCgCAAAKlSwAbJ9r+6u277B9u+0/TFULAJQo5WqgxyX9cUTcavtsSXtt3xgRdySsCQCKkawFEBEPRMSt1eNHJN0paTDvvAwAGcriGoDttZIulHTzKZ7bbHvc9vjhw4f7XRoADKzkAWB7uaSdkt4dET+a+3xE7IiIsYgYW7VqVf8LBIABlTQAbA+r/eH/yYj4XMpaAKA0KUcBWdI1ku6MiL9JVQcAlCplC2CDpN+RdKntfdXX5QnrAYCiJBsGGhHflORU+weA0iW/CAwASIMAAIBCEQAAUCgCAAAKRQAAQKEIAAAoFAEAAIUiAACgUAQAABSKAACAQhEAAFAoAgAACkUAAEChCACgzyanprX//oc0OTWduhQULtly0ECJdu07pK07D2h4aEgzrZa2b1qnjetXpy4LhaIFAPTJ5NS0tu48oGMzLT0yfVzHZlrasvMALQEkQwAAfTJx5KiGh07+Jzc8NKSJI0cTVYTSEQBAn6xZsUwzrdZJ22ZaLa1ZsSxRRSgdAQD0ycrlI9q+aZ2WDg/p7JElWjo8pO2b1mnl8pHUpaFQXAQG+mjj+tXacP6oJo4c1ZoVy/jwR1IEANBnK5eP8MGPLNAFBACFIgAAoFAEAADMUcpsba4BAECHkmZr0wIAgEpps7UJAACo9Gu2di5dTHQBAUClH7O1c+piogUAAJW6Z2vn1sVECwAAOtQ5W3u2i+mYTrQyZruYUkwOJAAAYI66ZmvntiAgXUAA0Ce5LQhICwAA+iinBQEJAADos1wWBKQLCAAKlTQAbH/E9oO2b0tZBwCUKHUL4KOSLktcAxool5mU/VDSsaK/kl4DiIiv216bsgY0T04zKetW0rGi/1K3AM7I9mbb47bHDx8+nLocJJbbTMo6lXSsSCP7AIiIHRExFhFjq1atSl0OEuvXYl05KOlYkUb2AQB0ym0mZZ1KOlakQQCgUXKbSVmnko4VaTgi0u3cvlbSSyWNSvqBpPdHxDXzvX5sbCzGx8f7VB1yNjk1ncVMyn4o6VhRD9t7I2Js7vbUo4DenHL/aK5cZlL2Q0nHiv6iCwgACkUAAEChCAAAKBQBAACFIgAAoFAEAFAwFporGzeEAQrFQnOgBQAUiIXmIBEAQJEGYaE5uq8Wjy4goEBNX2iO7qveoAUAFKjJC83RfdU7tACAQm1cv1obzh9t3EJzs91Xx3SiBTPbfdWUY8gFAQAUrIkLzTW9+yondAEBaJQmd1/lhhYAgMZpavdVbggAAI3UxO6r3NAFBACFIgAAoFAEAAAUigAAgEIRAABQKAIAAApFAABAoRYUALZ/zvbZdRUDAOifrgLA9i/ZPijpgKTbbO+3/YJ6S0PTsV47kLduZwJfI+mdEfENSbL9Ykn/KGldXYWh2VivHchft11Aj85++EtSRHxT0vF6SkLTsV470AzdtgD+3faHJV0rKSS9UdLXbF8kSRFxa031oYFYrx1ohm4D4PnVf98/Z/uFagfCpT2rCI03COu1T05NF7HSZCnHiVPrKgAi4mV1F4LBMbte+5Y51wCa8gFTyvWLUo4T83NEnPlF9pMk/bWkcyLi1bafLemFEXFN3QV2Ghsbi/Hx8X7uEovQxL8uJ6emtWHbHh2bOdGCWTo8pG9tvbQxx9CNUo4Tbbb3RsTY3O3dXgT+qKQbJJ1TfX+3pHf3pjQMqpXLR/T8c5/QqA+U2esXnWavX8ynicNdH89xPh5NPDcl6fYawGhEfMb2eyUpIo7bfrTGuoAkFnr9oqndKP24TtPUc1OSblsAP7a9Uu0LvrJ9iaSHa6sKSGQh95tt8nDXuu+r2+RzU5JuWwB/JGm3pGfY/pakVZJeX1tVQELd3m+26cNd67yvbtPPTSm6DYBnSHq1pHMlbZL0ywv4WaBxurnf7CAMd63rvrqDcG5K0G0X0J9ExI8krZD0MkkfkvT3i9257cts32X7HttXLvb3Af1UdzdKk3FumqHbYaDfjogLbV8l6WBE/NPstse9Y/sstUcTvVLShKRbJL05Iu6Y72cYBoocNXG4a79wbvIw3zDQbrtxDlVLQbxS0jbbI1r8vQQulnRPRHy/KvBTkl4jad4AAHJUVzfKIODc5K3bD/HfUnsewK9FxEOSnijpPYvc92pJ93d8P1FtO4ntzbbHbY8fPnx4kbsEAMzqdimIn0j6XMf3D0h6oK6i5ux7h6QdUrsLqB/7BIASpLwl5CG1RxXNWlNtA5ApZvYOlpRDOW+RdIHtp6v9wf8mSb+dsB4Ap8HM3sGTrAUQEcclvUvtawt3SvpMRNyeqh4A82Nm72BKOpkrIq6XdH3KGkrGED10K8eZvbx/F4/ZvIWiOY+FyG1mL+/f3kh5ERiJ0JzHQuU0s5f3b+/QAihQjs155K/OxeMWgvdv7xAABcqtOY/myGFmL+/f3qELqEA5NeeBheL92ztdLQaXCxaD6y1GUTQD/59OjfPSvcUuBocBlENzHqfHaJf58f5dPLqAgEwx2gV1IwCATM2Oduk0O9oF6AUCADiFHBY9Y7QL6sY1AGCOXPrdZ0e7bJlTC/3e6BUCAOjQ2e8+O9Foy84D2nD+aJIP3lwmX2EwEQBAhxxnmTLaBXXhGgDQYc2KZTo6c/ykbUdnjtPvjoFEAABz2D7t98CgIACADhNHjmrpkrNO2rZ0yVkMvcRAIgCADgy9REkIAKADC42hJIwCQteavPjWQmpf6NDLJp8XlI0AQFdymRz1eDye2rsdetnk8wLQBYQzavKiZHXW3uTzAkgEALrQ5EXJ6qy9yecFkAgAdKHJI2PqrL3J5wXNUtfihAQAzqjJI2PqrL3J5wXNsWvfIW3Ytkdvufpmbdi2R7v3HerZ7+aWkOhak0e71Fl7k88L8jY5Na0N2/bo2MyJlubS4SF9a+ulC3qvcUtILFqTFyWrs/Ymnxfkre7FCekCAoBM1X2diQAAgEzVfZ2JLiAAyFidNwUiAAAgc3VdZ6ILCAAKRQAAfVbXpB5goegCAvqIxeOQE1oAQJ+weBxyQwAAfcLicchNkgCw/Qbbt9tu2X7M9GRgELF4HHKTqgVwm6TXSfp6ov0DfcficchNkovAEXGnJNlOsXsgmTon9QALlf0oINubJW2WpKc+9amJqwEWj8XjkIvaAsD2lyU9+RRPvS8idnX7eyJih6QdUns56B6VBwDFqy0AIuIVdf1uAMDiMQwUAAqVahjoa21PSHqhpM/bviFFHQBQslSjgK6TdF2KfQMA2ugCAoBCEQAAUCgCAAAKRQAAQKEIAAAoFAEAAIUiAACgUAQAABSKAACAQhEAAFAoAgAACkUAAEChCAAAKBQBAACFIgAAoFAEAAAUigAAgEIRAABQKAIAAApFAABAoQiAzE1OTWv//Q9pcmo6dSkABsyS1AVgfrv2HdLWnQc0PDSkmVZL2zet08b1q1OXBWBA0ALI1OTUtLbuPKBjMy09Mn1cx2Za2rLzAC0BAD1DAGRq4shRDQ+d/L9neGhIE0eOJqoIwKAhADK1ZsUyzbRaJ22babW0ZsWyRBUBGDQEQKZWLh/R9k3rtHR4SGePLNHS4SFt37ROK5ePpC4NwIAo4iLw5NS0Jo4c1ZoVyxr1Abpx/WptOH+0kbUDyN/AB0DTR9KsXD7CBz+AWgx0FxAjaQBgfgMdAIykAYD5DXQAMJIGAOY30AHASBoAmN/AXwRmJA0AnNrAB4DESBoAOJWB7gICAMyPAACAQiUJANsfsP0d2wdsX2f7CSnqANBc3Ctj8VK1AG6U9NyIWCfpbknvTVQHgAbate+QNmzbo7dcfbM2bNuj3fsOpS6pkZIEQER8KSKOV9/eJGlNijoANA8z/Hsnh2sAb5f0hfmetL3Z9rjt8cOHD/exLAA5YoZ/79Q2DNT2lyU9+RRPvS8idlWveZ+k45I+Od/viYgdknZI0tjYWNRQKoAGYYZ/79QWABHxitM9b/ttkn5D0ssjgg92AF2ZneG/Zc4qv8z1WbgkE8FsXyZpi6RfjYifpKgBQHMxw783Us0E/qCkEUk32pakmyLiikS1AGggZvgvXpIAiIjzU+wXAHBCDqOAAAAJEAAAUCgCAAAKRQAAQKHcpCH4th+RdFfqOuYYlfTD1EWcQo515ViTlGddOdYk5VlXjjVJedX1tIhYNXdj024Ic1dEjKUuopPt8dxqkvKsK8eapDzryrEmKc+6cqxJyreuTnQBAUChCAAAKFTTAmBH6gJOIceapDzryrEmKc+6cqxJyrOuHGuS8q3rpxp1ERgA0DtNawEAAHqEAACAQmUdAN3ePN72vbYP2t5nezyTmi6zfZfte2xfWWdN1f7eYPt22y3b8w496/O56ramfp+rJ9q+0fZ3q/+umOd1j1bnaZ/t3TXVctpjtz1i+9PV8zfbXltHHQus6W22D3ecm9/rQ00fsf2g7dvmed62/7aq+YDti+quqcu6Xmr74Y5z9af9qKtrEZHtl6RXSVpSPd4mads8r7tX0mguNUk6S9L3JJ0n6Wck7Zf07JrrepakX5T0NUljp3ldP8/VGWtKdK62S7qyenzlad5XUzXXccZjl/ROSf9QPX6TpE9nUNPbJH2wH++hjn2+RNJFkm6b5/nL1b61rCVdIunmTOp6qaR/6+e5WshX1i2AyPDm8V3WdLGkeyLi+xHxf5I+Jek1Ndd1Z0RkNUu6y5r6fq6q3/+x6vHHJP1mzfubTzfH3lnrZyW93NVNNBLW1HcR8XVJ/3ual7xG0sej7SZJT7D9lAzqylrWATDH6W4eH5K+ZHuv7c0Z1LRa0v0d309U23KQ6lzNJ8W5elJEPFA9/h9JT5rndUttj9u+yXYdIdHNsf/0NdUfHg9LWllDLQupSZI2VV0tn7V9bo31dCvnf3MvtL3f9hdsPyd1MZ2SLwXRo5vHvzgiDtn+BbXvMvadKplT1tRz3dTVhb6fqxROV1fnNxERtucbC/206lydJ2mP7YMR8b1e19pA/yrp2oiYtv37ardQLk1cU65uVft9NGX7ckn/IumCxDX9VPIAiB7cPD4iDlX/fdD2dWo3Yx/3h1oPajokqfOvojXVtkU5U11d/o6+nqsu9P1c2f6B7adExANVN8GD8/yO2XP1fdtfk3Sh2v3jvdLNsc++ZsL2Ekk/L2myhzUsuKaI6Nz/1WpfU0mtlvfRYkXEjzoeX2/7Q7ZHIyKLReKy7gLyiZvHb4x5bh5v+2dtnz37WO2LtKe8It+vmiTdIukC20+3/TNqX7yrZRTJQvT7XHUpxbnaLemt1eO3SnpMS8X2Ctsj1eNRSRsk3dHjOro59s5aXy9pz3x/CPWrpjl96xsl3VljPd3aLel3q9FAl0h6uKObLxnbT569ZmP7YrU/c+sM8IVJfRX6dF+S7lG7X29f9TU7GuIcSddXj89Te6TCfkm3q931kLSm6vvLJd2t9l+MtdZU7e+1avd7Tkv6gaQbMjhXZ6wp0blaKekrkr4r6cuSnlhtH5N0dfX4RZIOVufqoKR31FTLY45d0l+o/QeGJC2V9M/V++4/JZ3Xh/Nzppquqt4/+yV9VdIz+1DTtZIekDRTvafeIekKSVdUz1vS31U1H9RpRsL1ua53dZyrmyS9qB91dfvFUhAAUKisu4AAAPUhAACgUAQAABSKAACAQhEAAFAoAgAACkUAAJmpJjPxbxO1402GIlWzoj9fLdJ1m+03un2vhNHq+bFq+QfZ/jPbH7P9Ddv32X6d7e1u31fhi7aHq9fda/uqat33cdsX2b7B9vdsX9Gx7/fYvqVaTO3Pq21r3V6D/+Nqz87OYYE1DDgCAKW6TNJ/R8TzI+K5kr54htc/Q+0FzzZK+oSkr0bE8yQdlfTrHa/7r4hYL+kbkj6q9vINl0ia/aB/ldqLgV0sab2kF9h+SfWzF0j6UEQ8JyLuW/whAqdHAKBUByW90vY2278SEQ+f4fVfiIiZ6ufO0onAOChpbcfrdndsvzkiHomIw5Km3b573Kuqr2+rvVLkM3Vidcj7or2WPdAXyVcDBVKIiLvdvm3g5ZL+yvZX1F7ee/aPoqVzfmS6+rmW7Zk4sYZKSyf/O5ru2D7dsX32dZZ0VUR8uPOXu32rxx8v5piAhaIFgCLZPkfSTyLiE5I+oPZt/e6V9ILqJZtq2vUNkt5ue3lVx+rq3gxA39ECQKmeJ+kDtltqr+T4B5KWSbrG9l+qfR/jnouIL9l+lqT/qFYJnpL0FkmP1rE/4HRYDRQACkUXEAAUigAAgEIRAABQKAIAAApFAABAoQgAACgUAQAAhfp/rOG9WPsCRBEAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light", + "tags": [] + }, + "output_type": "display_data" + } + ], + "source": [ + "explained_var = X[[\"summer\", \"sep\"]].var(axis=0).sum() / X.var(axis=0).sum() \n", + "\n", + "print(\"% Variance Explained:\", explained_var, end=\"\\n\\n\")\n", + "df_bordeaux.plot.scatter(x=\"summer\", y=\"sep\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Y6gWdO1CINp-" + }, + "source": [ + "Additionally, if we look below, using only two features has hindered our predictive accuracy. This sucks! Fortunately, some very clever mathematicians came up with ways to get around this. " + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + }, + "colab_type": "code", + "id": "l7re2E6KHUbs", + "outputId": "24c2f5c2-ff71-420e-874c-2bbdee6f9a8f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "All Features:\t\t 0.7526018827767169\n", + "With PCA Features:\t 0.4633153344681292\n", + "\n" + ] + } + ], + "source": [ + "reg = LinearRegression()\n", + "\n", + "reg.fit(X, y)\n", + "print(\"All Features:\\t\\t\", reg.score(X, y))\n", + "reg.fit(X[[\"summer\", \"sep\"]], y)\n", + "print(\"With PCA Features:\\t\", reg.score(X[[\"summer\", \"sep\"]], y), end=\"\\n\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "tflTZWS688xW" + }, + "source": [ + "## Dimensionality Reduction\n", + "\n", + "With feature selection, we were only able to capture 33% of the original variance, which isn't great. To capture more variation while still remaining in two variables, have to utilize some clever math.\n", + "\n", + "These clever mathematical techniques are known as feature creation, where we try to create new variables that helps us visualize higher dimensional data. There are many different techniques for dimensionality reduction all of which attempts to accomplish different things. Let's start off with the simplest and most popular one: **Principle Component Analysis**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "cZ52OQlGAm6Q" + }, + "source": [ + "### Principle Component Analysis (PCA)\n", + "\n", + "Simply put, Principle Component Analysis create new features that maximizes variation. What PCA is **not** doing is feature selection, rather it is creating an entirely new arbitrary feature that is a combination of all the features. \n", + "\n", + "PCA involves some simple linear algebra, but SciKit-Learn has a PCA implementation. Note that all dimensionality reduction algorithms in SciKit-Learn is operated very similarity to machine learning algorithms you learned in the previous chapters. Create the object and then run `fit()` or `fit_transform()`." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 255 + }, + "colab_type": "code", + "id": "FESc4NoI8-3v", + "outputId": "1d1b00d1-65f5-40dd-ca5f-93f9ba84ef7c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "% Variance Explained: 0.6462543462926849\n", + "\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PCA1PCA2
02.620822-1.437859
12.1664460.732196
22.359265-0.067934
31.518218-0.792417
41.5199200.451721
\n", + "
" + ], + "text/plain": [ + " PCA1 PCA2\n", + "0 2.620822 -1.437859\n", + "1 2.166446 0.732196\n", + "2 2.359265 -0.067934\n", + "3 1.518218 -0.792417\n", + "4 1.519920 0.451721" + ] + }, + "execution_count": 36, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.decomposition import PCA\n", + "\n", + "# We want to be able to plot this on a 2-D scatter plot so we choose 2 Dimensions\n", + "dimension_we_want = 2\n", + "\n", + "pca = PCA(n_components=dimension_we_want)\n", + "X_2d = pca.fit_transform(X) \n", + "X_2d = pd.DataFrame(X_2d, columns=[\"PCA1\", \"PCA2\"])\n", + "\n", + "print(\"\\n% Variance Explained:\", pca.explained_variance_ratio_.sum(), end=\"\\n\\n\")\n", + "X_2d.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + }, + "colab_type": "code", + "id": "Gj87JXmmBdNu", + "outputId": "4574186e-9f94-4b91-9cf9-ba4a68b9f866" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 37, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEGCAYAAABsLkJ6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAATe0lEQVR4nO3dcWzcZ33H8c/nEtfxcLZGjpWucUMqUtCqYIKwOjZv0lrYKIgFtVk1OsHEYIuQhkQlpoSqbGOaJtGwMaSViUWUwaSKqpvbBdFWNFVgUAQFB7mmaQrqGKyOGA1eSmOwXaf33R93Xm3X8Tm5+93zu3veL8mS73eX+30vd/597nme3/P8HBECAOSnkroAAEAaBAAAZIoAAIBMEQAAkCkCAAAytTF1ARdi69atsXPnztRlAEBHOX78+E8iYnDl9o4KgJ07d2p8fDx1GQDQUWz/cLXtdAEBQKYIAADIFAEAAJkiAAAgUwQAAGSKAADQEtMz83rs6Wc1PTOfuhSsU0edBgqgnI5MnNLBsUn1VCpaqFZ1aN+w9u7ZnrosNEALAEBTpmfmdXBsUnMLVZ2dP6e5haoOjE3SEugABACApkydmVVPZfmhpKdS0dSZ2UQVYb0IAABNGdrSp4Vqddm2hWpVQ1v6ElWE9SIAADRloL9Xh/YNa1NPRZt7N2pTT0WH9g1roL83dWlogEFgAE3bu2e7Rndt1dSZWQ1t6ePg3yEIAAAtMdDfy4G/w9AFBACZIgAAIFPJAsD2JtvftP2Y7RO2/ypVLQCQo5RjAPOSrouIGds9kh6x/WBEfCNhTQCQjWQBEBEhaaZ+s6f+E6nqAYDcJB0DsL3B9oSkZyQdjYhHV3nMftvjtsdPnz7d/iIBoEslDYCIeCEi9kgaknSN7d2rPOZwRIxExMjg4EuuaQwAuEilOAsoIp6V9CVJ16euBQBykfIsoEHbl9Z/75P025KeTFUPAOQm5VlAvyzps7Y3qBZE90TEFxLWAwBZSXkW0KSk16baPwDkrhRjAACA9iMAACBTBAAAZIoAAIBMEQAAkCkCAAAyRQAAQKYIAADIFAEAAJkiAAAgUwQAAGSKAACATBEAAJApAgAAMkUAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMhUsgCwfYXtL9l+wvYJ2+9PVQsA5Ghjwn2fk/SBiPi27c2Sjts+GhFPJKwJALKRrAUQET+KiG/Xfz8r6aSk7anqAYDclGIMwPZOSa+V9Ogq9+23PW57/PTp0+0uDQC6VvIAsN0vaUzSLRHx3Mr7I+JwRIxExMjg4GD7CwSALpU0AGz3qHbwvysi7k1ZCwDkJuVZQJZ0p6STEfGxVHUAQK5StgBGJb1T0nW2J+o/b0lYDwBkJdlpoBHxiCSn2j8A5C75IDAAIA0CAAAyRQAAQKYIAADIFAEAAJkiAIAVpmfm9djTz2p6Zj51KUChUq4GCpTOkYlTOjg2qZ5KRQvVqg7tG9bePaxRiO5ECwCom56Z18GxSc0tVHV2/pzmFqo6MDZJSwBdiwAA6qbOzKqnsvxPoqdS0dSZ2UQVAcUiAIC6oS19WqhWl21bqFY1tKUvUUVAsQgAoG6gv1eH9g1rU09Fm3s3alNPRYf2DWugvzd1aUAhGAQGlti7Z7tGd23V1JlZDW3p4+CPrkYAACsM9Pdy4EcW6AICgEwRAACQKQIAQCGYUV1+jAEAaDlmVHcGWgB1fFsBWoMZ1Z2DFoD4tgK00uKM6jm9OKlucUY1Z1eVS/YtAL6tdB9ac2kxo7pzZBUAqx0YWP+luxyZOKXR24/pHZ96VKO3H9PnJ06lLik7zKjuHNl0AZ2vm4dvK91jaWtusfvhwNikRndt5eDTZsyo7gxZtADW6ubh20r3oDVXLgP9vXrNFZfyt1RiWbQAGg1K8W2lO9CaAy5M0haA7U/bfsb240XuZz0HBr6tdD5ac8CFSd0C+IykOyT9S5E7WTwwHFgxBsCBofvQmgPWL2kARMRXbO9sx744MOSD1TyB9UndAmjI9n5J+yVpx44dTT0XBwYAeFHpzwKKiMMRMRIRI4ODg6nLAdqOiW0oSulbAEDOWKYERSp9CwDIFcuUoGipTwP9nKSvS3qV7Snb70lZD1AmTGxD0VKfBXRzyv0DZcbENhSNLiCgpJjYVl7dMjDPIDBQYsxfKZ9uGpgnAICSY/5KeXTbirN0AQHAOnXbwDwBAADr9LJLNmj+3AvLtnXywDwBsIZuGejBi3hPcbGOTJzSW+94RJWKJUm9G9zxA/OMAZxHNw30oIb3FBdrad//orB1//t+Q7u2bU5YWXPW1QKw3bPKtq2tL6cYF/qtjxmY3Yf3FM1Yre+/d0NFP3v+hfP8i86wZgDYvtb2lKQf2X5oxdLNDxVZWKtczEXCu22gB7ynaE63Tspr1AI4JOlNEbFV0mFJR22/vn6fC62sBS72W1+3vtk54z3tTGUZs+nWSXmNxgAuiYgTkhQR/2b7pKR7bR+UFIVX16RG1wI+H64g1n14TztP2cZsunFSXqMAWLB9WUT8jyRFxAnbb5D0BUmvKLy6JjXzra8b3+zc8Z52jrJOuOq2SXmNuoA+KGnb0g0RMSXptyR9pKCaWqbZZhsXiu8+vKedgTGb9lizBRARD5/nrs2Snm99Oa3Ht758TM/M8z53CcZs2mPd8wBsD0q6SdLNki6XdF9RRbVatzXb8FJl6y9GcxizaY81A8D2Zkk3SvoDSa+UdK+kKyNiqA21AetS1v5iNIfWe/EatQCekfRNSR+S9EhEhO0bii8LWL+LPdsL5UfrvViNBoFvldQr6R8l3Wq79Gf+ID/0FwMXZ80AiIiPR8TrJb2tvunfJV1u+6DtVxZeHbAO3TpJByiaIy5sPpft3aoNBP9+ROwqpKrzGBkZifHx8XbuEh2Es4CA1dk+HhEjK7c3GgTeJWlbRHxtcVtEPG77QUn/3Poy0clSH4DpLwYuTKNB4I+rNg6w0k8l/b2k3215RehInIYJdJ5Gg8DbIuI7KzfWt+0spCJ0HJZaBjpTowC4dI37OMUCkso5bb8sq0gCZdYoAMZt/8nKjbb/WNLxZndu+3rb37X9lO0PNvt8SKNsp2FezDUggBw1CoBbJP2R7S/b/rv6z39Ieo+k9zezY9sbJH1C0pslXS3pZttXN/OcSKNMp2HSHQWsX6PF4H4s6ddtXytpd33z/RFxrAX7vkbSUxHxfUmyfbdq8w2eaMFzo83KMm2fWcEoo9RnyJ1Po9NAN0l6r6Rdkr4j6c6IONeifW+X9PSS21OSfnWVGvZL2i9JO3bsaNGuUYQynIZZtu4ooMxnyDXqAvqspBHVDv5vlvS3hVe0QkQcjoiRiBgZHBxs9+7RYcrUHQWUvUuy0TyAqyPi1ZJk+07VFoZrlVOSrlhye6i+DWhKWbqjgLJ3STa8JOTiLxFxzm7pdeC/Jekq21eqduB/u2rLTgNNK0N3FFD2LslGXUCvsf1c/eespOHF320/18yO62MJ75P0RUknJd2zeAF6AOgGZe+SbHQW0IYidx4RD0h6oMh9AEBKZe6SXPclIQEAF6esXZKNuoAAAF2KAACATBEAAJApAgAAMkUAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMgUAQAAmSIAACBTBABWNT0zr8eeflbTM/OpS0HG+BwWiyuC4SWOTJzSwbFJ9VQqWqhWdWjfsPbu2Z66LGSGz2HxaAFgmemZeR0cm9TcQlVn589pbqGqA2OTfANDW/E5bA8CAMtMnZlVT2X5x6KnUtHUmdlEFSFHfA7bgwDAMkNb+rRQrS7btlCtamhLX6KKOgt91q3B57A9CAAsM9Dfq0P7hrWpp6LNvRu1qaeiQ/uGNdDfm7q00jsycUqjtx/TOz71qEZvP6bPT5xKXVLH4nPYHo6I9u/UvknShyX9iqRrImJ8Pf9uZGQkxsfX9VA0aXpmXlNnZjW0pY8/unWYnpnX6O3HNLfw4rfWTT0Vfe3gdfz/NYHPYWvYPh4RIyu3pzoL6HFJN0r6p0T7RwMD/b0d8QdXlgPEYp/1nF4MgMU+6074fyyrTvkcdqokARARJyXJdordo0uU6TRB+qzRiRgDQEdq1WmCrRq0pc8anaiwFoDthyVdtspdt0XEkQt4nv2S9kvSjh07WlQdOl0rulxa3YLYu2e7RndtLUWXFLAehQVARLyxRc9zWNJhqTYI3IrnROdrtstlaQtiMUQOjE1qdNfWpg7c9Fmjk9AFhI7UbJcLE42ARIPAtm+Q9A+SBiXdb3siIt6UohZ0rma6XBi0BRK1ACLivogYiojeiNjGwR8Xa6C/V6+54tIL7nZh0BZgNVBkjEFb5I4AQNYYtEXOGAQGgEwRAACQKQIAADJFAABApggAACi5oi40xFlAAFBiRa56SwsAAEqqVaveng8BAAAlVfSaVQQAAJRU0WtWEQAAUFJFr1nFIDAAlFiRa1YRAABQckWtWUUXELJU1HnVQCehBYDsFHledSebnplnaezMEADISlHXAu50hGKe6AJCVrgW8EsVPdkI5UUAICtcC/ilCMV8EQDICtcCfilCMV+MASA7XAt4ucVQPLBiDCD3/5ccEADIEtcCXo5QzBMBAEASoZgjxgAAIFNJAsD2R20/aXvS9n22L01RBwDkLFUL4Kik3RExLOl7km5NVAcAZCtJAETEQxFxrn7zG5KGUtQBADkrwxjAuyU9eL47be+3PW57/PTp020sqxxYtAxAUQo7C8j2w5IuW+Wu2yLiSP0xt0k6J+mu8z1PRByWdFiSRkZGooBSS4v1WQAUqbAAiIg3rnW/7XdJequkN0REVgf29WDRMgBFS3UW0PWSDkjaGxE/T1FD2bE+C4CipRoDuEPSZklHbU/Y/mSiOkqL9VkAFC3JTOCI2JViv52E9VkAFI2lIEqM9VkAFIkAKDnWZwFQlDLMAwAAJEAAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMgUAQAAmSIAACBTBACyMD0zr8eeflbTM/OpSwFKg+Wg0fWOTJzSwRUX1tm7Z3vqsoDkaAGgq03PzOvg2KTmFqo6O39OcwtVHRibpCUAiABAl5s6M6ueyvKPeU+loqkzs4kqAsqDAEBXG9rSp4Vqddm2hWpVQ1v6ElUElAcBgK420N+rQ/uGtamnos29G7Wpp6JD+4a5zCYgBoGRgb17tmt011ZNnZnV0JY+Dv5AHQGALAz093LgB1agCwgAMpUkAGz/te1J2xO2H7J9eYo6ACBnqVoAH42I4YjYI+kLkv4iUR0AkK0kARARzy25+TJJkaIOAMhZskFg238j6Q8l/VTStWs8br+k/ZK0Y8eO9hQHABlwRDFfvm0/LOmyVe66LSKOLHncrZI2RcRfruM5T0v6YeuqLNRWST9JXUQb8Xq7W06vtxtf68sjYnDlxsICYL1s75D0QETsTlpIi9kej4iR1HW0C6+3u+X0enN6ranOArpqyc23SXoyRR0AkLNUYwAfsf0qSVXVunTem6gOAMhWkgCIiH0p9ttmh1MX0Ga83u6W0+vN5rUmHwMAAKTBUhAAkCkCAAAyRQAUyPZHbT9ZX/foPtuXpq6pSLZvsn3CdtV2V55GZ/t629+1/ZTtD6aup2i2P237GduPp66laLavsP0l20/UP8fvT11T0QiAYh2VtDsihiV9T9Ktiesp2uOSbpT0ldSFFMH2BkmfkPRmSVdLutn21WmrKtxnJF2fuog2OSfpAxFxtaTXS/rTbn9/CYACRcRDEXGufvMbkoZS1lO0iDgZEd9NXUeBrpH0VER8PyKel3S3avNYulZEfEXS/6auox0i4kcR8e3672clnZS0PW1VxSIA2ufdkh5MXQSasl3S00tuT6nLDxC5sr1T0mslPZq2kmJxRbAmrWfNI9u3qda8vKudtRVhvWs8AZ3Kdr+kMUm3rFi5uOsQAE2KiDeudb/td0l6q6Q3RBdMumj0ervcKUlXLLk9VN+GLmG7R7WD/10RcW/qeopGF1CBbF8v6YCkvRHx89T1oGnfknSV7SttXyLp7ZI+n7gmtIhtS7pT0smI+FjqetqBACjWHZI2Szpav/zlJ1MXVCTbN9iekvRrku63/cXUNbVSfUD/fZK+qNoA4T0RcSJtVcWy/TlJX5f0KttTtt+TuqYCjUp6p6Tr6n+vE7bfkrqoIrEUBABkihYAAGSKAACATBEAAJApAgAAMkUAAECmCABgFbZfqJ8G+Ljtf7X9C/Xtl9m+2/Z/2j5u+wHbr1zy726xPWf7l5ZsG6ivMjlj+44UrwdYDQEArG42IvZExG5Jz0t6b32i0H2SvhwRr4iI16m2wuu2Jf/uZtUmjN24ZNucpD+X9GftKR1YHwIAaOyrknZJulbSQkT8/4S+iHgsIr4qSbZfIalf0odUC4LFx/wsIh5RLQiA0iAAgDXY3qja+v/fkbRb0vE1Hv521ZaI/qpqM2e3rfFYIDkCAFhdn+0JSeOS/lu1NWIauVnS3RFRVW1BsZsKrA9oGquBAqubjYg9SzfYPiHp91Z7sO1XS7pKtXWfJOkSSf+l2npQQCnRAgDW75ikXtv7FzfYHrb9m6p9+/9wROys/1wu6XLbL09VLNAIi8EBq7A9ExH9q2y/XNLHJb1OtUHdH0i6RbUVQt8SEU8ueezHJP04Im63/QNJv6hay+BZSb8TEU8U/TqAtRAAAJApuoAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMjU/wFur9UWE4oSkQAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light", + "tags": [] + }, + "output_type": "display_data" + } + ], + "source": [ + "X_2d.plot.scatter(x=\"PCA1\", y=\"PCA2\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "cT7Rv9lEDSji" + }, + "source": [ + "As we can see from above, the two new features (known as Principle Components) are not like any of our input features. Additionally, these two new components explains 64.6% of all the original variation. Let's see how well this performs in explaining our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "eBHVGhd-CJF8" + }, + "outputs": [], + "source": [ + "reg = LinearRegression()\n", + "\n", + "reg.fit(X, y)\n", + "print(\"All Features:\\t\\t\", reg.score(X, y))\n", + "reg.fit(X_2d, y)\n", + "print(\"With PCA Features:\\t\", reg.score(X_2d, y), end=\"\\n\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "bsoN_a8XI4p2" + }, + "source": [ + "As expected, with a higher explained variance ratio, we perform better in predicting the quality of the wine. With Dimensionality Reduction, we were able to capture most of the variation in the dataset while still being able to view it in two dimensions. In the most basic of terms, PCA creates a variable projected along the axis of maximum variable.\n", + "\n", + "![](https://raw.githubusercontent.com/bfkwong/data/master/IMG_0187%202.jpg)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "vwDiZRX3JYLN" + }, + "source": [ + "### Optional: Linear Algebra Behind PCA\n", + "\n", + "Principle Component Analysis chooses principle components along te **axis of greatest variance**. In Linear Algebra terms, the axis of greatest variance is the **Eigenvector with the largest Eigenvalue of the covariance matrix ($\\Sigma$)**\n", + "\n", + "The following are the steps in order to do PCA manually. \n", + "\n", + "1. Given data matrix $M$, generate the covariance matrix of $M$ denoted as $\\Sigma$\n", + "2. We then compute the Eigenvector and Eigenvalue of covariance matrix $\\Sigma$\n", + "3. Project the features to the Eigenvector with the largest Eigenvalue using the dot product (cross product for more than 1 dimensions)" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "J26dEAoAD0q7" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "# Step 1: Calculate covariance matrix \n", + "cov_mtrx = np.cov(X.T)\n", + "\n", + "# Step 2: Calculate Eigenvector and Eigenvalue\n", + "W,v = np.linalg.eig(cov_mtrx)\n", + "\n", + "# Step 3: Find the largest Eigenvalue and project our data onto the corresponding Eigenvector\n", + "idx_largest_eigenval = np.argmax(W)\n", + "eigenvec = v[:,idx_largest_eigenval]\n", + "\n", + "total = []\n", + "for row in X.index: \n", + " total.append(np.dot(X.loc[row], eigenvec))\n", + "\n", + "pd.DataFrame(pd.Series(total), columns=[\"PCA1\"]).head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "_G7NWh_1nf_k" + }, + "source": [ + "One interesting thing you may see here is that the eigenvalue corresponds to the variance explained by its corresponding eigenvalue. The eigenvalue of PCA1 is 2.26 and the variance of PCA1 is also 2.26" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "9QCALn43iTYd" + }, + "outputs": [], + "source": [ + "idx_largest_eigenval = np.argmax(W)\n", + "variance = pd.Series(total).var()\n", + "\n", + "print(\"Largest Eigenvalue:\\t\", W[idx_largest_eigenval])\n", + "print(\"Variance of Eigenvector:\", variance)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "bmLOaYzvEeG7" + }, + "source": [ + "### Multidimensional Scaling (MDS)\n", + "\n", + "We now pay a visit to our good friend, the Euclidean distance. One incredibly useful aspect of Euclidean distance is that it works in higher dimensions. The formula $$\\sqrt{x_1^2 + x_2^2}$$ is for two dimensional Euclidean distance, but to move it to a third dimension, it is as easy as adding a $x^3$ variable. One thing to recognize is that Euclidean distance in all dimension is still a number. \n", + "\n", + "Why am I rambling on about something you learned in middle school? Well the realization that Euclidean distance is scalar in all dimensions means that we can preserve the variance of n-th dimensional data in two dimensions as long as we try to ensure that the Euclidean distances in the n-th dimensional is proportion to the Euclidean distance in 2 dimensions. That was a lot to take in, the following image explains the concept.\n", + "\n", + "![](https://raw.githubusercontent.com/bfkwong/data/master/IMG_0188.jpg)\n", + "\n", + "Notice how when we reduced our dimensions from 2 to 1, the distances between points A, B, and C remained the same. Meaning that x, y, and z remained the same between the two dimensions. While distances may not always be preserved perfectly between dimensions, MDS attempts to preserve it as well as possible. " + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "L4AjVFrwn78I" + }, + "outputs": [], + "source": [ + "from sklearn.manifold import MDS\n", + "# We want to be able to plot this on a 2-D scatter plot so we choose 2 Dimensions\n", + "dimension_we_want = 2\n", + "\n", + "mds = MDS(n_components=dimension_we_want)\n", + "X_2d = mds.fit_transform(X) \n", + "X_2d = pd.DataFrame(X_2d, columns=[\"Dimension 1\", \"Dimension 2\"])\n", + "\n", + "display(X_2d.head())\n", + "print(\"Percentage variance explained:\", X_2d.var().sum()/X.var().sum())\n", + "X_2d.plot.scatter(x=\"Dimension 1\", y=\"Dimension 2\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "mlIA-o_cEnSD" + }, + "source": [ + "By using MDS, we were actually able to preserve over 95% of the variance from the original datasets. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "88kI9JxzFO6n" + }, + "source": [ + "### Linear vs Nonlinear Dimensionality Reduction \n", + "\n", + "Thus far, we have explored PCA (a linear reduction technique) and MDS (a nonlinear reduction technique). While there are a lot of differences between the two methods. The key differences could be boiled down to just the following statement: **linear dimensionality reduction technique only stretch and shift the data while nonlinear techniques make more drastic changes to the data**.\n", + "\n", + "This sometimes leads to nonlinear techniques being better at capturing variance but losing the overall shape of the data whereas linear techniques are better at keeping the general shape of the original data but loses more variance along the way. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "7uprQWlUGC37" + }, + "source": [ + "# Exercises" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "UnMJKgsTkDoZ" + }, + "source": [ + "1. Consider the Iris dataset (https://raw.githubusercontent.com/dlsun/pods/master/data/iris.csv). Drop the \"SepalWidth\" and \"PedalWidth\" columns and then apply PCA on \"SepalLength\" and \"PedalLength\" with `n_components = 2`. How many percent of the variance was PCA able to capture in this case? What happens when we use PCA to compress 2D data into 2D data?" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "HCXKtA8AlSQq" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "7.2 Visualizing Higher Dimensions.ipynb", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/07-Unsupervised-Learning/7.3 Visualizing_Higher_Dimensions.ipynb b/07-Unsupervised-Learning/7.3 Visualizing_Higher_Dimensions.ipynb new file mode 100644 index 0000000..13d4466 --- /dev/null +++ b/07-Unsupervised-Learning/7.3 Visualizing_Higher_Dimensions.ipynb @@ -0,0 +1,738 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "XTPLBCWKzfWS" + }, + "source": [ + "# 7.3 Visualizing Higher Dimensions" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "f1mm_fCWm9dO" + }, + "source": [ + ">\"I am a Tralfamadorian, seeing all time as you might see a stretch of the Rocky Mountains. All time is all time. It does not change. It does not lend itself to warnings or explanations. It simply is.\" \n", + ">\n", + ">-Kurt Vonnegate in \"Slaughterhour-Five\"\n", + "\n", + "We unfortunately are not Tralfamadorians. We instead we are three dimensional beings who can't visually see a fourth dimension like its a location on the Rocky Mountains. However, this doesn't mean that the fourth dimension is meaingless to us. We can derive a lot of understand from understanding the higher dimensions. The problem is, we can't see it and thus we can't plot it. \n", + "\n", + "Fortunately, very clever mathematicians throughout history has invented techniques to allow us to simulate what the higher dimension would look like. The rest of the section will discuss these techniques." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "XR-JkcQuwozv" + }, + "source": [ + "## Using Size and Color\n", + "\n", + "This is more of a review from previous sections but one way to visualize more dimensions is by using the size and color attributes of your scatter plots. This is rather intuitive. " + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 368 + }, + "colab_type": "code", + "id": "kgChDoxMwyKa", + "outputId": "cc578404-1e67-476a-f736-e070bc9c233e" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "" + ], + "text/plain": [ + "alt.Chart(...)" + ] + }, + "execution_count": 30, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "from scipy import stats\n", + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.preprocessing import LabelEncoder, StandardScaler\n", + "import altair as alt\n", + "\n", + "df_bordeaux = pd.read_csv(\"http://dlsun.github.io/pods/data/bordeaux.csv\")\n", + "\n", + "alt.Chart(df_bordeaux).mark_circle().encode(\n", + " alt.X('age',\n", + " scale=alt.Scale(zero=False)\n", + " ),\n", + " alt.Y('sep',\n", + " scale=alt.Scale(zero=False)\n", + " ),\n", + " color=\"summer\",\n", + " size=\"win\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "gxV83MG_y2LJ" + }, + "source": [ + "I am sure you can see the limitations of this method: you can only go up for 4 dimensions (5 if you use a 3-D scatter plot). This is still worth mentioning as sometimes, this may be all you need. \n", + "\n", + "For higher dimensions, we should consider either feature selection or feature reduction. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "tlWDTyDsr5Zv" + }, + "source": [ + "## Feature Selection \n", + "\n", + "If we want to compress 10 dimensions worth of data into 2 dimensions, we're bound to lose some detail during that compression. We can measure how much detail we kept at the end with the explained variance ratio which gives the percentage of variance/detail we kept after the compression. The higher the ratio, the more variance and detail we kept. \n", + "\n", + "One way very simple, almost trivial, way to only visualize higher dimensional data is to only plot the two dimensions that explains the most variation in the data. " + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 136 + }, + "colab_type": "code", + "id": "g9kxLPD30dxa", + "outputId": "c1aedb62-573b-406c-cfa6-844ab4054122" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "summer 0.343538\n", + "sep 0.323582\n", + "age 0.206936\n", + "year 0.206936\n", + "har 0.199621\n", + "win 0.053456\n", + "Name: R^2 Values, dtype: float64" + ] + }, + "execution_count": 31, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" + } + ], + "source": [ + "sclr = StandardScaler()\n", + "df_bordeaux = pd.DataFrame(sclr.fit_transform(df_bordeaux.dropna()), columns=df_bordeaux.columns)\n", + "\n", + "X = df_bordeaux.drop(\"price\", axis=1)\n", + "y = df_bordeaux[\"price\"]\n", + "\n", + "reg = LinearRegression()\n", + "reg.fit(X, y)\n", + "\n", + "scores = pd.Series(dtype=float, name=\"R^2 Values\")\n", + "for column in X.columns: \n", + " reg = LinearRegression()\n", + " reg.fit(X[[column]], y)\n", + " scores[column] = reg.score(X[[column]], y)\n", + "\n", + "scores.sort_values(ascending=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "qHprVgDLxKyE" + }, + "source": [ + "As we can see above, average summer temperature **summer** and average september temperature **sep** are the two variables that explain the most variance in the quality of the wine **price**. Thus, if we want to get the best representation of the dataset with only two dimensions, we can make a scatterplot of **summer** vs **sep**. However, even with the two variables that explain the most variation, we can only capture 33% of the variation of the original data. The other 66% is lost to the other features we chose to ignore. " + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 330 + }, + "colab_type": "code", + "id": "FcUCeYGOx2nP", + "outputId": "16dc19dd-f2ac-44b7-8ac6-0236cac283d4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "% Variance Explained: 0.3333333333333333\n", + "\n" + ] + }, + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 32, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEGCAYAAABsLkJ6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAUNklEQVR4nO3dfYxcV33G8efZeLu26rQYrwvEDpiQqLwaJ2zTgCmF8NKQVqZgKFDRgqByU4RU1Ao7CLX0TY1spP5RUVqshAKCBigmtVsCIWAoL2rSrINf8kJCQEmzbkrM1glZsLfrzK9/zF083njt2ezcOefO+X6kVWbvzO793ZvxPHvOPedcR4QAAOUZSl0AACANAgAACkUAAEChCAAAKBQBAACFWpK6gIUYHR2NtWvXpi4DABpl7969P4yIVXO3NyoA1q5dq/Hx8dRlAECj2L7vVNvpAgKAQhEAAFAoAgAACkUAAEChCAAAKBQBAABzTE5Na//9D2lyajp1KbVq1DBQAKjbrn2HtHXnAQ0PDWmm1dL2Teu0cf3q1GXVghYAAFQmp6a1decBHZtp6ZHp4zo209KWnQcGtiVAAABAZeLIUQ0PnfyxODw0pIkjRxNVVC8CAAAqa1Ys00yrddK2mVZLa1YsS1RRvQgAAKisXD6i7ZvWaenwkM4eWaKlw0PavmmdVi4fSV1aLbgIDAAdNq5frQ3nj2riyFGtWbFsYD/8JQIAAB5j5fKRgf7gn0UXEAAUigAAgEIRAABQKAIAAApFAABAoQgAACgUAQAAhSIAAKBQBAAAFIoAAIBCEQAAUCgCAAAKlSwAbJ9r+6u277B9u+0/TFULAJQo5WqgxyX9cUTcavtsSXtt3xgRdySsCQCKkawFEBEPRMSt1eNHJN0paTDvvAwAGcriGoDttZIulHTzKZ7bbHvc9vjhw4f7XRoADKzkAWB7uaSdkt4dET+a+3xE7IiIsYgYW7VqVf8LBIABlTQAbA+r/eH/yYj4XMpaAKA0KUcBWdI1ku6MiL9JVQcAlCplC2CDpN+RdKntfdXX5QnrAYCiJBsGGhHflORU+weA0iW/CAwASIMAAIBCEQAAUCgCAAAKRQAAQKEIAAAoFAEAAIUiAACgUAQAABSKAACAQhEAAFAoAgAACkUAAEChCACgzyanprX//oc0OTWduhQULtly0ECJdu07pK07D2h4aEgzrZa2b1qnjetXpy4LhaIFAPTJ5NS0tu48oGMzLT0yfVzHZlrasvMALQEkQwAAfTJx5KiGh07+Jzc8NKSJI0cTVYTSEQBAn6xZsUwzrdZJ22ZaLa1ZsSxRRSgdAQD0ycrlI9q+aZ2WDg/p7JElWjo8pO2b1mnl8pHUpaFQXAQG+mjj+tXacP6oJo4c1ZoVy/jwR1IEANBnK5eP8MGPLNAFBACFIgAAoFAEAADMUcpsba4BAECHkmZr0wIAgEpps7UJAACo9Gu2di5dTHQBAUClH7O1c+piogUAAJW6Z2vn1sVECwAAOtQ5W3u2i+mYTrQyZruYUkwOJAAAYI66ZmvntiAgXUAA0Ce5LQhICwAA+iinBQEJAADos1wWBKQLCAAKlTQAbH/E9oO2b0tZBwCUKHUL4KOSLktcAxool5mU/VDSsaK/kl4DiIiv216bsgY0T04zKetW0rGi/1K3AM7I9mbb47bHDx8+nLocJJbbTMo6lXSsSCP7AIiIHRExFhFjq1atSl0OEuvXYl05KOlYkUb2AQB0ym0mZZ1KOlakQQCgUXKbSVmnko4VaTgi0u3cvlbSSyWNSvqBpPdHxDXzvX5sbCzGx8f7VB1yNjk1ncVMyn4o6VhRD9t7I2Js7vbUo4DenHL/aK5cZlL2Q0nHiv6iCwgACkUAAEChCAAAKBQBAACFIgAAoFAEAFAwFporGzeEAQrFQnOgBQAUiIXmIBEAQJEGYaE5uq8Wjy4goEBNX2iO7qveoAUAFKjJC83RfdU7tACAQm1cv1obzh9t3EJzs91Xx3SiBTPbfdWUY8gFAQAUrIkLzTW9+yondAEBaJQmd1/lhhYAgMZpavdVbggAAI3UxO6r3NAFBACFIgAAoFAEAAAUigAAgEIRAABQKAIAAApFAABAoRYUALZ/zvbZdRUDAOifrgLA9i/ZPijpgKTbbO+3/YJ6S0PTsV47kLduZwJfI+mdEfENSbL9Ykn/KGldXYWh2VivHchft11Aj85++EtSRHxT0vF6SkLTsV470AzdtgD+3faHJV0rKSS9UdLXbF8kSRFxa031oYFYrx1ohm4D4PnVf98/Z/uFagfCpT2rCI03COu1T05NF7HSZCnHiVPrKgAi4mV1F4LBMbte+5Y51wCa8gFTyvWLUo4T83NEnPlF9pMk/bWkcyLi1bafLemFEXFN3QV2Ghsbi/Hx8X7uEovQxL8uJ6emtWHbHh2bOdGCWTo8pG9tvbQxx9CNUo4Tbbb3RsTY3O3dXgT+qKQbJJ1TfX+3pHf3pjQMqpXLR/T8c5/QqA+U2esXnWavX8ynicNdH89xPh5NPDcl6fYawGhEfMb2eyUpIo7bfrTGuoAkFnr9oqndKP24TtPUc1OSblsAP7a9Uu0LvrJ9iaSHa6sKSGQh95tt8nDXuu+r2+RzU5JuWwB/JGm3pGfY/pakVZJeX1tVQELd3m+26cNd67yvbtPPTSm6DYBnSHq1pHMlbZL0ywv4WaBxurnf7CAMd63rvrqDcG5K0G0X0J9ExI8krZD0MkkfkvT3i9257cts32X7HttXLvb3Af1UdzdKk3FumqHbYaDfjogLbV8l6WBE/NPstse9Y/sstUcTvVLShKRbJL05Iu6Y72cYBoocNXG4a79wbvIw3zDQbrtxDlVLQbxS0jbbI1r8vQQulnRPRHy/KvBTkl4jad4AAHJUVzfKIODc5K3bD/HfUnsewK9FxEOSnijpPYvc92pJ93d8P1FtO4ntzbbHbY8fPnx4kbsEAMzqdimIn0j6XMf3D0h6oK6i5ux7h6QdUrsLqB/7BIASpLwl5CG1RxXNWlNtA5ApZvYOlpRDOW+RdIHtp6v9wf8mSb+dsB4Ap8HM3sGTrAUQEcclvUvtawt3SvpMRNyeqh4A82Nm72BKOpkrIq6XdH3KGkrGED10K8eZvbx/F4/ZvIWiOY+FyG1mL+/f3kh5ERiJ0JzHQuU0s5f3b+/QAihQjs155K/OxeMWgvdv7xAABcqtOY/myGFmL+/f3qELqEA5NeeBheL92ztdLQaXCxaD6y1GUTQD/59OjfPSvcUuBocBlENzHqfHaJf58f5dPLqAgEwx2gV1IwCATM2Oduk0O9oF6AUCADiFHBY9Y7QL6sY1AGCOXPrdZ0e7bJlTC/3e6BUCAOjQ2e8+O9Foy84D2nD+aJIP3lwmX2EwEQBAhxxnmTLaBXXhGgDQYc2KZTo6c/ykbUdnjtPvjoFEAABz2D7t98CgIACADhNHjmrpkrNO2rZ0yVkMvcRAIgCADgy9REkIAKADC42hJIwCQteavPjWQmpf6NDLJp8XlI0AQFdymRz1eDye2rsdetnk8wLQBYQzavKiZHXW3uTzAkgEALrQ5EXJ6qy9yecFkAgAdKHJI2PqrL3J5wXNUtfihAQAzqjJI2PqrL3J5wXNsWvfIW3Ytkdvufpmbdi2R7v3HerZ7+aWkOhak0e71Fl7k88L8jY5Na0N2/bo2MyJlubS4SF9a+ulC3qvcUtILFqTFyWrs/Ymnxfkre7FCekCAoBM1X2diQAAgEzVfZ2JLiAAyFidNwUiAAAgc3VdZ6ILCAAKRQAAfVbXpB5goegCAvqIxeOQE1oAQJ+weBxyQwAAfcLicchNkgCw/Qbbt9tu2X7M9GRgELF4HHKTqgVwm6TXSfp6ov0DfcficchNkovAEXGnJNlOsXsgmTon9QALlf0oINubJW2WpKc+9amJqwEWj8XjkIvaAsD2lyU9+RRPvS8idnX7eyJih6QdUns56B6VBwDFqy0AIuIVdf1uAMDiMQwUAAqVahjoa21PSHqhpM/bviFFHQBQslSjgK6TdF2KfQMA2ugCAoBCEQAAUCgCAAAKRQAAQKEIAAAoFAEAAIUiAACgUAQAABSKAACAQhEAAFAoAgAACkUAAEChCAAAKBQBAACFIgAAoFAEAAAUigAAgEIRAABQKAIAAApFAABAoQiAzE1OTWv//Q9pcmo6dSkABsyS1AVgfrv2HdLWnQc0PDSkmVZL2zet08b1q1OXBWBA0ALI1OTUtLbuPKBjMy09Mn1cx2Za2rLzAC0BAD1DAGRq4shRDQ+d/L9neGhIE0eOJqoIwKAhADK1ZsUyzbRaJ22babW0ZsWyRBUBGDQEQKZWLh/R9k3rtHR4SGePLNHS4SFt37ROK5ePpC4NwIAo4iLw5NS0Jo4c1ZoVyxr1Abpx/WptOH+0kbUDyN/AB0DTR9KsXD7CBz+AWgx0FxAjaQBgfgMdAIykAYD5DXQAMJIGAOY30AHASBoAmN/AXwRmJA0AnNrAB4DESBoAOJWB7gICAMyPAACAQiUJANsfsP0d2wdsX2f7CSnqANBc3Ctj8VK1AG6U9NyIWCfpbknvTVQHgAbate+QNmzbo7dcfbM2bNuj3fsOpS6pkZIEQER8KSKOV9/eJGlNijoANA8z/Hsnh2sAb5f0hfmetL3Z9rjt8cOHD/exLAA5YoZ/79Q2DNT2lyU9+RRPvS8idlWveZ+k45I+Od/viYgdknZI0tjYWNRQKoAGYYZ/79QWABHxitM9b/ttkn5D0ssjgg92AF2ZneG/Zc4qv8z1WbgkE8FsXyZpi6RfjYifpKgBQHMxw783Us0E/qCkEUk32pakmyLiikS1AGggZvgvXpIAiIjzU+wXAHBCDqOAAAAJEAAAUCgCAAAKRQAAQKHcpCH4th+RdFfqOuYYlfTD1EWcQo515ViTlGddOdYk5VlXjjVJedX1tIhYNXdj024Ic1dEjKUuopPt8dxqkvKsK8eapDzryrEmKc+6cqxJyreuTnQBAUChCAAAKFTTAmBH6gJOIceapDzryrEmKc+6cqxJyrOuHGuS8q3rpxp1ERgA0DtNawEAAHqEAACAQmUdAN3ePN72vbYP2t5nezyTmi6zfZfte2xfWWdN1f7eYPt22y3b8w496/O56ramfp+rJ9q+0fZ3q/+umOd1j1bnaZ/t3TXVctpjtz1i+9PV8zfbXltHHQus6W22D3ecm9/rQ00fsf2g7dvmed62/7aq+YDti+quqcu6Xmr74Y5z9af9qKtrEZHtl6RXSVpSPd4mads8r7tX0mguNUk6S9L3JJ0n6Wck7Zf07JrrepakX5T0NUljp3ldP8/VGWtKdK62S7qyenzlad5XUzXXccZjl/ROSf9QPX6TpE9nUNPbJH2wH++hjn2+RNJFkm6b5/nL1b61rCVdIunmTOp6qaR/6+e5WshX1i2AyPDm8V3WdLGkeyLi+xHxf5I+Jek1Ndd1Z0RkNUu6y5r6fq6q3/+x6vHHJP1mzfubTzfH3lnrZyW93NVNNBLW1HcR8XVJ/3ual7xG0sej7SZJT7D9lAzqylrWATDH6W4eH5K+ZHuv7c0Z1LRa0v0d309U23KQ6lzNJ8W5elJEPFA9/h9JT5rndUttj9u+yXYdIdHNsf/0NdUfHg9LWllDLQupSZI2VV0tn7V9bo31dCvnf3MvtL3f9hdsPyd1MZ2SLwXRo5vHvzgiDtn+BbXvMvadKplT1tRz3dTVhb6fqxROV1fnNxERtucbC/206lydJ2mP7YMR8b1e19pA/yrp2oiYtv37ardQLk1cU65uVft9NGX7ckn/IumCxDX9VPIAiB7cPD4iDlX/fdD2dWo3Yx/3h1oPajokqfOvojXVtkU5U11d/o6+nqsu9P1c2f6B7adExANVN8GD8/yO2XP1fdtfk3Sh2v3jvdLNsc++ZsL2Ekk/L2myhzUsuKaI6Nz/1WpfU0mtlvfRYkXEjzoeX2/7Q7ZHIyKLReKy7gLyiZvHb4x5bh5v+2dtnz37WO2LtKe8It+vmiTdIukC20+3/TNqX7yrZRTJQvT7XHUpxbnaLemt1eO3SnpMS8X2Ctsj1eNRSRsk3dHjOro59s5aXy9pz3x/CPWrpjl96xsl3VljPd3aLel3q9FAl0h6uKObLxnbT569ZmP7YrU/c+sM8IVJfRX6dF+S7lG7X29f9TU7GuIcSddXj89Te6TCfkm3q931kLSm6vvLJd2t9l+MtdZU7e+1avd7Tkv6gaQbMjhXZ6wp0blaKekrkr4r6cuSnlhtH5N0dfX4RZIOVufqoKR31FTLY45d0l+o/QeGJC2V9M/V++4/JZ3Xh/Nzppquqt4/+yV9VdIz+1DTtZIekDRTvafeIekKSVdUz1vS31U1H9RpRsL1ua53dZyrmyS9qB91dfvFUhAAUKisu4AAAPUhAACgUAQAABSKAACAQhEAAFAoAgAACkUAAJmpJjPxbxO1402GIlWzoj9fLdJ1m+03un2vhNHq+bFq+QfZ/jPbH7P9Ddv32X6d7e1u31fhi7aHq9fda/uqat33cdsX2b7B9vdsX9Gx7/fYvqVaTO3Pq21r3V6D/+Nqz87OYYE1DDgCAKW6TNJ/R8TzI+K5kr54htc/Q+0FzzZK+oSkr0bE8yQdlfTrHa/7r4hYL+kbkj6q9vINl0ia/aB/ldqLgV0sab2kF9h+SfWzF0j6UEQ8JyLuW/whAqdHAKBUByW90vY2278SEQ+f4fVfiIiZ6ufO0onAOChpbcfrdndsvzkiHomIw5Km3b573Kuqr2+rvVLkM3Vidcj7or2WPdAXyVcDBVKIiLvdvm3g5ZL+yvZX1F7ee/aPoqVzfmS6+rmW7Zk4sYZKSyf/O5ru2D7dsX32dZZ0VUR8uPOXu32rxx8v5piAhaIFgCLZPkfSTyLiE5I+oPZt/e6V9ILqJZtq2vUNkt5ue3lVx+rq3gxA39ECQKmeJ+kDtltqr+T4B5KWSbrG9l+qfR/jnouIL9l+lqT/qFYJnpL0FkmP1rE/4HRYDRQACkUXEAAUigAAgEIRAABQKAIAAApFAABAoQgAACgUAQAAhfp/rOG9WPsCRBEAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light", + "tags": [] + }, + "output_type": "display_data" + } + ], + "source": [ + "explained_var = X[[\"summer\", \"sep\"]].var(axis=0).sum() / X.var(axis=0).sum() \n", + "\n", + "print(\"% Variance Explained:\", explained_var, end=\"\\n\\n\")\n", + "df_bordeaux.plot.scatter(x=\"summer\", y=\"sep\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Y6gWdO1CINp-" + }, + "source": [ + "Additionally, if we look below, using only two features has hindered our predictive accuracy. This sucks! Fortunately, some very clever mathematicians came up with ways to get around this. " + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + }, + "colab_type": "code", + "id": "l7re2E6KHUbs", + "outputId": "24c2f5c2-ff71-420e-874c-2bbdee6f9a8f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "All Features:\t\t 0.7526018827767169\n", + "With PCA Features:\t 0.4633153344681292\n", + "\n" + ] + } + ], + "source": [ + "reg = LinearRegression()\n", + "\n", + "reg.fit(X, y)\n", + "print(\"All Features:\\t\\t\", reg.score(X, y))\n", + "reg.fit(X[[\"summer\", \"sep\"]], y)\n", + "print(\"With PCA Features:\\t\", reg.score(X[[\"summer\", \"sep\"]], y), end=\"\\n\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "tflTZWS688xW" + }, + "source": [ + "## Dimensionality Reduction\n", + "\n", + "With feature selection, we were only able to capture 33% of the original variance, which isn't great. To capture more variation while still remaining in two variables, have to utilize some clever math.\n", + "\n", + "These clever mathematical techniques are known as feature creation, where we try to create new variables that helps us visualize higher dimensional data. There are many different techniques for dimensionality reduction all of which attempts to accomplish different things. Let's start off with the simplest and most popular one: **Principle Component Analysis**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "cZ52OQlGAm6Q" + }, + "source": [ + "### Principle Component Analysis (PCA)\n", + "\n", + "Simply put, Principle Component Analysis create new features that maximizes variation. What PCA is **not** doing is feature selection, rather it is creating an entirely new arbitrary feature that is a combination of all the features. \n", + "\n", + "PCA involves some simple linear algebra, but SciKit-Learn has a PCA implementation. Note that all dimensionality reduction algorithms in SciKit-Learn is operated very similarity to machine learning algorithms you learned in the previous chapters. Create the object and then run `fit()` or `fit_transform()`." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 255 + }, + "colab_type": "code", + "id": "FESc4NoI8-3v", + "outputId": "1d1b00d1-65f5-40dd-ca5f-93f9ba84ef7c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "% Variance Explained: 0.6462543462926849\n", + "\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PCA1PCA2
02.620822-1.437859
12.1664460.732196
22.359265-0.067934
31.518218-0.792417
41.5199200.451721
\n", + "
" + ], + "text/plain": [ + " PCA1 PCA2\n", + "0 2.620822 -1.437859\n", + "1 2.166446 0.732196\n", + "2 2.359265 -0.067934\n", + "3 1.518218 -0.792417\n", + "4 1.519920 0.451721" + ] + }, + "execution_count": 36, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.decomposition import PCA\n", + "\n", + "# We want to be able to plot this on a 2-D scatter plot so we choose 2 Dimensions\n", + "dimension_we_want = 2\n", + "\n", + "pca = PCA(n_components=dimension_we_want)\n", + "X_2d = pca.fit_transform(X) \n", + "X_2d = pd.DataFrame(X_2d, columns=[\"PCA1\", \"PCA2\"])\n", + "\n", + "print(\"\\n% Variance Explained:\", pca.explained_variance_ratio_.sum(), end=\"\\n\\n\")\n", + "X_2d.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + }, + "colab_type": "code", + "id": "Gj87JXmmBdNu", + "outputId": "4574186e-9f94-4b91-9cf9-ba4a68b9f866" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 37, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEGCAYAAABsLkJ6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAATe0lEQVR4nO3dcWzcZ33H8c/nEtfxcLZGjpWucUMqUtCqYIKwOjZv0lrYKIgFtVk1OsHEYIuQhkQlpoSqbGOaJtGwMaSViUWUwaSKqpvbBdFWNFVgUAQFB7mmaQrqGKyOGA1eSmOwXaf33R93Xm3X8Tm5+93zu3veL8mS73eX+30vd/597nme3/P8HBECAOSnkroAAEAaBAAAZIoAAIBMEQAAkCkCAAAytTF1ARdi69atsXPnztRlAEBHOX78+E8iYnDl9o4KgJ07d2p8fDx1GQDQUWz/cLXtdAEBQKYIAADIFAEAAJkiAAAgUwQAAGSKAADQEtMz83rs6Wc1PTOfuhSsU0edBgqgnI5MnNLBsUn1VCpaqFZ1aN+w9u7ZnrosNEALAEBTpmfmdXBsUnMLVZ2dP6e5haoOjE3SEugABACApkydmVVPZfmhpKdS0dSZ2UQVYb0IAABNGdrSp4Vqddm2hWpVQ1v6ElWE9SIAADRloL9Xh/YNa1NPRZt7N2pTT0WH9g1roL83dWlogEFgAE3bu2e7Rndt1dSZWQ1t6ePg3yEIAAAtMdDfy4G/w9AFBACZIgAAIFPJAsD2JtvftP2Y7RO2/ypVLQCQo5RjAPOSrouIGds9kh6x/WBEfCNhTQCQjWQBEBEhaaZ+s6f+E6nqAYDcJB0DsL3B9oSkZyQdjYhHV3nMftvjtsdPnz7d/iIBoEslDYCIeCEi9kgaknSN7d2rPOZwRIxExMjg4EuuaQwAuEilOAsoIp6V9CVJ16euBQBykfIsoEHbl9Z/75P025KeTFUPAOQm5VlAvyzps7Y3qBZE90TEFxLWAwBZSXkW0KSk16baPwDkrhRjAACA9iMAACBTBAAAZIoAAIBMEQAAkCkCAAAyRQAAQKYIAADIFAEAAJkiAAAgUwQAAGSKAACATBEAAJApAgAAMkUAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMhUsgCwfYXtL9l+wvYJ2+9PVQsA5Ghjwn2fk/SBiPi27c2Sjts+GhFPJKwJALKRrAUQET+KiG/Xfz8r6aSk7anqAYDclGIMwPZOSa+V9Ogq9+23PW57/PTp0+0uDQC6VvIAsN0vaUzSLRHx3Mr7I+JwRIxExMjg4GD7CwSALpU0AGz3qHbwvysi7k1ZCwDkJuVZQJZ0p6STEfGxVHUAQK5StgBGJb1T0nW2J+o/b0lYDwBkJdlpoBHxiCSn2j8A5C75IDAAIA0CAAAyRQAAQKYIAADIFAEAAJkiAIAVpmfm9djTz2p6Zj51KUChUq4GCpTOkYlTOjg2qZ5KRQvVqg7tG9bePaxRiO5ECwCom56Z18GxSc0tVHV2/pzmFqo6MDZJSwBdiwAA6qbOzKqnsvxPoqdS0dSZ2UQVAcUiAIC6oS19WqhWl21bqFY1tKUvUUVAsQgAoG6gv1eH9g1rU09Fm3s3alNPRYf2DWugvzd1aUAhGAQGlti7Z7tGd23V1JlZDW3p4+CPrkYAACsM9Pdy4EcW6AICgEwRAACQKQIAQCGYUV1+jAEAaDlmVHcGWgB1fFsBWoMZ1Z2DFoD4tgK00uKM6jm9OKlucUY1Z1eVS/YtAL6tdB9ac2kxo7pzZBUAqx0YWP+luxyZOKXR24/pHZ96VKO3H9PnJ06lLik7zKjuHNl0AZ2vm4dvK91jaWtusfvhwNikRndt5eDTZsyo7gxZtADW6ubh20r3oDVXLgP9vXrNFZfyt1RiWbQAGg1K8W2lO9CaAy5M0haA7U/bfsb240XuZz0HBr6tdD5ac8CFSd0C+IykOyT9S5E7WTwwHFgxBsCBofvQmgPWL2kARMRXbO9sx744MOSD1TyB9UndAmjI9n5J+yVpx44dTT0XBwYAeFHpzwKKiMMRMRIRI4ODg6nLAdqOiW0oSulbAEDOWKYERSp9CwDIFcuUoGipTwP9nKSvS3qV7Snb70lZD1AmTGxD0VKfBXRzyv0DZcbENhSNLiCgpJjYVl7dMjDPIDBQYsxfKZ9uGpgnAICSY/5KeXTbirN0AQHAOnXbwDwBAADr9LJLNmj+3AvLtnXywDwBsIZuGejBi3hPcbGOTJzSW+94RJWKJUm9G9zxA/OMAZxHNw30oIb3FBdrad//orB1//t+Q7u2bU5YWXPW1QKw3bPKtq2tL6cYF/qtjxmY3Yf3FM1Yre+/d0NFP3v+hfP8i86wZgDYvtb2lKQf2X5oxdLNDxVZWKtczEXCu22gB7ynaE63Tspr1AI4JOlNEbFV0mFJR22/vn6fC62sBS72W1+3vtk54z3tTGUZs+nWSXmNxgAuiYgTkhQR/2b7pKR7bR+UFIVX16RG1wI+H64g1n14TztP2cZsunFSXqMAWLB9WUT8jyRFxAnbb5D0BUmvKLy6JjXzra8b3+zc8Z52jrJOuOq2SXmNuoA+KGnb0g0RMSXptyR9pKCaWqbZZhsXiu8+vKedgTGb9lizBRARD5/nrs2Snm99Oa3Ht758TM/M8z53CcZs2mPd8wBsD0q6SdLNki6XdF9RRbVatzXb8FJl6y9GcxizaY81A8D2Zkk3SvoDSa+UdK+kKyNiqA21AetS1v5iNIfWe/EatQCekfRNSR+S9EhEhO0bii8LWL+LPdsL5UfrvViNBoFvldQr6R8l3Wq79Gf+ID/0FwMXZ80AiIiPR8TrJb2tvunfJV1u+6DtVxZeHbAO3TpJByiaIy5sPpft3aoNBP9+ROwqpKrzGBkZifHx8XbuEh2Es4CA1dk+HhEjK7c3GgTeJWlbRHxtcVtEPG77QUn/3Poy0clSH4DpLwYuTKNB4I+rNg6w0k8l/b2k3215RehInIYJdJ5Gg8DbIuI7KzfWt+0spCJ0HJZaBjpTowC4dI37OMUCkso5bb8sq0gCZdYoAMZt/8nKjbb/WNLxZndu+3rb37X9lO0PNvt8SKNsp2FezDUggBw1CoBbJP2R7S/b/rv6z39Ieo+k9zezY9sbJH1C0pslXS3pZttXN/OcSKNMp2HSHQWsX6PF4H4s6ddtXytpd33z/RFxrAX7vkbSUxHxfUmyfbdq8w2eaMFzo83KMm2fWcEoo9RnyJ1Po9NAN0l6r6Rdkr4j6c6IONeifW+X9PSS21OSfnWVGvZL2i9JO3bsaNGuUYQynIZZtu4ooMxnyDXqAvqspBHVDv5vlvS3hVe0QkQcjoiRiBgZHBxs9+7RYcrUHQWUvUuy0TyAqyPi1ZJk+07VFoZrlVOSrlhye6i+DWhKWbqjgLJ3STa8JOTiLxFxzm7pdeC/Jekq21eqduB/u2rLTgNNK0N3FFD2LslGXUCvsf1c/eespOHF320/18yO62MJ75P0RUknJd2zeAF6AOgGZe+SbHQW0IYidx4RD0h6oMh9AEBKZe6SXPclIQEAF6esXZKNuoAAAF2KAACATBEAAJApAgAAMkUAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMgUAQAAmSIAACBTBABWNT0zr8eeflbTM/OpS0HG+BwWiyuC4SWOTJzSwbFJ9VQqWqhWdWjfsPbu2Z66LGSGz2HxaAFgmemZeR0cm9TcQlVn589pbqGqA2OTfANDW/E5bA8CAMtMnZlVT2X5x6KnUtHUmdlEFSFHfA7bgwDAMkNb+rRQrS7btlCtamhLX6KKOgt91q3B57A9CAAsM9Dfq0P7hrWpp6LNvRu1qaeiQ/uGNdDfm7q00jsycUqjtx/TOz71qEZvP6bPT5xKXVLH4nPYHo6I9u/UvknShyX9iqRrImJ8Pf9uZGQkxsfX9VA0aXpmXlNnZjW0pY8/unWYnpnX6O3HNLfw4rfWTT0Vfe3gdfz/NYHPYWvYPh4RIyu3pzoL6HFJN0r6p0T7RwMD/b0d8QdXlgPEYp/1nF4MgMU+6074fyyrTvkcdqokARARJyXJdordo0uU6TRB+qzRiRgDQEdq1WmCrRq0pc8anaiwFoDthyVdtspdt0XEkQt4nv2S9kvSjh07WlQdOl0rulxa3YLYu2e7RndtLUWXFLAehQVARLyxRc9zWNJhqTYI3IrnROdrtstlaQtiMUQOjE1qdNfWpg7c9Fmjk9AFhI7UbJcLE42ARIPAtm+Q9A+SBiXdb3siIt6UohZ0rma6XBi0BRK1ACLivogYiojeiNjGwR8Xa6C/V6+54tIL7nZh0BZgNVBkjEFb5I4AQNYYtEXOGAQGgEwRAACQKQIAADJFAABApggAACi5oi40xFlAAFBiRa56SwsAAEqqVaveng8BAAAlVfSaVQQAAJRU0WtWEQAAUFJFr1nFIDAAlFiRa1YRAABQckWtWUUXELJU1HnVQCehBYDsFHledSebnplnaezMEADISlHXAu50hGKe6AJCVrgW8EsVPdkI5UUAICtcC/ilCMV8EQDICtcCfilCMV+MASA7XAt4ucVQPLBiDCD3/5ccEADIEtcCXo5QzBMBAEASoZgjxgAAIFNJAsD2R20/aXvS9n22L01RBwDkLFUL4Kik3RExLOl7km5NVAcAZCtJAETEQxFxrn7zG5KGUtQBADkrwxjAuyU9eL47be+3PW57/PTp020sqxxYtAxAUQo7C8j2w5IuW+Wu2yLiSP0xt0k6J+mu8z1PRByWdFiSRkZGooBSS4v1WQAUqbAAiIg3rnW/7XdJequkN0REVgf29WDRMgBFS3UW0PWSDkjaGxE/T1FD2bE+C4CipRoDuEPSZklHbU/Y/mSiOkqL9VkAFC3JTOCI2JViv52E9VkAFI2lIEqM9VkAFIkAKDnWZwFQlDLMAwAAJEAAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMgUAQAAmSIAACBTBACyMD0zr8eeflbTM/OpSwFKg+Wg0fWOTJzSwRUX1tm7Z3vqsoDkaAGgq03PzOvg2KTmFqo6O39OcwtVHRibpCUAiABAl5s6M6ueyvKPeU+loqkzs4kqAsqDAEBXG9rSp4Vqddm2hWpVQ1v6ElUElAcBgK420N+rQ/uGtamnos29G7Wpp6JD+4a5zCYgBoGRgb17tmt011ZNnZnV0JY+Dv5AHQGALAz093LgB1agCwgAMpUkAGz/te1J2xO2H7J9eYo6ACBnqVoAH42I4YjYI+kLkv4iUR0AkK0kARARzy25+TJJkaIOAMhZskFg238j6Q8l/VTStWs8br+k/ZK0Y8eO9hQHABlwRDFfvm0/LOmyVe66LSKOLHncrZI2RcRfruM5T0v6YeuqLNRWST9JXUQb8Xq7W06vtxtf68sjYnDlxsICYL1s75D0QETsTlpIi9kej4iR1HW0C6+3u+X0enN6ranOArpqyc23SXoyRR0AkLNUYwAfsf0qSVXVunTem6gOAMhWkgCIiH0p9ttmh1MX0Ga83u6W0+vN5rUmHwMAAKTBUhAAkCkCAAAyRQAUyPZHbT9ZX/foPtuXpq6pSLZvsn3CdtV2V55GZ/t629+1/ZTtD6aup2i2P237GduPp66laLavsP0l20/UP8fvT11T0QiAYh2VtDsihiV9T9Ktiesp2uOSbpT0ldSFFMH2BkmfkPRmSVdLutn21WmrKtxnJF2fuog2OSfpAxFxtaTXS/rTbn9/CYACRcRDEXGufvMbkoZS1lO0iDgZEd9NXUeBrpH0VER8PyKel3S3avNYulZEfEXS/6auox0i4kcR8e3672clnZS0PW1VxSIA2ufdkh5MXQSasl3S00tuT6nLDxC5sr1T0mslPZq2kmJxRbAmrWfNI9u3qda8vKudtRVhvWs8AZ3Kdr+kMUm3rFi5uOsQAE2KiDeudb/td0l6q6Q3RBdMumj0ervcKUlXLLk9VN+GLmG7R7WD/10RcW/qeopGF1CBbF8v6YCkvRHx89T1oGnfknSV7SttXyLp7ZI+n7gmtIhtS7pT0smI+FjqetqBACjWHZI2Szpav/zlJ1MXVCTbN9iekvRrku63/cXUNbVSfUD/fZK+qNoA4T0RcSJtVcWy/TlJX5f0KttTtt+TuqYCjUp6p6Tr6n+vE7bfkrqoIrEUBABkihYAAGSKAACATBEAAJApAgAAMkUAAECmCABgFbZfqJ8G+Ljtf7X9C/Xtl9m+2/Z/2j5u+wHbr1zy726xPWf7l5ZsG6ivMjlj+44UrwdYDQEArG42IvZExG5Jz0t6b32i0H2SvhwRr4iI16m2wuu2Jf/uZtUmjN24ZNucpD+X9GftKR1YHwIAaOyrknZJulbSQkT8/4S+iHgsIr4qSbZfIalf0odUC4LFx/wsIh5RLQiA0iAAgDXY3qja+v/fkbRb0vE1Hv521ZaI/qpqM2e3rfFYIDkCAFhdn+0JSeOS/lu1NWIauVnS3RFRVW1BsZsKrA9oGquBAqubjYg9SzfYPiHp91Z7sO1XS7pKtXWfJOkSSf+l2npQQCnRAgDW75ikXtv7FzfYHrb9m6p9+/9wROys/1wu6XLbL09VLNAIi8EBq7A9ExH9q2y/XNLHJb1OtUHdH0i6RbUVQt8SEU8ueezHJP04Im63/QNJv6hay+BZSb8TEU8U/TqAtRAAAJApuoAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMjU/wFur9UWE4oSkQAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light", + "tags": [] + }, + "output_type": "display_data" + } + ], + "source": [ + "X_2d.plot.scatter(x=\"PCA1\", y=\"PCA2\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "cT7Rv9lEDSji" + }, + "source": [ + "As we can see from above, the two new features (known as Principle Components) are not like any of our input features. Additionally, these two new components explains 64.6% of all the original variation. Let's see how well this performs in explaining our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "eBHVGhd-CJF8" + }, + "outputs": [], + "source": [ + "reg = LinearRegression()\n", + "\n", + "reg.fit(X, y)\n", + "print(\"All Features:\\t\\t\", reg.score(X, y))\n", + "reg.fit(X_2d, y)\n", + "print(\"With PCA Features:\\t\", reg.score(X_2d, y), end=\"\\n\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "bsoN_a8XI4p2" + }, + "source": [ + "As expected, with a higher explained variance ratio, we perform better in predicting the quality of the wine. With Dimensionality Reduction, we were able to capture most of the variation in the dataset while still being able to view it in two dimensions. In the most basic of terms, PCA creates a variable projected along the axis of maximum variable.\n", + "\n", + "![](https://raw.githubusercontent.com/bfkwong/data/master/IMG_0187%202.jpg)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "vwDiZRX3JYLN" + }, + "source": [ + "### Optional: Linear Algebra Behind PCA\n", + "\n", + "Principle Component Analysis chooses principle components along te **axis of greatest variance**. In Linear Algebra terms, the axis of greatest variance is the **Eigenvector with the largest Eigenvalue of the covariance matrix ($\\Sigma$)**\n", + "\n", + "The following are the steps in order to do PCA manually. \n", + "\n", + "1. Given data matrix $M$, generate the covariance matrix of $M$ denoted as $\\Sigma$\n", + "2. We then compute the Eigenvector and Eigenvalue of covariance matrix $\\Sigma$\n", + "3. Project the features to the Eigenvector with the largest Eigenvalue using the dot product (cross product for more than 1 dimensions)" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "J26dEAoAD0q7" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "# Step 1: Calculate covariance matrix \n", + "cov_mtrx = np.cov(X.T)\n", + "\n", + "# Step 2: Calculate Eigenvector and Eigenvalue\n", + "W,v = np.linalg.eig(cov_mtrx)\n", + "\n", + "# Step 3: Find the largest Eigenvalue and project our data onto the corresponding Eigenvector\n", + "idx_largest_eigenval = np.argmax(W)\n", + "eigenvec = v[:,idx_largest_eigenval]\n", + "\n", + "total = []\n", + "for row in X.index: \n", + " total.append(np.dot(X.loc[row], eigenvec))\n", + "\n", + "pd.DataFrame(pd.Series(total), columns=[\"PCA1\"]).head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "_G7NWh_1nf_k" + }, + "source": [ + "One interesting thing you may see here is that the eigenvalue corresponds to the variance explained by its corresponding eigenvalue. The eigenvalue of PCA1 is 2.26 and the variance of PCA1 is also 2.26" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "9QCALn43iTYd" + }, + "outputs": [], + "source": [ + "idx_largest_eigenval = np.argmax(W)\n", + "variance = pd.Series(total).var()\n", + "\n", + "print(\"Largest Eigenvalue:\\t\", W[idx_largest_eigenval])\n", + "print(\"Variance of Eigenvector:\", variance)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "bmLOaYzvEeG7" + }, + "source": [ + "### Multidimensional Scaling (MDS)\n", + "\n", + "We now pay a visit to our good friend, the Euclidean distance. One incredibly useful aspect of Euclidean distance is that it works in higher dimensions. The formula $$\\sqrt{x_1^2 + x_2^2}$$ is for two dimensional Euclidean distance, but to move it to a third dimension, it is as easy as adding a $x^3$ variable. One thing to recognize is that Euclidean distance in all dimension is still a number. \n", + "\n", + "Why am I rambling on about something you learned in middle school? Well the realization that Euclidean distance is scalar in all dimensions means that we can preserve the variance of n-th dimensional data in two dimensions as long as we try to ensure that the Euclidean distances in the n-th dimensional is proportion to the Euclidean distance in 2 dimensions. That was a lot to take in, the following image explains the concept.\n", + "\n", + "![](https://raw.githubusercontent.com/bfkwong/data/master/IMG_0188.jpg)\n", + "\n", + "Notice how when we reduced our dimensions from 2 to 1, the distances between points A, B, and C remained the same. Meaning that x, y, and z remained the same between the two dimensions. While distances may not always be preserved perfectly between dimensions, MDS attempts to preserve it as well as possible. " + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "L4AjVFrwn78I" + }, + "outputs": [], + "source": [ + "from sklearn.manifold import MDS\n", + "# We want to be able to plot this on a 2-D scatter plot so we choose 2 Dimensions\n", + "dimension_we_want = 2\n", + "\n", + "mds = MDS(n_components=dimension_we_want)\n", + "X_2d = mds.fit_transform(X) \n", + "X_2d = pd.DataFrame(X_2d, columns=[\"Dimension 1\", \"Dimension 2\"])\n", + "\n", + "display(X_2d.head())\n", + "print(\"Percentage variance explained:\", X_2d.var().sum()/X.var().sum())\n", + "X_2d.plot.scatter(x=\"Dimension 1\", y=\"Dimension 2\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "mlIA-o_cEnSD" + }, + "source": [ + "By using MDS, we were actually able to preserve over 95% of the variance from the original datasets. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "88kI9JxzFO6n" + }, + "source": [ + "### Linear vs Nonlinear Dimensionality Reduction \n", + "\n", + "Thus far, we have explored PCA (a linear reduction technique) and MDS (a nonlinear reduction technique). While there are a lot of differences between the two methods. The key differences could be boiled down to just the following statement: **linear dimensionality reduction technique only stretch and shift the data while nonlinear techniques make more drastic changes to the data**.\n", + "\n", + "This sometimes leads to nonlinear techniques being better at capturing variance but losing the overall shape of the data whereas linear techniques are better at keeping the general shape of the original data but loses more variance along the way. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "7uprQWlUGC37" + }, + "source": [ + "# Exercises" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "UnMJKgsTkDoZ" + }, + "source": [ + "1. Consider the Iris dataset (https://raw.githubusercontent.com/dlsun/pods/master/data/iris.csv). Drop the \"SepalWidth\" and \"PedalWidth\" columns and then apply PCA on \"SepalLength\" and \"PedalLength\" with `n_components = 2`. How many percent of the variance was PCA able to capture in this case? What happens when we use PCA to compress 2D data into 2D data?" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "HCXKtA8AlSQq" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "7.2 Visualizing Higher Dimensions.ipynb", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/07-Unsupervised-Learning/7_2_Visualizing_Higher_Dimensions.ipynb b/07-Unsupervised-Learning/7_2_Visualizing_Higher_Dimensions.ipynb deleted file mode 100644 index c2cb755..0000000 --- a/07-Unsupervised-Learning/7_2_Visualizing_Higher_Dimensions.ipynb +++ /dev/null @@ -1,731 +0,0 @@ -{ - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "name": "7.2 Visualizing Higher Dimensions.ipynb", - "provenance": [], - "collapsed_sections": [] - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - } - }, - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "XTPLBCWKzfWS", - "colab_type": "text" - }, - "source": [ - "# 7.2 Visualizing Higher Dimensions\n", - "\n", - "By this point, we hope we've convinced you how important it is to visualize your data. While summary statistics are helpful, it doesn't provide us with a good grasp of what the entire dataset looks like. In two dimension, we can use a 2-D scatter plot. In three dimension, we can use a 3-D scatter plot. But what if we have more than three dimensions? This chapter talks about how we can visualize data that is beyond 3 dimensions." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "f1mm_fCWm9dO", - "colab_type": "text" - }, - "source": [ - "## Goal of Visualizing Higher Dimensions\n", - "\n", - ">\"I am a Tralfamadorian, seeing all time as you might see a stretch of the Rocky Mountains. All time is all time. It does not change. It does not lend itself to warnings or explanations. It simply is.\" \n", - ">\n", - ">-Kurt Vonnegate in \"Slaughterhour-Five\"\n", - "\n", - "We unfortunately are not Tralfamadorians, instead we are three dimensional beings who can't visually see a fourth dimension like its a location on the Rocky Mountains. However, this doesn't mean that the fourth dimension is meaingless to us. We can derive a lot of understand from understanding the higher dimensions. The problem is, we can't it and thus we can't plot it. \n", - "\n", - "Fortunately, very clever mathematicians throughout history has invented techniques to allow us to simulate what the higher dimension would look like. The rest of the section will discuss these techniques." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XR-JkcQuwozv", - "colab_type": "text" - }, - "source": [ - "## Using Size and Color\n", - "\n", - "This is more of a review from previous sections but one way to visualize more dimensions is by using the size and color attributes of your scatter plots. This is rather intuitive. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "kgChDoxMwyKa", - "colab_type": "code", - "outputId": "cc578404-1e67-476a-f736-e070bc9c233e", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 368 - } - }, - "source": [ - "import pandas as pd\n", - "from scipy import stats\n", - "from sklearn.linear_model import LinearRegression\n", - "from sklearn.preprocessing import LabelEncoder, StandardScaler\n", - "import altair as alt\n", - "\n", - "df_bordeaux = pd.read_csv(\"http://dlsun.github.io/pods/data/bordeaux.csv\")\n", - "\n", - "alt.Chart(df_bordeaux).mark_circle().encode(\n", - " alt.X('age',\n", - " scale=alt.Scale(zero=False)\n", - " ),\n", - " alt.Y('sep',\n", - " scale=alt.Scale(zero=False)\n", - " ),\n", - " color=\"summer\",\n", - " size=\"win\"\n", - ")" - ], - "execution_count": 0, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "alt.Chart(...)" - ], - "text/html": [ - "\n", - "
\n", - "" - ] - }, - "metadata": { - "tags": [] - }, - "execution_count": 30 - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "gxV83MG_y2LJ", - "colab_type": "text" - }, - "source": [ - "I am sure you can see the limitations of this method: you can only go up for 4 dimensions (5 if you use a 3-D scatter plot). This is still worth mentioning as sometimes, this may be all you need. \n", - "\n", - "For higher dimensions, we should consider either feature selection or feature reduction. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tlWDTyDsr5Zv", - "colab_type": "text" - }, - "source": [ - "## Feature Selection \n", - "\n", - "If we want to compress 10 dimensions worth of data into 2 dimensions, we're bound to lose some detail during that compression. We can measure how much detail we kept at the end with the explained variance ratio which gives the percentage of variance/detail we kept after the compression. The higher the ratio, the more variance and detail we kept. \n", - "\n", - "One way very simple, almost trivial, way to only visualize higher dimensional data is to only plot the two dimensions that explains the most variation in the data. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "g9kxLPD30dxa", - "colab_type": "code", - "outputId": "c1aedb62-573b-406c-cfa6-844ab4054122", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 136 - } - }, - "source": [ - "sclr = StandardScaler()\n", - "df_bordeaux = pd.DataFrame(sclr.fit_transform(df_bordeaux.dropna()), columns=df_bordeaux.columns)\n", - "\n", - "X = df_bordeaux.drop(\"price\", axis=1)\n", - "y = df_bordeaux[\"price\"]\n", - "\n", - "reg = LinearRegression()\n", - "reg.fit(X, y)\n", - "\n", - "scores = pd.Series(dtype=float, name=\"R^2 Values\")\n", - "for column in X.columns: \n", - " reg = LinearRegression()\n", - " reg.fit(X[[column]], y)\n", - " scores[column] = reg.score(X[[column]], y)\n", - "\n", - "scores.sort_values(ascending=False)" - ], - "execution_count": 0, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "summer 0.343538\n", - "sep 0.323582\n", - "age 0.206936\n", - "year 0.206936\n", - "har 0.199621\n", - "win 0.053456\n", - "Name: R^2 Values, dtype: float64" - ] - }, - "metadata": { - "tags": [] - }, - "execution_count": 31 - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qHprVgDLxKyE", - "colab_type": "text" - }, - "source": [ - "As we can see above, average summer temperature **summer** and average september temperature **sep** are the two variables that explain the most variance in the quality of the wine **price**. Thus, if we want to get the best representation of the dataset with only two dimensions, we can make a scatterplot of **summer** vs **sep**. However, even with the two variables that explain the most variation, we can only capture 33% of the variation of the original data. The other 66% is lost to the other features we chose to ignore. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "FcUCeYGOx2nP", - "colab_type": "code", - "outputId": "16dc19dd-f2ac-44b7-8ac6-0236cac283d4", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 330 - } - }, - "source": [ - "explained_var = X[[\"summer\", \"sep\"]].var(axis=0).sum() / X.var(axis=0).sum() \n", - "\n", - "print(\"% Variance Explained:\", explained_var, end=\"\\n\\n\")\n", - "df_bordeaux.plot.scatter(x=\"summer\", y=\"sep\")" - ], - "execution_count": 0, - "outputs": [ - { - "output_type": "stream", - "text": [ - "% Variance Explained: 0.3333333333333333\n", - "\n" - ], - "name": "stdout" - }, - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "" - ] - }, - "metadata": { - "tags": [] - }, - "execution_count": 32 - }, - { - "output_type": "display_data", - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEGCAYAAABsLkJ6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAUNklEQVR4nO3dfYxcV33G8efZeLu26rQYrwvEDpiQqLwaJ2zTgCmF8NKQVqZgKFDRgqByU4RU1Ao7CLX0TY1spP5RUVqshAKCBigmtVsCIWAoL2rSrINf8kJCQEmzbkrM1glZsLfrzK9/zF083njt2ezcOefO+X6kVWbvzO793ZvxPHvOPedcR4QAAOUZSl0AACANAgAACkUAAEChCAAAKBQBAACFWpK6gIUYHR2NtWvXpi4DABpl7969P4yIVXO3NyoA1q5dq/Hx8dRlAECj2L7vVNvpAgKAQhEAAFAoAgAACkUAAEChCAAAKBQBAABzTE5Na//9D2lyajp1KbVq1DBQAKjbrn2HtHXnAQ0PDWmm1dL2Teu0cf3q1GXVghYAAFQmp6a1decBHZtp6ZHp4zo209KWnQcGtiVAAABAZeLIUQ0PnfyxODw0pIkjRxNVVC8CAAAqa1Ys00yrddK2mVZLa1YsS1RRvQgAAKisXD6i7ZvWaenwkM4eWaKlw0PavmmdVi4fSV1aLbgIDAAdNq5frQ3nj2riyFGtWbFsYD/8JQIAAB5j5fKRgf7gn0UXEAAUigAAgEIRAABQKAIAAApFAABAoQgAACgUAQAAhSIAAKBQBAAAFIoAAIBCEQAAUCgCAAAKlSwAbJ9r+6u277B9u+0/TFULAJQo5WqgxyX9cUTcavtsSXtt3xgRdySsCQCKkawFEBEPRMSt1eNHJN0paTDvvAwAGcriGoDttZIulHTzKZ7bbHvc9vjhw4f7XRoADKzkAWB7uaSdkt4dET+a+3xE7IiIsYgYW7VqVf8LBIABlTQAbA+r/eH/yYj4XMpaAKA0KUcBWdI1ku6MiL9JVQcAlCplC2CDpN+RdKntfdXX5QnrAYCiJBsGGhHflORU+weA0iW/CAwASIMAAIBCEQAAUCgCAAAKRQAAQKEIAAAoFAEAAIUiAACgUAQAABSKAACAQhEAAFAoAgAACkUAAEChCACgzyanprX//oc0OTWduhQULtly0ECJdu07pK07D2h4aEgzrZa2b1qnjetXpy4LhaIFAPTJ5NS0tu48oGMzLT0yfVzHZlrasvMALQEkQwAAfTJx5KiGh07+Jzc8NKSJI0cTVYTSEQBAn6xZsUwzrdZJ22ZaLa1ZsSxRRSgdAQD0ycrlI9q+aZ2WDg/p7JElWjo8pO2b1mnl8pHUpaFQXAQG+mjj+tXacP6oJo4c1ZoVy/jwR1IEANBnK5eP8MGPLNAFBACFIgAAoFAEAADMUcpsba4BAECHkmZr0wIAgEpps7UJAACo9Gu2di5dTHQBAUClH7O1c+piogUAAJW6Z2vn1sVECwAAOtQ5W3u2i+mYTrQyZruYUkwOJAAAYI66ZmvntiAgXUAA0Ce5LQhICwAA+iinBQEJAADos1wWBKQLCAAKlTQAbH/E9oO2b0tZBwCUKHUL4KOSLktcAxool5mU/VDSsaK/kl4DiIiv216bsgY0T04zKetW0rGi/1K3AM7I9mbb47bHDx8+nLocJJbbTMo6lXSsSCP7AIiIHRExFhFjq1atSl0OEuvXYl05KOlYkUb2AQB0ym0mZZ1KOlakQQCgUXKbSVmnko4VaTgi0u3cvlbSSyWNSvqBpPdHxDXzvX5sbCzGx8f7VB1yNjk1ncVMyn4o6VhRD9t7I2Js7vbUo4DenHL/aK5cZlL2Q0nHiv6iCwgACkUAAEChCAAAKBQBAACFIgAAoFAEAFAwFporGzeEAQrFQnOgBQAUiIXmIBEAQJEGYaE5uq8Wjy4goEBNX2iO7qveoAUAFKjJC83RfdU7tACAQm1cv1obzh9t3EJzs91Xx3SiBTPbfdWUY8gFAQAUrIkLzTW9+yondAEBaJQmd1/lhhYAgMZpavdVbggAAI3UxO6r3NAFBACFIgAAoFAEAAAUigAAgEIRAABQKAIAAApFAABAoRYUALZ/zvbZdRUDAOifrgLA9i/ZPijpgKTbbO+3/YJ6S0PTsV47kLduZwJfI+mdEfENSbL9Ykn/KGldXYWh2VivHchft11Aj85++EtSRHxT0vF6SkLTsV470AzdtgD+3faHJV0rKSS9UdLXbF8kSRFxa031oYFYrx1ohm4D4PnVf98/Z/uFagfCpT2rCI03COu1T05NF7HSZCnHiVPrKgAi4mV1F4LBMbte+5Y51wCa8gFTyvWLUo4T83NEnPlF9pMk/bWkcyLi1bafLemFEXFN3QV2Ghsbi/Hx8X7uEovQxL8uJ6emtWHbHh2bOdGCWTo8pG9tvbQxx9CNUo4Tbbb3RsTY3O3dXgT+qKQbJJ1TfX+3pHf3pjQMqpXLR/T8c5/QqA+U2esXnWavX8ynicNdH89xPh5NPDcl6fYawGhEfMb2eyUpIo7bfrTGuoAkFnr9oqndKP24TtPUc1OSblsAP7a9Uu0LvrJ9iaSHa6sKSGQh95tt8nDXuu+r2+RzU5JuWwB/JGm3pGfY/pakVZJeX1tVQELd3m+26cNd67yvbtPPTSm6DYBnSHq1pHMlbZL0ywv4WaBxurnf7CAMd63rvrqDcG5K0G0X0J9ExI8krZD0MkkfkvT3i9257cts32X7HttXLvb3Af1UdzdKk3FumqHbYaDfjogLbV8l6WBE/NPstse9Y/sstUcTvVLShKRbJL05Iu6Y72cYBoocNXG4a79wbvIw3zDQbrtxDlVLQbxS0jbbI1r8vQQulnRPRHy/KvBTkl4jad4AAHJUVzfKIODc5K3bD/HfUnsewK9FxEOSnijpPYvc92pJ93d8P1FtO4ntzbbHbY8fPnx4kbsEAMzqdimIn0j6XMf3D0h6oK6i5ux7h6QdUrsLqB/7BIASpLwl5CG1RxXNWlNtA5ApZvYOlpRDOW+RdIHtp6v9wf8mSb+dsB4Ap8HM3sGTrAUQEcclvUvtawt3SvpMRNyeqh4A82Nm72BKOpkrIq6XdH3KGkrGED10K8eZvbx/F4/ZvIWiOY+FyG1mL+/f3kh5ERiJ0JzHQuU0s5f3b+/QAihQjs155K/OxeMWgvdv7xAABcqtOY/myGFmL+/f3qELqEA5NeeBheL92ztdLQaXCxaD6y1GUTQD/59OjfPSvcUuBocBlENzHqfHaJf58f5dPLqAgEwx2gV1IwCATM2Oduk0O9oF6AUCADiFHBY9Y7QL6sY1AGCOXPrdZ0e7bJlTC/3e6BUCAOjQ2e8+O9Foy84D2nD+aJIP3lwmX2EwEQBAhxxnmTLaBXXhGgDQYc2KZTo6c/ykbUdnjtPvjoFEAABz2D7t98CgIACADhNHjmrpkrNO2rZ0yVkMvcRAIgCADgy9REkIAKADC42hJIwCQteavPjWQmpf6NDLJp8XlI0AQFdymRz1eDye2rsdetnk8wLQBYQzavKiZHXW3uTzAkgEALrQ5EXJ6qy9yecFkAgAdKHJI2PqrL3J5wXNUtfihAQAzqjJI2PqrL3J5wXNsWvfIW3Ytkdvufpmbdi2R7v3HerZ7+aWkOhak0e71Fl7k88L8jY5Na0N2/bo2MyJlubS4SF9a+ulC3qvcUtILFqTFyWrs/Ymnxfkre7FCekCAoBM1X2diQAAgEzVfZ2JLiAAyFidNwUiAAAgc3VdZ6ILCAAKRQAAfVbXpB5goegCAvqIxeOQE1oAQJ+weBxyQwAAfcLicchNkgCw/Qbbt9tu2X7M9GRgELF4HHKTqgVwm6TXSfp6ov0DfcficchNkovAEXGnJNlOsXsgmTon9QALlf0oINubJW2WpKc+9amJqwEWj8XjkIvaAsD2lyU9+RRPvS8idnX7eyJih6QdUns56B6VBwDFqy0AIuIVdf1uAMDiMQwUAAqVahjoa21PSHqhpM/bviFFHQBQslSjgK6TdF2KfQMA2ugCAoBCEQAAUCgCAAAKRQAAQKEIAAAoFAEAAIUiAACgUAQAABSKAACAQhEAAFAoAgAACkUAAEChCAAAKBQBAACFIgAAoFAEAAAUigAAgEIRAABQKAIAAApFAABAoQiAzE1OTWv//Q9pcmo6dSkABsyS1AVgfrv2HdLWnQc0PDSkmVZL2zet08b1q1OXBWBA0ALI1OTUtLbuPKBjMy09Mn1cx2Za2rLzAC0BAD1DAGRq4shRDQ+d/L9neGhIE0eOJqoIwKAhADK1ZsUyzbRaJ22babW0ZsWyRBUBGDQEQKZWLh/R9k3rtHR4SGePLNHS4SFt37ROK5ePpC4NwIAo4iLw5NS0Jo4c1ZoVyxr1Abpx/WptOH+0kbUDyN/AB0DTR9KsXD7CBz+AWgx0FxAjaQBgfgMdAIykAYD5DXQAMJIGAOY30AHASBoAmN/AXwRmJA0AnNrAB4DESBoAOJWB7gICAMyPAACAQiUJANsfsP0d2wdsX2f7CSnqANBc3Ctj8VK1AG6U9NyIWCfpbknvTVQHgAbate+QNmzbo7dcfbM2bNuj3fsOpS6pkZIEQER8KSKOV9/eJGlNijoANA8z/Hsnh2sAb5f0hfmetL3Z9rjt8cOHD/exLAA5YoZ/79Q2DNT2lyU9+RRPvS8idlWveZ+k45I+Od/viYgdknZI0tjYWNRQKoAGYYZ/79QWABHxitM9b/ttkn5D0ssjgg92AF2ZneG/Zc4qv8z1WbgkE8FsXyZpi6RfjYifpKgBQHMxw783Us0E/qCkEUk32pakmyLiikS1AGggZvgvXpIAiIjzU+wXAHBCDqOAAAAJEAAAUCgCAAAKRQAAQKHcpCH4th+RdFfqOuYYlfTD1EWcQo515ViTlGddOdYk5VlXjjVJedX1tIhYNXdj024Ic1dEjKUuopPt8dxqkvKsK8eapDzryrEmKc+6cqxJyreuTnQBAUChCAAAKFTTAmBH6gJOIceapDzryrEmKc+6cqxJyrOuHGuS8q3rpxp1ERgA0DtNawEAAHqEAACAQmUdAN3ePN72vbYP2t5nezyTmi6zfZfte2xfWWdN1f7eYPt22y3b8w496/O56ramfp+rJ9q+0fZ3q/+umOd1j1bnaZ/t3TXVctpjtz1i+9PV8zfbXltHHQus6W22D3ecm9/rQ00fsf2g7dvmed62/7aq+YDti+quqcu6Xmr74Y5z9af9qKtrEZHtl6RXSVpSPd4mads8r7tX0mguNUk6S9L3JJ0n6Wck7Zf07JrrepakX5T0NUljp3ldP8/VGWtKdK62S7qyenzlad5XUzXXccZjl/ROSf9QPX6TpE9nUNPbJH2wH++hjn2+RNJFkm6b5/nL1b61rCVdIunmTOp6qaR/6+e5WshX1i2AyPDm8V3WdLGkeyLi+xHxf5I+Jek1Ndd1Z0RkNUu6y5r6fq6q3/+x6vHHJP1mzfubTzfH3lnrZyW93NVNNBLW1HcR8XVJ/3ual7xG0sej7SZJT7D9lAzqylrWATDH6W4eH5K+ZHuv7c0Z1LRa0v0d309U23KQ6lzNJ8W5elJEPFA9/h9JT5rndUttj9u+yXYdIdHNsf/0NdUfHg9LWllDLQupSZI2VV0tn7V9bo31dCvnf3MvtL3f9hdsPyd1MZ2SLwXRo5vHvzgiDtn+BbXvMvadKplT1tRz3dTVhb6fqxROV1fnNxERtucbC/206lydJ2mP7YMR8b1e19pA/yrp2oiYtv37ardQLk1cU65uVft9NGX7ckn/IumCxDX9VPIAiB7cPD4iDlX/fdD2dWo3Yx/3h1oPajokqfOvojXVtkU5U11d/o6+nqsu9P1c2f6B7adExANVN8GD8/yO2XP1fdtfk3Sh2v3jvdLNsc++ZsL2Ekk/L2myhzUsuKaI6Nz/1WpfU0mtlvfRYkXEjzoeX2/7Q7ZHIyKLReKy7gLyiZvHb4x5bh5v+2dtnz37WO2LtKe8It+vmiTdIukC20+3/TNqX7yrZRTJQvT7XHUpxbnaLemt1eO3SnpMS8X2Ctsj1eNRSRsk3dHjOro59s5aXy9pz3x/CPWrpjl96xsl3VljPd3aLel3q9FAl0h6uKObLxnbT569ZmP7YrU/c+sM8IVJfRX6dF+S7lG7X29f9TU7GuIcSddXj89Te6TCfkm3q931kLSm6vvLJd2t9l+MtdZU7e+1avd7Tkv6gaQbMjhXZ6wp0blaKekrkr4r6cuSnlhtH5N0dfX4RZIOVufqoKR31FTLY45d0l+o/QeGJC2V9M/V++4/JZ3Xh/Nzppquqt4/+yV9VdIz+1DTtZIekDRTvafeIekKSVdUz1vS31U1H9RpRsL1ua53dZyrmyS9qB91dfvFUhAAUKisu4AAAPUhAACgUAQAABSKAACAQhEAAFAoAgAACkUAAJmpJjPxbxO1402GIlWzoj9fLdJ1m+03un2vhNHq+bFq+QfZ/jPbH7P9Ddv32X6d7e1u31fhi7aHq9fda/uqat33cdsX2b7B9vdsX9Gx7/fYvqVaTO3Pq21r3V6D/+Nqz87OYYE1DDgCAKW6TNJ/R8TzI+K5kr54htc/Q+0FzzZK+oSkr0bE8yQdlfTrHa/7r4hYL+kbkj6q9vINl0ia/aB/ldqLgV0sab2kF9h+SfWzF0j6UEQ8JyLuW/whAqdHAKBUByW90vY2278SEQ+f4fVfiIiZ6ufO0onAOChpbcfrdndsvzkiHomIw5Km3b573Kuqr2+rvVLkM3Vidcj7or2WPdAXyVcDBVKIiLvdvm3g5ZL+yvZX1F7ee/aPoqVzfmS6+rmW7Zk4sYZKSyf/O5ru2D7dsX32dZZ0VUR8uPOXu32rxx8v5piAhaIFgCLZPkfSTyLiE5I+oPZt/e6V9ILqJZtq2vUNkt5ue3lVx+rq3gxA39ECQKmeJ+kDtltqr+T4B5KWSbrG9l+qfR/jnouIL9l+lqT/qFYJnpL0FkmP1rE/4HRYDRQACkUXEAAUigAAgEIRAABQKAIAAApFAABAoQgAACgUAQAAhfp/rOG9WPsCRBEAAAAASUVORK5CYII=\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "tags": [], - "needs_background": "light" - } - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Y6gWdO1CINp-", - "colab_type": "text" - }, - "source": [ - "Additionally, if we look below, using only two features has hindered our predictive accuracy. This sucks! Fortunately, some very clever mathematicians came up with ways to get around this. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "l7re2E6KHUbs", - "colab_type": "code", - "outputId": "24c2f5c2-ff71-420e-874c-2bbdee6f9a8f", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 68 - } - }, - "source": [ - "reg = LinearRegression()\n", - "\n", - "reg.fit(X, y)\n", - "print(\"All Features:\\t\\t\", reg.score(X, y))\n", - "reg.fit(X[[\"summer\", \"sep\"]], y)\n", - "print(\"With PCA Features:\\t\", reg.score(X[[\"summer\", \"sep\"]], y), end=\"\\n\\n\")" - ], - "execution_count": 0, - "outputs": [ - { - "output_type": "stream", - "text": [ - "All Features:\t\t 0.7526018827767169\n", - "With PCA Features:\t 0.4633153344681292\n", - "\n" - ], - "name": "stdout" - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tflTZWS688xW", - "colab_type": "text" - }, - "source": [ - "## Dimensionality Reduction\n", - "\n", - "With feature selection, we were only able to capture 33% of the original variance, which isn't great. To capture more variation while still remaining in two variables, have to utilize some clever math.\n", - "\n", - "These clever mathematical techniques are known as feature creation, where we try to create new variables that helps us visualize higher dimensional data. There are many different techniques for dimensionality reduction all of which attempts to accomplish different things. Let's start off with the simplest and most popular one: **Principle Component Analysis**." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cZ52OQlGAm6Q", - "colab_type": "text" - }, - "source": [ - "### Principle Component Analysis (PCA)\n", - "\n", - "Simply put, Principle Component Analysis create new features that maximizes variation. What PCA is **not** doing is feature selection, rather it is creating an entirely new arbitrary feature that is a combination of all the features. \n", - "\n", - "PCA involves some simple linear algebra, but SciKit-Learn has a PCA implementation. Note that all dimensionality reduction algorithms in SciKit-Learn is operated very similarity to machine learning algorithms you learned in the previous chapters. Create the object and then run `fit()` or `fit_transform()`." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "FESc4NoI8-3v", - "colab_type": "code", - "outputId": "1d1b00d1-65f5-40dd-ca5f-93f9ba84ef7c", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 255 - } - }, - "source": [ - "from sklearn.decomposition import PCA\n", - "\n", - "# We want to be able to plot this on a 2-D scatter plot so we choose 2 Dimensions\n", - "dimension_we_want = 2\n", - "\n", - "pca = PCA(n_components=dimension_we_want)\n", - "X_2d = pca.fit_transform(X) \n", - "X_2d = pd.DataFrame(X_2d, columns=[\"PCA1\", \"PCA2\"])\n", - "\n", - "print(\"\\n% Variance Explained:\", pca.explained_variance_ratio_.sum(), end=\"\\n\\n\")\n", - "X_2d.head()" - ], - "execution_count": 0, - "outputs": [ - { - "output_type": "stream", - "text": [ - "\n", - "% Variance Explained: 0.6462543462926849\n", - "\n" - ], - "name": "stdout" - }, - { - "output_type": "execute_result", - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
PCA1PCA2
02.620822-1.437859
12.1664460.732196
22.359265-0.067934
31.518218-0.792417
41.5199200.451721
\n", - "
" - ], - "text/plain": [ - " PCA1 PCA2\n", - "0 2.620822 -1.437859\n", - "1 2.166446 0.732196\n", - "2 2.359265 -0.067934\n", - "3 1.518218 -0.792417\n", - "4 1.519920 0.451721" - ] - }, - "metadata": { - "tags": [] - }, - "execution_count": 36 - } - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "Gj87JXmmBdNu", - "colab_type": "code", - "outputId": "4574186e-9f94-4b91-9cf9-ba4a68b9f866", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 296 - } - }, - "source": [ - "X_2d.plot.scatter(x=\"PCA1\", y=\"PCA2\")" - ], - "execution_count": 0, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "" - ] - }, - "metadata": { - "tags": [] - }, - "execution_count": 37 - }, - { - "output_type": "display_data", - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEGCAYAAABsLkJ6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAATe0lEQVR4nO3dcWzcZ33H8c/nEtfxcLZGjpWucUMqUtCqYIKwOjZv0lrYKIgFtVk1OsHEYIuQhkQlpoSqbGOaJtGwMaSViUWUwaSKqpvbBdFWNFVgUAQFB7mmaQrqGKyOGA1eSmOwXaf33R93Xm3X8Tm5+93zu3veL8mS73eX+30vd/597nme3/P8HBECAOSnkroAAEAaBAAAZIoAAIBMEQAAkCkCAAAytTF1ARdi69atsXPnztRlAEBHOX78+E8iYnDl9o4KgJ07d2p8fDx1GQDQUWz/cLXtdAEBQKYIAADIFAEAAJkiAAAgUwQAAGSKAADQEtMz83rs6Wc1PTOfuhSsU0edBgqgnI5MnNLBsUn1VCpaqFZ1aN+w9u7ZnrosNEALAEBTpmfmdXBsUnMLVZ2dP6e5haoOjE3SEugABACApkydmVVPZfmhpKdS0dSZ2UQVYb0IAABNGdrSp4Vqddm2hWpVQ1v6ElWE9SIAADRloL9Xh/YNa1NPRZt7N2pTT0WH9g1roL83dWlogEFgAE3bu2e7Rndt1dSZWQ1t6ePg3yEIAAAtMdDfy4G/w9AFBACZIgAAIFPJAsD2JtvftP2Y7RO2/ypVLQCQo5RjAPOSrouIGds9kh6x/WBEfCNhTQCQjWQBEBEhaaZ+s6f+E6nqAYDcJB0DsL3B9oSkZyQdjYhHV3nMftvjtsdPnz7d/iIBoEslDYCIeCEi9kgaknSN7d2rPOZwRIxExMjg4EuuaQwAuEilOAsoIp6V9CVJ16euBQBykfIsoEHbl9Z/75P025KeTFUPAOQm5VlAvyzps7Y3qBZE90TEFxLWAwBZSXkW0KSk16baPwDkrhRjAACA9iMAACBTBAAAZIoAAIBMEQAAkCkCAAAyRQAAQKYIAADIFAEAAJkiAAAgUwQAAGSKAACATBEAAJApAgAAMkUAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMhUsgCwfYXtL9l+wvYJ2+9PVQsA5Ghjwn2fk/SBiPi27c2Sjts+GhFPJKwJALKRrAUQET+KiG/Xfz8r6aSk7anqAYDclGIMwPZOSa+V9Ogq9+23PW57/PTp0+0uDQC6VvIAsN0vaUzSLRHx3Mr7I+JwRIxExMjg4GD7CwSALpU0AGz3qHbwvysi7k1ZCwDkJuVZQJZ0p6STEfGxVHUAQK5StgBGJb1T0nW2J+o/b0lYDwBkJdlpoBHxiCSn2j8A5C75IDAAIA0CAAAyRQAAQKYIAADIFAEAAJkiAIAVpmfm9djTz2p6Zj51KUChUq4GCpTOkYlTOjg2qZ5KRQvVqg7tG9bePaxRiO5ECwCom56Z18GxSc0tVHV2/pzmFqo6MDZJSwBdiwAA6qbOzKqnsvxPoqdS0dSZ2UQVAcUiAIC6oS19WqhWl21bqFY1tKUvUUVAsQgAoG6gv1eH9g1rU09Fm3s3alNPRYf2DWugvzd1aUAhGAQGlti7Z7tGd23V1JlZDW3p4+CPrkYAACsM9Pdy4EcW6AICgEwRAACQKQIAQCGYUV1+jAEAaDlmVHcGWgB1fFsBWoMZ1Z2DFoD4tgK00uKM6jm9OKlucUY1Z1eVS/YtAL6tdB9ac2kxo7pzZBUAqx0YWP+luxyZOKXR24/pHZ96VKO3H9PnJ06lLik7zKjuHNl0AZ2vm4dvK91jaWtusfvhwNikRndt5eDTZsyo7gxZtADW6ubh20r3oDVXLgP9vXrNFZfyt1RiWbQAGg1K8W2lO9CaAy5M0haA7U/bfsb240XuZz0HBr6tdD5ac8CFSd0C+IykOyT9S5E7WTwwHFgxBsCBofvQmgPWL2kARMRXbO9sx744MOSD1TyB9UndAmjI9n5J+yVpx44dTT0XBwYAeFHpzwKKiMMRMRIRI4ODg6nLAdqOiW0oSulbAEDOWKYERSp9CwDIFcuUoGipTwP9nKSvS3qV7Snb70lZD1AmTGxD0VKfBXRzyv0DZcbENhSNLiCgpJjYVl7dMjDPIDBQYsxfKZ9uGpgnAICSY/5KeXTbirN0AQHAOnXbwDwBAADr9LJLNmj+3AvLtnXywDwBsIZuGejBi3hPcbGOTJzSW+94RJWKJUm9G9zxA/OMAZxHNw30oIb3FBdrad//orB1//t+Q7u2bU5YWXPW1QKw3bPKtq2tL6cYF/qtjxmY3Yf3FM1Yre+/d0NFP3v+hfP8i86wZgDYvtb2lKQf2X5oxdLNDxVZWKtczEXCu22gB7ynaE63Tspr1AI4JOlNEbFV0mFJR22/vn6fC62sBS72W1+3vtk54z3tTGUZs+nWSXmNxgAuiYgTkhQR/2b7pKR7bR+UFIVX16RG1wI+H64g1n14TztP2cZsunFSXqMAWLB9WUT8jyRFxAnbb5D0BUmvKLy6JjXzra8b3+zc8Z52jrJOuOq2SXmNuoA+KGnb0g0RMSXptyR9pKCaWqbZZhsXiu8+vKedgTGb9lizBRARD5/nrs2Snm99Oa3Ht758TM/M8z53CcZs2mPd8wBsD0q6SdLNki6XdF9RRbVatzXb8FJl6y9GcxizaY81A8D2Zkk3SvoDSa+UdK+kKyNiqA21AetS1v5iNIfWe/EatQCekfRNSR+S9EhEhO0bii8LWL+LPdsL5UfrvViNBoFvldQr6R8l3Wq79Gf+ID/0FwMXZ80AiIiPR8TrJb2tvunfJV1u+6DtVxZeHbAO3TpJByiaIy5sPpft3aoNBP9+ROwqpKrzGBkZifHx8XbuEh2Es4CA1dk+HhEjK7c3GgTeJWlbRHxtcVtEPG77QUn/3Poy0clSH4DpLwYuTKNB4I+rNg6w0k8l/b2k3215RehInIYJdJ5Gg8DbIuI7KzfWt+0spCJ0HJZaBjpTowC4dI37OMUCkso5bb8sq0gCZdYoAMZt/8nKjbb/WNLxZndu+3rb37X9lO0PNvt8SKNsp2FezDUggBw1CoBbJP2R7S/b/rv6z39Ieo+k9zezY9sbJH1C0pslXS3pZttXN/OcSKNMp2HSHQWsX6PF4H4s6ddtXytpd33z/RFxrAX7vkbSUxHxfUmyfbdq8w2eaMFzo83KMm2fWcEoo9RnyJ1Po9NAN0l6r6Rdkr4j6c6IONeifW+X9PSS21OSfnWVGvZL2i9JO3bsaNGuUYQynIZZtu4ooMxnyDXqAvqspBHVDv5vlvS3hVe0QkQcjoiRiBgZHBxs9+7RYcrUHQWUvUuy0TyAqyPi1ZJk+07VFoZrlVOSrlhye6i+DWhKWbqjgLJ3STa8JOTiLxFxzm7pdeC/Jekq21eqduB/u2rLTgNNK0N3FFD2LslGXUCvsf1c/eespOHF320/18yO62MJ75P0RUknJd2zeAF6AOgGZe+SbHQW0IYidx4RD0h6oMh9AEBKZe6SXPclIQEAF6esXZKNuoAAAF2KAACATBEAAJApAgAAMkUAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMgUAQAAmSIAACBTBABWNT0zr8eeflbTM/OpS0HG+BwWiyuC4SWOTJzSwbFJ9VQqWqhWdWjfsPbu2Z66LGSGz2HxaAFgmemZeR0cm9TcQlVn589pbqGqA2OTfANDW/E5bA8CAMtMnZlVT2X5x6KnUtHUmdlEFSFHfA7bgwDAMkNb+rRQrS7btlCtamhLX6KKOgt91q3B57A9CAAsM9Dfq0P7hrWpp6LNvRu1qaeiQ/uGNdDfm7q00jsycUqjtx/TOz71qEZvP6bPT5xKXVLH4nPYHo6I9u/UvknShyX9iqRrImJ8Pf9uZGQkxsfX9VA0aXpmXlNnZjW0pY8/unWYnpnX6O3HNLfw4rfWTT0Vfe3gdfz/NYHPYWvYPh4RIyu3pzoL6HFJN0r6p0T7RwMD/b0d8QdXlgPEYp/1nF4MgMU+6074fyyrTvkcdqokARARJyXJdordo0uU6TRB+qzRiRgDQEdq1WmCrRq0pc8anaiwFoDthyVdtspdt0XEkQt4nv2S9kvSjh07WlQdOl0rulxa3YLYu2e7RndtLUWXFLAehQVARLyxRc9zWNJhqTYI3IrnROdrtstlaQtiMUQOjE1qdNfWpg7c9Fmjk9AFhI7UbJcLE42ARIPAtm+Q9A+SBiXdb3siIt6UohZ0rma6XBi0BRK1ACLivogYiojeiNjGwR8Xa6C/V6+54tIL7nZh0BZgNVBkjEFb5I4AQNYYtEXOGAQGgEwRAACQKQIAADJFAABApggAACi5oi40xFlAAFBiRa56SwsAAEqqVaveng8BAAAlVfSaVQQAAJRU0WtWEQAAUFJFr1nFIDAAlFiRa1YRAABQckWtWUUXELJU1HnVQCehBYDsFHledSebnplnaezMEADISlHXAu50hGKe6AJCVrgW8EsVPdkI5UUAICtcC/ilCMV8EQDICtcCfilCMV+MASA7XAt4ucVQPLBiDCD3/5ccEADIEtcCXo5QzBMBAEASoZgjxgAAIFNJAsD2R20/aXvS9n22L01RBwDkLFUL4Kik3RExLOl7km5NVAcAZCtJAETEQxFxrn7zG5KGUtQBADkrwxjAuyU9eL47be+3PW57/PTp020sqxxYtAxAUQo7C8j2w5IuW+Wu2yLiSP0xt0k6J+mu8z1PRByWdFiSRkZGooBSS4v1WQAUqbAAiIg3rnW/7XdJequkN0REVgf29WDRMgBFS3UW0PWSDkjaGxE/T1FD2bE+C4CipRoDuEPSZklHbU/Y/mSiOkqL9VkAFC3JTOCI2JViv52E9VkAFI2lIEqM9VkAFIkAKDnWZwFQlDLMAwAAJEAAAECmCAAAyBQBAACZIgAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMgUAQAAmSIAACBTBACyMD0zr8eeflbTM/OpSwFKg+Wg0fWOTJzSwRUX1tm7Z3vqsoDkaAGgq03PzOvg2KTmFqo6O39OcwtVHRibpCUAiABAl5s6M6ueyvKPeU+loqkzs4kqAsqDAEBXG9rSp4Vqddm2hWpVQ1v6ElUElAcBgK420N+rQ/uGtamnos29G7Wpp6JD+4a5zCYgBoGRgb17tmt011ZNnZnV0JY+Dv5AHQGALAz093LgB1agCwgAMpUkAGz/te1J2xO2H7J9eYo6ACBnqVoAH42I4YjYI+kLkv4iUR0AkK0kARARzy25+TJJkaIOAMhZskFg238j6Q8l/VTStWs8br+k/ZK0Y8eO9hQHABlwRDFfvm0/LOmyVe66LSKOLHncrZI2RcRfruM5T0v6YeuqLNRWST9JXUQb8Xq7W06vtxtf68sjYnDlxsICYL1s75D0QETsTlpIi9kej4iR1HW0C6+3u+X0enN6ranOArpqyc23SXoyRR0AkLNUYwAfsf0qSVXVunTem6gOAMhWkgCIiH0p9ttmh1MX0Ga83u6W0+vN5rUmHwMAAKTBUhAAkCkCAAAyRQAUyPZHbT9ZX/foPtuXpq6pSLZvsn3CdtV2V55GZ/t629+1/ZTtD6aup2i2P237GduPp66laLavsP0l20/UP8fvT11T0QiAYh2VtDsihiV9T9Ktiesp2uOSbpT0ldSFFMH2BkmfkPRmSVdLutn21WmrKtxnJF2fuog2OSfpAxFxtaTXS/rTbn9/CYACRcRDEXGufvMbkoZS1lO0iDgZEd9NXUeBrpH0VER8PyKel3S3avNYulZEfEXS/6auox0i4kcR8e3672clnZS0PW1VxSIA2ufdkh5MXQSasl3S00tuT6nLDxC5sr1T0mslPZq2kmJxRbAmrWfNI9u3qda8vKudtRVhvWs8AZ3Kdr+kMUm3rFi5uOsQAE2KiDeudb/td0l6q6Q3RBdMumj0ervcKUlXLLk9VN+GLmG7R7WD/10RcW/qeopGF1CBbF8v6YCkvRHx89T1oGnfknSV7SttXyLp7ZI+n7gmtIhtS7pT0smI+FjqetqBACjWHZI2Szpav/zlJ1MXVCTbN9iekvRrku63/cXUNbVSfUD/fZK+qNoA4T0RcSJtVcWy/TlJX5f0KttTtt+TuqYCjUp6p6Tr6n+vE7bfkrqoIrEUBABkihYAAGSKAACATBEAAJApAgAAMkUAAECmCABgFbZfqJ8G+Ljtf7X9C/Xtl9m+2/Z/2j5u+wHbr1zy726xPWf7l5ZsG6ivMjlj+44UrwdYDQEArG42IvZExG5Jz0t6b32i0H2SvhwRr4iI16m2wuu2Jf/uZtUmjN24ZNucpD+X9GftKR1YHwIAaOyrknZJulbSQkT8/4S+iHgsIr4qSbZfIalf0odUC4LFx/wsIh5RLQiA0iAAgDXY3qja+v/fkbRb0vE1Hv521ZaI/qpqM2e3rfFYIDkCAFhdn+0JSeOS/lu1NWIauVnS3RFRVW1BsZsKrA9oGquBAqubjYg9SzfYPiHp91Z7sO1XS7pKtXWfJOkSSf+l2npQQCnRAgDW75ikXtv7FzfYHrb9m6p9+/9wROys/1wu6XLbL09VLNAIi8EBq7A9ExH9q2y/XNLHJb1OtUHdH0i6RbUVQt8SEU8ueezHJP04Im63/QNJv6hay+BZSb8TEU8U/TqAtRAAAJApuoAAIFMEAABkigAAgEwRAACQKQIAADJFAABApggAAMjU/wFur9UWE4oSkQAAAABJRU5ErkJggg==\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "tags": [], - "needs_background": "light" - } - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cT7Rv9lEDSji", - "colab_type": "text" - }, - "source": [ - "As we can see from above, the two new features (known as Principle Components) are not like any of our input features. Additionally, these two new components explains 64.6% of all the original variation. Let's see how well this performs in explaining our dataset." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "eBHVGhd-CJF8", - "colab_type": "code", - "colab": {} - }, - "source": [ - "reg = LinearRegression()\n", - "\n", - "reg.fit(X, y)\n", - "print(\"All Features:\\t\\t\", reg.score(X, y))\n", - "reg.fit(X_2d, y)\n", - "print(\"With PCA Features:\\t\", reg.score(X_2d, y), end=\"\\n\\n\")" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bsoN_a8XI4p2", - "colab_type": "text" - }, - "source": [ - "As expected, with a higher explained variance ratio, we perform better in predicting the quality of the wine. With Dimensionality Reduction, we were able to capture most of the variation in the dataset while still being able to view it in two dimensions. In the most basic of terms, PCA creates a variable projected along the axis of maximum variable.\n", - "\n", - "![](https://raw.githubusercontent.com/bfkwong/data/master/IMG_0187%202.jpg)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "vwDiZRX3JYLN", - "colab_type": "text" - }, - "source": [ - "### Optional: Linear Algebra Behind PCA\n", - "\n", - "Principle Component Analysis chooses principle components along te **axis of greatest variance**. In Linear Algebra terms, the axis of greatest variance is the **Eigenvector with the largest Eigenvalue of the covariance matrix ($\\Sigma$)**\n", - "\n", - "The following are the steps in order to do PCA manually. \n", - "\n", - "1. Given data matrix $M$, generate the covariance matrix of $M$ denoted as $\\Sigma$\n", - "2. We then compute the Eigenvector and Eigenvalue of covariance matrix $\\Sigma$\n", - "3. Project the features to the Eigenvector with the largest Eigenvalue using the dot product (cross product for more than 1 dimensions)" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "J26dEAoAD0q7", - "colab_type": "code", - "colab": {} - }, - "source": [ - "import numpy as np\n", - "\n", - "# Step 1: Calculate covariance matrix \n", - "cov_mtrx = np.cov(X.T)\n", - "\n", - "# Step 2: Calculate Eigenvector and Eigenvalue\n", - "W,v = np.linalg.eig(cov_mtrx)\n", - "\n", - "# Step 3: Find the largest Eigenvalue and project our data onto the corresponding Eigenvector\n", - "idx_largest_eigenval = np.argmax(W)\n", - "eigenvec = v[:,idx_largest_eigenval]\n", - "\n", - "total = []\n", - "for row in X.index: \n", - " total.append(np.dot(X.loc[row], eigenvec))\n", - "\n", - "pd.DataFrame(pd.Series(total), columns=[\"PCA1\"]).head()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_G7NWh_1nf_k", - "colab_type": "text" - }, - "source": [ - "One interesting thing you may see here is that the eigenvalue corresponds to the variance explained by its corresponding eigenvalue. The eigenvalue of PCA1 is 2.26 and the variance of PCA1 is also 2.26" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "9QCALn43iTYd", - "colab_type": "code", - "colab": {} - }, - "source": [ - "idx_largest_eigenval = np.argmax(W)\n", - "variance = pd.Series(total).var()\n", - "\n", - "print(\"Largest Eigenvalue:\\t\", W[idx_largest_eigenval])\n", - "print(\"Variance of Eigenvector:\", variance)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bmLOaYzvEeG7", - "colab_type": "text" - }, - "source": [ - "### Multidimensional Scaling (MDS)\n", - "\n", - "We now pay a visit to our good friend, the Euclidean distance. One incredibly useful aspect of Euclidean distance is that it works in higher dimensions. The formula $$\\sqrt{x_1^2 + x_2^2}$$ is for two dimensional Euclidean distance, but to move it to a third dimension, it is as easy as adding a $x^3$ variable. One thing to recognize is that Euclidean distance in all dimension is still a number. \n", - "\n", - "Why am I rambling on about something you learned in middle school? Well the realization that Euclidean distance is scalar in all dimensions means that we can preserve the variance of n-th dimensional data in two dimensions as long as we try to ensure that the Euclidean distances in the n-th dimensional is proportion to the Euclidean distance in 2 dimensions. That was a lot to take in, the following image explains the concept.\n", - "\n", - "![](https://raw.githubusercontent.com/bfkwong/data/master/IMG_0188.jpg)\n", - "\n", - "Notice how when we reduced our dimensions from 2 to 1, the distances between points A, B, and C remained the same. Meaning that x, y, and z remained the same between the two dimensions. While distances may not always be preserved perfectly between dimensions, MDS attempts to preserve it as well as possible. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "L4AjVFrwn78I", - "colab_type": "code", - "colab": {} - }, - "source": [ - "from sklearn.manifold import MDS\n", - "# We want to be able to plot this on a 2-D scatter plot so we choose 2 Dimensions\n", - "dimension_we_want = 2\n", - "\n", - "mds = MDS(n_components=dimension_we_want)\n", - "X_2d = mds.fit_transform(X) \n", - "X_2d = pd.DataFrame(X_2d, columns=[\"Dimension 1\", \"Dimension 2\"])\n", - "\n", - "display(X_2d.head())\n", - "print(\"Percentage variance explained:\", X_2d.var().sum()/X.var().sum())\n", - "X_2d.plot.scatter(x=\"Dimension 1\", y=\"Dimension 2\")" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mlIA-o_cEnSD", - "colab_type": "text" - }, - "source": [ - "By using MDS, we were actually able to preserve over 95% of the variance from the original datasets. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "88kI9JxzFO6n", - "colab_type": "text" - }, - "source": [ - "### Linear vs Nonlinear Dimensionality Reduction \n", - "\n", - "Thus far, we have explored PCA (a linear reduction technique) and MDS (a nonlinear reduction technique). While there are a lot of differences between the two methods. The key differences could be boiled down to just the following statement: **linear dimensionality reduction technique only stretch and shift the data while nonlinear techniques make more drastic changes to the data**.\n", - "\n", - "This sometimes leads to nonlinear techniques being better at capturing variance but losing the overall shape of the data whereas linear techniques are better at keeping the general shape of the original data but loses more variance along the way. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7uprQWlUGC37", - "colab_type": "text" - }, - "source": [ - "# Exercises" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UnMJKgsTkDoZ", - "colab_type": "text" - }, - "source": [ - "1. Consider the Iris dataset (https://raw.githubusercontent.com/dlsun/pods/master/data/iris.csv). Drop the \"SepalWidth\" and \"PedalWidth\" columns and then apply PCA on \"SepalLength\" and \"PedalLength\" with `n_components = 2`. How many percent of the variance was PCA able to capture in this case? What happens when we use PCA to compress 2D data into 2D data?" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "HCXKtA8AlSQq", - "colab_type": "code", - "colab": {} - }, - "source": [ - "" - ], - "execution_count": 0, - "outputs": [] - } - ] -} \ No newline at end of file diff --git a/10-Textual-Data/10.5 Sentiment_Analysis.ipynb b/10-Textual-Data/10.5 Sentiment_Analysis.ipynb new file mode 100644 index 0000000..a235b4f --- /dev/null +++ b/10-Textual-Data/10.5 Sentiment_Analysis.ipynb @@ -0,0 +1,1700 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Copy of 10.5 Regular Expression and Sentiment Analysis.ipynb", + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "3UYXoMwEpG40", + "colab_type": "text" + }, + "source": [ + "# 10.5 Regular Expression and Sentiment Analysis\n", + "\n", + "Sentiment analysis is the use of natural language processing to quantify subjective information. Our goal in this section is to use machine learning to identify whether a piece of text captures positive, negative, or neutral emotions. Sentiment analysis has become more prevalent in our world through its application in algorithmic traders, recommendation systems, and market research. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2vgbSDB4v5rF", + "colab_type": "text" + }, + "source": [ + "## Sentiment Analysis\n", + "\n", + "Consider the following sentences:\n", + "\n", + "1. \"I am so happy to be here right now!\"\n", + "2. \"I'm pretty sad about this whole thing.\"\n", + "\n", + "Most people would agree that the first sentence exhibits positive emotion and the second sentence exhibits negative emotion. We perceive it to be this way because the first sentence has the word happy and the second sentence has the word sad. With this very simple idea in mind, we can build a very naïve classifier that determines if a sentence exhibits positive, negative, or neutral emotion. \n", + "\n", + "Our output being a range between -1 and 1 with sentence towards -1 as having negative sentiment and sentices towards +1 having positive sentiment.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5tcQToKM0eBY", + "colab_type": "code", + "outputId": "8f297650-236b-475d-d25c-992a963b8a4b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 68 + } + }, + "source": [ + "positive_words = set([\"happy\", \"great\", \"fanstastic\", \"love\", \"appreciate\", \"grateful\"])\n", + "negative_words = set([\"sad\", \"gross\", \"disturbing\", \"bitter\", \"sorry\", \"pathetic\"])\n", + "\n", + "def sentiment_analyzer_v2(sentence): \n", + " sentence = sentence.lower().split(\" \")\n", + "\n", + " pos_word_cnt = 0\n", + " neg_word_cnt = 0\n", + "\n", + " for word in sentence: \n", + " if word in positive_words: \n", + " pos_word_cnt += 1\n", + " elif word in negative_words: \n", + " neg_word_cnt += 1\n", + " \n", + " return (pos_word_cnt - neg_word_cnt) / (pos_word_cnt + neg_word_cnt)\n", + " \n", + "print(sentiment_analyzer_v2(\"This is making me happy !\"))\n", + "print(sentiment_analyzer_v2(\"This is making me sad !\"))\n", + "print(sentiment_analyzer_v2(\"I am neither happy nor sad .\"))" + ], + "execution_count": 1, + "outputs": [ + { + "output_type": "stream", + "text": [ + "1.0\n", + "-1.0\n", + "0.0\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gk2g-nvq1-Xf", + "colab_type": "text" + }, + "source": [ + "Given a big enough dictionary of positive and negative words, this algorithm can work pretty well. But it takes a lot of work to figure out what words are happy and then type it into a list, its just not efficient. So, being the clever Data Scientists that we are, let's create a machine learning algorithm. \n", + "\n", + "Here is the schematics: let's get a list of texts that are labelled as either positive, negative, or netural. We use a count vector to see what words occured in which text and how many times that word occured. We then use a machine learning algorithm to train on this count vector along with the sentiment label. In essense, this algorithm is saying \"if these words occured $n$ number of times in a text, then it is likely for it to be a specific sentiment\"\n", + "\n", + "Download a set of tweets from (link) and you will see that there are two features that we care about: polarity and text. Polarity tells us what the sentiment is (0 for negative, 2 for neutral, 4 for positive). To keep things consistent with our naive algorithm above, let's map the polarity such that positive is +1, neutral is 0, and negative is -1. With this encoding, the closer our prediction is to 1, the more positive the sentiment is and the closer our prediction is to -1, the more negative the sentiment is. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "d1Vy9AYt4F3-", + "colab_type": "code", + "outputId": "a6c0836d-268a-4fda-9983-5dc182601e16", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + } + }, + "source": [ + "import pandas as pd\n", + "\n", + "df_tweets = pd.read_csv(\"https://raw.githubusercontent.com/bfkwong/data/master/twitter_sentiment.csv\")\n", + "df_tweets[\"polarity\"] = df_tweets[\"polarity\"].map({4:1, 2:0, 0:-1})\n", + "df_tweets.head()" + ], + "execution_count": 2, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
polaritytweet_idtweet_datequeryusertext
013Mon May 11 03:17:40 UTC 2009kindle2tpryan@stellargirl I loooooooovvvvvveee my Kindle2. ...
114Mon May 11 03:18:03 UTC 2009kindle2vcu451Reading my kindle2... Love it... Lee childs i...
215Mon May 11 03:18:54 UTC 2009kindle2chadfuOk, first assesment of the #kindle2 ...it fuck...
316Mon May 11 03:19:04 UTC 2009kindle2SIX15@kenburbary You'll love your Kindle2. I've had...
417Mon May 11 03:21:41 UTC 2009kindle2yamarama@mikefish Fair enough. But i have the Kindle2...
\n", + "
" + ], + "text/plain": [ + " polarity ... text\n", + "0 1 ... @stellargirl I loooooooovvvvvveee my Kindle2. ...\n", + "1 1 ... Reading my kindle2... Love it... Lee childs i...\n", + "2 1 ... Ok, first assesment of the #kindle2 ...it fuck...\n", + "3 1 ... @kenburbary You'll love your Kindle2. I've had...\n", + "4 1 ... @mikefish Fair enough. But i have the Kindle2...\n", + "\n", + "[5 rows x 6 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 2 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zIFResqfRiOt", + "colab_type": "text" + }, + "source": [ + "### Text Normalization\n", + "\n", + "The goal of normalizing is to remove excess noise so that the algorithm only has to focus on what is important. Think of this process as the text version of `StandardScaler`.\n", + "\n", + "**Lemmatization** is one popular normalization technique. During lemmatization, the words `studies` and `studying` gets lemmatized to `study`. In essense, the process of lemmatization turns different forms of the same word (i.e. studies, studying) into the same base lemma (study). This helps reduce the noise in our dataset. \n", + "\n", + "**Stop word removal** is another way to normalize our text in order to reduce noise. Stop words such as \"there\", \"how\", \"then\", \"we\" offer no additional clues for deciding what the sentiment of a sentence is. Thus, it makes sense for us to remove these words before training out algorithm \n", + "\n", + "Lemmatization and stop word removal are both tedious tasks for us to do, which is why NLTK provides us with functions to remove them. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tloDK32sSxNt", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 306 + }, + "outputId": "bb0d1c6d-14a7-42c9-959e-2be71e324325" + }, + "source": [ + "import nltk\n", + "\n", + "nltk.download('punkt')\n", + "nltk.download('stopwords')\n", + "nltk.download('wordnet')\n", + "\n", + "from nltk.tokenize import word_tokenize\n", + "from nltk.corpus import stopwords\n", + "from nltk.stem.wordnet import WordNetLemmatizer\n", + "\n", + "stop_words=set(stopwords.words(\"english\"))\n", + "tweets = list(df_tweets[\"text\"])\n", + "for tweet in range(len(tweets)): \n", + " tweets[tweet] = [x for x in word_tokenize(tweets[tweet]) if x not in stop_words]\n", + "\n", + "lem = WordNetLemmatizer()\n", + "for tweet in range(len(tweets)): \n", + " tweets[tweet] = [lem.lemmatize(x) for x in tweets[tweet]]\n", + "\n", + "df_tweets[\"processed_text\"] = [\" \".join(x) for x in tweets]\n", + "df_tweets.head()" + ], + "execution_count": 3, + "outputs": [ + { + "output_type": "stream", + "text": [ + "[nltk_data] Downloading package punkt to /root/nltk_data...\n", + "[nltk_data] Package punkt is already up-to-date!\n", + "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", + "[nltk_data] Package stopwords is already up-to-date!\n", + "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", + "[nltk_data] Package wordnet is already up-to-date!\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
polaritytweet_idtweet_datequeryusertextprocessed_text
013Mon May 11 03:17:40 UTC 2009kindle2tpryan@stellargirl I loooooooovvvvvveee my Kindle2. ...@ stellargirl I loooooooovvvvvveee Kindle2 . N...
114Mon May 11 03:18:03 UTC 2009kindle2vcu451Reading my kindle2... Love it... Lee childs i...Reading kindle2 ... Love ... Lee child good re...
215Mon May 11 03:18:54 UTC 2009kindle2chadfuOk, first assesment of the #kindle2 ...it fuck...Ok , first assesment # kindle2 ... fucking roc...
316Mon May 11 03:19:04 UTC 2009kindle2SIX15@kenburbary You'll love your Kindle2. I've had...@ kenburbary You 'll love Kindle2 . I 've mine...
417Mon May 11 03:21:41 UTC 2009kindle2yamarama@mikefish Fair enough. But i have the Kindle2...@ mikefish Fair enough . But Kindle2 I think '...
\n", + "
" + ], + "text/plain": [ + " polarity ... processed_text\n", + "0 1 ... @ stellargirl I loooooooovvvvvveee Kindle2 . N...\n", + "1 1 ... Reading kindle2 ... Love ... Lee child good re...\n", + "2 1 ... Ok , first assesment # kindle2 ... fucking roc...\n", + "3 1 ... @ kenburbary You 'll love Kindle2 . I 've mine...\n", + "4 1 ... @ mikefish Fair enough . But Kindle2 I think '...\n", + "\n", + "[5 rows x 7 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zMUQylZN4JQC", + "colab_type": "text" + }, + "source": [ + "With this data, let's use the CountVectorizer to turn the tweets into a collection of words. To make sure we exclude any hashtags and @ symbols, we will also specify a regular expression tokenizer to only include alphabet characters with the `tokenizer` parameter.\n", + "\n", + "Additionally, we want to specify `ngram_range = (1,2)`. This is to help provide context to the words. Consider the double negative string `I do not dislike` which carries positive sentiment. If we split the words into unigrams, we get words like `not` and `dislike`, which are negative words. By using a bigram, we are able to train the algorithm to realize that `not dislike` is actually a positive term. Thus, allowing the algorithm to be able to handle difficult to decipher sentiments like double negatives and sarcasm. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "25ZsIkbos6gR", + "colab_type": "code", + "colab": {} + }, + "source": [ + "from sklearn.feature_extraction.text import CountVectorizer\n", + "from nltk.tokenize import RegexpTokenizer\n", + "\n", + "token = RegexpTokenizer(r'[a-zA-Z]+')\n", + "cv = CountVectorizer(lowercase=True,stop_words='english',ngram_range = (1,2),tokenizer = token.tokenize)\n", + "tweet_text_cv = cv.fit_transform(df_tweets['processed_text'])" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mSjQ6Pms4Ang", + "colab_type": "text" + }, + "source": [ + "Why are we doing this? Given that we have a label for whether the piece of text exhibits positive, negative, or neutral emotions, we can use the count vector to see what words tend to occur in positive sentences, and etc. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I9h-lDkt5QAb", + "colab_type": "code", + "outputId": "15b3389d-78b5-4c80-e6bf-f3eb13fc43c2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 270 + } + }, + "source": [ + "vocab = [[x, cv.vocabulary_[x]] for x in cv.vocabulary_]\n", + "vocab.sort(key=lambda x:x[1])\n", + "vocab = [x[0] for x in vocab]\n", + "\n", + "df_twitter_cv = pd.DataFrame(tweet_text_cv.todense(), columns=vocab)\n", + "df_twitter_cv[\"polarity\"] = df_tweets[\"polarity\"]\n", + "\n", + "df_twitter_cv.head()" + ], + "execution_count": 40, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
aaplaapl esabortionabortion zealotabsolutelyabsolutely blowabsolutely hilariousaccannisaccannis edogaccessaccess damnaccess throttleaccidentaccident guessaccident locationaccordingaccording createaccostsaccosts rogeraccountaccount requestacgacg customachingaciaacia pillsactuallyactually quiteadad adobead wadamadam lambertaddadd peopleaddictionaddiction thankaddictiveadidasadidas billups...years greatyeeeeeyeezyyeezy khakiyemayesyes gmyes lolyes myes videoyesterdayyesterday cbsykyoyo teachyorkyork timesyoutubeyoutube adobeyryr oldytzyuanyuan investedyummmmmyzealotzealot nzerozero desirezetzet oziczlffzomgzomg gzoomzoom lebronzydrunaszydrunas awesomepolarity
00000000000000000000000000000000000000000...0000000000000000000000000000000000000001
10000000000000000000000000000000000000000...0000000000000000000000000000000000000001
20000000000000000000000000000000000000000...0000000000000000000000000000000000000001
30000000000000000000000000000000000000000...0000000000000000000000000000000000000001
40000000000000000000000000000000000000000...0000000000000000000000000000000000000001
\n", + "

5 rows × 5483 columns

\n", + "
" + ], + "text/plain": [ + " aapl aapl es abortion ... zydrunas zydrunas awesome polarity\n", + "0 0 0 0 ... 0 0 1\n", + "1 0 0 0 ... 0 0 1\n", + "2 0 0 0 ... 0 0 1\n", + "3 0 0 0 ... 0 0 1\n", + "4 0 0 0 ... 0 0 1\n", + "\n", + "[5 rows x 5483 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 40 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qRecjyTw7iB0", + "colab_type": "text" + }, + "source": [ + "If a certain word occurs very frequently in texts that are labeled as positive texts, then we can make the assumption that the word is positive. So if in the future we encounter a sentence with this word, we should classify the sentence as positive. \n", + "\n", + "With this idea in mind, let's train a model to predict whether a sentence exhibits positive, negative, or neutral emotions. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5MQxI6NTtkQ6", + "colab_type": "code", + "outputId": "d5cd83c2-9d91-44ac-c4ca-a2806be56931", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 51 + } + }, + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "from sklearn.linear_model import LinearRegression\n", + "from sklearn import metrics\n", + "\n", + "X_train, X_test, y_train, y_test = train_test_split(tweet_text_cv, \n", + " df_tweets['polarity'], \n", + " test_size=0.3, \n", + " random_state=1)\n", + "\n", + "sentiment_analyzer = LinearRegression().fit(X_train, y_train)\n", + "predicted = sentiment_analyzer.predict(X_test)\n", + "print(\"LinearRegression R^2:\\t\", sentiment_analyzer.score(X_test, y_test))\n", + "print(\"LinearRegression MSE:\\t\",metrics.mean_squared_error(y_test, predicted))" + ], + "execution_count": 41, + "outputs": [ + { + "output_type": "stream", + "text": [ + "LinearRegression R^2:\t 0.3904061882859056\n", + "LinearRegression MSE:\t 0.43156532563528044\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "336lT6cO9C5A", + "colab_type": "text" + }, + "source": [ + "Let's see this bad boy in action. Consider the following sentences. The model was able to correctly classify the sentence has having positive leaning polarity" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "truyzLGe8iB6", + "colab_type": "code", + "outputId": "58d6be48-f315-400b-cad3-15b1de6e870f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "source": [ + "# Clearly positive sentence\n", + "test = cv.transform([\"I love being here!\"]).todense()\n", + "sentiment_analyzer.predict(test)" + ], + "execution_count": 47, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.23681886])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 47 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QeZ66XQUUj5A", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "759a34ab-8f7d-4606-e8d3-911d219fe465" + }, + "source": [ + "# Clearly negative sentence\n", + "test = cv.transform([\"I hate this.\"]).todense()\n", + "sentiment_analyzer.predict(test)" + ], + "execution_count": 48, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([-0.41152944])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 48 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QnY1HqkAVH09", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "49e33736-cd9a-4090-af71-fee58fb0fa5c" + }, + "source": [ + "# Ambiguous positive sentence with double negative\n", + "test = cv.transform([\"I do not dislike school.\"]).todense()\n", + "sentiment_analyzer.predict(test)" + ], + "execution_count": 9, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([0.05733004])" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZYD55k4NXLSp", + "colab_type": "text" + }, + "source": [ + "Since we used a linear model in this, we can analyze the coefficients of each variable to see what contributes most to positive and negative sentiment. Recall that each variable represents either an ngram, the ngram that is the biggest is the `most positive` and the ngram that is the smallest is the `most negative`. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "zXUbPfHOYb9o", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 450 + }, + "outputId": "859752b3-6459-416e-f604-499074f36d5a" + }, + "source": [ + "ngrams = [x for x in df_twitter_cv.columns if x != \"polarity\"]\n", + "\n", + "df_word_coef = pd.DataFrame([ngrams,sentiment_analyzer.coef_], index=[\"word\", \"coef\"]).T.set_index(\"word\")\n", + "df_word_coef.sort_values(\"coef\", ascending=False)" + ], + "execution_count": 10, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
coef
word
loves twitter0.441781
loves0.441781
g0.404112
cool0.387307
loved0.38585
......
gm-0.389355
comcast-0.390205
fighting-0.406231
fighting latex-0.406231
hate-0.421888
\n", + "

5482 rows × 1 columns

\n", + "
" + ], + "text/plain": [ + " coef\n", + "word \n", + "loves twitter 0.441781\n", + "loves 0.441781\n", + "g 0.404112\n", + "cool 0.387307\n", + "loved 0.38585\n", + "... ...\n", + "gm -0.389355\n", + "comcast -0.390205\n", + "fighting -0.406231\n", + "fighting latex -0.406231\n", + "hate -0.421888\n", + "\n", + "[5482 rows x 1 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 10 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-ivL_T4qa16K", + "colab_type": "text" + }, + "source": [ + "As expected, words like `hate` and `fight` has very negative connotations to it while words like `loves` and `cool` has very positive connotation to it. Another thing we can look at is the intercept, which tells us the overall sentiment of the entire training corpus. As you can see below, the overall sentinment of the training corpus is rather neutral." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rocJ7syaYd82", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "49a88a1a-c8b6-4e89-c529-049eecd3a8a6" + }, + "source": [ + "sentiment_analyzer.intercept_" + ], + "execution_count": 11, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.01035837109566326" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 11 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4iu58XS1bjc0", + "colab_type": "text" + }, + "source": [ + "## NLTK Implementation \n", + "\n", + "This is a lot of tedious work, and whenever there is a lot of tedious work, you can bet that there's a library for that. The following is the NLTK sentiment analysis algorithm using a very similar technique as we implemented above. \n", + "\n", + "NLTK is a different library than SciKit-Learn so it will require us to do our preprocessing a bit differently. The remainder of the section will walk you through the differences." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yzphuVSGrUIr", + "colab_type": "text" + }, + "source": [ + "### Text Preprocessing\n", + "\n", + "NLTK requires that your training examples are in the form of a list of tuples with 2 elements where the first element are word tokens and the second element is the class. An example would be: \n", + "\n", + "```\n", + "[([\"I\", \"love\", \"Pepsi\"], 1), ([\"I\", \"am\", \"not\", \"a\", \"fan\", \"of\", \"Coke\"], -1), ...]\n", + "```\n", + "\n", + "The following code creates this encoding as well as creating a train test split:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KSz0heZnmATQ", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 425 + }, + "outputId": "1d597d9d-7b74-4b8b-a0ca-a8a924ec7c42" + }, + "source": [ + "from nltk.tokenize import TweetTokenizer\n", + "from nltk.classify import NaiveBayesClassifier\n", + "from nltk.sentiment import SentimentAnalyzer\n", + "from nltk.sentiment.util import *\n", + "import random\n", + "\n", + "tweet_tknize = TweetTokenizer()\n", + "df_tweetsnltk = df_tweets.copy()[[\"polarity\", \"processed_text\"]]\n", + "\n", + "polarity_score = df_tweetsnltk.polarity\n", + "tokenized_tweets = list(df_tweetsnltk.processed_text.apply(tweet_tknize.tokenize))\n", + "\n", + "tweets_formatted = []\n", + "for x in range(len(polarity_score)):\n", + " tweets_formatted.append((tokenized_tweets[x], polarity_score[x]))\n", + "\n", + "training_tweets = tweets_formatted[:int(0.70 * len(tweets_formatted))]\n", + "testing_tweets = tweets_formatted[int(0.70 * len(tweets_formatted)):]\n", + "\n", + "tweets[0:2]" + ], + "execution_count": 28, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[['@',\n", + " 'stellargirl',\n", + " 'I',\n", + " 'loooooooovvvvvveee',\n", + " 'Kindle2',\n", + " '.',\n", + " 'Not',\n", + " 'DX',\n", + " 'cool',\n", + " ',',\n", + " '2',\n", + " 'fantastic',\n", + " 'right',\n", + " '.'],\n", + " ['Reading',\n", + " 'kindle2',\n", + " '...',\n", + " 'Love',\n", + " '...',\n", + " 'Lee',\n", + " 'child',\n", + " 'good',\n", + " 'read',\n", + " '.']]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 28 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "znirIYg1srFx", + "colab_type": "text" + }, + "source": [ + "### Feature Extraction\n", + "\n", + "Now, that we have the tokenize string and its labels. Let's create our `SentimentAnalyzer` object and extract unigrams to prepare the tweets for training: " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hIByTveaenDj", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Create our SentimentAnalyzer\n", + "sentim_analyzer = SentimentAnalyzer()\n", + "\n", + "# Get all words/tokens that is in our trianing set\n", + "# This formats our data in the way that the next function requires it\n", + "all_words = sentim_analyzer.all_words(training_tweets)\n", + "\n", + "# Get the formatted all_words list and create unigrams out of it\n", + "unigram_feats = sentim_analyzer.unigram_word_feats(all_words, min_freq=4)\n", + "\n", + "# We add this feature to our SentimentAnalyzer object\n", + "sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ISXRyMlfvMji", + "colab_type": "text" + }, + "source": [ + "### Training our Sentiment Analyzer\n", + "\n", + "We will be using a NaiveBayesClassifer for this instance. NLTK only supports classifiers for sentiment analysis" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ICZ4jrtcsUZd", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "ba1e989e-2e4a-4560-8ebd-486d181619ee" + }, + "source": [ + "training_set = sentim_analyzer.apply_features(training_tweets)\n", + "trainer = NaiveBayesClassifier.train\n", + "classifier = sentim_analyzer.train(trainer, training_set)" + ], + "execution_count": 57, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Training classifier\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TZLNMDM8vaun", + "colab_type": "text" + }, + "source": [ + "### Testing our Sentiment Analyzer" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wigp271ivLsI", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 204 + }, + "outputId": "e39420fe-1650-4f94-f9f0-0f7a858f4f9f" + }, + "source": [ + "testing_set = sentim_analyzer.apply_features(testing_tweets)\n", + "sorted(sentim_analyzer.evaluate(testing_set).items())" + ], + "execution_count": 58, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Evaluating NaiveBayesClassifier results...\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[('Accuracy', 0.6333333333333333),\n", + " ('F-measure [-1]', 0.6837606837606838),\n", + " ('F-measure [0]', 0.6304347826086957),\n", + " ('F-measure [1]', 0.5714285714285714),\n", + " ('Precision [-1]', 0.625),\n", + " ('Precision [0]', 0.8055555555555556),\n", + " ('Precision [1]', 0.52),\n", + " ('Recall [-1]', 0.7547169811320755),\n", + " ('Recall [0]', 0.5178571428571429),\n", + " ('Recall [1]', 0.6341463414634146)]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 58 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rUFp7fSwvvD7", + "colab_type": "text" + }, + "source": [ + "Now that we know it works, let's test it with some random tweets we pulled from Twitter." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AptkeGbNvcbC", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "b709bf50-b3f1-4e87-b2ea-0dd815693f98" + }, + "source": [ + "# Clearly positive sentence\n", + "sentim_analyzer.classify(\"I love being here!\".split(\" \"))" + ], + "execution_count": 63, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "1" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 63 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "j8jolufOwzVy", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "f430a625-7430-4598-8448-acc1c286a163" + }, + "source": [ + "# Clearly negative sentence\n", + "sentim_analyzer.classify(\"I hate this.\".split(\" \"))" + ], + "execution_count": 64, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "-1" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 64 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2j7DvCPIyG3s", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "095aff26-2077-49e4-e14c-bdfb1622965d" + }, + "source": [ + "# Ambiguous positive sentence with double negative\n", + "sentim_analyzer.classify(\"I do not dislike school.\".split(\" \"))" + ], + "execution_count": 65, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "1" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 65 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZXaLsyo7y4Z3", + "colab_type": "text" + }, + "source": [ + "# Exercises" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "x5ANpEW3y5Mx", + "colab_type": "code", + "colab": {} + }, + "source": [ + "" + ], + "execution_count": 0, + "outputs": [] + } + ] +} \ No newline at end of file