Skip to content
Open

a #4

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
307 changes: 307 additions & 0 deletions knn/.ipynb_checkpoints/INFO370-KNN_Exercise-checkpoint.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,307 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Import modules"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.datasets import load_boston\n",
"from sklearn.cross_validation import train_test_split\n",
"from sklearn.preprocessing import scale\n",
"from sklearn.neighbors import KNeighborsRegressor\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.cross_validation import KFold\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Load data\n",
"\n",
"For this exercise, we will be using a dataset of housing prices in Boston during the 1970s. Python's super-awesome sklearn package already has the data we need to get started. Below is the command to load the data. The data is stored as a dictionary. \n",
"\n",
"The 'DESCR' is a description of the data and the command for printing it is below. Note all the features we have to work with. From the dictionary, we need the data and the target variable (in this case, housing price). Store these as variables named \"data\" and \"price\", respectively. Once you have these, print their shapes to see all checks out with the DESCR."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Boston House Prices dataset\n",
"\n",
"Notes\n",
"------\n",
"Data Set Characteristics: \n",
"\n",
" :Number of Instances: 506 \n",
"\n",
" :Number of Attributes: 13 numeric/categorical predictive\n",
" \n",
" :Median Value (attribute 14) is usually the target\n",
"\n",
" :Attribute Information (in order):\n",
" - CRIM per capita crime rate by town\n",
" - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n",
" - INDUS proportion of non-retail business acres per town\n",
" - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n",
" - NOX nitric oxides concentration (parts per 10 million)\n",
" - RM average number of rooms per dwelling\n",
" - AGE proportion of owner-occupied units built prior to 1940\n",
" - DIS weighted distances to five Boston employment centres\n",
" - RAD index of accessibility to radial highways\n",
" - TAX full-value property-tax rate per $10,000\n",
" - PTRATIO pupil-teacher ratio by town\n",
" - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n",
" - LSTAT % lower status of the population\n",
" - MEDV Median value of owner-occupied homes in $1000's\n",
"\n",
" :Missing Attribute Values: None\n",
"\n",
" :Creator: Harrison, D. and Rubinfeld, D.L.\n",
"\n",
"This is a copy of UCI ML housing dataset.\n",
"http://archive.ics.uci.edu/ml/datasets/Housing\n",
"\n",
"\n",
"This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n",
"\n",
"The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\n",
"prices and the demand for clean air', J. Environ. Economics & Management,\n",
"vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n",
"...', Wiley, 1980. N.B. Various transformations are used in the table on\n",
"pages 244-261 of the latter.\n",
"\n",
"The Boston house-price data has been used in many machine learning papers that address regression\n",
"problems. \n",
" \n",
"**References**\n",
"\n",
" - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n",
" - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n",
" - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)\n",
"\n"
]
}
],
"source": [
"boston = load_boston()\n",
"print (boston.DESCR)\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(506, 13)\n",
"(506,)\n"
]
}
],
"source": [
"data = boston.data\n",
"price = boston.target\n",
"print (data.shape)\n",
"print(price.shape)\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train-Test split\n",
"\n",
"Now, using sklearn's train_test_split, (see [here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for more. I've already imported it for you.) let's make a random train-test split with the test size equal to 30% of our data (i.e. set the test_size parameter to 0.3). For consistency, let's also set the random.state parameter = 11.\n",
"\n",
"Name the variables train_data, train_price for the training data and test_data, test_price for the test data. As a sanity check, let's also print the shapes of these variables."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data_train, data_test, price_train, price_test = train_test_split(data, price, test_size = 0.30, random_state=11)"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"print (data_train)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Scale our data\n",
"\n",
"Before we get too far ahead, let's scale our data. Let's subtract the min from each column (feature) and divide by the difference between the max and min for each column. \n",
"\n",
"Here's where things can get tricky. Remember, our test data is unseen yet we need to scale it. We cannot scale using it's min/max because the data is unseen might not be available to us en masse. Instead, we use the training min/max to scale the test data.\n",
"\n",
"Be sure to check which axis you use to take the mins/maxes!\n",
"\n",
"Let's add a \"\\_stand\" suffix to our train/test variable names for the standardized values"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"mins = np.min(data_train, axis = 0)\n",
"maxes = np.max(data_train, axis = 0)\n",
"diff = maxes - mins"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"diff"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 8.89698800e+01, 9.50000000e+01, 2.72800000e+01,\n",
" 1.00000000e+00, 4.86000000e-01, 4.86200000e+00,\n",
" 9.38000000e+01, 1.09969000e+01, 2.30000000e+01,\n",
" 5.24000000e+02, 9.40000000e+00, 3.96580000e+02,\n",
" 3.62400000e+01])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_train_stand = (data_train - mins) / diff\n",
"data_test_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# K-Fold CV\n",
"\n",
"Now, here's where things might get really messy. Let's implement 10-Fold Cross Validation on K-NN across a range of K values (given below - 9 total). We'll keep our K for K-fold CV constant at 10. \n",
"\n",
"Let's determine our accuracy using an RMSE (root-mean-square-error) value based on Euclidean distance. Save the errors for each fold at each K value (10 folds x 9 K values = 90 values) as you loop through.\n",
"\n",
"Take a look at [sklearn's K-fold CV](http://scikit-learn.org/0.17/modules/generated/sklearn.cross_validation.KFold.html). Also, sklearn has it's own [K-NN implementation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor). there is also an implementation of [mean squared error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html), though you'll have to take the root yourself. I've imported these for you already. :)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Plot Results\n",
"\n",
"Plot your training accuracy across all folds as a function of K. What do you see?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [conda root]",
"language": "python",
"name": "conda-root-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading