Skip to content

araveret/DAT6

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

173 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DAT6 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (2/21/14 - 5/2/15). View student work in the student repository.

Instructors: Sinan Ozdemir and Josiah Davis.

Office hours: 5-7pm on Tuesday and 5-7pm on Saturday at General Assembly

Machine Learning Overview: See here for an overview of the machine learning models! Please only use this as a broad general overview and shouldn't be taken as law. There are some generalizations that I am making.

Course Project information

Saturday Topic Project Milestone
2/21: Introduction / Pandas
2/28: Git(hub) / Getting Data
3/7: Advanced Pandas / Machine Learning One Page Write-up with Data
π == τ/2 day Model Evaluation / Logistic Regression
3/21: Linear Regression 2-3 Minute Presentation
3/28: Data Problem / Clustering and Visualization
4/2 Naive Bayes / Natural Language Processing Deadline for Topic Changes
4/11 Trees / Ensembles First Draft Due (Peer Review)
4/18 PCA / Databases / MapReduce
4/25 Recommendation Engines
5/2 Project Presentations Presentation

Installation and Setup

  • Install the Anaconda distribution of Python 2.7x.
  • Install Git and create a GitHub account.
  • Once you receive an email invitation from Slack, join our "DAT6 team" and add your photo!

Class 1: Introduction and Pandas

Agenda:

  • Introduction to General Assembly
  • Course overview: our philosophy and expectations (slides)
  • Data science overview (slides)
  • Data Analysis in Python (code)
  • Tools: check for proper setup of Anaconda, overview of Slack

Homework:

Optional:

  • Review your base python (code)

Class 2: Git(hub) and Getting Data

Agenda:

Homework:

Resources:


Class 3: Advanced Pandas and Machine Learning

Agenda:

Homework:

  • Complete the advanced Pandas homework (Submit on the Dat6-students repo via a pull request)
  • Continue to develop your project. If you have a dataset, explore it with pandas. If you don't have a dataset yet, you should prioritize getting the data.(Nothing to turn in for next week).
  • Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it next class. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
    • In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
    • In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
    • In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
    • How does the choice of K affect model bias? How about variance?
    • As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
    • Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
    • Does a high value for K cause over-fitting or under-fitting?

Resources:


Class 4: Model Evaluation and Logistic Regression

Agenda:

Homework:

Resources:


Class 5: Linear Regression

Agenda:

  • Project Status Updates
  • Linear Regression and Evaluation (slides, code)

Homework:

  • Your Homework for this week is to continue to develop your project. April 2nd is the deadline for project changes.

Resources:


Class 6: Data Problem and Clustering and Visualization

  • Today we will work on a real world data problem! Our data is stock data over 7 months of a fictional company ZYX including twitter sentiment, volume and stock price. Our goal is to create a predictive model that predicts forward returns.

  • Today we will also be covering our first unsupervised machine learning algorithm, clustering. Our scope will be to explore the kmeans algorithm (slides, code). In particular we will address:

    • What are the applications of cluster analysis?
    • How does the kmeans algorithm work on a conceptual level?
    • How we can create our own kmeans clustering routine in python.
    • What are the different options for visualizing the output of Kmeans clustering?
    • How do we measure the quality of our cluster analysis and tune our modeling procedure? (additional code)
    • What are some of the limitations of cluster analysis using Kmeans? (additional code)
  • Project overview (documentation)

    • Be sure to read documentation thoroughly and ask questions! We may not have included all of the information you want...
    • Remember, the goal is prediction. We are given labeled data and we must build a supervised model in order to predict forward stock return. When building your models, be sure to use examples from previous classes to build and evaluate them.
    • Metrics are key! Be sure to know which metrics are relevant to the model you chose. For example RMSE only makes sense for regression and ROC/AUC only works for classification.

Homework:

  • Read Paul Graham's A Plan for Spam and be prepared to discuss it in class next time. Here are some questions to think about while you read:
    • Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
    • Before he tried the "statistical approach" to spam filtering, what was his approach?
    • How exactly does his statistical filtering system work?
    • What did Paul say were some of the benefits of the statistical approach?
    • How good was his prediction of the "spam of the future"?
  • Below are the foundational topics upon which Monday's class will depend. Please review these materials before class:
    • Confusion matrix: guide roughly mirrors the lecture from class.
    • Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
    • Basics of probability: These slides are very good. Pay specific attention to these terms: probability, mutually exclusive, independent. You may also find videos of Sinan teaching similar ideas in the class videos section of Slack.
  • Complete the kmeans clustering exercise on the UN dataset and submit a pull request to the GitHub repo. (homework, solutions)
  • Conduct kmeans clustering on your own dataset and submit a pull request to the GitHub repository
  • Download all of the NLTK collections.
    • In Python, use the following commands to bring up the download menu.
    • import nltk
    • nltk.download()
    • Choose "all".
    • Alternatively, just type nltk.download('all')
  • Install two new packages: textblob and lda.
    • Open a terminal or command prompt.
    • Type pip install textblob and pip install lda.

Deadline for topic changes for your final project is next week!

Resources:


Class 7 Part 1: Natural Language Processing

  • Overview of Natural Language Processing (slides)
  • Real World Examples
  • Natural Language Processing (code)
  • NLTK: tokenization, stemming, lemmatization, part of speech tagging, stopwords, Named Entity Recognition (Stanford NER Tagger), TF-IDF, LDA, document summarization
  • Alternative: TextBlob

Resources:

Class 7 Part 2: Naive Bayes

Resources:


Class 8: Trees and Ensembles

  • Briefly review ROC curves and Confusion Matrix Terminology
  • Classification and Regression Trees (code, slides)
  • Brief Introduction to the IPython notebook
  • Ensemble Techniques (notebook)
    • Ensembling
    • Random Forests
    • Boosted Trees

Homework

  • Mandatory: You will be assigned to review the project drafts of two of your peers. See guidelines for feedback.
  • Optional: You should try to create your own Titanic Kaggle submission by building on the techniques we covered in class. Here are some ideas.

Resources

Classification and Regression Trees

  • Chapter 8.1 of An Introduction to Statistical Learning also covers the basics of Classification and Regression Trees
  • The scikit-learn documentation has a nice summary of the strengths and weaknesses of Trees.
  • For those of you with background in javascript, d3.js has a nice tree layout that would make more presentable tree diagrams:
    • Here is a link to a static version, as well as a link to a dynamic version with collapsable nodes.
    • If this is something you are interested in, Gary Sieling wrote a nice function in python to take the output of a scikit-learn tree and convert into json format.
    • If you are intersted in learning d3.js, this a good tutorial for understanding the building blocks of a decision tree. Here is another tutorial focusing on building a tree diagram in d3.js.
  • Dr. Justin Esarey from Rice University has a nice video lecture on CART that also includes an R code walkthrough

Ensemble Methods

  • Leo Brieman's paper on Random Forests
  • yhat has a brief primer on Random Forests that can provide a review of many of the topics we covered today.
  • Here is a link to some Kaggle competitions that were won using Random Forests
  • Chapter 10 of the Elements of Statistical Learning covers Boosting. See page 339 for the algorithm presented in class.
  • Dr. Justin Esary has a nice tutorial on Boosting. Watch from 32:00 – 59:00 for relevant material.
  • Tutorial by Professor Rob Schapire of Princeston on the AdaBoost Algorithm

IPython Notebook

R

Class 9: PCA / Databases / MapReduce Dimension Reduction

Resources

  • PCA using the iris data set here and with 2 components here
  • PCA step by step here

Homework:

Class 10: Recommenders

  • Recommendation Engines slides
  • Recommendation Engine Example code

Resources:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.5%
  • R 3.5%