GitHub

DAT6 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (2/21/14 - 5/2/15). View student work in the student repository.

Instructors: Sinan Ozdemir and Josiah Davis.

Office hours: 5-7pm on Tuesday and 5-7pm on Saturday at General Assembly

Machine Learning Overview: See here for an overview of the machine learning models! Please only use this as a broad general overview and shouldn't be taken as law. There are some generalizations that I am making.

Course Project information

Saturday	Topic	Project Milestone
2/21:	Introduction / Pandas
2/28:	Git(hub) / Getting Data
3/7:	Advanced Pandas / Machine Learning	One Page Write-up with Data
π == τ/2 day	Model Evaluation / Logistic Regression
3/21:	Linear Regression	2-3 Minute Presentation
3/28:	Data Problem / Clustering and Visualization
4/2	Naive Bayes / Natural Language Processing	Deadline for Topic Changes
4/11	Trees / Ensembles	First Draft Due (Peer Review)
4/18	PCA / Databases / MapReduce
4/25	Recommendation Engines
5/2	Project Presentations	Presentation

Installation and Setup

Install the Anaconda distribution of Python 2.7x.
Install Git and create a GitHub account.
Once you receive an email invitation from Slack, join our "DAT6 team" and add your photo!

Class 1: Introduction and Pandas

Agenda:

Introduction to General Assembly
Course overview: our philosophy and expectations (slides)
Data science overview (slides)
Data Analysis in Python (code)
Tools: check for proper setup of Anaconda, overview of Slack

Homework:

Pandas Homework

Optional:

Review your base python (code)

Class 2: Git(hub) and Getting Data

Agenda:

Github: (slides)
Getting Data (slides)
Regular Expressions (code)
Getting Data (code)

Homework:

Complete the first Project Milestone (Submit on the Dat6-students repo via a pull request)

Resources:

Forbes: The Facebook Experiment
Hacking OkCupid
Videos on Git and GitHub. Created by one of our very own General Assembly Instructors, Kevin Markham.
Reference for common Git commands (created by Kevin Markham).
Solutions to last week's pandas homework assignment (code)

Class 3: Advanced Pandas and Machine Learning

Agenda:

Advanced pandas (code)
Iris exploration exercise (exercise, solutions)
Intro. to Machine Learning (slides, code)

Homework:

Complete the advanced Pandas homework (Submit on the Dat6-students repo via a pull request)
Continue to develop your project. If you have a dataset, explore it with pandas. If you don't have a dataset yet, you should prioritize getting the data.(Nothing to turn in for next week).
Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it next class. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
- In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
- In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
- In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
- How does the choice of K affect model bias? How about variance?
- As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
- Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
- Does a high value for K cause over-fitting or under-fitting?

Resources:

For more on Pandas plotting, read the visualization page from the official Pandas documentation.
To learn how to customize your plots further, browse through this notebook on matplotlib (long!) and check out the matplotlib documentation.
To explore different types of visualizations and when to use them, Columbia's Data Mining class has an excellent slide deck.
For a more in-depth look at machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
To learn about NumPy check out this reference code

Class 4: Model Evaluation and Logistic Regression

Agenda:

The Bias-Variance Tradeoff
Model Evaluation Procedures (slides, code)
Logistic Regression (slides, exercise, solutions)
Model Evalutation Metrics (slides, code)

Homework:

Your Homework for this week is to prepare a 2-3 minute presenation and submit a pull request before class.

Resources:

For more on the ROC Curve / AUC, watch the video (14 minutes total) created by one of our very own GA instructors, Kevin Markham.
For more on logistic regression, watch the first three videos (30 minutes total) from Chapter 4 of An Introduction to Statistical Learning.
UCLA's IDRE has a handy table to help you remember the relationship between probability, odds, and log-odds.
Better Explained has a very friendly introduction (with lots of examples) to the intuition behind "e".
Here are some useful lecture notes on interpreting logistic regression coefficients.

Class 5: Linear Regression

Agenda:

Project Status Updates
Linear Regression and Evaluation (slides, code)

Homework:

Your Homework for this week is to continue to develop your project. April 2nd is the deadline for project changes.

Resources:

To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning. Alternatively, watch the related videos that covers the key points from that chapter.
The introduction to linear regression is much more mathmatical and thorough, and includes a lot of practical advice.
The aforementioned article has a particularly helpful section on the assumptions of linear regression.

Class 6: Data Problem and Clustering and Visualization

Today we will work on a real world data problem! Our data is stock data over 7 months of a fictional company ZYX including twitter sentiment, volume and stock price. Our goal is to create a predictive model that predicts forward returns.
Today we will also be covering our first unsupervised machine learning algorithm, clustering. Our scope will be to explore the kmeans algorithm (slides, code). In particular we will address:
- What are the applications of cluster analysis?
- How does the kmeans algorithm work on a conceptual level?
- How we can create our own kmeans clustering routine in python.
- What are the different options for visualizing the output of Kmeans clustering?
- How do we measure the quality of our cluster analysis and tune our modeling procedure? (additional code)
- What are some of the limitations of cluster analysis using Kmeans? (additional code)
Project overview (documentation)
- Be sure to read documentation thoroughly and ask questions! We may not have included all of the information you want...
- Remember, the goal is prediction. We are given labeled data and we must build a supervised model in order to predict forward stock return. When building your models, be sure to use examples from previous classes to build and evaluate them.
- Metrics are key! Be sure to know which metrics are relevant to the model you chose. For example RMSE only makes sense for regression and ROC/AUC only works for classification.

Homework:

Read Paul Graham's A Plan for Spam and be prepared to discuss it in class next time. Here are some questions to think about while you read:
- Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
- Before he tried the "statistical approach" to spam filtering, what was his approach?
- How exactly does his statistical filtering system work?
- What did Paul say were some of the benefits of the statistical approach?
- How good was his prediction of the "spam of the future"?
Below are the foundational topics upon which Monday's class will depend. Please review these materials before class:
- Confusion matrix: guide roughly mirrors the lecture from class.
- Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
- Basics of probability: These slides are very good. Pay specific attention to these terms: probability, mutually exclusive, independent. You may also find videos of Sinan teaching similar ideas in the class videos section of Slack.
Complete the kmeans clustering exercise on the UN dataset and submit a pull request to the GitHub repo. (homework, solutions)
Conduct kmeans clustering on your own dataset and submit a pull request to the GitHub repository
Download all of the NLTK collections.
- In Python, use the following commands to bring up the download menu.
- import nltk
- nltk.download()
- Choose "all".
- Alternatively, just type nltk.download('all')
Install two new packages: textblob and lda.
- Open a terminal or command prompt.
- Type pip install textblob and pip install lda.

Deadline for topic changes for your final project is next week!

Resources:

Introduction to Data Mining has a great chapter on cluster analysis.
The scikit-learn user guide has a section on clustering.

Class 7 Part 1: Natural Language Processing

Overview of Natural Language Processing (slides)
Real World Examples
Natural Language Processing (code)
NLTK: tokenization, stemming, lemmatization, part of speech tagging, stopwords, Named Entity Recognition (Stanford NER Tagger), TF-IDF, LDA, document summarization
Alternative: TextBlob

Resources:

Natural Language Processing with Python: free online book to go in-depth with NLTK
NLP online course: no sessions are available, but video lectures and slides are still accessible
Brief slides on the major task areas of NLP
Detailed slides on a lot of NLP terminology
A visual survey of text visualization techniques: for exploration and inspiration
DC Natural Language Processing: active Meetup group
Stanford CoreNLP: suite of tools if you want to get serious about NLP
Getting started with regex: Python introductory lesson and reference guide, real-time regex tester, in-depth tutorials
SpaCy: a new NLP package

Class 7 Part 2: Naive Bayes

Briefly discuss A Plan for Spam
Probability and Bayes' theorem
- Slides part 1
- Visualization of conditional probability
Naive Bayes classification
- Slides part 2
- Example with spam email
- Airport security example
Naive Bayes classification in scikit-learn (code)
- Data set: SMS Spam Collection
- scikit-learn documentation: CountVectorizer, Naive Bayes

Resources:

The first part of the slides was adapted from Visualizing Bayes' theorem, which includes an additional example (using Venn diagrams) of how this applies to testing for breast cancer.
For an alternative introduction to Bayes' Theorem, Bayes' Rule for Ducks, this 5-minute video on conditional probability, or these slides on conditional probability may be helpful.
For more details on Naive Bayes classification, Wikipedia has two useful articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has an excellent Q&A.
If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.

Class 8: Trees and Ensembles

Briefly review ROC curves and Confusion Matrix Terminology
Classification and Regression Trees (code, slides)
Brief Introduction to the IPython notebook
Ensemble Techniques (notebook)
- Ensembling
- Random Forests
- Boosted Trees

Homework

Mandatory: You will be assigned to review the project drafts of two of your peers. See guidelines for feedback.
Optional: You should try to create your own Titanic Kaggle submission by building on the techniques we covered in class. Here are some ideas.

Resources

Classification and Regression Trees

Chapter 8.1 of An Introduction to Statistical Learning also covers the basics of Classification and Regression Trees
The scikit-learn documentation has a nice summary of the strengths and weaknesses of Trees.
For those of you with background in javascript, d3.js has a nice tree layout that would make more presentable tree diagrams:
- Here is a link to a static version, as well as a link to a dynamic version with collapsable nodes.
- If this is something you are interested in, Gary Sieling wrote a nice function in python to take the output of a scikit-learn tree and convert into json format.
- If you are intersted in learning d3.js, this a good tutorial for understanding the building blocks of a decision tree. Here is another tutorial focusing on building a tree diagram in d3.js.
Dr. Justin Esarey from Rice University has a nice video lecture on CART that also includes an R code walkthrough

Ensemble Methods

Leo Brieman's paper on Random Forests
yhat has a brief primer on Random Forests that can provide a review of many of the topics we covered today.
Here is a link to some Kaggle competitions that were won using Random Forests
Chapter 10 of the Elements of Statistical Learning covers Boosting. See page 339 for the algorithm presented in class.
Dr. Justin Esary has a nice tutorial on Boosting. Watch from 32:00 – 59:00 for relevant material.
Tutorial by Professor Rob Schapire of Princeston on the AdaBoost Algorithm

IPython Notebook

IPython documentation in website form and notebook form: does not focus exclusively on the IPython Notebook
IPython notebook keyboard shortcuts
IPython notebook viewer

R

I created a script that implements a Classification Tree in R
Here are some resources for helping you to learn R
- Intro to R put on by Google Developers (21 videos, 2-3 minutes each).
- Computing for Data Analysis created through Coursera (27 videos, 5-30 minutes each).
- Cheat sheets created by RStudio can be a helpful reference.
- Kevin Markham has a helpful video on Data Manipulation in R
- R in a Nutshell is a free e-book created by O'Reilly.
- The book An Introduction to Statistical Learning contains R code examples for the techniques we use in this class.
To learn more about R, contact one of us during office hours!

Class 9: PCA / Databases / MapReduce Dimension Reduction

PCA
- Slides
- Code: PCA and SVD
- Code: image compression with PCA (original source)
Mapreduce
- Slides
- Code: PCA and SVD

Resources

PCA using the iris data set here and with 2 components here
PCA step by step here

Homework:

Class 10: Recommenders

Recommendation Engines slides
Recommendation Engine Example code

Resources:

The Netflix Prize
Why Netflix never implemented the winning solution
Visualization of the Music Genome Project
The People Inside Your Machine (23 minutes) is a Planet Money podcast episode about how Amazon Mechanical Turks can assist with recommendation engines (and machine learning in general).

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
code		code
data		data
homework		homework
images		images
notebooks		notebooks
other		other
slides		slides
.DS_Store		.DS_Store
README.md		README.md
peer_review.md		peer_review.md
project.md		project.md
public_data.md		public_data.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DAT6 Course Repository

Installation and Setup

Class 1: Introduction and Pandas

Class 2: Git(hub) and Getting Data

Class 3: Advanced Pandas and Machine Learning

Class 4: Model Evaluation and Logistic Regression

Class 5: Linear Regression

Class 6: Data Problem and Clustering and Visualization

Class 7 Part 1: Natural Language Processing

Class 7 Part 2: Naive Bayes

Class 8: Trees and Ensembles

Class 9: PCA / Databases / MapReduce Dimension Reduction

Class 10: Recommenders

About

Uh oh!

Releases

Packages

Languages

araveret/DAT6

Folders and files

Latest commit

History

Repository files navigation

DAT6 Course Repository

Installation and Setup

Class 1: Introduction and Pandas

Class 2: Git(hub) and Getting Data

Class 3: Advanced Pandas and Machine Learning

Class 4: Model Evaluation and Logistic Regression

Class 5: Linear Regression

Class 6: Data Problem and Clustering and Visualization

Class 7 Part 1: Natural Language Processing

Class 7 Part 2: Naive Bayes

Class 8: Trees and Ensembles

Class 9: PCA / Databases / MapReduce Dimension Reduction

Class 10: Recommenders

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages