Course materials for General Assembly's Data Science course in Washington, DC (2/21/14 - 5/2/15). View student work in the student repository.
Instructors: Sinan Ozdemir and Josiah Davis.
Office hours: 5-7pm on Tuesday and 5-7pm on Saturday at General Assembly
Machine Learning Overview: See here for an overview of the machine learning models! Please only use this as a broad general overview and shouldn't be taken as law. There are some generalizations that I am making.
| Saturday | Topic | Project Milestone |
|---|---|---|
| 2/21: | Introduction / Pandas | |
| 2/28: | Git(hub) / Getting Data | |
| 3/7: | Advanced Pandas / Machine Learning | One Page Write-up with Data |
| π == τ/2 day | Model Evaluation / Logistic Regression | |
| 3/21: | Linear Regression | 2-3 Minute Presentation |
| 3/28: | Data Problem / Clustering and Visualization | |
| 4/2 | Naive Bayes / Natural Language Processing | Deadline for Topic Changes |
| 4/11 | Trees / Ensembles | First Draft Due (Peer Review) |
| 4/18 | PCA / Databases / MapReduce | |
| 4/25 | Recommendation Engines | |
| 5/2 | Project Presentations | Presentation |
- Install the Anaconda distribution of Python 2.7x.
- Install Git and create a GitHub account.
- Once you receive an email invitation from Slack, join our "DAT6 team" and add your photo!
Agenda:
- Introduction to General Assembly
- Course overview: our philosophy and expectations (slides)
- Data science overview (slides)
- Data Analysis in Python (code)
- Tools: check for proper setup of Anaconda, overview of Slack
Homework:
Optional:
- Review your base python (code)
Agenda:
Homework:
- Complete the first Project Milestone (Submit on the Dat6-students repo via a pull request)
Resources:
- Forbes: The Facebook Experiment
- Hacking OkCupid
- Videos on Git and GitHub. Created by one of our very own General Assembly Instructors, Kevin Markham.
- Reference for common Git commands (created by Kevin Markham).
- Solutions to last week's pandas homework assignment (code)
Agenda:
- Advanced pandas (code)
- Iris exploration exercise (exercise, solutions)
- Intro. to Machine Learning (slides, code)
Homework:
- Complete the advanced Pandas homework (Submit on the Dat6-students repo via a pull request)
- Continue to develop your project. If you have a dataset, explore it with pandas. If you don't have a dataset yet, you should prioritize getting the data.(Nothing to turn in for next week).
- Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it next class. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
- In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
- In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
- In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
- How does the choice of K affect model bias? How about variance?
- As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
- Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
- Does a high value for K cause over-fitting or under-fitting?
Resources:
- For more on Pandas plotting, read the visualization page from the official Pandas documentation.
- To learn how to customize your plots further, browse through this notebook on matplotlib (long!) and check out the matplotlib documentation.
- To explore different types of visualizations and when to use them, Columbia's Data Mining class has an excellent slide deck.
- For a more in-depth look at machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
- To learn about NumPy check out this reference code
Agenda:
- The Bias-Variance Tradeoff
- Model Evaluation Procedures (slides, code)
- Logistic Regression (slides, exercise, solutions)
- Model Evalutation Metrics (slides, code)
Homework:
- Your Homework for this week is to prepare a 2-3 minute presenation and submit a pull request before class.
Resources:
- For more on the ROC Curve / AUC, watch the video (14 minutes total) created by one of our very own GA instructors, Kevin Markham.
- For more on logistic regression, watch the first three videos (30 minutes total) from Chapter 4 of An Introduction to Statistical Learning.
- UCLA's IDRE has a handy table to help you remember the relationship between probability, odds, and log-odds.
- Better Explained has a very friendly introduction (with lots of examples) to the intuition behind "e".
- Here are some useful lecture notes on interpreting logistic regression coefficients.
Agenda:
Homework:
- Your Homework for this week is to continue to develop your project. April 2nd is the deadline for project changes.
Resources:
- To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning. Alternatively, watch the related videos that covers the key points from that chapter.
- The introduction to linear regression is much more mathmatical and thorough, and includes a lot of practical advice.
- The aforementioned article has a particularly helpful section on the assumptions of linear regression.
-
Today we will work on a real world data problem! Our data is stock data over 7 months of a fictional company ZYX including twitter sentiment, volume and stock price. Our goal is to create a predictive model that predicts forward returns.
-
Today we will also be covering our first unsupervised machine learning algorithm, clustering. Our scope will be to explore the kmeans algorithm (slides, code). In particular we will address:
- What are the applications of cluster analysis?
- How does the kmeans algorithm work on a conceptual level?
- How we can create our own kmeans clustering routine in python.
- What are the different options for visualizing the output of Kmeans clustering?
- How do we measure the quality of our cluster analysis and tune our modeling procedure? (additional code)
- What are some of the limitations of cluster analysis using Kmeans? (additional code)
-
Project overview (documentation)
- Be sure to read documentation thoroughly and ask questions! We may not have included all of the information you want...
- Remember, the goal is prediction. We are given labeled data and we must build a supervised model in order to predict forward stock return. When building your models, be sure to use examples from previous classes to build and evaluate them.
- Metrics are key! Be sure to know which metrics are relevant to the model you chose. For example RMSE only makes sense for regression and ROC/AUC only works for classification.
Homework:
- Read Paul Graham's A Plan for Spam and be prepared to discuss it in class next time. Here are some questions to think about while you read:
- Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
- Before he tried the "statistical approach" to spam filtering, what was his approach?
- How exactly does his statistical filtering system work?
- What did Paul say were some of the benefits of the statistical approach?
- How good was his prediction of the "spam of the future"?
- Below are the foundational topics upon which Monday's class will depend. Please review these materials before class:
- Confusion matrix: guide roughly mirrors the lecture from class.
- Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
- Basics of probability: These slides are very good. Pay specific attention to these terms: probability, mutually exclusive, independent. You may also find videos of Sinan teaching similar ideas in the class videos section of Slack.
- Complete the kmeans clustering exercise on the UN dataset and submit a pull request to the GitHub repo. (homework, solutions)
- Conduct kmeans clustering on your own dataset and submit a pull request to the GitHub repository
- Download all of the NLTK collections.
- In Python, use the following commands to bring up the download menu.
import nltknltk.download()- Choose "all".
- Alternatively, just type
nltk.download('all')
- Install two new packages:
textblobandlda.- Open a terminal or command prompt.
- Type
pip install textblobandpip install lda.
Deadline for topic changes for your final project is next week!
Resources:
- Introduction to Data Mining has a great chapter on cluster analysis.
- The scikit-learn user guide has a section on clustering.
- Overview of Natural Language Processing (slides)
- Real World Examples
- Natural Language Processing (code)
- NLTK: tokenization, stemming, lemmatization, part of speech tagging, stopwords, Named Entity Recognition (Stanford NER Tagger), TF-IDF, LDA, document summarization
- Alternative: TextBlob
Resources:
- Natural Language Processing with Python: free online book to go in-depth with NLTK
- NLP online course: no sessions are available, but video lectures and slides are still accessible
- Brief slides on the major task areas of NLP
- Detailed slides on a lot of NLP terminology
- A visual survey of text visualization techniques: for exploration and inspiration
- DC Natural Language Processing: active Meetup group
- Stanford CoreNLP: suite of tools if you want to get serious about NLP
- Getting started with regex: Python introductory lesson and reference guide, real-time regex tester, in-depth tutorials
- SpaCy: a new NLP package
- Briefly discuss A Plan for Spam
- Probability and Bayes' theorem
- Naive Bayes classification
- Slides part 2
- Example with spam email
- Airport security example
- Naive Bayes classification in scikit-learn (code)
- Data set: SMS Spam Collection
- scikit-learn documentation: CountVectorizer, Naive Bayes
Resources:
- The first part of the slides was adapted from Visualizing Bayes' theorem, which includes an additional example (using Venn diagrams) of how this applies to testing for breast cancer.
- For an alternative introduction to Bayes' Theorem, Bayes' Rule for Ducks, this 5-minute video on conditional probability, or these slides on conditional probability may be helpful.
- For more details on Naive Bayes classification, Wikipedia has two useful articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has an excellent Q&A.
- If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.
- Briefly review ROC curves and Confusion Matrix Terminology
- Classification and Regression Trees (code, slides)
- Brief Introduction to the IPython notebook
- Ensemble Techniques (notebook)
- Ensembling
- Random Forests
- Boosted Trees
Homework
- Mandatory: You will be assigned to review the project drafts of two of your peers. See guidelines for feedback.
- Optional: You should try to create your own Titanic Kaggle submission by building on the techniques we covered in class. Here are some ideas.
Resources
Classification and Regression Trees
- Chapter 8.1 of An Introduction to Statistical Learning also covers the basics of Classification and Regression Trees
- The scikit-learn documentation has a nice summary of the strengths and weaknesses of Trees.
- For those of you with background in javascript, d3.js has a nice tree layout that would make more presentable tree diagrams:
- Here is a link to a static version, as well as a link to a dynamic version with collapsable nodes.
- If this is something you are interested in, Gary Sieling wrote a nice function in python to take the output of a scikit-learn tree and convert into json format.
- If you are intersted in learning d3.js, this a good tutorial for understanding the building blocks of a decision tree. Here is another tutorial focusing on building a tree diagram in d3.js.
- Dr. Justin Esarey from Rice University has a nice video lecture on CART that also includes an R code walkthrough
Ensemble Methods
- Leo Brieman's paper on Random Forests
- yhat has a brief primer on Random Forests that can provide a review of many of the topics we covered today.
- Here is a link to some Kaggle competitions that were won using Random Forests
- Chapter 10 of the Elements of Statistical Learning covers Boosting. See page 339 for the algorithm presented in class.
- Dr. Justin Esary has a nice tutorial on Boosting. Watch from 32:00 – 59:00 for relevant material.
- Tutorial by Professor Rob Schapire of Princeston on the AdaBoost Algorithm
IPython Notebook
- IPython documentation in website form and notebook form: does not focus exclusively on the IPython Notebook
- IPython notebook keyboard shortcuts
- IPython notebook viewer
R
- I created a script that implements a Classification Tree in R
- Here are some resources for helping you to learn R
- Intro to R put on by Google Developers (21 videos, 2-3 minutes each).
- Computing for Data Analysis created through Coursera (27 videos, 5-30 minutes each).
- Cheat sheets created by RStudio can be a helpful reference.
- Kevin Markham has a helpful video on Data Manipulation in R
- R in a Nutshell is a free e-book created by O'Reilly.
- The book An Introduction to Statistical Learning contains R code examples for the techniques we use in this class.
- To learn more about R, contact one of us during office hours!
-
PCA
- Slides
- Code: PCA and SVD
- Code: image compression with PCA (original source)
-
Mapreduce
- Slides
- Code: PCA and SVD
Resources
Homework:
Resources:
- The Netflix Prize
- Why Netflix never implemented the winning solution
- Visualization of the Music Genome Project
- The People Inside Your Machine (23 minutes) is a Planet Money podcast episode about how Amazon Mechanical Turks can assist with recommendation engines (and machine learning in general).