Web Science Assignment

This project is a Web Science assignment for crawling twitter and clustering it. This README will go through the steps to set up the project environment and running the program

Getting Started

Prerequisites

This program uses the following technologies.

Python 3.6
Jupyter Notebook
MongoDB

Packages needed

There is several packages that need to be installed before the program can start.

Tweepy for accesing the Twitter API
spaCy for Name Entity Recognition(NER) analysis of the tweets
NumPy prerequisite for other packages
SciPy prerequisite for other packages
Science Kit Learn for creating bag of words and clustering
NLTK for lemmatization, tokenization and word tagger
Matplotlib for graph plotting
PRAW for accessing the Reddit API

Installing

Run the following command to install all packages

pip install tweepy spacy numpy scipy scikit-learn nltk matplotlib praw

spaCy requires a model to run which can be downloaded with the following command

python -m spacy download en

Running the program

Please ensure you have API keys from twitter and reddit before running. You can manually input the keys in the code or alternatively place them in the config.py provided. Instruction are as follows.

If using config.py please place your api key in in the fields below and ignore the next two instruction.

# TWitter API Key
consumer_token = ""
consumer_secret = ""
access_token = ""
access_secret = ""

# Redit API KEY
reddit_id = ""
reddit_secret = ""
reddit_password = ""
reddit_user_agent = "crawling script for web science"
reddit_username = ""

Input the twitter API in twitter.py from line 18-21.

#API KEYS
consumer_token = config.consumer_token
consumer_secret = config.consumer_secret
access_token = config.access_token
access_secret = config.access_secret

For reddit, you will need your own account a provide the API key and account information from lines 31-35 and 75-79.

# reddit connection
api = praw.Reddit(client_id         =config.reddit_id,
                    client_secret   =config.reddit_secret,
                    password        =config.reddit_password, 
                    user_agent      =config.reddit_user_agent,
                    username        =config.reddit_username)

To run the twitter crawler, cd into the directory and run the command.

python twitter.py

To run the twitter counter open up jupyter notebook with the command jupyter notebook and select the twitter_count.ipynb notebook .

To run the clustering open up jupyter notebook with the command jupyter notebook and select the clustering.ipynb notebook .

To run the reddit crawler, cd into the directory and run the command command.

python reddit_crawler.py

To run the reddit couunt, cd into the directory and run the command.

python count_reddit.py

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Report		Report
.gitignore		.gitignore
Course Work – 2018.pdf		Course Work – 2018.pdf
README.md		README.md
Test.ipynb		Test.ipynb
clustering.ipynb		clustering.ipynb
count_reddit.py		count_reddit.py
count_twitter.py		count_twitter.py
reddit_crawler.py		reddit_crawler.py
top125_subreddit.json		top125_subreddit.json
twitter.py		twitter.py
twitter_count.ipynb		twitter_count.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Science Assignment

Getting Started

Prerequisites

Packages needed

Installing

Running the program

About

Uh oh!

Releases

Packages

Languages

ultratin/WS

Folders and files

Latest commit

History

Repository files navigation

Web Science Assignment

Getting Started

Prerequisites

Packages needed

Installing

Running the program

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages