- Question definition
- Data collection
- Data annotation
- Data analysis
- Interpretation
- Communication
annotation_swift_data.csv- All articles and fields with MANUAL Annotation of Topics and Sentiment
prediction.csv- Naive Bayes Results
pos_tag_sentiment_analysis.csv- All Articles and fields with COMPUTATIONAL Sentiment Score
shuffled_untokenized_article_swift_data.json- Main Data file for 500 articles
formatted_swift_data.json- Original pull of 7.8K articles
10_most_distinctive_words_by_sentiment_by_category.json- TF-IDF Values
- Each seperated by respective folders
-
data_collection_pipeline/datacollector.py- DONE- Pulls articles from a web API (for initial data collection)
- returns
data_collection_pipeline/formatted_swift_data.json
-
data_collection_pipeline/raw_data_preparation.py- DONE- Minor script to remove padding. Comments section describes manual process leading to how this script is used.
- returns `data_collection_pipeline/formatted_swift_data.json
-
data_collection_pipeline/cleanfilter.py- DONE- Removes URL duplicates and any articles without 'Taylor Swift' in the title
- Returns 2984 valid articles
- returns
data_collection_pipeline/formatted_swift_data.json
-
data_collection_pipeline/priorityselector.py- DONE- Generates two data files: (1) Articles on URL Reference List and (2) Remaining articles
- Based on URL Reference List, we have:
- 336 articles sourced from a hostname in the list
- 2648 articles that are not sourced from a hostname in the list
- returns
data_collection_pipeline/filtered_swift_data.json
-
data_collection_pipeline/contentscraper.py- DONE- Navigates into the URL to pull the content of the article for 500 articles
- returns
shuffled_untokenized_article_swift_data.json
-
data_collection_pipeline/createcsv.py- DONE
- returns
data_collection_pipeline/formatted_swift_data.json - Creates a CSV file and individual text files for reading
- returns
annotation_swift_data.csv
data_collection_pipeline/amend.py- DONE- Script to fix the missing content files, and amends data issues by resampling new article sources.
get_annotation_stats.py- DONE
py get_annotation_stats.py -d ../data/annotation_swift_data.csv
- Returns
num_articles_by_category_for_500_articles.json
compile_word_counts.py- DONE
py compile_word_counts.py -o ../results/word_count_with_frequency_threshold_of_5.json -d ../data/annotation_swift_data.csv -f 5
- Returns
word_count_with_frequency_threshold_of_5.jsonandtokenized_words.txt
compute_tfidf.py- DONE
py compute_tfidf.py -c ../results/word_count_with_frequency_threshold_of_5.json -n 10
- Returns
10_most_distinctive_words_by_sentiment_by_category.json
journal.ipynb- DONE- computes Naive Baye's model for topics
- returns
predictions.csv
positional_tag_sentiment_analysis.ipynb- DONE- computes the sentiment score for each article using smooth TF-IDF and SentiWordNet
- returns
pos_tag_sentiment_analysis.csv
- For the purposes of this project, not all files were provided, only the main files.*
- Chelsea Chisholm - chelsea.chisholm2@mail.mcgill.ca
- Chen - xi.chen20@mail.mcgill.ca
- Yu Hao Tian - yu.h.tian@mail.mcgill.ca
- Donald Szeto - donald.szeto@mail.mcgill.ca