Skip to content

yuhao239/NLP---Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

COMP 370 Fall 2023 Project

Roadmap

  • Question definition
  • Data collection
  • Data annotation
  • Data analysis
  • Interpretation
  • Communication

DATA Files

  • annotation_swift_data.csv
    • All articles and fields with MANUAL Annotation of Topics and Sentiment
  • prediction.csv
    • Naive Bayes Results
  • pos_tag_sentiment_analysis.csv
    • All Articles and fields with COMPUTATIONAL Sentiment Score
  • shuffled_untokenized_article_swift_data.json
    • Main Data file for 500 articles
  • formatted_swift_data.json
    • Original pull of 7.8K articles
  • 10_most_distinctive_words_by_sentiment_by_category.json
    • TF-IDF Values

SCRIPTS

  • Each seperated by respective folders

Data Collection Pipeline

  • data_collection_pipeline/datacollector.py - DONE

    • Pulls articles from a web API (for initial data collection)
    • returns data_collection_pipeline/formatted_swift_data.json
  • data_collection_pipeline/raw_data_preparation.py - DONE

    • Minor script to remove padding. Comments section describes manual process leading to how this script is used.
    • returns `data_collection_pipeline/formatted_swift_data.json
  • data_collection_pipeline/cleanfilter.py - DONE

    • Removes URL duplicates and any articles without 'Taylor Swift' in the title
    • Returns 2984 valid articles
    • returns data_collection_pipeline/formatted_swift_data.json
  • data_collection_pipeline/priorityselector.py - DONE

    • Generates two data files: (1) Articles on URL Reference List and (2) Remaining articles
    • Based on URL Reference List, we have:
      • 336 articles sourced from a hostname in the list
      • 2648 articles that are not sourced from a hostname in the list
    • returns data_collection_pipeline/filtered_swift_data.json
  • data_collection_pipeline/contentscraper.py - DONE

    • Navigates into the URL to pull the content of the article for 500 articles
    • returns shuffled_untokenized_article_swift_data.json
  • data_collection_pipeline/createcsv.py - DONE

  • returns data_collection_pipeline/formatted_swift_data.json
  • Creates a CSV file and individual text files for reading
  • returns annotation_swift_data.csv
  • data_collection_pipeline/amend.py - DONE
    • Script to fix the missing content files, and amends data issues by resampling new article sources.

Data Annotation Analysis

  • get_annotation_stats.py - DONE
py get_annotation_stats.py -d ../data/annotation_swift_data.csv
  • Returns num_articles_by_category_for_500_articles.json
  • compile_word_counts.py - DONE
py compile_word_counts.py -o ../results/word_count_with_frequency_threshold_of_5.json -d ../data/annotation_swift_data.csv -f 5
  • Returns word_count_with_frequency_threshold_of_5.json and tokenized_words.txt
  • compute_tfidf.py - DONE
py compute_tfidf.py -c ../results/word_count_with_frequency_threshold_of_5.json -n 10
  • Returns 10_most_distinctive_words_by_sentiment_by_category.json

Naive Bayes Topic Analysis

  • journal.ipynb - DONE
    • computes Naive Baye's model for topics
    • returns predictions.csv

SentiWordNet Sentiment Analysis

  • positional_tag_sentiment_analysis.ipynb - DONE
    • computes the sentiment score for each article using smooth TF-IDF and SentiWordNet
    • returns pos_tag_sentiment_analysis.csv

Contact

About

COMP370 - Introduction to Data Science Final Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published