The goal is to build a machine learning model which is capable to take historical stock news using API or web scraping, and perform sentiment analysis on the headline of the news to predict the future outcome of the news in the stock market. Also, see a relation of other assests affecting the stock market.
For collecting the historical stock news I chose FinViz website. This website is a stock screener which has all stock information, market prices along with news related to each particular stock. Also it has updated information on the performance of each sector, industry and any major stock index. Below is a quick look how the stock news are arranged for each ticker, for example: 'AMZN'
Using python to parse this website and get date, ticker, and the headline of the news over a period of time. I used libraries like BeautifulSoup, requests, json to parse the website, extract the news the HTML, which looks like this:
As we can see that each news are stored into a table with id='news-table' bounded by tags for the time data and tags for the news headline tags, also we can see use similar method to get the news URL if we want for further use.
-
Preprocessing/ Clean up:
For the preprocessing part, I iterate through all tr tags from the previous step and get the text for date, time, ticker and headline. I further format it into a list of lists for each news headline in proper format of [['ticker1','YYYY-MM-DD', 'HH:MMPM', 'News Headline1'], ['ticker1','YYYY-MM-DD', 'HH:MMPM', 'News Headline2']]
-
Sentiment Analysis
First step, to the project after the stock news collection was to perform sentiment analysis on the news data collected and assign a compuund score -1 being the highest negative and 1 being the highest positive and 0 being neutral. For example for 'AMZN' the compound sentiment was:
I used Yahoo Finance to download the historical stock prices, which included date, open_stock, close_stock, high-stock, low_stock, volume_stock for that day.
Further, for relationship I tried to look for correlation between the stock prices and the sentiment score for the previous day.
Further, plotting sentiment over date:
Further, plotting stock price over same time period:
-
Prediction using Machine Learning
After collection of sentiment score, historical price data, predicting the stock price in the future using machine learning models. I used ARIMA (Autoregressive Integrated Moving Average) model to make the predictions, giving input as both past values of the stock price and other features, semtimemt scores.
Model 1: ARIMA Model Predictions:
Model 2: Neural Network Model with dropout regularization and train it using binary cross-entropy loss and Adam optimizer with a learning rate of 0.001 for 500 epochs.
The accuracy of the model was about 68% .
- The GPU of my computer was not the best one for web scraping, it would have been less time consuming with higher GPU.
To train over more data, and other models like LSTM and compare the results with ARIMA model.
- WOrk with parameters to get more accurate predictions
- Incorporate the asset prices such as gold, silver, etc. to predict the prices along with the sentiment from their news and see if there is correlation between assets and stock news.
- modules:
- scrapestocknews.py: To scrape news from FinViz stock screener
- formatparseddata.py: To parse the collected HTML data to correct format of date, time, news, tickers
- sentimentanalysis.py: To do sentiment analysis on the news for all colected historical data
- eda_price_vs_date.py: Visualize stock price over time
- eda_sentiments_vs_date.py: Visualize sentiments over time
- pricediff_vs_sentiment.py : Visualize pricedifference and sentiment to see trend
- model1: ARIMA Used to make predictions
- model2: Neural newtwork model to make predictions
- Packages used in notebook: numpy, pandas, matplotlib, tenserflow, sklearn, BeautifulSoup, requests, json, yfinance, seaborn, statsmodels.api








