Skip to content

Webscraping financial data to help in predicting bankrupcy

Notifications You must be signed in to change notification settings

matthewdicks98/Explore-Webscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Explore_Webscraper

Webscraping financial data to help in predicting bankrupcy.

Technical set-up

Ubuntu: 18.04
Python: 3.8.5
scrapy: 1.6.0
Firefox Browser: 85.0 (64-bit)
Gecko Driver: v0.28.0-linux64

Note on drivers

I use firefix so I used the firefox bowser along with the gecko driver defined above. You can find more information about selenium drivers here, if you want to use other drivers. The download page for the driver I used (I think there is an updated version) here.

Running the spiders

Make sure that when you are going to run a spider you are two directories above the spider folder. You should be in the following directory:

*/Explore_Webscraper/explore_webscraper

Throttling

Throttling is used to reduce the number of calls to a single domain. If there are too many requests then the server will not be able to handle the requests fast enough and it will return errors. This will cause a loss of data. The throttling parameters can be found in the the settings file. Throttling was used for all spiders

Pipelines

Pipelines were used when collecting the financial and status information because it allowed them to run smoothly in parallel. When running the companies_filtering and board spiders please ensure that ITEM_PIPELINE is commented out in the settings file. It must be enabled (uncommented) when running the driver file. Also make sure when running the driver file that the files you are writting to in the pipeline are correct for you.

Company names and id collection

First disable the ITEM_PIPELINES middleware by commenting it out in the settings file. It is commented out by default.

Run the following command to start and run the companies_filtering.py spider:

$ scrapy crawl companies_filtering 

If you want to store your data in a file run the following command:

$ scrapy crawl companies_filtering -o path_to_file/file.(csv, json)

You need to specify either a .csv or .json file extension.

Financials and status collection

First enable the ITEM_PIPELINES middleware by uncommenting it out in the settings file. Also ensure that throttling is enabled. Set the throttling parameters to what will maximise the speed and minimize the 503 errors over your network. Also make sure that the pipelines code is writing to the file that you want to write to.

Use the following command to run the financials and status spiders in parallel:

python3 driver

Board collection

First disable the ITEM_PIPELINES middleware by commenting it out in the settings file. Also ensure that throttling is enabled. Set the throttling parameters to what will maximise the speed and minimize the 503 errors over your network.

Run the following command to start and run the board.py spider:

$ scrapy crawl board 

If you want to store your data in a file run the following command:

$ scrapy crawl board -o path_to_file/file.(csv, json)

You need to specify either a .csv or .json file extension.

Notebooks

Name and ID Collection Notebook

The notebook in this folder is used to merge the ID and name batch data into one full dataset with all the IDs and names of the companies. It also filters out the companies with XXXX in the IDs as most of them do not have financials.

Cleaning Notebooks

These notebooks are used to merge and clean the financials, status and board datasets into the full feature dataset. There are a number of notebooks in this directory but the ones labeled with (1), (2) and (3) correspond to the final notebooks used to clean the data. These notebooks should also be used in that order. The other notebooks were used in intermediate steps. Notebooks (1) and (2) were used to create the dataset prior to adding the board information. Notebook (3) added the board dataset that created the final dataset.

Modeling Notebooks

These notebooks were used to create some basic models and investigate methods that worked best. The three notebooks correspond to modeling were done at slightly different times in the data collection process. 'Testing methods on how to deal with unbalanced data.ipynb' was used to test how different a default model performed on unbalanced data, using only the status and financial information. 'Board_data_modeling.ipynb' was used to assess the predictive power of the board features in isolation from the other financial and status information. 'Modeling with new features.ipynb' first assess models run without board information and then with board information to see how the board features added to the predictive power of the models.

About

Webscraping financial data to help in predicting bankrupcy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published