The following links are helpful for the project,
- 10 minutes to Pandas
- Beautiful Soup
- Requests
- plotly
- MovieLens Dataset
- OMDB API
- Markdown Quick Tutorial
The dataset01.csv and dataset02.csv consists of 27000 entries.
For project, we have filtered the dataset for year 1990-2014, country as USA, language as English for which we get 10060 entries.
-
Run the
filteringDataset.ipynbto filter the dataset and remove duplicate ID’s. After executing we getdatasetWithoutBoxOffice.csv. -
Run
extractBoxOffice.ipynbto extract box office using WebCrawl class present inwebcrawl.py. After executing we getdatasetWithBoxOffice.csv.
Optional(but suggested): We have made 10 copies of extractBoxOffice.ipynb with 1000 entries each, and then using mergeCSV.ipynb we have merged all the csv's to get datasetWithBoxOffice.csv.
Alternatively, you can run
extractBoxOfficeAllEntries.ipynbto extract box office for all entries, but consumes lot of time (in hrs).
-
Run
extractTicketInflationPrice.ipynbto extract table of ticket inflation price by year. After executing we getticketPriceInflation.csv. -
Run
adjustTicketPriceInflation.ipynb. After executing we getfinalDataset.csv. -
Run
plotDataset1.ipynb,plotDataset2.ipynbto visualise the dataset.
For Windows when converting to csv use encoding as UTF-8.
Images
- Snapshot of Final dataset
- One of the plot of dataset

