Improved script for downloading all games data without incurring in errors #5

hechmik · 2019-05-15T11:27:49Z

As part of a university of project I needed to download data from vgchartz. I found your script really useful but after few downloads requests were blocked. In order to solve this problems I've improved the code as follows:

Introduce sleep time between requests: in this way I've been able to download all 55 pages without incurring in HTTP error 429
Use a random user agent for each requests: the reason is the same as above
Add requirements.txt file: in this way people that use virtual environments will find really easy to use the script
Use logging instead of print statements: logs are displayed on stdout and written to a specific file
Split code in multiple functions: in this way I think it will be easier to maintain and improve it in the future
Read many parameters from a configuration json file, such as range of pages to download, output filename, range of sleep time, application log filename
Made the download of genre optional: this is the most critical operation, because currently you have to make 56 requests for getting most of the data and 55 000 + requests just for obtaining the game genre. The option to download or not can be set in the json file

…ser agent

Jenifael · 2020-03-04T14:39:01Z

Thank a lot for this fix. It's really well written.

FYI they changed their http protocol to https. In order to make it work, you just need to specify "https://www.vgchartz.com/game/" instead of "http://www.vgchartz.com/game/"

hechmik · 2020-03-04T14:41:15Z

I changed the link in the configuration file, it should be ok now :)

Jenifael · 2020-03-04T14:47:17Z

Actually it was for the vgchartzfull.py file:

    game_tags = list(filter(
        lambda x: x.attrs['href'].startswith('https://www.vgchartz.com/game/'),
        # discard the first 10 elements because those
        # links are in the navigation bar
        soup.find_all("a")
    ))[10:]

without it, game_tags would return nothing :)
Thanks again

hechmik · 2020-03-04T14:51:34Z

Thanks a lot, I forgot that bit!

Avoiding to upload csv data partial files

# Conflicts: # requirements.txt # vgchartzfull.py

Pelirrojo · 2020-03-30T16:50:23Z

I've just work over this fork that is working fine at time of this writing.

Signed-off-by: Manuel Eusebio de Paz Carmona <Pelirrojo@users.noreply.github.com>

Update total count

…ly year.

imirkin · 2021-01-02T00:52:28Z

FWIW I had to apply the following diff to get it to run:

diff --git a/vgchartz-full-crawler.py b/vgchartz-full-crawler.py
index 7c6c30c..47c2397 100644
--- a/vgchartz-full-crawler.py
+++ b/vgchartz-full-crawler.py
@@ -186,11 +186,11 @@ def download_data(*, start_page, end_page, include_genre):
 
         # We locate the game through search <a> tags with game urls in the main table
         game_tags = list(filter(
-            lambda x: x.attrs['href'].startswith('https://www.vgchartz.com/game/'),
+            lambda x: 'href' in x.attrs and x.attrs['href'].startswith('https://www.vgchartz.com/game/'),
             # discard the first 10 elements because those
             # links are in the navigation bar
             soup.find_all("a")
-        ))[10:]
+        ))
 
         for tag in game_tags:

Hope this helps someone.

skyye-99 · 2022-01-13T21:00:43Z

Hi, after experimenting a little with older versions of this I realized you have the most useable version by far. But I seem to be getting an error that I can't pin down, which was similar to one in the original version that involved a change to the website's layout. (I'm new to python so forgive any stupid questions). The error I'm getting is "Unexpected error: (<class 'KeyError'>, KeyError('href'), <traceback object at 0x000002073555C040>)". Does anyone know why this is happening?

Pelirrojo · 2022-01-15T22:00:40Z

Thanks @imirkin with your fix in the lambda function is working fine.
I've just run with Python 3.7 (MiniConda and run smooths) @huertaj2

GregorUT

Thanks for your additions.

hechmik added 11 commits May 14, 2019 21:09

Add file for listing used libraries

607055b

Add method for random headers generation, sleep between requests.

f0e242c

Externalise the get genre part, wrap all code into a function

f68ea44

fixed bugs such as libraries, get_page usage and os random pick for u…

65a0596

…ser agent

first working version with random user agent and sleep between requests

2abab60

Add basic documentation, modularized code

ea44a69

Example property file

a55b284

Completed refactor, read many parameter from property file

70922cf

Read vgchartz url from config json

e65e51a

Add entry for log filename

d5ab1f4

Improved documentation, add logging to both stdout and file

4ed64bf

Upgraded to HTTPS

8a3e28f

Use https in lambda for skipping first 10 elements

b31b21d

Pelirrojo added 13 commits March 30, 2020 11:18

Update README.md

1c86411

Update README.md

fb81a13

Create requirements.txt

dd64927

Update .gitignore

c732358

Update README.md

35cff1d

Update README.md

0f5aec3

Update README.md

d8f2173

Update .gitignore

6685ad3

Avoiding to upload csv data partial files

Refactor in functions

33a53d8

Merge branch 'develop-hechmik'

05824e2

# Conflicts: # requirements.txt # vgchartzfull.py

Add an exception to manage the main loop wrapping the functions call

414602c

Explode query parameters

43dbe5e

I love functions with named parameters sorry XD

779fcad

Fix padding spaces due to html parsing

826476d

Pelirrojo and others added 10 commits March 31, 2020 00:34

Updating doc

a8bf656

Signed-off-by: Manuel Eusebio de Paz Carmona <Pelirrojo@users.noreply.github.com>

Update README.md

6aa2b6c

Update total count

Folder reorganize and bump dependencies

6a8c2a2

script to easy run

c4f9ff8

Update documentation and add some TODOs

0e48d8d

Improve script output

7f5719e

Refactor data saving to add more data and parse full dates instead on…

214a853

…ly year.

Updating Doc

381c264

Updating Doc

1b88322

Merge pull request #1 from Machine-Learning-Labs/master

510aa49

GregorUT approved these changes Oct 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved script for downloading all games data without incurring in errors #5

Improved script for downloading all games data without incurring in errors #5

Uh oh!

hechmik commented May 15, 2019

Uh oh!

Jenifael commented Mar 4, 2020

Uh oh!

hechmik commented Mar 4, 2020 •

edited

Loading

Uh oh!

Jenifael commented Mar 4, 2020 •

edited

Loading

Uh oh!

hechmik commented Mar 4, 2020

Uh oh!

Pelirrojo commented Mar 30, 2020

Uh oh!

imirkin commented Jan 2, 2021

Uh oh!

skyye-99 commented Jan 13, 2022

Uh oh!

Pelirrojo commented Jan 15, 2022

Uh oh!

GregorUT left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Improved script for downloading all games data without incurring in errors #5

Are you sure you want to change the base?

Improved script for downloading all games data without incurring in errors #5

Uh oh!

Conversation

hechmik commented May 15, 2019

Uh oh!

Jenifael commented Mar 4, 2020

Uh oh!

hechmik commented Mar 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jenifael commented Mar 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hechmik commented Mar 4, 2020

Uh oh!

Pelirrojo commented Mar 30, 2020

Uh oh!

imirkin commented Jan 2, 2021

Uh oh!

skyye-99 commented Jan 13, 2022

Uh oh!

Pelirrojo commented Jan 15, 2022

Uh oh!

GregorUT left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hechmik commented Mar 4, 2020 •

edited

Loading

Jenifael commented Mar 4, 2020 •

edited

Loading