Skip to content

Conversation

@hechmik
Copy link

@hechmik hechmik commented May 15, 2019

As part of a university of project I needed to download data from vgchartz. I found your script really useful but after few downloads requests were blocked. In order to solve this problems I've improved the code as follows:

  • Introduce sleep time between requests: in this way I've been able to download all 55 pages without incurring in HTTP error 429
  • Use a random user agent for each requests: the reason is the same as above
  • Add requirements.txt file: in this way people that use virtual environments will find really easy to use the script
  • Use logging instead of print statements: logs are displayed on stdout and written to a specific file
  • Split code in multiple functions: in this way I think it will be easier to maintain and improve it in the future
  • Read many parameters from a configuration json file, such as range of pages to download, output filename, range of sleep time, application log filename
  • Made the download of genre optional: this is the most critical operation, because currently you have to make 56 requests for getting most of the data and 55 000 + requests just for obtaining the game genre. The option to download or not can be set in the json file

@Jenifael
Copy link

Jenifael commented Mar 4, 2020

Thank a lot for this fix. It's really well written.

FYI they changed their http protocol to https. In order to make it work, you just need to specify "https://www.vgchartz.com/game/" instead of "http://www.vgchartz.com/game/"

@hechmik
Copy link
Author

hechmik commented Mar 4, 2020

I changed the link in the configuration file, it should be ok now :)

@Jenifael
Copy link

Jenifael commented Mar 4, 2020

Actually it was for the vgchartzfull.py file:

    game_tags = list(filter(
        lambda x: x.attrs['href'].startswith('https://www.vgchartz.com/game/'),
        # discard the first 10 elements because those
        # links are in the navigation bar
        soup.find_all("a")
    ))[10:]

without it, game_tags would return nothing :)
Thanks again

@hechmik
Copy link
Author

hechmik commented Mar 4, 2020

Thanks a lot, I forgot that bit!

@Pelirrojo
Copy link

I've just work over this fork that is working fine at time of this writing.

@imirkin
Copy link

imirkin commented Jan 2, 2021

FWIW I had to apply the following diff to get it to run:

diff --git a/vgchartz-full-crawler.py b/vgchartz-full-crawler.py
index 7c6c30c..47c2397 100644
--- a/vgchartz-full-crawler.py
+++ b/vgchartz-full-crawler.py
@@ -186,11 +186,11 @@ def download_data(*, start_page, end_page, include_genre):
 
         # We locate the game through search <a> tags with game urls in the main table
         game_tags = list(filter(
-            lambda x: x.attrs['href'].startswith('https://www.vgchartz.com/game/'),
+            lambda x: 'href' in x.attrs and x.attrs['href'].startswith('https://www.vgchartz.com/game/'),
             # discard the first 10 elements because those
             # links are in the navigation bar
             soup.find_all("a")
-        ))[10:]
+        ))
 
         for tag in game_tags:

Hope this helps someone.

@skyye-99
Copy link

Hi, after experimenting a little with older versions of this I realized you have the most useable version by far. But I seem to be getting an error that I can't pin down, which was similar to one in the original version that involved a change to the website's layout. (I'm new to python so forgive any stupid questions). The error I'm getting is "Unexpected error: (<class 'KeyError'>, KeyError('href'), <traceback object at 0x000002073555C040>)". Does anyone know why this is happening?

@Pelirrojo
Copy link

Thanks @imirkin with your fix in the lambda function is working fine.
I've just run with Python 3.7 (MiniConda and run smooths) @huertaj2

Copy link
Owner

@GregorUT GregorUT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your additions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants