Skip to content

Conversation

@adeak
Copy link

@adeak adeak commented Jan 3, 2022

The profile pages on SO/SE have been completely rewritten (see announcement from December 7, 2021), which means much of this library has to be rewritten.

Since the profile pages are an opaque mess of nested divs now (starting to look a lot like twitter HTML), the easiest approach I could find was to find divs with titles like this:

<div class="p12 bb bc-black-075" title="0 non-wiki questions (0 score). 70 non-wiki answers (898 score).">

One tag on the tag page gets one of these divs, and this already gives us the tag score. Inside there's a tag with the tag's name for text. I didn't want to rely on those random-looking strings in the class attribute.

I've also changed a handful of things (some of them stylistic):

  • Reorganise imports (standard library first, alphabetically), remove no-op returns.
  • Remove some of the modularity, because it's straightforward to scrape based on the title I showed above. This also means we don't know ahead of time how many pages there will be. We could add back addtional logic for this if this is something we need.
  • Add self-throttling with a default frequency of 1.5 seconds per request. During testing with Jon Skeet I started seeing 5-second and longer pushbacks from the server, and even timeout errors on occasion. I didn't want to add the complexity of doing retries with timeouts for the requests (see e.g. https://stackoverflow.com/a/35636367/5067311), so the manual wait seemed reasonable.
  • Remove the sorting keyword parameter from the function: this wasn't used in the original version either.

No doubt the company will add arbitrary small changes in a few weeks just to break scrapers like this. Until then this should work (even if slow due to the throttling/pushbacks).

@rayryeng
Copy link

As of this date, I had to downgrade pillow to pillow==9.5.0. I additionally had to fix numpy so that it works with MacOS M1/M2 chips: numpy==1.24.4 and I had to additionally install lxml: lxml==5.2.2. Please consider making a change to your PR and modify the requirements.txt file accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants