Rewrite most of the scraping logic for the revamped profile design #5

adeak · 2022-01-03T21:27:18Z

The profile pages on SO/SE have been completely rewritten (see announcement from December 7, 2021), which means much of this library has to be rewritten.

Since the profile pages are an opaque mess of nested divs now (starting to look a lot like twitter HTML), the easiest approach I could find was to find divs with titles like this:

<div class="p12 bb bc-black-075" title="0 non-wiki questions (0 score). 70 non-wiki answers (898 score).">

One tag on the tag page gets one of these divs, and this already gives us the tag score. Inside there's a tag with the tag's name for text. I didn't want to rely on those random-looking strings in the class attribute.

I've also changed a handful of things (some of them stylistic):

Reorganise imports (standard library first, alphabetically), remove no-op returns.
Remove some of the modularity, because it's straightforward to scrape based on the title I showed above. This also means we don't know ahead of time how many pages there will be. We could add back addtional logic for this if this is something we need.
Add self-throttling with a default frequency of 1.5 seconds per request. During testing with Jon Skeet I started seeing 5-second and longer pushbacks from the server, and even timeout errors on occasion. I didn't want to add the complexity of doing retries with timeouts for the requests (see e.g. https://stackoverflow.com/a/35636367/5067311), so the manual wait seemed reasonable.
Remove the sorting keyword parameter from the function: this wasn't used in the original version either.

No doubt the company will add arbitrary small changes in a few weeks just to break scrapers like this. Until then this should work (even if slow due to the throttling/pushbacks).

rayryeng · 2024-06-10T06:37:15Z

As of this date, I had to downgrade pillow to pillow==9.5.0. I additionally had to fix numpy so that it works with MacOS M1/M2 chips: numpy==1.24.4 and I had to additionally install lxml: lxml==5.2.2. Please consider making a change to your PR and modify the requirements.txt file accordingly.

Rewrite most of the scraping logic for the revamped profile design

8cbc2a2

rayryeng mentioned this pull request Jun 10, 2024

IndexError: list index out of range #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rewrite most of the scraping logic for the revamped profile design #5

Rewrite most of the scraping logic for the revamped profile design #5

Uh oh!

adeak commented Jan 3, 2022 •

edited

Loading

Uh oh!

rayryeng commented Jun 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rewrite most of the scraping logic for the revamped profile design #5

Are you sure you want to change the base?

Rewrite most of the scraping logic for the revamped profile design #5

Uh oh!

Conversation

adeak commented Jan 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rayryeng commented Jun 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adeak commented Jan 3, 2022 •

edited

Loading