-
Notifications
You must be signed in to change notification settings - Fork 12
Description
The current Delving Bitcoin scraper skips indexing when a document with the same ID already exists. This means updates to existing posts are not accounted for.
When adding Delving Bitcoin to scraperv2, we should address this issue and consider the following approaches:
Approach 1: Leverage Metadata Stored in ES Index
James' discourse-archive uses an on-disk .metadata.json file to store the last_sync_date, allowing it to avoid syncing from scratch each time. Currently, we are not using this functionality. Each GitHub Action workflow syncs everything from the beginning, which is manageable since Delving Bitcoin's history is relatively short.
In scraperv2, metadata from scrape jobs is stored in our Elasticsearch (ES) index. We can take advantage of this existing functionality to manage last_sync_date and only sync new or updated posts. However, this may require modifications to James' initial code to integrate this metadata mechanism.
Approach 2: Use a Custom Delving Bitcoin Archive
Another option is to forego using discourse-archive on every scraping run. Instead, we could maintain our own instance of the delving-bitcoin-archive and read the necessary data directly from a GitHub repository. This would be similar to how we handle data in the GitHubMetadataScraper.
Related
- The current code in
delvingbitcoin_2_elasticsearch/achieve.pyis extracted from jamesob/discourse-archive. - https://delvingbitcoin.org/t/public-archive-for-delving-bitcoin/87/5
- https://meta.discourse.org/t/fetch-all-posts-from-a-topic-using-the-api/260886
- https://h-rd.org/archiving-a-discourse-forum/