Delving Bitcoin Scraper Ignores Updates to Existing Posts

The current Delving Bitcoin scraper skips indexing when a document with the same ID already exists. This means updates to existing posts are not accounted for.  

When adding Delving Bitcoin to `scraperv2`, we should address this issue and consider the following approaches:

#### Approach 1: Leverage Metadata Stored in ES Index  
James' `discourse-archive` uses an on-disk `.metadata.json` file to store the `last_sync_date`, allowing it to avoid syncing from scratch each time. Currently, we are not using this functionality. Each GitHub Action workflow syncs everything from the beginning, which is manageable since Delving Bitcoin's history is relatively short.

In `scraperv2`, metadata from scrape jobs is stored in our Elasticsearch (ES) index. We can take advantage of this existing functionality to manage `last_sync_date` and only sync new or updated posts. However, this may require modifications to James' initial code to integrate this metadata mechanism.

#### Approach 2: Use a Custom Delving Bitcoin Archive
Another option is to forego using `discourse-archive` on every scraping run. Instead, we could maintain our own instance of the [delving-bitcoin-archive](https://github.com/jamesob/delving-bitcoin-archive) and read the necessary data directly from a GitHub repository. This would be similar to how we handle data in the [`GitHubMetadataScraper`](https://github.com/bitcoinsearch/scraper/blob/master/scraper/scrapers/github_metadata.py).

**Related**
- The current code in `delvingbitcoin_2_elasticsearch/achieve.py` is extracted from [jamesob/discourse-archive](https://github.com/jamesob/discourse-archive).
- https://delvingbitcoin.org/t/public-archive-for-delving-bitcoin/87/5
- https://meta.discourse.org/t/fetch-all-posts-from-a-topic-using-the-api/260886
- https://h-rd.org/archiving-a-discourse-forum/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delving Bitcoin Scraper Ignores Updates to Existing Posts #97

Approach 1: Leverage Metadata Stored in ES Index

Approach 2: Use a Custom Delving Bitcoin Archive

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Delving Bitcoin Scraper Ignores Updates to Existing Posts #97

Description

Approach 1: Leverage Metadata Stored in ES Index

Approach 2: Use a Custom Delving Bitcoin Archive

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions