Skip to content

Delving Bitcoin Scraper Ignores Updates to Existing Posts #97

@kouloumos

Description

@kouloumos

The current Delving Bitcoin scraper skips indexing when a document with the same ID already exists. This means updates to existing posts are not accounted for.

When adding Delving Bitcoin to scraperv2, we should address this issue and consider the following approaches:

Approach 1: Leverage Metadata Stored in ES Index

James' discourse-archive uses an on-disk .metadata.json file to store the last_sync_date, allowing it to avoid syncing from scratch each time. Currently, we are not using this functionality. Each GitHub Action workflow syncs everything from the beginning, which is manageable since Delving Bitcoin's history is relatively short.

In scraperv2, metadata from scrape jobs is stored in our Elasticsearch (ES) index. We can take advantage of this existing functionality to manage last_sync_date and only sync new or updated posts. However, this may require modifications to James' initial code to integrate this metadata mechanism.

Approach 2: Use a Custom Delving Bitcoin Archive

Another option is to forego using discourse-archive on every scraping run. Instead, we could maintain our own instance of the delving-bitcoin-archive and read the necessary data directly from a GitHub repository. This would be similar to how we handle data in the GitHubMetadataScraper.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions