Skip to content

Switch DataLoader from HuggingFace to R2 #49

@KMFODA

Description

@KMFODA

Once feature/diloco is merged, we will be using a DataLoader that uses the current blockchain block and the miner's uid to seed a random subset of the data using HuggingFace's DataLoader.

Given ongoing issues with HuggingFace's dataset API it is worth investing in a DataLoader that loads the data in the same format from an R2 instance where we will locally host the data.

Templar do this in a very clean way here: https://github.com/tplr-ai/templar/blob/main/src/tplr/r2_dataset.py which can be used as inspo.

The format of the parquet files stored within the dataset can be explored using the HF API: https://huggingface.co/docs/dataset-viewer/en/parquet#using-the-dataset-viewer-api

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedOpen to contributions from the community

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions