-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
help wantedOpen to contributions from the communityOpen to contributions from the community
Description
Once feature/diloco is merged, we will be using a DataLoader that uses the current blockchain block and the miner's uid to seed a random subset of the data using HuggingFace's DataLoader.
Given ongoing issues with HuggingFace's dataset API it is worth investing in a DataLoader that loads the data in the same format from an R2 instance where we will locally host the data.
Templar do this in a very clean way here: https://github.com/tplr-ai/templar/blob/main/src/tplr/r2_dataset.py which can be used as inspo.
The format of the parquet files stored within the dataset can be explored using the HF API: https://huggingface.co/docs/dataset-viewer/en/parquet#using-the-dataset-viewer-api
Metadata
Metadata
Assignees
Labels
help wantedOpen to contributions from the communityOpen to contributions from the community