You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 18, 2025. It is now read-only.
From the talk today, one good point was the point that reproducibility problems often stem from data inconsistencies. To that end, I think we should have a DataDownloader component that can download data from URLs and save them locally to disk.
If the files exist, the downloader can skip the download
the downloader should calculate checksums for downloaded files. it should produce a checksums.cfg file to simplify reusing these in configuration later
the downloader should allow checksums to be configured in the experiment file. when set, the downloader would verify the downloaded file is the same as the one specified in the experiment.
so an example json config could be:
{
"_name": "Downloader",
"local_dir": "$my_path",
"checksums": "$WORK_DIR/checksums_2019_05_23.cfg", <-- produced by a previous download
"sentences.txt.gz": {
"url": "$BASE_URL/sentences.txt.gz",
"decompress": true
},
"word_embeddings.npy": {
"url": "$BASE_URL/word_embeddings.npy"
}
}