diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 38b13d9..1c47403 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -7,17 +7,123 @@ There are `TODOs` that better enhance the reproducability and accuracy of datase ## Adding a dataset -See `datasets/diseases` as an example of a dataset. Datasets take some form of raw data from an online service and convert it into usable datasets -with associated gold standards for SPRAS to run on. - -To add a dataset: -1. Check that your dataset provider isn't already added (some of these datasets act as providers for multiple datasets) -1. Create a new folder under `datasets/` -1. Add an attached Snakefile that converts your `raw` data to `processed` data. - - Make sure to use `uv` here. See `diseases`'s Snakefile for an example. -1. Add your Snakefile to the top-level `run_snakemake.sh` file. -1. Add your datasets to the appropiate `configs` - - If your dataset has gold standards, make sure to include them here. +**Check that your data provider isn't already a dataset in `datasets`.** There are some datasets that are able to serve more data, and only use +a subset of it: these datasets can be extended for your needs. + +The goal of a dataset is to take raw data and produce data to be fed to SPRAS. +We'll follow along with `datasets/contributing`. This mini-tutorial assumes that you already have familiarity with SPRAS +[as per its contributing guide](https://spras.readthedocs.io/en/latest/contributing/index.html). + +### Uploading raw data + +This is a fake dataset: the data can be generated by running `datasets/contributing/raw_generation.py`, where the following artifacts will output: +- `sources.txt` +- `targets.txt` +- `gold-standard.tsv` +- `interactome.tsv` + +Unlike in this example, the data used in other datasets comes from other sources (whether that's supplementary info in a paper, or out of +biological databases like UniProt.) These artifacts can be large, and occasionally update, so we store them in Google Drive for caching and download +them when we want to reconstruct a dataset. + +Note that the four artifacts above change every time `raw_generation.py` is run. Upload those artifacts to Google Drive in a folder of your choice. +Share the file and allow for _Anyone with the link_ to _View_ the file. + +Once shared, copying the URL should look something like: + +> https://drive.google.com/file/d/1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h/view?usp=sharing + +We always drop the entire `/view?...` suffix, and replace `/file/d/` with `/uc?id=`, which turns the URL to a direct download link, which is internally +downloaded with [gdown](https://github.com/wkentaro/gdown). Those post-processing steps should make the URL now look as so: + +> https://drive.google.com/uc?id=1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h + +Now, add a directive to `cache/directory.py` under `Contributing`. Since this doesn't have an online URL, this should use `CacheItem.cache_only`, to +indicate that no other online database serves this URL. + +Your new directive under the `directory` dictionary should look something as so, with one entry for every artifact: + +```python +..., +"Contributing": { + "interactome.tsv": CacheItem.cache_only( + name="Randomly-generated contributing interactome", + cached="https://drive.google.com/uc?id=..." + ), + ... +} +``` + +### Setting up a workflow + +Now, we need to make these files SPRAS-compatible. To do this, we'll set up a `Snakefile`, which will handle: +- Artifact downloading +- Script running. + +`sources.txt` and `targets.txt` are already in a SPRAS-ready format, but we need to process `gold-standard.tsv` and `interactome.tsv`. + +Create a `Snakefile` under your dataset with the top-level directives: + +```python +# This provides the `produce_fetch_rules` util to allows us to automatically fetch the Google Drive data. +include: "../../cache/Snakefile" + +rule all: + input: + # The two files we will be passing to SPRAS + "raw/sources.txt", + "raw/targets.txt", + # The two files we will be processing + "processed/gold-standard.tsv", + "processed/interactome.tsv" +``` + +We'll generate four `fetch` rules, or rules that tell Snakemake to download the data we uploaded to Google Drive earlier. + +```python +produce_fetch_rules({ + # The value array is a path into the dictionary from `cache/directory.py`. + "raw/sources.txt": ["Contributing", "sources.txt"], + # and so on for targets, gold-standard, and interactome: + # note that excluding these three stops the Snakemake file from working by design! + ... +}) +``` + +Create two scripts that make `gold-standard.tsv` and `interactome.tsv` SPRAS-ready, consulting +the [SPRAS file format documentation](https://spras.readthedocs.io/en/latest/output.html). You can use any dependencies inside the top-level +`pyproject.toml`, and you can test out your scripts with `uv run