diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 1c47403..21084ee 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,68 +1,137 @@ -# Contributing +# Contributing a new benchmarking dataset -## Helping Out +This guide walks new contributors through the process of adding a new dataset for SPRAS benchmarking and running SPRAS on that dataset. It is considered a companion to the [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html), which walks users through adding and algorithm to the SPRAS software. It is useful, but not necessary, to complete that contributing guide before beginning this one. -There are `TODOs` that better enhance the reproducability and accuracy of datasets or analysis of algorithm outputs, as well as -[open resolvable issues](https://github.com/Reed-CompBio/spras-benchmarking/). +## Prerequisites -## Adding a dataset +Before following this guide, a contributor will need -**Check that your data provider isn't already a dataset in `datasets`.** There are some datasets that are able to serve more data, and only use -a subset of it: these datasets can be extended for your needs. +- Familiarity with Python ([Carpentries introduction](https://swcarpentry.github.io/python-novice-inflammation/)) +- Familiarity with Git and GitHub ([Carpentries introduction](https://swcarpentry.github.io/git-novice/)) +- Snakemake ([Carpentries introduction](https://carpentries-incubator.github.io/workflows-snakemake/) + or [beginner's guide](http://ivory.idyll.org/blog/2023-snakemake-slithering-section-1.html)) +- The ability to post files to Google Drive -The goal of a dataset is to take raw data and produce data to be fed to SPRAS. -We'll follow along with `datasets/contributing`. This mini-tutorial assumes that you already have familiarity with SPRAS -[as per its contributing guide](https://spras.readthedocs.io/en/latest/contributing/index.html). +## Step 0: Fork the repository and create a branch -### Uploading raw data +From the [spras-benchmarking repository](https://github.com/Reed-CompBio/spras-benchmarking), +click the "Fork" button in the upper right corner to create a copy of +the repository in your own GitHub account. Do not change the "Repository +name". Then click the green "Create fork" button. -This is a fake dataset: the data can be generated by running `datasets/contributing/raw_generation.py`, where the following artifacts will output: +The simplest way to set up SPRAS benchmarking for local development is to clone your +fork of the repository to your local machine. You can do that with a +graphical development environment or from the command line. After +cloning the repository, create a new git branch called +``example-dataset`` for local neighborhood development. In the +following commands, replace the example username ``agitter`` with your +GitHub username. + +```sh +git clone https://github.com/agitter/spras-benchmarking.git +git checkout -b example-dataset +``` + +Then you can make commits and push them to your fork of the repository +on the ``example-dataset`` branch + +```sh +git push origin example-dataset +``` + +For this example dataset only, you will not merge the changes +back to the original SPRAS benchmarking repository. Instead, you can open a pull +request to your fork so that the SPRAS benchmarking maintainers can still provide +feedback. For example, use the "New pull request" button from +https://github.com/agitter/spras-benchmarking/pulls and set ``agitter/spras-benchmarking`` as both +the base repository and the head repository with ``example-dataset`` as the compare branch. + +The [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html) also provides instructions so you can push changes to both the Reed-CompBio version of spras-benchmarking and your fork. + +### Step 1: Install `uv` + +Unlike in the main SPRAS repository, we use `uv`, an equivalent to `pip`, for running our dataset pipelines. As we will see later in `1.1`, +we still use Conda for running SPRAS itself. + +You can follow `uv`'s installation instructions [on their website](https://docs.astral.sh/uv/getting-started/installation/). + +### 1.1: Activate the spras environment and install SPRAS as a submdule. + +This repository depends on SPRAS. If you want to reproduce the results of running SPRAS on datasets locally, +you will need to setup SPRAS. SPRAS depends on [Docker](https://www.docker.com/) and [Conda](https://docs.conda.io/projects/conda/en/stable/). If it is hard to install either of these tools, +a [devcontainer](https://containers.dev/) is available for easy setup. + +```sh +conda env create -f spras/environment.yml +conda activate spras +pip install ./spras +``` + +## Step 2: Add a dataset + +The goal of a dataset is to take raw data and produce data to be fed to SPRAS. In this guide, we will add a dataset that is provided in `datasets/example`. + +### 2.1: Generate an example dataset + +Generate a fake dataset by running + +```sh +uv run datasets/example/raw_generation.py +``` + +The following artifacts will be placed in `dataset/example/`: - `sources.txt` - `targets.txt` - `gold-standard.tsv` - `interactome.tsv` -Unlike in this example, the data used in other datasets comes from other sources (whether that's supplementary info in a paper, or out of -biological databases like UniProt.) These artifacts can be large, and occasionally update, so we store them in Google Drive for caching and download +### 2.2: Place the example dataset on Google Drive + +In more realistic scenarios, the data used in other datasets comes from other sources (whether that's supplementary info in a paper, or out of biological databases like UniProt.) These artifacts can be large, and may occasionally be updated, so we store them in Google Drive for caching and download them when we want to reconstruct a dataset. -Note that the four artifacts above change every time `raw_generation.py` is run. Upload those artifacts to Google Drive in a folder of your choice. -Share the file and allow for _Anyone with the link_ to _View_ the file. +Note that the four artifacts above change every time `raw_generation.py` is run. Upload those artifacts to Google Drive in a folder of your choice. Set the Sharing settings so that _Anyone with the link_ can _View_ the file. Once shared, copying the URL should look something like: -> https://drive.google.com/file/d/1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h/view?usp=sharing +``` +https://drive.google.com/file/d/1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h/view?usp=sharing +``` We always drop the entire `/view?...` suffix, and replace `/file/d/` with `/uc?id=`, which turns the URL to a direct download link, which is internally downloaded with [gdown](https://github.com/wkentaro/gdown). Those post-processing steps should make the URL now look as so: -> https://drive.google.com/uc?id=1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h +``` +https://drive.google.com/uc?id=1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h +``` + +### 2.3: Add the example dataset's location to `cache/directory.py` -Now, add a directive to `cache/directory.py` under `Contributing`. Since this doesn't have an online URL, this should use `CacheItem.cache_only`, to -indicate that no other online database serves this URL. +Now, add a directive to `cache/directory.py` that specifies the location of the example dataset. This should be added as a new (key,value) pair to the `directory` variable. Since the example dataset doesn't have an online URL, this should use `CacheItem.cache_only`, to indicate that no other online database serves this URL. -Your new directive under the `directory` dictionary should look something as so, with one entry for every artifact: +Your new directive under the `directory` dictionary should look something as so, with one entry for each of the four artifacts: ```python ..., -"Contributing": { +"ExampleData": { "interactome.tsv": CacheItem.cache_only( - name="Randomly-generated contributing interactome", + name="Randomly-generated example data interactome", cached="https://drive.google.com/uc?id=..." ), ... } ``` -### Setting up a workflow +Step 3: Set up a workflow to run the example dataset +---------- + +Now, we need to make these files SPRAS-compatible. To do this, we'll set up a `Snakefile`, which will handle downloading the artifacts from the Google Drive links and running any scripts to reformat the artifacts into SPRAS-compatible formats. -Now, we need to make these files SPRAS-compatible. To do this, we'll set up a `Snakefile`, which will handle: -- Artifact downloading -- Script running. +In the example dataset, `sources.txt` and `targets.txt` are already in a SPRAS-ready format, but we need to process `gold-standard.tsv` and `interactome.tsv`. -`sources.txt` and `targets.txt` are already in a SPRAS-ready format, but we need to process `gold-standard.tsv` and `interactome.tsv`. +### 3.1: Write a `Snakefile` to fetch datasets -Create a `Snakefile` under your dataset with the top-level directives: +Navigate to the `dataset/example` directory and create a `Snakefile` with the top-level directives: ```python # This provides the `produce_fetch_rules` util to allows us to automatically fetch the Google Drive data. @@ -90,16 +159,22 @@ produce_fetch_rules({ }) ``` -Create two scripts that make `gold-standard.tsv` and `interactome.tsv` SPRAS-ready, consulting +### 3.2: Write code to put example dataset files in a SPRAS-compatible format + +Create two scripts that output SPRAS-ready variants of `raw/gold-standard.tsv` and `raw/interactome.tsv` to `processed/`, consulting the [SPRAS file format documentation](https://spras.readthedocs.io/en/latest/output.html). You can use any dependencies inside the top-level -`pyproject.toml`, and you can test out your scripts with `uv run