From cb55710ada4ea0f7aed0562966a84116279ab049 Mon Sep 17 00:00:00 2001 From: Anna Ritz Date: Thu, 29 Jan 2026 11:08:34 -0800 Subject: [PATCH 1/6] partway through contrib guide --- CONTRIBUTING.md | 78 +++++++++++++++---- datasets/{contributing => example}/.gitignore | 0 datasets/{contributing => example}/README.md | 2 +- .../raw_generation.py | 0 4 files changed, 64 insertions(+), 16 deletions(-) rename datasets/{contributing => example}/.gitignore (100%) rename datasets/{contributing => example}/README.md (93%) rename datasets/{contributing => example}/raw_generation.py (100%) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 1c47403..178f8a2 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,22 +1,67 @@ -# Contributing +# Contributing a new benchmarking dataset -## Helping Out +This guide walks new contributors through the process of adding a new dataset for SPRAS benchmarking and running SPRAS on that dataset. It is considered a companion to the [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html), which walks users through adding and algorithm to the SPRAS software. It is useful, but not necessary, to complete that contributing guide before beginning this one. -There are `TODOs` that better enhance the reproducability and accuracy of datasets or analysis of algorithm outputs, as well as -[open resolvable issues](https://github.com/Reed-CompBio/spras-benchmarking/). +Prerequisites +------------- + +Before following this guide, a contributor will need + +- Familiarity with Git and GitHub (`Carpentries + introduction `__) +- Snakemake `Carpentries + introduction `__ + or `beginner's + guide `__ +- The ability to post files to Google Drive + +Step 0: Fork the repository and create a branch +----------------------------------------------- + +From the `spras-benchmarking repository `__, +click the "Fork" button in the upper right corner to create a copy of +the repository in your own GitHub account. Do not change the "Repository +name". Then click the green "Create fork" button. + +The simplest way to set up SPRAS benchmarking for local development is to clone your +fork of the repository to your local machine. You can do that with a +graphical development environment or from the command line. After +cloning the repository, create a new git branch called +``example-dataset`` for local neighborhood development. In the +following commands, replace the example username ``agitter`` with your +GitHub username. + +.. code:: bash + + git clone https://github.com/agitter/spras-benchmarking.git + git checkout -b example-dataset -## Adding a dataset +Then you can make commits and push them to your fork of the repository +on the ``example-dataset`` branch -**Check that your data provider isn't already a dataset in `datasets`.** There are some datasets that are able to serve more data, and only use -a subset of it: these datasets can be extended for your needs. +.. code:: bash -The goal of a dataset is to take raw data and produce data to be fed to SPRAS. -We'll follow along with `datasets/contributing`. This mini-tutorial assumes that you already have familiarity with SPRAS -[as per its contributing guide](https://spras.readthedocs.io/en/latest/contributing/index.html). + git push origin example-dataset + +For this example dataset only, you will not merge the changes +back to the original SPRAS benchmarking repository. Instead, you can open a pull +request to your fork so that the SPRAS benchmarking maintainers can still provide +feedback. For example, use the "New pull request" button from +https://github.com/agitter/spras-benchmarking/pulls and set ``agitter/spras-benchmarking`` as both +the base repository and the head repository with ``example-dataset`` as the compare branch. + +The [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html) also provides instructions so you can push changes to both the Reed-CompBio version of spras-benchmarking and your fork. + +Adding an example dataset +-------------- + +The goal of a dataset is to take raw data and produce data to be fed to SPRAS. In this guide, we will add a dataset that is provided in `datasets/example`. + +## TODO ended here. ### Uploading raw data -This is a fake dataset: the data can be generated by running `datasets/contributing/raw_generation.py`, where the following artifacts will output: +This is a fake dataset: the data can be generated by running `datasets/example/raw_generation.py`, where the following artifacts will output: - `sources.txt` - `targets.txt` - `gold-standard.tsv` @@ -125,8 +170,11 @@ then refer to individual files when linking node or edge files in the configurat To test these, use the `conda` environment from the `spras` submodule to run `snakemake` with SPRAS. -## Adding an algorithm +## Making contributions + +You can now add your own datasets to the `spras-benchmarking` repo, which will be reviewed by the maintainers. **Check that your data provider isn't already a dataset in `datasets`.** There are some datasets that are able to serve more data, and only use a subset of it: these datasets can be extended for your needs. Code contributions will be licensed using the project's MIT license. + +If you wish to contribute to the codebase beyond adding datasets, there are `TODOs` that better enhance the reproducibility and accuracy of datasets or analysis of algorithm outputs, as well as +[open resolvable issues](https://github.com/Reed-CompBio/spras-benchmarking/). -If you want to add an algorithm, refer to the [SPRAS repository](https://github.com/Reed-CompBio/SPRAS) instead. -If you want to test your new algorithm you PRed to SPRAS, you can swap out the `spras` submodule that this repository uses -with your fork of SPRAS. +If you want to add an algorithm to SPRAS, refer to the [SPRAS repository](https://github.com/Reed-CompBio/SPRAS) instead. If you want to test your new algorithm you PRed to SPRAS, you can swap out the `spras` submodule that this repository uses with your fork of SPRAS. diff --git a/datasets/contributing/.gitignore b/datasets/example/.gitignore similarity index 100% rename from datasets/contributing/.gitignore rename to datasets/example/.gitignore diff --git a/datasets/contributing/README.md b/datasets/example/README.md similarity index 93% rename from datasets/contributing/README.md rename to datasets/example/README.md index 01519e2..355f9f3 100644 --- a/datasets/contributing/README.md +++ b/datasets/example/README.md @@ -1,4 +1,4 @@ -# Contributing Guide dataset +# Example dataset for the contributing guide **This is an artificial dataset** for how to make datasets. diff --git a/datasets/contributing/raw_generation.py b/datasets/example/raw_generation.py similarity index 100% rename from datasets/contributing/raw_generation.py rename to datasets/example/raw_generation.py From 68234cc1982d8eec7e51810bd2a29be1216d5c02 Mon Sep 17 00:00:00 2001 From: Anna Ritz Date: Thu, 29 Jan 2026 15:02:25 -0800 Subject: [PATCH 2/6] finished reading through CONTRIBUTING --- CONTRIBUTING.md | 42 +++++++++++++++++++++++++++--------------- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 178f8a2..04b98b0 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -7,18 +7,16 @@ Prerequisites Before following this guide, a contributor will need -- Familiarity with Git and GitHub (`Carpentries - introduction `__) -- Snakemake `Carpentries - introduction `__ - or `beginner's - guide `__ +- Familiarity with Python ([Carpentries introduction](https://swcarpentry.github.io/python-novice-inflammation/)) +- Familiarity with Git and GitHub ([Carpentries introduction](https://swcarpentry.github.io/git-novice/)) +- Snakemake ([Carpentries introduction](https://carpentries-incubator.github.io/workflows-snakemake/) + or [beginner's guide](http://ivory.idyll.org/blog/2023-snakemake-slithering-section-1.html)) - The ability to post files to Google Drive Step 0: Fork the repository and create a branch ----------------------------------------------- -From the `spras-benchmarking repository `__, +From the [spras-benchmarking repository](https://github.com/Reed-CompBio/spras-benchmarking), click the "Fork" button in the upper right corner to create a copy of the repository in your own GitHub account. Do not change the "Repository name". Then click the green "Create fork" button. @@ -31,17 +29,17 @@ cloning the repository, create a new git branch called following commands, replace the example username ``agitter`` with your GitHub username. -.. code:: bash - +``` git clone https://github.com/agitter/spras-benchmarking.git git checkout -b example-dataset +``` Then you can make commits and push them to your fork of the repository on the ``example-dataset`` branch -.. code:: bash - - git push origin example-dataset +``` +git push origin example-dataset +``` For this example dataset only, you will not merge the changes back to the original SPRAS benchmarking repository. Instead, you can open a pull @@ -52,15 +50,27 @@ the base repository and the head repository with ``example-dataset`` as the comp The [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html) also provides instructions so you can push changes to both the Reed-CompBio version of spras-benchmarking and your fork. +Step 1: Activate the spras environment and install SPRAS as a submdule. +--------------- + +This repository depends on SPRAS. If you want to reproduce the results of benchmarking locally, +you will need to setup SPRAS. SPRAS depends on [Docker](https://www.docker.com/) and [Conda](https://docs.conda.io/projects/conda/en/stable/). If it is hard to install either of these tools, +a [devcontainer](https://containers.dev/) is available for easy setup. + +```sh +conda env create -f spras/environment.yml +conda activate spras +pip install ./spras +``` + Adding an example dataset -------------- The goal of a dataset is to take raw data and produce data to be fed to SPRAS. In this guide, we will add a dataset that is provided in `datasets/example`. -## TODO ended here. - -### Uploading raw data +### Generate example data +Generate a fake dataset by running This is a fake dataset: the data can be generated by running `datasets/example/raw_generation.py`, where the following artifacts will output: - `sources.txt` - `targets.txt` @@ -164,6 +174,8 @@ Make sure that your `Snakefile` is run inside the top-level `run_snakemake.sh` f ### Adding to the SPRAS config +TODO: add note to activate `spras` conda environment. + Since this is a pathway problem and not a disease mining problem, we'll mutate `configs/pra.yaml`. Add your dataset and gold standard to the configuration. Since this dataset passes in a mix of raw and processed files, it would be best to make the `data_dir` set to `datasets/contributing`, then refer to individual files when linking node or edge files in the configuration. From 0dcaf085552db7561a09afb22aee0bc7f23d23ae Mon Sep 17 00:00:00 2001 From: Anna Ritz Date: Thu, 29 Jan 2026 15:07:48 -0800 Subject: [PATCH 3/6] adding some last changes. --- CONTRIBUTING.md | 79 +++++++++++++++++++++++++++++++++---------------- 1 file changed, 53 insertions(+), 26 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 04b98b0..3d991e1 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -63,61 +63,71 @@ conda activate spras pip install ./spras ``` -Adding an example dataset +Step 2: Add a dataset -------------- The goal of a dataset is to take raw data and produce data to be fed to SPRAS. In this guide, we will add a dataset that is provided in `datasets/example`. -### Generate example data +### 2.1: Generate an example dataset Generate a fake dataset by running -This is a fake dataset: the data can be generated by running `datasets/example/raw_generation.py`, where the following artifacts will output: + +```py +python datasets/example/raw_generation.py +``` +The following artifacts will be placed in `dataset/example/`: - `sources.txt` - `targets.txt` - `gold-standard.tsv` - `interactome.tsv` -Unlike in this example, the data used in other datasets comes from other sources (whether that's supplementary info in a paper, or out of -biological databases like UniProt.) These artifacts can be large, and occasionally update, so we store them in Google Drive for caching and download +### 2.2: Place the example dataset on Google Drive + +In more realistic scenarios, the data used in other datasets comes from other sources (whether that's supplementary info in a paper, or out of biological databases like UniProt.) These artifacts can be large, and may occasionally be updated, so we store them in Google Drive for caching and download them when we want to reconstruct a dataset. -Note that the four artifacts above change every time `raw_generation.py` is run. Upload those artifacts to Google Drive in a folder of your choice. -Share the file and allow for _Anyone with the link_ to _View_ the file. +Note that the four artifacts above change every time `raw_generation.py` is run. Upload those artifacts to Google Drive in a folder of your choice. Set the Sharing settings so that _Anyone with the link_ can _View_ the file. Once shared, copying the URL should look something like: -> https://drive.google.com/file/d/1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h/view?usp=sharing +``` +https://drive.google.com/file/d/1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h/view?usp=sharing +``` We always drop the entire `/view?...` suffix, and replace `/file/d/` with `/uc?id=`, which turns the URL to a direct download link, which is internally downloaded with [gdown](https://github.com/wkentaro/gdown). Those post-processing steps should make the URL now look as so: -> https://drive.google.com/uc?id=1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h +``` +https://drive.google.com/uc?id=1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h +``` -Now, add a directive to `cache/directory.py` under `Contributing`. Since this doesn't have an online URL, this should use `CacheItem.cache_only`, to -indicate that no other online database serves this URL. +### 2.3: Add the example dataset's location to `cache/directory.py` -Your new directive under the `directory` dictionary should look something as so, with one entry for every artifact: +Now, add a directive to `cache/directory.py` that specifies the location of the example dataset. This should be added as a new (key,value) pair to the `directory` variable. Since the example dataset doesn't have an online URL, this should use `CacheItem.cache_only`, to indicate that no other online database serves this URL. + +Your new directive under the `directory` dictionary should look something as so, with one entry for each of the four artifacts: ```python ..., -"Contributing": { +"ExampleData": { "interactome.tsv": CacheItem.cache_only( - name="Randomly-generated contributing interactome", + name="Randomly-generated example data interactome", cached="https://drive.google.com/uc?id=..." ), ... } ``` -### Setting up a workflow +Step 3: Set up a workflow to run the example dataset +---------- + +Now, we need to make these files SPRAS-compatible. To do this, we'll set up a `Snakefile`, which will handle downloading the artifacts from the Google Drive links and running any scripts to reformat the artifacts into SPRAS-compatible formats. -Now, we need to make these files SPRAS-compatible. To do this, we'll set up a `Snakefile`, which will handle: -- Artifact downloading -- Script running. +In the example dataset, `sources.txt` and `targets.txt` are already in a SPRAS-ready format, but we need to process `gold-standard.tsv` and `interactome.tsv`. -`sources.txt` and `targets.txt` are already in a SPRAS-ready format, but we need to process `gold-standard.tsv` and `interactome.tsv`. +### 3.1: Write a `Snakefile` to fetch datasets -Create a `Snakefile` under your dataset with the top-level directives: +Navigate to the `dataset/example` directory and create a `Snakefile` with the top-level directives: ```python # This provides the `produce_fetch_rules` util to allows us to automatically fetch the Google Drive data. @@ -145,6 +155,8 @@ produce_fetch_rules({ }) ``` +### 3.2: Write code to put example dataset files in a SPRAS-compatible format + Create two scripts that make `gold-standard.tsv` and `interactome.tsv` SPRAS-ready, consulting the [SPRAS file format documentation](https://spras.readthedocs.io/en/latest/output.html). You can use any dependencies inside the top-level `pyproject.toml`, and you can test out your scripts with `uv run