Reed-CompBio · annaritz · Jan 29, 2026 · Jan 29, 2026 · Jan 29, 2026 · Jan 30, 2026
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,68 +1,137 @@
-# Contributing
+# Contributing a new benchmarking dataset
 
-## Helping Out
+This guide walks new contributors through the process of adding a new dataset for SPRAS benchmarking and running SPRAS on that dataset. It is considered a companion to the [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html), which walks users through adding and algorithm to the SPRAS software. It is useful, but not necessary, to complete that contributing guide before beginning this one.
 
-There are `TODOs` that better enhance the reproducability and accuracy of datasets or analysis of algorithm outputs, as well as
-[open resolvable issues](https://github.com/Reed-CompBio/spras-benchmarking/).
+## Prerequisites
 
-## Adding a dataset
+Before following this guide, a contributor will need
 
-**Check that your data provider isn't already a dataset in `datasets`.** There are some datasets that are able to serve more data, and only use
-a subset of it: these datasets can be extended for your needs.
+- Familiarity with Python ([Carpentries introduction](https://swcarpentry.github.io/python-novice-inflammation/))
+- Familiarity with Git and GitHub ([Carpentries introduction](https://swcarpentry.github.io/git-novice/))
+- Snakemake ([Carpentries introduction](https://carpentries-incubator.github.io/workflows-snakemake/)
+  or [beginner's guide](http://ivory.idyll.org/blog/2023-snakemake-slithering-section-1.html))
+- The ability to post files to Google Drive
 
-The goal of a dataset is to take raw data and produce data to be fed to SPRAS.
-We'll follow along with `datasets/contributing`. This mini-tutorial assumes that you already have familiarity with SPRAS
-[as per its contributing guide](https://spras.readthedocs.io/en/latest/contributing/index.html).
+## Step 0: Fork the repository and create a branch
 
-### Uploading raw data
+From the [spras-benchmarking repository](https://github.com/Reed-CompBio/spras-benchmarking),
+click the "Fork" button in the upper right corner to create a copy of
+the repository in your own GitHub account. Do not change the "Repository
+name". Then click the green "Create fork" button.
 
-This is a fake dataset: the data can be generated by running `datasets/contributing/raw_generation.py`, where the following artifacts will output:
+The simplest way to set up SPRAS benchmarking for local development is to clone your
+fork of the repository to your local machine. You can do that with a
+graphical development environment or from the command line. After
+cloning the repository, create a new git branch called
+``example-dataset`` for local neighborhood development. In the
+following commands, replace the example username ``agitter`` with your
+GitHub username.
+
+```sh
+git clone https://github.com/agitter/spras-benchmarking.git
+git checkout -b example-dataset
+```
+
+Then you can make commits and push them to your fork of the repository
+on the ``example-dataset`` branch
+
+```sh
+git push origin example-dataset
+```
+
+For this example dataset only, you will not merge the changes
+back to the original SPRAS benchmarking repository. Instead, you can open a pull
+request to your fork so that the SPRAS benchmarking maintainers can still provide
+feedback. For example, use the "New pull request" button from
+https://github.com/agitter/spras-benchmarking/pulls and set ``agitter/spras-benchmarking`` as both
+the base repository and the head repository with ``example-dataset`` as the compare branch.
+
+The [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html) also provides instructions so you can push changes to both the Reed-CompBio version of spras-benchmarking and your fork.
+
+### Step 1: Install `uv`
+
+Unlike in the main SPRAS repository, we use `uv`, an equivalent to `pip`, for running our dataset pipelines. As we will see later in `1.1`,
+we still use Conda for running SPRAS itself.
+
+You can follow `uv`'s installation instructions [on their website](https://docs.astral.sh/uv/getting-started/installation/).
+
+### 1.1: Activate the spras environment and install SPRAS as a submdule.
+
+This repository depends on SPRAS. If you want to reproduce the results of running SPRAS on datasets locally,
+you will need to setup SPRAS. SPRAS depends on [Docker](https://www.docker.com/) and [Conda](https://docs.conda.io/projects/conda/en/stable/). If it is hard to install either of these tools,
+a [devcontainer](https://containers.dev/) is available for easy setup.
+
+```sh
+conda env create -f spras/environment.yml
+conda activate spras
+pip install ./spras
+```
+
+## Step 2: Add a dataset
+
+The goal of a dataset is to take raw data and produce data to be fed to SPRAS. In this guide, we will add a dataset that is provided in `datasets/example`.
+
+### 2.1: Generate an example dataset
+
+Generate a fake dataset by running
+
+```sh
+uv run datasets/example/raw_generation.py
+```
+
+The following artifacts will be placed in `dataset/example/`:
 - `sources.txt`
 - `targets.txt`
 - `gold-standard.tsv`
 - `interactome.tsv`
 
-Unlike in this example, the data used in other datasets comes from other sources (whether that's supplementary info in a paper, or out of
-biological databases like UniProt.) These artifacts can be large, and occasionally update, so we store them in Google Drive for caching and download
+### 2.2: Place the example dataset on Google Drive
+
+In more realistic scenarios, the data used in other datasets comes from other sources (whether that's supplementary info in a paper, or out of biological databases like UniProt.) These artifacts can be large, and may occasionally be updated, so we store them in Google Drive for caching and download
 them when we want to reconstruct a dataset.
 
-Note that the four artifacts above change every time `raw_generation.py` is run. Upload those artifacts to Google Drive in a folder of your choice.
-Share the file and allow for _Anyone with the link_ to _View_ the file.
+Note that the four artifacts above change every time `raw_generation.py` is run. Upload those artifacts to Google Drive in a folder of your choice. Set the Sharing settings so that _Anyone with the link_ can _View_ the file.
 
 Once shared, copying the URL should look something like:
 
-> https://drive.google.com/file/d/1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h/view?usp=sharing
+```
+https://drive.google.com/file/d/1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h/view?usp=sharing
+```
 
 We always drop the entire `/view?...` suffix, and replace `/file/d/` with `/uc?id=`, which turns the URL to a direct download link, which is internally
 downloaded with [gdown](https://github.com/wkentaro/gdown). Those post-processing steps should make the URL now look as so:
 
-> https://drive.google.com/uc?id=1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h
+```
+https://drive.google.com/uc?id=1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h
+```
+
+### 2.3: Add the example dataset's location to `cache/directory.py`
 
-Now, add a directive to `cache/directory.py` under `Contributing`. Since this doesn't have an online URL, this should use `CacheItem.cache_only`, to
-indicate that no other online database serves this URL.
+Now, add a directive to `cache/directory.py` that specifies the location of the example dataset. This should be added as a new (key,value) pair to the `directory` variable. Since the example dataset doesn't have an online URL, this should use `CacheItem.cache_only`, to indicate that no other online database serves this URL.
 
-Your new directive under the `directory` dictionary should look something as so, with one entry for every artifact:
+Your new directive under the `directory` dictionary should look something as so, with one entry for each of the four artifacts:
 
 ```python
 ...,
-"Contributing": {
+"ExampleData": {
     "interactome.tsv": CacheItem.cache_only(
-        name="Randomly-generated contributing interactome",
+        name="Randomly-generated example data interactome",
         cached="https://drive.google.com/uc?id=..."
     ),
     ...
 }
 ```
 
-### Setting up a workflow
+Step 3: Set up a workflow to run the example dataset
+----------
+
+Now, we need to make these files SPRAS-compatible. To do this, we'll set up a `Snakefile`, which will handle downloading the artifacts from the Google Drive links and running any scripts to reformat the artifacts into SPRAS-compatible formats.
 
-Now, we need to make these files SPRAS-compatible. To do this, we'll set up a `Snakefile`, which will handle:
-- Artifact downloading
-- Script running.
+In the example dataset, `sources.txt` and `targets.txt` are already in a SPRAS-ready format, but we need to process `gold-standard.tsv` and `interactome.tsv`.
 
-`sources.txt` and `targets.txt` are already in a SPRAS-ready format, but we need to process `gold-standard.tsv` and `interactome.tsv`.
+### 3.1: Write a `Snakefile` to fetch datasets
 
-Create a `Snakefile` under your dataset with the top-level directives:
+Navigate to the `dataset/example` directory and create a `Snakefile` with the top-level directives:
 
 ```python
 # This provides the `produce_fetch_rules` util to allows us to automatically fetch the Google Drive data.
@@ -90,16 +159,22 @@ produce_fetch_rules({
 })
 ```
 
-Create two scripts that make `gold-standard.tsv` and `interactome.tsv` SPRAS-ready, consulting
+### 3.2: Write code to put example dataset files in a SPRAS-compatible format
+
+Create two scripts that output SPRAS-ready variants of `raw/gold-standard.tsv` and `raw/interactome.tsv` to `processed/`, consulting
 the [SPRAS file format documentation](https://spras.readthedocs.io/en/latest/output.html). You can use any dependencies inside the top-level
-`pyproject.toml`, and you can test out your scripts with `uv run <script>`, an installation requirement from the top-level README.
+`pyproject.toml`, though [pandas](https://pandas.pydata.org/) should suffice, and you can test out your scripts with `uv run <script>`,
+an installation requirement from Step 1.
 
 
 > [!TIP]
-> Getting the current directory of your script prevents path errors. We use the snippet `Path(__file__).parent.resolve()`
-> throughout the repository.
+> While scripts will usually be run through `Snakemake`, they can also be run standalone through `uv run <script>.py`.
+> Users not running scripts from `datasets/<dataset>` will encounter path errors, unless you resolve the file's current directory
+> to be its current location through `current_dir = Path(__file__).parent.resolve()`.
+
+### 3.3: Write Snakemake rules to produce SPRAS-compatible files
 
-Once you have your scripts, add rules that consume the raw data and produce your processed data. For example:
+Once you have your scripts, add rules to the `Snakefile` that consume the raw data and produce your processed data. For example:
 
 ```py
 rule interactome:
@@ -111,22 +186,42 @@ rule interactome:
         "uv run scripts/process_interactome.py"
 ```
 
-Once you do the same for `gold-standard.tsv`, your data pipeline is ready! You can test it with `uv run snakemake --cores 1`.
+Once you do the same for `gold-standard.tsv`, your dataset recreation pipeline is ready! This will not run SPRAS itself, but it will allow
+your processed dataset files to be reproduced. You can test it with `uv run snakemake --cores 1`.
+
+## Step 4: Add the example dataset to the set of benchmark data
+
+To make sure your dataset is run along with all other datasets when benchmarking is run,
+you need to run your new `Snakefile` to `run_snakemake.sh` file in the top-level directory, and add it to the appropiate SPRAS configuration in `configs`.
+
+The example dataset inputs indicate that algorithms designed for pathway reconstruction analysis should be run on this example (as opposed to a disease mining analysis, which would not have sources and targets). Therefore, we will add this dataset to be run when pathway reconstruction analysis (PRA) methods are used. The configuration file for these methods is in `configs/pra.yaml`.
 
 ### Adding to `run_snakemake.sh`
 
 Make sure that your `Snakefile` is run inside the top-level `run_snakemake.sh` file.
 
 ### Adding to the SPRAS config
 
-Since this is a pathway problem and not a disease mining problem, we'll mutate `configs/pra.yaml`. Add your dataset and gold standard to the
-configuration. Since this dataset passes in a mix of raw and processed files, it would be best to make the `data_dir` set to `datasets/contributing`,
-then refer to individual files when linking node or edge files in the configuration.
+Since this is a pathway problem and not a disease mining problem, we'll mutate `configs/pra.yaml`. Add your dataset and gold standard to the configuration. Since this dataset passes in a mix of raw and processed files, it would be best to make the `data_dir` set to `datasets/example`, then refer to individual files when linking node or edge files in the configuration. Under the `datasets` tag, add lines like this:
+
+```yaml
+  - label: exampleDataset
+    node_files: ["raw/sources.txt", "raw/targets.txt"]
+    edge_files: ["processed/interactome.tsv"]
+    data_dir: "datasets/example"
+```
+
+To test these, use the `conda` environment from the `spras` submodule to run `snakemake` with SPRAS:
+
+```sh
+snakemake --cores 1 --configfile configs/pra.yaml --show-failed-logs -s spras/Snakefile
+```
+
+## Making contributions
 
-To test these, use the `conda` environment from the `spras` submodule to run `snakemake` with SPRAS.
+You can now add your own datasets to the `spras-benchmarking` repo, which will be reviewed by the maintainers. **Check that your data provider isn't already a dataset in `datasets`.** There are some datasets that are able to serve more data, and only use a subset of it: these datasets can be extended for your needs. Code contributions will be licensed using the project's MIT license.
 
-## Adding an algorithm
+If you wish to contribute to the codebase beyond adding datasets, there are `TODOs` that better enhance the reproducibility and accuracy of datasets or analysis of algorithm outputs, as well as
+[open resolvable issues](https://github.com/Reed-CompBio/spras-benchmarking/issues).
 
-If you want to add an algorithm, refer to the [SPRAS repository](https://github.com/Reed-CompBio/SPRAS) instead.
-If you want to test your new algorithm you PRed to SPRAS, you can swap out the `spras` submodule that this repository uses
-with your fork of SPRAS.
+If you want to add an algorithm to SPRAS, refer to the [SPRAS repository](https://github.com/Reed-CompBio/SPRAS) instead. If you want to test your new algorithm you PRed to SPRAS, you can swap out the `spras` submodule that this repository uses with your fork of SPRAS.
diff --git a/datasets/contributing/.gitignore → datasets/example/.gitignore b/datasets/contributing/.gitignore → datasets/example/.gitignore
diff --git a/datasets/contributing/README.md → datasets/example/README.md b/datasets/contributing/README.md → datasets/example/README.md
@@ -1,4 +1,4 @@
-# Contributing Guide dataset
+# Example dataset for the contributing guide
 
 **This is an artificial dataset** for how to make datasets.
 

diff --git a/datasets/contributing/raw_generation.py → datasets/example/raw_generation.py b/datasets/contributing/raw_generation.py → datasets/example/raw_generation.py
@@ -25,17 +25,19 @@ def gnp_noise(graph: networkx.DiGraph, p: float):
 
 def generate_parser():
     parser = argparse.ArgumentParser(prog='Pathway generator')
-    parser.add_argument("--path-count", type=int, default=10)
-    parser.add_argument("--path-length", type=int, default=7)
+    parser.add_argument("--path-count", type=int, default=10, help="The number of paths, whose starts and ends are marked as sources and targets.")
+    parser.add_argument("--path-length", type=int, default=7, help="The length of every path from --path-count.")
 
     parser.add_argument("--sources-output", type=str, default="sources.txt")
     parser.add_argument("--targets-output", type=str, default="targets.txt")
 
-    parser.add_argument("--gold-standard-noise", type=float, default=0.03)
+    parser.add_argument("--gold-standard-noise", type=float, default=0.03,
+                        help="The probability that edges in the gold standard are connected to each other.")
     parser.add_argument("--gold-standard-output", type=str, default="gold-standard.tsv")
 
     parser.add_argument("--interactome-extra-nodes", type=int, default=400)
-    parser.add_argument("--interactome-noise", type=float, default=0.01)
+    parser.add_argument("--interactome-noise", type=float, default=0.01,
+                        help="The probability that edges in the larger interactome are connected to each other.")
     parser.add_argument("--interactome-output", type=str, default="interactome.tsv")
     return parser