From cb55710ada4ea0f7aed0562966a84116279ab049 Mon Sep 17 00:00:00 2001
From: Anna Ritz <aritz@reed.edu>
Date: Thu, 29 Jan 2026 11:08:34 -0800
Subject: [PATCH 1/6] partway through contrib guide

---
 CONTRIBUTING.md                               | 78 +++++++++++++++----
 datasets/{contributing => example}/.gitignore |  0
 datasets/{contributing => example}/README.md  |  2 +-
 .../raw_generation.py                         |  0
 4 files changed, 64 insertions(+), 16 deletions(-)
 rename datasets/{contributing => example}/.gitignore (100%)
 rename datasets/{contributing => example}/README.md (93%)
 rename datasets/{contributing => example}/raw_generation.py (100%)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 1c47403..178f8a2 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,22 +1,67 @@
-# Contributing
+# Contributing a new benchmarking dataset
 
-## Helping Out
+This guide walks new contributors through the process of adding a new dataset for SPRAS benchmarking and running SPRAS on that dataset. It is considered a companion to the [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html), which walks users through adding and algorithm to the SPRAS software. It is useful, but not necessary, to complete that contributing guide before beginning this one.
 
-There are `TODOs` that better enhance the reproducability and accuracy of datasets or analysis of algorithm outputs, as well as
-[open resolvable issues](https://github.com/Reed-CompBio/spras-benchmarking/).
+Prerequisites
+-------------
+
+Before following this guide, a contributor will need
+
+- Familiarity with Git and GitHub (`Carpentries
+  introduction <https://swcarpentry.github.io/git-novice/>`__)
+- Snakemake `Carpentries
+  introduction <https://carpentries-incubator.github.io/workflows-snakemake/>`__
+  or `beginner's
+  guide <http://ivory.idyll.org/blog/2023-snakemake-slithering-section-1.html>`__
+- The ability to post files to Google Drive
+
+Step 0: Fork the repository and create a branch
+-----------------------------------------------
+
+From the `spras-benchmarking repository <https://github.com/Reed-CompBio/spras-benchmarking>`__,
+click the "Fork" button in the upper right corner to create a copy of
+the repository in your own GitHub account. Do not change the "Repository
+name". Then click the green "Create fork" button.
+
+The simplest way to set up SPRAS benchmarking for local development is to clone your
+fork of the repository to your local machine. You can do that with a
+graphical development environment or from the command line. After
+cloning the repository, create a new git branch called
+``example-dataset`` for local neighborhood development. In the
+following commands, replace the example username ``agitter`` with your
+GitHub username.
+
+.. code:: bash
+
+   git clone https://github.com/agitter/spras-benchmarking.git
+   git checkout -b example-dataset
 
-## Adding a dataset
+Then you can make commits and push them to your fork of the repository
+on the ``example-dataset`` branch
 
-**Check that your data provider isn't already a dataset in `datasets`.** There are some datasets that are able to serve more data, and only use
-a subset of it: these datasets can be extended for your needs.
+.. code:: bash
 
-The goal of a dataset is to take raw data and produce data to be fed to SPRAS.
-We'll follow along with `datasets/contributing`. This mini-tutorial assumes that you already have familiarity with SPRAS
-[as per its contributing guide](https://spras.readthedocs.io/en/latest/contributing/index.html).
+   git push origin example-dataset
+
+For this example dataset only, you will not merge the changes
+back to the original SPRAS benchmarking repository. Instead, you can open a pull
+request to your fork so that the SPRAS benchmarking maintainers can still provide
+feedback. For example, use the "New pull request" button from
+https://github.com/agitter/spras-benchmarking/pulls and set ``agitter/spras-benchmarking`` as both
+the base repository and the head repository with ``example-dataset`` as the compare branch.
+
+The [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html) also provides instructions so you can push changes to both the Reed-CompBio version of spras-benchmarking and your fork. 
+
+Adding an example dataset
+--------------
+
+The goal of a dataset is to take raw data and produce data to be fed to SPRAS. In this guide, we will add a dataset that is provided in `datasets/example`.
+
+## TODO ended here. 
 
 ### Uploading raw data
 
-This is a fake dataset: the data can be generated by running `datasets/contributing/raw_generation.py`, where the following artifacts will output:
+This is a fake dataset: the data can be generated by running `datasets/example/raw_generation.py`, where the following artifacts will output:
 - `sources.txt`
 - `targets.txt`
 - `gold-standard.tsv`
@@ -125,8 +170,11 @@ then refer to individual files when linking node or edge files in the configurat
 
 To test these, use the `conda` environment from the `spras` submodule to run `snakemake` with SPRAS.
 
-## Adding an algorithm
+## Making contributions
+
+You can now add your own datasets to the `spras-benchmarking` repo, which will be reviewed by the maintainers. **Check that your data provider isn't already a dataset in `datasets`.** There are some datasets that are able to serve more data, and only use a subset of it: these datasets can be extended for your needs. Code contributions will be licensed using the project's MIT license.
+
+If you wish to contribute to the codebase beyond adding datasets, there are `TODOs` that better enhance the reproducibility and accuracy of datasets or analysis of algorithm outputs, as well as
+[open resolvable issues](https://github.com/Reed-CompBio/spras-benchmarking/).
 
-If you want to add an algorithm, refer to the [SPRAS repository](https://github.com/Reed-CompBio/SPRAS) instead.
-If you want to test your new algorithm you PRed to SPRAS, you can swap out the `spras` submodule that this repository uses
-with your fork of SPRAS.
+If you want to add an algorithm to SPRAS, refer to the [SPRAS repository](https://github.com/Reed-CompBio/SPRAS) instead. If you want to test your new algorithm you PRed to SPRAS, you can swap out the `spras` submodule that this repository uses with your fork of SPRAS.
diff --git a/datasets/contributing/.gitignore b/datasets/example/.gitignore
similarity index 100%
rename from datasets/contributing/.gitignore
rename to datasets/example/.gitignore
diff --git a/datasets/contributing/README.md b/datasets/example/README.md
similarity index 93%
rename from datasets/contributing/README.md
rename to datasets/example/README.md
index 01519e2..355f9f3 100644
--- a/datasets/contributing/README.md
+++ b/datasets/example/README.md
@@ -1,4 +1,4 @@
-# Contributing Guide dataset
+# Example dataset for the contributing guide
 
 **This is an artificial dataset** for how to make datasets.
 
diff --git a/datasets/contributing/raw_generation.py b/datasets/example/raw_generation.py
similarity index 100%
rename from datasets/contributing/raw_generation.py
rename to datasets/example/raw_generation.py

From 68234cc1982d8eec7e51810bd2a29be1216d5c02 Mon Sep 17 00:00:00 2001
From: Anna Ritz <aritz@reed.edu>
Date: Thu, 29 Jan 2026 15:02:25 -0800
Subject: [PATCH 2/6] finished reading through CONTRIBUTING

---
 CONTRIBUTING.md | 42 +++++++++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 178f8a2..04b98b0 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -7,18 +7,16 @@ Prerequisites
 
 Before following this guide, a contributor will need
 
-- Familiarity with Git and GitHub (`Carpentries
-  introduction <https://swcarpentry.github.io/git-novice/>`__)
-- Snakemake `Carpentries
-  introduction <https://carpentries-incubator.github.io/workflows-snakemake/>`__
-  or `beginner's
-  guide <http://ivory.idyll.org/blog/2023-snakemake-slithering-section-1.html>`__
+- Familiarity with Python ([Carpentries introduction](https://swcarpentry.github.io/python-novice-inflammation/))
+- Familiarity with Git and GitHub ([Carpentries introduction](https://swcarpentry.github.io/git-novice/))
+- Snakemake ([Carpentries introduction](https://carpentries-incubator.github.io/workflows-snakemake/)
+  or [beginner's guide](http://ivory.idyll.org/blog/2023-snakemake-slithering-section-1.html))
 - The ability to post files to Google Drive
 
 Step 0: Fork the repository and create a branch
 -----------------------------------------------
 
-From the `spras-benchmarking repository <https://github.com/Reed-CompBio/spras-benchmarking>`__,
+From the [spras-benchmarking repository](https://github.com/Reed-CompBio/spras-benchmarking),
 click the "Fork" button in the upper right corner to create a copy of
 the repository in your own GitHub account. Do not change the "Repository
 name". Then click the green "Create fork" button.
@@ -31,17 +29,17 @@ cloning the repository, create a new git branch called
 following commands, replace the example username ``agitter`` with your
 GitHub username.
 
-.. code:: bash
-
+```
    git clone https://github.com/agitter/spras-benchmarking.git
    git checkout -b example-dataset
+```
 
 Then you can make commits and push them to your fork of the repository
 on the ``example-dataset`` branch
 
-.. code:: bash
-
-   git push origin example-dataset
+```
+git push origin example-dataset
+```
 
 For this example dataset only, you will not merge the changes
 back to the original SPRAS benchmarking repository. Instead, you can open a pull
@@ -52,15 +50,27 @@ the base repository and the head repository with ``example-dataset`` as the comp
 
 The [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html) also provides instructions so you can push changes to both the Reed-CompBio version of spras-benchmarking and your fork. 
 
+Step 1: Activate the spras environment and install SPRAS as a submdule. 
+---------------
+
+This repository depends on SPRAS. If you want to reproduce the results of benchmarking locally,
+you will need to setup SPRAS. SPRAS depends on [Docker](https://www.docker.com/) and [Conda](https://docs.conda.io/projects/conda/en/stable/). If it is hard to install either of these tools,
+a [devcontainer](https://containers.dev/) is available for easy setup.
+
+```sh
+conda env create -f spras/environment.yml
+conda activate spras
+pip install ./spras
+```
+
 Adding an example dataset
 --------------
 
 The goal of a dataset is to take raw data and produce data to be fed to SPRAS. In this guide, we will add a dataset that is provided in `datasets/example`.
 
-## TODO ended here. 
-
-### Uploading raw data
+### Generate example data
 
+Generate a fake dataset by running 
 This is a fake dataset: the data can be generated by running `datasets/example/raw_generation.py`, where the following artifacts will output:
 - `sources.txt`
 - `targets.txt`
@@ -164,6 +174,8 @@ Make sure that your `Snakefile` is run inside the top-level `run_snakemake.sh` f
 
 ### Adding to the SPRAS config
 
+TODO: add note to activate `spras` conda environment.
+
 Since this is a pathway problem and not a disease mining problem, we'll mutate `configs/pra.yaml`. Add your dataset and gold standard to the
 configuration. Since this dataset passes in a mix of raw and processed files, it would be best to make the `data_dir` set to `datasets/contributing`,
 then refer to individual files when linking node or edge files in the configuration.

From 0dcaf085552db7561a09afb22aee0bc7f23d23ae Mon Sep 17 00:00:00 2001
From: Anna Ritz <aritz@reed.edu>
Date: Thu, 29 Jan 2026 15:07:48 -0800
Subject: [PATCH 3/6] adding some last changes.

---
 CONTRIBUTING.md | 79 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 53 insertions(+), 26 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 04b98b0..3d991e1 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -63,61 +63,71 @@ conda activate spras
 pip install ./spras
 ```
 
-Adding an example dataset
+Step 2: Add a dataset
 --------------
 
 The goal of a dataset is to take raw data and produce data to be fed to SPRAS. In this guide, we will add a dataset that is provided in `datasets/example`.
 
-### Generate example data
+### 2.1: Generate an example dataset
 
 Generate a fake dataset by running 
-This is a fake dataset: the data can be generated by running `datasets/example/raw_generation.py`, where the following artifacts will output:
+
+```py
+python datasets/example/raw_generation.py <ADD ARGUMENTS>
+```
+The following artifacts will be placed in `dataset/example/`:
 - `sources.txt`
 - `targets.txt`
 - `gold-standard.tsv`
 - `interactome.tsv`
 
-Unlike in this example, the data used in other datasets comes from other sources (whether that's supplementary info in a paper, or out of
-biological databases like UniProt.) These artifacts can be large, and occasionally update, so we store them in Google Drive for caching and download
+### 2.2: Place the example dataset on Google Drive
+
+In more realistic scenarios, the data used in other datasets comes from other sources (whether that's supplementary info in a paper, or out of biological databases like UniProt.) These artifacts can be large, and may occasionally be updated, so we store them in Google Drive for caching and download
 them when we want to reconstruct a dataset.
 
-Note that the four artifacts above change every time `raw_generation.py` is run. Upload those artifacts to Google Drive in a folder of your choice.
-Share the file and allow for _Anyone with the link_ to _View_ the file.
+Note that the four artifacts above change every time `raw_generation.py` is run. Upload those artifacts to Google Drive in a folder of your choice. Set the Sharing settings so that _Anyone with the link_ can _View_ the file.
 
 Once shared, copying the URL should look something like:
 
-> https://drive.google.com/file/d/1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h/view?usp=sharing
+```
+https://drive.google.com/file/d/1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h/view?usp=sharing
+```
 
 We always drop the entire `/view?...` suffix, and replace `/file/d/` with `/uc?id=`, which turns the URL to a direct download link, which is internally
 downloaded with [gdown](https://github.com/wkentaro/gdown). Those post-processing steps should make the URL now look as so:
 
-> https://drive.google.com/uc?id=1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h
+```
+https://drive.google.com/uc?id=1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h
+```
 
-Now, add a directive to `cache/directory.py` under `Contributing`. Since this doesn't have an online URL, this should use `CacheItem.cache_only`, to
-indicate that no other online database serves this URL.
+### 2.3: Add the example dataset's location to `cache/directory.py`
 
-Your new directive under the `directory` dictionary should look something as so, with one entry for every artifact:
+Now, add a directive to `cache/directory.py` that specifies the location of the example dataset. This should be added as a new (key,value) pair to the `directory` variable. Since the example dataset doesn't have an online URL, this should use `CacheItem.cache_only`, to indicate that no other online database serves this URL.
+
+Your new directive under the `directory` dictionary should look something as so, with one entry for each of the four artifacts:
 
 ```python
 ...,
-"Contributing": {
+"ExampleData": {
     "interactome.tsv": CacheItem.cache_only(
-        name="Randomly-generated contributing interactome",
+        name="Randomly-generated example data interactome",
         cached="https://drive.google.com/uc?id=..."
     ),
     ...
 }
 ```
 
-### Setting up a workflow
+Step 3: Set up a workflow to run the example dataset
+----------
+
+Now, we need to make these files SPRAS-compatible. To do this, we'll set up a `Snakefile`, which will handle downloading the artifacts from the Google Drive links and running any scripts to reformat the artifacts into SPRAS-compatible formats.
 
-Now, we need to make these files SPRAS-compatible. To do this, we'll set up a `Snakefile`, which will handle:
-- Artifact downloading
-- Script running.
+In the example dataset, `sources.txt` and `targets.txt` are already in a SPRAS-ready format, but we need to process `gold-standard.tsv` and `interactome.tsv`.
 
-`sources.txt` and `targets.txt` are already in a SPRAS-ready format, but we need to process `gold-standard.tsv` and `interactome.tsv`.
+### 3.1: Write a `Snakefile` to fetch datasets 
 
-Create a `Snakefile` under your dataset with the top-level directives:
+Navigate to the `dataset/example` directory and create a `Snakefile` with the top-level directives:
 
 ```python
 # This provides the `produce_fetch_rules` util to allows us to automatically fetch the Google Drive data.
@@ -145,6 +155,8 @@ produce_fetch_rules({
 })
 ```
 
+### 3.2: Write code to put example dataset files in a SPRAS-compatible format
+
 Create two scripts that make `gold-standard.tsv` and `interactome.tsv` SPRAS-ready, consulting
 the [SPRAS file format documentation](https://spras.readthedocs.io/en/latest/output.html). You can use any dependencies inside the top-level
 `pyproject.toml`, and you can test out your scripts with `uv run <script>`, an installation requirement from the top-level README.
@@ -154,7 +166,9 @@ the [SPRAS file format documentation](https://spras.readthedocs.io/en/latest/out
 > Getting the current directory of your script prevents path errors. We use the snippet `Path(__file__).parent.resolve()`
 > throughout the repository.
 
-Once you have your scripts, add rules that consume the raw data and produce your processed data. For example:
+### 3.3: Write Snakemake rules to produce SPRAS-compatible files
+
+Once you have your scripts, add rules to the `Snakefile` that consume the raw data and produce your processed data. For example:
 
 ```py
 rule interactome:
@@ -168,19 +182,32 @@ rule interactome:
 
 Once you do the same for `gold-standard.tsv`, your data pipeline is ready! You can test it with `uv run snakemake --cores 1`.
 
+Step 4: Add the example dataset to the set of benchmark data
+----------------
+
+If you want to add the dataset so it is run along with all other datasets, add a call to run the new `Snakefile` in `run_snakemake.sh` file in the top-level directory.
+
+The example dataset inputs indicate that algorithms designed for pathway reconstruction analysis should be run on this example (as opposed to a disease mining analysis, which would not have sources and targets). Therefore, we will add this dataset to be run when pathway reconstruction analysis (PRA) methods are used. The configuration file for these methods is in `configs/pra.yaml`. 
 ### Adding to `run_snakemake.sh`
 
 Make sure that your `Snakefile` is run inside the top-level `run_snakemake.sh` file.
 
 ### Adding to the SPRAS config
 
-TODO: add note to activate `spras` conda environment.
+Since this is a pathway problem and not a disease mining problem, we'll mutate `configs/pra.yaml`. Add your dataset and gold standard to the configuration. Since this dataset passes in a mix of raw and processed files, it would be best to make the `data_dir` set to `datasets/example`, then refer to individual files when linking node or edge files in the configuration. Under the `datasets` tag, add lines like this:
 
-Since this is a pathway problem and not a disease mining problem, we'll mutate `configs/pra.yaml`. Add your dataset and gold standard to the
-configuration. Since this dataset passes in a mix of raw and processed files, it would be best to make the `data_dir` set to `datasets/contributing`,
-then refer to individual files when linking node or edge files in the configuration.
+```
+  - label: exampleDataset
+    node_files: ["raw/sources.txt", "raw/targets.txt"]
+    edge_files: ["processed/interactome.tsv"]
+    data_dir: "datasets/example"
+```
+
+To test these, use the `conda` environment from the `spras` submodule to run `snakemake` with SPRAS:
 
-To test these, use the `conda` environment from the `spras` submodule to run `snakemake` with SPRAS.
+```sh
+snakemake --cores 1 --configfile configs/pra.yaml --show-failed-logs -s spras/Snakefile
+```
 
 ## Making contributions
 

From 12d3b8b8380c6c6376f1558e03e60c18020c4144 Mon Sep 17 00:00:00 2001
From: "Tristan F." <LeoDog896@hotmail.com>
Date: Fri, 30 Jan 2026 01:47:20 +0000
Subject: [PATCH 4/6] apply suggestions, formatting changes

---
 CONTRIBUTING.md | 58 ++++++++++++++++++++++++++++---------------------
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 3d991e1..146bf6f 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -2,8 +2,7 @@
 
 This guide walks new contributors through the process of adding a new dataset for SPRAS benchmarking and running SPRAS on that dataset. It is considered a companion to the [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html), which walks users through adding and algorithm to the SPRAS software. It is useful, but not necessary, to complete that contributing guide before beginning this one.
 
-Prerequisites
--------------
+## Prerequisites
 
 Before following this guide, a contributor will need
 
@@ -13,8 +12,7 @@ Before following this guide, a contributor will need
   or [beginner's guide](http://ivory.idyll.org/blog/2023-snakemake-slithering-section-1.html))
 - The ability to post files to Google Drive
 
-Step 0: Fork the repository and create a branch
------------------------------------------------
+## Step 0: Fork the repository and create a branch
 
 From the [spras-benchmarking repository](https://github.com/Reed-CompBio/spras-benchmarking),
 click the "Fork" button in the upper right corner to create a copy of
@@ -29,15 +27,15 @@ cloning the repository, create a new git branch called
 following commands, replace the example username ``agitter`` with your
 GitHub username.
 
-```
-   git clone https://github.com/agitter/spras-benchmarking.git
-   git checkout -b example-dataset
+```sh
+git clone https://github.com/agitter/spras-benchmarking.git
+git checkout -b example-dataset
 ```
 
 Then you can make commits and push them to your fork of the repository
 on the ``example-dataset`` branch
 
-```
+```sh
 git push origin example-dataset
 ```
 
@@ -50,10 +48,16 @@ the base repository and the head repository with ``example-dataset`` as the comp
 
 The [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html) also provides instructions so you can push changes to both the Reed-CompBio version of spras-benchmarking and your fork. 
 
-Step 1: Activate the spras environment and install SPRAS as a submdule. 
----------------
+### Step 1: Install `uv`
+
+Unlike in the main SPRAS repository, we use `uv`, an equivalent to `pip`, for running our dataset pipelines. As we will see later in `1.1`,
+we still use Conda for running SPRAS itself.
+
+You can follow `uv`'s installation instructions [on their website](https://docs.astral.sh/uv/getting-started/installation/).
 
-This repository depends on SPRAS. If you want to reproduce the results of benchmarking locally,
+### 1.1: Activate the spras environment and install SPRAS as a submdule.
+
+This repository depends on SPRAS. If you want to reproduce the results of running SPRAS on datasets locally,
 you will need to setup SPRAS. SPRAS depends on [Docker](https://www.docker.com/) and [Conda](https://docs.conda.io/projects/conda/en/stable/). If it is hard to install either of these tools,
 a [devcontainer](https://containers.dev/) is available for easy setup.
 
@@ -63,8 +67,7 @@ conda activate spras
 pip install ./spras
 ```
 
-Step 2: Add a dataset
---------------
+## Step 2: Add a dataset
 
 The goal of a dataset is to take raw data and produce data to be fed to SPRAS. In this guide, we will add a dataset that is provided in `datasets/example`.
 
@@ -72,9 +75,10 @@ The goal of a dataset is to take raw data and produce data to be fed to SPRAS. I
 
 Generate a fake dataset by running 
 
-```py
-python datasets/example/raw_generation.py <ADD ARGUMENTS>
+```sh
+uv run datasets/example/raw_generation.py
 ```
+
 The following artifacts will be placed in `dataset/example/`:
 - `sources.txt`
 - `targets.txt`
@@ -157,14 +161,16 @@ produce_fetch_rules({
 
 ### 3.2: Write code to put example dataset files in a SPRAS-compatible format
 
-Create two scripts that make `gold-standard.tsv` and `interactome.tsv` SPRAS-ready, consulting
+Create two scripts that output SPRAS-ready variants of `raw/gold-standard.tsv` and `raw/interactome.tsv` to `processed/`, consulting
 the [SPRAS file format documentation](https://spras.readthedocs.io/en/latest/output.html). You can use any dependencies inside the top-level
-`pyproject.toml`, and you can test out your scripts with `uv run <script>`, an installation requirement from the top-level README.
+`pyproject.toml`, though [pandas](https://pandas.pydata.org/) should suffice, and you can test out your scripts with `uv run <script>`,
+an installation requirement from Step 1.
 
 
 > [!TIP]
-> Getting the current directory of your script prevents path errors. We use the snippet `Path(__file__).parent.resolve()`
-> throughout the repository.
+> While scripts will usually be run through `Snakemake`, they can also be run standalone through `uv run <script>.py`.
+> Users not running scripts from `datasets/<dataset>` will encounter path errors, unless you resolve the file's current directory
+> to be its current location through `current_dir = Path(__file__).parent.resolve()`.
 
 ### 3.3: Write Snakemake rules to produce SPRAS-compatible files
 
@@ -180,14 +186,16 @@ rule interactome:
         "uv run scripts/process_interactome.py"
 ```
 
-Once you do the same for `gold-standard.tsv`, your data pipeline is ready! You can test it with `uv run snakemake --cores 1`.
+Once you do the same for `gold-standard.tsv`, your dataset recreation pipeline is ready! This will not run SPRAS itself, but it will allow
+your processed dataset files to be reproduced. You can test it with `uv run snakemake --cores 1`.
 
-Step 4: Add the example dataset to the set of benchmark data
-----------------
+## Step 4: Add the example dataset to the set of benchmark data
 
-If you want to add the dataset so it is run along with all other datasets, add a call to run the new `Snakefile` in `run_snakemake.sh` file in the top-level directory.
+To make sure your dataset is run along with all other datasets when benchmarking is run,
+you need to run your new `Snakefile` to `run_snakemake.sh` file in the top-level directory, and add it to the appropiate SPRAS configuration in `configs`.
 
 The example dataset inputs indicate that algorithms designed for pathway reconstruction analysis should be run on this example (as opposed to a disease mining analysis, which would not have sources and targets). Therefore, we will add this dataset to be run when pathway reconstruction analysis (PRA) methods are used. The configuration file for these methods is in `configs/pra.yaml`. 
+
 ### Adding to `run_snakemake.sh`
 
 Make sure that your `Snakefile` is run inside the top-level `run_snakemake.sh` file.
@@ -196,7 +204,7 @@ Make sure that your `Snakefile` is run inside the top-level `run_snakemake.sh` f
 
 Since this is a pathway problem and not a disease mining problem, we'll mutate `configs/pra.yaml`. Add your dataset and gold standard to the configuration. Since this dataset passes in a mix of raw and processed files, it would be best to make the `data_dir` set to `datasets/example`, then refer to individual files when linking node or edge files in the configuration. Under the `datasets` tag, add lines like this:
 
-```
+```yaml
   - label: exampleDataset
     node_files: ["raw/sources.txt", "raw/targets.txt"]
     edge_files: ["processed/interactome.tsv"]
@@ -214,6 +222,6 @@ snakemake --cores 1 --configfile configs/pra.yaml --show-failed-logs -s spras/Sn
 You can now add your own datasets to the `spras-benchmarking` repo, which will be reviewed by the maintainers. **Check that your data provider isn't already a dataset in `datasets`.** There are some datasets that are able to serve more data, and only use a subset of it: these datasets can be extended for your needs. Code contributions will be licensed using the project's MIT license.
 
 If you wish to contribute to the codebase beyond adding datasets, there are `TODOs` that better enhance the reproducibility and accuracy of datasets or analysis of algorithm outputs, as well as
-[open resolvable issues](https://github.com/Reed-CompBio/spras-benchmarking/).
+[open resolvable issues](https://github.com/Reed-CompBio/spras-benchmarking/issues).
 
 If you want to add an algorithm to SPRAS, refer to the [SPRAS repository](https://github.com/Reed-CompBio/SPRAS) instead. If you want to test your new algorithm you PRed to SPRAS, you can swap out the `spras` submodule that this repository uses with your fork of SPRAS.

From c16c33ceeb8069a87c2e37953737f13e05b42aaa Mon Sep 17 00:00:00 2001
From: "Tristan F." <LeoDog896@hotmail.com>
Date: Fri, 30 Jan 2026 01:49:01 +0000
Subject: [PATCH 5/6] style: fmt

---
 CONTRIBUTING.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 146bf6f..21084ee 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -46,7 +46,7 @@ feedback. For example, use the "New pull request" button from
 https://github.com/agitter/spras-benchmarking/pulls and set ``agitter/spras-benchmarking`` as both
 the base repository and the head repository with ``example-dataset`` as the compare branch.
 
-The [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html) also provides instructions so you can push changes to both the Reed-CompBio version of spras-benchmarking and your fork. 
+The [SPRAS Contributing Guide](https://spras.readthedocs.io/en/latest/contributing/index.html) also provides instructions so you can push changes to both the Reed-CompBio version of spras-benchmarking and your fork.
 
 ### Step 1: Install `uv`
 
@@ -73,7 +73,7 @@ The goal of a dataset is to take raw data and produce data to be fed to SPRAS. I
 
 ### 2.1: Generate an example dataset
 
-Generate a fake dataset by running 
+Generate a fake dataset by running
 
 ```sh
 uv run datasets/example/raw_generation.py
@@ -129,7 +129,7 @@ Now, we need to make these files SPRAS-compatible. To do this, we'll set up a `S
 
 In the example dataset, `sources.txt` and `targets.txt` are already in a SPRAS-ready format, but we need to process `gold-standard.tsv` and `interactome.tsv`.
 
-### 3.1: Write a `Snakefile` to fetch datasets 
+### 3.1: Write a `Snakefile` to fetch datasets
 
 Navigate to the `dataset/example` directory and create a `Snakefile` with the top-level directives:
 
@@ -194,7 +194,7 @@ your processed dataset files to be reproduced. You can test it with `uv run snak
 To make sure your dataset is run along with all other datasets when benchmarking is run,
 you need to run your new `Snakefile` to `run_snakemake.sh` file in the top-level directory, and add it to the appropiate SPRAS configuration in `configs`.
 
-The example dataset inputs indicate that algorithms designed for pathway reconstruction analysis should be run on this example (as opposed to a disease mining analysis, which would not have sources and targets). Therefore, we will add this dataset to be run when pathway reconstruction analysis (PRA) methods are used. The configuration file for these methods is in `configs/pra.yaml`. 
+The example dataset inputs indicate that algorithms designed for pathway reconstruction analysis should be run on this example (as opposed to a disease mining analysis, which would not have sources and targets). Therefore, we will add this dataset to be run when pathway reconstruction analysis (PRA) methods are used. The configuration file for these methods is in `configs/pra.yaml`.
 
 ### Adding to `run_snakemake.sh`
 

From c703c08db4778f545f656a257bdce379f6c5d711 Mon Sep 17 00:00:00 2001
From: "Tristan F." <LeoDog896@hotmail.com>
Date: Fri, 30 Jan 2026 02:10:02 +0000
Subject: [PATCH 6/6] docs: add argument documentation to non-obvious args

---
 datasets/example/raw_generation.py | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/datasets/example/raw_generation.py b/datasets/example/raw_generation.py
index acbca21..9e3c6c7 100644
--- a/datasets/example/raw_generation.py
+++ b/datasets/example/raw_generation.py
@@ -25,17 +25,19 @@ def gnp_noise(graph: networkx.DiGraph, p: float):
 
 def generate_parser():
     parser = argparse.ArgumentParser(prog='Pathway generator')
-    parser.add_argument("--path-count", type=int, default=10)
-    parser.add_argument("--path-length", type=int, default=7)
+    parser.add_argument("--path-count", type=int, default=10, help="The number of paths, whose starts and ends are marked as sources and targets.")
+    parser.add_argument("--path-length", type=int, default=7, help="The length of every path from --path-count.")
 
     parser.add_argument("--sources-output", type=str, default="sources.txt")
     parser.add_argument("--targets-output", type=str, default="targets.txt")
 
-    parser.add_argument("--gold-standard-noise", type=float, default=0.03)
+    parser.add_argument("--gold-standard-noise", type=float, default=0.03,
+                        help="The probability that edges in the gold standard are connected to each other.")
     parser.add_argument("--gold-standard-output", type=str, default="gold-standard.tsv")
 
     parser.add_argument("--interactome-extra-nodes", type=int, default=400)
-    parser.add_argument("--interactome-noise", type=float, default=0.01)
+    parser.add_argument("--interactome-noise", type=float, default=0.01,
+                        help="The probability that edges in the larger interactome are connected to each other.")
     parser.add_argument("--interactome-output", type=str, default="interactome.tsv")
     return parser