Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 117 additions & 11 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,123 @@ There are `TODOs` that better enhance the reproducability and accuracy of datase

## Adding a dataset

See `datasets/diseases` as an example of a dataset. Datasets take some form of raw data from an online service and convert it into usable datasets
with associated gold standards for SPRAS to run on.

To add a dataset:
1. Check that your dataset provider isn't already added (some of these datasets act as providers for multiple datasets)
1. Create a new folder under `datasets/<your-dataset>`
1. Add an attached Snakefile that converts your `raw` data to `processed` data.
- Make sure to use `uv` here. See `diseases`'s Snakefile for an example.
1. Add your Snakefile to the top-level `run_snakemake.sh` file.
1. Add your datasets to the appropiate `configs`
- If your dataset has gold standards, make sure to include them here.
**Check that your data provider isn't already a dataset in `datasets`.** There are some datasets that are able to serve more data, and only use
a subset of it: these datasets can be extended for your needs.

The goal of a dataset is to take raw data and produce data to be fed to SPRAS.
We'll follow along with `datasets/contributing`. This mini-tutorial assumes that you already have familiarity with SPRAS
[as per its contributing guide](https://spras.readthedocs.io/en/latest/contributing/index.html).

### Uploading raw data

This is a fake dataset: the data can be generated by running `datasets/contributing/raw_generation.py`, where the following artifacts will output:
- `sources.txt`
- `targets.txt`
- `gold-standard.tsv`
- `interactome.tsv`

Unlike in this example, the data used in other datasets comes from other sources (whether that's supplementary info in a paper, or out of
biological databases like UniProt.) These artifacts can be large, and occasionally update, so we store them in Google Drive for caching and download
them when we want to reconstruct a dataset.

Note that the four artifacts above change every time `raw_generation.py` is run. Upload those artifacts to Google Drive in a folder of your choice.
Share the file and allow for _Anyone with the link_ to _View_ the file.

Once shared, copying the URL should look something like:

> https://drive.google.com/file/d/1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h/view?usp=sharing

We always drop the entire `/view?...` suffix, and replace `/file/d/` with `/uc?id=`, which turns the URL to a direct download link, which is internally
downloaded with [gdown](https://github.com/wkentaro/gdown). Those post-processing steps should make the URL now look as so:

> https://drive.google.com/uc?id=1Agte0Aezext-8jLhGP4GmaF3tS7gHX-h

Now, add a directive to `cache/directory.py` under `Contributing`. Since this doesn't have an online URL, this should use `CacheItem.cache_only`, to
indicate that no other online database serves this URL.

Your new directive under the `directory` dictionary should look something as so, with one entry for every artifact:

```python
...,
"Contributing": {
"interactome.tsv": CacheItem.cache_only(
name="Randomly-generated contributing interactome",
cached="https://drive.google.com/uc?id=..."
),
...
}
```

### Setting up a workflow

Now, we need to make these files SPRAS-compatible. To do this, we'll set up a `Snakefile`, which will handle:
- Artifact downloading
- Script running.

`sources.txt` and `targets.txt` are already in a SPRAS-ready format, but we need to process `gold-standard.tsv` and `interactome.tsv`.

Create a `Snakefile` under your dataset with the top-level directives:

```python
# This provides the `produce_fetch_rules` util to allows us to automatically fetch the Google Drive data.
include: "../../cache/Snakefile"

rule all:
input:
# The two files we will be passing to SPRAS
"raw/sources.txt",
"raw/targets.txt",
# The two files we will be processing
"processed/gold-standard.tsv",
"processed/interactome.tsv"
```

We'll generate four `fetch` rules, or rules that tell Snakemake to download the data we uploaded to Google Drive earlier.

```python
produce_fetch_rules({
# The value array is a path into the dictionary from `cache/directory.py`.
"raw/sources.txt": ["Contributing", "sources.txt"],
# and so on for targets, gold-standard, and interactome:
# note that excluding these three stops the Snakemake file from working by design!
...
})
```

Create two scripts that make `gold-standard.tsv` and `interactome.tsv` SPRAS-ready, consulting
the [SPRAS file format documentation](https://spras.readthedocs.io/en/latest/output.html). You can use any dependencies inside the top-level
`pyproject.toml`, and you can test out your scripts with `uv run <script>`, an installation requirement from the top-level README.


> [!TIP]
> Getting the current directory of your script prevents path errors. We use the snippet `Path(__file__).parent.resolve()`
> throughout the repository.

Once you have your scripts, add rules that consume the raw data and produce your processed data. For example:

```py
rule interactome:
input:
"raw/interactome.tsv"
output:
"processed/interactome.tsv"
shell:
"uv run scripts/process_interactome.py"
```

Once you do the same for `gold-standard.tsv`, your data pipeline is ready! You can test it with `uv run snakemake --cores 1`.

### Adding to `run_snakemake.sh`

Make sure that your `Snakefile` is run inside the top-level `run_snakemake.sh` file.

### Adding to the SPRAS config

Since this is a pathway problem and not a disease mining problem, we'll mutate `configs/pra.yaml`. Add your dataset and gold standard to the
configuration. Since this dataset passes in a mix of raw and processed files, it would be best to make the `data_dir` set to `datasets/contributing`,
then refer to individual files when linking node or edge files in the configuration.

To test these, use the `conda` environment from the `spras` submodule to run `snakemake` with SPRAS.

## Adding an algorithm

Expand Down
4 changes: 4 additions & 0 deletions datasets/contributing/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
gold-standard.tsv
interactome.tsv
sources.txt
targets.txt
10 changes: 10 additions & 0 deletions datasets/contributing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Contributing Guide dataset

**This is an artificial dataset** for how to make datasets.

This comes with a `raw_generation.py` script, which produces the associated raw data, where the gold standard is `k` paths of length `n` with
Erdős-Rényi edges, such that the sources and targets come from the start and ends of each path. The background interactome is the gold standard with
more edge and node noise. This is not a topologically-accurate emulation of (signaling) pathways, but it suffices to trick most pathway reconstruction
algorithms.

This does not cover the (very common!) task of ID mapping, as this can vary constantly between datasets.
78 changes: 78 additions & 0 deletions datasets/contributing/raw_generation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import argparse
import itertools
from pathlib import Path
import random
import networkx
import uuid
import pandas

def random_id() -> str:
return uuid.uuid4().hex

def assign_ids(graph: networkx.DiGraph) -> networkx.DiGraph:
"""Assigns new IDs to a graph based on `random_id`"""
mapping = {node: random_id() for node in graph}
return networkx.relabel_nodes(graph, mapping)

def gnp_noise(graph: networkx.DiGraph, p: float):
"""
The mutative equivalent to networkx.gnp_random_graph,
whose original implementation does not consume a graph.
"""
for e in itertools.permutations(graph.nodes, 2):
if random.random() < p:
graph.add_edge(*e)

def generate_parser():
parser = argparse.ArgumentParser(prog='Pathway generator')
parser.add_argument("--path-count", type=int, default=10)
parser.add_argument("--path-length", type=int, default=7)

parser.add_argument("--sources-output", type=str, default="sources.txt")
parser.add_argument("--targets-output", type=str, default="targets.txt")

parser.add_argument("--gold-standard-noise", type=float, default=0.03)
parser.add_argument("--gold-standard-output", type=str, default="gold-standard.tsv")

parser.add_argument("--interactome-extra-nodes", type=int, default=400)
parser.add_argument("--interactome-noise", type=float, default=0.01)
parser.add_argument("--interactome-output", type=str, default="interactome.tsv")
return parser

def main():
args = generate_parser().parse_args()

graph = networkx.DiGraph()
sources: list[str] = []
targets: list[str] = []

# Add the path graphs to form the base of the pathway, while getting sources and targets as well.
for _ in range(args.path_count):
path_graph = networkx.path_graph(args.path_length, create_using=networkx.DiGraph())
path_graph = assign_ids(path_graph)

topological_sort = list(networkx.topological_sort(path_graph))
first_node, last_node = (topological_sort[0], topological_sort[-1])
sources.append(first_node)
targets.append(last_node)

graph = networkx.union(graph, path_graph)

Path(args.sources_output).write_text("\n".join(sources))
Path(args.targets_output).write_text("\n".join(targets))

# Then, we'll add some noise: this will be our gold standard.
gnp_noise(graph, args.gold_standard_noise)
gold_standard = pandas.DataFrame(((a, b) for a, b, _data in networkx.to_edgelist(graph)), columns=["Source", "Target"])
# We make the gold standard output a little annoying to force some post-processing with pandas.
gold_standard.insert(1, "Interaction-Type", "pp")
gold_standard.to_csv(args.gold_standard_output, index=False, sep='\t')

# and we'll follow along similarly to above to build our interactome.
graph.add_nodes_from((random_id() for _ in range(args.interactome_extra_nodes)))
gnp_noise(graph, args.interactome_noise)
interactome = pandas.DataFrame(((a, b) for a, b, _data in networkx.to_edgelist(graph)), columns=["Source", "Target"])
interactome.to_csv(args.interactome_output, index=False, sep='\t')

if __name__ == "__main__":
main()
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies = [
"bioservices>=1.12.1",
"gdown>=5.2.0",
"more-itertools>=10.7.0",
"networkx>=3.6.1",
"pandas>=2.3.0",
]

Expand Down
11 changes: 11 additions & 0 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading