Skip to content

Computational framework for disease gene prediction through supervised learning on multiplex biological networks

License

Notifications You must be signed in to change notification settings

RausellLab/Tiresias

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tiresias

Computational framework for disease gene prediction through supervised learning on multiplex biological networks.

Version

Tiresias is still under development. The present software is available in beta.

Getting started

This software has been tested on Ubuntu 18.04. It is meant to be run on Linux systems.

Prerequisites

Installing

  1. Clone the repository and get into the Git directory.
git clone https://github.com/RausellLab/Tiresias.git && cd Tiresias
  1. Setup a Conda environment and activate it.
conda env create -f environment.yml && conda activate tiresias
  1. Pull node2vec Docker image.
make node2vec-image

If you get an error at this point, the first thing to do is to check that Docker is configured properly. See "Manage Docker as a non-root user".

You can now check that the framework works properly by launching the data pipeline with dummy input data:

make pipeline

You can then browse the results with MLFlow UI. Launch:

mlflow ui

Then, go to http://localhost:5000/ in your web browser.

Once you're done, clean up the intermediary and output files created by the pipeline:

make clean-all

Usage

Configuration

  1. Datasets

Input datasets paths must be entered in config.yml.

  1. Parameters

Set up the parameters used for featurization and for running the models by modifying the JSON files located in parameters/.

  • parameters/features.json: contains the parameters used to run random walks and to create node embeddings.
  • parameters/models_validation.json: contains model parameters used during the validation stage.
  • parameters/models_test.json: contains model parameters used during the test stage. Note that those parameters are also used for outputting a final ranking of the nodes during the predict stage.
  1. System and models

System resource configuration and enabling/disabling the use of models is done by modifying config.yml.

Run pipeline

Before running any pipeline step, the Conda environment must be activated with:

conda activate tiresias

The pipeline can be run from start to finish with a single command:

make pipeline

It is also possible to run the pipeline steps individually.

  1. Data preprocessing
make data
  1. Random walks
make random-walks
  1. Node embeddings
make embeddings
  1. Validation
make validation
  1. Test
make test
  1. Predict
make predict

Visualize results with MLFlow

Model run results are recorded using MLFlow. Results are stored in the mlruns/ directory.

You can browse results with MLFlow UI. To do so, launch MLFlow UI.

mlflow ui

Then, go to http://localhost:5000/ in your web browser.

Reports

At the validation, test and prediction steps of the pipeline, reports containing recaps of the results are generated. The report files are available in reports/.

Intermediary files

Intermediary files generated when running pipeline steps and their associated metadata are stored in artifacts/.

Note that the pipeline scripts automatically pick-up the files present in artifacts/ as inputs for downstream pipeline steps. This allow for more flexibility when running pipeline steps separately. However, you may want not to repeat some experiments or start with completely different data or parameters. In this case, move or delete the existing artifacts/ directory, then re-run the pipeline from start with the new inputs.

Built with

References

Mordelet, F. and Vert, J.-P. (2011). ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics, 12(1):389.

Mordelet, F. and Vert, J.-P. (2014). A bagging SVM to learn from positive and unlabeled examples. Pattern Recognition Letters, 37:201–209.

Lovász, László, et al. Random walks on graphs: A survey. Combinatorics, Paul erdos is eighty, 1993, vol. 2, no 1, p. 1-46.

Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. (2004). Learning with local and global consistency. In Advances in Neural Information Processing Systems 16, pages 321–328. MIT Press

Li, Y. and Li, J. (2012). Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data. BMC Genomics, 13(7):S27.

Valdeolivas, A., Tichit, L., Navarro, C., Perrin, S., Odelin, G., Levy, N., ... & Baudot, A. (2018). Random walk with restart on multiplex and heterogeneous biological networks. Bioinformatics, 35(3), 497-505.

Grover, A. and Leskovec, J. (2016). node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, pages 855–864, San Francisco, California, USA. ACM Press.

T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” arXiv:1609.02907 [cs, stat], Sep. 2016.

M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling, “Modeling Relational Data with Graph Convolutional Networks,” arXiv:1703.06103 [cs, stat], Mar. 2017.

Licence

Tiresias uses Apache License 2.0.

Contact

Please address comments and questions about Tiresias to thibaud.martinez@gmail.com, stefani.dritsa@institutimagine.org, chloe-agathe.azencott@mines-paristech.fr and antonio.rausell@inserm.fr.

About

Computational framework for disease gene prediction through supervised learning on multiplex biological networks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •