Skip to content

Designed a cloud-native LLM pipeline for rare disease drug discovery, integrating fine-tuned generative models with phenotypic screening and semantic reasoning. Leveraged AWS EC2/S3 for scalable training and deployed TinyLlama-1.1B using LoRA/SFT to extract gene-disease-drug associations with pathway and literature validation.

License

Notifications You must be signed in to change notification settings

Jaber-Valinejad/Target_Deconvolution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


Target Identification and Drug Discovery for Rare Diseases through High-Throughput Screening and Phenotypic Assays

Testing Static Badge
Docs Static Badge
Package Python GitHub last commit Open in Colab Static Badge Static Badge
Meta DOI Docs GitHub License GitHub Sponsors

What is it?

This project focuses on identifying potential therapeutic targets for rare diseases by using phenotypic assays and high-throughput screening of compound libraries. The goal is to uncover small-molecule modulators that can address disease-specific cellular processes. By leveraging innovative target deconvolution methods and secondary screening, the project aims to accelerate drug discovery for rare diseases with limited treatment options, contributing to the development of effective therapies.

Table of Contents

πŸ—‚οΈ Repository Structure

Jaber-Valinejad/
β”œβ”€β”€ Data/                  <- Raw or processed datasets (not tracked by Git if large or sensitive)
β”œβ”€β”€ Docs/                  <- Project documentation, notes, or references
β”œβ”€β”€ Figs/                  <- Figures, charts, or visualizations generated during the project
β”œβ”€β”€ Methods/               <- Scripts, notebooks, or descriptions of methods and workflows
β”œβ”€β”€ LICENSE                <- License file specifying terms of use and distribution
β”œβ”€β”€ README.md              <- Overview of the project, setup instructions, and repository guide (You are here!)
└── requirements-dev.txt   <- Python dependencies for development and testing

Data Collection and Sources

All data used for this project can be used in Data folder. Plus, for literature mining, we utilize the publication database in RDAS related to the target diseases. To perform this, we can run the following command in Neo4j:

# Neo4j
MATCH (p)<-[m: MENTIONED_IN]-(g:GARD)
WHERE g.GardId = "GARD:0002027"
optional MATCH (p:Article)-[r:ANNOTATION_FOR]-(t:PubtatorAnnotation)
WITH p, collect(t.text) AS texts
WITH p, reduce(all_texts = [], t IN texts | all_texts + t) AS all_texts
RETURN p.pubmed_id, p.title, p.abstractText, p.publicationYear, apoc.coll.toSet([text IN all_texts | toLower(text)]) AS unique_texts

Target Deconvolution

Main Approach 1: Target Deconvolution via Predicted Genes using Pathway.ipynb

In this approach, we use tools like SwissDrugDesign or SuperPRED to predict genes associated with the newly identified compounds. The following methods are used to assess the association:

  • Then, we chcek association between these genes with target disease.

  • Pathway Enrichment Analysis: After predicting the genes, we perform a pathway enrichment analysis using ShinyGO to identify any enriched pathways. Then, we chcek association between these enriched pathways with target disease.

  • Enriched Biological Terms: Biological terms enriched in the target genes are analyzed using ShinyGO with all available gene sets as pathway databases. Then, we chcek association between these enriched biological terms with target disease.

Key Metrics:

  • Fold Enrichment: The ratio of the percentage of genes in your list compared to the background genes. Higher values indicate stronger enrichment.
  • False Discovery Rate (FDR): Calculated using the Benjamini-Hochberg method to adjust for multiple comparisons.

Main Approach 2: Target Deconvolution via Similar Compounds (CTD) using CTD.ipynb

This method leverages the Comparative Toxicogenomics Database (CTD) and ChEMBL to identify similar compounds related to the target disease. Steps include:

  • Identifying Targets: Using the 'chembl_webresource_client' and organism β€˜Homo sapiens’ to find targets.

  • Identifying similiar compounds: To find similar compounds we use ChEMBL API. Then, we find related genes using uniprot

  • Identifying realted disaese: Visit pubchem. Then we Find the CTD link there. Then, we use this information to find related diseases.

  • Association Checking: We check associations between any of genes, phenotypes, and diseases related to the identified compounds using resources like OMIM, Orphanet, and the Human Phenotype Ontology (HPO) and target gene.

Main Approach 3: Target Deconvolution via Similar Compounds (CID) using CID.ipynp

When using a threshold of 0.8 from CTD, fewer similar compounds may be identified. To increase the chance of finding associations, we use Compound Identifier (CID), and convert CIDs to CTD codes. To link these CIDs to relevant diseases and phenotypes, we need to convert CIDs to CTD codes. This conversion can be achieved by mapping CID values to their respective CTD codes. Below are the steps to accomplish this:

  1. Using SID map: To convert CIDs to CTD codes, we can use the SID-Map file, which contains the mapping between substances (SID), their registry identifiers, and their standardized CID. This is a gzipped text file that lists substances with their corresponding SID, source names, registry identifiers, and the CID (if available). We can use command-line tools to filter the SID-Map file and extract relevant mappings.
(structuredev12)(structure) gzc $PUBCHEM_FTP/Substance/Extras/SID-Map.gz | grep "Comparative Toxicogenomics Database" | egrep ' 30131$'
134223583       Comparative Toxicogenomics Database (CTD)       D013993 30131

SID-Map.gz: This is a listing of all (live) SIDs with their source names and registry identifiers, and the standardized CID if present. It is a gzipped text file where each line contains at least three columns: SID, tab, source name, tab, registry identifier; then a fourth column of tab, CID if there is a standardized CID for the given SID. This SID-Map file helps identify the standardized CID for substances and their corresponding CTD identifier, enabling the association between compounds and diseases.

  1. API Integration: Additional conversion can be done via pubchem API to map CIDs to related diseases and phenotypes. Please refer to SID-Map.

Association Checking

The association checking process is multi-faceted and involves:

  1. Literature Search: It includes the follwoing steps: 1) A comprehensive search through relevant scientific literature; 2) Verifying associations through known datasets; 3) Checking concurrency on sentence level. Please refer to CTD.ipynb.
  2. Semantic Similarity: Evaluating similarities between biological terms, diseases, genes, and phenotypes. Please refer to Pathway.ipynb
  3. Scientific Evidence Mining using Translator: Using tools like Translator to mine scientific evidence for associations.

-We assess the association between genes, phenotypes, diseases, and target genes.

-In addition to the original terms, we consider their synonyms, descriptions, and clinical features obtained from sources such as OMIM and Orphanet.

-Synonyms for diseases and biological terms can be accessed through OMIM, Orphanet, and so on. For pathways, we refer to the Gene Ontology database.

Fine-Tuned LLM Model for Association Discovery

The annotation datasets are obtained through The Human Phenotype Ontology. These datasets include:

In addition to these datasets, we utilized a fine-tuned dataset available in FT_data_v2.csv to construct the final fine-tuning dataset. The final dataset was generated using the finetuning_datasets.ipynb notebook.

The fine-tuning process is detailed in the Lora.ipynb notebook.

Dependencies

  • Python (version 3.x)
  • RDAS Python package
  • ShinyGO for pathway enrichment analysis
  • SwissDrugDesign and SuperPRED for gene prediction
  • ChemBL API for compound information
  • HPO and OMIM for disease and phenotype data

Documentation

For more detailed documentation, please refer to Docs folder.

Getting Help

For any issues or questions, please open an issue in the GitHub repository or contact the project maintainers.

Discussion and Development

We are working towards developing machine learning and deep learning models to predict genes associated with newly identified compounds. Currently, we are using tools like SwissDrugDesign and SuperPRED for gene-target predictions, which involve predicting genes based on the compounds' chemical structures. However, these tools have limitations, such as the inability to set prediction thresholds, leading to lower-confidence predictions (e.g., probabilities around 0.1). As we move forward, we plan to integrate machine learning and deep learning techniques to enhance the accuracy and reliability of these predictions. This will enable us to refine our approach, increase the confidence of gene-target associations, and accelerate the identification of promising therapeutic targets for rare diseases. To discuss new ideas, improvements, or any questions, please join the conversation in the Discussions section of the repository.

About

Designed a cloud-native LLM pipeline for rare disease drug discovery, integrating fine-tuned generative models with phenotypic screening and semantic reasoning. Leveraged AWS EC2/S3 for scalable training and deployed TinyLlama-1.1B using LoRA/SFT to extract gene-disease-drug associations with pathway and literature validation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published