Target Identification and Drug Discovery for Rare Diseases through High-Throughput Screening and Phenotypic Assays
| Testing | |
| Docs | |
| Package | |
| Meta |
This project focuses on identifying potential therapeutic targets for rare diseases by using phenotypic assays and high-throughput screening of compound libraries. The goal is to uncover small-molecule modulators that can address disease-specific cellular processes. By leveraging innovative target deconvolution methods and secondary screening, the project aims to accelerate drug discovery for rare diseases with limited treatment options, contributing to the development of effective therapies.
- Data Collection and Sources
- Discussion and Development
- Target Deconvolution
- Association Checking
- Fine-Tuned LLM Model for Association Discovery
- Dependencies
- Documentation
- Getting Help
- Discussion and Development
Jaber-Valinejad/
βββ Data/ <- Raw or processed datasets (not tracked by Git if large or sensitive)
βββ Docs/ <- Project documentation, notes, or references
βββ Figs/ <- Figures, charts, or visualizations generated during the project
βββ Methods/ <- Scripts, notebooks, or descriptions of methods and workflows
βββ LICENSE <- License file specifying terms of use and distribution
βββ README.md <- Overview of the project, setup instructions, and repository guide (You are here!)
βββ requirements-dev.txt <- Python dependencies for development and testing
All data used for this project can be used in Data folder. Plus, for literature mining, we utilize the publication database in RDAS related to the target diseases. To perform this, we can run the following command in Neo4j:
# Neo4j
MATCH (p)<-[m: MENTIONED_IN]-(g:GARD)
WHERE g.GardId = "GARD:0002027"
optional MATCH (p:Article)-[r:ANNOTATION_FOR]-(t:PubtatorAnnotation)
WITH p, collect(t.text) AS texts
WITH p, reduce(all_texts = [], t IN texts | all_texts + t) AS all_texts
RETURN p.pubmed_id, p.title, p.abstractText, p.publicationYear, apoc.coll.toSet([text IN all_texts | toLower(text)]) AS unique_textsMain Approach 1: Target Deconvolution via Predicted Genes using Pathway.ipynb
In this approach, we use tools like SwissDrugDesign or SuperPRED to predict genes associated with the newly identified compounds. The following methods are used to assess the association:
-
Then, we chcek association between these genes with target disease.
-
Pathway Enrichment Analysis: After predicting the genes, we perform a pathway enrichment analysis using ShinyGO to identify any enriched pathways. Then, we chcek association between these enriched pathways with target disease.
-
Enriched Biological Terms: Biological terms enriched in the target genes are analyzed using ShinyGO with all available gene sets as pathway databases. Then, we chcek association between these enriched biological terms with target disease.
- Fold Enrichment: The ratio of the percentage of genes in your list compared to the background genes. Higher values indicate stronger enrichment.
- False Discovery Rate (FDR): Calculated using the Benjamini-Hochberg method to adjust for multiple comparisons.
Main Approach 2: Target Deconvolution via Similar Compounds (CTD) using CTD.ipynb
This method leverages the Comparative Toxicogenomics Database (CTD) and ChEMBL to identify similar compounds related to the target disease. Steps include:
-
Identifying Targets: Using the 'chembl_webresource_client' and organism βHomo sapiensβ to find targets.
-
Identifying similiar compounds: To find similar compounds we use ChEMBL API. Then, we find related genes using uniprot
-
Identifying realted disaese: Visit pubchem. Then we Find the CTD link there. Then, we use this information to find related diseases.
-
Association Checking: We check associations between any of genes, phenotypes, and diseases related to the identified compounds using resources like OMIM, Orphanet, and the Human Phenotype Ontology (HPO) and target gene.
Main Approach 3: Target Deconvolution via Similar Compounds (CID) using CID.ipynp
When using a threshold of 0.8 from CTD, fewer similar compounds may be identified. To increase the chance of finding associations, we use Compound Identifier (CID), and convert CIDs to CTD codes. To link these CIDs to relevant diseases and phenotypes, we need to convert CIDs to CTD codes. This conversion can be achieved by mapping CID values to their respective CTD codes. Below are the steps to accomplish this:
- Using SID map: To convert CIDs to CTD codes, we can use the SID-Map file, which contains the mapping between substances (SID), their registry identifiers, and their standardized CID. This is a gzipped text file that lists substances with their corresponding SID, source names, registry identifiers, and the CID (if available). We can use command-line tools to filter the SID-Map file and extract relevant mappings.
(structuredev12)(structure) gzc $PUBCHEM_FTP/Substance/Extras/SID-Map.gz | grep "Comparative Toxicogenomics Database" | egrep ' 30131$'134223583 Comparative Toxicogenomics Database (CTD) D013993 30131SID-Map.gz: This is a listing of all (live) SIDs with their source names and registry identifiers, and the standardized CID if present. It is a gzipped text file where each line contains at least three columns: SID, tab, source name, tab, registry identifier; then a fourth column of tab, CID if there is a standardized CID for the given SID. This SID-Map file helps identify the standardized CID for substances and their corresponding CTD identifier, enabling the association between compounds and diseases.
- API Integration: Additional conversion can be done via pubchem API to map CIDs to related diseases and phenotypes. Please refer to SID-Map.
The association checking process is multi-faceted and involves:
- Literature Search: It includes the follwoing steps: 1) A comprehensive search through relevant scientific literature; 2) Verifying associations through known datasets; 3) Checking concurrency on sentence level. Please refer to CTD.ipynb.
- Semantic Similarity: Evaluating similarities between biological terms, diseases, genes, and phenotypes. Please refer to Pathway.ipynb
- Scientific Evidence Mining using Translator: Using tools like Translator to mine scientific evidence for associations.
-We assess the association between genes, phenotypes, diseases, and target genes.
-In addition to the original terms, we consider their synonyms, descriptions, and clinical features obtained from sources such as OMIM and Orphanet.
-Synonyms for diseases and biological terms can be accessed through OMIM, Orphanet, and so on. For pathways, we refer to the Gene Ontology database.
The annotation datasets are obtained through The Human Phenotype Ontology. These datasets include:
In addition to these datasets, we utilized a fine-tuned dataset available in FT_data_v2.csv to construct the final fine-tuning dataset. The final dataset was generated using the finetuning_datasets.ipynb notebook.
The fine-tuning process is detailed in the Lora.ipynb notebook.
- Python (version 3.x)
- RDAS Python package
- ShinyGO for pathway enrichment analysis
- SwissDrugDesign and SuperPRED for gene prediction
- ChemBL API for compound information
- HPO and OMIM for disease and phenotype data
For more detailed documentation, please refer to Docs folder.
For any issues or questions, please open an issue in the GitHub repository or contact the project maintainers.
We are working towards developing machine learning and deep learning models to predict genes associated with newly identified compounds. Currently, we are using tools like SwissDrugDesign and SuperPRED for gene-target predictions, which involve predicting genes based on the compounds' chemical structures. However, these tools have limitations, such as the inability to set prediction thresholds, leading to lower-confidence predictions (e.g., probabilities around 0.1). As we move forward, we plan to integrate machine learning and deep learning techniques to enhance the accuracy and reliability of these predictions. This will enable us to refine our approach, increase the confidence of gene-target associations, and accelerate the identification of promising therapeutic targets for rare diseases. To discuss new ideas, improvements, or any questions, please join the conversation in the Discussions section of the repository.
