Skip to content

Designing Universal Primers for PCR amplification of the barcoding genes in all plant species

Notifications You must be signed in to change notification settings

GenoRobotics-EPFL/Primer-Design

Repository files navigation

Review Assignment Due Date Open in Visual Studio Code

Important Note

  • Each of our jupyter notebook takes about an hour to run due to

    • the function to choose number of clusters (pairwiseCrossValidation) takes about 20 minutes to run
    • the function to run the alignment algorithm (runClustalInRange) takes about 30 minutes to run
  • DNABERT6 and Finetuned_DNABERT6 takes about 4-6 hours to get the embeddings

For these reasons we provide all the necessary files to get our results, in this drive link: https://drive.google.com/drive/folders/1KLiCzlLoEf0avWA5S5f40Lqq-E_cfNEX?usp=sharing

Repository Organization

  • run.ipynb : this is the model with the best results, the training loop is commented, you can just load the .pth file.

    • We recommend not to run "runClustalInRange" function, you can find the required files in our drive folder mentioned above.
  • helper.py : we have all our utility functions in this file, all notebook import this file

  • plot_clusters.py : this function was provided to us by Antoine Tappy who is working on a similar project

  • data_preprosessing.ipynb : we did all our data preprocessing in this file, you can find all the data we used in Data.zip

  • Other_Notebook : as we tried many different models, we keep all of our other notebooks here, there are some duplicate files such as helper.py, to simplify importing

Requirements

  • clustal
  • biopython
  • numpy
  • sklearn
  • scipy
  • pytorch
  • matplotlib

Data & Pretrained Models:

Data

We got our data from NCBI's website

The procedure is as follows

  • Enter rbcL
  • Filter for plants only
  • Select the sequence length
    • rbcL : 600 to 1000 -> ~90k Sequence, ~80Mo
  • Download it to have the information and the DNA sequence Click and send to (corner top right) > Complete Record > File > Format = Fasta > Sort by Taxonomy ID
  • Put the fasta file into /Data

Pretrained models

You can find these models in our drive folder as well

Running run.ipynb

  • To be able to run run.ipynb, please extract Data.zip with the same name in the same directory as run.ipynb
  • As mentioned before in "Important Note" we highly recommend downloading /clustal /clusters and /plots folders from our drive to avoid running the notebook for an hour

About

Designing Universal Primers for PCR amplification of the barcoding genes in all plant species

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published