Application of Protein Structure Encodings and Sequence Embeddings for Transporter Substrate Prediction

Installation

Requirements

Linux (tested on Ubuntu 24.04 LTS in WSL2)
miniforge
To recreate feature datasets (optional):
- Up to 200GB disk storage (PSSMs, for BLAST databases)
- GPU compatible with CUDA Toolkit 12.6+, >=16GB VRAM (embeddings)

Create environment, install project code as python package

conda env create -f environment_full.yml
conda activate subpred_deeplearning
pip install -e .

A separate environment is used for the DNN notebooks (notebooks 08-14). All package versions are the same, except that the dnn_cpu environment uses the CPU version of tensorflow to train DNNs, whereas the subpred_deeplearning environment contains the CUDA-accellerated variant of the package to generate embeddings on the GPU. The DNN training is currently incompatible with the latest generation of Nvidia GPUs. The CPU version of TF also ensures full reproducibility of all results.

conda env create -f environment_dnn_cpu_full.yml
conda activate dnn_cpu
pip install -e .

Recreating results from manuscript

The raw data is available here:

/data/raw (113GB)

Running the 01_preprocessing notebook will turn the raw data into pre-processed pickles. To vastly speed up the feature computation, we saved the PSSMs and embeddings that we calculated for all proteins in the dataset in a cache folder. Once they are extracted into the appropriate folder, the feature generation methods will read these files instead of calculating everything from scratch. The preprocessed pickles, along with cached PSSMs and embeddings, are available for download here:

/data/datasets (1.4GB)

After extracting the data into the matching folders (tar -xf from the root directory of the repository), the notebooks can be re-calculated. Here, it is important to run the svm notebooks (02-06) first, with the subpred_deeplearning conda environment, and then the dnn notebooks (07-12) with the dnn_cpu environment, for reasons mentioned above. The SVM notebooks need to be run first, as they export their feature data, to ensure that the DNN notebooks use the same datasets. The ML feature data that is created by the SVM notebooks and subsequently read by the DNN notebooks is, alternatively, also saved in an archive that can be downloaded here:

/data/tmp_data (37MB)

Finally, the evaluation scores from all iterations of the repeated 5-fold cross validation, along with generated plots, are available here:

/data/results (3MB)

How the raw data was assembled (do not run)

All commands used to assemble /data/raw were saved in the preprocessing folder. Note that these scripts always download the latest version of each database, and the contents of the datasets might change in the future. Uniref is version 2022_01 (contains enough proteins to create evolutionary profiles, and we already had pre-calculated PSSMs for most proteins from a previous project), everything else was downloaded on 11.05.2025. They were executed in this order:

GO annotations

./preprocessing/download_goa.sh

GOA UniProt (version 226), released on 06 May, 2025 and assembled using the publicly released data available in the source databases on 28 April, 2025.

AlphafoldDB PDB files for model organisms

./preprocessing/download_alphafolddb.sh

Version 4 released in 2022, downloaded on 15.05.2025

Additional tar files from alphafolddb (https://www.alphafold.ebi.ac.uk/download) can be added to the script, to include more organisms. They will automatically be pre-processed.

Uniprot data

conda activate subpred_deeplearning
./preprocessing/download_uniprot.sh

Uniprot Version 2025_02 released on 23.04.2025

GO OBO

./preprocessing/download_go.sh

GO version 2025-03-16

Uniref

./preprocessing/download_uniref.sh

Uniref version 2022_1

Interpro annotation names

./preprocessing/download_interpro.sh

Downloaded version 2025-04-22

Create 3Di fasta file and blast databases:

conda activate subpred_deeplearning
./preprocessing/create_blastdbs.sh
./preprocessing/create_3Di_fasta.sh

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
data		data
notebooks		notebooks
preprocessing		preprocessing
subpred		subpred
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
backup.sh		backup.sh
environment.yml		environment.yml
environment_dnn_cpu.yml		environment_dnn_cpu.yml
environment_dnn_cpu_full.yml		environment_dnn_cpu_full.yml
environment_full.yml		environment_full.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Application of Protein Structure Encodings and Sequence Embeddings for Transporter Substrate Prediction

Installation

Requirements

Create environment, install project code as python package

Recreating results from manuscript

How the raw data was assembled (do not run)

GO annotations

AlphafoldDB PDB files for model organisms

Uniprot data

GO OBO

Uniref

Interpro annotation names

Create 3Di fasta file and blast databases:

About

Uh oh!

Releases

Packages

Languages

License

adenger/subpred_dl

Folders and files

Latest commit

History

Repository files navigation

Application of Protein Structure Encodings and Sequence Embeddings for Transporter Substrate Prediction

Installation

Requirements

Create environment, install project code as python package

Recreating results from manuscript

How the raw data was assembled (do not run)

GO annotations

AlphafoldDB PDB files for model organisms

Uniprot data

GO OBO

Uniref

Interpro annotation names

Create 3Di fasta file and blast databases:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages