emb2dis: protein disorder prediction tool

This repository contains a deep learning tool for predicting intrinsically disordered regions (IDRs) in protein sequences.

This tool generates embeddings from raw protein sequences using a pre-trained protein language model (pLM) and predicts disorder probabilities using a deep learning model that was trained with the DisProt dataset (2023_12) and tested on the CAID3v3 benchmarks. The output of the tool includes per-residue disorder scores, plots of disorder along the sequence and summary statistics.

Environment setup

Clone the repository:

git clone https://github.com/sofiaaduarte/emb2dis.git
cd emb2dis

Create a virtual environment:

conda create -n emb2dis python=3.11
conda activate emb2dis

Install required packages:

pip install -r requirements.txt

Usage

The main script is predict_disorder.py. You can provide a FASTA file containing one or more protein sequences:

python predict_disorder.py --fasta data/samples.fasta

This script will:

Read all sequences from the FASTA file.
Generate embeddings using the specified pLM (ProtT5 by default).
Predict disorder scores for each residue using a sliding window approach.
Save results (CSV and plots) to the output directory (./results/ by default).
Print disorder statistics to the console.

Command-line Arguments

Argument	Short	Description
`--fasta`	`-f`	Path to input FASTA file (required).
`--model`	`-m`	Protein language model: `ProtT5` (by default) or `ESM2`
`--output-dir`	`-o`	Directory to save predictions (.csv) and plots (.png) (`./results/` by default).
`--device`	`-d`	Device: `cpu`, `cuda` (by default), `cuda:0`, etc.
`--verbose`	`-v`	Enable verbose output for detailed progress (`False` by default).

Examples

1. Specify output directory and verbose mode:

python predict_disorder.py --fasta data/samples.fasta --output-dir my_results/ --verbose

2. Use ESM2 model on CPU:

python predict_disorder.py --fasta data/samples.fasta --model ESM2 --device cpu

3. Use a specific GPU:

python predict_disorder.py --fasta data/samples.fasta --device cuda:1

Models

Supported Protein Language Models

Model	Description	Embedding Size	Reference	Repository
ESM2	ESM-2 (650M parameters)	1280	Lin et al., 2023	facebookresearch/esm
ProtT5	ProtT5-XL (half precision)	1024	Elnaggar et al., 2021	rostlab/ProtTrans

The disorder prediction models are trained specifically for each pLM.

Additional models will be added in future releases.

Additional notes

ESM2 Sequence Limit: The ESM2 model supports protein sequences up to 1024 residues. Any input exceeding this length will be truncated automatically, and a warning will be issued if this occurs.
Sequence preprocessing: Non-canonical amino acids (U, Z, O, B) are automatically converted to 'X' before generating embeddings.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
config		config
data		data
model		model
src		src
.gitignore		.gitignore
README.md		README.md
emb2dis.png		emb2dis.png
predict_disorder.py		predict_disorder.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

emb2dis: protein disorder prediction tool

Environment setup

Usage

Command-line Arguments

Examples

Models

Supported Protein Language Models

Additional notes

About

Uh oh!

Releases

Packages

Languages

sofiaaduarte/emb2dis

Folders and files

Latest commit

History

Repository files navigation

emb2dis: protein disorder prediction tool

Environment setup

Usage

Command-line Arguments

Examples

Models

Supported Protein Language Models

Additional notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages