Skip to content

jp43/DockBox2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DockBox2

Table of contents

Graph Neural Network Model to improve docking predictions

DockBox2 (DBX2) is a sequel to DockBox that combines the concept of consensus docking with machine learning to improve docking predictions. In short, DBX2 provides the ability to train and run a GNN model based on inductive representation learning (GraphSAGE) to better interpret docking results (e.g., generated by DBX). DBX2 can be used in two modes: the 'node' mode which estimates pose correctness, and the 'graph' mode, which estimates binding affinity.

Installation

The easiest way to install DockBox2 is to create a virtual environment. In this way, DockBox2 and its dependencies can easily be installed in user-space without clashing with potentially incompatible system-wise packages.

Once virtualenv has been properly installed, simply type (and press the return key)

virtualenv env

on the command line followed by

source env/bin/activate

to activate the virtual environment (do not forget to activate your environment every time you log into a new shell environment).

Finally, the DockBox2 package can be set up by going in DockBox2 installation directory and typing:

python setup.py install

Installation is complete!

Procedure

[1] Split data and retrain the model

[1.1] Split the training data

The trained models and training data are available at https://doi.org/10.5281/zenodo.17254874. The training data were prepared from PDBbind2016, as mentioned in the paper (https://www.pdbbind-plus.org.cn/).

To re-train the model, the training data need to be split into training, validation, and test sets by using the "split_train_val_dbx2" command:

split_train_val_dbx2 -g <training data> --normalize  --cutoff <A> --maxnodes <B> --seed <C> --train <D>
  • <training data>: training data prepared as graphs (.pkl format).
  • <A>: the RMSD cutoff used to assign correctness labels.
  • <B>: the maximum number of poses per graph in the training set.
  • <C>: the random seed.
  • <D>: the fraction of data assigned to the training set.

For example;

split_train_val_dbx2 -g graphs.pkl --normalize --cutoff 2.0 --maxnodes 70 --seed 123 --train 0.9

This will automatically split the training data into training, validation, and test sets, and provide the Alpha coefficient for pose correctness. After this step, three pickle files will be generated (e.g., train.pkl, val.pkl, test.pkl).

[1.2] Re-train the model

The model can be re-train by using the following command:

traindbx2 -f <A> -t <B> -v <C> -p <D> --seed <E> --task <F> -w <G>
  • <A>: the configuration file (e.g., config.ini) that provides all settings for training and prediction.
  • <B> and <C>: the graphs for training and validating the model, generated from [1].
  • <D>: the patience value used for early stopping (default = 3).
  • <E>: the random seed.
  • <F>: the task level for training the model (default = node). Recommended to use node level so the model can predict both binding pose (node level) and binding affinity (graph level).
  • <G>: the name of the trained model, saved as an .h5 file.

For example;

traindbx2 -f config.ini -t train.pkl -v val.pkl -p 4 --seed 123 --task node graph -w model_w.h5

[2] Create graph for DBX2 prediction

DBX2 is a GNN model that learns and makes predictions based on pose ensembles. Therefore, users need to generate multiple protein–ligand binding poses and calculate the pairwise RMSD (i.e., the RMSD differences between each binding pose).

[2.1] Generation of protein–ligand binding poses and rescoring

In order to create graphs for prediction with DBX2, users may need to perform molecular docking to generate multiple binding poses for each protein–ligand complex using various docking software (e.g., AutoDock, Vina, Gnina, DOCK, etc.). We recommend using DBX1 (https://github.com/jp43/DockBox ), as it can automatically perform molecular docking and rescoring. After molecular docking, the generated binding poses need to be rescored with several docking software.

We're recommended to generate a total of 80-140 poses for each protein-ligand:

For example;

  • Total of 140 poses: 60 from AutoDock, 20 from Vina, 60 from DOCK
  • Total of 80 poses: 60 from AutoDock, 20 from Vina

The final data for create graph need to save as csv file, and the example csv file can be see as "results.csv" file in the (https://doi.org/10.5281/zenodo.17254874).

Importantly, all poses need to be rescored with Vina, as all energetic features for the node features must be calculated by Vina (e.g., gauss1_inter, gauss2_inter, gauss1_intra, gauss2_intra, hydrophobic_inter, hydrophobic_intra, etc.). All these energetic features should be saved as CSV file in a directory named vina_scores. For example, the files should be named vina_scores_001.csv, vina_scores_002.csv, etc. (but if you have not that much number of protein-ligand system, you can then just give only 1 file "vina_scores_001.csv"), inside the vina_scores directory. An example CSV file for energetic features can be seen in vina_scores.tar (https://doi.org/10.5281/zenodo.17254874).

Make sure that the mol2file entries match between the Vina CSV files (vina_scores_<number>.csv) and the docking CSV (results.csv). This is important because the script generates graphs by matching data from the mol2file column.

[2.2] Calculate the pairwise RMSD for each pose

As DBX2 need to generate graph for each protein-ligand complex, and each binding pose (node) need to connect via RMSD different between each pose. Therefore, the pairwise RMSD between each pose of each protein-ligand system is then need to calculate.

The pairwise RMSD can be calculate by using openbabel (e.g., command: obrms {pose_x} {pose_y}), the pairwise RMSD of each pose need to save as csv and kept in a directory name rmsd, similar to vina. The example of RMSD file can be seen in rmsd.tar

[2.3] Create graph for DBX2

Before creating graph for DBX2, you will need to make sure that:

  • "results.csv": Contains docking/rescoring results. It is located in the same directory as the rmsd and vina_scores directories. Example file "results.csv" in https://doi.org/10.5281/zenodo.17254874.
  • Directory name "vina_scores": that consists of energetic features from Vina as csv file (with pattern name of "vina_scores_<number>.csv"; vina_scores_001.csv), example file name "vina_scores.tar" in https://doi.org/10.5281/zenodo.17254874.
  • Directory name "rmsd": that consists of pairwise RMSD (with pattern name of "rmsd_<number>.csv"; rmsd_001.csv), example file name "rmsd.tar" in https://doi.org/10.5281/zenodo.17254874.
For_prediction
├── results.csv
├── vina_scores/
│   ├── vina_scores_001.csv
│   ├── vina_scores_002.csv
│   └── ...
└── rmsd/
    ├── rmsd_001.csv
    ├── rmsd_002.csv
    └── ...

Then generate graph using "create_graphs.py" in https://doi.org/10.5281/zenodo.17254874. This will create graph for DBX2 name "Pred_graph.pkl".

[3] Prediction

To run prediction, user can run with command "rundbx2":

rundbx2 -f <A> -w <B> -t <C> --task <D> --seed 123 -o <E>

- <A>: is the config file that used during training the model (config.ini).

- <B>: is a trained model.

- <C>: is graph that generated from step **<2.3>**.

- <D>: is a level for DBX2 to predict.

- <E>: is a prediction result as .score file.

For example;

rundbx2 -f config.ini -w model_w.h5 -t Pred_grap.pkl --task graph node --seed 123 -o Pred.score

This will give "Pred_node.score" and "Pred_graph.score" for protein-ligand binding pose and binding affinity prediction.

About

GNN model to improve binding mode and affinity predictions from docking

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages