Skip to content

KeatingLab/terminator

Repository files navigation

TERMinator

This repo contains code for Neural Network-Derived Potts Models for Structure-Based Protein Design using Backbone Atomic Coordinates and Tertiary Motifs by Alex J. Li, Mindren Lu, Israel Desta, Vikram Sundar, Gevorg Grigoryan, and Amy E. Keating.

The following instructions assume you are running on SuperCloud, one of MIT's HPCC systems. These SLURM scripts are included in the repo, and can be adapted to other HPCC systems if necessary.

Documentation

Documentation will be hosted at ReadTheDocs. You can also build the docs and view them locally following the instructions in the docs folder. The "Getting Started" guide is copied below for convenience.

Requirements

  • python3
  • pytorch = 1.10.1
  • pandas
  • numpy
  • scipy
  • tqdm
  • matplotlib
  • seaborn
  • pylint
  • pytest-pylint
  • yapf
  • pyg

Above should be all that's needed, and an env.yaml is included that specifies these.

Setup

The following instructions assume you are running on SuperCloud, one of MIT's HPCC. These SLURM scripts are included in the repo, and can be adapted to other HPCC systems if necessary. Additionally, for all scripts, assume that absolute file paths should be provided, as not all scripts have been tested to work with relative paths.

Setup the proper conda environment using the env.yaml file (e.g. conda env create -f env.yaml). This will create a conda env called terminator, which can be activated using conda activate terminator.

Next, run pip install -e . in the root directory, which will install the TERMinator software suite in sandbox mode as an importable module terminator in the environment.

Additionally, you'll need to modify scripts/config.sh specifying the path to your MST installation (e.g. ~/MST_workspace/MST). See Mosaist for the latest updates. This is necessary for using MST/bin/design.

TERMinator Feature Generation

First, you'll need a folder that of dTERMen runs e.g. a folder of structure <dataset>/<pdb_id>/<pdb_id>.<ext>, where <ext> must include .dat and .red.pdb outputted from running MST/bin/design.

To generate feature files from this folder, use

python scripts/data/preprocessing/generateDataset.py \
    --in_folder <input_data_folder> \
    --out_folder <output_features_folder> \
    -n <num_cores> \
    <-u if you want to overwrite existing feature files>

which will create a dataset <output_features_folder> that you can feed into TERMinator.

COORDinator Feature Generation

COORDinator preprocessing requires a bit of extra preprocessing. To generate feature files from your raw data, use

python scripts/data/preprocessing/cleanStructs.py \
    --in_list_path <pdb_paths_file> \
    --out_folder <output_folder> \
    -n <num_processes>

which will clean the PDB files listed in <pdb_paths_file>. Be sure that <pdb_paths_file> is a file containing a list of PDB paths, with one path per line. The outputted <output_folder> can then be fed into generateDataset.py above with the additional flag --coords_only to featurize these structures for COORDinator.

Training and evaluation

To train a new model, run

./scripts/models/train/submit_train.sh \
    <dataset_dir> \
    <model_hparams_path> \
    <run_hparams_path> \
    <run_dir> \
    <output_dir> \
    <run_wall_time>

This will submit a job to train on the given dataset with the given hyperparameters, place the trained model and related files in the run directory, and results in the output directory. For TERMinator, use model_hparams=hparams/model/terminator.json and run_hparams=hparams/run/default.json. For COORDinator, use model_hparams=hparams/model/coordinator.json and run_hparams/run/seq_batching.json; you may also want to remove the --lazy option in scripts/models/train/run_train_gpu.sh, which can speed up training by loading the whole dataset into memory first. Note that the train script will assume you placed the train, val, and test splits in <dataset_dir>/train.in, <dataset_dir>/validation.in, and <dataset_dir>/test.in, respectively. If you need different behavior, you can directly call the scripts/models/train/train.py script. The model will automatically evaluate on the test set and dump the results into net.out in <output_dir>, which can be used in postprocessing.

If you instead want to evaluate on a pretrained model, run

  ./scripts/models/eval/submit_eval.sh \
      <model_directory> \
      <dataset_dir> \
      <output_dir> \
      [subset_file]

This will load the model, evaluate the features from using that model, and place them in the output dir. subset_file is optional: if provided, only that subset will be evaluated from dataset_dir, otherwise the whole dataset will be evaluated.

Postprocessing

To perform postprocessing, run

./scripts/data/postprocesing/submit_etab.sh \
    <dtermen_data_root> \
    <pdb_root> \
    <output_dir>

dtermen_data_root should be the parent directory to all the dTERMen runs you've run (e.g. for every dtermen dataset DATA, the directory should be structured DATA/<pdb_id>/(dTERMen run files for pdb_id)). For COORDinator, you can use the outputs of scripts/data/preprocessing/cleanStructs.py.

pdb_root is similar but should be the parent directory to all databases in databaseCreator format (e.g. for database DATA, the directory should be structured DATA/PDB/<pdb_id_mid>/<pdb_id>.pdb).

Automatic steps afterwards

submit_etab.sh automatically calls the following two scripts to automate postprocessing.

First, it should submit a batch job array to run dTERMen on each of the etabs. In case you want to run this manually, this command is

python scripts/data/postprocessing/batch_arr_dTERMen.py \
    --output_dir=<output_directory_name> \
    --pdb_root=<pdb_root> \
    --dtermen_data=<dtermen_data_root> \
    --batch_size=48

batch_size specifies how many dTERMen runs each job in the job array will run in parallel. Each job in the job array should take 5-10 minutes, but could take longer depending on protein size. The resultant files are also dumped in <output_dir>/etabs/.

After the previous step completes, a summarization script should also automatically run. This command is

  python scripts/data/postprocessing/summarize_results.py \
      --output_dir=<output_dir> \
      --dtermen_data=<dtermen_data_root>

This will be located at <output_dir>/summary_results.csv.

Although these two steps are run automatically, oftentimes certain dTERMen jobs will have not finished (e.g. sometimes jobs stall if they're placed on a busy node, causing jobs to hit the wall time). Run the above step again if you see no summary_results.csv in the output directory or it's empty, and it will resubmit all dTERMen jobs that didn't complete.

Redesign using Complexity-based Penalty

If you have a low-complexity sequence, you may want to redesign it with our complexity-based penalty. To do that, use run

bash design_complex_penalt.sh \
    <input_etab> \
    <output_file_path>

which will run MCMC optimization with our complexity-based penalty.

Other Potentially Useful Scripts

To convert dTERMen etabs to numpy etabs, run

python scripts/analysis/dtermen2npEtabs.py \
  --out_folder=<np_etab_folder> \
  --in_list=<file containing list of paths to .etab files> \
  --num_cores=N

This will read the etab files in in_list, convert them into numpy files, and dump them in np_etab_folder

To compress etab files,

./scripts/data/postprocessing/submit_compress_files.sh <output_dir>

Acknowledgements

Much of this work adapts code from two source:

License

We license this work under the MIT license. Copyright (c) 2022 Alex J. Li, Mindren Lu, Israel Desta, Vikram Sundar, Gevorg Grigoryan, Amy E. Keating.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •