Skip to content

BioMedAI-UCSC/openmm_generate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

113 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenMM Generate

A Python-based molecular dynamics simulation toolkit using OpenMM for generating all-atom protein trajectories at scale.

Overview

This repository provides a module for generating standard molecular dynamics simulations using OpenMM for all-atom protein representations. It is part of a larger pipeline for molecular dynamics and machine learning research, designed to run efficiently on GPU-accelerated HPC systems.

Status

⚠️ IMPORTANT: This code is currently being ported and refactored from private repositories for public release. Full documentation and tutorials will be provided within 1-2 weeks.

Features

  • Automated Protein Preparation: Download PDB files, add missing residues/atoms, solvate systems
  • Ligand Parameterization: Automatic parameterization of small molecules using GAFF
  • Flexible Simulation Options: Configurable timesteps, temperatures, and integrators
  • Efficient Data Storage: HDF5-based trajectory storage with forces and energies
  • GPU Acceleration: Multi-GPU support with automatic distribution
  • Resumable Simulations: Checkpoint-based continuation of interrupted runs
  • Batch Processing: Process multiple proteins in parallel across GPUs
  • HPC Integration: SLURM scripts for Perlmutter, Delta, and other HPC systems

Installation

Prerequisites

  • Python 3.11 or 3.12
  • CUDA 12.1+ (for GPU acceleration)
  • Conda or Mamba package manager

Environment Setup

Create the conda environment using the provided configuration:

conda env create -f conda_envs/delta.yml -n openmm_generate
conda activate openmm_generate

Or for a minimal setup:

conda env create -f conda/openmm_cu121_py312.yml -n openmm
conda activate openmm

Dependencies

Key dependencies include:

  • OpenMM 8.1+
  • OpenMM Forcefields
  • OpenFF Toolkit
  • PDBFixer
  • RDKit
  • MDTraj
  • h5py

Quick Start

Basic Usage

Prepare and simulate a single protein:

cd scripts
python batch_generate.py <pdb_id> --prepare --steps 10000 --report-steps 10

Example Commands

Run a single PDB for 1000 frames, removing ligands:

./batch_generate.py 1ABC --prepare --remove-ligands --data-dir=../data --steps 1000

Run all PDBs from a JSON file for 10,000 frames with ligands:

./batch_generate.py openmm_2023.12.20_N1.json --prepare --data-dir=../data --steps 10000

Run with custom integrator (4fs timestep at 350K):

./batch_generate.py pdb_list.json --prepare --data-dir=../data --steps 10000 \
  --integrator=integrator_350k_4fs.json

Process local PDB files from a directory:

./batch_generate.py --prepare --steps 10000 --report-steps=10 \
  --input-dir=../data/input/ --data-dir=../data/output

Multi-GPU parallel processing:

./batch_generate.py pdb_list.json --prepare --steps 100000 --report-steps=100 \
  --gpus=0,1,2,3 --pool-size=8 --data-dir=../data

Command Line Options

Core Options

  • pdbid_list: PDB IDs to process, or path to JSON file with PDB ID array
  • --prepare: Run preparation (add solvent, fix residues) if not already done
  • --prepare-implicit: Use implicit solvent model instead of explicit
  • --remove-ligands: Remove ligands instead of parameterizing them
  • --steps: Total number of simulation steps (default: 10000)
  • --report-steps: Save data every n frames (default: 1)
  • --data-dir: Output data directory (default: ../data/)
  • --input-dir: Input directory for local PDB files

Advanced Options

  • --integrator: Path to JSON file with integrator parameters (dt, temperature, etc.)
  • --gpus: Comma-separated list of GPU IDs to use (e.g., "0,1,2,3")
  • --pool-size: Number of simultaneous simulations (default: 10)
  • --batch-size: Split input list into batches of this size
  • --batch-index: Select which batch to run (for array jobs)
  • --timeout: Maximum runtime in hours (e.g., 2.5 for 2.5 hours)
  • --force: Force re-run of completed simulations

Integrator Configuration

Custom integrator parameters can be specified via JSON files:

integrator_4fs.json (4 femtosecond timestep):

{
    "dt": "0.004*picoseconds"
}

integrator_350k_4fs.json (4fs timestep at 350K):

{
    "dt": "0.004*picoseconds",
    "temperature": "350*kelvin"
}

Available parameters:

  • dt: Timestep (default: 0.002 ps)
  • temperature: Temperature (default: 300 K)
  • friction: Friction coefficient (default: 1.0/ps)
  • pressure: Pressure for barostat (default: 1.0 atm)
  • barostatInterval: Barostat update frequency (default: 25 steps)

HPC Usage

SLURM Job Submission

The repository includes SLURM scripts for various HPC systems:

Perlmutter (NERSC):

sbatch perlmutter1x4_N1000.slurm

Delta (NCSA):

sbatch delta.slurm

Example SLURM Configuration

#!/bin/bash
#SBATCH --gpus-per-task=4
#SBATCH --cpus-per-task=32
#SBATCH --time=2:00:00

export OMP_NUM_THREADS=32

srun python3 batch_generate.py pdb_list.json \
  --batch-index $SLURM_ARRAY_TASK_ID \
  --batch-size 25 \
  --pool-size 8 \
  --gpus="0,1,2,3" \
  --prepare \
  --steps=1000000 \
  --report-steps=100 \
  --timeout 1.8 \
  --data-dir="${OUTPUT_DIR}"

Output Structure

For each processed protein, the following directory structure is created:

data/
└── <pdb_id>/
    ├── raw/                      # Original PDB file
    │   └── <pdb_id>.pdb
    ├── processed/                # Prepared system
    │   ├── <pdb_id>_processed.pdb
    │   ├── <pdb_id>_processed_ligands_smiles.json
    │   ├── <pdb_id>_processed_ligands_cache.json
    │   ├── forcefield.json
    │   ├── <pdb_id>_process.log
    │   └── finished.txt
    ├── simulation/               # Simulation files
    │   ├── system.xml
    │   ├── integrator.xml
    │   ├── checkpoint.chk
    │   ├── log.txt
    │   ├── <pdb_id>_simulation.log
    │   ├── final_state.xml
    │   ├── final_state.pdb
    │   └── finished.txt
    └── result/                   # Trajectory data
        └── output_<pdb_id>.h5    # HDF5 trajectory file

HDF5 Output Format

The HDF5 trajectory files contain:

  • coordinates: Atomic positions (nm)
  • forces: Atomic forces (kJ/mol/nm)
  • cell_lengths: Periodic box dimensions (nm)
  • cell_angles: Periodic box angles (degrees)
  • time: Simulation time (ps)
  • potentialEnergy: Potential energy (kJ/mol)
  • kineticEnergy: Kinetic energy (kJ/mol)
  • topology: System topology in JSON format

Module Structure

module/function.py

Utility functions for file management and topology operations.

module/preprocess.py

Protein preparation pipeline:

  • Download PDB files from RCSB
  • Fix missing residues and atoms
  • Add hydrogens at specified pH
  • Add explicit or implicit solvent
  • Parameterize ligands

module/simulation.py

Core simulation engine:

  • System setup and minimization
  • Equilibration
  • Production MD
  • Checkpoint management
  • Resumable simulations

module/ligands.py

Ligand handling:

  • RCSB ligand template retrieval
  • SMILES-based parameterization
  • GAFF force field generation
  • Topology insertion

module/reporters.py

Custom HDF5 reporter with:

  • Threaded I/O for performance
  • Force trajectory storage
  • H5MD-compatible format
  • Compression support

module/pdb_lookup.py

RCSB database query utilities for ligand SMILES strings.

Force Field Configuration

Default force fields:

  • Protein: AMBER14
  • Water: TIP3P-FB
  • Ligands: GAFF (via OpenFF)
  • Implicit Solvent: GBN2

Force field selection is automatic based on preparation options and stored in forcefield.json.

Data Processing Workflow

Cecilia Data Cleaning

The cecilia_data_cleaning/ directory contains tools for processing existing MD datasets:

Fix Tyrosine Residues:

python cecilia_data_cleaning/fix_cecilia_tyrosol.py

Generate Starting Poses via TICA:

python cecilia_data_cleaning/tica_starting_pos/main.py

This performs time-lagged independent component analysis (TICA) on bond distances to identify diverse starting conformations for enhanced sampling.

Performance Considerations

GPU Utilization

  • Each simulation requires ~2-4GB GPU memory
  • Pool size should be a multiple of GPU count for efficiency
  • Recommended: 2 simulations per GPU (pool_size = 2 × num_gpus)

CPU Threading

  • Set OMP_NUM_THREADS to match allocated cores
  • Ligand parameterization temporarily limits threads to 2
  • Typical: 32 threads for 4-GPU nodes

I/O Performance

  • HDF5 reporter uses threaded writes to minimize overhead
  • report_steps controls output frequency and file size
  • Compression can be enabled via use_gzip=True

Troubleshooting

Common Issues

CUDA out of memory:

  • Reduce pool size
  • Decrease number of atoms (remove water with atomSubset)

Ligand parameterization fails:

  • Check RCSB connectivity
  • Verify ligand is in PDB Chemical Component Dictionary
  • Use --remove-ligands if parameterization not needed

Simulation doesn't resume:

  • Ensure checkpoint files exist
  • Check that HDF5 file isn't corrupted
  • Verify simulation time matches last recorded frame

Force field errors:

  • Check for non-standard residues
  • Verify all ligands have templates
  • Review preparation logs

Contributing

We welcome contributions! Please use GitHub Issues to:

  • Report bugs
  • Request features
  • Ask questions
  • Suggest improvements

License

License information will be added soon.

Acknowledgments

This work was performed on:

  • Perlmutter at NERSC (National Energy Research Scientific Computing Center)
  • Delta at NCSA (National Center for Supercomputing Applications)

Development supported by computational resources and expertise from these facilities.

Contact

For questions and support, please open an issue on GitHub.

References

About

Tools and data for running OpenMM simulations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7