A Python-based molecular dynamics simulation toolkit using OpenMM for generating all-atom protein trajectories at scale.
This repository provides a module for generating standard molecular dynamics simulations using OpenMM for all-atom protein representations. It is part of a larger pipeline for molecular dynamics and machine learning research, designed to run efficiently on GPU-accelerated HPC systems.
- Automated Protein Preparation: Download PDB files, add missing residues/atoms, solvate systems
- Ligand Parameterization: Automatic parameterization of small molecules using GAFF
- Flexible Simulation Options: Configurable timesteps, temperatures, and integrators
- Efficient Data Storage: HDF5-based trajectory storage with forces and energies
- GPU Acceleration: Multi-GPU support with automatic distribution
- Resumable Simulations: Checkpoint-based continuation of interrupted runs
- Batch Processing: Process multiple proteins in parallel across GPUs
- HPC Integration: SLURM scripts for Perlmutter, Delta, and other HPC systems
- Python 3.11 or 3.12
- CUDA 12.1+ (for GPU acceleration)
- Conda or Mamba package manager
Create the conda environment using the provided configuration:
conda env create -f conda_envs/delta.yml -n openmm_generate
conda activate openmm_generateOr for a minimal setup:
conda env create -f conda/openmm_cu121_py312.yml -n openmm
conda activate openmmKey dependencies include:
- OpenMM 8.1+
- OpenMM Forcefields
- OpenFF Toolkit
- PDBFixer
- RDKit
- MDTraj
- h5py
Prepare and simulate a single protein:
cd scripts
python batch_generate.py <pdb_id> --prepare --steps 10000 --report-steps 10Run a single PDB for 1000 frames, removing ligands:
./batch_generate.py 1ABC --prepare --remove-ligands --data-dir=../data --steps 1000Run all PDBs from a JSON file for 10,000 frames with ligands:
./batch_generate.py openmm_2023.12.20_N1.json --prepare --data-dir=../data --steps 10000Run with custom integrator (4fs timestep at 350K):
./batch_generate.py pdb_list.json --prepare --data-dir=../data --steps 10000 \
--integrator=integrator_350k_4fs.jsonProcess local PDB files from a directory:
./batch_generate.py --prepare --steps 10000 --report-steps=10 \
--input-dir=../data/input/ --data-dir=../data/outputMulti-GPU parallel processing:
./batch_generate.py pdb_list.json --prepare --steps 100000 --report-steps=100 \
--gpus=0,1,2,3 --pool-size=8 --data-dir=../datapdbid_list: PDB IDs to process, or path to JSON file with PDB ID array--prepare: Run preparation (add solvent, fix residues) if not already done--prepare-implicit: Use implicit solvent model instead of explicit--remove-ligands: Remove ligands instead of parameterizing them--steps: Total number of simulation steps (default: 10000)--report-steps: Save data every n frames (default: 1)--data-dir: Output data directory (default: ../data/)--input-dir: Input directory for local PDB files
--integrator: Path to JSON file with integrator parameters (dt, temperature, etc.)--gpus: Comma-separated list of GPU IDs to use (e.g., "0,1,2,3")--pool-size: Number of simultaneous simulations (default: 10)--batch-size: Split input list into batches of this size--batch-index: Select which batch to run (for array jobs)--timeout: Maximum runtime in hours (e.g., 2.5 for 2.5 hours)--force: Force re-run of completed simulations
Custom integrator parameters can be specified via JSON files:
integrator_4fs.json (4 femtosecond timestep):
{
"dt": "0.004*picoseconds"
}integrator_350k_4fs.json (4fs timestep at 350K):
{
"dt": "0.004*picoseconds",
"temperature": "350*kelvin"
}Available parameters:
dt: Timestep (default: 0.002 ps)temperature: Temperature (default: 300 K)friction: Friction coefficient (default: 1.0/ps)pressure: Pressure for barostat (default: 1.0 atm)barostatInterval: Barostat update frequency (default: 25 steps)
The repository includes SLURM scripts for various HPC systems:
Perlmutter (NERSC):
sbatch perlmutter1x4_N1000.slurmDelta (NCSA):
sbatch delta.slurm#!/bin/bash
#SBATCH --gpus-per-task=4
#SBATCH --cpus-per-task=32
#SBATCH --time=2:00:00
export OMP_NUM_THREADS=32
srun python3 batch_generate.py pdb_list.json \
--batch-index $SLURM_ARRAY_TASK_ID \
--batch-size 25 \
--pool-size 8 \
--gpus="0,1,2,3" \
--prepare \
--steps=1000000 \
--report-steps=100 \
--timeout 1.8 \
--data-dir="${OUTPUT_DIR}"For each processed protein, the following directory structure is created:
data/
└── <pdb_id>/
├── raw/ # Original PDB file
│ └── <pdb_id>.pdb
├── processed/ # Prepared system
│ ├── <pdb_id>_processed.pdb
│ ├── <pdb_id>_processed_ligands_smiles.json
│ ├── <pdb_id>_processed_ligands_cache.json
│ ├── forcefield.json
│ ├── <pdb_id>_process.log
│ └── finished.txt
├── simulation/ # Simulation files
│ ├── system.xml
│ ├── integrator.xml
│ ├── checkpoint.chk
│ ├── log.txt
│ ├── <pdb_id>_simulation.log
│ ├── final_state.xml
│ ├── final_state.pdb
│ └── finished.txt
└── result/ # Trajectory data
└── output_<pdb_id>.h5 # HDF5 trajectory file
The HDF5 trajectory files contain:
coordinates: Atomic positions (nm)forces: Atomic forces (kJ/mol/nm)cell_lengths: Periodic box dimensions (nm)cell_angles: Periodic box angles (degrees)time: Simulation time (ps)potentialEnergy: Potential energy (kJ/mol)kineticEnergy: Kinetic energy (kJ/mol)topology: System topology in JSON format
Utility functions for file management and topology operations.
Protein preparation pipeline:
- Download PDB files from RCSB
- Fix missing residues and atoms
- Add hydrogens at specified pH
- Add explicit or implicit solvent
- Parameterize ligands
Core simulation engine:
- System setup and minimization
- Equilibration
- Production MD
- Checkpoint management
- Resumable simulations
Ligand handling:
- RCSB ligand template retrieval
- SMILES-based parameterization
- GAFF force field generation
- Topology insertion
Custom HDF5 reporter with:
- Threaded I/O for performance
- Force trajectory storage
- H5MD-compatible format
- Compression support
RCSB database query utilities for ligand SMILES strings.
Default force fields:
- Protein: AMBER14
- Water: TIP3P-FB
- Ligands: GAFF (via OpenFF)
- Implicit Solvent: GBN2
Force field selection is automatic based on preparation options and stored in forcefield.json.
The cecilia_data_cleaning/ directory contains tools for processing existing MD datasets:
Fix Tyrosine Residues:
python cecilia_data_cleaning/fix_cecilia_tyrosol.pyGenerate Starting Poses via TICA:
python cecilia_data_cleaning/tica_starting_pos/main.pyThis performs time-lagged independent component analysis (TICA) on bond distances to identify diverse starting conformations for enhanced sampling.
- Each simulation requires ~2-4GB GPU memory
- Pool size should be a multiple of GPU count for efficiency
- Recommended: 2 simulations per GPU (pool_size = 2 × num_gpus)
- Set
OMP_NUM_THREADSto match allocated cores - Ligand parameterization temporarily limits threads to 2
- Typical: 32 threads for 4-GPU nodes
- HDF5 reporter uses threaded writes to minimize overhead
report_stepscontrols output frequency and file size- Compression can be enabled via
use_gzip=True
CUDA out of memory:
- Reduce pool size
- Decrease number of atoms (remove water with
atomSubset)
Ligand parameterization fails:
- Check RCSB connectivity
- Verify ligand is in PDB Chemical Component Dictionary
- Use
--remove-ligandsif parameterization not needed
Simulation doesn't resume:
- Ensure checkpoint files exist
- Check that HDF5 file isn't corrupted
- Verify simulation time matches last recorded frame
Force field errors:
- Check for non-standard residues
- Verify all ligands have templates
- Review preparation logs
We welcome contributions! Please use GitHub Issues to:
- Report bugs
- Request features
- Ask questions
- Suggest improvements
License information will be added soon.
This work was performed on:
- Perlmutter at NERSC (National Energy Research Scientific Computing Center)
- Delta at NCSA (National Center for Supercomputing Applications)
Development supported by computational resources and expertise from these facilities.
For questions and support, please open an issue on GitHub.