This repository implements the PepMorph pipeline for the paper Morphology-Specific Peptide Discovery via Masked Conditional Generative Modeling - Costa & Zavadlav, (2025). It provides a conditional, mask-aware CVAE for peptide generation and utilities to train, validate, and compare generated sequences against morphology targets.
Peptide self-assembly prediction offers a powerful bottom-up strategy for designing biocompatible, low-toxicity materials for large-scale synthesis in a broad range of biomedical and energy applications. However, screening the vast sequence space for categorization of aggregate morphology remains intractable. We introduce PepMorph, an end-to-end peptide discovery pipeline that generates novel sequences that are not only prone to aggregate but whose self-assembly is steered toward fibrillar or spherical morphologies by conditioning on isolated peptide descriptors that serve as morphology proxies. To this end, we compiled a new dataset by leveraging existing aggregation propensity datasets and extracting geometric and physicochemical descriptors. This dataset is then used to train a Transformer-based Conditional Variational Autoencoder with a masking mechanism, which generates novel peptides under arbitrary conditioning. After filtering to ensure design specifications and validation of generated sequences through coarse-grained molecular dynamics simulations, PepMorph yielded 83% success rate under our CG-MD validation protocol and morphology criterion for the targeted class, showcasing its promise as a framework for application-driven peptide discovery.
├── data/
│ ├── raw/ # source datasets + input FASTA (peptides.fst)
│ ├── processed/ # merged/normalized descriptor CSVs
│ └── splits/ # deterministic train/val/test indices
├── artifacts/
│ ├── data_figs/ # dataset figures
│ ├── descriptor_calc/ # descriptor-calc logs/timings
│ ├── models/ # pretrained weights (AP predictor, masked CVAE)
│ ├── validation/ # validation outputs (figs/results/gen_peptides)
│ ├── md_sims/ # validation artifacts from the paper (MD sims, analysis)
│ └── legacy_runs/ # older outputs moved out of working dirs
├── notebooks/
│ └── *.ipynb # exploratory notebooks
├── src/
│ ├── shared/
│ │ └── plot_style.py # shared plotting style
│ ├── scripts/ # dataset merge/analysis utilities
│ ├── descriptor_calc/ # PEP-FOLD/descriptor generation pipeline
│ ├── modeling/ # AP predictor + CVAE training/validation
│ └── md_sims/ # CG robustness analysis scripts
├── README.md
├── environment.yml # pinned Python dependencies
- Create the environment:
conda env create -f environment.yml
conda activate pepmorph- Prepare the processed dataset:
python src/scripts/merge.py- Create deterministic splits (stratified by length, seed=42):
python src/scripts/make_splits.py- Train the AP/SA predictor:
python src/modeling/ap_model/ap_sa_pred.py --config src/modeling/ap_model/config.yaml- Train the masked CVAE (all hyperparameters in config):
python src/modeling/masked_cvae/train.py --config src/modeling/masked_cvae/config.yaml- Generate peptides and evaluate:
python src/modeling/validation/generate.py --mode all
python src/modeling/validation/evaluate_novelty_diversity_sim.py
python src/modeling/validation/validation_plotting.pyOutputs go to artifacts/validation/{gen_peptides,results,figs}.
- Environment:
environment.ymlpins all major dependencies (Torch, ESM, numpy/pandas/sklearn, plotting). - Splits:
data/splits/*.txtare deterministic (seed=42), generated bysrc/scripts/make_splits.py. - Seeds: training/evaluation scripts accept
--seedand call a shared deterministic seed routine. - AP model config: AP hyperparameters are in
src/modeling/ap_model/config.yamland logged toartifacts/models/ap_model/config_used.yamlon each run. - Masked CVAE config: Masked CVAE hyperparameters are in
src/modeling/masked_cvae/config.yamland logged toartifacts/models/masked_cvae/config_used.yamlon each run. - Checkpoints: pretrained weights live in
artifacts/models/{ap_model,masked_cvae}.
If you use PepMorph, please cite:
@misc{costa2025morphologyspecificpeptidediscoverymasked,
title={Morphology-Specific Peptide Discovery via Masked Conditional Generative Modeling},
author={Nuno Costa and Julija Zavadlav},
year={2025},
eprint={2509.02060},
archivePrefix={arXiv},
primaryClass={q-bio.BM},
url={https://arxiv.org/abs/2509.02060},
}For questions regarding implementation, please send an email to nuno.costa@tum.de
