The first deep learning–based assembler for diploid de novo genome assembly
DipGNNome is the first deep learning–based assembler for diploid de novo genome assembly, containing following features:
-
First ML-ready assembly graph pipeline: We implement the first publicly available pipeline for constructing machine learning–ready assembly graphs with ground-truth labels, enabling supervised training in the diploid setting.
-
Novel assembly algorithm: We develop an assembly algorithm that integrates model predictions with a beam search strategy to efficiently traverse long, string-like graphs with limited branching, a design that may generalize to other path-finding tasks.
-
Competitive performance: We show that a GNN-based assembler can achieve comparable assemblies to state-of-the-art methods while following a fundamentally different, learning-driven paradigm.
The dataset for DipGNNome is available at this Google Drive link.
DipGNNome operates in three main stages:
- HiFi reads are assembled into a unitig graph with hifiasm
- Graphs are simplified and annotated with haplotype information using trio-derived k-mers
- Creates machine learning–ready datasets with ground-truth labels
- Simulated diploid reads generate labeled graphs for supervised GNN training
- Edge classification models learn to predict optimal assembly paths
- Symmetric Gated Graph Convolutional Network (SymGatedGCN) architecture
- Real data are processed as in stage A
- Trained GNN scores edges for optimal path selection
- Beam search algorithm reconstructs phased maternal and paternal haplotypes
- Python 3.8+
- Git, Make, GCC (for external tools)
- CUDA toolkit (optional, for GPU acceleration)
Choose one of the following installation methods:
python install_tools.pyFor GPU support:
./setup_gpu.shFor CPU-only:
./setup_cpu.sh# Install Python dependencies
pip install -r requirements.txt
# Install external tools (see INSTALLATION.md for details)DipGNNome includes a pre-trained model (dipgnnome_trained.pt) that has been trained on synthetic diploid data. This allows you to skip the training step and directly perform genome assembly;
DipGNNome includes a pre-trained model (dipgnnome_trained.pt) that has been trained on synthetic diploid data. This allows you to skip the training step and directly perform genome assembly;
Generate unitig graphs with haplotype information from HiFi reads:
python data_gen/unitig_gen_data_class.py \
--data_path /path/to/hifi/reads/ \
--config configs/real_full.ymlTrain the Graph Neural Network on synthetic diploid data:
python training/train.py \
--data_path /path/to/training/graphs/ \
--data_config configs/example_data_config.yml \
--device cuda:0 \
--run_name your_experiment_name \
--wandb your_wandb_projectRun diploid genome assembly using the trained model:
python decoding/main.py \
--model dipgnnome_trained.pt \
--ref reference_genome_name \
--ass_out_dir /path/to/output/ \
--filename output_prefix \
--dataset /path/to/input/data/ \
--strategy beam--data_path: Path to directory containing HiFi reads--config: Configuration file specifying parameters
--data_path: Path to training graph datasets--data_config: Dataset configuration file--device: GPU device (e.g.,cuda:0,cuda:6) orcpu--run_name: Name for this training run--wandb: Weights & Biases project name for experiment tracking
--model: Path to trained model checkpoint (usedipgnnome_trained.ptfor pre-trained model)--ref: Reference genome identifier--ass_out_dir: Output directory for assembly results--filename: Prefix for output files--dataset: Path to input dataset--strategy: Assembly strategy (beam,greedy, etc.)
DipGNNome/
├── data_gen/ # Data processing and graph generation
│ ├── unitig_gen_data_class.py
│ └── utg_builder.py
├── training/ # Model training and architecture
│ ├── train.py
│ ├── SymGatedGCN.py
│ └── utils.py
├── decoding/ # Assembly and path-finding algorithms
│ ├── main.py
│ ├── inference.py
│ ├── graph_walk.py
│ └── eval.py
├── configs/ # Configuration files
├── install_tools.py # Installation script
├── requirements.txt # Python dependencies
├── environment_*.yml # Conda environments
└── setup_*.sh # Setup scripts
- Deep Learning: PyTorch, DGL (Deep Graph Library)
- Scientific Computing: NumPy, SciPy, Pandas, Scikit-learn
- Bioinformatics: BioPython, Edlib, PyLiftover
- Graph Processing: NetworkX
- Utilities: tqdm, PyYAML, Wandb
- hifiasm (v0.25.0) - HiFi read assembly
- PBSIM3 - PacBio read simulation
- yak - k-mer counting and analysis
DipGNNome uses YAML configuration files for easy customization:
configs/config.yml- Main training and model parametersconfigs/example_data_config.yml- Example dataset configuration with placeholder pathsconfigs/dataset_*.yml- Dataset-specific configurations (legacy/example variants)decoding/decode_strategies.yml- Assembly strategy parameters
Update the following placeholders in configs/example_data_config.yml to match your environment:
paths:
pbsim_path: /path/to/pbsim3 # or use vendor/pbsim3 after `python install_tools.py`
hifiasm_path: /path/to/hifiasm_025/ # directory containing the hifiasm binary
hifiasm_dump: /path/to/hifiasm_dump # writable temp directory for hifiasm dumps
yak_path: /path/to/yak # or vendor/yak after `python install_tools.py`
centromere_coords: /path/to/centromere_coordinates.yml
Notes:
- Running
python install_tools.pyinstalls third-party tools undervendor/. If you use that, set:yak_path: vendor/yakpbsim_path: vendor/pbsim3- You still need to provide
hifiasm_pathif not installed system-wide.
- Prefer paths relative to the repository root (e.g.,
data/...,vendor/...).
DipGNNome achieves comparable performance to state-of-the-art diploid assemblers. The following table compares DipGNNome (both Greedy and Beam Search variants) with hifiasm across multiple genomes:
| Genome | Metric | DipGNNome Greedy | DipGNNome Beam Search | hifiasm |
|---|---|---|---|---|
| H. sapiens | Length (mB) | 2840.0 / 2946.1 | 2856.4 / 2970.3 | 2938.5 / 3032.3 |
| Rdup (%) | 0.1 / 0.1 | 0.2 / 0.3 | 0.8 / 0.2 | |
| NG50 (mB) | 35.7 / 25.7 | 63.0 / 65.2 | 56.0 / 59.4 | |
| NGA50 (mB) | 33.6 / 25.7 | 48.4 / 49.3 | 49.7 / 58.6 | |
| Switch Err (%) | 1.1 / 1.2 | 1.1 / 1.2 | 0.8 / 1.0 | |
| Hamming Err (%) | 2.2 / 2.5 | 1.9 / 3.0 | 0.8 / 0.8 | |
| P. paniscus | Length (mB) | 3114.5 / 2934.2 | 3125.5 / 2947.4 | 3204.4 / 3066.0 |
| Rdup (%) | 2.0 / 0.8 | 1.9 / 1.0 | 1.2 / 1.1 | |
| NG50 (mB) | 52.6 / 48.5 | 94.0 / 70.3 | 100.1 / 63.0 | |
| NGA50 (mB) | 35.2 / 39.6 | 50.1 / 42.3 | 55.2 / 47.1 | |
| Switch Err (%) | 0.3 / 1.8 | 0.3 / 1.7 | 0.2 / 1.5 | |
| Hamming Err (%) | 1.3 / 2.3 | 1.7 / 3.1 | 0.1 / 1.3 | |
| G. gorilla | Length (mB) | 3366.2 / 3267.0 | 3389.8 / 3281.6 | 3528.4 / 3351.6 |
| Rdup (%) | 1.1 / 1.4 | 1.2 / 1.4 | 1.2 / 1.1 | |
| NG50 (mB) | 69.0 / 44.3 | 108.1 / 100.6 | 94.7 / 82.6 | |
| NGA50 (mB) | 45.7 / 33.6 | 56.1 / 48.9 | 55.5 / 48.6 | |
| Switch Err (%) | 0.3 / 0.2 | 0.3 / 0.2 | 0.2 / 0.2 | |
| Hamming Err (%) | 1.1 / 1.0 | 1.4 / 1.0 | 0.2 / 0.2 | |
| P. troglodytes | Length (mB) | 3056.0 / 2938.1 | 3080.7 / 2956.3 | 3149.0 / 3032.9 |
| Rdup (%) | 0.1 / 0.1 | 0.5 / 0.2 | 0.1 / 0.1 | |
| NG50 (mB) | 72.1 / 74.8 | 136.8 / 136.6 | 126.0 / 121.9 | |
| NGA50 (mB) | 62.2 / 68.0 | 102.8 / 76.1 | 101.9 / 121.9 | |
| Switch Err (%) | 0.1 / 0.2 | 0.1 / 0.2 | 0.1 / 0.2 | |
| Hamming Err (%) | 0.3 / 0.4 | 0.7 / 1.3 | 0.1 / 0.1 | |
| S. syndactylus | Length (mB) | 3131.6 / 3029.6 | 3133.5 / 3038.1 | 3230.0 / 3127.6 |
| Rdup (%) | 0.2 / 0.1 | 0.3 / 0.3 | 0.1 / 0.7 | |
| NG50 (mB) | 67.6 / 50.7 | 90.8 / 69.6 | 84.9 / 75.4 | |
| NGA50 (mB) | 55.4 / 45.8 | 56.1 / 58.6 | 76.1 / 38.7 | |
| Switch Err (%) | 0.1 / 0.2 | 0.1 / 0.2 | 0.1 / 0.2 | |
| Hamming Err (%) | 0.3 / 0.4 | 0.5 / 0.6 | 0.1 / 0.1 | |
| P. abelii | Length (mB) | 3297.8 / 3487.0 | 3332.7 / 3522.1 | 2967.5 / 3361.0 |
| Rdup (%) | 10.5 / 12.7 | 11.1 / 13.6 | 2.3 / 7.5 | |
| NG50 (mB) | 55.9 / 53.7 | 79.8 / 80.8 | 40.1 / 94.2 | |
| NGA50 (mB) | 43.0 / 40.6 | 45.7 / 57.5 | 34.1 / 65.3 | |
| Switch Err (%) | 6.5 / 4.6 | 6.5 / 4.6 | 4.8 / 3.8 | |
| Hamming Err (%) | 6.3 / 4.4 | 6.7 / 4.4 | 5.0 / 3.8 | |
| P. pygmaeus | Length (mB) | 3166.4 / 3272.6 | 3176.3 / 3304.0 | 3092.5 / 3212.2 |
| Rdup (%) | 5.8 / 5.8 | 6.2 / 6.4 | 2.7 / 6.5 | |
| NG50 (mB) | 51.8 / 52.5 | 108.2 / 100.9 | 65.8 / 85.0 | |
| NGA50 (mB) | 45.6 / 51.6 | 78.9 / 92.2 | 55.7 / 47.4 | |
| Switch Err (%) | 7.3 / 8.9 | 7.2 / 8.8 | 6.9 / 8.7 | |
| Hamming Err (%) | 7.7 / 9.6 | 7.5 / 9.0 | 7.8 / 10.7 |
Note: Values are shown as paternal/maternal haplotypes.
If you use DipGNNome in your research, please cite:
To be made citable..This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.
For installation issues, see INSTALLATION.md for detailed troubleshooting.
For questions about usage or methodology, please open an issue on GitHub.
- Built on top of hifiasm for initial assembly
- Extends GNNome, the first haploid genome assembly using deep learning
DipGNNome - diploid de novo Genome Assembly using Graph Neural Networks.

