RBAM is a framework that leverages variational autoencoders (VAE) to learn latent genotype representations, facilitating representation-informed association mapping and phenotype classification. This approach addresses limitations of traditional GWAS by accounting for polygenicity, epistatic interactions, and linkage disequilibrium.
Genome-wide association studies (GWAS) have provided key insights into the genetic architecture of complex diseases. However, traditional approaches often fall short in accounting for polygenicity, epistatic interactions, and linkage diequilibrium, leading to reduced power. We present Representation Learning-Based Association Mapping (RBAM), a framework that leverages variational autoencoders (VAE) to learn latent genotype representations, facilitating representation-informed association mapping and phenotype classification. Using 17 complex disorders and traits spanning brain disorders, immunological traits, cancers, cardiometabolic, and quantitative phenotypes, GWAS samples from the UK Biobank, dbGaP, and WTCCC, RBAM demonstrates superior power to detect validated gene-disease associations, particularly validated via DisGeNET disease-specific databases. Simulation studies confirm that RBAM maintains a controlled Type I error rate. Functional analysis reveals overlapping genetic pathways among different diseases. Overall, RBAM provides a robust and interpretable framework, bridging the gap between unsupervised representation learning and association mapping.
Keywords: Representation learning, Variational auto-encoder, Genome-wide association study, Kernel association testing, Complex traits, Polygenic risk prediction
- Clone the RBAM repository:
git clone https://github.com/davidenoma/rbam.git- Clone the MOKA pipeline (required for association mapping):
git clone https://github.com/davidenoma/moka.git ~/moka- (Recommended) Create a new Python 3.9 environment:
source rbam_env/bin/activate
4. Install Python dependencies:
#### Using Conda
```bash
conda install --file requirements.txt
pip install -r requirements.txtFor manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, the following Python package versions were used to ensure reproducibility:
numpy 1.21.6, pandas 1.3.5, tensorflow 2.6.0, keras 2.6.0, hyperopt 0.2.7, matplotlib 3.4.3, scikit-learn 0.24.2, xgboost 1.5.2, shap 0.39.0, seaborn 0.11.2, and pyyaml 5.4.1.
These versions are specified in the requirements.txt file and should be installed to replicate the computational environment used in this study.
For the benchmarking and method comparisons reported in this work, we used the following external tools and versions: REGENIE v2.2.4 — used for single-variant and whole-genome regression-based association comparisons (https://github.com/rgcgithub/regenie). SKAT R package v2.0.5 — used for sequence/kernel-based association testing comparisons (available via CRAN/Bioconductor). These versions were selected to ensure reproducibility of the results
- PLINK - For genotype data processing
- MOKA Pipeline - For association mapping
export PATH=$PATH:/path/to/plink
Your input genotype data must be in PLINK binary format (.bed, .bim, .fam files).
The framework supports both:
- Case-control studies (binary phenotypes)
- Quantitative traits (continuous phenotypes)
python runner/rbam_main.py test_geno/test_geno.raw test_geno/test_geno.bim ccParameters:
test_geno/test_geno.raw: Path to PLINK raw format genotype file
plink --recode A --bfile genotype_file --out genotype_filetest_geno/test_geno.bim: Path to corresponding BIM file<phenotype_type>: Either"cc"(case-control) or"quantitative"
Output:
- Trained VAE model saved in
model/ormodel_cc_com_qt/ - Reconstruction metrics (MSE and R²)
- Encoder and decoder weights extracted to
output_weights/
python runner/rbam_XAI_main.py test_geno/test_geno.raw test_geno/test_geno.bim ccFeatures:
- Hyperparameter optimization using Hyperopt
- SHAP (SHapley Additive exPlanations) for explainable AI
- Feature importance extraction
- Reconstruction quality metrics
Output:
- SHAP values for feature importance:
model_outputs/hopt_AE/shap_values_*/ - Merged SNP weights:
*_merged_snp_and_weights.csv - Visualization plots: SHAP bar plots
python runner/rbam_predictor.py test_geno/test_geno.rawClassifiers Implemented:
- Logistic Regression
- Random Forest
- XGBoost
- Neural Network (TensorFlow)
Features:
- Automated hyperparameter tuning
- Cross-validation
- Class imbalance handling
- Multiple performance metrics (Accuracy, AUC, R²)
Output:
- Classification results:
model_outputs/rbam_classifier/ - Performance metrics for each classifier
For comprehensive analysis including association mapping:
python single_folder_reconstruction_and_moka.py <folder_path> [options]Options:
--quantitative: For quantitative traits (default: binary/case-control)--reconstruction: Enable genotype reconstruction--plink-path <path>: Specify the path to the PLINK binary (default:plinkin PATH)
Example:
# Binary trait analysis with default PLINK and enabling reconstruction of
python single_folder_reconstruction_and_moka.py /path/to/genotype_folder --reconstruction
# Quantitative trait analysis
python single_folder_reconstruction_and_moka.py /path/to/genotype_folder --quantitative
# Without genotype reconstruction, which is faster after model training
python single_folder_reconstruction_and_moka.py /path/to/genotype_folder
# Specify custom PLINK binary location
python single_folder_reconstruction_and_moka.py /path/to/genotype_folder --plink-path /usr/local/bin/plink- The folder name must match the genotype file prefix. For example, if your genotype files are named
test_geno.bed,test_geno.bim, andtest_geno.fam, they must be placed in a folder namedtest_geno/. - Example structure:
test_geno/ ├── test_geno.bed ├── test_geno.bim └── test_geno.fam
Remember to clone the MOKA pipeline (required for association mapping):
git clone https://github.com/davidenoma/moka.git ~/moka
- If you do not provide
--plink-pathand PLINK is not found in your system PATH, the pipeline will automatically download and unzip PLINK for you (Linux/macOS supported). - You can still specify a custom PLINK binary using
--plink-pathif needed.
# With genotype reconstruction
python single_folder_reconstruction_and_moka.py test_geno --reconstruction
# For quantitative trait model
python single_folder_reconstruction_and_moka.py test_geno --quantitative
# Run the pipeline on the provided test_geno example (binary trait)
python single_folder_reconstruction_and_moka.py test_geno
# Specify custom PLINK binary location
python single_folder_reconstruction_and_moka.py test_geno --plink-path /usr/local/bin/plinkNote: If you do not specify --plink-path and PLINK is not installed, the script will automatically download and use PLINK.
Workflow:
-
Data Preprocessing:
- LD pruning using PLINK (
--indep-pairwise 50 5 0.2) - Genotype folder must bear the same name as the genotype ( bed, bim and fam ) files e.g.
- test_geno/
- test_geno.bim
- test_geno.fam
- test_geno.bed
- test_geno/
- Conversion to raw format
- Missing value imputation
- LD pruning using PLINK (
-
Model Training:
- VAE and Autoencoder training
- Weight extraction (encoder, decoder, combined)
- SHAP analysis for feature importance
-
Association Mapping:
- Integration with MOKA pipeline
- Multiple weight types analysis:
enc: Encoder weightsdec: Decoder weightsenc_dec: Combined weightsshap: SHAP-based weights
-
Results Generation:
- GWAS results for each weight type
- Manhattan plots
- Merged association results
├── model/ # Trained models (case-control)
├── model_cc_com_qt/ # Trained models (quantitative)
├── model_outputs/ # Analysis results
│ ├── hopt_AE/ # Autoencoder results
│ │ └── shap_values_*/ # SHAP analysis
│ └── rbam_classifier/ # Classification results
├── output_weights/ # Extracted weights
│ ├── hopt/ # Binary trait weights
│ └── hopt_cc_com_or_quant/ # Quantitative trait weights
└── moka_pipeline/ # MOKA pipeline results
├── result_folder/ # Association results
└── output_plots/ # Manhattan plots
- Variational Autoencoders (VAE): Learn latent representations accounting for genetic architecture
- Hyperparameter Optimization: Automated tuning using Hyperopt
- Cross-validation: Robust model evaluation
- SHAP Analysis: Feature importance for individual SNPs
- Multiple Weight Types: Encoder, decoder, and combined weights
- Visualization: Bar plots and importance rankings
- Multiple Algorithms: Logistic Regression, Random Forest, XGBoost, Neural Networks
- Class Imbalance Handling: Balanced class weights and downsampling
- Performance Metrics: Accuracy, AUC, R² scores
- Integration with MOKA: Seamless pipeline for GWAS
- Multiple Weight Strategies: Different biological interpretations
- Spectral Decorrelation: Optional preprocessing for population structure
The framework provides comprehensive evaluation:
- Reconstruction Quality: MSE and R² scores
- Feature Importance: Encoder and decoder VAE weight distributions and SHAP values on Autoencoder
- Classification Performance: Cross Validated prediction metrics (Accuracy, AUC)
- Association Results: Manhattan plots, KEGG and GO enrichment analyses
- GPU Memory Issues: Reduce batch size or use CPU-only mode
- PLINK Errors: Ensure PLINK is in PATH and data format is correct
- Missing Dependencies: Install all required packages and external tools
- Use smaller batch sizes for large datasets
- Enable TensorFlow memory growth:
tf.config.experimental.set_memory_growth() - Consider data chunking for very large genotype files
If you use RBAM in your research, please cite: Representation learning-based genome-wide association mapping discovers genes underlying complex traits https://doi.org/10.21203/rs.3.rs-7624342/v1
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and support:
- Open an issue on GitHub
- Contact: david.enoma@ucalgary.ca or quan.long@ucalgary.ca
- UK Biobank, dbGaP, and WTCCC for providing genetic data
- DisGeNET for disease-gene association validation
- The MOKA pipeline for association mapping framework

