Predicting CRE-Gene Interactions from DNA Sequence by Fine-Tuning Enformer with Single-Cell Multiome Data
This repository contains the data preparation and training scripts. It omits the feature_linkages folder, hg38.pk, and gencode.v32.annotation.gtf due to GitHub space constraints.
These files and the processed data are available in Google Drive at this link: https://drive.google.com/drive/folders/1u6fTEUJmviggkTk2OXMYRfot0fvLfZj8?usp=drive_link
Although deep learning models like Enformer have achieved excellent performance in predicting a variety of genomic tasks, they struggle to accurately model genetic expression variation across individuals. This suggests a gap in the model's fundamental understanding of how cis-regulatory elements and genes interact. To address this, we hypothesized that training a model to explicitly predict CRE-gene linkages would improve its regulatory understanding. In this study, we fine-tuned Enformer using the 10x Genomics Human PBMC Single-Cell Multiome dataset, leveraging paired chromatin accessibility and gene expression data to generate high-confidence peak-gene linkages. We implemented a supervised learning approach with a balanced mean squared error loss function and added explicit distance tracks to the model input. We compared a linear probe baseline against a model where the last transformer layer was fine-tuned. While the baseline model failed to localize regulatory elements (