DeepASMM: Decoding Cis-Regulatory Mechanisms via Quantification of Motif Autonomy and Contextual Synergy
DeepASMM (Deep Learning Driven Autonomous and Synergic Motif Mining Framework) is a neural network–based approach that quantifies both autonomous functionality and sequence context synergy of motifs through forward-propagation perturbation analysis.
For additional details, we kindly invite you to refer to the DeepASMM publication: DeepASMM: Decoding Cis-Regulatory Mechanisms via Quantification of Motif Autonomy and Contextual Synergy.
We developed a deep learning–based framework called DeepASMM to identify motifs with autonomous effects and sequence context synergy in genomic sequences.
DeepASMM consists of four main steps. First, deep learning–based genomic sequence prediction models are constructed to capture regulatory information embedded in sequences. One-hot encoding is used to preprocess sequences, enabling the model to effectively learn predictive rules. Second, background sequences are selected from true positive samples, ensuring that the sequences used for motif discovery contain real regulatory signals. Third, candidate motif localization is performed by scanning background sequences to identify all occurrences of each motif. Fourth, motif functionality assessment is conducted: the motif autonomous functionality score quantifies the intrinsic regulatory effect of a motif by embedding it into empty sequences, while the sequence context synergy score measures how the surrounding sequence context influences the motif’s effect.
In this study, we used the dataset of our previously developed maize gene expression prediction model DeepCBA for the experiments (Wang et al., 2024). This dataset includes chromatin interaction and gene expression data of three tissues (shoot, ear, and tassel) of maize (B73).
The maize chromatin accessibility prediction task, also involves the data of three tissues: shoot, ear, and tassel (Peng et al., 2019; Li et al., 2019; Sun et al., 2020). For each dataset of chromatin accessibility peaks, we extended to 300bp region based on the central locus as positive samples. Negative samples were randomly selected from the maize B73 reference genome with the same number as positive samples, ensuring no overlap with positive regions. All samples were randomly split into training, validation, and test sets at a ratio of 6:2:2.
For the human chromatin accessibility prediction task, we used the dataset reported in the Basset model (Kelley et al., 2016). This dataset contains 2,071,886 sequences of 600bp covering 164 human cell types. In this dataset, 1,930,000 sequences were randomly selected as the training set, 70,000 as the validation set, and 71,886 as the test set.
If you are running this project using GPU, please configure CUDA and cuDNN according to this version.
| Version | |
|---|---|
| CUDA | 11.8 |
| cuDNN | 8.6 |
This project is based on Python 3.8.13. The required environment is as follows:
| Packages | Version |
|---|---|
| numpy | 1.19.5 |
| pandas | 1.2.4 |
| tensorflow | 2.4.0 |
| logomaker | 0.8 |
| matplotlib | 3.4.3 |
| tqdm | 4.62.3 |
Some test cases have also been verified to run on tensorflow 1.15.
For more required packages, please refer to the requirements.txt file in this project.
-
For detailed instructions, please refer to the DeepASMM Manual in this repository.
-
Parallel execution: DeepASMM supports Python-based multi-processing acceleration. Depending on your hardware configuration, up to 10× speedup can be achieved.
-
Demo examples are available at DeepASMM Demos.
-
If multiple motif sequence alignments are required, we recommend using our extended tool, TOMTOM Parallelization Tool.
This tool is built on Python’s multi-processing framework and wraps the MEME Suite TOMTOM module, achieving up to 100× faster alignment performance.
For questions or suggestions, please reach out: Li_jie@webmail.hzau.edu.cn, liujianxiao@mail.hzau.edu.cn.
