Implementation of Recursive Cleaning for Large-scale Protein Data via Multimodal Learning by Zixuan Jiang*, Sitao Zhang*, Jiahang Cao*, Qiang Zhang, Shiyi Liu, Yuetong Fang, Lingfeng Zhang, Rui Qing, and Renjing Xu. Please feel free to reach out to us at zjiang597@connect.hkust-gz.edu.cn with any questions.
We introduce ProtAC, which corrects large Protein datasets with a scalable Automatic Cleaning framework that leverages both sequence and functional information through multimodal learning.
Our approach is a cyclic process consisting of three stages: first pretraining the model on a large noisy dataset, then finetuning the model on a small manually annotated dataset, and finally cleaning the noisy dataset using the finetuned model.
We achieve
-
a state-of-the-art (SOTA) model that outperforms competitors under diverse parametric levels, evaluated on multiple function-related downstream tasks
-
a cleaned UniRef50 dataset containing ~50M proteins with well-annotated functions
and demonstrate that
- our model is able to understand the relationships between intricate functional annotations in proteins and substantiate the validity of the proposed functional annotation revisions through extensive biological analysis (please see details in our paper).
To set up the environment and run our code, you could use the commands below in the terminal:
First clone this repo
git clone https://github.com/AzusaXuan/ProtAC.git
Then,
cd ProtAC
Use following commands to set up the environment
conda create -n protac_env python=3.9
conda activate protac_env
pip3 install -r requirements.txt # we offer this requirement list for reference, initializing the packages by yourself is encouraged
We provide ckpt_kw as model checkpoint for keyword-related task and ckpt for other tasks.
| Model Version | Parameter | Layer | Head | Checkpoints |
|---|---|---|---|---|
| ProtAC-ESM2-8M | 29M | 6 | 4 | ckpt, ckpt_kw |
| ProtAC-ESM2-35M | 79M | 12 | 8 | ckpt, ckpt_kw |
| ProtAC-ESM2-150M | 192M | 12 | 8 | ckpt, ckpt_kw |
| ProtAC-ESM2-650M | 824M | 24 | 16 | ckpt |
| ProtAC-PB | 27M | 6 | 4 | ckpt |
torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode <task_type: e.g. train, finetune, caption, eval> --checkpoint <path/to/ckpt> --depth <Layer> --attn_heads <Head>
torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode eval
For downstream task 2 and 3, we mainly use the code from ProtST.
torchrun --nproc_per_node=<number_of_gpus> run_downstream_GO.py --branch <MF/BP/CC>
torchrun --nproc_per_node=<number_of_gpus> run_downstream_EC.py
torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode kw_pred
torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode caption_sw
torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode train --actual_epoch <round_num>
torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode finetune --actual_epoch <round_num>
torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode caption --actual_epoch <round_num>
You could also run the whole workflow using run_epochs.sh
We offer original datasets for training and cleaned dataset for further training and evaluation. First five datasets have proteins in the form of untokenized suquences together with binarized GO arrays, use GO dict file to check corresponding GO ids; while last two datasets are with binarized keyword arrays, use KW_dict file to check keyword terms.
| Dataset | Amount (size) | Description |
|---|---|---|
| UniRef50-2018 | ~30M (11.1GB) | Original UniRef50 dataset as of May 2018 for pre-training and captioning task |
| UniRef50-cleaned | ~30M (11.7GB) | Cleaned UniRef50 with GO annotations created by ProtAC |
| SwissProt-train | ~530K (234MB) | SwissProt dataset updated to July 2023 for pre-training and fine-tuning |
| SwissProt-test | ~30K (13MB) | SwissProt dataset updated to July 2023 for evaluation |
| SwissProt-caption | 458 (300kB) | Newly updated sequences in SwissProt from 2023 to January 2024 |
| SwissProt-keyword-train | 18K (325MB) | For keyword prediction task, we further split the SwissProt test set in a 3:2 ratio, assigning 18,000 sequences to the trainset and 12,000 sequences to the testset |
| SwissProt-keyword-test | 12K (307MB) | See above |
The code and model weights are released under MIT license. See the LICENSE for details.
@article{jiang2024recursive,
title={Recursive Cleaning for Large-scale Protein Data via Multimodal Learning},
author={Jiang, Zixuan and Zhang, Sitao and Cao, Jiahang and Zhang, Qiang and Liu, Shiyi and Fang, Yuetong and Zhang, Lingfeng and Qing, Rui and Xu, Renjing},
journal={bioRxiv},
pages={2024--10},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
