ProtAC (under construction)

Implementation of Recursive Cleaning for Large-scale Protein Data via Multimodal Learning by Zixuan Jiang*, Sitao Zhang*, Jiahang Cao*, Qiang Zhang, Shiyi Liu, Yuetong Fang, Lingfeng Zhang, Rui Qing, and Renjing Xu. Please feel free to reach out to us at zjiang597@connect.hkust-gz.edu.cn with any questions.

Introduction

We introduce ProtAC, which corrects large Protein datasets with a scalable Automatic Cleaning framework that leverages both sequence and functional information through multimodal learning.

Our approach is a cyclic process consisting of three stages: first pretraining the model on a large noisy dataset, then finetuning the model on a small manually annotated dataset, and finally cleaning the noisy dataset using the finetuned model.

We achieve

a state-of-the-art (SOTA) model that outperforms competitors under diverse parametric levels, evaluated on multiple function-related downstream tasks
a cleaned UniRef50 dataset containing ~50M proteins with well-annotated functions

and demonstrate that

our model is able to understand the relationships between intricate functional annotations in proteins and substantiate the validity of the proposed functional annotation revisions through extensive biological analysis (please see details in our paper).

Environment

To set up the environment and run our code, you could use the commands below in the terminal:

First clone this repo

git clone https://github.com/AzusaXuan/ProtAC.git

Then,

cd ProtAC

Use following commands to set up the environment

conda create -n protac_env python=3.9
conda activate protac_env
pip3 install -r requirements.txt # we offer this requirement list for reference, initializing the packages by yourself is encouraged

Model weights

We provide ckpt_kw as model checkpoint for keyword-related task and ckpt for other tasks.

Model Version	Parameter	Layer	Head	Checkpoints
ProtAC-ESM2-8M	29M	6	4	ckpt, ckpt_kw
ProtAC-ESM2-35M	79M	12	8	ckpt, ckpt_kw
ProtAC-ESM2-150M	192M	12	8	ckpt, ckpt_kw
ProtAC-ESM2-650M	824M	24	16	ckpt
ProtAC-PB	27M	6	4	ckpt

You could select a model version and run the chosen task using following command:

torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode <task_type: e.g. train, finetune, caption, eval> --checkpoint <path/to/ckpt> --depth <Layer> --attn_heads <Head>

Evaluation

Downstream task1: Gene Ontology (GO) prediction (7533-category)

torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode eval

For downstream task 2 and 3, we mainly use the code from ProtST.

Downstream task2: GO prediction (MF, BP, CC)

torchrun --nproc_per_node=<number_of_gpus> run_downstream_GO.py --branch <MF/BP/CC>

Downstream task3: Enzyme Commission (EC) prediction

torchrun --nproc_per_node=<number_of_gpus> run_downstream_EC.py

Downstream task4: Keyword (KW) prediction

torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode kw_pred

Downstream task5: GO prediction (SwissProt-753)

torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode caption_sw

Cleaning workflow

Step1: Pre-training

torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode train --actual_epoch <round_num>

Step2: Fine-tuning

torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode finetune --actual_epoch <round_num>

Step3: Data-cleaning

torchrun --nproc_per_node=<number_of_gpus> data_clean_main.py --mode caption --actual_epoch <round_num>

You could also run the whole workflow using run_epochs.sh

Datasets

We offer original datasets for training and cleaned dataset for further training and evaluation. First five datasets have proteins in the form of untokenized suquences together with binarized GO arrays, use GO dict file to check corresponding GO ids; while last two datasets are with binarized keyword arrays, use KW_dict file to check keyword terms.

Dataset	Amount (size)	Description
UniRef50-2018	~30M (11.1GB)	Original UniRef50 dataset as of May 2018 for pre-training and captioning task
UniRef50-cleaned	~30M (11.7GB)	Cleaned UniRef50 with GO annotations created by ProtAC
SwissProt-train	~530K (234MB)	SwissProt dataset updated to July 2023 for pre-training and fine-tuning
SwissProt-test	~30K (13MB)	SwissProt dataset updated to July 2023 for evaluation
SwissProt-caption	458 (300kB)	Newly updated sequences in SwissProt from 2023 to January 2024
SwissProt-keyword-train	18K (325MB)	For keyword prediction task, we further split the SwissProt test set in a 3:2 ratio, assigning 18,000 sequences to the trainset and 12,000 sequences to the testset
SwissProt-keyword-test	12K (307MB)	See above

License

The code and model weights are released under MIT license. See the LICENSE for details.

Citation

@article{jiang2024recursive,
  title={Recursive Cleaning for Large-scale Protein Data via Multimodal Learning},
  author={Jiang, Zixuan and Zhang, Sitao and Cao, Jiahang and Zhang, Qiang and Liu, Shiyi and Fang, Yuetong and Zhang, Lingfeng and Qing, Rui and Xu, Renjing},
  journal={bioRxiv},
  pages={2024--10},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
downstream		downstream
models		models
protst		protst
LICENSE		LICENSE
README.md		README.md
data_clean_main.py		data_clean_main.py
data_clean_main_proteinbert.py		data_clean_main_proteinbert.py
dataloader.py		dataloader.py
go_dict_7533.csv		go_dict_7533.csv
kw_dict_773.csv		kw_dict_773.csv
merge_caption_inds.py		merge_caption_inds.py
merge_hdf5_new.py		merge_hdf5_new.py
overview.png		overview.png
requirements.txt		requirements.txt
run_epochs.sh		run_epochs.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProtAC (under construction)

Introduction

Environment

Model weights

You could select a model version and run the chosen task using following command:

Evaluation

Downstream task1: Gene Ontology (GO) prediction (7533-category)

Downstream task2: GO prediction (MF, BP, CC)

Downstream task3: Enzyme Commission (EC) prediction

Downstream task4: Keyword (KW) prediction

Downstream task5: GO prediction (SwissProt-753)

Cleaning workflow

Step1: Pre-training

Step2: Fine-tuning

Step3: Data-cleaning

Datasets

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

AzusaXuan/ProtAC

Folders and files

Latest commit

History

Repository files navigation

ProtAC (under construction)

Introduction

Environment

Model weights

You could select a model version and run the chosen task using following command:

Evaluation

Downstream task1: Gene Ontology (GO) prediction (7533-category)

Downstream task2: GO prediction (MF, BP, CC)

Downstream task3: Enzyme Commission (EC) prediction

Downstream task4: Keyword (KW) prediction

Downstream task5: GO prediction (SwissProt-753)

Cleaning workflow

Step1: Pre-training

Step2: Fine-tuning

Step3: Data-cleaning

Datasets

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages