This README provides an overview of the environment setup and directory/configuration file organization required for multitask Graph Neural Network (GNN) training on ADME property prediction and subsequent fine-tuning, as employed in the paper “Improving ADME Prediction with Multitask Graph Neural Networks and Assessing Explainability in Lead Optimization” by Ito S. et al.
By following this document, you will understand the essential requirements for data preparation and configuration placement. Detailed training and fine-tuning procedures are not covered here.
- Data Curator: Shoma Ito
- Repository Manager: Takuto Koyama (koyama.takuto.82j[at]st.kyoto-u.ac.jp)
-
kmol v1.1.7 (GitHub): the command-line tool for model training and evaluation
Install required packages in a dedicated environment:
git clone https://github.com/elix-tech/kmol.git
cd kmol
git checkout v1.1.7
make create-env
conda activate kmolADMET_MTFT/
├── dataset/
│ └── all_train_data_2022/
│ ├── standardize-drug_list/
│ │ └── ADME/
│ │ └── all_data/
│ │ └── log_data/ # per-target log-transformed values
│ └── multitask.csv # combined multitask labels and values
└── configs/
└── accuracy_drug/
└── adme/
├── multitask/ # config files for GNN multitask training
└── finetuning_learning_rate/ # config files for GNN fine-tuning
-
dataset/all_train_data_2022:standardize-drug_list/ADME/all_data/log_data: per‑target log‑transformed values.../multitask.csv: combined multitask labels and values
-
configs/accuracy_drug/adme:multitask/: config files for GNN multitask trainingfinetunig_learning_rate/: config files for GNN fine‑tuning
Before training, ensure your data is properly formatted:
- Individual target data: Place log-transformed values in
dataset/all_train_data_2022/standardize-drug_list/ADME/all_data/log_data/ - Multitask data: Combine all targets into
dataset/all_train_data_2022/multitask.csv - Configuration: Adjust paths in config files to match your directory structure
For detailed data preprocessing steps (including log transformation, duplicate removal, and epsilon handling), please refer to:
dataset/all_train_data_2022/standardize-drug_list/ADME/all_data/README.md
To perform multitask GNN training on ADME properties:
# Activate the kmol environment
conda activate kmol
# Run multitask training using the configuration files
kmol train /path/to/ADME_MTFT/configs/accuracy_drug/adme/multitask/train/itr1/config.jsonThe multitask training will use the combined dataset (dataset/all_train_data_2022/standardize-drug_list/ADME/all_data/log_data/multitask.csv) to train a single GNN model on multiple ADME properties simultaneously.
Key configuration features for multitask training:
- Multitask loader configuration:
"loader": {
"type": "multitask",
"input_path": "multitask.csv",
"task_column_name": "target_id",
"max_num_tasks": 10
}- Target-specific standardization: Each ADME property (target 0-9) has individual mean and standard deviation values for proper normalization:
"transformers": [
{
"type": "standardize",
"target": 0,
"mean": 1.5067472653927056,
"std": 0.6985528512457979
},
// ... for targets 1-9
]- Training parameters:
"is_finetuning": false- Initial training from scratch"checkpoint_path": null- No pre-trained model
After multitask training, fine-tune the model for specific ADME parameters:
# Run fine-tuning with the same command
kmol train /path/to/ADME_MTFT/configs/accuracy_drug/adme/finetuning_learning_rate/train/clint/itr1/config.jsonKey configuration features for fine-tuning:
- Single-target data: Uses target-specific CSV files (e.g.,
clint.csvfor intrinsic clearance)
"loader": {
"type": "multitask",
"input_path": "clint.csv",
"task_column_name": "target_id",
"max_num_tasks": 10
}- Pre-trained model loading:
"is_finetuning": true,
"checkpoint_path": "../ADMET/configs/accuracy_drug/adme/multitask/train/itr1/checkpoint.best.pt"To make predictions on new compounds:
# Run prediction using the fine-tuned model
kmol predict /path/to/ADME_MTFT/configs/accuracy_drug/adme/finetuning_learning_rate/test/clint/itr1/config.jsonKey configuration features for prediction:
- Test data configuration:
"loader": {
"input_path": "test/clint.csv"
},
"splitter": {
"type": "index",
"splits": {
"test": 1.0
}
}- Fine-tuned model loading:
"is_finetuning": true,
"checkpoint_path": "../ADMET/configs/accuracy_drug/adme/finetuning_learning_rate/train/clint/itr1/checkpoint.best.pt"-
Same standardization parameters: Uses identical transformers as training to ensure consistent data preprocessing
-
Evaluation metrics: Configured to compute R², MAE, and RMSE on test split
- Update the paths in the commands above to match your actual directory structure
- Ensure the kmol environment is activated before running any commands
- For detailed configuration options, refer to the kmol documentation
