In this work, we constructed Weighted Cell-Specific Networks (WCSN) based on highly variable genes, capturing both gene expression patterns and gene-gene interaction strengths. These networks are further refined by integrating high-confidence gene interaction data, enhancing their biological relevance. A graph neural network is then employed to extract features from the refined WCSN, enabling accurate cell type annotation. We term our model WCSGNet.
We also proposes two optimization strategies for constructing WCSNs: first, logarithmic transformation of network edge weights to mitigate the skewed distribution problem of gene expression levels; second, integration of high-confidence gene regulatory networks to supplement WCSNs, thereby enhancing the integrity of network structure and biological interpretability.
Yi-Ran Wang, Pu-Feng Du*. WCSGNet: a graph neural network approach using weighted cell-specific networks for cell-type annotation in scRNA-seq. Frontiers in Genetics (2025). doi: 10.3389/fgene.2025.1553352.
- windows11
- Tesla P100-PCIE-16GB
| Requirements | Release |
|---|---|
| CUDA | 12.7 |
| Python | 3.8.20 |
| torch | 1.12.1+cu113 |
| torch_geometric | 2.5.1 |
| numpy | 1.24.4 |
| scikit-learn | 1.3.2 |
| pandas | 2.0.3 |
| matplotlib | 3.7.5 |
This folder stores the code files.
-
DataPreprocessing
Jupyter notebooks for preprocessing individual datasets. The processed data generated by these scripts is saved in the
data/pre_data/scRNAseq_datasetsdirectory.Gene_interaction.ipynb: Jupyter notebooks for preprocessing high-confidence gene interaction data. The processed data is saved in thedata/pre_data/Networkdirectory. -
draw
This folder contains code for drawing.
-
Tables
This folder include folders for nine scRNA-seq datasets.
- Every sub folder include
*_true.csv,*_pred.csv, and '*' represents different methods including wcsgnet and 8 baseline methods ( get the baseline mthods results using GitHub - tabdelaal/scRNAseq_Benchmark at snakemake_and_docker ).
This folder also include folders which contain the performance (F1, Mean-F1, Accuracy) of different methods and datasets.
- Every sub folder include
-
data_partitioning.py
This file stores the five-fold cross-validation splits for the corresponding dataset. The generated files will be stored in the
result/datasets/folder. -
gene_filter.py
This file can store the indices of the highly variable genes (eg. 2000) in the gene expression matrix for each dataset.
-
get_gene_ncbi.py
This file generates the Entrez Gene IDs for each gene in the scRNA-seq dataset.
-
get_net_adj.py
This file generates the high-confidence interaction subnet for all genes in the scRNA-seq dataset. The result is saved as
*_net_adj.npyin the corresponding scRNA-seq dataset directory underresult/datasets/. -
up_sample.py
This step performs up-sampling on cell types with fewer cells in the training set and generates the cell indices of the up-sampled training set. The result is saved as
*_train_index_imputed.npyin the corresponding scRNA-seq dataset directory under ``result/datasets/`. -
wcsn_constr_train.py
This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold training set.
-
wcsn_constr_test.py
This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold testing set.
-
model.py
This file contains the code for the WCSGNet model.
-
datasets_wcsn.py
This file defines a custom PyTorch Geometric dataset class
MyDataset, which is designed to handle graph data. There is no need to supplement the WCSN; it directly reads the WCSN data. -
datasets_wcsn_LT.py
This file defines a custom PyTorch Geometric dataset class
MyDataset2, which is designed to handle graph data. Its main functionalities include applying a logarithmic transformation to the WCSN weights. -
datasets_ewcsn.py
This file defines a custom PyTorch Geometric dataset class,
MyDataset2, designed to handle graph data. Its main functionalities include supplementing WCSN with high-confidence interactions, where the weight values of the newly added edges are set to the mean weight of the existing edges after applying a logarithmic transformation. -
wcsn_classify_train.py
This step generates the 5-fold training set models using WCSN and saves them in
result/wcsn_models. -
wcsn_classify_test.py
This step generates the predicted results for the testing sets using WCSN and saves them in
result/wcsn_preds. -
LT_wcsn_classify_train.py
This step generates the 5-fold training set models using WCSN(logarithmic transformation) and saves them in
result/LT_wcsn_models. -
LT_wcsn_classify_test.py
This step generates the predicted results for the testing sets using WCSN(logarithmic transformation) and saves them in
result/LT_wcsn_preds. -
ewcsn_classify_train.py
This step generates the 5-fold training set models using EWCSN and saves them in
result/ewcsn_models. -
ewcsn_classify_test.py
This step generates the predicted results for the testing sets using EWCSN and saves them in
result/ewcsn_preds.
Storage of Downloaded Raw Data and the datasets after pre-processing.
-
raw
-
scRNAseq_Benchmark_datasets
The downloaded scRNA-seq datasets include: Muraro,Zheng 68k, Zhang_T, Kang, Baron, AMB, and TM.
-
Network:
The downloaded high-confidence gene interaction data includes: HumanNet GSP, BIOGRID, and Alliance of Genome Resources.
-
MGI
Data from MGI.
- List of Mouse Genetic Markers (sorted alphabetically by marker symbol, tab-delimited) (Includes withdrawn marker symbols)
- MGI Marker Associations to Entrez Gene (tab-delimited)
-
HGNC
Data from HGNC
- Current tab separated hgnc_complete_set file from HGNC
-
-
pre_data
-
scRNA-seq_datasets
Preprocessed scRNA-seq datasets generated using the
.ipynbfiles located in thesrc/DataProcessingdirectory. -
network
Gene interaction files stored in TSV format. Each file contains two columns representing gene pairs, where the genes are identified using Entrez Gene IDs.
-
-
datasets
Store the data generated during the processing of each scRNA-seq dataset. This includes the five-fold splits of the dataset, the Entrez Gene IDs of genes, the filtered list of highly variable genes, the generated high-confidence interaction subnetworks, the indices obtained from up-sampling the training set, and the WCSNs generated for each fold of the training and testing sets.
-
Figures
Storage result diagram.
-
wcsn_models
This folder contains the trained models and the models obtained from each fold of the cross validations by
src/wcsn_classify_train.py -
LT_wcsn_models
This folder contains the trained models and the models obtained from each fold of the cross validations by
src/LT_wcsn_classify_train.py -
ewcsn_models
This folder contains the trained models and the models obtained from each fold of the cross validations by
srce/ewcsn_classify_train.py -
wcsn_preds
This folder contains the predicted results generated by
src/wcsn_classify_test.py -
LT_wcsn_preds
This folder contains the predicted results generated by
src/LT_wcsn_classify_test.py. -
ewcsn_preds
This folder contains the predicted results generated by
src/ewcsn_classify_test.py.
Muraro, Zheng 68k, Baron, AMB, and TM: Available for direct download from Zenodo.
Zhang T: Accessible via GEO under accession number GSE108989.
Kang: Accessible via GEO under accession number GSE96583.
Save to the data/raw/scRNAseq_Benchmark_datasets directory.
HumanNet GSP: Available for download from HumanNet Search.
BIOGRID v4.4.235: Downloadable from the BioGRID File Repository.
Alliance of Genome Resources Molecular Interaction Datasets: Human and mouse gene interaction data are accessible from the Alliance of Genome Resources.
Save to the data/raw/Network directory.
HGNC: https://www.genenames.org/download/archive/
MGI: https://www.informatics.jax.org/downloads/reports/index.html
Save to the data/raw/HGNC and data/raw/MGIdirectory.
-
Initial preprocessing
In the GitHub project's section 3.1.1, the scRNA-seq datasets are downloaded and require initial preprocessing using the
ipynbfiles in thesrc/DataProcessingdirectory. The preprocessing steps include:- Filtering out cell types with fewer than 10 cells and cells with unclear annotations.
- Filtering out genes expressed in fewer than 10 cells.
After preprocessing, the resulting data should be saved in the
dataset/pre_data/scRNA-seq_datasetsdirectory.
-
Five-Fold Cross-Validation Splits
! python src/data_partitioning.py
Optional parameters
- -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.
- -outdir: Default='dataset/5fold_data/', Specify the output directory.
- --n_splits: Default=5, Indicates Five-fold cross-validation.
This step generates a seq_dict.npz file for each dataset located in the result/datasets/ directory. These files are used to store the five-fold cross-validation splits for the corresponding dataset, ensuring consistent and reproducible training and evaluation.
-
Up-sampling
! python src/up_sample.py
Optional parameters
-
-expr: Default='data/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.
-
-outdir: Default='result/datasets/', Specify the output directory.
-
--n_splits: Default=5, Indicates Five-fold cross-validation.
This step performs up-sampling on cell types with fewer cells in the training set and generates the cell indices of the up-sampled training set. The result is saved as
*_train_index_imputed.npyin the corresponding scRNA-seq dataset directory underresult/datasets/. -
-
Selection of highly variable genes(HVGs)
! python src/gene_filter.py
Optional parameters
-
-expr: Default='data/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.
-
-outdir: Default='result/datasets/', Specify the output directory.
-
-hvgs: Default=2000, Specify the number of HVGs.
This step generates a
.npyfile for each dataset, containing 2000 HVGs. The file stores the indices of the highly variable genes in the gene expression matrix. -
-
Processing Gene Interaction Network Data
src/DataProcessing/Gene_interaction.ipynb
This step processes the high-confidence gene interaction data obtained in 3.1.2. For each dataset, generate a gene interaction file in TSV format, containing two columns of genes represented by Entrez Gene IDs, and save the files in the
data/pre_data/networkdirectory. The HumanNet GSP does not require processing. -
Obtain Entrez Gene IDs for genes in different scRNA-seq datasets
! python src/get_net_ncbi.py
Optional parameters
- -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
- -outdir: Default='dataset/5fold_data/'. Specify the output directory.
- -gn: Default='hgnc'. Gene name format: 'hgnc', 'mgi', 'ensembl', or 'ncbi'.
- -species: Default='human'. Options: 'human' or 'mouse'.
This step retrieves the Entrez Gene IDs for each gene in the scRNA-seq dataset. If an ID is unavailable, it is set to
None. The gene list is saved asgenes_ncbi.npyin the corresponding scRNA-seq dataset directory underresult/datasets/. -
Obtain the high-confidence interaction subnet for each gene in the scRNA-seq dataset.
! python src/get_net_adj.py
Optional parameters
- -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
- -net: Default='dataset/pre_data/network/HumanNet-GSP.tsv '. Specify the high-confidence gene interaction dataset.
- -outdir: Default='dataset/5fold_data/'. Specify the output directory.
- -gt: Default='ncbi'. Options: 'ncbi'
This step retrieves the high-confidence interaction subnet for each gene in the scRNA-seq dataset. The result is saved as
*_net_adj.npyin the corresponding scRNA-seq dataset directory underresult/datasets/.
! python src/wcsn_constr_train.py
Optional parameters
- -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
- -outdir: Default='result/datasets/'. Specify the output directory.
- -cuda: Default=True.
- -hvgs: Default=2000, The number of HVGs.
- -ca: Default=0.01, Significance level.
- --n_splits: Default=5, Indicates Five-fold cross-validation.
This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold training set. The graph for each training set cell is saved as a .pt file in the processed folder of the corresponding fold (e.g., train_f1) within the WCSN_a0.01_hvgs2000 folder, which is located under the corresponding dataset folder in result/datasets/.
! python src/wcsn_constr_test.py
Optional parameters
- -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
- -outdir: Default='dataset/5fold_data/'. Specify the output directory.
- -cuda: Default=True.
- -hvgs: Default=2000, The number of HVGs.
- -ca: Default=0.01, Significance level.
- --n_splits: Default=5, Indicates Five-fold cross-validation.
This step constructs a 5-fold WCSN based on highly variable genes for each scRNA-seq dataset's testing set. The graph for each testing set cell is saved as a .pt file in the processed folder of the corresponding fold (e.g., train_f1) within the WCSN_a0.01_hvgs2000 folder, which is located under the corresponding dataset folder in result/datasets/.
Training
! python src/wcsn_classify_train.py
Optional parameters
-
-expr: : Default='data/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
-
-outdir: Default='result/wcsn_models'. Specify the output directory.
-
-ca: Default=0.01, Significance level.
-
-hvgs: Default=2000, The number of HVGs.
-
-bs: Default=32, The batch size of this training.
This step generates the 5-fold training set models using WCSN and saves them in result/wcsn_models.
Testing
! python src/wcsn_classify_test.py
Optional parameters
-
-expr: : Default='data/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
-
-outdir: Default='result/'. Specify the output directory.
-
-ca: Default=0.01, Significance level.
-
-hvgs: Default=2000, The number of HVGs.
-
-bs: Default=32.
This step generates the predicted results for the testing sets and saves them in result/wcsn_preds.
The results include:
*_Prediction.h5: Contains the true labels and predicted labels for the test set cells, the probability matrix for each predicted cell type, and the cell embeddings for each cell.
*_F1.csv: Includes the accuracy, Mean F1-Score, and the F1-Score for each cell type.
Training
! python src/LT_wcsn_classify_train.py
The optional parameters are mostly the same as those in wcsn_classify_train.py.
This step generates the 5-fold training set models using WCSN(logarithmic transformation) and saves them in result/LT_wcsn_models.
Testing
! python src/LT_wcsn_classify_test.py
The optional parameters are mostly the same as those in wcsn_classify_test.py.
This step generates the predicted results for the testing sets using WCSN(logarithmic transformation) and saves them in result/LT_wcsn_preds. The results include *_Prediction.h5 and *_F1.csv.
Training
! python src/ewcsn_classify_train.py
Optional parameters
-
-expr: : Default='data/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
-
-net: Default='HumanNet-GSP '. Specify the high-confidence gene interaction network name.
-
-outdir: Default='result/ewcsn_models'. Specify the output directory.
-
-ca: Default=0.01, Significance level.
-
-hvgs: Default=2000, The number of HVGs.
-
-bs: Default=32, The batch size of this training.
This step generates the 5-fold training set models and saves them in result/ewcsn_models.
Testing
! python src/ewcsn_classify_test.py
Optional parameters
- -expr: : Default='data/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
- -net: Default='HumanNet-GSP '. Specify the high-confidence gene interaction network name.
- -outdir: Default='result/'. Specify the output directory.
- -ca: Default=0.01, Significance level.
- -hvgs: Default=2000, The number of HVGs.
- -bs: Default=32.
This step generates the predicted results for the testing sets and saves them in result/ewcsn_preds.
The results include:
*_Prediction.h5: Contains the true labels and predicted labels for the test set cells, the probability matrix for each predicted cell type, and the cell embeddings for each cell.
*_F1.csv: Includes the accuracy, Mean F1-Score, and the F1-Score for each cell type.
All drawing codes are from
src/Figures/
-
Figure 4-1, 4-2, 4-3, 4-4
src/draw/baseline.ipynb
-
Figure 4-5
src/draw/sankey.ipynb
Sankey diagram of the different datasets under WCSGNet's 5-fold cross-validation.
-
Figure 4-6, 4-7, 4-9, 4-10
src/draw/analysis_Baron_Human.ipynb
src/draw/R/Figure-hub-genes.R
src/draw/R/Figure-high-weight.R
-
Figure 4-8, 4-11
src/draw/analysis_AMB.ipynb
src/draw/R/Figure-AMB-gene.R
src/draw/R/Figure-AMB-edge.R
-
Figure 5-1, 5-2
src/draw/log_trans.py
-
Figure 5-3
src/draw/LT_compare.ipynb
-
Figure 5-4, 5-5
src/draw/ewcsn.ipynb
The following factors may result in slight differences in the Mean F1-score and Accuracy for cell type classification when reproducing the results, compared to those reported in the paper.
- The DataLoader applies a shuffle operation on the training dataset during model training, leading to some randomness in the input sequence of the training data.
- The use of the Dropout mechanism in the model introduces variability in the trained models across different runs.
- Parameter initialization also produces some randomness.
However, these differences do not have a disruptive impact on the conclusions of the paper.