Skip to content

Yi-ellen/CellTypeAnnotation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WCSGNet

In this work, we constructed Weighted Cell-Specific Networks (WCSN) based on highly variable genes, capturing both gene expression patterns and gene-gene interaction strengths. These networks are further refined by integrating high-confidence gene interaction data, enhancing their biological relevance. A graph neural network is then employed to extract features from the refined WCSN, enabling accurate cell type annotation. We term our model WCSGNet.

We also proposes two optimization strategies for constructing WCSNs: first, logarithmic transformation of network edge weights to mitigate the skewed distribution problem of gene expression levels; second, integration of high-confidence gene regulatory networks to supplement WCSNs, thereby enhancing the integrity of network structure and biological interpretability.

Citation

Yi-Ran Wang, Pu-Feng Du*. WCSGNet: a graph neural network approach using weighted cell-specific networks for cell-type annotation in scRNA-seq. Frontiers in Genetics (2025). doi: 10.3389/fgene.2025.1553352.

1. Platform and Dependency

1.1 Platform

  • windows11
  • Tesla P100-PCIE-16GB

1.2 Dependency

Requirements Release
CUDA 12.7
Python 3.8.20
torch 1.12.1+cu113
torch_geometric 2.5.1
numpy 1.24.4
scikit-learn 1.3.2
pandas 2.0.3
matplotlib 3.7.5

2. Project Catalog Structure

2.1 src

This folder stores the code files.

  • DataPreprocessing

    Jupyter notebooks for preprocessing individual datasets. The processed data generated by these scripts is saved in the data/pre_data/scRNAseq_datasetsdirectory.

    Gene_interaction.ipynb: Jupyter notebooks for preprocessing high-confidence gene interaction data. The processed data is saved in the data/pre_data/Networkdirectory.

  • draw

    This folder contains code for drawing.

  • Tables

    This folder include folders for nine scRNA-seq datasets.

    This folder also include folders which contain the performance (F1, Mean-F1, Accuracy) of different methods and datasets.

  • data_partitioning.py

    This file stores the five-fold cross-validation splits for the corresponding dataset. The generated files will be stored in the result/datasets/ folder.

  • gene_filter.py

    This file can store the indices of the highly variable genes (eg. 2000) in the gene expression matrix for each dataset.

  • get_gene_ncbi.py

    This file generates the Entrez Gene IDs for each gene in the scRNA-seq dataset.

  • get_net_adj.py

    This file generates the high-confidence interaction subnet for all genes in the scRNA-seq dataset. The result is saved as *_net_adj.npy in the corresponding scRNA-seq dataset directory under result/datasets/.

  • up_sample.py

    This step performs up-sampling on cell types with fewer cells in the training set and generates the cell indices of the up-sampled training set. The result is saved as *_train_index_imputed.npy in the corresponding scRNA-seq dataset directory under ``result/datasets/`.

  • wcsn_constr_train.py

    This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold training set.

  • wcsn_constr_test.py

    This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold testing set.

  • model.py

    This file contains the code for the WCSGNet model.

  • datasets_wcsn.py

    This file defines a custom PyTorch Geometric dataset class MyDataset, which is designed to handle graph data. There is no need to supplement the WCSN; it directly reads the WCSN data.

  • datasets_wcsn_LT.py

    This file defines a custom PyTorch Geometric dataset class MyDataset2, which is designed to handle graph data. Its main functionalities include applying a logarithmic transformation to the WCSN weights.

  • datasets_ewcsn.py

    This file defines a custom PyTorch Geometric dataset class, MyDataset2, designed to handle graph data. Its main functionalities include supplementing WCSN with high-confidence interactions, where the weight values of the newly added edges are set to the mean weight of the existing edges after applying a logarithmic transformation.

  • wcsn_classify_train.py

    This step generates the 5-fold training set models using WCSN and saves them in result/wcsn_models.

  • wcsn_classify_test.py

    This step generates the predicted results for the testing sets using WCSN and saves them in result/wcsn_preds.

  • LT_wcsn_classify_train.py

    This step generates the 5-fold training set models using WCSN(logarithmic transformation) and saves them in result/LT_wcsn_models.

  • LT_wcsn_classify_test.py

    This step generates the predicted results for the testing sets using WCSN(logarithmic transformation) and saves them in result/LT_wcsn_preds.

  • ewcsn_classify_train.py

    This step generates the 5-fold training set models using EWCSN and saves them in result/ewcsn_models.

  • ewcsn_classify_test.py

    This step generates the predicted results for the testing sets using EWCSN and saves them in result/ewcsn_preds.

2.2 data

Storage of Downloaded Raw Data and the datasets after pre-processing.

  • raw

    • scRNAseq_Benchmark_datasets

      The downloaded scRNA-seq datasets include: Muraro,Zheng 68k, Zhang_T, Kang, Baron, AMB, and TM.

    • Network:

      The downloaded high-confidence gene interaction data includes: HumanNet GSP, BIOGRID, and Alliance of Genome Resources.

    • MGI

      Data from MGI.

      • List of Mouse Genetic Markers (sorted alphabetically by marker symbol, tab-delimited) (Includes withdrawn marker symbols)
      • MGI Marker Associations to Entrez Gene (tab-delimited)
    • HGNC

      Data from HGNC

      • Current tab separated hgnc_complete_set file from HGNC
  • pre_data

    • scRNA-seq_datasets

      Preprocessed scRNA-seq datasets generated using the .ipynb files located in the src/DataProcessing directory.

    • network

      Gene interaction files stored in TSV format. Each file contains two columns representing gene pairs, where the genes are identified using Entrez Gene IDs.

2.3 result

  • datasets

    Store the data generated during the processing of each scRNA-seq dataset. This includes the five-fold splits of the dataset, the Entrez Gene IDs of genes, the filtered list of highly variable genes, the generated high-confidence interaction subnetworks, the indices obtained from up-sampling the training set, and the WCSNs generated for each fold of the training and testing sets.

  • Figures

    Storage result diagram.

  • wcsn_models

    This folder contains the trained models and the models obtained from each fold of the cross validations by src/wcsn_classify_train.py

  • LT_wcsn_models

    This folder contains the trained models and the models obtained from each fold of the cross validations by src/LT_wcsn_classify_train.py

  • ewcsn_models

    This folder contains the trained models and the models obtained from each fold of the cross validations by srce/ewcsn_classify_train.py

  • wcsn_preds

    This folder contains the predicted results generated by src/wcsn_classify_test.py

  • LT_wcsn_preds

    This folder contains the predicted results generated by src/LT_wcsn_classify_test.py.

  • ewcsn_preds

    This folder contains the predicted results generated by src/ewcsn_classify_test.py.

3. Workflow

3.1 Data Collection and Preprocessing

3.1.1 Download scRNA-seq dataset

Muraro, Zheng 68k, Baron, AMB, and TM: Available for direct download from Zenodo.

Zhang T: Accessible via GEO under accession number GSE108989.

Kang: Accessible via GEO under accession number GSE96583.

Save to the data/raw/scRNAseq_Benchmark_datasets directory.

3.1.2 Download high-confidence gene interaction data

HumanNet GSP: Available for download from HumanNet Search.

BIOGRID v4.4.235: Downloadable from the BioGRID File Repository.

Alliance of Genome Resources Molecular Interaction Datasets: Human and mouse gene interaction data are accessible from the Alliance of Genome Resources.

Save to the data/raw/Network directory.

3.1.3 Download HGNC and MGI

HGNC: https://www.genenames.org/download/archive/

MGI: https://www.informatics.jax.org/downloads/reports/index.html

Save to the data/raw/HGNC and data/raw/MGIdirectory.

3.1.4 scRNA-seq preprocessing
  • Initial preprocessing

    In the GitHub project's section 3.1.1, the scRNA-seq datasets are downloaded and require initial preprocessing using the ipynb files in the src/DataProcessing directory. The preprocessing steps include:

    1. Filtering out cell types with fewer than 10 cells and cells with unclear annotations.
    2. Filtering out genes expressed in fewer than 10 cells. After preprocessing, the resulting data should be saved in the dataset/pre_data/scRNA-seq_datasets directory.
  • Five-Fold Cross-Validation Splits

! python src/data_partitioning.py

Optional parameters

  • -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.
  • -outdir: Default='dataset/5fold_data/', Specify the output directory.
  • --n_splits: Default=5, Indicates Five-fold cross-validation.

This step generates a seq_dict.npz file for each dataset located in the result/datasets/ directory. These files are used to store the five-fold cross-validation splits for the corresponding dataset, ensuring consistent and reproducible training and evaluation.

  • Up-sampling

    ! python src/up_sample.py

    Optional parameters

    • -expr: Default='data/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.

    • -outdir: Default='result/datasets/', Specify the output directory.

    • --n_splits: Default=5, Indicates Five-fold cross-validation.

    This step performs up-sampling on cell types with fewer cells in the training set and generates the cell indices of the up-sampled training set. The result is saved as *_train_index_imputed.npy in the corresponding scRNA-seq dataset directory under result/datasets/.

  • Selection of highly variable genes(HVGs)

    ! python src/gene_filter.py

    Optional parameters

    • -expr: Default='data/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.

    • -outdir: Default='result/datasets/', Specify the output directory.

    • -hvgs: Default=2000, Specify the number of HVGs.

    This step generates a .npy file for each dataset, containing 2000 HVGs. The file stores the indices of the highly variable genes in the gene expression matrix.

  • Processing Gene Interaction Network Data

    src/DataProcessing/Gene_interaction.ipynb

    This step processes the high-confidence gene interaction data obtained in 3.1.2. For each dataset, generate a gene interaction file in TSV format, containing two columns of genes represented by Entrez Gene IDs, and save the files in the data/pre_data/network directory. The HumanNet GSP does not require processing.

  • Obtain Entrez Gene IDs for genes in different scRNA-seq datasets

    ! python src/get_net_ncbi.py

    Optional parameters

    • -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
    • -outdir: Default='dataset/5fold_data/'. Specify the output directory.
    • -gn: Default='hgnc'. Gene name format: 'hgnc', 'mgi', 'ensembl', or 'ncbi'.
    • -species: Default='human'. Options: 'human' or 'mouse'.

    This step retrieves the Entrez Gene IDs for each gene in the scRNA-seq dataset. If an ID is unavailable, it is set to None. The gene list is saved as genes_ncbi.npy in the corresponding scRNA-seq dataset directory under result/datasets/.

  • Obtain the high-confidence interaction subnet for each gene in the scRNA-seq dataset.

    ! python src/get_net_adj.py

    Optional parameters

    • -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
    • -net: Default='dataset/pre_data/network/HumanNet-GSP.tsv '. Specify the high-confidence gene interaction dataset.
    • -outdir: Default='dataset/5fold_data/'. Specify the output directory.
    • -gt: Default='ncbi'. Options: 'ncbi'

    This step retrieves the high-confidence interaction subnet for each gene in the scRNA-seq dataset. The result is saved as *_net_adj.npy in the corresponding scRNA-seq dataset directory under result/datasets/.

3.2 WCSN Construction

3.2.1 WCSN construction for reference dataset

! python src/wcsn_constr_train.py

Optional parameters

  • -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
  • -outdir: Default='result/datasets/'. Specify the output directory.
  • -cuda: Default=True.
  • -hvgs: Default=2000, The number of HVGs.
  • -ca: Default=0.01, Significance level.
  • --n_splits: Default=5, Indicates Five-fold cross-validation.

This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold training set. The graph for each training set cell is saved as a .pt file in the processed folder of the corresponding fold (e.g., train_f1) within the WCSN_a0.01_hvgs2000 folder, which is located under the corresponding dataset folder in result/datasets/.

3.2.2 WCSN construction for query dataset

! python src/wcsn_constr_test.py

Optional parameters

  • -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
  • -outdir: Default='dataset/5fold_data/'. Specify the output directory.
  • -cuda: Default=True.
  • -hvgs: Default=2000, The number of HVGs.
  • -ca: Default=0.01, Significance level.
  • --n_splits: Default=5, Indicates Five-fold cross-validation.

This step constructs a 5-fold WCSN based on highly variable genes for each scRNA-seq dataset's testing set. The graph for each testing set cell is saved as a .pt file in the processed folder of the corresponding fold (e.g., train_f1) within the WCSN_a0.01_hvgs2000 folder, which is located under the corresponding dataset folder in result/datasets/.

3.3 Training and Prediction using

3.3.1 Training and prediction using WCSN

Training

! python src/wcsn_classify_train.py

Optional parameters

  • -expr: : Default='data/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.

  • -outdir: Default='result/wcsn_models'. Specify the output directory.

  • -ca: Default=0.01, Significance level.

  • -hvgs: Default=2000, The number of HVGs.

  • -bs: Default=32, The batch size of this training.

This step generates the 5-fold training set models using WCSN and saves them in result/wcsn_models.

Testing

! python src/wcsn_classify_test.py

Optional parameters

  • -expr: : Default='data/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.

  • -outdir: Default='result/'. Specify the output directory.

  • -ca: Default=0.01, Significance level.

  • -hvgs: Default=2000, The number of HVGs.

  • -bs: Default=32.

This step generates the predicted results for the testing sets and saves them in result/wcsn_preds.

The results include:

*_Prediction.h5: Contains the true labels and predicted labels for the test set cells, the probability matrix for each predicted cell type, and the cell embeddings for each cell.

*_F1.csv: Includes the accuracy, Mean F1-Score, and the F1-Score for each cell type.

3.3.2 Training and prediction using WCSN(logarithmic transformation)

Training

! python src/LT_wcsn_classify_train.py

The optional parameters are mostly the same as those in wcsn_classify_train.py.

This step generates the 5-fold training set models using WCSN(logarithmic transformation) and saves them in result/LT_wcsn_models.

Testing

! python src/LT_wcsn_classify_test.py

The optional parameters are mostly the same as those in wcsn_classify_test.py.

This step generates the predicted results for the testing sets using WCSN(logarithmic transformation) and saves them in result/LT_wcsn_preds. The results include *_Prediction.h5 and *_F1.csv.

3.3.3 Training and prediction using EWCSN

Training

! python src/ewcsn_classify_train.py

Optional parameters

  • -expr: : Default='data/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.

  • -net: Default='HumanNet-GSP '. Specify the high-confidence gene interaction network name.

  • -outdir: Default='result/ewcsn_models'. Specify the output directory.

  • -ca: Default=0.01, Significance level.

  • -hvgs: Default=2000, The number of HVGs.

  • -bs: Default=32, The batch size of this training.

This step generates the 5-fold training set models and saves them in result/ewcsn_models.

Testing

! python src/ewcsn_classify_test.py

Optional parameters

  • -expr: : Default='data/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
  • -net: Default='HumanNet-GSP '. Specify the high-confidence gene interaction network name.
  • -outdir: Default='result/'. Specify the output directory.
  • -ca: Default=0.01, Significance level.
  • -hvgs: Default=2000, The number of HVGs.
  • -bs: Default=32.

This step generates the predicted results for the testing sets and saves them in result/ewcsn_preds.

The results include:

*_Prediction.h5: Contains the true labels and predicted labels for the test set cells, the probability matrix for each predicted cell type, and the cell embeddings for each cell.

*_F1.csv: Includes the accuracy, Mean F1-Score, and the F1-Score for each cell type.

4. Figures in this study

All drawing codes are from src/Figures/

  • Figure 4-1, 4-2, 4-3, 4-4

    src/draw/baseline.ipynb

  • Figure 4-5

    src/draw/sankey.ipynb

    Sankey diagram of the different datasets under WCSGNet's 5-fold cross-validation.

  • Figure 4-6, 4-7, 4-9, 4-10

    src/draw/analysis_Baron_Human.ipynb

    src/draw/R/Figure-hub-genes.R

    src/draw/R/Figure-high-weight.R

  • Figure 4-8, 4-11

    src/draw/analysis_AMB.ipynb

    src/draw/R/Figure-AMB-gene.R

    src/draw/R/Figure-AMB-edge.R

  • Figure 5-1, 5-2

    src/draw/log_trans.py

  • Figure 5-3

    src/draw/LT_compare.ipynb

  • Figure 5-4, 5-5

    src/draw/ewcsn.ipynb

5. Repeatability

The following factors may result in slight differences in the Mean F1-score and Accuracy for cell type classification when reproducing the results, compared to those reported in the paper.

  1. The DataLoader applies a shuffle operation on the training dataset during model training, leading to some randomness in the input sequence of the training data.
  2. The use of the Dropout mechanism in the model introduces variability in the trained models across different runs.
  3. Parameter initialization also produces some randomness.

However, these differences do not have a disruptive impact on the conclusions of the paper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published