Skip to content

PCA program for a plink genotype file; it is fast, handles relatives, handles large number of subjects

Notifications You must be signed in to change notification settings

NCI-CGR/Software_CGRPCA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This package has the PCA program to calculate principal components for a genotype file in the Plink format.

It was initially developped by Shengchao Alfred Li.

The input to the PCA program is genotypes in the Plink format as well as an ibd file generated by plink.

The program use the ibd file to identify founders and their relatives.
Relatives are defined to be with PI_HAT > 0.2.
A progressive algorithm is used to keep the largest number of founders.
First 20 PCs for non-related subjects are calculated with a singular value decomposition (SVD) algorithm,
PCs for the relatives are then generated by project them onto the space spanned by the 20 PCs.
This method was previously used to deal with incomplete ancient DNA.
We wrote the script from scratch for calculate PCA for genotype data with relatives, to work with Plink format input,
to use the 'smartpca' normalization method, to enable automatic identification of unrelated-subjects and relatives.

This program is easy to use, can handle large datasets (tested with ~100K) and can potentially modified to handle even more,
and is fast. It is 5 times faster than the software smartpca fast-mode; and is much faster than smartpca, and can handle datasets so large that smartpca fails (but smartpca fast-mode can still handle). It is more accurate than smartpca fast-mode for later PCs (for example, PC7 and above for 1000 subjects)

The R script uses package "functional", "genio" and "irlba".
They have been pre-installed in our R/3.6.3.

If you have to install by yourself to your R 4.x distribution, note that 'genio' may be old and is not available to your R 4.x. In this case, you can locate an older version for R 3.x and install to your R 4.x and it most likely will be just fine.

We use "genio" for its Plink read functions.
We use "irlba" for its SVD functions.
We use "functional" for its "Compose" function for easy handling NAs. This package may be circumvented by a few more codes.

File preparation for the PCA calculation:

We suggest a LD pruned genotype file to save space and speed;
We suggest the genotype files also be filtered for autosomes, call rate and MAF. In the example we use call rate > 90% and MAF > 0.2.

A ibd file (the .genome generated by Plink) is needed. We suggest it is generated with option --min 0.2 to save space and processing time.

This is especially important for large datasets.
The PCA program use PI_HAT > 0.2 only, anyway.

Example:

The Example genotype is from the HapMap3, downloaded as vcf.gz. Although it is too big to be placed here,
we included files generated so you can test the PCA program.

******** Preparation steps ***********

#Suppose you are in a subdirectory ./bin
module load plink/1.9

#convert genotypes in the vcf format to the Plink format;
#the files ../genotypes/HapMap3.bed, bim, fam are generated by this command,
#These files are not included because they are large.
plink --vcf ../genotypes/HapMap3.vcf.gz --make-bed --out ../genotypes/HapMap3

#prune the genotypes; file ../genotypes/HapMap3.prune.in is generated by this command,
#It is included in this package;
plink --bfile ../genotypes/HapMap3 --geno 0.1 --maf 0.2 --chr 1-22 --indep-pairwise 50 5 0.1 --out ../genotypes/HapMap3

#generate pruned genotypes,
#The files ../genotypes/HapMap3_pruned.bed, bim, fam are generated by this command,
#They are included in this package;
plink --bfile ../genotypes/HapMap3 --extract ../genotypes/HapMap3.prune.in --make-bed --out ../genotypes/HapMap3.pruned

#Use Plink --genome to find ibd information, to identify relatives,
#The file ../genotypes/HapMap3.pruned.genome is generated,
#It is included in this package;
plink --bfile ../genotypes/HapMap3.pruned --genome --min 0.2 --out ../genotypes/HapMap3.pruned

#generate the relatives file to hold subjects that are not used in computing the initial PCs;
#They are later projected onto the top PC spaces.
Rscript ../modules/IBD_identify_relatives.R ../genotypes/HapMap3.pruned.genome ../genotypes/HapMap3.pruned.genome.relatives

******** end Preparation steps ***********

#Run the PCA program; It needs two input files and two specified output files;
#use your own file names to replace those specified in this example.
#1, the Plink format genotype in bed format, in this example it is specified as ../genotypes/HapMap3.pruned;
#2, the genome file, in this example it is specified as ../genotypes/HapMap3.pruned.genome;
#Two output files are also specified,
#3, the principal component file, in this example itis specified as ../pca/HapMap3.pruned.eigen
#4, the weight file for the principal components, in this example it is specified as ../pca/HapMap3.pruned.weights
Rscript ../modules/PCA_svd_project.R ../genotypes/HapMap3.pruned ../genotypes/HapMap3.pruned.genome.relatives ../pca/HapMap3.pruned.eigen ../pca/HapMap3.pruned.weights

#Plot the principal components for inspection; It needs one input and two outputs; use your own file names to replace those used in this example.
#1, the principal component file, in this example itis specified as ../pca/HapMap3.pruned.eigen
#Two output files are also specified,
#2, The main plots; in this example it is specified as ../pca/HapMap3.pruned.eigen.png
#3, Additional plots; in this example it is specified as ../pca/HapMap3.pruned.eigen.more.png
Rscript ../modules/plot_for_CGRPCA.R ../pca/HapMap3.pruned.eigen ../pca/HapMap3.pruned.eigen.png ../pca/HapMap3.pruned.eigen.more.png

About

PCA program for a plink genotype file; it is fast, handles relatives, handles large number of subjects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages