Skip to content
coffm049 edited this page Jan 4, 2024 · 1 revision

Estimate.py estimates SNP-heritability via closed form formula with single Genetic Relatedness Matrix (GRM) as input. It is suggested to use this version on a server with sufficient memory when sample size is less than 100k. In our paper, analyzing a 45k sample took less than 2 minutes and about 40 GB memory.

Please check the input description with ./Estimate.py --help.

Arguments

It is reccomended that users define a .json file containing all of the arguments for analysis. This will help both with organization and with reproducibility. This means that the argfile would be the only argument. Users can also define all filepaths and variable selections manually using command line flags if desired.

                                                 Input                                                   Description
--argfile ARGFILE.json COND REQUIRED. ARGFILE.json, string, is the filename to be passed containing all information for PC's, covariates, phenotypes, and GRM. This takes priority over all other arguments. See the example arfile included under the Example directory..
--prefix PREFIX REQUIRED. string PREFIX is the prefix of GRM file with GCTA binary GRM format. (PREFIX.grm.bin, PREFIX.grm.N.bin and PREFIX.grm.id)
--pheno PHENO.phen REQUIRED. PHENO.phen, string, is the name of phenotype file, following GCTA phenotype file format (space delimited text file) but with column names). The first two columns are FID and IID and phenotypes start from the third column.
--mpheno m OPTIONAL. list of integers or integer, Default=1. If you have multiple phenotypes in the file, you can specify by --mpheno m. Otherwise, the first phenotype will be used. Note that 1 refers to the third column of the file since we skip over the FID and IID columns. If passed a list, estimates will be computed for every phenotype specified.
--PC PC OPTIONAL. PC, string, is the name of PCs file, following GCTA (space delimited, no column names) --pca file (same as plink --pca). The third column is the first PC, the forth column is the second PC...
--npc n OPTIONAL. integer, Default = all PCs in the PC file will be used. You can specify top n PCs to be adjusted by --npc n.
--covar COVAR OPTIONAL. COVAR, string, is the name of covariate file, following GCTA --qcovar file format or .csv file format. It may contain sex, age, etc. Note that this file does not include principal components, which need to be include seperately by --PC PC.
--covars COVARS OPTIONAL. COVARS is the list of integers specifying which covariates to control for from the covariate file. column numbering does not include the FID and IID columns (therefore 1 refers to the third column of the file). Note that this is an ordered list if used in conjunction with the --loop_covs flag.
--k k OPTIONAL. integer. You can specify the number of rows in restoring the GCTA GRM binary file into matrix each time. If not provide, it will process the whole GRM at one time. When you have a relative large sample size, specifying --k k can speed up the computation and save the memory.
--std OPTIONAL. Run SAdj-HE by specifying --std. Otherwise, UAdj-HE will be computed. (There are potential bugs with the standardized version, so it is reccommended to use unstandardized for now).
--loop_covs OPTIONAL: Default= False. If True, loop over the ORDERED set of user defined covariates including all previous covariates in each iteration. **Note: The order in which the covrariates are controlled for is based upon the researchers best judjements. In other words, include the most likely **

Descrtiption of Inputs

Here are illustrative examples of what files might look like. The phenotyp, covariates, and principal componet files have the first two columns that are the Family ID (FID) and the Individuals ID (IID). They are then followed by values specific to each file type (phenotypes for the phenotype file, covariates for the covaraiates file, and PC loadings for the PC file (An exmaple explaining how to compute PC's is given in the section "Example of computing GRM and eigenvectors from .bed files" below). Note: Both the phenotype and covariates files should have column headers, whereas the PC file should not. See the example pheno, covariate, and PC file in the examples folder.

Column Column Contents
1 FID,string, Unique family identifier for each family in the study.
2 IID, string, Unique individual identifier for each individual in the study.
3-infinite numeric, measurements for PC loading, phenotype, or covariate measures depending on file type

**Note: All files need the first two columns to be FID and IID, respectively. Also any missing values will remove the observation from the given analysis. **

Output

A .csv with heritability estimate (h2), standard error (SE), phenotype used (Pheno), number of prinicpal components controlled for (PCs), list of covariates controled for separated by a "+" (Covariates), computational time (Time for analysis(s)), and peak memory (Memory Usage (Mb)) are also provided. See the results included in the Example folder.

Clone this wiki locally