-
Notifications
You must be signed in to change notification settings - Fork 26
Description
ISSUE:
When variants directly typed are rare but also exist in the reference panel (target VCF), the imputation may be unreliable. While the imputation quality (R2) in these cases is shown as high, the ER2 is, as expected, not.
The output of minimac4 is the hard GT calls and dosages (DS) which correspond to imputed data, even if the ER2 shows the imputed data is unreliable for this/these markers.
EXAMPLE:
Data is from publicly available sample data, namely CORIELL samples (see the Imputed data, Genotype data (pre-phasing) and Genotype data (post-phasing) pasted below).
SOLUTION
While I am using the older release (v1.0.2), I assume the same can be observed in the later releases as well (e.g., v4.1.x).
If possible, could an argument be passed, wherein the final output for markers that are TYPED can be either the imputed one (to be set as default; what happens now) or the phased directly typed one. In the latter case, the hard-called GT fields would be copied from the inputted phased data.
Right now, in order to achieve that, one must do VCF curation, which is always a messy affair.
Much obliged,
Vid
Imputed data (N.B.: hard-genotyped calls were extracted from the imputed data-set; if required, the unmolested vcf info can be provided)
[v@server Imputed_Data]$ tabix -h Chr6.dose.vcf.gz "6:18143955-181439 55" ##fileformat=VCFv4.1 ##FILTER=<ID=PASS,Description="All filters passed"> ##filedate=2024.3.19 ##source=Minimac4.v1.0.2 ##contig=<ID=6> ##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency"> ##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency"> ##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy (R-square)"> ##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)"> ##INFO=<ID=IMPUTED,Number=0,Type=Flag,Description="Marker was imputed but NOT genotyped"> ##INFO=<ID=TYPED,Number=0,Type=Flag,Description="Marker was genotyped AND imputed"> ##INFO=<ID=TYPED_ONLY,Number=0,Type=Flag,Description="Marker was genotyped but NOT imputed"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]"> ##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1"> ##minimac4_Command=/filler/minimac4 --cpus 10 --allTypedSites --minRatio 0.00001 --refHaps /filler/6.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz --format GT,DS,GP --haps ./filler/Chr6_Phased.vcf.gz --prefix ./filler/Chr6 ##bcftools_normVersion=1.19+htslib-1.19 ##bcftools_normCommand=norm -c s -f /filler/ucsc.hg19.fasta -o ./filler/Chr6.norm.vcf.gz -O z ./filler/Chr6.sorted.vcf.gz; Date=Fri Mar 22 11:28:43 2024 ##bcftools_annotateVersion=1.19+htslib-1.19 ##bcftools_annotateCommand=annotate --set-id %CHROM:%POS:%REF:%ALT -o ../filler/Chr6.annotate.vcf.gz -O z ../filler/Chr6.norm.vcf.gz; Date=Fri Mar 22 11:30:47 2024 ##bcftools_viewVersion=1.19+htslib-1.19 ##bcftools_viewCommand=view -h ../filler/Chr6.annotate.vcf.gz; Date=Fri Mar 22 11:32:47 2024
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00111 NA06989 NA10854 NA11839 NA12236 NA12815 NA17137 NA17290 NA17641 NA18544 NA18564 NA19122 NA19143 NA19317 NA23246 NA23275 NA23297 NA24027 HG00133 HG00373 HG00589 HG01083 HG01086 HG01094 HG01680 HG03225 HG03246 HG03643 NA07029 NA10831 NA10855 NA12753 NA17074 NA17169 NA17176 NA17673 NA18484 NA18552 NA18855 NA18973 NA18992 NA19109 NA19174 NA19178 NA19207 NA19213 NA19226 NA19239 NA19785 NA19908 NA19917 NA20289 NA20509 NA23874 NA23877 NA23878 HG00111_duplo NA06989_duplo NA12236_duplo NA18564_duploNA19317_duplo NA23246_duplo NA23297_duplo NA24027_duplo HG01083_duplo HG03225_duplo HG03246_duplo NA12753_duplo NA19207_duplo NA19226_duplo NA19785_duplo NA17137_duplo HG00373_duploHG01094_duplo NA07029_duplo NA17074_duplo NA18973_duplo NA19109_duplo NA20509_duplo
6 18143955 6:18143955:C:G C G . PASS AF=0.19201;MAF=0.19201;R2=0.88798;ER2=0.02747;TYPED GT 1|0 0|1 1|1 1|0 0|0 0|1 0|1 0|0 0|0 0|1 0|0 1|0 0|0 0|1 0|1 0|1 0|0 0|0 0|1 0|1 0|0 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|1 1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|1 0|1 1|0 0|0 0|0 0|1 1|0 0|0 0|0 0|0 0|0 1|0 0|0 0|0 0|0 0|0 0|1 0|1 1|0 0|1 0|0 0|0 0|0 0|0
Genotype data (pre-phasing)
##fileformat=VCFv4.3 ##fileDate=20250325 ##source=PLINKv2.00 ##contig=<ID=6> ##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00111 NA06989 NA10854 NA11839 NA12236 NA12815 NA17137 NA17290 NA17641 NA18544 NA18564 NA19122 NA19143 NA19317 NA23246 NA23275 NA23297 NA24027 HG00133 HG00373 HG00589 HG01083 HG01086 HG01094 HG01680 HG03225 HG03246 HG03643 NA07029 NA10831 NA10855 NA12753 NA17074 NA17169 NA17176 NA17673 NA18484 NA18552 NA18855 NA18973 NA18992 NA19109 NA19174 NA19178 NA19207 NA19213 NA19226 NA19239 NA19785 NA19908 NA19917 NA20289 NA20509 NA23874 NA23877 NA23878 HG00111_duplo NA06989_duplo NA12236_duplo NA18564_duplo NA19317_duplo NA23246_duplo NA23297_duplo NA24027_duplo HG01083_duplo HG03225_duplo HG03246_duplo NA12753_duplo NA19207_duplo NA19226_duplo NA19785_duplo NA17137_duplo HG00373_duplo HG01094_duplo NA07029_duplo NA17074_duplo NA18973_duplo NA19109_duplo NA20509_duplo
6 18143955 rs1800462 G C . . PR GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Genotype data (post-phasing)
##fileformat=VCFv4.1 ##fileDate=25032025_11h16m45s ##source=SHAPEIT2.v904 ##log_file=./filler/Chr6_Phased.vcf.log.temp.log ##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00111 NA06989 NA10854 NA11839 NA12236 NA12815 NA17137 NA17290 NA17641 NA18544 NA18564 NA19122 NA19143 NA19317 NA23246 NA23275 NA23297 NA24027 HG00133 HG00373 HG00589 HG01083 HG01086 HG01094 HG01680 HG03225 HG03246 HG03643 NA07029 NA10831 NA10855 NA12753 NA17074 NA17169 NA17176 NA17673 NA18484 NA18552 NA18855 NA18973 NA18992 NA19109 NA19174 NA19178 NA19207 NA19213 NA19226 NA19239 NA19785 NA19908 NA19917 NA20289 NA20509 NA23874 NA23877 NA23878 HG00111_duplo NA06989_duplo NA12236_duplo NA18564_duplo NA19317_duplo NA23246_duplo NA23297_duplo NA24027_duplo HG01083_duplo HG03225_duplo HG03246_duplo NA12753_duplo NA19207_duplo NA19226_duplo NA19785_duplo NA17137_duplo HG00373_duplo HG01094_duplo NA07029_duplo NA17074_duplo NA18973_duplo NA19109_duplo NA20509_duplo
6 18143955 rs1800462 G C . PASS . GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0