Skip to content

One row per sample/allele in csv output #33

@domenico-simone

Description

@domenico-simone

VCF file:

##fileformat=VCFv4.0
##reference=chrRCRS
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=.,Type=Integer,Description="Reads covering the REF position">
##FORMAT=<ID=HF,Number=.,Type=Float,Description="Heteroplasmy Frequency of variant allele">
##FORMAT=<ID=CILOW,Number=.,Type=Float,Description="Value defining the lower limit of the confidence interval of the heteroplasmy fraction">
##FORMAT=<ID=CIUP,Number=.,Type=Float,Description="Value defining the upper limit of the confidence interval of the heteroplasmy fraction">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SRR043366	SRR043354
chrRCRS	263	0	A	G,C	0	PASS	AC=1;AN=2	GT:DP:HF:CILOW:CIUP	0/1:167:0.994:0.963:1.0	0/2:167:0.994:0.963:1.0

Idea for a CSV output:

SAMPLE	CHROM	POS	ID	REF	ALT	QUAL	AC	AN	Locus	FunctionalLocus	CodonPosition	AaChange	HF	CILOW	CIUP	…
SRR043366	chrRCRS	263	0	A	G	0	2;-1	4	MT-DLOOP	MT-HV2 (Hypervariable segment 2)	.	.	0.994	0.963	1	…
SRR043354	chrRCRS	263	0	A	C	0	2;-1	4	MT-DLOOP	MT-HV2 (Hypervariable segment 2)	.	.	0.853	0.79	0.9	…

So we have one row for each SAMPLE-ALT. This means that, if one samples has > 1 ALT allele, there will be, for that sample, as many rows as the number of ALT alleles. This would make this table easily usable for downstream processing, eg: wrangling and plotting with tidyverse packages, creating dynamic (sortable/filterable) plots with HTMLwidgets etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions