Skip to content

JLodewijk/VCFTools-Java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

VCFtools is a tool which is used to analyse VCF files. Variant Call Format files(VCF) are files which store DNA polymorphism data like SNPs. The format of the VCF files was elaborated for the 1000 Genome project.

This version of the VCFtools is written in Java and uses Genome Analysis Toolkit (GATK) 2.7-4 to read VCF files. GATK is a package to analyze next generation sequencing data. The new tool is a multi platform command line interface application and it has also a user friendly web interface. With the web interface it is for biologists easier to analyze VCF files

To use this program type:

java -jar "nl.bioinf.vcftools.jar"

Parameters that can be used are:

-bcf This option defines the BCF file to be processed

-bed Include a set of sites on the basis of a BED file

-chr Chromosome identifiers can be used more than once to include multiple chromosomes. Separate the identifiers with ',' if multiple identifiers are given

-count this option results a file with a raw count of allele per site of a given VCF file with the suffix .frq.count

-depth generates a file containing the mean depth per individual. This file has the suffix .idepth

-excludeBed Exclude a set of sites on the basis of a BED files

-excludePositions Exclude a set of sites. Separate with ',' if multiple sites are wanted to be given

-excludePositionsFile Exclude a set of sites on the basis of a list of positions in a file

-excludeSnp Exclude SNPs which are given by the user. Separate the snps with a ',' if mulitple snp are given

-excludeSnpFile Exclude a list of SNPs given in a file. The file should contain a list of SNP IDs, with one ID per line

-freq outputs the allel frequency in a file with the suffix .frq

-fromBp This option defines the physical start position of the site which will be processed. A integer is expected. This option must be used in right after -chr

-geno Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 1 indicates no missing data allowed). A double is expected

-gvcf This option defines the compressed VCF file to be processed

-h,--help Help function

-hwe Sites with a p-value below the threshold defined by this option are taken to be out of the Hardy-Weinberg Equilibrium and therefore excluded. A double is expected

-invertMask This option can be used to specify a mask file that will be inverted before being applied

-keepFiltered This option can be used to select sites on the basis of specific filter flags

-keepIndv Specify an individual to be kept in the analysis. This option can accept multiple arguments to specify multiple individuals. Each individual should be seperated with a ','. A string is expected.

-keepIndvFile Provide a file containing a list of individuals to include in subsequent analysis. Each individual ID (as defined in the VCF header line) should be included on a separate line.

-keepInfo This option can be used to select sites on the basis of specific INFO flags, keepInfo is applied before removeInfo if both are given.

-keepOnlyIndels include sites that contain an indel

-mac Include only sites with Minor Allele Count which is higher than the given value. A double is expected

-maf Include only sites with Minor Allele Frequency which is higher than the given value. A double is expected

-mask Include sites on the basis of a MASK file

-maskMin Set the threshold value which determines if sites are filtered or not. A double is expected

-maxAlleles Include only sites with a number of alleles which is lower than the given value. For example, to include only biallelic sites, one could use --minAlleles 2. A doubleouble is expected

-maxDp exclude all genotypes with a sequencing depth which is lower than the given value

-maxIndv Randomly thins individuals so that only the specified number are retained. A double is expected

-maxIndvMeanDp Calculate the mean coverage on a per-individual basis. Only individuals with a coverage which is lower the the given value are included in subsequent analyses. A double is expected

-maxMac Include only sites with Minor Allele Count which is lower than the given value. A double is expected

-maxMaf Inlude only sites with Minor Allele Frequency which is lower than the given value. A double is expected

-maxMeanDp Include sites with mean Depth which is lower than the given value. A double is expected

-maxMissingCount Exclude sites which has more than the given value for the number of missing chromosomes. An doubleouble is expected

-maxNonRefAc Include only sites with all Non-Reference Allele Counts which is lower than the given value. A double is expected

-maxNonRefAf Include only sites with all Non-Reference Allele Frequencies which is lower than the given value. A double is expected

-minAlleles Include only sites with a number of alleles which is higher than the given value. For example, to include only biallelic sites, one could use --minAlleles 2. A doubleouble is expected

-mind Specify the minimum call rate threshold for each individual. A double is expected

-minDp exclude all genotypes with a sequencing depth which is higher than the given value

-minGq exclude all genotypes with a quality below the threshold specified by this option

-minIndvMeanDp Calculate the mean coverage on a per-individual basis. Only individuals with a coverage which is higher than the given value are included in subsequent analyses. A double is expected

-minMeanDp Include sites with mean Depth which is higher than higher the given value. A double is expected

-minQ Include only sites with Quality above this threshold. A double is expected

-nonRefAc Include only sites with all Non-Reference Allele Counts which is higher than the given value. A double is expected

-nonRefAf Include only sites with all Non-Reference Allele Frequencies which is higher than the given value. A double is expected

-notChr Chromosome identifiers can be used more than once to exlude multiple chromosomes. Separate the identifiers with ',' if multiple identifiers are given

-out This option defines the output filename prefix for all files generated by vcftools

-phased Only include phased data

-positions Include a set of sites. Separate with ',' if multiple sites are wanted to be given

-positionsFile Include a set of sites on the basis of a list of positions in a file

-removeFiltered Exclude sites with a specific filter flag

-removeFilteredAll This option removes all sites with a FILTER flag

-removeFilteredGeno This option removes all genotypes based on a specific filter flag

-removeFilteredGenoAll This option removes all genotypes based on a filter flag. Default filter flag is '.' or everything not equal to PASS

-removeIndels Exclude sites that contain an indel

-removeIndv Specify an individual to be removed from the analysis. A string is expected. If --indv also used, --indv will be applied first

-removeIndvFile Provide a file containing a list of individuals to exclude in subsequent analysis. Each individual ID (as defined in the VCF header line) should be included on a separate line if --keep also used, --keep will be applied first

-removeInfo This option can be used to exclude sites with a specific INFO flag

-snp This option defines a snp which will be processed

-snpFile Include a list of SNPs given in a file, with one ID per line

-thin Thin sites so that no two sites are within the specified distance. A double is expected

-toBp This option defines the physical stop position of the site which will be processed. A integer is expected. This option must be used in right after -fromBp

-vcf This option defines the VCF file to be processed

About

Porting VCFTools to Java and support polyploids

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •