This is a bunch of simple utilities for biology which created during my son's PhD research.
vcf.py
A set of functions to manipulate VCF files.
Named functions:
fields, meta, info, filter, format, data, fileformat = vcf.read('some_file_name_for_vcf', getdata=True)
if getdata=False then the returned 'data' is an empty list of dictionaries.
------------------------------------------------------------------------------------------
fields = vcf.getFields('some_file_name_for_vcf')
returns a list with all the fields contained in the 'some_file_name_for_vcf' file
------------------------------------------------------------------------------------------
meta = vcf.getMeta('some_file_name_for_vcf')
returns a dictionarywith all the META info contained in the 'some_file_name_for_vcf' file
------------------------------------------------------------------------------------------
info = vcf.getInfo('some_file_name_for_vcf')
returns a list with all the INFO contained in the 'some_file_name_for_vcf' file
------------------------------------------------------------------------------------------
filter = vcf.getFilter('some_file_name_for_vcf')
returns a list with all the FILTERs contained in the 'some_file_name_for_vcf' file
------------------------------------------------------------------------------------------
format = vcf.getFormat('some_file_name_for_vcf')
returns a list with all the FORMAT info contained in the 'some_file_name_for_vcf' file
------------------------------------------------------------------------------------------
fileformat = vcf.getFileformat('some_file_name_for_vcf')
returns a string containing the type and version of 'some_file_name_for_vcf' file
------------------------------------------------------------------------------------------
data = vcf.getData('some_file_name_for_vcf', startLine=0, num_of_lines=1)
returns a list of dictionaries of the data contained in the 'some_file_name_for_vcf' file
starting at the 'startLine' line of data. The list contains 'num_of_lines' dictionaries
with data.
'startLine' is the 1-based line number fro where the grepping starts, so letting it to have
zero value, it returns an empty list of dictionaries
------------------------------------------------------------------------------------------
firstline = vcf.getFirstDataline('some_file_name_for_vcf')
returns the line number of the first line containing data
------------------------------------------------------------------------------------------
sampledata = vcf.getSampleData('some_file_name_for_vcf', line_number)
returns a dictionary with all the 'sample_name':'value' pairs for the 'line_number' data-line
------------------------------------------------------------------------------------------
[to console] vcf.printSampleData('some_file_name_for_vcf', line_number)
prints the 'sample_name':'value' pairs for all samples contained in the 'line_number' data-line
-------------------------------------------------------------------------------------------------------
VCFformatColumn.py
Problem description:
We have a VCF file ('filename') containing lines with n columns, where after the FORMAT column
the samples columns follow
We want to reproduce this file replacing the sample columns with a relevant subcolumn of FORMAT
Example: suppose that in a data line the FORMAT column is "GT:AD:DP:GQ:PL" and the three samples formats
are ".:0,0:0:.:0,0" "0:4,0:4:99:0,169" "1:0,25:25:99:1078,0" and we want to replace these column with
the respective subcolmn "AD" (second subcolumn in format). In this case the relevant sample columns will
become "0,0" "4,0" "0,25"
Input arguments:
filename: the name of the file we want to parse
formatColumn: the number of the FORMAT column. The numbering is zero-based
subColumnToGet: the part of the FORMAT column that we want to extract (e.g. "GT")
Output:
The output of the script is a new file (filename + "_out") containing the lines
of the filename with sample columns replaced
find_length.py
Problem description:
We have a file ('filename') containing lines with n columns, having 3 columns (col1, col2 and col3)
with strings like 'GTACT'
We want to reproduce this file adding two more columns at the end of each line. These columns
will have the string lengths of these two columns, setting the biggest value in the first
added column and the smaller value at the second added column
Input arguments:
filename: the name of the file we want to parse
col1: the number of the first (of the 3 columns) we want to calculate
the length. The numbering is zero-based
col2: the number of the second (of the 3 columns) we want to calculate
the length. The numbering is zero-based
col3: the number of the third (of the 3 columns) we want to calculate
the length. The numbering is zero-based
non_count_char: is a string containing characters that we want NOT to count during
length calculation (like "," or "*"). This string can have more than
one character (like ",*") or no charakters at all (like "")
Output:
The output of the script is a new file (filename + "_out") containing the lines
of the filename appended with the 3 new columns
complete_file.py
Problem description:
We have a file ('gene_file') containing lines with n columns. We also have another
file ('details_file') the above mentioned n columns plus another column at the end
of each line containing the 'details'
We need to create an output file that will contain each line of the
'gene_file' after adding at the end the n+1 column of the matched 'details_file' line
Input arguments:
gene_file: the name of the file containing the genes (n columns)
details_file: the name of the file containing the details (n+1 columns)
include_not_found: a boolean that tells if we want to include at the output file
also the lines for which we didn't find a match (in which case
the line is appended with asterisks (************************)
Output:
The output of the script is a new file (gene_file + "-out") containing the lines
of the gene_file appended with the found details
sort_file
Problem description:
We have a file (tab separated) in the form:
***************************************************
CHROM POS N_ALLELES N_CHR {ALLELE:FREQ}
Bgt_chr-01 392817 2 225 A:0.946667 T:0.0533333
Bgt_chr-01 393045 2 225 T:0.946667 C:0.0533333
Bgt_chr-01 393150 2 225 T:0.995556 C:0.00444444
Bgt_chr-01 452402 2 225 G:0.946667 A:0.0533333
Bgt_chr-01 452453 2 225 C:0.946667 T:0.0533333
Bgt_chr-01 3895640 3 219 A:0.00913242 G:0.0273973 *:0.96347
Bgt_chr-01 452641 2 225 A:0.911111 T:0.0888889
Bgt_chr-01 452680 2 225 T:0.946667 C:0.0533333
Bgt_chr-01 452701 2 225 A:0.893333 G:0.106667
***************************************************
Allele frequencies starts from column 'startCol' (default=4)
Default filename is 'test_navalwstiseiragiabins'
We want to create another file (with the same name and '_out' tailed)
in which all the lines will be written in the same order, but in each line
the allele columns will be sorted by their frequencies
gene_create_file2.py
Problem description:
We have a bunch of files (in fasta format) containing samples' genes.
Files resides in 'datapath_in' folder and are in the form:
>FRA_SYROS_2000_5_Bg_tritici_BgtE-4528
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRL
>FRA_SYROS_2000_5_Bg_tritici_Bgt-3232
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
............................................................
>FRA_SYROS_2000_5_Bg_tritici_Bgt-6670
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
We have also a 'gene_file' where we keep the gene names together with their
alternatives, in the form:
Bgt-10_p USA_Ken_2_5_Bgtritici_Bgt-10_p
Bgt-10_p USA_KAN_43_Bgtritici_Bgt-10_p
Bgt-10_p USA_J2_1_Bgtritici_Bgt-10_p
Bgt-1000_p USA_KAN_43_Bgtritici_Bgt-1000_p
Bgt-1000_p USA_J2_1_Bgtritici_Bgt-1000_p
Bgt-1000_p USA_C4_6_Bgtritici_Bgt-1000_p
...........................................................
where there are two genes (Bgt-10_p and Bgt-1000_p) with 3 alternatives each
This file is a tab-separated file where, in each line, the first string is
the gene name and the following string is an alternative name
We want to create another bunch of files (one for each gene contained in the
'gene_file') in the folder 'datapath_out', with each file containing all the
sample sequences with the said gene and its alternatives, in fasta format.
filterdata.py
Problem description
We have a file (filein) containing several columns and many lines.
We read each line (one by one) from the filein
For each line read, we get values for four variables:
gene = the text of first column
gene2 = the first 9 letter of the text in second column
gene2_total = the text in second column (total string)
length = the text in third column (it is actually number but we read it as text). Is not used
size = the text in fourth column. We read it as text and later we convert it to float
We read each line and we create blocks that have the same gene (first column).
If in any line we find that there is same gene2 in the block then we check
the sizes to see if the fraction is less than 1e-50.
In this case we continue normally else we discard the block.
NEW ADDITIONS
We need all four genes to be present in a block, else we drop it
We only keep the genes that are nearest to zero (only four items in every block)
If two gene2 are identical then we accept them without testing the size
changefiledata.py
standard columns (default=9) tab separated
# 0 1 2 3 4 5 6 7 8
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
Reads a text file and for each line where reference column differs from
the Info column (after 'AA='):
changes the Info value (?),
interchange REF and ALT columns,
changes DT format value from 1 to 0 or from 0 to 1 and
interchanges values inside AD and PL columns of samples format.
Arguments:
filein: the filename of the data file from where data is read. If it is
not given then it asks for a filename
fileout: the output file where the anchanged and changed data will be written.
If this argument is not given, the script uses the 'filein' adding
'_out' keyword at the end of the name
stdcols (default=9): the count of standard columns after which the samples start
log_console (default=False): if the script must write infos on console
markchanges (default=True): if True then in each changed line an asterisk (*)
is added at the start of the line