SGA Scoring script and related utilities
These are customary, and (I think) not hardcoded anywhere except in the parameters you supply. Thus, you can change them as long as you set those parameter-paths up accordingly. Sub-directories and symbolic links etc also work fine.
rawdata/: for input filesscored/: for output filesrefdata/: for supplemental input files
-
Mandatory
inputfiledatabase dump to score (9 columns) (customarily underrawdata/)outputfileoutput file pattern (no extension) (scored/)smfitnessfilepath to query / array fitness file (refdataetc )linkagefilepath to linkage definition file (ignored if linkage is skipped, see below)coord_filepath to orf coordinate file (ignored if linkage is skipped)removearraylistpath to list of known bad arrays (to be removed)
-
Overrideable (have a hard-coded default, but may drop you into the debugger for confirmation if not supplied): -
border_strain_orfORF_strainid found on the border -wild_typeORF_strainid to use as wild-type -skip_perl_stepskips a preprocessing step. If you're scoring aninputfileyou've already scored, you can skip this to save some time -
Optional:
skip_linkage_detectionset totrueto skip linkage detection altogetherskip_linkage_maskset the totrueto replace linkage colonies after correctionsskip_wt_removeset this totrueto include WT query strain/s in the outputeps_qnorm_refpath to mat-file containing a quantile normalization table
You can paste this into matlab, (% are comments)
% -------------------------------------------------------------------------------------
% October 9, 2014
% nolink score of fg30 for adrian verster
% in which linkage data are held out from corrections, then replaced for analysis
% -------------------------------------------------------------------------------------
inputfile = 'rawdata/Collab/AdrianVerster/raw_sga_fg_t30_131130_adrian.txt';
outputfile= 'scored/Collab/AdrianVerster/nolink_sga_fg_t30_131130_adrian_scored_141009';
skip_perl_step = false;
skip_linkage_mask = true;
skip_wt_remove = false;
wild_type = 'URA3control_sn4757';
border_strain_orf = 'YOR202W_dma1';
smfitnessfile = 'refdata/smf_t30_130417.txt';
linkagefile = 'refdata/linkage-est_sga_merged_131122.txt';
coord_file = 'refdata/chrom_coordinates_111220.txt';
removearraylist = 'refdata/bad_array_strains_140526.csv';
compute_sgascore
The script creates 4 output files, beginning with the supplied output pattern, but each with a unique extension:
.txtfinal scores in 12-column format.loga record of events, pretty much a copy of what you see on the screen when you run the script.mata matfile containing everything in the workspace at the end of scoring, for post-mortems.orfa list of unique strain_ids from columns 1 & 2 of.txtwhich may come in handy (e.g. for building a hashmap to load the.txtfile
The script has a number of places marked:
% SAFE ENTRY
Which means "The only variables which change after this point, are created after this point."
These have been determined by manual inspection and are by no means exhaustive. They are used to resume
the script mid-execution to save time when testing changes. I generally use this procedure:
- Locate the position of the change you wish to test (line XXX)
- Locate the first SAFE ENTRY above XXX, (line YYY)
- Delete lines 1-YYY and save the results to
resume.m - To supress logging to the log-file, set this magic number at the top of
resume.m:lfid = -11;- Or to log:
lfid = fopen([outputfile '.log'], 'a')(ato append,wto overwrite)
- Or to log:
- load the mat-file you want to resume (if not already loaded)
- run
resume
See the Column_Key for description of input and outputfiles
- MATLAB version (other than 2010b which has a bug in
svm()) - Image Processing Toolbox (image_toolbox)
- Statistics Toolbox (statistics_toolbox)