-
Notifications
You must be signed in to change notification settings - Fork 214
Evaluating alignment performance using simulation
Understanding mapper performance is difficult in the case of real data. We can't easily know the true position of a read in a reference genome, and if we could it would only be possible by using a read aligner to find the likely positions. In consequence, we need to simulate reads in order to be certain of the true alignment position. Here we walk through a process that allows us to evaluate mapper performance using reads sampled with error from a pangenome.
We can use any variation graph as a basis for comparison, but our comparison will ultimately rely on embedded reference paths in the graph in order to generate comparisons with non-graph aligners. Thus make sure that the graph used has a reference path. See Changing References to set any given path as a reference. An example, using graphs/cactus-BRCA2.gfa and small/x.fa_1.fastq in the test directory of vg:
# Put path "13" into PanSN format so we can make it a reference later
# https://github.com/pangenome/PanSN-spec
sed "s/P\t13/P\tref#0#13/" graphs/cactus-BRCA2.gfa > graph.gfa
# Index GFA to magically produce a GBZ
vg autoindex --prefix graph --workflow sr-giraffe --gfa graph.gfa
# Convert path "13" to a reference
vg gbwt -Z --set-tag "reference_samples=ref" --gbz-format -g graph.ref.gbz graph.giraffe.gbzWe'll also need a comparable linear reference. We extract the reference path:
vg paths --extract-fasta -x graph.ref.gbz --paths-by ref > linear.fa
# Index for bwa-mem (note: Minimap2 is a better comparison for long reads)
bwa index linear.faFor a real test, you should use a genome-wide graph, such as those released by the HPRC
Next we generate reads with vg sim. Make sure to annotate them with --multi-position for better correctness checking.
# Simulate single-end reads
vg sim -x graph.ref.gbz -n 1000 -F small/x.fa_1.fastq -a --multi-position --random-seed 47 > sim_errors.gam
# Convert to FASTQ for linear tools
vg view --fastq-out --align-in sim_errors.gam > sim_errors.fqNote that vg sim can also handle long reads. Simply provide a different FASTQ training file and/or use a larger -l argument. Paired-end reads are also possible. See the vg sim wiki for more details.
Time for the main event!
vg giraffe -b default -Z graph.ref.gbz -f sim_errors.fq > giraffe.gam
bwa mem linear.fa sim_errors.fq > bwa.samIt's possible to export the GAM file into SAM format, but since this is a vg wiki page we'll work in vg's native formats. Thus, we can inject the SAM alignments into graph-space:
# Get rid of secondary and supplementary alignments (which confuse vg's comparison tools)
samtools view -b -F 256 -F 2048 bwa.sam > bwa.bam
# Inject into graph-space
vg inject -x graph.ref.gbz --add-identity bwa.bam > bwa.gamAnd we probably want to be "fair", and the linear alignments will necessarily look worse because they can't align against the variation present in the graph. So, let's force all the alignments to be on the reference:
# Onto the reference
vg surject -x graph.ref.gbz giraffe.gam > giraffe.surject.gamThere are a few ways to compare these read alignments. High-level statistics come from vg stats -a <file.gam>. Read-by-read information comes from vg filter. We can also check the correctness of the alignments using vg gamcompare. The --range 0 here means that we're not allowing any wiggle room as to alignment locations. Long reads may benefit from being more flexible by using vg annotate --multi-position and vg gamcompare --range 200 to allow for a little ambiguity, e.g. in repetitive regions.
# Annotate first, so that gamcompare knows what it's looking at
vg annotate -x graph.ref.gbz --gam giraffe.surject.gam > giraffe.annot.gam
vg annotate -x graph.ref.gbz --gam bwa.gam > bwa.annot.gam
# Compare for correctness using the original simulations as truth
vg gamcompare --range 0 giraffe.annot.gam sim_errors.gam > giraffe.compare.gam
vg gamcompare --range 0 bwa.annot.gam sim_errors.gam > bwa.compare.gam