Evaluating alignment performance using simulation

Understanding mapper performance is difficult in the case of real data. We can't easily know the true position of a read in a reference genome, and if we could it would only be possible by using a read aligner to find the likely positions. In consequence, we need to simulate reads in order to be certain of the true alignment position. Here we walk through a process that allows us to evaluate mapper performance using reads sampled with error from a pangenome.

Preparing the references

We can use any variation graph as a basis for comparison, but our comparison will ultimately rely on embedded reference paths in the graph in order to generate comparisons with non-graph aligners. Thus make sure that the graph used has a reference path. See Changing References to set any given path as a reference. An example, using graphs/cactus-BRCA2.gfa and small/x.fa_1.fastq in the test directory of vg:

# Put path "13" into PanSN format so we can make it a reference later
# https://github.com/pangenome/PanSN-spec
sed "s/P\t13/P\tref#0#13/" graphs/cactus-BRCA2.gfa > graph.gfa
# Index GFA to magically produce a GBZ
vg autoindex --prefix graph --workflow sr-giraffe --gfa graph.gfa
# Convert path "13" to a reference
vg gbwt -Z --set-tag "reference_samples=ref" --gbz-format -g graph.ref.gbz graph.giraffe.gbz

We'll also need a comparable linear reference. We extract the reference path:

vg paths --extract-fasta -x graph.ref.gbz --paths-by ref > linear.fa
# Index for bwa-mem (note: Minimap2 is a better comparison for long reads)
bwa index linear.fa

For a real test, you should use a genome-wide graph, such as those released by the HPRC

Generating reads with known-truth positions

Next we generate reads with vg sim. Make sure to annotate them with --multi-position for better correctness checking.

# Simulate single-end reads
vg sim -x graph.ref.gbz -n 1000 -F small/x.fa_1.fastq -a --multi-position --random-seed 47 > sim_errors.gam
# Convert to FASTQ for linear tools
vg view --fastq-out --align-in sim_errors.gam > sim_errors.fq

Note that vg sim can also handle long reads. Simply provide a different FASTQ training file and/or use a larger -l argument. Paired-end reads are also possible. See the vg sim wiki for more details.

Getting read alignments

Time for the main event!

vg giraffe -b default -Z graph.ref.gbz -f sim_errors.fq > giraffe.gam
bwa mem linear.fa sim_errors.fq > bwa.sam

It's possible to export the GAM file into SAM format, but since this is a vg wiki page we'll work in vg's native formats. Thus, we can inject the SAM alignments into graph-space:

# Get rid of secondary and supplementary alignments (which confuse vg's comparison tools)
samtools view -b -F 256 -F 2048 bwa.sam > bwa.bam
# Inject into graph-space
vg inject -x graph.ref.gbz --add-identity bwa.bam > bwa.gam

And we probably want to be "fair", and the linear alignments will necessarily look worse because they can't align against the variation present in the graph. So, let's force all the alignments to be on the reference:

# Onto the reference
vg surject -x graph.ref.gbz giraffe.gam > giraffe.surject.gam

Comparing the alignments

There are a few ways to compare these read alignments. High-level statistics come from vg stats -a <file.gam>. Read-by-read information comes from vg filter. We can also check the correctness of the alignments using vg gamcompare. The --range 0 here means that we're not allowing any wiggle room as to alignment locations. Long reads may benefit from being more flexible by using vg annotate --multi-position and vg gamcompare --range 200 to allow for a little ambiguity, e.g. in repetitive regions.

# Annotate first, so that gamcompare knows what it's looking at
vg annotate -x graph.ref.gbz --gam giraffe.surject.gam > giraffe.annot.gam
vg annotate -x graph.ref.gbz --gam bwa.gam > bwa.annot.gam

# Compare for correctness using the original simulations as truth
vg gamcompare --range 0 giraffe.annot.gam sim_errors.gam > giraffe.compare.gam
vg gamcompare --range 0 bwa.annot.gam sim_errors.gam > bwa.compare.gam

Evaluating alignment performance using simulation

Preparing the references

Generating reads with known-truth positions

Getting read alignments

Comparing the alignments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally