-
Notifications
You must be signed in to change notification settings - Fork 16
Description
I came across this bug when using Piranha to analyse eCLIP data. I have already filtered for the second read with samtools (which is the read used to determine the crosslink site), and was using the BAM file from that filtering as input to Piranha. However, I found that the read binning was giving inaccurate results. Through a bit of exploration, I found that this was caused by the SEQ and QUAL fields.
I've uploaded two BAM files, which were generated as follows:
# retrieve the header
samtools view -H data/HepG2_RBFOX2.bam > output/header
# get the first 10,000 reads from chr1
samtools view data/HepG2_RBFOX2.bam | grep "chr1\t" | head -10000 > output/reads
# replace the SEQ and QUAL fields with *
cat output/reads | awk 'BEGIN{OFS="\t"}{$10="*"; $11="*"; print}' > output/reads_no_seq
# combine the header and the reads and convert to BAM files
cat output/header output/reads | samtools view -b > output/reads.bam
cat output/header output/reads_no_seq | samtools view -b > output/reads_no_seq.bam
# run Piranha for each bam file
docker run -v $(pwd):/mount quay.io/biocontainers/piranha:1.2.1--h4f60aae_8 \
Piranha -v -s -b 10 -o /mount/output/reads.bed /mount/output/reads.bam
docker run -v $(pwd):/mount quay.io/biocontainers/piranha:1.2.1--h4f60aae_8 \
Piranha -v -s -b 10 -o /mount/output/reads_no_seq.bed /mount/output/reads_no_seq.bam
The BAM files look as follows:
$ samtools view output/reads.bam | head -2
CCTTG:SN1001:449:HGTN3ADXX:1:1206:8464:69989 147 chr1 14771 255 43M = 14681 -133 CACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGT B<FFFFFB<0<<<IIFBF<07FFFBFIFFFFFBB<B<BBFFFB NH:i:1 HI:i:1 AS:i:80 nM:i:0 NM:i:0 MD:Z:43 jM:B:c,-1 jI:B:i,-1 RG:Z:foo-44A07990
GAGAG:SN1001:449:HGTN3ADXX:2:2103:11297:36292 147 chr1 16211 255 43M = 16137 -117 TAGCCATGCTCTGACAGTCTCAGTTGTACACACGAGCCAGCAG FBBFFBBFBFBBFFFFFBFIFFFFFFBFFFFBFFFFFFFFFFB NH:i:1 HI:i:1 AS:i:76 nM:i:1 NM:i:1 MD:Z:26C16 jM:B:c,-1 jI:B:i,-1 RG:Z:foo-44A07990
$ samtools view reads_no_seq.bam | head -2
CCTTG:SN1001:449:HGTN3ADXX:1:1206:8464:69989 147 chr1 14771 255 43M = 14681 -133 * * NH:i:1 HI:i:1 AS:i:80 nM:i:0 NM:i:0 MD:Z:43 jM:B:c,-1 jI:B:i,-1 RG:Z:foo-44A07990
GAGAG:SN1001:449:HGTN3ADXX:2:2103:11297:36292 147 chr1 16211 255 43M = 16137 -117 * * NH:i:1 HI:i:1 AS:i:76 nM:i:1 NM:i:1 MD:Z:26C16 jM:B:c,-1 jI:B:i,-1 RG:Z:foo-44A07990
And the outputs from Piranha are:
$ head output/reads.bed
chr1 630710 630720 X 26 + 2.0749e-06
chr1 632620 632630 X 11 + 0.00152753
chr1 846530 846540 X 33 + 4.3324e-08
chr1 960080 960090 X 16 + 0.000144504
chr1 960420 960430 X 11 + 0.00152753
chr1 960440 960450 X 13 + 0.000637789
chr1 960890 960900 X 16 + 0.000144504
chr1 961020 961030 X 11 + 0.00152753
chr1 961090 961100 X 14 + 0.000398054
chr1 961150 961160 X 12 + 0.000983474
$ head output/reads_no_seq.bed
chr1 630710 630720 X 97 + 1.17462e-14
chr1 633990 634000 X 22 + 0.000243025
chr1 846520 846540 X 29 + 6.64929e-05
chr1 960420 960430 X 18 + 0.000666816
chr1 960440 960450 X 18 + 0.000666816
chr1 960810 960830 X 19 + 0.000536713
chr1 960870 960880 X 20 + 0.000406609
chr1 960890 960920 X 24.6667 + 0.000146072
chr1 962160 962170 X 19 + 0.000529893
chr1 1038700 1038710 X 19 + 0.000529893
The read counts in reads_no_seq.bed are consistently higher, and seem to correspond to the correct number of reads in that window (from eyeballing the reads in a genome browser).
The first window, 630710-20 shows a particularly large disparity, so I looked at the reads in that window in more detail (attached). There were 97, as indicated in the reads_no_seq.bed sample, However, I couldn't work out how 26 reads were counted for that region!