Inconsistent results depending on whether SEQ and QUAL fields are set

I came across this bug when using Piranha to analyse eCLIP data. I have already filtered for the second read with `samtools` (which is the read used to determine the crosslink site), and was using the BAM file from that filtering as input to Piranha. However, I found that the read binning was giving inaccurate results. Through a bit of exploration, I found that this was caused by the SEQ and QUAL fields.

I've uploaded two BAM files, which were generated as follows:

```
# retrieve the header
samtools view -H data/HepG2_RBFOX2.bam > output/header
# get the first 10,000 reads from chr1
samtools view data/HepG2_RBFOX2.bam | grep "chr1\t" | head -10000 > output/reads
# replace the SEQ and QUAL fields with *
cat output/reads | awk 'BEGIN{OFS="\t"}{$10="*"; $11="*"; print}' > output/reads_no_seq

# combine the header and the reads and convert to BAM files
cat output/header output/reads | samtools view -b > output/reads.bam
cat output/header output/reads_no_seq | samtools view -b > output/reads_no_seq.bam

# run Piranha for each bam file
docker run -v $(pwd):/mount quay.io/biocontainers/piranha:1.2.1--h4f60aae_8 \
    Piranha -v -s -b 10 -o /mount/output/reads.bed /mount/output/reads.bam
docker run -v $(pwd):/mount quay.io/biocontainers/piranha:1.2.1--h4f60aae_8 \
    Piranha -v -s -b 10 -o /mount/output/reads_no_seq.bed /mount/output/reads_no_seq.bam
```

The BAM files look as follows:

```
$ samtools view output/reads.bam | head -2
CCTTG:SN1001:449:HGTN3ADXX:1:1206:8464:69989    147     chr1    14771   255     43M     =       14681   -133    CACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGT     B<FFFFFB<0<<<IIFBF<07FFFBFIFFFFFBB<B<BBFFFB     NH:i:1  HI:i:1  AS:i:80 nM:i:0  NM:i:0  MD:Z:43 jM:B:c,-1       jI:B:i,-1       RG:Z:foo-44A07990
GAGAG:SN1001:449:HGTN3ADXX:2:2103:11297:36292   147     chr1    16211   255     43M     =       16137   -117    TAGCCATGCTCTGACAGTCTCAGTTGTACACACGAGCCAGCAG     FBBFFBBFBFBBFFFFFBFIFFFFFFBFFFFBFFFFFFFFFFB     NH:i:1  HI:i:1  AS:i:76 nM:i:1  NM:i:1  MD:Z:26C16      jM:B:c,-1       jI:B:i,-1       RG:Z:foo-44A07990
$ samtools view reads_no_seq.bam | head -2
CCTTG:SN1001:449:HGTN3ADXX:1:1206:8464:69989    147     chr1    14771   255     43M     =       14681   -133    *       *       NH:i:1  HI:i:1  AS:i:80 nM:i:0  NM:i:0  MD:Z:43 jM:B:c,-1       jI:B:i,-1       RG:Z:foo-44A07990
GAGAG:SN1001:449:HGTN3ADXX:2:2103:11297:36292   147     chr1    16211   255     43M     =       16137   -117    *       *       NH:i:1  HI:i:1  AS:i:76 nM:i:1  NM:i:1  MD:Z:26C16      jM:B:c,-1       jI:B:i,-1       RG:Z:foo-44A07990
```

And the outputs from Piranha are:

```
$ head output/reads.bed 
chr1    630710  630720  X       26      +       2.0749e-06
chr1    632620  632630  X       11      +       0.00152753
chr1    846530  846540  X       33      +       4.3324e-08
chr1    960080  960090  X       16      +       0.000144504
chr1    960420  960430  X       11      +       0.00152753
chr1    960440  960450  X       13      +       0.000637789
chr1    960890  960900  X       16      +       0.000144504
chr1    961020  961030  X       11      +       0.00152753
chr1    961090  961100  X       14      +       0.000398054
chr1    961150  961160  X       12      +       0.000983474
$ head output/reads_no_seq.bed 
chr1    630710  630720  X       97      +       1.17462e-14
chr1    633990  634000  X       22      +       0.000243025
chr1    846520  846540  X       29      +       6.64929e-05
chr1    960420  960430  X       18      +       0.000666816
chr1    960440  960450  X       18      +       0.000666816
chr1    960810  960830  X       19      +       0.000536713
chr1    960870  960880  X       20      +       0.000406609
chr1    960890  960920  X       24.6667 +       0.000146072
chr1    962160  962170  X       19      +       0.000529893
chr1    1038700 1038710 X       19      +       0.000529893
```

The read counts in `reads_no_seq.bed` are consistently higher, and seem to correspond to the correct number of reads in that window (from eyeballing the reads in a genome browser).

The first window, 630710-20 shows a particularly large disparity, so I looked at the reads in that window in more detail (attached). There were 97, as indicated in the `reads_no_seq.bed` sample, However, I couldn't work out how 26 reads were counted for that region!

[reads_no_seq.bam.gz](https://github.com/smithlabcode/piranha/files/7258418/reads_no_seq.bam.gz)
[reads.bam.gz](https://github.com/smithlabcode/piranha/files/7258419/reads.bam.gz)
[chr1_630710_20_reads.txt](https://github.com/smithlabcode/piranha/files/7258581/chr1_630710_20_reads.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent results depending on whether SEQ and QUAL fields are set #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent results depending on whether SEQ and QUAL fields are set #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions