-
Notifications
You must be signed in to change notification settings - Fork 33
Description
I'm trying to de-duplicate a large sam file (it's ~200GB in size). This tool seems promising (although I'd like to know more about how it handles unpaired reads, which I would still like to use in my pipeline). Before I worry about that, I'd like to get it working.
Here's some system details:
OS: Linux 4.18.0-492.el8.x86_64
machine architecture: x86_64
compiler: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-21)
The following command:
samblaster -r -i SC-13c-TCMP_S214_L001_R1_001.fastq.merged.sam -o SC-13c.rmdup.sam.gz
produces this output:
samblaster: Version 0.1.26
samblaster: Opening SC-13c-TCMP_S214_L001_R1_001.fastq.merged.sam for read.
samblaster: Opening SC-13c.rmdup.sam.gz for write.
samblaster: Loaded 439926763 header sequence entries.
samblaster: Unable to allocate signature set array.samblaster: Premature exit
The sam file is merged from many 10s of bam files, each of which is made from the same query sequence set mapped to a different reference set. There may be duplicate entries in the reference sets, so this is why I am trying to de-duplicate this merged file. This seems like a memory-related error, but I have over 900GB of free memory on this machine. Any help is greatly appreciated.