de-duplicating a large sam file

I'm trying to de-duplicate a large sam file (it's ~200GB in size). This tool seems promising (although I'd like to know more about how it handles unpaired reads, which I would still like to use in my pipeline). Before I worry about that, I'd like to get it working. 

Here's some system details:
```
OS: Linux 4.18.0-492.el8.x86_64
machine architecture: x86_64
compiler: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-21)
```
The following command:

`samblaster -r -i SC-13c-TCMP_S214_L001_R1_001.fastq.merged.sam -o SC-13c.rmdup.sam.gz`

produces this output:

```
samblaster: Version 0.1.26
samblaster: Opening SC-13c-TCMP_S214_L001_R1_001.fastq.merged.sam for read.
samblaster: Opening SC-13c.rmdup.sam.gz for write.
samblaster: Loaded 439926763 header sequence entries.
samblaster: Unable to allocate signature set array.samblaster: Premature exit
```
The sam file is merged from many 10s of bam files, each of which is made from the same query sequence set mapped to a different reference set. There may be duplicate entries in the reference sets, so this is why I am trying to de-duplicate this merged file. This seems like a memory-related error, but I have over 900GB of free memory on this machine. Any help is greatly appreciated.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

de-duplicating a large sam file #59

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

de-duplicating a large sam file #59

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions