Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 2 additions & 10 deletions _posts/2018-08-11-Projects2018.txt.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,11 @@ tags: participants apply
----------------------------------------

## [Project 1: REUSE](https://github.com/chorltsd/REUSE)
**Team Lead:** Sam Chorlton (sam.chorlton [AT] pm.me)

*Help change the world by filtering unneeded sequences from a next-generation sequencing dataset, enriching signal from noise and enabling rapid pathogen discovery, isolation of sequence types (eg. rRNA), contaminant removal and more.*
**Team Lead:** [Sam Chorlton](samchorlton.com) (sam [AT] samchorlton.com)

**Abstract:**

Filtering unwanted sequences from nucleic acid sequencing data is an important step in many analyses. It has been used to remove technical artefacts (eg. PhiX), discover known and novel pathogens, isolate nucleic acid types (eg rRNA), and remove noise in metagenomic studies. This step significantly improves the speed and quality of subsequent analyses.

Here I propose an end-to-end pipeline (REUSE) for Rapidly Eliminating Unwanted SEquences from large sequencing datasets. The result of REUSE will be sequences that do not belong to a reference sequence. This pipeline will be based on previously established techniques for isolating known and novel pathogens among sequencing data. It will seek to dramatically speed up the process, optimize flaws in other pipelines, and automate it from start to finish. It will likely include a k-mer filter, read alignment, read assembly, and contig alignment. Some of these steps will be based on publicly available tools, such as RNA-STAR and Trinity, whereas others will need to be programmed from the ground up.

The work at hackseq18 will focus on development of the most novel and needed module, the k-mer filter (k-REUSE). Previous evidence indicates that k-mers can be used to rapidly screen and filter sequences, and that a k-mer of 21 basepairs is sufficient to discriminate between unrelated species.(1) Currently published applications, such as Kontaminant(1), Cookiecutter(2), BBDuk(3) and others have several limitations, including lack of parallelization, high memory requirements (>50gb for the human genome), and lack of ability to save the reference index to disk. Other techniques, such a read alignment, are too slow to use on large datasets.

The goal of HackSeq18 will be the development of k-REUSE and comparison to other filters. Further development will likely be needed after the hackathon for integration of k-REUSE into the complete REUSE pipeline and ultimate application to extremely large datasets.
Filtering unwanted sequences from nucleic acid sequencing data is an important step in many analyses. It has been used to remove technical artefacts (eg. PhiX), discover known and novel pathogens, isolate nucleic acid types (eg. rRNA), and remove noise in metagenomic studies. This step significantly improves the speed and quality of subsequent analyses. Commonly used approaches include mapping of reads to a reference set, or filtration via bloom trees or k-mer hashes. K-mers have been shown to accurately differentiate reads between species in a fraction of the time as traditional read mapping approaches; however, currently implemented approaches such as BBDuk, Kontaminant and Cookiecutter are limited by memory usage, parallelization and other practical features. Here, we develop REUSE, a program to Rapidly Eliminate Useless SEquences. REUSE implements a minimal perfect hash function to generate a reference index with limited RAM and time. Searching the index is performed using the complete k-mer set from each read, and reeads can be discarded or retained, depending on user preference, if they contain a pre-specified number of k-mers found in the index. In comparisons against other tools on simulated and real data, REUSE is consistently faster and uses less RAM. REUSE demonstrates similar accuracy to traditional read mappers, and produces identical results to other k-mer based tools. REUSE is publicly available at https://github.com/chorltsd/REUSE.

----------------------------------------

Expand Down