Skip to content

kennedydane/sdclust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Introduction

Expressed Sequence Tags, or ESTs, represent an artificial data set derived from sequenced mRNA. As such they provide insight into the transcriptome of many organisms. They are particularly useful when trying to find genes in organisms whose genome has not yet been sequenced. An introduction to ESTs can be found here - http://en.wikipedia.org/wiki/Expressed_sequence_tag.

EST clustering is an important step in the process of going from the biological sample to the final digital data describing the expressed DNA. In particular when the cDNA clones are sequenced we generally get many copies of the same expressed sequence, however the starting and ending points can differ. The clustering stage identifies which ESTs are related to one another so that the data can be assembled and a consensus sequence identified.

There are a number of difficulties when doing the clustering. Firstly there is a lot of data. And since we need to find which ESTs match one another this quadratic complexity can result in very long computation times. Secondly the data can contain errors, which means that exact matches between two related sequences are rare.

This program aims to tackle these problems using several tactics. Firstly to handle the complexity issues, potential matches are identified when they have a certain number of matches in common. Since sequences from different portions of expressed DNA are less likely to have a large number of common words, this means that unnecessary comparisons are avoided. The problem with the error prone data is dealt with in the final comparison where the d2 distance measure is used. This comparison is a validated method that uses word frequencies to determine sequence similarity.

About

Automatically exported from code.google.com/p/sdclust

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published