SST: An Algorithm for Searching Sequence Databases
in
Time Proportional to the Logarithm of the Database Size
Eldar Giladi, Michael G. Walker, James Z. Wang and Wayne Volkmuth
Abstract:
We have developed an algorithm, called SST (Sequence Search Tree),
that searches a database of DNA sequences for near exact matches, in
time proportional to the logarithm of the database size n. In SST, we
partition each sequence into fragments of fixed length called
"windows" using multiple offets. Each window is mapped into a vector
of dimension 4^k which contains the frequency of occurrence of its
component k-tuples, with k a parameter typically in the range 4 - 6.
Then we create a tree-structured index of the windows in vector space,
using tree structured vector quantization (TSVQ). We identify the
nearest-neighbors of a query sequence by partitioning the query into
windows and searching the tree-structured index for nearest neighbor
windows in the database. This yields an O (log n ) complexity for the
search. SST is most effective for applications in which the target
sequences show a high degree of similarity to the query sequence, such
as assembling shotgun sequences or matching ESTs to genomic sequence.
The algorithm is also an effective filtration method. Specifically,
it can be used as a preprocessing step for other search methods to
reduce the complexity of searching one large database against another.
For the problem of identifying overlapping fragments in the assembly
of 120,000 fragments from a 1.5 megabase genomic sequence, SST is 17
to 35 times faster than BLAST when we consider both building and
searching the tree. For searching alone (i.e., after building the tree
index), SST is 50 to 100 times faster than BLAST.
Abstract in Color
(PDF)
Abstract in Color
(PostScript)
Full Paper in Color
(PDF)
Full Paper in Color
(PostScript)
Citation:
Eldar Giladi, Michael G. Walker, James Z. Wang and Wayne Volkmuth,
``SST: An Algorithm for Searching Sequence Databases in Time
Proportional to the Logarithm of the Database Size,'' Currents in
Computational Molecular Biology, Poster Proc. International Conference
on Research in Computational Molecular Biology (RECOMB), Frontiers
Science Series no. 30, S. Miyano, R. Shamir, T. Takagi, (eds.),
Universal Academy Press, Inc., Tokyo, Japan, April 2000.
Copyright 2000 Universal Academy Press.
Personal use of this
material is permitted. However, permission to reprint/republish this
material for advertising or promotional purposes or for creating new
collective works for resale or redistribution to servers or lists, or
to reuse any copyrighted component of this work in other works, must
be obtained from the Universal Academy Press.
Last Modified:
October 1 1999
© 1999, James Z. Wang