Towards Declarative and Efficient Querying on Protein Structures

The fairly recent publication of a draft of the human genome has
served to fuel an already explosive area of research in life
sciences. Even with the human genome sequence now in hand, life
science researchers still face a number of challenges such as
determining exact gene locations and functions. Increasingly, a
critical aspect of such research requires analyzing large volumes of
biological data sets. Unfortunately existing querying methods used in
such research employ awkward procedural querying methods, and often
use query evaluation algorithms that don’t scale as the data set size
increases. Many biological data sets are growing exponentially, which
is going to make these existing methods even more cumbersome in the
future. Efficient and declarative methods for querying these data sets
are urgently needed.

In this talk, I will describe our research efforts in building a
database management system, called Periscope, to meet these
challenges. Our current focus in this project is on supporting the
database querying needs in the area of functional proteomics. In this
talk I will touch upon various aspects of Periscope including an
algebra that we have developed for querying on protein sequence and
geometrical structures. I will spend most of the talk describing a new
sequence matching algorithm that is often more accurate and faster
than the popular sequence search tool -- BLAST. (BLAST is the “Google”
equivalent for searching on biological sequence data sets.) I will
conclude the talk pointing to some actual life sciences problems that
are being investigated using Periscope, and highlight the benefits
that declarative and efficient querying can bring to the life sciences
community.