Supporting Annotation Layers for Natural Language Processing
          Marti Hearst, UC Berkeley School of Information

Today most natural language processing (NLP) algorithms make use of
the results of previous processing steps. For example, a word sense
disambiguation algorithm may use the output of a tokenizer, a
part-of-speech tagger, a phrase boundary recognizer, and a module that
classifies noun phrases into semantic categories. Currently there is
no standard way to represent and store the results of such processing
for efficient retrieval.

We propose an annotation framework for marking text up with processing
results, a query language for flexibly accessing portions of text that
have been so annotated, and indexing architectures for efficiently
performing retrievals against the annotated text. The model allows for
both hierarchical and overlapping layers of annotation as well as for
querying at multiple levels of description. We demonstrate the power
of the query language and the efficiency of the indexing architecture
on a wide variety of query types that have been published in the NLP
literature (our focus is on NLP on bioscience journal articles).

The architecture is built on top of an RDBMS, and so can take
advantage of advanced indexing structures supplied by such systems. We
believe this work is the first to experiment with different indexing
structures in order to determine how to make annotation-based queries
scale to very large corpora with many layers of annotations.

URL: http://biotext.berkeley.edu
Joint work with Preslav Nakov, Ariel Schwartz, and Brian Wolf
Supported by NSF DBI-0317510 and gift from Genentech