Supporting Annotation Layers for Natural Language Processing Marti Hearst, UC Berkeley School of Information Today most natural language processing (NLP) algorithms make use of the results of previous processing steps. For example, a word sense disambiguation algorithm may use the output of a tokenizer, a part-of-speech tagger, a phrase boundary recognizer, and a module that classifies noun phrases into semantic categories. Currently there is no standard way to represent and store the results of such processing for efficient retrieval. We propose an annotation framework for marking text up with processing results, a query language for flexibly accessing portions of text that have been so annotated, and indexing architectures for efficiently performing retrievals against the annotated text. The model allows for both hierarchical and overlapping layers of annotation as well as for querying at multiple levels of description. We demonstrate the power of the query language and the efficiency of the indexing architecture on a wide variety of query types that have been published in the NLP literature (our focus is on NLP on bioscience journal articles). The architecture is built on top of an RDBMS, and so can take advantage of advanced indexing structures supplied by such systems. We believe this work is the first to experiment with different indexing structures in order to determine how to make annotation-based queries scale to very large corpora with many layers of annotations. URL: http://biotext.berkeley.edu Joint work with Preslav Nakov, Ariel Schwartz, and Brian Wolf Supported by NSF DBI-0317510 and gift from Genentech