Structured computer representations of biomedical data:
application to the ribosome

Russ B. Altman, M.D., Ph.D.
Stanford Medical Informatics

Abstract

The information explosion that is gripping molecular biology has challenged our traditional mechanisms for the collection, storage and analysis of experimental data. In particular, it is becoming more difficult to create explanatory and predictive models that are consistent both internally and with the huge volumes of published data. The difficulty increases when a large variety of heterogeneous experimental approaches are used to gather data from multiple perspectives. A central strategy for managing this information overload is the creation of technologies which store and represent these data in novel ways. In order to facilitate computational processing of data, it is especially critical to develop standardized structured data formats for representing biological data.

The large majority of biological experiments do not have standardized templates. The results of these experiments are still predominantly disseminated in published texts accompanied by figures and tables for summary and convenience. While this format is useful for knowledge extraction by readers on a per-article basis, it does not allow for efficient integration of all data relevant to a particular topic, and it certainly is not amenable to computer-based data extraction for the purposes of further computations on these data.

To show the value of structured representations of data in dealing with these critical issues, we have built a prototype knowledge base (RiboWEB) of structural data pertaining to the small (30S) ribosomal subunit of E. coli. Diverse types of data taken principally from published journal articles are represented using a set of templates within this knowledge base, and these data are linked to each other with numerous and rich connections. Not only does this representation allow for easier and more convenient data retrieval by human users, but it facilitates automated data analysis by computer programs. We believe that formal representations of the data and models within scientific subdisciplines hold promise as a key method for delivering the next generation of scientific data resources and represent the way in which scientific data should be published in the future.