Abstracts for Genome Databases Seminar

"Overview of Genome Databases"
Peter D. Karp, SRI International

Bioinformatics is an emerging scientific discipline at the intersection of computer science and life science. Genome databases is a subarea of bioinformatics that is concerned with managing the large volumes of experimental data that are generated by new high-throughput techniques in biology, and the symbolic computational theories that result from integrating across those data. This talk will survey research issues in genome databases, and illustrate those issues with an analogy to reverse engineering of computer viruses. The talk will also examine why the computer science field of databases has had so little impact on the field of genome databases, and outline a proposed symbolic computing curriculum for scientists that I argue is essential in this industrial age of science.

"Scaling Genomics Data Integration - Moving Bioinformatics into the Present"
Terence Critchlow, Lawrence Livermore National Laboratory

Despite huge investments of time, energy, and money over the past decade, data integration remains one of the key computer science problem facing genomics researchers. The DataFoundry project at LLNL is an ongoing effort to improve geneticists' access to and interactions with their data, and as such has been trying to address this problem for years. While describing our historical and current research efforts will be the technical focus of this presentation, several key reasons why the data integration problem continues to loom over genomics research will also be identified.

"Database Design - Data Modeling"
Iqbal Panesar, Incyte Genomics

Data Modeling is an essential initial step in Database Design during the analysis/design phase of projects. There are numerous opportunities in the industry which present generic models. Good modeling techniques help translate business requirements to entity-relationships. There are points to consider while following the concepts of entity-relationship modeling. The talk also covers the concepts of entity, relationship, attribute, domain, their types, normalization and de-normalization. An example case study is taken to understand entity-relationship model, concluding with other challenges in database design.

"Gene Expression Data Management: Key Challenges and a Case Study"
Victor M. Markowitz, GeneLogic Inc.

Gene expression data from a vast number of microarray experiments are generated at numerous pharmaceutical and biotech companies, as well as academic laboratories. Comprehensive gene expression data analysis requires integrating these data with information on samples involved in the gene expression experiments and genes and ESTs on the microarrays. The massive and rapidly growing size of the gene expression data poses a continuous challenge in using effectively traditional data management techniques for collecting, storing, analyzing, and distributing these data. Furthermore, the inherent imprecision of the algorithms used to generate gene expression data requires supporting multiple and evolving interpretations of the observed scientific phenomena and compounds the problem of managing already massive amounts of data. An additional challenge is posed by the need of incorporating data generated at disparate sites under different experimental conditions.

We will discuss these challenges in the context of Gene Logic's GenesisÔ gene expression data management system, the platform used for delivering gene expression data to subscribers of the GeneExpress® Data Suites and for integrating Gene Logic with customer gene expression data.

"Developments at the Protein Data Bank"
Phil Bourne, San Diego Supercomputer Center

Approximately 3 years ago the PDB moved to the Research Collaboratory for Structural Bioinformatics from Brookhaven National Lab. Where it had been for 27 years. As such it represents one of the oldest databases in biology, yet like other areas of biology is now growing rapidly with the advent of structural genomics. We are reengineering the PDB at this time to take advantage of this expected influx of data, yet must remain backwardly compatible. This presentation will thus cover the past present and future of the PDB.

"PharmGKB: The Pharmacogenetics Knowledge Base"
Daniel Rubin, Stanford University

The explosive growth of biomedical data becoming available for analysis gives us an unprecedented opportunity to learn the genetic basis for disease. Pharmacogenetics is a discipline that seeks to understand the genetic basis of the variation in drug response among people. Large-scale studies in pharmacogenetics are being done to collect genotypes and phenotypes in large populations. But to make sense of this information, computational tools capable of efficiently accessing and analyzing these data are needed. Genetic data are complex, and simply storing raw sequences in a relational database will be inadequate to provide the detailed query support and analytical needs of pharmacogenetics.

We created the PharmGKB (http://pharmgkb.org/), a robust storage and retrieval system of pharmacogenetics data that contains genetic sequences, cellular and molecular phenotype, and clinical data. One of the goals of this resource is to provide analytical functionality to connect genotype with phenotype. To accomplish this, we are modeling the domain of pharmacogenomics using an ontology, and we are building a knowledge base to integrate data from multiple study centers as well as from external databases. Building a production system based on an ontology has presented many unique challenges as well as exciting research opportunities. In this talk I will discuss the benefits and challenges of our approach to developing this resource.

"Integrated Data Systems for Interpreting Genome-Focused Data in Cancer "
Ajay Jain, University of California, San Francisco

Abstract: Biology is rapidly evolving into a quantitative molecular science. Many factors contribute to the acceleration of this evolution, including: completion of the human genome sequence and improvements in measurement technology for DNA, RNA, and proteins. The result of this trend is an increasing pool of quantitative biological data. Given quantitative data, it is possible to induce constraints and interrelationships to construct predictive models of biological systems. Such models can provide context for the rapid interpretation of experimental observations. We believe that by integrating quantitative analytical methods and data visualization approaches with annotation information about biological entities, we can accelerate biological research by stimulating the generation of hypotheses that would otherwise be missed. Our collaborations are generating extensive and growing sets of microarray-based expression data and/or high-resolution genomic copy number data in multiple human cancers. This seminar will present a data system that enables integrated analysis combining experimental data with genomic and genetic annotations.

"Saccharomyces Genome Database"
J. Michael Cherry, Stanford University

The Saccharomyces Genome Database (SGD) is a resource for the biomedical research community. The project provides information on the genes, including the genetics, physical interactions, and mutant phenotypes. Results from functional genomic projects: microarray, systematic deletion, SAGE, and 2-hybrid analysis. This talk will provide an overview of the SGD curation process, the types of information available from SGD and the annotation system referred to as the "Gene Ontology".

"The Elucidation of Regulatory Networks in Complex Biological Systems: The Convergence of Biology, Medicine and Computing"
George Poste, Health Technology Networks

The focus of biological research will shift increasingly to the analysis of the regulation of complex biological systems in both health and disease. Elucidation of the information content encoded in genomic and proteomic network, and how these instructions are integrated to specify and control body form and function will present formidable intellectual, technical and logistical challenge. The rise of "systems biology" heralds the dawn of big biology. Just as 'big physics' evolved to accommodate a massive expansion of research data and computer-intensive problems, biology and medicine are poised for the same transition. Systems biology will generate data streams on an unprecedented scale. Academia and healthcare industry organizations are ill prepared for the technical, financial and organizational implications of large scale computing. The union of biology and medicine with computing will occur with accelerating momentum and will be dominated by the following requirements: scale, speed, standards, simulation and security. The rise of computational biology has profound implications for research, clinical practice, education, intellectual property and industrial competitiveness. Universities will need to replace anachronistic curricula and outdated organizational structures based on narrow specialties with new inter-disciplinary units for teaching and research. Government-funding agencies must adopt new policies to adapt to the expanded scale of modern research and its cross-sector linkages between healthcare, computing and telecommunications. In its more advanced iterations, computational biology will usher in a new era of biology in silico in which simulation of complex biological systems will have sufficient predictive accuracy to replace many aspects of costly experimentation and will create new physician-decision support tools to identify the optimum approaches to clinical care to best reflect the genetic and phenotypic characteristic of individual patients.