Clio: A Tool for Data Mapping via Query Discovery

Laura Haas
IBM Almaden Research Center

laura@almaden.ibm.com

Abstract

At the heart of many data-intensive applications is the problem of quickly and accurately transforming data into a new form. With the growing power of modern database engines, it is appealing to use declarative queries to represent the transformations needed, and reap the benefits of query optimization, parallelism, and logical data independence. Yet tools for creating, managing and understanding the complex queries necessary for data transformation are still too primitive to permit widespread adoption of this approach. Clio addresses this problem in several novel ways. In this talk, we describe a new paradigm for interactive mapping creation which relies on the use of value correspondences that show how a value of a target attribute can be created from a set of values of source attributes. We also present an algorithm for incrementally deriving a query from an evolving set of value correspondences. This algorithm at the heart of Clio allows complex SQL queries to be generated based on simple value correspondences. Data mapping queries require the use of complex, non associative operators (such as multiple outer joins). Reasoning about such operators can be extremely difficult, and users may have difficulty understanding the subtle distinctions between two complex queries as a result. However, to be scalable to large schemas, mapping tools must necessarily permit users to incrementally create, evolve and compose such complex queries. Another feature of Clio is a new framework that uses data examples as the basis for understanding and refining declarative schema mappings. The framework presents the user with carefully selected examples that both illuminate a specific mapping (helping the user to understand the mapping) and also illustrate any differences from alternative mappings (helping the user to differentiate mappings).

Biography

Laura Haas is a research staff member and manager at IBM's Almaden Research Center. Dr. Haas joined IBM in 1981 to work on the R* distributed relational database management project, and subsequently managed the Starburst extensible query processing work, which forms the basis of the DB2 UDB query processor. Dr. Haas then headed the Exploratory Database Systems Department at Almaden for three and a half years. She returned to project management to start the Garlic project on heterogeneous middleware systems. Technology from the Garlic project enables access to heterogeneous data sources in the latest releases of DB2, and forms the basis for a new IBM offering (DiscoveryLink) for Life Sciences R&D. Dr. Haas is currently "acting CTO" for the DiscoveryLink offering, as well as manager of an exploratory research project on schema mapping (Clio). Dr. Haas was vice-chair of ACM SIGMOD from 1989-1997. She has served as an Associate Editor of the ACM journal Transactions on Database Systems, as Program Chair of the 1998 ACM SIGMOD technical conference, and was recently elected to the VLDB Board of Trustees. She has received IBM awards for Outstanding Technical Achievement and Outstanding Contributions, and a YWCA Tribute to Women in INdustry (TWIN) award.