Clio: XML Schema Mapping and Data Translation

Renee Miller
University of Toronto

miller@cs.toronto.edu

Abstract

Mapping and translating data stored in different formats continues to be an important problem in modern information systems. We present a novel framework for mapping among XML and relational schemas in which a high-level mapping is translated into semantically meaningful queries that transform source data into a target schema. Our approach works in two phases. In the first phase, a high-level mapping, expressed as a set of attribute-to-attribute correspondences, is processed and converted into a logical mapping that captures the design choices made in the source and target schemas (including their hierarchical organization and the grouping of attributes into nested tables and sets). The second phase translates the logical mapping into a query that can be executed over the source schemas and is guaranteed to produce data satisfying the constraints and structure of the target schema. To this end, target attribute values may need to be invented to ensure that the data respects the constraints (including nested referential constraints) and the (possibly nested) structure of the target schema. Our approach is unique in that 1) we consider not only relational schemas, but also nested relational schemas with (nested) constraints; 2) for this large class of schemas, the mapping algorithm is complete in that it produces all mappings that are consistent with the schema constraints; 3) our data translation algorithm correctly translates source data even if there is missing data in the target (attributes with no correspondence to the source). We have implemented our algorithms in Clio, a schema mapping tool that uses the data itself to help users to understand and choose among alternative mappings.

Time permitting, I will discuss the relationship of this work to foundational work on the Universal Relation Model and to more recent work on answering queries using views.

This is joint work with Ron Fagin, Mauricio Hernandez, Lucian Popa (IBM Almaden) and Yannis Velegrakis (University of Toronto).