Most of today's data is still stored in files rather than in databases. This fact has become even more evident with the growth of the World Wide Web in the 1990s. Because of that observation, the research area of semistructured data has evolved. Semistructured data is typically stored in documents and has an irregular, partial, and implicit structure. The thesis presents a new framework for querying semistructured data. Traditional database management requires design and ensures declarativity. The possibilities to design are limited in the field of semistructured data, thus, a more flexible approach is needed.
We argue that semistructured data should be represented by a set of partial schemata rather than by one complete schema. Because of irregularities of the data, a complete schema would be very large and not representative. Instead, partial schemata can serve as good representations of parts of the data. While finding a complete schema turns out to be difficult, a database designer may be able to provide partial schemata for the database. Also, partial schemata can be extracted from user queries if the query language is designed appropriately. We suggest to split the notion of query into a ``What''- and a ``How''-part. Partial schemata represent the ``What''-part. They cover semantically richer concepts than database schemata traditionally do. Among these concepts are predicates, variable definitions, and path descriptions. Schemata can be used for query optimization, but they also give users hints on the content of the database. Finding the occurrences (matches) of such a schema forms the most important part of query execution. All queries of our approach, such as the focus query or the transformation query, are based on this matching. Query execution can be optimized using knowledge about containment relationships between different schemata.
Our approach and the optimization techniques are conceptually modeled and implemented as a prototype on the basis of Constraint Satisfaction Problems (CSPs). CSPs form a general class of search problems for which many techniques and heuristics exist. A CSP consists of variables that have a domain associated to them. Constraints restrict the values that variables can simultaneously take. We transform the problem of finding the matches of a schema in a database to a CSP. We prove that under certain conditions the matches of a schema can be found without any search and in polynomial time. For optimization purposes the containment relationship between schemata is explored. We formulate a sufficient condition for schema containment and test it again using CSP techniques. The containment relationship can be used in two ways depending on the direction of the containment: It is either possible to reduce the search space when looking for matches of a schema, or it is possible to present the first few matches immediately without any search. Our approach has been implemented into the constraint system ECLiPSe and tested using XML documents.