Next: Matching Criteria for Up: Change Detection in Previous: The Complete Algorithm

Finding Good Matchings

In this section we consider the Good Matching problem, motivated in Section 3. We want to find an appropriate matching between the nodes of trees and that can serve as input to Algorithm EditScript. In applications in which the data has object-ids or keys, we can match nodes using these object-ids or keys. However, as described in Section 1, our focus here is on applications where information may not have keys or object-ids that can be used to match ``fragments'' of objects in one version with those in another. We use the term keyless data for hierarchical data that may not have identifying keys or object-ids.

When comparing versions of keyless data, there may be more than one way to match objects. Thus we need to define matching criteria that a matching must satisfy to be considered ``good'' or appropriate. In general, the matching criteria will depend on the domain being considered. One way of evaluating matchings that is desirable in many situations is to consider the minimum cost edit scripts that conform to the matchings (and transform into ). Intuitively, a matching that allows us to transform one tree to the other at a lower cost is a better matching. Formally, for matchings M and , we say that M is better than if a minimum cost edit script that conforms to M is cheaper than a minimum cost edit script that conforms to . Our goal is to find a best matching, that is, a matching M that satisfies the given matching criteria and such that there is no better matching that also satisfies the criteria.

Unfortunately, if our matching criterion only requires that matched nodes have the same label, then finding the best matching has two difficulties. The first difficulty is that many matchings that satisfy only this trivial matching criterion may be unnatural in certain domains. For example, when matching documents, we may only want to match textual units (paragraphs, sections, subsections, etc.) that have more than a certain percentage of sentences in common. The second difficulty is one of complexity: the only algorithm known to us to compute the best matching as defined above (based on post-processing the output of the algorithm in [ZS89]) runs in time where n is the number of tree nodes [Zha95]. To solve the first difficulty, we restrict the set of matchings we consider by introducing stronger matching criteria, as described below. These criteria also permit us to design efficient algorithms for matching. In the rest of this section, we describe some matching criteria for keyless data, using structured documents as an example.

Next: Matching Criteria for Up: Change Detection in Previous: The Complete Algorithm

Sudarshan S. Chawathe
Wed Jun 19 08:22:34 PDT 1996