A Matching Algorithm

Next: Implementation and Performance Up: Finding Good Matchings Previous: Matching Criteria for

A Matching Algorithm

Theorem 5.1 allows us to design a simple algorithm for computing the best matching. This algorithm, called algorithm Match, is presented in [CRGMW95], and runs in quadratic time. Below, we present a faster algorithm, called FastMatch, for computing the unique maximal matching. Our algorithm uses a function equal to compare nodes. For leaf nodes, is true if and only if and , where f is a parameter valued between 0 and 1. For internal nodes, is true if and only if and , where is a parameter.

Figure 10 shows Algorithm FastMatch, which uses the longest common subsequence (LCS) routine, introduced earlier in Section 4.2, to perform an initial matching of nodes that appear in the same order. Nodes still unmatched after the call to LCS are processed using linear search. We assume that all nodes with a given label l in tree T are chained together from left to right. Let denote the chain of nodes with label l in tree T. Node x occurs to the left of node y in if x appears before y in the in-order traversal of T when siblings are visited left-to-right.

Figure 10: Algorithm FastMatch

Now we analyze the running time of Algorithm FastMatch. Define the weighted edit distance e between trees and as follows. Let be the shortest edit script that transforms to . Then the weighted edit distance is given by where , for , is 1 if is an insert or a delete, |x| if is a move of the subtree rooted at x, and 0 otherwise. Recall that |x| denotes the number of leaf nodes that are descendants of node x. Intuitively, the weighted edit distance indicates how different the two trees are ``structurally,'' where the degree of difference associated with the move of a subtree depends on the number of leaves in the subtree. In [CRGMW95] we show that the running time of Algorithm FastMatch is proportional to where n is the total number of leaf nodes, c is the average cost of comparing two leaves (using the compare function), l is the number of internal node labels, and e is the weighted edit distance between and .

Next: Implementation and Performance Up: Finding Good Matchings Previous: Matching Criteria for

Sudarshan S. Chawathe
Wed Jun 19 08:22:34 PDT 1996