Recall from Section 5 that the running time of
Algorithm *FastMatch* is given by an expression of the form . In this expression, represents the number of
leaf node comparisons (invocations of function *compare*), **c** is
the average cost of comparing leaf nodes, and represents the
number of node partner checks. Partner checks are implemented in *
LaDiff* as integer comparisons. We know that is bounded by , and that is bounded by ** 2lne **, where **n** is the
number of tree nodes, **e** is the weighted edit distance between the
two trees, and **l** is the number of internal node labels. The
parameter **e** depends on the nature of the differences between the
trees (recall the definition of weighted edit distance in Section
5.2).

There are two reasons for studying the performance of FastMatch
empirically. The first reason is that the formula for the running
time contains the weighted edit distance, **e**, which is difficult to
estimate in terms of the input. A more natural measure of the input
size is the number of edit operations in an optimal edit script, which
we call the *unweighted* edit distance, **d**. We can show
analytically that the ratio is bounded by for a large
class of inputs, but we believe that in real cases, its value is much
lower than . We therefore study the relationship between **e**
and **d** empirically. The second reason is that we would like to test
our conjecture that the analytical bound on the running time of
FastMatch is ``loose,'' and in most practical situations the algorithm
runs much faster.

**Figure 11:** Weighted edit distance

For our performance study, we used three sets of files. The files in
each set represent different versions of a document (a conference
paper). We ran FastMatch on pairs of files within each of these three
sets. (Comparing files across sets is not meaningful because we would
be comparing two completely different documents.) In
Figure 11 we indicate how the weighted edit distance
(**e**) varies with the unweighted edit distance (**d**), for each of the
three document sets. Recall that **n** is the number of tree leaves,
that is, the number of sentences in the document. We see that the
relationship between **e** and **d** is close to linear. Furthermore,
the variance with respect to the three document sets is not high,
suggesting that is not very sensitive to the size of the
documents (**n**). The average value of is 3.4 for these
documents.

In Figure 12 we plot how the running time of
FastMatch varies with the weighted edit distance **e**. The vertical
axis is the running time as measured by the number of comparisons made
by FastMatch and the horizontal axis is the weighted edit distance.
Note that the analytical bound on the number of comparisons made by
FastMatch is much higher than the numbers depicted in
Figure 12; on the average, FastMatch makes
approximately 20 times fewer comparisons than those predicted by the
analytical bound, supporting our conjecture that the analytical
bound on the running time is a loose one. We also observe that
Figure 12 suggests an approximately linear relation
between the running time and **e**, although there is a high variance.
This variance may be explained by our first observation that the
actual running time is far below the predicted bound.

Wed Jun 19 08:22:34 PDT 1996