Next: Related Work Up: Meaningful Change Detection in Previous: Computing a Min-Cost Edge

Implementation and Performance

In this section, we describe our implementation of MH-DIFF, and discuss its analytical and empirical performance. Figure 9 depicts the overall architecture of our implementation, with rectangles representing the modules (numbered, for reference) of the program, and other shapes representing data. Given two trees and as input, Module 1 constructs the induced graph (Section 3.1). This induced graph is next pruned (Module 2) using the pruning rules of Section 5.3 to give the pruned induced graph. In Module 2, the update cost for each edge in the induced graph is computed using the domain-dependent comparison function for node labels (Section 2.2). The next three modules together compute a minimum-cost edge cover of the pruned induced graph using the reduction of the edge cover problem to a weighted matching problem [&make_named_href('', "node21.html#combopt","[PS82]")]. That is, the pruned induced graph is first translated (by Module 3) into an instance of a weighted matching problem. This weighted matching problem is solved using a package (Module 4) [&make_named_href('', "node21.html#wmatch","[Rot]")] based on standard techniques [&make_named_href('', "node21.html#combopt","[PS82]")]. The output of the weighted matching solver is a minimum-cost matching, which is translated by Module 5 into , a minimum-cost edge cover of the pruned induced graph. Next, Module 6 uses the minimum-cost edge cover computed, to produce the desired edit script, using the method described in Section 4.2).

Figure 9: System Architecture

Recall that since we use a heuristic cost function to compute a minimum-cost edge cover, the edge cover produced by our program, and hence the edit script may not be the optimal one. We have also implemented a simple search module that starts with minimum-cost edge cover (see Figure 9) computed by our program and explores its neighborhood of minimal edge covers in an effort to find a better solution. The search proceeds by first exploring minimal edge covers that contain only one edge not in . Next, we explore minimal edge covers containing two edges not in , and so on. The intuition is that we expect the optimal solution to be ``close'' to the initial solution . Although, in the worst case, such an exploration may be extremely time-consuming, note that as a result of pruning edges, the search space is typically much smaller than the worst case. Due to space constraints, we do not describe the details of this search phase in this paper.

We have used our implementation to compute the differences between query results as part of the Tsimmis and projects at Stanford [&make_named_href('', "node21.html#TsimmisOverview","[CGMH 94]"), &make_named_href('', "node21.html#c3","[WU95]")]. These projects use the OEM data model, which is a simple labeled-object model to represent tree-structured query results. In particular, we have run our system on the output of Tsimmis queries over a bibliographic information source that contains information about database-related publications in a format similar to BibTeX. Since the data in this information source is mainly textual, we treat all labels as strings. For the domain-dependent label-update cost function, we use a weighted character-frequency histogram difference scheme that compares strings based on the number of occurrences of each character of the alphabet in them. For example, consider comparing the labels ``foobar'' and ``crowbar.'' The character-frequency histograms are, respectively, and . The difference histogram is . Adding up the magnitudes of the differences gives us 5, which we then normalize by the total number of characters in the strings (13), and scale by a parameter (currently 5), to get the update cost (5/13)*5 = 1.9 .

Let us now analyze the running time of our program. Let n be the total number of nodes in both input trees and . Constructing the induced graph (Module 1, in Figure 9) involves building a complete bipartite graph with O(n) nodes on each side. We also evaluate the domain-dependent label-comparison function for each pair of nodes, and store this cost on the corresponding edge. Thus, building the induced graph requires time , where k is the cost of the domain-dependent comparison function. Next, consider the pruning phase (Module 2). By maintaining a priority queue (based on edge costs) of edges incident on each node of the induced graph, the test to determine whether an edge may be pruned can be performed in constant time. If the edge is pruned, removing it from the induced graph requires constant time, while removing it from the priority queues at each of its nodes requires O(log n) time. When an edge [m,n] is pruned, we also record the changes to the costs , , , and , which can be done in constant time. Thus, pruning an edge requires O(log n) time. Since at most are pruned, the total worst case cost of the pruning phase is . Let e be the number of edges that remain in the induced graph after pruning. The minimum-cost edge cover is computed in time O(ne) by Modules 3, 4, and 5. The computation of the edit script from the minimum-cost edge cover can be done in O(n) time by Module 6. (Note that the number of edges in a minimal edge cover is always O(n) .)

The number of edges that remain in the induced graph after pruning (denoted by e above) is an important metric for three main reasons. Firstly, as seen above, a lower number of edges results in faster execution of the minimum-cost edge cover algorithm. Secondly, a smaller number of edges decreases the possibility of finding a suboptimal edge cover, since there are fewer choices that need to be made by the algorithm. Thirdly, having a smaller number of edges in the induced graph reduces exponentially the size of the space of candidate minimal edge covers that the search module needs to explore.

Figure 10: Effectiveness of pruning

Given the importance of the metric e, we have conducted a number of experiments to study the relationship between e and n. We start with four ``input'' trees representing actual results of varying sizes from our Tsimmis system. For each input tree, we generate a batch of ``output'' trees by applying a number of random edits. The number of random edits is either 10% or 20% of the number of nodes in the input tree. Then for each output tree, we run MH-DIFF on it and its original input tree. The results are summarized by the graph in Figure 10. The horizontal axis indicates the total number of nodes in the two trees being compared (and hence, in the induced graph). The vertical axis indicates the number of edges that remain after pruning the induced graph. Note that the ideal case (best possible pruning) corresponds to , since we need at least edges to cover n nodes, whereas the worst case is (no pruning at all). For comparison, we have also plotted e = n/2 and on the graph in Figure 10. We observe that the relationship between e and n is close to linear, and that the observed values of e are much closer to n/2 than to .

Note that in Figure 10 we have plotted the results for two different values of d, the percentage of random edit operations applied to the input tree. We see that, for a given value of n, a higher value of d results in a higher value of e, in general. We note that some points with a higher d value seem to have a lower value of e than the general trend. This is because applying d random edits is not the same as having the input and output trees separated by d edits, due to the possibility of redundant edit operations. Thus, some data points, even though they were obtained by applying d random edits, actually correspond to fewer changes in the tree.

We have also studied the quality of the initial solution produced by MH-DIFF. In particular, we are interested in finding out in what fraction of cases our method produces suboptimal initial solutions, and by how much the cost of the suboptimal solution exceeds that of the optimal. Given the exponential (in e) size of the search space of minimal edge covers of the induced graph, it is not feasible to try exhaustive searches on large datasets. However, we have exhaustively searched the space of minimal edge covers, and corresponding edit scripts, for smaller datasets. We ran 50 experiments, starting with an input tree derived as in the experiments for e above, and using 6 randomly generated edit operations to generate an output tree. We searched the space of minimal edge covers of the pruned induced graph exhaustively for these cases, and found that the MH-DIFF initial solution differed from the minimum-cost one in only 2 cases out of 50. That is, in 96% of the cases MH-DIFF found the minimum cost edit script, and of course it did this in much less time than the exhaustive method. In the two cases where MH-DIFF missed, the resulting script cost about 15% more that the minimum cost possible.

Next: Related Work Up: Meaningful Change Detection in Previous: Computing a Min-Cost Edge

Sudarshan S. Chawathe
Sat Feb 22 12:28:02 PST 1997