Next: Conclusion Up: Meaningful Change Detection in Previous: Implementation and Performance

Related Work

The general problem of detecting changes from snapshots of data has been studied before from different angles. For example, [&make_named_href('', "node21.html#WF74","[WF74]")] defines a string-to-string correction problem as the problem of finding the best sequence of insert, delete, and update operations that transform one string to another. The problem is developed further in [&make_named_href('', "node21.html#Wag75","[Wag75]")], which adds the ``swap'' operation to the list of edit operations. These papers also introduce the structure of a ``trace'' or a matching between the characters of the strings being compared as a useful tool for computing an edit script. A simpler change detection problem for strings, using only insertions and deletions as edit operations has been studied extensively [&make_named_href('', "node21.html#Mye86","[Mye86]"), &make_named_href('', "node21.html#WMM90","[WMG90]")]. The idea of a longest common subsequence replaces the idea of a trace in this simpler problem. A variant of the algorithm presented in [&make_named_href('', "node21.html#Mye86","[Mye86]")] for computing the longest common subsequence is implemented in the gnudiff [&make_named_href('', "node21.html#gnudiff","[HHS ]")] program. All these algorithms work with strings, that is, with flat-file, or relational data, and are not suitable for computing changes in structured data.

In [&make_named_href('', "node21.html#ZS89","[ZS89]"), &make_named_href('', "node21.html#SZ90","[SZ90]")], the authors define a change detection problem for ordered trees, using insertion, deletion, and label-update as the edit operations, observing its added difficulty compared to the equivalent problem for strings; they also present an efficient dynamic-programming based algorithm to solve that problem. A proof of the -hardness of a similar change detection problem (using insertion, deletion, and label-update) for unordered trees is presented in [&make_named_href('', "node21.html#ZWS95","[ZWS95]")], which also presents an algorithm for a restricted version of the change detection problem. In [&make_named_href('', "node21.html#markSearchTreeMatching","[SWZS94]")], the authors present an enumerative (exponential time) algorithm for the change detection problem for unordered trees, as well as heuristic algorithms based on search techniques such as simulated annealing. An important assumption made by the algorithms in [&make_named_href('', "node21.html#ZS89","[ZS89]"), &make_named_href('', "node21.html#SZ90","[SZ90]"), &make_named_href('', "node21.html#ZWS95","[ZWS95]"), &make_named_href('', "node21.html#markSearchTreeMatching","[SWZS94]")] is that the cost of updating any label to any other label is always less than the cost of deleting a node with the old label and inserting a node with the new label. While this restriction is reasonable for some domains, it does not always lead to intuitive results. For example, consider two trees with the same structure, but completely different labels on the nodes (e.g., two trees representing different query results, but with a similar structure). Assuming the cost of label update is always lower than the cost of the corresponding insertion and deletion will result in an edit script that simply updates all the labels in the trees. While this is technically sound, it is not the semantically desirable result for this example.

In [&make_named_href('', "node21.html#tdiff","[CRGMW96]")] we defined a variant of the change detection problem for ordered trees, using subtree moves as an edit operation in addition to insertions, deletions, and updates, and presented an efficient algorithm for solving it. That algorithm uses domain characteristics to find a solution efficiently. A major drawback of the algorithm in [&make_named_href('', "node21.html#tdiff","[CRGMW96]")] is that it assumes that the number of duplicates (or near duplicates) in the labels found in the input trees is very small. Another drawback of of the algorithm in [&make_named_href('', "node21.html#tdiff","[CRGMW96]")] is that it assumes each node of the input trees has a special tag that describes its semantics. (For example, an ordered tree representing a document may have tags ``paragraph,'' ``section,'' etc.) Furthermore, that algorithm assumes the existence of a total order over these tags such that a node with tag cannot be the child of a node with tag unless . While these assumptions are reasonable in a text comparison scenario, there are many domains in which they do not hold.

The work presented in this paper differs from previous work in several important ways. Firstly, we detect the change detection problem for unordered trees, which is inherently harder than the similar problem for ordered trees. Secondly, we consider a rich set of edit operations, including copy and move operations, that make the edit script computed more meaningful and intuitively usable. Furthermore, we do not assume that the nodes of the input trees are ``tagged'' in a manner required by the algorithm in [&make_named_href('', "node21.html#tdiff","[CRGMW96]")], nor do we assume the absence of duplicates (or near duplicates) in the labels of the nodes in the input trees. Finally, we do not assume that the cost of updating any label to any other label is always less than the cost of deletion and insertion.

Next: Conclusion Up: Meaningful Change Detection in Previous: Implementation and Performance

Sudarshan S. Chawathe
Sat Feb 22 12:28:02 PST 1997