The general problem of detecting changes from snapshots of data has
been studied before from different angles. For example, [&make_named_href('',
"node21.html#WF74","[WF74]")]
defines a string-to-string correction problem as the problem of
finding the best sequence of insert, delete, and update operations
that transform one string to another. The problem is developed
further in [&make_named_href('',
"node21.html#Wag75","[Wag75]")], which adds the ``swap'' operation to the list
of edit operations. These papers also introduce the structure of a
``trace'' or a matching between the characters of the strings being
compared as a useful tool for computing an edit script. A simpler
change detection problem for strings, using only insertions and
deletions as edit operations has been studied extensively
[&make_named_href('',
"node21.html#Mye86","[Mye86]"), &make_named_href('',
"node21.html#WMM90","[WMG90]")]. The idea of a longest common subsequence
replaces the idea of a trace in this simpler problem. A variant of
the algorithm presented in [&make_named_href('',
"node21.html#Mye86","[Mye86]")] for computing the longest
common subsequence is implemented in the gnudiff [&make_named_href('',
"node21.html#gnudiff","[HHS ]")]
program. All these algorithms work with strings, that is, with
flat-file, or relational data, and are not suitable for computing
changes in structured data.
In [&make_named_href('',
"node21.html#ZS89","[ZS89]"), &make_named_href('',
"node21.html#SZ90","[SZ90]")], the authors define a change detection problem for
ordered trees, using insertion, deletion, and label-update as
the edit operations, observing its added difficulty compared to the
equivalent problem for strings; they also present an efficient
dynamic-programming based algorithm to solve that problem. A proof of
the -hardness of a similar change detection problem (using
insertion, deletion, and label-update) for unordered trees is
presented in [&make_named_href('',
"node21.html#ZWS95","[ZWS95]")], which also presents an algorithm for a
restricted version of the change detection problem. In
[&make_named_href('',
"node21.html#markSearchTreeMatching","[SWZS94]")], the authors present an enumerative
(exponential time) algorithm for the change detection problem for
unordered trees, as well as heuristic algorithms based on search
techniques such as simulated annealing. An important assumption made
by the algorithms in [&make_named_href('',
"node21.html#ZS89","[ZS89]"), &make_named_href('',
"node21.html#SZ90","[SZ90]"), &make_named_href('',
"node21.html#ZWS95","[ZWS95]"), &make_named_href('',
"node21.html#markSearchTreeMatching","[SWZS94]")] is
that the cost of updating any label to any other label is always less
than the cost of deleting a node with the old label and inserting a
node with the new label. While this restriction is reasonable for
some domains, it does not always lead to intuitive results. For
example, consider two trees with the same structure, but completely
different labels on the nodes (e.g., two trees representing different
query results, but with a similar structure). Assuming the cost of
label update is always lower than the cost of the corresponding
insertion and deletion will result in an edit script that simply
updates all the labels in the trees. While this is technically sound,
it is not the semantically desirable result for this example.
In [&make_named_href('',
"node21.html#tdiff","[CRGMW96]")] we defined a variant of the change detection problem
for ordered trees, using subtree moves as an edit operation in
addition to insertions, deletions, and updates, and presented an
efficient algorithm for solving it. That algorithm uses domain
characteristics to find a solution efficiently. A major drawback of the
algorithm in [&make_named_href('',
"node21.html#tdiff","[CRGMW96]")] is that it assumes that the number of
duplicates (or near duplicates) in the labels found in the input trees
is very small. Another drawback of of the algorithm in [&make_named_href('',
"node21.html#tdiff","[CRGMW96]")] is
that it assumes each node of the input trees has a special tag that
describes its semantics. (For example, an ordered tree representing a
document may have tags ``paragraph,'' ``section,'' etc.) Furthermore,
that algorithm assumes the existence of a total order over these
tags such that a node with tag
cannot be the child of a node with
tag
unless
. While these assumptions are
reasonable in a text comparison scenario, there are many domains in
which they do not hold.
The work presented in this paper differs from previous work in several important ways. Firstly, we detect the change detection problem for unordered trees, which is inherently harder than the similar problem for ordered trees. Secondly, we consider a rich set of edit operations, including copy and move operations, that make the edit script computed more meaningful and intuitively usable. Furthermore, we do not assume that the nodes of the input trees are ``tagged'' in a manner required by the algorithm in [&make_named_href('', "node21.html#tdiff","[CRGMW96]")], nor do we assume the absence of duplicates (or near duplicates) in the labels of the nodes in the input trees. Finally, we do not assume that the cost of updating any label to any other label is always less than the cost of deletion and insertion.