Next: Model and Problem Definition Up: Meaningful Change Detection in Previous: Meaningful Change Detection in

Introduction

Detection of changes between data structures is an important function in many applications. For example, in the World-Wide Web an analyst may be interested in knowing how a competitor's site has changed since the last time visited. This may be achieved by saving a snapshot of the previous HTML pages at the site (something that most browsers do for efficiency anyway). In a CAD design environment, an engineer may wish to understand the differences between two related but concurrently developed chip designs. In a distributed file system, an administrator may need to detect differences between two mirror file systems that became partitioned and independently modified. In a warehousing environment, the changes at a site need to be identified so that a materialized view can be incrementally maintained.

In this paper we present an efficient algorithm, MH-DIFF, for meaningful change detection between two hierarchically structured data snapshots, or trees. The key word here is meaningful (the ``M'' in the name). That is, our goal is to portray the changes between two trees in a succinct and descriptive way. As is commonly done, we portray the changes as an edit script that gives the sequence of operations needed to transform one tree into another. However, in this paper we use a richer set of operations than has ever been used before, and this leads, we believe, to much higher quality edit scripts.

In particular, we use move and copy operations, in addition to the more traditional insert, delete, and update operations. Thus, if a substructure (e.g., a section of text, a shift register) is moved to another location, our algorithm will report it as a single operation. If the substructure is copied (e.g., a second shift register is added which is identical to one already in the circuit), then our algorithm will identify it as such. Traditional change detection algorithms would report such changes as sequences of inserts and deletes (or simply inserts in the case of a copy), which do not convey the true meaning of the change.

Note that detecting moves and copies becomes more important if the moved or copied subtree is large. For instance, if we are comparing file systems, and a large directory with thousands of files is mounted elsewhere, we clearly do not wish to report the change as thousands of file deletes followed by thousands of file creations. Also note that to detect moves and copies, it is essential that our algorithm understand the structure as well as the content of the data. Thus, our algorithm cannot treat the data as ``flat'' information, e.g., as files with records or relations with tuples. This means that techniques developed for flat change detection [&make_named_href('', "node21.html#Mye86","[Mye86]"), &make_named_href('', "node21.html#WilburtDiff","[LGM96]")] are not applicable here.

Algorithm MH-DIFF has two additional important features:

It does not rely on the existence of node (atomic object) identifiers that can match nodes in one tree to nodes in the other. In many applications such identifiers do not exist. For instance, sentences and paragraphs in text documents do not come with unique identifiers attached. Even when the nodes are stored in a database system (e.g., circuit components), we may be comparing copies with the same content but different identifiers. Thus, for full generality, MH-DIFF does not assume unique identifiers that span the two trees, and instead compares the contents of nodes to determine if they are related. (If the trees have such identifiers, MH-DIFF could easily take advantage of them, but we do not discuss that here.)
Algorithm MH-DIFF is based on a fairly flexible cost model. Each operation in the repertoire is given a user-defined fixed cost, except for the update operation, whose cost is determined by a user-provided function that compares the values of two nodes. This gives end users great latitude in saying what types of edit scripts are preferable for an application.

There is a good reason why difference algorithms with the features we have described here have not been developed earlier, even though they are clearly desirable. The reason is the inherent complexity of the problem; one can show that the problem is -hard. Algorithm MH-DIFF provides a heuristic solution, which is based on transforming the problem to the ``edge cover domain.'' That is, instead of working with edit scripts, the algorithm works with edge covers that represent how one set of nodes match another set. In this transformation, the costs of the edit operations are translated into costs on the edges of the cover.

In an earlier paper [&make_named_href('', "node21.html#tdiff","[CRGMW96]")] we studied a much simpler version of the change detection problem. In that work we did not consider copy operations, we assumed that the number of duplicates of a node was very limited, we assumed ordered trees, and we assumed that nodes had ``tags'' that reflect the structural constraints on the input trees. (For example, nodes were tagged as say ``paragraphs'' or ``sections,'' making it easier to match nodes.) All these restrictions made it much simpler to find a minimum-cost edit script, and indeed we developed an efficient algorithm that found a minimum-cost script. Here, on the other hand, here we drop these restrictions, and introduce copy operations. This leads to an algorithm that is very different from the one in [&make_named_href('', "node21.html#tdiff","[CRGMW96]")], and that yields a heuristic solution in worst-case time, where n is the number of nodes, but most often in roughly time. In Section 7 we compare in more detail MH-DIFF to our earlier work, as well as to other work on change detection.

Next: Model and Problem Definition Up: Meaningful Change Detection in Previous: Meaningful Change Detection in

Sudarshan S. Chawathe
Sat Feb 22 12:28:02 PST 1997