Next: Overview and Preliminaries Up: Change Detection in Previous: Introduction

Related Work

Most previous work in change management has dealt only with flat-file and relational data. For example, [LGM95] presents algorithms for efficiently comparing sets of records that have keys. The GNU diff utility compares flat text files by computing the LCS of their lines using the algorithm described in [Mye86]. There are also a number of front-ends to this standard diff program that display the results of diff in a more comprehensible manner. (The ediff program [Kif95] is a good example.) However, since the standard diff program does not understand the hierarchical structure of data, such utilities suffer from certain inherent drawbacks. Given large data files with several changes, diff often mismatches regions of data. (For example, while comparing Latex files, an item is sometimes matched to a section, a sentence is sometimes matched to a Latex command, and so on.) Furthermore, these utilities do not detect moves of data---moves are always reported as deletions and insertions. Some commercial word processors have facilities for comparing documents and marking changes. For example, Microsoft Word has a ``revisions'' feature that can detect simple updates, inserts, and deletes of text. It cannot detect moves. WordPerfect has a ``mark changes'' facility that can detect some move operations. However, there are restrictions on how documents can be compared (on either a word, phrase, sentence, or paragraph basis). Furthermore, these approaches do not generalize to non-document data.

The general problem of finding the minimum cost edit distance between ordered trees has been studied in [ZS89]. Compared to the algorithm presented there, our algorithm is more restrictive in that we make some assumptions about the nature of the data being represented. Our algorithm always yields correct results, but if the assumptions do not hold it may produce suboptimal results. Because of our assumptions, we are able to design an algorithm with a lower running-time complexity. In particular, our algorithm runs in time , where n is the number of tree leaves and e is the ``weighted edit distance'' (typically, ). The algorithm in [ZS89] runs in time for balanced trees (even higher for unbalanced trees). Our work also uses a different set of edit operations than those used in [ZS89]. The two sets of edit operations are equivalent in the sense that any state reachable using one set is also reachable using the other. A more detailed comparison of the two sets of edit operations is in [CRGMW95].

We believe our approach and that in [ZS89] are complementary; the choice of which algorithm to use depends on the domain characteristics. In an application where the amount of data is small (small tree structures), or where we are willing to spend more time (biochemical structures), the more thorough algorithm [ZS89] may be preferred. However, in applications with large amounts of data (object hierarchies, database dumps), or with strict running-time requirements, we would use our algorithm. The efficiency of our method is based on exploiting certain domain characteristics. Even in domains where these characteristics may not hold for all of the data, it may be preferable to get a quick, correct, but not guaranteed optimal, solution using our approach.

Next: Overview and Preliminaries Up: Change Detection in Previous: Introduction

Sudarshan S. Chawathe
Wed Jun 19 08:22:34 PDT 1996