next up previous
Next: Related Work Up: Change Detection in Previous: Change Detection in

Introduction

 

We study the problem of detecting and representing changes to hierarchically structured information. Detecting changes to data (henceforth referred to as deltas) is a basic function of many important database facilities and applications, including active databases [WC96], data warehousing [HGMW95,IK93,ZGMHW95], view maintenance [GM95], and version and configuration management [HKG94].

For example, consider the World-Wide Web. A user may visit certain (HTML) documents repeatedly and is interested in knowing how each document has changed since the last visit. Assuming we have saved the old version of the document (which many web browsers do already for efficiency), we can detect the changes by comparing the old and new versions of the document. In addition, we are interested in presenting the changes in a meaningful way. For example, a paragraph that has moved could be marked with a ``tombstone'' in its old position and be highlighted in its new position. Similarly, insertions, deletions, and updates could be marked using changes in colors and fonts.

The work on change detection reported in this paper has four key characteristics:

To describe a delta between two versions of hierarchical data, we use the notion of a minimum cost edit script. The minimum cost edit script for two trees is defined using node insert, node delete, node update, and subtree move as the basic editing operations. An interesting feature of our approach is that there is a clean separation of the change detection problem into two subproblems: (1) finding a matching between objects in the two versions, and (2) computing an edit script. If objects have unique identifiers, the first problem is simplified, and we can use this property to achieve a speed-up.

Although a minimum cost edit script is a good formal notion of the delta between two trees, it is not always the most convenient method for displaying or querying deltas. We have developed a second, equivalent representation scheme called a delta tree for this purpose. Due to lack of space, we do not describe delta trees in this paper; they are described in [CRGMW95].

To demonstrate our approach and algorithms, we have implemented a system to detect, mark, and display changes in structured documents, based on their hierarchical structure. Our system, called LaDiff, takes two versions of a Latex document as input and produces as output a Latex document with the changes marked. We have used this system to experimentally evaluate the performance of our algorithms; results are presented in Section 6.2. We have also implemented our algorithms in change detection modules for HTML pages and for a simple nested-object model [PGMW95].

The remainder of the paper is organized as follows. We discuss related work in Section 2. Section 3 describes our general approach, divides our problem into two distinct subproblems, and provides preliminary definitions. Our algorithms for solving the two subproblems are discussed in Sections 4 and 5. Section 6 describes the application of our techniques to hierarchically structured documents and presents our empirical performance study. Conclusions and ongoing work are covered in Section 7. Due to space constraints, several details and proofs of theorems are not presented in this paper, and may be found in [CRGMW95].



next up previous
Next: Related Work Up: Change Detection in Previous: Change Detection in



Sudarshan S. Chawathe
Wed Jun 19 08:22:34 PDT 1996