Implementation

Next: Empirical Evaluation of Up: Implementation and Performance Previous: Implementation and Performance

Implementation

In the following description, we focus on Latex documents, but the implementation also handles other kinds of structured documents (e.g., HTML). LaDiff takes as input two files containing the old and new versions of a Latex document. These files are first parsed to produce their tree representations (the old tree and new tree, respectively). Currently, we parse a subset of Latex consisting of sentences, paragraphs, subsections, sections, lists, items, and document. It is easy to extend our parser to handle a larger subset of Latex , and we plan to do so in the future. Next, the edit script and delta tree are computed using the algorithms of Sections 4--5. Our program takes the match threshold t (Section 5) as a parameter. Our comparison function for leaf nodes---which are sentences---first computes the LCS (recall Section 4.2) of the words in the sentences, then counts the number of words not in the LCS. Interior nodes (paragraphs, items, sections, etc.) are compared as described in Section 5. Finally, a preorder traversal of the delta tree is performed to produce an output Latex \ document with annotations describing the changes. Our implementation uses a modified version of the LCS algorithm from [Mye86]. Note that we cannot use the LCS algorithm used by the standard UNIX diff program, because it requires inequality comparisons in addition to equality comparisons.

Sudarshan S. Chawathe
Wed Jun 19 08:22:34 PDT 1996