Using Edge Covers to Generate Edit Scripts

Next: Finding the Edge Cover Up: Edge Covers and Edit Previous: Edge Cover Induced by

Using Edge Covers to Generate Edit Scripts

The goal of using an edge cover is that it should capture the essential aspects of an edit script; that is, no important information should be lost in going from an edit script to the edge cover induced by it. However, there are certain edit scripts for which this property does not hold. For example, consider an edit script that inserts a node p as the parent of ten siblings (children of the same parent) , then moves p to another location in the tree, and finally deletes p. The node p is absent from both the initial tree and the final tree. Therefore, an edge cover of the initial and final trees contains no record of the temporary insertion of node p. Thus, we have lost some information in going from to the edge cover.

Is the fact that our edge covers cannot capture edit scripts like a problem? On the one hand, could be the minimum cost edit script MH-DIFF is trying to find. For example, say that insert, delete, and move operations all cost one unit. The cost of would then be the cost of one insert, plus the cost of one move, plus the cost of one delete, for a total cost of 3. If we do not use the ``bulk move trick'' that uses, we need to move each of individually, for a cost of 10. Thus, could be the minimum cost edit script, and if we rule it out, then MH-DIFF would miss it.

On the other hand, scripts like do not represent transformations that are meaningful or intuitive to an end user. In other words, if a user saw , he would not understand why node p was inserted, since it really has no function in his application. True, the costs provided by the user are intended to describe the desirability of edit operations, but if we abuse these numbers we can end up with ``tricky'' scripts like that are more confusing than helpful.

Another example of a potentially unintuitive edit script is the following: Consider an edit script that moves a node to become a child of another node , then makes several copies of the subtree rooted at (thus making copies of as well), and finally deletes the original copy of . This edit script moves to a place where it does not need to be (under ) only to generate free copies of .

The cause of the unintuitive nature of the edit scripts described above is an interaction between different edit operations, which gives rise to a ``compound'' effect. For example, in the edit script above, the effect of the move operation is compounded because it acts on a node that was previously inserted. Similarly, in edit script above, the effects of the copy operations are compounded because they act on a subtree into which a node was previously moved. Our approach is to disallow such unintuitive compound effects.

A simple way of characterizing edit scripts that disallow undesirable compound effects is to require edit operations to occur in phases, and to order the phases appropriately. In the following discussion, we use the names INS, DEL, etc. to denote phases consisting of, respectively, INS operations, DEL operations, etc. First, we require that the INS phase occur after the DEL phase, so that an edit script cannot first insert a node and then delete it. Next, we require the other edit phases (UPD, MOV, CPY, and GLU) to occur after the DEL phase (so that nodes operated on by these phases cannot be later deleted), and before the INS phase (so that inserted nodes cannot be operated on by these phases). Furthermore, we require that the UPD (respectively, MOV) phase occur after the CPY phase and before the GLU phase, so that an edit script cannot compound the effect of an UPD (respectively, MOV) operation by copying the updated node (and similarly for glues). These ordering constraints yield the following order of edit phases: DEL, CPY, UPD, MOV, GLU, INS. (We chose the relative order of the UPD and MOV phases arbitrarily.) One additional restriction, not covered by the above ordering constraint, is the following: A node in a subtree operated on by a CPY operation cannot be operated on by a GLU operation. We call edit scripts that satisfy these restrictions structured edit scripts. In the sequel, we consider only structured edit scripts. Structured edit scripts have the following important property that allows us to consider only minimal edge covers in the sequel. (A minimal edge cover is an edge cover that is not a proper superset of any edge cover.)

The reader may observe that, in addition to disallowing unintuitive compound effects, the above restrictions also disallow some intuitive sequences of operations. For example, a structured edit script cannot delete a node produced as a result of a CPY operation. Therefore, a structured edit script cannot copy a subtree containing 100 nodes if 99 of them are needed, because it would be unable to delete the unwanted copy of the 100th node. An analogous situation exists for INS and GLU operations. Our algorithms [&make_named_href('', "node21.html#bbdiffExtended","[CGM97]")] actually do permit such deletions (called ghost deletions) after copies, and insertions (called ghost insertions) before glues. For similar reasons, we also permit certain move operations to occur before the CPY phase. Furthermore, we allow a move or copy operation to a destination that is currently unavailable (e.g., because it is produced by a copy operation) to be ``paused'' until the destination becomes available. Lemma 4.1 remains true under these weaker restrictions.

We now describe how, given a minimal edge cover K of the graph induced by trees and , we compute a minimum-cost edit script corresponding to this edge cover. As explained in Section 3, we also represent the edit operations of such an edit script as annotations on the affected edges. Due to space constraints, we do not present the full details of our algorithm CtoS (cover-to-script) in this paper, and present instead a brief explanation of the basic ideas behind the algorithm. The detailed algorithm is presented in [&make_named_href('', "node21.html#bbdiffExtended","[CGM97]")].

The algorithm proceeds in phases that roughly reflect the phases of a structured edit script described above. We refer to edges belonging to the given edge cover K as K-edges. We say two nodes are matched to each other if there is a K-edge connecting them. The first phase of the algorithms is the delete phase, in which we generate an edit operation for each node m that is matched to the special node . We claim that any edit script that matches m to must contain this operation, due to the following observations: Firstly, any node matched to is absent from the final tree. Furthermore, there are only two ways in which a node can be made to disappear: either it is deleted explicitly, or it is glued to some other node. (We use here the fact that structured edit scripts cannot first glue a node to another and then delete the second node.) However, the second method will not result in m matching in the edge cover induced by the script; instead, m will match the node to which it was glued. Therefore we can safely produce a operation for all such nodes m.

The next phase of the algorithm handles copy operations. In particular, it looks for sets two or more of K-edges incident on a common node . Note that from Lemma 4.1, and the observation that minimal edge covers cannot contain any path of length three, it follows that if e = [m, n] is such an edge, there can be no other K-edge incident on n. We call such a set of edges a flower with base m. This set of edges represents copies of the node m. However, as we have seen in Section 3, some of the copies of m could be produced as a result of some ancestor of m being copied. We call such copies free copies of m. Our algorithm considers flowers in preorder of the base nodes. As copy operations are generated for some node m, we also keep track of the number of free copies of nodes in the copied subtree. Knowing the number of available free copies allows us to determine exactly which flowers correspond to explicit copy operations and which correspond to implicit (free) copies. Furthermore, any unused free copies are nodes that need to be deleted after the copy operation is performed. These are the ghost deletions we introduced above. Finally, note that a free copy may need to be moved to its final location; this situation is easily detected by checking whether the parents of the affected nodes match.

The update phase of the algorithm is straightforward, and produces an update operation for each edge [m,n] such that the labels of m and n differ. Since we are considering only structured edit scripts, there is no way to avoid such an update; in particular, ``tricks'' like updating a node and then copying it are disallowed. The glue and delete phases of the algorithm are analogous to the copy and insert phases, respectively. The details are in [&make_named_href('', "node21.html#bbdiffExtended","[CGM97]")].

Next: Finding the Edge Cover Up: Edge Covers and Edit Previous: Edge Cover Induced by

Sudarshan S. Chawathe
Sat Feb 22 12:28:02 PST 1997