The goal of using an edge cover is that it should capture the
essential aspects of an edit script; that is, no important information
should be lost in going from an edit script to the edge cover induced
by it. However, there are certain edit scripts for which this
property does not hold. For example, consider an edit script
that inserts a node p as the parent of ten siblings (children of the
same parent)
, then moves p to another location
in the tree, and finally deletes p. The node p is absent from
both the initial tree and the final tree. Therefore, an edge cover of
the initial and final trees contains no record of the temporary
insertion of node p. Thus, we have lost some information in going
from
to the edge cover.
Is the fact that our edge covers cannot capture edit scripts like
a problem? On the one hand,
could be the minimum
cost edit script MH-DIFF is trying to find. For example, say that
insert, delete, and move operations all cost one unit. The cost of
would then be the cost of one insert, plus the cost of one
move, plus the cost of one delete, for a total cost of 3. If we do
not use the ``bulk move trick'' that
uses, we need to move
each of
individually, for a cost of 10. Thus,
could be the minimum cost edit script, and if we rule it
out, then MH-DIFF would miss it.
On the other hand, scripts like do not represent
transformations that are meaningful or intuitive to an end user. In
other words, if a user saw
, he would not understand why
node p was inserted, since it really has no function in his
application. True, the costs provided by the user are intended to
describe the desirability of edit operations, but if we abuse these
numbers we can end up with ``tricky'' scripts like
that are
more confusing than helpful.
Another example of a potentially unintuitive edit script is the
following: Consider an edit script that moves a node
to become a child of another node
, then makes several copies of
the subtree rooted at
(thus making copies of
as well), and
finally deletes the original copy of
. This edit script moves
to a place where it does not need to be (under
) only to
generate free copies of
.
The cause of the unintuitive nature of the edit scripts described
above is an interaction between different edit operations, which gives
rise to a ``compound'' effect. For example, in the edit script
above, the effect of the move operation is compounded
because it acts on a node that was previously inserted. Similarly, in
edit script
above, the effects of the copy operations are
compounded because they act on a subtree into which a node was
previously moved. Our approach is to disallow such unintuitive
compound effects.
A simple way of characterizing edit scripts that disallow undesirable compound effects is to require edit operations to occur in phases, and to order the phases appropriately. In the following discussion, we use the names INS, DEL, etc. to denote phases consisting of, respectively, INS operations, DEL operations, etc. First, we require that the INS phase occur after the DEL phase, so that an edit script cannot first insert a node and then delete it. Next, we require the other edit phases (UPD, MOV, CPY, and GLU) to occur after the DEL phase (so that nodes operated on by these phases cannot be later deleted), and before the INS phase (so that inserted nodes cannot be operated on by these phases). Furthermore, we require that the UPD (respectively, MOV) phase occur after the CPY phase and before the GLU phase, so that an edit script cannot compound the effect of an UPD (respectively, MOV) operation by copying the updated node (and similarly for glues). These ordering constraints yield the following order of edit phases: DEL, CPY, UPD, MOV, GLU, INS. (We chose the relative order of the UPD and MOV phases arbitrarily.) One additional restriction, not covered by the above ordering constraint, is the following: A node in a subtree operated on by a CPY operation cannot be operated on by a GLU operation. We call edit scripts that satisfy these restrictions structured edit scripts. In the sequel, we consider only structured edit scripts. Structured edit scripts have the following important property that allows us to consider only minimal edge covers in the sequel. (A minimal edge cover is an edge cover that is not a proper superset of any edge cover.)
The reader may observe that, in addition to disallowing unintuitive compound effects, the above restrictions also disallow some intuitive sequences of operations. For example, a structured edit script cannot delete a node produced as a result of a CPY operation. Therefore, a structured edit script cannot copy a subtree containing 100 nodes if 99 of them are needed, because it would be unable to delete the unwanted copy of the 100th node. An analogous situation exists for INS and GLU operations. Our algorithms [&make_named_href('', "node21.html#bbdiffExtended","[CGM97]")] actually do permit such deletions (called ghost deletions) after copies, and insertions (called ghost insertions) before glues. For similar reasons, we also permit certain move operations to occur before the CPY phase. Furthermore, we allow a move or copy operation to a destination that is currently unavailable (e.g., because it is produced by a copy operation) to be ``paused'' until the destination becomes available. Lemma 4.1 remains true under these weaker restrictions.
We now describe how, given a minimal edge cover K of the graph
induced by trees and
, we compute a minimum-cost edit
script corresponding to this edge cover. As explained in
Section 3, we also represent the edit operations of
such an edit script as annotations on the affected edges. Due to
space constraints, we do not present the full details of our algorithm
CtoS (cover-to-script) in this paper, and present instead a
brief explanation of the basic ideas behind the algorithm. The
detailed algorithm is presented in [&make_named_href('',
"node21.html#bbdiffExtended","[CGM97]")].
The algorithm proceeds in phases that roughly reflect the phases of a
structured edit script described above. We refer to edges belonging
to the given edge cover K as K-edges. We say two nodes are
matched to each other if there is a K-edge connecting them.
The first phase of the algorithms is the delete phase, in which we
generate an edit operation for each node m that is matched
to the special node
. We claim that any edit script that
matches m to
must contain this
operation, due to
the following observations: Firstly, any node matched to
is
absent from the final tree. Furthermore, there are only two
ways in which a node can be made to disappear: either it is deleted
explicitly, or it is glued to some other node. (We use here the fact
that structured edit scripts cannot first glue a node to another and
then delete the second node.) However, the second method will not
result in m matching
in the edge cover induced by the
script; instead, m will match the node to which it was glued.
Therefore we can safely produce a
operation for all such
nodes m.
The next phase of the algorithm handles copy operations. In
particular, it looks for sets two or more of K-edges incident on a
common node . Note that from Lemma 4.1, and
the observation that minimal edge covers cannot contain any path of
length three, it follows that if e = [m, n] is such an edge, there
can be no other K-edge incident on n. We call such a set of edges a
flower with base m. This set of edges represents copies of the node
m. However, as we have seen in Section 3, some of
the copies of m could be produced as a result of some ancestor of
m being copied. We call such copies free copies of m. Our
algorithm considers flowers in preorder of the base nodes. As copy
operations are generated for some node m, we also keep track of the
number of free copies of nodes in the copied subtree. Knowing the
number of available free copies allows us to determine exactly which
flowers correspond to explicit copy operations and which correspond to
implicit (free) copies. Furthermore, any unused free copies are nodes
that need to be deleted after the copy operation is performed. These
are the ghost deletions we introduced above. Finally, note that a
free copy may need to be moved to its final location; this situation
is easily detected by checking whether the parents of the affected
nodes match.
The update phase of the algorithm is straightforward, and produces an update operation for each edge [m,n] such that the labels of m and n differ. Since we are considering only structured edit scripts, there is no way to avoid such an update; in particular, ``tricks'' like updating a node and then copying it are disallowed. The glue and delete phases of the algorithm are analogous to the copy and insert phases, respectively. The details are in [&make_named_href('', "node21.html#bbdiffExtended","[CGM97]")].