In this section we describe how MH-DIFF finds a minimal edge cover of the induced graph. The resulting cover will serve as input to algorithm CtoS (Section 4). Our goal is to find not just any minimal edge cover, but one that corresponds to a minimum-cost edit script. Let us call such an minimal edge cover the target cover.
Consider an edge e in our pruned induced graph. To get to the target cover, MH-DIFF must decide whether e should be included in the cover. To reach this decision, it would be nice if MH-DIFF knew the ``cost'' of e. That is, if e remains in the target cover, then it would be annotated (by algorithm CtoS) with some operation, and we could say that the cost of this operation is the cost of e. Unfortunately, we have a ``chicken and the egg problem'' here: CtoS cannot run until we have the target cover, and we cannot get the target cover until we know the costs it will imply. To break the impasse, our approach uses the following idea:
Instead of trying to compute the actual cost of e, we compute an upper and lower bound to this cost. These bounds can be computed without the knowledge of which other edges are included in the target cover, and serve two purposes: Firstly, they allow us to design pruning rules that are used to conservatively eliminate unnecessary edges from the induced graph. Secondly, after pruning, the bounds can guide our search for the target cover.
As an enhancement, we actually use a variation on the edge cost
suggested above. The following example shows that simply ``charging''
each annotation to the edge it is on is not entirely ``fair.'' We are
given a tree containing two nodes,
and
with the same
label l. Furthermore
has children
and
with
labels a and b, respectively, and
has children
and
with labels c and d, respectively. Suppose
is a
logical copy of
. (That is,
and
are isomorphic.)
Consider an edge cover that matches each node in
to its copy in
except that it ``cross matches''
and
across the
trees, as shown in Figure 8. Given this edge cover,
algorithm CtoS will produce a move operation for each of
the nodes
,
,
, and
. However, these
move operations were caused not by any mismatching of the nodes
,
,
, or
, but instead, by the
mismatching of
and
. Therefore it would be intuitively
more fair to charge these move operations to the edges responsible for
the mismatch, viz.
and
. To
achieve this, we use the following scheme: If e is annotated with
INS, DEL, or UPD in the target cover, we do charge e for this
operation. However, if e is annotated by MOV, CPY, or GLU, then
the parent of e, and not e is charged. We call the edge
costs computed in such a fashion fair costs, and define them
below:
Figure 8: Distributing edge costs fairly