In this section, we formulate the change detection problem and split it into two subproblems that are discussed in later sections. We first introduce these problems informally using an example, and then present the formal definitions and terms used in the rest of the paper.
Hierarchically structured information can be represented as
ordered trees---trees in which the children of each node have a
designated order. We address our problem of detecting and
representing changes in the context of such trees. (Hereafter, when
we use the term ``tree'' we mean an ordered tree.) We consider trees
in which each node has a label and a value. We also assume
that each tree node has a unique identifier; identifiers may be
generated by our algorithms when they are not provided in the data
itself. Note that the nodes that represent the same real-world entity
in different versions may not have the same identifier. We refer to
the node with identifier x as ``node x'' for conciseness.
As a running example, consider trees and
shown in
Figure 1, and ignore the dashed lines for the
moment. The number inside each node is the node's identifier and the
letter beside each node is its label. All of the interior nodes have
null values, not shown. Leaf nodes have the values indicated in
parentheses. (These trees could represent two structured documents,
where the labels D, P, and S denote Document,
Paragraph, and Sentence, respectively. The values of the
sentence nodes are the sentences themselves.) We are interested in
finding the delta between these two trees. We will assume that
represents the ``old'' data and
the ``new'' data, so we want to
determine an appropriate transformation from tree
to tree
.
Our first task in finding such a transformation is to determine nodes
in the two trees that correspond to one another. Intuitively, these
are nodes that either remain unchanged or have their value updated in
the transformation from to
(rather than, say, deleting the
old node and inserting a new one). For example, node 5 in
has
the same value as node 15 in
, so nodes 5 and 15 should probably
correspond. Similarly, nodes 4 and 13 have one child node each, and
the child nodes have the same value, so nodes 4 and 13 should probably
correspond. The notion of a correspondence between nodes that have
identical or similar values is formalized as a
matching between node identifiers. Matchings are one-to-one.
We say that a matching is partial if only some nodes in the two
trees participate, while a matching is total if all nodes
participate. Hereafter, we use the term ``matching'' to mean a
partial matching unless stated otherwise.
Hence, one of our problems is to find an appropriate matching for the trees we are comparing. We call this problem the Good Matching problem. In some application domains the Good Matching problem is easy, such as when data objects contain object identifiers or unique keys. In other domains, such as structured documents, the matching is based on labels and values only, so the Good Matching problem is more difficult. Furthermore, not only do we want to match nodes that are identical (with respect to the labels and values of the nodes and their children), but we also want to match nodes that are ``approximately equal.'' For instance, node 3 in Figure 1 probably should match node 14 even though node 3 is missing one of the children of 14. Details of the Good Matching problem---including what constitutes a ``good'' matching---are addressed in Section 5. A matching for our running example is illustrated by the dashed lines in Figure 1.
We say that two trees are isomorphic if they are identical
except for node identifiers. For trees and
, once we have
found a good (partial) matching M, our next step is to find a
sequence of ``change operations'' that transforms tree
into a
tree
that is isomorphic to
. Changes may include
inserting (leaf) nodes, deleting nodes, updating the values of nodes,
and moving nodes along with their subtrees. Intuitively, as
is
transformed into
, the partial matching M is extended into a
total matching
between the nodes of
and
. The total
matching
then defines the isomorphism between trees
and
. We call the sequence of change operations an edit
script, and we say that the edit script conforms to the
original matching M provided that
. (As will be
seen, an edit script conforms to partial matching M as long as the
script does not insert or delete nodes participating in M.) Edit
scripts are defined in more detail shortly.
We would like our edit script to transform tree as little as
possible in order to obtain a tree isomorphic to
. To capture
minimality of transformations, we introduce the notion of the
cost of an edit script, and we look for a script of minimum cost.
Thus, our second main problem is the problem of finding such a minimum
cost edit script; we refer to this as the Minimum Conforming
Edit Script (MCES) problem. The remainder of this section formally
defines edit operations and edit scripts. Our algorithm for the MCES
problem is presented in Section 4, and
Section 5 presents our algorithm for the Good
Matching problem. Note that we consider the MCES problem before the
Good Matching problem, despite the fact that our method requires
finding a matching before generating an edit script. As will be seen,
the definition of a good matching relies on certain aspects of edit
scripts, so for presentation purposes we consider the details of our
edit script algorithms first.