next up previous
Next: Method Overview Up: Model and Problem Definition Previous: Edit Operations and Edit

Cost Model

 

Given a pair of trees, there are, in general, several edit scripts that transform one tree to the other. For example, there is the trivial edit script that deletes all the nodes of one tree and then inserts all the nodes of the second tree. There are many other edit scripts that, informally, do more work than seems necessary. Formally, we would like to find an edit script that is ``minimal'' in the sense that it does no more work that what is absolutely required. To this end, we define a cost model for edit operations and edit scripts.

There are two major criteria for choosing a cost model. Firstly, the cost model should accurately capture the domain characteristics of the data being considered. For example, if we are comparing the schematics for two printed-circuit boards, we may prefer an edit script that has as few inserts as possible, and instead describes changes with moves and copies of the old components. However, if we are comparing text documents, we may prefer to see a paragraph as a new insertion, rather than a description of how it was assembled from bits and pieces of sentences from the old document. Secondly, the cost model should be simple to specify, and should require little effort from the user. For example, a cost model that requires the user to specify dozens of parameters is not desirable by this criterion, even though it may accurately model the domain.

Another issue is the trade-off between generality of the cost model and difficulty in computing a minimum-cost edit script. For example, a very general cost model would have a user-specified function to determine the cost of each edit operation, based on the type of the edit operation, as well as the particular nodes on which it operates. However, such a model is not amenable to the design of efficient algorithms for computing the minimum-cost edit script, since it does not permit us to reason about the relative costs of the possible edit operations.

With the above criteria in mind, we propose a simple cost model in which the costs of insertion, deletion, move, copy, and glue operations are given by constants, , , , , and , respectively. Furthermore, given the symmetry between INS and DEL, and CPY and GLU, it is reasonable to use , and . Since, intuitively, a MOV operation causes a smaller change than either CPY or GLU, it is also reasonable to use . Note, however, that our algorithms do not depend on these relationships between the cost parameters. The cost of an update operation depends on the old and new values of the label being updated; that is, , where is the old label of n, and is a domain-dependent function that returns a non-negative real number.

Finally, the cost of an edit script , denoted by , is defined as the sum of the costs of the edit operations in . That is, .

Problem Statement: Given two rooted, labeled trees and , find an edit script such that transforms to a tree that is isomorphic to , and such that for every edit script with this property, .


next up previous
Next: Method Overview Up: Model and Problem Definition Previous: Edit Operations and Edit

Sudarshan S. Chawathe
Sat Feb 22 12:28:02 PST 1997