ACM Computing Surveys 28A(4), December 1996, http://www.acm.org/surveys/1996/WidomIntegrating/. Copyright © 1996 by the Association for Computing Machinery, Inc. See the permissions statement below.
In the research community, most approaches to solving the data integration problem are based very roughly on the following two-step process:
The natural alternative to a lazy approach is an eager or in-advance approach to data integration. In an eager approach:
A lazy approach to integration is appropriate for information that changes rapidly, for clients with unpredictable needs, and for queries that operate over vast amounts of data from very large numbers of information sources (e.g., the world-wide web). However, the lazy approach may incur inefficiency and delay in query processing, especially when queries are issued multiple times, when information sources are slow, expensive, or periodically unavailable, and when significant processing is required for the translation, filtering, and merging steps. In cases where information sources do not permit ad-hoc queries, the lazy approach is simply not feasible.
In the warehousing approach, the integrated information is available for immediate querying and analysis by clients. Thus, the warehousing approach is appropriate for: (1) clients requiring specific predictable portions of the available information; (2) clients requiring high query performance but not necessarily over the most recent state of the information; (3) environments in which native applications at the information sources require high performance (large multi-source queries are executed at the warehouse instead); (4) clients wanting access to private copies of the information so that it can be modified, annotated, summarized, and so on; and (5) clients wanting to save information that is not maintained at the sources (such as historical information).
As stated above, in the past the database research community has focused primarily on lazy approaches to integration, although recent work has begun to consider the warehousing approach. What's the answer? Certainly there are scenarios that clearly favor one approach over the other. However, it is our belief that many of the complex, large scale inter-database applications of the future will require both approaches. The ideal information integration system to handle such applications will be one in which some data is fetched, processed, and integrated in advance and stored in the system's warehouse, while other data is fetched and processed only in response to user queries. Aside from performance and "freshness" considerations, the difference between pre-fetched (eager) and fetched-on-demand (lazy) data should be fully transparent to the client.
Surprisingly, extremely little research to date has considered the numerous technical problems associated with this flexible and general approach to the problem of integrating heterogeneous databases.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.
widom@db.stanford.edu