Mediation to Deal with Heterogeneous Data Sources.
The objective of interoperation is to increase the value of information when information from multiple sources is accessed, related, and combined. However, care is required to realize this benefit. One problem to be addressed in this context is that a simple integration over the ever-expanding number of resources available on-line leads to what customers perceive as information overload. In actuality, the customers experience data overload, making it nearly impossible for them to extract relevant points of information out of a huge haystack of data.
Information should support the making of decisions and actions. We distinguish Interoperation of Information from integration of data and databases, since we do not expect to combine the sources, but only selected results derived from them [Kim:95]. If much of the data obtained from the sources is materialized, then the integration of information overlaps with the topic of data warehousing [Kimball:96]. In the interoperation paradigm we favor that the merging is performed as the need arises, relying on articulation points that have been found and defined earlier [ColletHS:91]. If the base sources are transient, a warehouse can provide a suitable persistent resource.
Interoperation requires knowledge and intelligence, but increases substantially the value to the consumer. For instance, domain knowledge which combines merchant ship data with trucking and railroad information permits a customer to analyze and plan multi-modal shipping. Interoperating over multiple, distinct information domains, as shipping, cost-of-money, and weather requires broader knowledge, but will further improve the value of the information. Consider here the manager who deals with delivery of goods, who must combine information about shipping, the cost of inventory that is delayed, and the effects of weather on possible delays. This knowledge is tied to the customer's task model, which provides an intersecting context over several source domains.
The required value-added service tasks, as selection of relevant and high-quality data, matching of source data, creating fused data objects, summarizing and abstracting, fall outside of the capabilities of the sources, and are costly to implement in individual applications. The provision of such services requires an architecture for computing systems that recognizes their intermediate functionality. In such an architecture mediating services create an opportunity for novel on-line business ventures, which will replace the traditional services provided by consultants, analysts, and publishers.
We define the architecture of a software system to be the partitioning of a system into major pieces or modules. Modules will have independent software operation, and are likely located on distinct, but networked hardware as well. Criteria for partitioning are technical and social. The prime technical criterium is having a modest bandwidth requirement across the interfaces among the modules. The prime social criterium is having a well-defined domain for management, with local authority and responsibilities. Luckily, these two criteria often match.
It is now obvious that building a single, integrated system for any substantial enterprise, encompassing all possible source domains and knowledge about them is an impossible task. Even abstract modeling of a single enterprise in sufficient detail has been frustrating. When such proposals were made in the past, the scope of information processing in an enterprise was poorly understood, and data-processing often focused on financial management. Modern enterprises use a mix of public market and service information in concert with their own data. Many have also delegated data-processing, together with profit-and-loss responsibilities, to smaller units within their organizations. An integrated system warehousing all the diverse sources would not be maintainable. Each single source, even if quite stable, will still change its structure every few years, as capabilities and environments change, companies merge, and new rate-structures develop.
Integrating hundreds of such sources is futile.
Figure 1: A Client-Server Architecture.
Today, a popular architecture is represented by client-server systems (Figure 1). Simple middleware as CORBA and COM [HelalB:95], provides communication among the two layers. However, these 2-layer systems do not scale well as the number of available services grows. While assembly of a new client is easy if all the required services exist, if any change is needed in an existing service to accommodate the new client, a major maintenance problem arises. First of all, all other clients have to be inspected to see if they use any of the services being updated, and those that do have to be updated when the service changes, in perfect synchrony. Scheduling the change-over to a data that suitable that is suitable for the affected clients induces delays. Those delays in turn cause that other updates needs arise, and will have to be inserted on that same day. The changeover becomes a major event, costly and risky.
Hence, dealing with many, say hundreds of data servers entails constant changes. A client-server architecture of that size is likely never be able to serve the customers. To make such large systems work, an architectural alternative is required. We will see that changes can be gradually accommodated in a mediated architecture, as a result of an improved assignment of functions.
Figure 2: A Mediated Architecture
1.1 Mediator Architecture
The mediator architecture envisages a partitioning of resources and services in two dimensions, as shown in Figure 2 [Wiederhold:92]:
The modules in the various layers will contribute data and information to each other, but they will not be strictly matched (i.e., not be stovepiped). The vertical partitioning in the mediating layer is based on having expertise in a service domain, and within that layer modules may call on each other. For instance, logistics expertise, as knowledge about merchant shippers, will be kept in a single mediating module, and a superior mediating module dealing with shared concepts about transportation will integrate ship, trucking, and railroad information. At the client layer several distinct domains, such as weather and cost of shipping, will be brought together. These domains do not have commensurate metrics, so that a service layer cannot provide reliable interoperation (Figure 3). The client layer and, in it, the logistics customer, has to weigh the combination and make the final decision to balance costs and risks. Similarly, a farmer may combine harvest and weather information. Moving the vagueness of combining information from dissimilar domains to the client layer reduces the overall complexity of the system.
Figure 3: Formal and Pragmatic Interoperation.
1.2 Task assignment
In a 2-layer client-server architecture all functions had to be assigned either to the server or to the client modules. The current debates on thin versus fat clients and servers illustrate that the alternatives are not clear, even though that some function assignments are obvious. With a third, intermediate layer, which mediates between the users and the sources, many functions, and particularly those that add value, and require maintenance to retain value, can be assigned there. We will review those assignments now.
Server: Selection of data is a function which is best performed at the server since one does not want to ship large amounts of unneeded data to the client or the mediator. The effectiveness of the SELECT statement of SQL is evidence of that assignment; not many languages can make do with one verb for most of their functionality. Making those data accessible may require a wrapper at or near the server, so that access can be performed using standard interfaces.
Client: Interaction with the user is an obvious function for the clients. Local response must be rapid and reliable. Adaptation to the wide variety of local devices is best understood and maintained locally. For instance, moving from displays and keyboards to voice output and gesture input requires local feedback. Images and maps may have to be scaled to suit local displays. When maps are scaled, the labeling has to be adjusted [AonumiIK:89].
Mediator: Not suitable for assignment to a server nor to a client are functions as the integration of data from multiple servers and the transformation of those data to information that is effective for the client program. Requiring that any server can interoperate with any other possible relevant server imposes requirements that are hard to establish and impossible to maintain. The resulting $n^2$ complexity is obvious. Similarly, requiring that servers can prepare views for any client is also onerous; in practice the load of adaptation would fall on the client. To resolve this issue of assignment for interoperation we define and intermediate layer, and establish modules in that layer, which will be referred to as mediators. The next section will deal with such modules, and focus on some of the roles that arise in geographic-based processing.
Interoperation with the diversity of available sources requires a variety of functions. The mediator architecture has to accommodate multiple types of modules, and allow them to be combined as required. For instance, facilitators will search for likely resources and ways to access them [WiederholdG:97]. To serve interoperation, related information that is relevant for the domain has to be selected and acquired from multiple sources. Query processors will reformulate an initial query to enhance the chance of obtaining relevant data [ArensKS:96, ChuQ:94]. Text associated with images can be processed to yield additional keys [GugliemoR:96]. Selection then obtains potentially useful data from the sources, and has to balance relevance with cost of moving the data to the mediator. After selection, further processing is needed for integration and making the results relevant to the client. In this exposition we will focus on issues that relate to spatial information and focus on two topics, integration and transformation. The references given can be used to explore other areas.
Selection from multiple sources will obtain data that is redundant, mismatched, and contains excessive detail. Web searches today demonstrate these weaknesses, they focus on breadth of selection and leave the extraction of useful information to the user.
Omitting redundancy:When information is obtained from a broad selection of sources, as on the web, redundancy is unavoidable. But since sources often represent data in their own formats, omitting overlaps has to be based on similarity assessment, rather than on exact matches [GarciaGS:96]. When geographic regions overlap, the sources that are most relevant to the customer in terms of content and detail are best. Assessing the similarity of images requires new technologies, wavelets appear to be promising [ChangLW:99].
Quality of data is a complementary issue. A mediator may have rules as `Source A is preferable over Source B', or `more recent data are better’, but sometimes differences of data values obtained cannot be resolved at the mediating level, because the metrics for judgement are absent. If the differences are significant, both results, identified with their sources can be reported to the client [AgarwalKSW:95].
Matching: Integration of information requires matching of articulation points, the identifiers that are used to link entities from distinct sources. Matching of data from sources is based mainly on terms and measures. we now have to link complementary information, say text and maps. When sources use differing terminologies we need ontological tools to find matching points for their articulation [ColletHS:91].
While articulation of textual information is based on matching of abstract terms, when systems need to exchange actual goods and services, physical proximity is paramount. This means that for problems in logistics, in military planning, in service delivery, and in responding to natural disasters geographic markers are of prime importance.
Georeferencing: Unfortunately, the representation of geographic fiducial points varies greatly among sources and their representations. We commonly use names to denote geographic entities, but the naming differs among contexts. Even names of major entities, as countries, differ among respected resources. While the U.N. web pages refer to "The Gambia", most other sources call the country simply "Gambia". If we include temporal variations then the names of the components of the former USSR and Yugoslavia induce more complexity. Based on current sources we would not be able to find in which country the 1984 Winter Olympics were held [JanninkEa:98]. When native representations use differing alphabets another level of complexity ensues.
The problems get worse at finer granularity. Names and extents of towns and roads change over time, making global information unreliable. For delivery of goods to a specific loading dock at a warehouse local knowledge becomes essential. Such local knowledge must be delegated to the lowest level in the system to allow responsive maintenance and flexibility. In modern delivery systems, as those used by the Federal Express delivery service, the driver makes the final judgement and records the location as well as the recipient.
Using latitude and longitude can provide a common underpinning. The wide availability of GPS has popularized this representation. Whiled commercial GPS is limited to about 100 m precision, the increasing capabilities of ground-based emitters (pseudolites), used in combination with space-based transmitters can conveniently increase the precision to a meter, allowing, for instance, the matching of trucks to loading gates. The translations required to move from geographical named areas and points to areas described by vertices is now well understood, although remains sufficiently complex that mediators are required to offload clients from performing such transformations.
Matching interacts with selection, so that the integration process is not a simple pipeline.
The initial data selection must balance breadth of retrieval with cost of access and transmission. After matching retrieval of further articulated data can ensue. To access ancillary geographic sources the names or spatial parameters used as keys must be used. When areas are to be located circumscribing boxes must be defined so that all possibly relevant material is included, and the result be filtered locally [GaedeG:98]. Again, many of these techniques are well understood, but require the right architectural setting to become available as services to a larger user population [DolinAAD:97].
Integration brings together information from autonomous sources, and that means also that data is represented at differing levels of detail. For instance, geographic results must be brought into the proper context for the application domain. Often detailed data must be aggregated to a higher level of granularity. For instance, to assess sales in a region, detailed data from all stores in the region must be aggregated. The aggregation may require multiple hierarchical levels, where postal codes and town names provide intermediate levels. Such a hierarchy can be modeled in the mediator, so that the client is relieved from that computation. The summarization will also reduce the volume data, relieving the network and the processors from high demands.
Summarization: The actual computation of quantitative summaries can again be allocated to the source, to the mediating layer, or to the client. Languages used for server access, such as SQL, provide some means for grouping and summarization, although expressing the criteria correctly is difficult for end-users. Warehouse and data-mining technology is addressing these issues today [AgrawalIS:93], but supporting a wide variety of aggregation models with materialized data is very costly. The mediator can use its model to drive the computation. However, server capabilities may be limited. Even when SQL is available, the lack of an operator to compute the variance, complementing the AVERAGE operator also motivates moving aggregating computations out of the server. While in 90% of the cases the average is a valid descriptor of a data set, not warning the end-user that the distribution is far from normal (bi-modal or having major outliers) is fraught with dangers in misinterpretation. Knowledge encoded in a mediator can provide warnings to the client, appropriate to the type of service being provided, that the data is not trustworthy.
While numeric processing for summarization is well understood, dealing with other data types is harder. We now have experimental abstractors that will summarize text for customers [KupiecPC:95]. Such summarizations may also be cascaded if the documents can be placed into a hierarchical customer structure [Pratt:97].
Aggregation may also be required prior to integrating data from diverse sources. Autonomous sources will often differ in detail, because of differing information requirements of their own clientele. For instance, cost data from local schools must be aggregated to county level before it can be compared with other county budget items. The knowledge to perform aggregation to enable matching is best maintained by a specialist within the school administration; at the county budgeting level it is easy to miss changes in the school system. Other transformations that are commonly needed before integration can be performed are to resolve temporal inconsistencies [GohMS:94], or context differences, as seen in data about countries collected from different international agencies [JanninkEa:98].
The complexity of these transformations is such that they are not appropriate for assignment to the client. Transformations performed on results of integration can, of course, not be assigned to servers.
Object-structuring: Anyone using the web today can attest to the complexity that linearly presented data imposes on the customer who seeks relevant information. Most clients are best served by structuring their information in object-oriented form. That means not only carrying forward the top-level summarization, but also the details that contribute to the summaries. Structural modeling tools can transform relational source data into diverse object-oriented formats, as needed by the client [BarsalouSKW:91]. The base model can cover multiple sources.
Differing contexts require alternate hierarchies. In geography we distinguish political, social, topographical, and other hierarchies. While geographically-based hierarchies are common, other aggregations may be based on social criteria, as income or age of customers. Layering of geographic criteria and social criteria is also common.
Digital Libraries: Related research is being performed within the Digital Library Project, supported by NSF, DARPA, and NASA. For publications as journals and books mediating selection services were traditionally provided through reviewers and editors, while libraries, through their indexers, local storage capabilities, provided dissemination services to the clients. The technical challenge in automating the process is again dealing with the lack of common structure [HammerEa:97], heterogeneity of sources [NavatheD:95], and the redundancy [ShivakumarG:96] in the source data. For geographic libraries the base material is graphics and images, identified by related text [SmithF:95]. There are many opportunities for innovative value-added services in this area [ButtenfieldG:96].
For building and maintaining multi-layer systems, interface standards are crucial. When legacy files can be structured into tables, SQL will become the access language, as is being done by many extensions of relational system [Cattell:91]. In addition to accepted standards for data, as SQL, ODL [Cattell:94], and CORBA [OMG:91], a number of new interfaces have appeared. For instance, a transmission protocol for knowledge and data querying and manipulation being used in related research is KQML [LabrouF:97]. KQML provides for specification of the ontology being used in a transmission, to assure that the contents can be understood by communicating modules. Currently XML is gathering much momentum [Connolly:97]. When data cannot be structured well, the XML format provides an alternative. Such semi-structured data have been the topic of much recent research [PapakonstantinouGW:95]. XML structures can be defined for specific domains, using domain-specific type descriptions (DTD). Those DTDs will be developed by specialists, and will help in matching the meaning of the information being shared.
The alternative server-based technology, provided by pure Java, does allow uploading of functions to the client, but maintaining support for all user applications in the server or mediator is costly, as is shipping of all presentation alternatives for all client types. Furthermore, since we envision that pragmatic integration and processing will occur in the client, we must transmit information in a form suitable for further processing, and not just for display. As a language, however, Java is attractive, and we are likely to see Java programs in the client interoperating with Java-beans in the mediators.
Many new conventions are being considered for standardization, which will provide stability, and solidify market share. However, it is wise to wait before imposing any such standards on the community until adequate practical experience exists. It remains an open question how beneficial researcher involvement in the standards development process will be, but researchers will certainly be affected by the outcomes [Libicki:95].
Capabilities for data collection are increasing rapidly, advances in communications accelerate the flow, the situations that the clients must deal with are increasingly varied. Military intelligence systems were among the first users of this technology, even before solid research results were obtained and documented. Fusion of sensor data and images was already common. Geographic systems were integrated in several of these systems, but the interfaces to other data sources are still not very smooth.
Most operational mediating systems have been explicitly programmed. This means that the knowledge the mediators embody is in the form of computer codes. Moving to more formal descriptions is the objective of much current development. Building new systems can become more effective if there is reuse of technology and knowledge [Musen:92]. Use of rules makes the mediator easier to manage, important when the number of potential sources is large [PanchapagesanEa:99]. The leverage offered by modest, domain-specific knowledge bases should be substantial, but still has to be proven. In geography, such concepts have been proposed, but their use for interoperation has not yet been shown [BeardSH:97].
As software suppliers gain experience there will be spinoffs into pure commercial work [DeBellisH:95]. An early example is the use of matchmaking mediators leading now to application in the Lockheed-sponsored venture for distribution of space satellite images [MarkTMS:92]. A list of software suppliers was prepared for [Wiederhold:98] and is maintained in related web pages (http://www-db.stanford.edu/LIC/mediator.html).
Commercial dissemination of mediating modules will only occur if the information service paradigm proves to be effective. Interposition of a mediating layer into the client-server model incurs costs. A system's performance cost may be offset through reduction in transmitted data volume, as the information density increases.
Crucial benefit/cost ratios are in balancing service quality and system maintenance [Wiederhold:95]. The bane of artificial intelligence technology has been the cost-versus-benefit of knowledge maintenance. Mediation provides a focus for such maintenance, in divorcing it from the operational pressures at the servers and the short-range needs at the clients, as shown in Figure 4. Reduced long-term maintenance costs may become be the most powerful driver towards the use of mediating technologies, since software maintenance absorbs anywhere from 60% to 90% of computing budgets, and increases disproportionally with scale.
Figure 4: Mediation assigns responsibility for maintenance
Interoperation, while adding value, also adds risks. Combining information from multiple sources, aided by helpful agents that retrieve relevant information which was not directly requested, increases the risk of violation of individual and commercial privacy. Issues of privacy protection [JonesCW:95] and security must be addressed if broad access to valuable data is to become commonplace. A project on security mediation focuses on this issue [WiederholdBD:98]. A security mediator is a distinct module in an enterprise firewall, which complements traditional access protection with mechanisms to filter results before releasing them to the outside world. In a security mediator the owner is the security officer in charge of an organizationally defined domain [GongQ:96].
Having a need itself is not an adequate motivation for research investment; there also has to be a reasonable hope of moving towards solutions. In many areas, say in dealing with problems of strife and hunger, we are frustrated by complexity and a lack of leverage points. Providing information to agencies to effectively marshal and deploy their resources is a motivation for our research. Finding the right balance of the possible and the ideal is the major strategic issue in defining fundamental research. A tactical issue is finding the right time-point.
Research to solve problems that industry recognizes tends to be futile for academics. Industry will be able to devote sufficient resources to provide adequate, focused solutions. If academics can determine what solutions industry will adopt, then there are opportunities to go beyond. Going beyond can involve depth or breadth. Going in depth may mean dealing with likely omissions. In integration that might be providing for translation of terms that do not match, but not providing the triggers when domains change so that translations have to be updated. Going breadth in the same problem domain may mean devising rules that can work for multiple domains, rather than for some specific translation.
These tasks in information generation are complex and have to be adaptable to evolving needs of the customers, to changes in data resources, and to upgrades of system environments. The number of research issues needing solutions in the field is great.
As the technical and syntactical problems of interoperation are being dealt with in industry, the semantic issues come to the forefront. Data resources, and especially databases, carry implicit or explicit domain definitions --- no database customer expects a merchant shipping database to deal with interest rates. Similarly, a financial database is expected to ignore details about ships and a weather database is innocent of both. In all three domains the knowledge needed to adequately describe the data is manageable, but great leverage is provided by the many ground instances that knowledge-based rules can refer to.
Figure 5: Moving towards a New Science.
4.2 Alternate Sources
While integration started out in dealing with well-structured databases, much current focus in on semi-structured data, and the textual contents of those data. Images, maps and graphs are brought in mostly through associated keys. Content analysis of these sources is making progress, and will become input to data integration. Video and speech are being analyzed as well, their volume makes integrated delivery to clients more problematical.
For planning and decision-making results from simulations also need to be integrated [ArensCHIK:94]. That will allow the clients not only to view the past, but also extrapolate timelines into the future [WiederholdJG:98]. Today this function is left wholly to the clients, and the tools they have, as spreadsheets, are not well integrated into
their processing systems.
In the meantime, mediated systems are being built where alternatives are not feasible, for instance, where source data is hidden in legacy systems that cannot be converted, or where the planning cycle needed for data system integration is excessive.
Starting in 1992 the Advanced Research Projects Agency (ARPA, now DARPA), the agency for joint research over all service branches of the U.S. Department of Defense, initiated a new research program in Intelligent Integration of Information (I3). Many results cited in this paper were initiated with ARPA support. Later research presented here was supported by NSF CISE and AFOSR. We thank the many participants and students who have helped in developing and realizing the concepts presented here.
Mediated systems are still in their infancy. We hope that ongoing development and deployment will fuel an effective research cycle. Having a clean architecture allows also a partitioning of research tasks, since the overall problem presented by information systems is greater than any single project can handle. Interoperation will require a variety of articulation points among sources and domain-specific knowledge.
The architecture we presented allows multiple application hierarchies to be overlaid, so that the structure forms a directed acyclic graph from client to resource, although the information flow is in the opposite direction. The complexity is still an order less than that implied by arbitrary networks, simplifying composition both in terms of research and operational management. The final vision is summarized in Figure 5, indicating the inputs we have in order to move towards a new science, focusing on integration of capabilities and competencies needed to make large systems work.
Proc. 6th Intern. Conf. on Information and nowledge Management (CIKM'97), Las Vegas, NV, Nov. 1997, pp. 348-55. http://pharos.alexandria.ucsb.edu/publications/cikm97.ps.