An Information Food Chain
Advanced Applications on the WWW
Presented at in Proc. 4th European Conference on Research and Advanced Technology for Digital Libraries, Springer LCNS Vol.1923, 2000, pages 490-498.
Stefan Decker, Jan Jannink, Prasenjit Mitra, Gio Wiederhold
Stanford University, CS Dept, CA 94306
The Internet and especially the World Wide Web are growing at a tremendous rate. More and more information is becoming directly available for human consumption. But humans have a limited information processing capacity and are often not able to find and process the relevant information. Automation, in the form of processing agents that provide a powerful extension to human capabilities, is required. However, with the current technology it is difficult and expensive to build such automated agents and support human user with their information processing power, since agents are not able to understand the meaning of the natural language terms found on today's' webpages. The remedy this situation we look at different existing technologies and put them together to a new information food chain [Etzioni 1997] for agents, that enables advanced applications on the WWW.
To really facilitate automated agents on the web, agent interpretable formal data is required. However, creating formal data about a particular domain is a high effort task, and it is not immediately clear what kind of data and tools should be created to support this task. This papers aims at clarifying the question of which data and tools are necessary for the creation and deployment of formal data on the web by presenting an information food chain [Etzioni 1997] for agents and formal data aiming at agents. Every part of the food chain provides information, which enables the existence of the next part. A part of the food chain is basically information on the web, which is used to create more advanced information.
That in contrast to other agent approaches we do not focus on inter-agent-communications approaches like KQML [Wiederhold 1990][Finin 1997]. Instead we investigated the infrastructure necessary to enable automated single agents on the Web, which we believed has to be created first, before inter-agent communications is necessary.
Our goal is not to provide a new technical solution to sub-problem, instead we aim at putting all pieces of current technology in the right place to enable automated agents.
The rest of the paper is organized as follows:
In the next section we present an overview about the agent information food chain.
Representing information on the web requires a joint representation language for data on the web, such that the building of wrappers is no longer necessary. Given the current situation, this will be most likely XML [XML 99] based. However, XML itself is not suitable for this task (although the current hype around XML might suggest this) because the semantics common to all XML languages is just the parse tree of the documents, which is as useful as an HTML parse tree. If general XML is used, creation of wrappers becomes necessary again. So a specific XML-based language is necessary for formal agent communication.
For data exchange on the web it is also necessary to have a specification of the terminology of a particular domain, where the data is about. Raw data available on the web without an explanation about its semantics is useless (e.g. the tag <flighttime> in data from a aircargo carrier might mean starting time of the flight or landing time).
Ontologies [Fridman Noy, 1997] are a means for Knowledge Sharing and Reuse and capture the semantics of a domain of interest. Ontologies are consensual and formal specifications of vocabularies used to describe a specific domain, and facilitate information interchange among information systems. Since there will be no ontology available describing all possible data, we need multiple ontologies for different application domains (DTDs as used e.g. at Biztalk ( can be regarded as a very limited kind of ontology – they describe the grammar of an document.).
However, ontologies themselves are just formal data on the web, that needs to be exchanged. So it is desirable for the XML-based data exchange language to also be able to represent an ontology for a particular domain.
Example for a XML-based languages that allows explicitly ontology
definition are e.g. RDF (Resource Description Framework) [Lassila 1999], RDF Schema [Brickley 1999],
XOL (XML-based Ontology Exchange Language) [Karp 1999], or OML/CKML (Ontology
Markup Language/ Conceptual Markup Language) [Kent 1999] or the recent DAML
program (DARPA Agent Markup Language: )
The a specific XML based language and ontologies are one foundation automated agents on the web. However, to deploy the formal data and to make the creation of the formal data manageable infrastructure (e.g. support- and deployment tools) - the information food chain, is necessary. A pictorial description of the information food chain is depicted in Figure 1.
Figure 1: Agents Information Food Chain
The food chain starts with the construction of an ontology, preferably with an support tool, the Ontology Construction Tool. An ontology is the "explicit specification of a conceptualization", which provides all the required terminology, and a basis for a community of interest for information exchange. The ontology defines the terms that are possible to use for annotation information in webpages, using the former mentioned XML-based representation language. An Webpage Annotation Tool has means to browse the ontology and to select appropriate terms of the ontology and map them to sections of a webpage. The webpage annotation process creates a set of annotated webpages, which are available to an automated Agent to achieve its tasks. An Agent itself needs several sub-components, specifically an Inference System for the evaluation of rules and queries and general inferences, an Ontology Articulation Toolkit for mediation among information obtained from different ontologies. The data in from the annotations can be used to construct additional websites: e.g. a Community Web Portal, that presents a community of interest to the outside word in a concise manner. And finally, information-seeking users can give specific retrieval tasks to an OnTo-Agent, or they can query a Community Web Portal for immediate access to the information.
The parts of the agents information foodchain, that we believe to be essential, will be described in greater detail in the next section.
Ontologies are engineering artifacts, and constructing an ontology involves human interaction. Thus means are necessary to keep the costs for ontology creation as low as possible. Therefore an Ontology Construction tool is necessary to provide means to construct ontologies in a cost-effective manner.
Ontologies evolve and change over time as our knowledge, needs, and capabilities change. Reducing acquisition and maintenance cost of ontologies is an important task. An examples of ontology editors are e.g. the Protégé framework for customized knowledge-base system construction [Grosso et al., 1999] or the WebOnto-Framework [Domingue 1998]. WebOnto is aiming at the distributed development of ontologies. WebOnto supports collaborative browsing,
creation and editing of ontologies by providing a direct manipulation interface that displays ontological expressions.However, none of those tool is yet used for creating ontologies for web based agent applications.
Creating formal data an the web is a costly process. But effective deployment of agents requires the creation of data. Hence, an important goal of an information food chain is to reduce cost of ontology-based data creation as much as possible. The annotation process requires tools that support user in the creation of this kind of data. The biggest source of information on the web today are HTML pages. The creation of explicit ontology-based data and metadata for HTML-pages requires much effort if done naively. Most HTML pages we find on the web are created using a visual HTML editor, so that users often already know how to use these tools. However, HTML-Editors focus on the presentation of information, whereas we need support for the creation of ontology-based metadata to describe the content of information. Thus the information food chain needs a practical, easy-to-use, ontology based, semantic annotation tool for HTML pages.
A basic tool, Onto-Pad, was already created in the project Ontobroker [Decker, 1999][Fensel 1999]. Onto-Pad is an extension of a Java-based HTML editor, which allows normal browsing and editing of the HTML page, and supports the annotation of the HTML-page with ontology-based metadata. The annotator can select a portion of the text from a webpage and choose to add a semantic annotation, which is inserted into the HTML source. However, for significant annotation tasks a basic annotation tool is not sufficient. It still takes a long time to annotate large pages, although a significant improvement was reported when compared to the manual task.
So a practical tool should also exploit information extraction techniques for semi-automatic metadata creation. The precision of linguistic processing technology is far from perfect and reasonably exact automatic annotation is not possible. However, there exists currently much linguistic processing technology that is highly appropriate to help users with their annotation tasks [Alembic 1999][Day 1997][Neumann 1997]. Often it is sufficient to give the user a choice to annotate a phrase with one of two or three ontological expressions. Resources like WordNet [Fellbaum 1998] and results obtained from the Scalable Knowledge Composition (SKC) project [Mitra 2000] to provide high-level background knowledge to guide the annotation support.
Often a collection of similar pages has to be semantically annotated. Here document templates with built-in annotations simplify the creation of collections of annotated documents considerably. Document templates enable the reuse of annotations: a document template has to be only created once.
We can also benefit from the intensive, ongoing efforts in the W3C Extended Mark-up Language (XML) developments. XML promotes the creation of semantic markup. With the definition of a mapping between XML-elements and the concepts of an ontology [Erdmann 1999], XML-documents can be treated as formal information sources
In order to solve a task that involves consulting multiple information sources that have separate ontologies, an automated Agent needs to bridge the semantic gap between the several ontologies we finds on the web.
In our Scalable Knowledge Composition (SKC [Mitra 2000]) we have developed tools that allow an expert in the articulation to define rules that link semantically disjoint ontologies ,i.e., ontologies from different sources that differ in their terminology and semantic relationships. These then rules define a new articulation ontology, which serve the application, and translates terms that are involved in the intersection to those in the source domain. Typical disjoint source domains with an important articulation in a logistics example would be trucking and air-cargo. But actual shipping companies can also use terms differently. Information produced within the source domain in response to a task can then be reported up to the application using the inverse process. Now the results are translated to a consistent terminology, so that the application can produce an integrated result. The major advantage of the SKC approach is that not all of the terms in the source ontologies have to be aligned, i.e., be made globally consistent. Aligning just two ontologies completely requires a major effort for a practical application, as well as ongoing maintenance. For instance, it is unlikely that United Airlines and British Airways will want to agree on a fully common ontology, but a logistics expert can easily link the terms that have matching semantics, or define rules that resolve partial match problems. Partial matches occur when terms differ in scope, say that one airline includes unaccompanied luggage in its definition of cargo, and another does not. Partial match is dealt with by having rules in the articulation that indicate which portions of two partially matching concepts belonging to different ontologies are semantically similar and can thus be combined. Articulation also provides scalability. Since there are hundreds of domains just in the logistics area, and different applications need to use them in various combinations, the global-consistency approach would require that all domains that can interact at some time must be made consistent. No single application can take that responsibility, and it is unlikely that even any government can mandate national semantic consistency (the French might try).
However, an automated agent can also come across a new domain, with its own ontology about which it has no prior knowledge. The customer still needs information to be extracted from the new domain. This is a hard problem because the agent does not possess any prior information about the semantics of the terms and the relationships in the new information source.
To address this problem methods for the generation of dynamic articulations are necessary, on an as-needed basis. This will allow the creation of working articulation ontologies or linkages among the source ontologies. If the end-user has the expertise to perform the articulation, then no specialist expert who understands the linkages among the domains is needed. For instance, any frequent traveler today can deal with most domain differences among airlines.
Combining information from different source enables new applications and exploits available information as much as possible. This requires inference techniques to enable the ontology language to relate different pieces of information to each other. An agents inference system provides the means for an effective and efficient reasoning for agents. We expect that this technology will enable agents to reason with distributed available knowledge.
Metadata on the Web is usually distributed and added value can be generated by combining metadata from several metadata offerings: for example, in the transport domain, one metadata offerer might state that a flight is available from Washingon D.C. to New York.
Another statement from another server states the availability of a flight from New York to Frankfurt. Combining these two pieces of information an agents inference system can deduce that a connection is available from Washington DC to Frankfurt, although this it not explicitly stated anywhere. Please note that even such simple rules as used in Error! Reference source not found. are neither evaluable by commercial databases (because of their inability to perform recursion) nor by simple PROLOG systems (because of potential infinite looping). Thus either fundamental type of system is not usable for this task.
Inference engines also reduce metadata creation and maintenance cost. Having to state every single assertion of metadata explicitly leads to a large metadata creation and maintenance overhead. So tools and techniques are necessary that help to reduce the amount of explicitly stated metadata by inferring further implicit metadata. Implicit metadata can be derived from already known metadata by using general background knowledge, as stated in the ontologies.
To exploit axioms defined in ontologies for available metadata, an automated agent needs to have reasoning capabilities. The properties of these reasoning capabilities have to be carefully chosen. A reasoning mechanism, which is too powerful, has intractable computational properties, whereas a too limited approach does not enable the full potential of inference. Deductive database techniques have proven to be a good compromise between these tradeoffs.
Figure 2 Inference Enabling DAML Markup Deployment
From the very beginning
communities of interest have formed on the web that covered topics that they
deemed to be of interest to their group. These users create what is now
commonly known as community web portals.
Community web portals are hierarchical structured similar to Yahoo! and require a collaborative
process with limited resources (manpower, money) for maintaining and editing
the portal. Since a number of people are involved, update and maintenance is a
major effort. Strangely enough, technology for supporting communities of
interest has not quite kept up with the complexity of the tasks of managing
community web portals.
Recent research has demonstrated that authoring, as well as reading and
understanding of web-sites, profits from conceptual models underlying document
structures [Fröhlich 1998][Kesseler 1995]. The ontology underlying a community
of interest provides such a conceptual model, since the ontology formally
represents common knowledge and interests that people share within their community.
It is used to support the major tasks of a portal, namely. accessing the portal
through manifold, dynamic, and conceptually plausible views onto the
information of interest in a particular community, and providing information in a
number of ways that reflect different types application requirements posed by
the individuals.
In the Ontobroker [Decker 1999] and SHOE project [Henflin 1998] means were investigated to annotate webpages with ontology-based metadata. - thus realizing part of the food chain. Agent based architectures (see [] for an overview_ usually focus on inter- agent communication instead to ontology creation and deployment.
We presented an information food chain, that delivers an infrastructure that empowers intelligent agents on the web and deploys applications will facilitate automation of information processing on the web. Fundamental to that approach is the use of a formal Markup language for annotation of web resources.
We expect this information infrastructure to be the basis for the for the "Semantic Web" idea - that the World-Wide-Web (WWW) will become more than a collection of linked HTML-pages for human perusal, but will be the source of formal knowledge that can be exploited by automated agents. Without automation, and precision of operations, business and governmental uses of the information will remain limited.
