The Scalabable Knowledge Composition (SKC) project is beining initiated to develop a specific approach to resolve semantic heterogeneity in Information systems. The SKC approach requires developing an algebra over ontologies that represent the terminologies from distinct, typically autonomous domains. Intersection will be the most crucial operation since it identifies the articulation (Guha and Lenat's term), namely the terms where linkage occurs among the domains. The intersection, and all other SKC algebra operations will themselves be knowledge driven, using articulation rules. Now source ontologies can be largely maintained autonomously, while the articulation rules will be maintained by groups benefitting from sharing and interoperation among domains. The SKC project will develop methods using the resulting articulation that allow ontologies to interoperate. Now the problem of managing large knowledge bases is reduced to one of composition. No global agreement is needed among maintainers of disjoint ontologies. We believe that this distributed approach to knowledge maintenance is the best (only?) approach to make semantic interoperation scalable. We are convinced that methods to enforce consistency, even if supported by edicts from `higher authorities' to force distinct, autonomous groups to use language coherently won't work. (The French academy can continue to tell us otherwise). The project is hence conceptually quite innovative. SKC also requires the building of some solid demonstrations to show the world the feasibility of this approach. There are many open research questions that we expect to uncover in this process. SKC Icon (ps).
This research effort is to be funded by AFOSR, with cooperation of the DARPA High-Performance Knowledge Base (HPKB) program.
We have received some complementary funding by the Hughes Research Institute, Malibu, CA.
Earlier, exploratory work has been supported by the Commerecenet Consortium.
An introductory SKC presentation was made at the HPKB West Coast Introductory Meeting, held at Stanford University, March 26-27, 1997.
References:
The ability to compose knowledge from independent sources of SKC also supports the acquisition objective of DARPA's HPKB program. It is easy to motivate small groups of experts to develop ontologies in their specific domain of expertise. It is costly, especially in time, to convene large groups to establish and update broadly based ontologies. Furthermore, larger ontologies often require compromises, reducing their precision. Composition empowers small groups to contribute to large tasks.
Increasingly powerful computers and better processing algorithms will help to establish and maintain large knowledge bases, but equally crucial is improving the management of knowledge and its components. Composability of independently developed chunks of knowledge provides the basis for such management. The SKC project will provide the operations to support composition, as well as intersection and selection of chunked knowledge. An intersection operation permits focusing on critical linkages. Logical partitioning of knowledge into chunks reduces computational complexity by exponential factors, while enabling distribution of computations to be processed in parallel over many processors.
The Ontology Algebra will itself be knowledge-driven, a necessary feature to deal with the complexities and inconsistencies that arise when distinct knowledge resources are merged. By formulating the needed operations as an algebra, SKC provides a sound basis for extensive and incremental knowledge manipulation. The knowledge that will drive the Ontology Algebra is limited to rules that enable articulation, the linking disjoint knowledge resources, and interoperation, the processing of information based on the articulated knowledge. We believe strongly that an disciplined manipulation of knowledge resources will be essential to achieve the needed
The Ontology Algebra supports the objective of the HPKB program: making much relevant knowledge available to applications, without incurring the problems of building, managing, and maintaining huge, integrated knowledge bases.
Figure 1: Knowledge Composition (ps).
The articulation knowledge drives the ontology algebra and is distinctly maintained in chunks relevant to interacting domains to deal with the complexities and inconsistencies that arise when distinct knowledge resources are merged [Wiederhold:91]. We observe that, in the past, the database field was able to make progress once an algebra over data had been defined [Codd:1970]. We also observe that partitioning was necessary to make progress with a truly large knowledge base, CYC at MCC [LenatG:90]. However, CYC is still physically integrated and does not possess a formal basis for operations that deal with interactions among its partitions. The application knowledge bases can then be arbitrarily composed, using rules associated with the ontology algebra. Composition and articulation among distinctly maintained knowledge-based resources achieves effective use, reuse, and scalability.
We will make two more points where the SKC approach is truly innovative. Section G will provide the full rationale and show examples to clarify the distinctions.
Furthermore, the resulting computation, can be naturally distributed over all active partitions (8 in Figure 1), giving a factor of over 500 for this simple case. This assessment is refined in the next section. as point 5..
Scalability interacts with maintenance, another critical factor since maintenance costs of large knowledge bases is likely to be large. We do not have published figures, but observe that general software, encoding knowledge procedurally, experiences life-time maintenance costs equal to or exceeding its acquisition cost. Much knowledge base maintenance leads to growth, because we keep on learning about the world and how to deal with it.
The SKC proposal recognizes and addresses the following problems:
It is possible to write programs and employ thesauri to aid in matching knowledge entries which may refer to identical items that are named differently. The thesauri needed may be automatically generated, by processing documents and noting overlaps. Such systems will create many interesting matches, increasing the coverage, but will also report many false and even ludicrous matches, since they will have a low precision. Some web tools, as Alta-Vista, use such technologies. However, systems which broaden searches with little restraint cause information overload on the customer, and are not suitable in situations with responsible practitioners.
Tools have been developed to merge diverse knowledge bases, with the expectation that more is better [Humphries:95]. However, knowledge from independent sources displays significant differences, as discussed above, so that the aggregation will be inconsistent and error-prone. No central organization can resolve all the differences that will show up, since each resolution requires knowledge about the source domains and their intersection. The SKC proposal introduces a knowledge-driven Ontology Algebra, which is intended to bring formality and generality to manipulatability of ontologies. The Ontology Algebra will allow combining distinct knowledge resource bases into application-specific knowledge bases. We use the term algebra to indicate that the operations themselves will be composable, so that a variety of outputs can be generated, and optimal sequences be constructed. The operations of an algebra will be more disciplined than simply merging the resources, and involve selecting, matching, transforming, and intersecting the base resources to achieve the desired application-specific results. The application knowledge- bases will hence be typically smaller than the union of their resources, but will be more effective and more economic to process. The Ontology Algebra will itself be knowledge- driven, to deal explicitly with the complexities of composing knowledge.
Whereas most work, when merging ontologies, assumes that identically spelled words imply identical concepts, such an approach is unproven and overreaching. This assumption, basic to mathematics, where X=X is a fundamental axiom, is invalid in the real world of natural language, where individuals may express themselves in the most effective manner for them [Garfield:87]. When ontologies are created by a merger which is based on matching spellings, many errors and problems will be found. The result has to be patched until it appears to operate satisfactorily. The errors that occur can be on several levels.
As long as the processing of military information, and especially its fusion, was the task of human experts, context was implicitly recognized and most errors were easily avoided. However, performance demands that more and more military information be automatically processed. Then these problems will become explicit and obvious. Providing the needed operations for automatic processing of knowledge that is inconsistent will be the primary objective of SKC.
HOUSE (carpentry) = HOUSE (householder) TABLE (carpentry) = TABLE (householder)Such rules are needed if the application deals with houses and their furniture in both the owner’s and in a maintenance context. That application will probably not need a rule
SALARY (carpentry) = SALARY (householder) .The house maintenance application may need the rule.
SINKER(carpentry) in NAIL (householder)denoting that the householder will refer to large nails (SINKERs) as NAILs. However, the BRAD (CARPENTRY), not used by the householder, may not need to be so defined. The matching rules themselves may be bounded within an articulation context.
We will use another example, that of purchasing goods for a department store, where the goods are in the context of wholesale purchasing. Here the domain experts, defining the articulation knowledge, are the purchasing agents, who all should be members of the American Society of Purchasing Agents (ASPA). For SHOE purchases there will be the matching rule:
SHOE (store) = SHOE (factory)The terms SHOE are not terminals in the store and factory ontologies, but rather the roots of subhierarchies, as PUMPS, WEDGIES, LOAFERS, SANDALS, etc.
{PUMPS, WEDGIES, LOAFERS, SANDALS}(store) ISA SHOE(factory)Not all terms in the subhierarchies will match, so that the ASPA will have further rules, such as,
PUMPS (store) = SHOE(factory) if HEEL(factory) > 5cm.Parts of shoes, needed to specify purchases, as HEEL, will also appear in the algebra's rule base; but by the time the NAIL(factory) is reached, no matching rule is needed. Confusion with NAIL(anatomy), which may be an entry in the shoe store’s ontology, is hence avoided. We have already an ontology for anatomy, if needed in the store for articulation with health problems due to high heels, provided by the College of Pathologists [ACP: 92].
The fact that a term does not appear in the articulation intersection
does not mean that it is not accessible for domain-specific
computation, using methods which are available to the application
(that's why there are double-headed arrows in Figure 1). For
instance, NAIL (factory) may be a term used in a computational method
DURABILITY whose execution remains local to the factory system, but
which can be invoked from the articulated result, if there is a
matching rule, say
Maintenance of the rules, to deal, say, with changes in shoe fashions
is now assigned to ASPA rather than to the factory in the store.
Stores could, of course, add their own mappings, a local issue, but
one that can still be aided by having the ontology algebra. In
general, subsidiary entries of shared entries, as SIZE and COLOR for
SHOEs are candidates for sharing as well. The observation provides the
basis for tools that aid in the creation and a maintenance of the
articulation algebra.
>
The competency of an ontology depends on its depth and relevance
[GruningerF:95]. An algebra can combine these competencies for
distributed execution, but the articulation points have to located,
joined, minimized, and made accessible. The ontologies can be
completely distinct, overlapping, or be subsets of each other. Each
relevant set has to be computable.
The ontology algebra will need to include the common Set operations
among knowledge- bases, or rather their ontologies:
The essential innovation in SKC is that the definition of shared is
determined by rules in the articulation knowledge bases, and not by
syntactic match. Examples of the rules were presented in the examples
above. Whenever entries match, their dependent entries will also be
investigated.
To support the base operations of the ontology algebra we also need
some infrastructure operations, which will access different knowledge
representations. These mapping operations will have to be specific to
the underlying representation, so that as little loss of functionality
of the resources as is possible is incurred.
We do not expect to have to map executable codes, only their names.
Our algebra is hence an ontology algebra, rather than a general
knowledge algebra. We are not hopeful that we can describe arbitrary
methods in knowledge-based information systems in a way which permits
formal, scaleable, and reliable manipulation. Any execution of
available methods must be carried out in the local domain resource,
since semantics embedded in executable code cannot be expected to
survive a transfer into another context. Methods that perform remote
update of resource knowledge bases must always be executed in a local
context, so that side-effects are consistently managed. This approach
enables safe update, within the limits of authorization. Articulation
knowledge can be similarly updated. We do expect. However, that most
updates will be made locally, by the owners of the knowledge bases.
Remote execution is well supported by modern computing technology
[Burback:96], although the problems of remote service execution in
differing context are just now being addressed [HowieKL:96].
We must note that an ontology is more than a collection of terms. It
also includes definition of relationships, constraints, and behaviors
[GruningerF:95]. Whenever a term is matched, their subsuming,
subsumed, and otherwise related terms also become candidates for
matching. We will further develop a tool for the creator and
maintainer of an articulation knowledge base that provides suggestions
and guidance. However, it is quite unwise to automatically move all
candidate terms into the articulation knowledge base, since that would
Most procedures used in AI have greatly increasing costs as the number
of inferencing rules (N) increases, since all (N2) combinations must
be investigated. That factor increases when more complex relationships
are needed. If we can partition a problem into P chunks then the
processing cost becomes on the order of O((N/P)2+P2). If the
partitions are hierarchically structured the cost becomes O((N/P)2+P).
For P=10 and N=250 the ratio is about 80 for the general partitioned
alternative. For P=20 and N=2500 this ratio is nearly 400. The
benefits keep increasing by O(P2) as the number of partitions
increases; however, we wish to keep the number of knowledge resources
used by any applications in bounds.
Further benefits ensue by being able to ignore all knowledge outside
of the intersections described by the articulation rules. In Figure 1
the resource D, and the articulation rules for the intersection
( C U D ) will not participate in the application being served. In
broadly based systems, i.e., systems serving many applications, these
savings will be significant.
Furthermore the resulting computations can be naturally distributed
over all active partitions (8 in Figure 1), giving a factor of over
500 for this simple case. There will be human benefits that at least
equal the quantified improvements. It is difficult for a human to
understand reasoning covering more than a dozen items and more than
five plies deep. Since the articulated structure built during
knowledge composition can be reported, it is possible to trace any
unexpected result to its sources and locate the source or the specific
interaction. Systems that cannot present their reasoning paths
clearly will require much human effort to understand issues when they
do not perform as expected, or be rejected if the investigation effort
seems excessive.
Scalability interacts with maintenance, since maintenance of systems
beyond their design size becomes increasingly costly. Design size is
not simply a size metric, but must also consider complexity. Arbitrary
structural interactions increase the complexity of maintenance
exponentially, since the effects of any change spread wider and wider.
Most working ontologies are hierarchical, to permit growth. But
limiting larger domains to hierarchies ignores the needs of
intelligent applications, who cannot be limited to simplisctic views
of the world.
The model used by SKC provides for distributed, domain-specific
maintenance by experts. In the shoe example, we find maintenance
being needed at the factory level, at the store level, and at the
purchasing agent level.
In the factory example, ontologies may be maintained as two layers:
first, all factories that are members of the International Shoe Cartel
share a common base ontology; but secondly, specific factories may add
terms for more specialized needs. A similar layering will exist in
the store. There will be many the purchasing of many flavors of goods,
and the managers also have to deal with sales quotas, personnel, real
estate and the like. The articulation knowledge, exploited by the
Ontology Algebra, is maintained by a third unit, the ASPA affiliated
purchasing agents, again possibly with a local extension. Note that
the purchasing agents will not be tempted to create rules merging the
personnel or real estate concepts of the factory and store ontologies.
Those are irrelevant to purchasing.
In an articulation to support taxation, with rules produced by the tax
lawyers, one may find some rules for shared concepts for real estate.
Not only the knowledge resources are partitioned, but their processing
and inferential rules are partitioned as well. It is not necessary to
build overly fancy processing modules to take care of cases that are
beyond current concerns.
While this maintenance scheme looks complex, it actually focuses
authority and responsibility. The need of committees, assembled from
experts from many domains, and the resulting delays and compromises is
avoided. Since maintenance cost is often 60% to 80% of computer
systems costs [ref:xx], explicitly dealing with organizational
precepts and tools for making maintenance efficient is wiser than
sweeping the issue under the rug.
Since the processing is partitioned, the computational load can be
distributed over the nodes supporting the base resources, the nodes
performing the algebra on selected knowledge classes and their
instances, and the nodes presenting the integrated information. Such
a natural distribution is likely be to more effective than approaches
where massive, centralized computations are defined initially, which
are then analyzed to infer possible parallelism. However, when
computations are initially massive, they are likely to have internal
linkages and shortcuts, defeating parallelization, unless the
computations are restructured. SKC programs are naturally from
certain unstructured shortcuts, but the result is immediately suitable
for parallel execution. Binary ground instances is concentrated in
base resources, so that maximal distribution occurs, at the layer
corresponding to the greatest volume of processing resource.
Problems to be addressed when joining knowledge bases are differences
among them in representation, structure, and semantics. Many of these
differences are legitimate, and reflect differences in context,
objectives, and tools used by the contributors. Imposing top-down
standards would disconnect contributors from their interests and
productive approaches. SKC faces the issues raised by these realities
by introducing articulation knowledge. Articulation knowledge captures
the experience of an integrator for re- execution and reuse in
alternate configurations and applications. SKC provides a
partitioning for effective, distributed maintenance. Source
ontologies are best maintained locally by domain experts.
Articulation knowledge is partitioned as well, and maintained by
integrators. Tools for constructing articulation knowledge will
exploit the base knowledge resources, in itself an example of
knowledge reuse.
We hence can use distributed computations, or rather their results, to
minimize the effect of semantic heterogeneity [Nadis:96]. The
demonstrations in SKC will show that arbitrary knowledge bases, as
needed by applications, can be composed from independently developed
ground resources. The partitioning can also provide isolation, when
needed for security purposes. Strictly, no active interaction with
base resources is required to build the articulation. Updates of the
base resources will be available to the application as they are made.
Within this proposal we do not expect to profit from this capability,
but we may be able to exploit within another DARPA funded project on
Survivability Access Wrappers (SAW) [WiederholdBCS:96].
A crucial feature of the approach in SKC is its scalability. SKC
achieves large-scale operation not by massiveness, but by being able
to focus on relevant information. Intersection operations extract
only the knowledge needed for articulation, typically a modest
fraction of the base knowledge. Union operation allow the merging of
knowledge-bases for joint access, without requiring deep consistency.
Base knowledge remains accessible when needed to compute or expand
selected, relevant instances. Methods remain executable in base
knowledge bases, assuring correct context for their execution.
While SKC will not directly focus on high-performance computing or
communications technology, its underlying structure supports advances
in these areas. Parallel and distributed multi-computer operation are
natural implementations for production systems following the SKC
paradigm and employing the discipline of an Ontology Algebra.
Transmission volume is minimized when articulated intersections of
base knowledge are passed towards the applications.
Gio Wiederhold,
email to: gio@cs.stanford.edu.
if (LOCATION (factory) =`EUROPE')
then SIZE (factory) = SIZETABLE (SIZE (store));
else SIZE (factory) = SIZE (store);
Color specifications will certainly need a table, since the factory
is likely to use a code
COLORCODE (factory)= colortable (COLOR (store)),
allowing the store to refer to COLORCODE `XY14WZ' as `Spring
Pink'. These examples should illustrate both the need for an
Ontology Algebra and the approach needed to achieve its
implementation. The number of rules needed to purchase shoes will be
modest. Otherwise purchasing agents today would already have an
impossible task. 4. An Ontology Algebra
We now have introduced the general SKC approach, the problems to be
addressed, and the knowledge needed to achieve the goal of dealing
with information from distinct domains. The crucial features needed
for the Ontology Algebra should now have become obvious: operations
that manipulate knowledge, using articulation knowledge which
specifies relationships among knowledge collections. We expect that
articulation knowledge can be sparse, since only relationships
relevant to an application domain must be described.
We plan to implement difference rather than negation to be assured
that no sets of near-infinite size can be produced.
5. The Scalability Problem and Effect of the SKC Partitioning.
We consider scalability, together with maintainability, the most
crucial issue facing knowledge-based approaches. System that work
well in practice have been of modest size [FeigenbaumMN:88]. Analysis
of practical applications rarely show more than a few hundred
inferential rules, although those may be complemented by many ground
(database) instances. They demonstrate the power of the knowledge
paradigm, but not its scalability. Research to improve the scale of
knowledge-based processing by moving those ground instances into
conventional or specialized, high-performance databases is important,
and will actually clarify what is a ground knowledge and comprises the
inferencing part [KarpP:95].6. Maintenance
For long-lived knowledge-based systems continuing maintenance is
essential. Our knowledge of the world changes over time, and systems
that do not provide for maintenance will be short-lived and not repay
their investment (we have seen such systems in Artificial
Intelligence). As discussed above, maintenance costs of large
knowledge bases is likely to be huge. We do not have published figures
for knowledge bases, but observe that general software, encoding
knowledge procedurally, experience life-time maintenance costs equal
to or exceeding their acquisition cost. A secondary factor is that
much knowledge base maintenance leads to growth, because we keep on
learning about the world and how to deal with it.7. Efficiency In Processing
Knowledge extracted for articulation from multiple resource sets may
be merged, creating larger, but yet modest knowledge bases. Little
processing will be wasted wading through irrelevant information.
Since many intelligent processing schemes in artificial intelligence,
even when heuristics are employed, have high order polynomial factors
in the size of the knowledge bases, such an economy is essential to
obtain effective processing. We have observed few systems using
several hundred inferential rules; that is, rules that are not
grounded in static, factual atomic values.Conclusion
The ability of the SKC approach to focus on relevant knowledge, and
to perform distributed search, models the capabilities of effective
human analysts, who will focus on relevant connections, and only when
those connections seem promising, will drill down into the detailed
instances [Fodor:83]. SKC provides tools, based on an algebra that
manipulates the ontologies of the information resources. The
partitioning that is supported matches modern, distributed
organizations which depend already on delegation of authority and
responsibility. Well- structured delegation enhances longevity
through effective maintenance. Having a large, but poorly maintained
and error-prone system is worse than having a smaller up-to-date
system. SKC supports large-scale integrated inferencing over small
knowledge modules, by always selecting relevant knowledge, and placing
it in a hierarchical execution model.
secretary: Marianne Siroker
Gates Computer Science Bldg. 4A, room 436
email to: siroker@cs.stanford.edu.