Professor Jack Milton
Department of Mathematics
University of California at Davis
Davis, CA 95616
Prof.~Stefano Ceri (summers 1982 - 1988)
Politechnico di Milano
Milano, Italy
David Beech
Hewlett-Packard Research Laboratory, Palo Alto, CA; Oracle
Robert Blum
MD, PhD, now in practice, Menlo Park CA.
Ralph Cherubini
Digital Equipment Corporation, Hudson MA
Christian Esculier
Paris and Univ.~of Grenoble, France
Barbara Grosz
PhD, SRI International; now Professor at Harvard Univ.
David Hartzband
Digital Equipment Corporation.
Gary Hendrix, PhD
SRI International; now at Symantics, Menlo Park
Jerrold Kaplan. PhD
Lotus Corp.; Go Corp.; ONSale Corp.
Prof.~Sham Navathe
University of Florida, Gainesville, FL; Georgia Tech.
Gordon Novak, PhD (1977)
SRI International, now at Univ.~of Texas at Austin
ChoChun Hsu
Beijing University
Arthur M. Keller, PhD
Stanford University and University of Texas at Austin
Kurt Konolidge, PhD
SRI International
Ingeborg M. Kuhn, PhD
Stanford AAMRS project; later at Veterans Admin., San Francisco
Prof.~Robert Paige
Rutgers University, New Brunswick NJ; now at NYU
Udo Pletat, PhD (summer 1988)
IBM Research, Stuttgart, FRG; now Professor Germany
Tore Risch, PhD (1990-1991)
HP fellow, now Professor at Linkooing Un., Sweden
Earl Sacerdoti, PhD (1977-1980)
SRI International; now at Teknowledge, Palo Alto
Daniel Sagalowicz (1977-1981)
PhD, later at Teknowledge, Palo Alto; Framentec, Monaco, and now at Oracle.
ZaiSheng Song
Beijing University
ShiWei Tang
Beijing University
Mike Walker
Stanford University (later MIS PhD); now a Consulting Professor in the MIS program.
Prof.~David Warren
SRI International; now Professor at Manchester Univ., Great Britain
Richard Weyhrauch, PhD
SRI International and Stanford University
Prof.~Marianne Winslett* (1983-1988)
Professor Univ.~of Illinois
JuanFen Yu
Beijing University
Name (year) subsequent venues
A task (6) on disjoint domains was supported directly by SPAWAR. Implementation of concepts developed earlier is supported by the DoD STARS program, through a contract with SRI International. Knowledge-acquisition research was partially supported through the RX project, funded by the National Center for Health Services Research (HS-3650 and HS-4389) and by the National Library of Medicine (LM-4334).
Work in natural-language interaction has been partially supported by NSF (IST-8023889).
Computer facilities for medically related applications were provided by the Division of Research Resources of NIH (RR-785). Additional equipment is being provided through a cooperative grant provided by Digital Equipment Corporation, through their Hudson, MA, Artificial Intelligence Research Group.
The views and conclusions contained in this document are those of the authors and should not be interpreted as representative of the official policies, either expressed or implied, of any agency of the U.S. Government.
We have investigated a wide variety of types of knowledge about data and databases, from relatively formal semantics based on relational models, to probabalistic and uncertain concepts. The goal of the individual investigations is to devise methods which make access to data more convenient, more effective, or more efficient.
Data and knowledge bases will only be of value of they are comprehensive within their scope, and up-to-date. We have devoted much of our research efforts to the problems arising with the maintenance of large databases and, more recently, large knowledge bases.
Managers are aware of the value of control of and access to information; they depend on control of the data developed and used by their enterprises. The centralization of this data, which until recently was an essential part of having and sharing large databases, has resulted in a loss of control because of awkward access and dependence on remote personnel. While economic and technological considerations have favored centralization of data storage and processing during the past, recent directions in hardware development, including VLSI technology, are making available networks of smaller, yet very powerful, special purpose computers that can be used to collect, store, and process information close to its source. Loosely-coupled networks of different machines, personal workstations, etc., are likely to be the dominant type of computing environment within a few years. Consequently, local control of local information will be the rule rather than the exception. Such distributed environments pose new difficulties for the management of data resources. For instance, while control over local data has improved, access to remote data may be more difficult.
Another challenge presented by the move to distributed data distributed systems, is that the knowledge and authority required to manage the diverse sources of information are themselves distributed among multiple sites [Wiederhold;82e]. Familiar methods of data management--in which perfect information about the structure, semantics, and content of the data is normally assumed--are not adequate to deal with the less structured environment encountered in truly distributed systems. We have found such areas, where knowledge is not perfect, ideal for the application of artificial intelligence techniques.
Developments in artificial intelligence over the last two decades provide concepts that are of value in increasing the responsiveness of database systems to their user's needs. To pursue our work in applying artificial intelligence techniques to the practical domain of database management, we study the problems of data management in centralized and distributed computing environments, and the management of the knowledge that pertains to that task.
Difficulties occur in every aspect of data management: database design, understanding the information potential of the database, and actual use of the computer. Specific issues include data abstraction, query processing, security, redundancy, optimization, integrity, and decentralized processing. The management of knowledge becomes more specific when applied to databases. While the quantity of knowledge needed may be large, it has a scope defined by the corresponding databases and operates in a world which in many respects is closed relative to that scope.
Our research probes the quantitative and complexity boundaries of data management. Large quantities of data, complex data relationships, and the limits of storage technology all require that advanced techniques be pursued if realistic problems are to be approached effectively.
In order to understand the problems formally we typically begin by investigating methods and algorithms available in database design and operation [Wiederhold;83b]. We first ascertain the limits of algorithmic approaches and new technology, and subsequently develop heuristic methods to probe beyond those boundaries. In much of our research we subsequently bring to bear the techniques developed in the artificial intelligence (AI) community for dealing with partially understood, arbitrarily structured problem domains [Barr;80], [Wiederhold;83f].
The introduction of heuristics leads to imprecision. We can quote, however, the British mathematician Alan Turing [Hodges;83], who stated in 1946:
In other words then, if a machine is expected to be infallible, it cannot also be intelligent.
The capability of symbols, manipulated through these AI techniques, to represent generalizations and abstractions, provides the basis for intelligent behavior. Recognition of uncertainty, which is an essential aspect of dealing with higher level concepts rather than with specific instances, is hence an essential feature of knowledge in our problem domain.
The specific AI tools are based on such concepts as semantic modeling and build on existing database technology. The databases are effective keepers of information describing the instances that the knowledge is applied to [Wiederhold;86a]. Although the traditional province of database management is in business data processing, we are also considering the data-management of large scale scientific problems [Wiederhold;82g]. The diagram in Section 10 of this report provides a layout of the tasks we have addressed.
To provide an experimental basis for our research we maintain several large databases. One of these includes data on $34\,000$ merchant ships, ports, and shipping information. A second database contains design specifications for the dec pdp-11 computer.
Most of the early work was carried out on dec-10 and 20 computers at SRI International and Stanford. For the new tasks, described in Section 5, we have equipment configured to emulate a modern workstation, with 8 megabytes of memory and 600 megabytes of storage, and access to the ethernet. The current computing engine is a VAX 750, with graphic and speech output capabilities. Primary system software is Common Lisp and the rdb relational database under VAX vms.
A variety of database-management approaches have been used in our research, network [codasyl;71,74], relational [Codd;70,71], [Blasgen;77], [Held;71], as well as special ones. The medical data are kept on tod [Weyl;75]. The bibliographic database is available for search in fully inverted form using the news service (ns) system on the Stanford sail computer. This bibliography can also be shipped over the arpa net. A printout can be furnished upon request, but we consider it too large for manual perusal.
We concentrate on promising task areas within our research definition rather than on building complete systems. Since this note is intended principally as a review of our own research, we have not cited all of the many publications of other authors that have provided us with pertinent results, important ideas, and useful guidance in respect to general research directions. General references which contributed significantly to our research are listed in Section 12 of this report, while the papers published under the aegis of the KBMS project are listed in Section 11.
We have categorized our efforts into six areas, although many issues overlap. The six sections (4 through 9) which follow provide short summarizations of our research. Many subsections represent a PhD dissertation and related research. Section 10 presents general conclusions about the results obtained.
We will now present these topics in more detail and cite the relevant references. Note that this precis is limited to work actually performed. Work being planned or in progress will be reported in further editions of this precis, as well as in separate publications.
The databases, the mediators, and the applications will all reside on nodes of powerful networks. The end-users will always have computers available to serve their specific tasks. We refer to those machines as application workstations, although they may at times be large and powerful processors.
The data accessed by the first four SoDs are distinct views of an extensive bibliography of knowledge and database references. Information exploited includes type of publication (for the Quality-SoD), authors (the principal identifiers), author's location (for the Conflict-avoidance-SoD), publication details and sequence with dates (for the Conflict-avoidance-SoD), coauthor lists (for the Coauthor avoidance-SoD), as well as title, abstract, and classification (the last three are used by the Relevance-SoD).
Two of the SoDs are currently implemented--relevance and conflict avoidance. They are implemented as Lisp programs which have access to the object system and a commercial relational database. As we gain more experience, we intend to replace the Lisp code with a more declarative representation.
At the simplest level, the relevance SoD takes a keyword (or list of keywords) describing the subject of the paper to be reviewed, and looks in the database for authors who have written papers on the keyword(s). If the enough authors are returned by this database query, this is all that happens. If however, the database query does not find enough authors, or if the application asks for more candidate reviewers later, the relevance SoD replaces the original query by a more general one, in order to increase the cardinality of the result.
This capability is an example of query generalization [Chaudhuri;89]. It is possible because the SoD makes use of some of the semantics of the keywords. The keywords are arranged in a hierarchy, in which the parent is the more general keyword, and the children the more specific. If a query is does not return sufficiently many results, a concept of semantic distince in the hierarchy is used to suggest alternate keywords to try.
The data structures used by the SoD are designed to efficiently support this kind of iterated query style efficiently. A set of authors can be found by a succession of related queries can be answered with about the same total effort as would be needed to find that same set of authors with one more general query.
The application interface is simply a set of Lisp functions which the application can use. As our system evolves, we intend to built a higher-level interface. In our design, an application which is looking for reviewers would submit a query of the form:
select best 3 reviewer from relevance-SoD where relevant = 'knowledge-base' and reviewer not in (select all friend from conflict-avoidance-SoD where author = 'Gio Wiederhold')Note that this query refers to two different SoDs. Since a particular SoD can only answer queries about its own domain, this query is translated into a slightly lower level form, which specifies the individual queries to the SoDs, and the information flow between them.
X := select all friend from conflict-avoidance-SoD where author = 'Gio Wiederhold' Y := select best 3 reviewer from relevance-SoD where relevant = 'knowledge-base' and not in X RETURN YNote that even this second query is fairly high level. It refers to such abstractions as relevant, which are implemented by the SoDs.
At the end of the funding period we shall be able to demonstate the complete system, with all five SoDs. In addition we expect to be able to demonstate advanced techniques for combining the knowledge and results from the five SoDs to achieve a balanced assessment of the reviewers. In particular this require a means to weigh conflicting information from the SoDs, as a given individual may be outstanding according to some criteria but poor according to others.
The primary intent of our formal model is to remedy the lack of structural abstraction in the relational model. Objects are specified recursively as follows: A base object is a non-empty set of projections defined on the base relations. A composite object is a set of other objects. Composite objects can be further subdivided into non-recursive and recursive objects. To render practical the mapping between the relational store and the object working representation, I add constraints to this general definition. Those additional constraints rely on the use of a semantic data model--the structural model--and enable me, among other things, to derive a unique hierarchical structure for base objects, to support record-valued and set-valued domains, and to guarantee termination for specific types of composite recursive objects.
Our prototype system, called penguin, defines three components for the object layer: (1) The object generator maps relations into object templates where each template can be a complex combination of join and projection operations on the base relations. (2) The object instantiator provides nonprocedural access to the actual object instances. Combining the database-access function (stored in the template), and a declarative query, penguin automatically generates the relational query and transmits it to the dbms, which in turn transmits back the set of matching relational tuples. Those tuples are then assembled into new object instances; (3) The object decomposer implements the inverse function; that is, it maps the object instances back to the base relations. This component is invoked when changes to some object instances need to be made persistent at the database level.
We expect our results to indicate that an object-based architecture provides (1) efficient access to and manipulation of complex units of information while preserving the advantages associated with persistent storage of data in relational format, and (2) a domain-independent, bidirectional communication channel between relational database systems and expert systems.
For ad hoc querying, linear query languages have been widely used. sql and restricted forms of natural language are such languages. Most ad hoc query interfaces based on linear query languages are, however, passive. Their behavior is characterized by the passive reception of user queries and the provision of isolated textual user guidance. Passive ad hoc query interfaces are more frustrating to end users than a typical operating system interface, because database queries tend to be more complex than operating system commands, and thus more prone to various types of query failure. This query failure is a major obstacle to achieving the high productivity of end users.
Kaleidoscope is a cooperative ad hoc query interface designed with the productivity of end users in mind. Two query languages are considered: sql and a restricted form of natural language. The key feature of Kaleidoscope designed for increasing the productivity of end users is multi-step in-query cooperative guidance. Instead of receiving query constituents passively, the system leads users through a sequence of cooperative interactions to guide them away from the potential failures in a query. At each step, a set of legitimate query constituents or templates for such constituents are proposed to the user based on the system's knowledge of query language grammar and the structure and content of underlying databases. In addition, any information that is derivable by adding a query constituent to the partial query structure (but that may be unknown to the user) is displayed on a system message window. Essential to Kaleidoscope's cooperative interactions is a menu front-end that facilitates the communication between the user and the system as in the so-called menu-based natural language interfaces NLMenu and inglish.
Kaleidoscope's in-query cooperative guidance is relevant to the past work on post-query cooperative response. Many types of knowledge turn out to be commonly useful to both in-query guidance and post-query response. However, Kaleidoscope's in-query guidance is based on more active way of utilizing such knowledge than post-query cooperative systems: if the system is knowledgeable enough to suggest the post-query correction of failed queries, its knowledge should be used instead to guide users away >From query failure. Even if most types of query failure is prevented in Kaleidoscope, it is still possible that the result of a query may be unsatisfactory to the user. For instance, a query may produce no tuples, or it may produce too many tuples. In such cases, post-query cooperative response may complement in-query guidance.
We consider knowledge to be represented as logical theories, perhaps augmented with non-monotonic assumptions, such as a closed world. Integrating parts of a knowledge base then becomes semantically equivalent to the process of combining subtheories to give a global theory. Since the resulting knowledge base may contain much information which is irrelevant to a given user, it is then helpful to have a capability for customizing the subsets of the integrated knowledge base to give user views.
We need a firm foundation before such operations can be defined. Therefore a portion of this work has focused on issues of semantics of knowledge bases. First order logic is not capable of representing the non-monotonic assumptions which a knowledge base needs to accurately depict the world. To gain such a non-monotonic capability, we have turned to a circumscription-based formalism.
Circumscription provides a very powerful and precise tool for defining unique name and closed world assumptions. Investigations into the unique name assumption have resulted in a new form of circumscription that is well suited for representing the semantics of knowledge bases. It is the first form of circumscription to handle equality in a principled fashion, and the first to provide unique names assumptions by default.
A prioritized extension of equality circumscription can then be used to define the semantics of integrating knowledge components and definining views on the resulting knowledge base.
A practical benefit is that the semantic description provides a guide for implementation. In particular, it provides a test for when queries to the global knnowledge base into may be decomposed into queries to the components, as well as a technique for performing the decomposition.
This formalism also suggests the beginnings of a design theory for knowledge bases. In particular, it provides checks for when a knowledge base contains redundant information, as well as a test for when a view faithfully reflects the state of the global knowledge base. <% reference Rathmann IJCAI 89>
We present a solution to the problem of correctly performing relational database query operations despite data domain mismatch. We use the mechanisms of domain mappings, virtual attributes, and partial values to resolve domain differences and to map conflicting attribute values to common domains. We formalize operations over relations containing such virtual attributes and partial values by the definition of an extended relational algebra. We analyze the properties of this extended algebra and describe the system we have implemented that uses it to resolve domain mismatch and to perform operations over information from mismatched domains. This work enables information derived from multiple sources to be presented in an integrated form [DeMichiel;89a, b].
The results in this thesis are validated using several different synthetic benchmarks, and two different models of join processing costs: one, a conventional disk based model, and, two, a model for memory resident databases. We develop a cost model for join processing in main memory databases. This model is validated for the uniform distribution and more general distributions including the Zipf distribution. Simulations based on the cost model are used to study the performance of various join methods.
To study the characteristics of the search space of query plans, we obtain the statistical distributions of query plan costs. We observe that, as the number of joins increase, the difficulty of the optimization problem increases dramatically in the range considered. We study the effectiveness of the heuristic of postponing cross products as late as possible. We find that the number of sampled query plans of low cost increases significantly using the heuristic. For 10 join queries, the percentage of low cost query plans increases from 2.4\% to 79.6\%, and for 50 join queries from less than 0.1\% to 3.8\%.
We next address the problem of determining good methods for optimization. Taking advantage of the commonly used heuristics of pushing selections down and performing projections as soon as possible, the problem of optimizing large join queries is reduced to the problem of determining the optimal join order. The latter problem is a hard combinatorial optimization problem. We apply general techniques, such as iterative improvement and simulated annealing, that have proved effective in a variety of combinatorial optimization problems. We then study the use of heuristic techniques, such as augmentation and local improvement, in combination with the general combinatorial techniques. We observe great differences between the many different methods considered. We find that two combinations of the augmentation heuristic and the iterative improvement technique are superior to all the other methods.
Database programming is an instance of programming. Besides the common properties it shares with programming, database programming requires having the knowledge of database semantics - the rules and the constraints applied to the database so it remains consistent and sharable by many users. Such knowledge is necessary both to maintain database integrity and to explore more optimization opportunities. Automated programming in databases is common. The fruitful research in query processing, semantic query optimization, and logic programming shows that automated programming of data retrieval programs is not only possible but also feasible. Similar approaches can be taken to programming data manipulation programs, namely transactions [Qian 86, 87, 88a, 88b, 88c].
The paradigm of database management makes automated programming more desirable. Since most database systems today support to various degrees the specification of database semantics, it is time to provide automated tools to exploit the use of this semantics. User-supplied transaction specifications are often small incrementals to the specification of database semantics, which is oriented towards human understanding rather than machine execution, and transaction programming involves routine translation of such combined specifications into procedural code. Programmers rarely have a complete understanding of database semantics. For reasons such as ease of use and security, databases often provide programmers with access to views, which are selected small portions of data. Transactions are reused over long period of time through changing situations, while prorammers do not stay. Moreover, tremendous opportunities for optimization exist because transactions usually perform small changes to database states. Since manual optimization carries extremely high risks of violating database semantics, it should hence be the responsibility of database systems to synthesize executable code from transaction specifications that preserve the validity of database semantics.
Database technology makes automation more feasible. Less effort is needed in writing transaction specifications because a large portion of them, namely the database semantics, is already present. Transactions are dominated by data manipulation instead of computation. On the average, eighty percent of the common transaction code is about constraint checking and enforcement. A small number of language constructs, such as insert, delete, modify, test, and iterate, constitute a reasonable transaction language. Simple algorithms and regular data structures suffice for most of the data manipulation jobs. These features significantly simplify the synthesis task.
Certain aspects of database management raise new challenges to automated programming. Transactions are imperative programs which manipulate database states. Specifications are often incomplete, requiring the synthesis of transactions with attached applicability conditions. Database semantics must be utilized to optimize both the synthesis process to reduce search effort and the transactions generated to enhance runtime performance.
We approach the synthesis of transactions which preserve the validity of semantic integrity constraints using deductive tachniques. A class of iterative transactions is defined and shown to have exactly polynomial-time expressive power. An iterative transaction logic is developed as the formalism with which the synthesis of iterative transactions can be conducted. Transactions are generated as the by-product of proving specifications in the logic. The deductive-tableau proof system developed by Manna and Waldinger is extended with inference rules for the extraction of iterative transactions from proofs, which require the coorperation of multiple tableaux. It is shown that the proof system is sound and complete for the synthesis of the subclass of first-order transactions. Properties of transaction specifications are studied to identify a class of bounded stratified specifications for which the synthesis process always terminates. Control strategies are also developed which utilize the database semantics to reduce search effort in the synthesis process and to enhance runtime efficiency of the transactions generated. Our approach is demonstrated with a prototype implementation.
Further avenues of design oriented investigation have been distributed database design and the design of database management systems.
Our work on database design is based on the use of these structurally relevant semantics, which form the basis for defined connections in the database. Three connection types are adequate to model any structural database constraint other than arbitrary functions: ownership, reference, and subset. When the expected usage loads are mapped to defined connections, the codified knowledge will provide the quantifiable guidance for the physical design. The correctness of binding decisions is necessary to make large databases capable of acceptable performance [Wiederhold;81b], [Ceri;81,85b]. The existence of a model also allows making decisions about reconfiguration of a database as the loads change over time. The problems of asynchronous operation of distributed databases are being modeled using the identity connection [Wiederhold;87].
It does not matter in what manner the connections are implemented. This permits the logical design to proceed independent of the eventual implementation. This freedom is critical for databases which are expected to outlive current technologies and for databases which may be implemented differently on multiple sites in distributed system.
The primitives of the structural model are also applicable to databases serving non-traditional functions as engineering information systems for hardware [Wiederhold;82b] and maintenance control systems for software [Olumi;83].
An alternative approach, distributing integrity authority and responsibility to autonomous local sites has been explored in the context of networks of personal computers. The bulk of actual maintenance actions can be delegated to rule-driven agents, which inspect the update messages [Wiederhold;84d].
A straightforward algorithmic optimization approach is not feasible when designing a multi-file database since the combinations of the decision space greatly exceed reasonable time bounds. Addressing this problem, we have developed a method which will provide implementation guidelines for a relational database design in close to linear time [Whang;81a, b, 83a, 84, 85], [Wiederhold;83a]. The concept developed is termed separability of joins between the participating relations.
When the relations of a database are separable, the allocation of clustering and indexing choices for a database implementation can proceed relation by relation. The assumptions required for this approach appear to be fulfilled over a wide range of practical cases. We have also extended this approach to network databases [Whang;82].
A by-product of this research is an improved equation for the estimation of block accesses when retrieving records from files [Whang;85]. This formula has been adapted for the optimization of query processing in relational systems.
Our work on database partitioning uses structurally relevant semantics to define connections in the database. These connections and the expected usage loads provide the guidance for the physical binding design decisions [Ceri;81].
This issue maps directly into distributed systems: the cost of connections that go via communication links becomes even more critical. We assume initial user input to select candidate fragments of the database for partitioning and a list of important transactions using those fragments. With each transaction an origin site is associated. A simple model that permits replication but ignores storage and update costs would have that site contain all its needed fragments.
When storage and update costs are considered, the optimal site assignment for the fragments is very difficult. We have been able to formulate and solve an integer programming model for this problem for a modest number of fragments and transactions when there is no replication. We have also analyzed the heuristics required when replication is permitted [Navathe;82], [Ceri;83].
Multiple processors may also be used in clusters, perhaps connected by a high-bandwidth local net, in order to achieve higher aggregate performance than a single large processor can provide. When the processors are clustered, the entry point for a transaction is of little concern. Partitioning is now constrained by limits imposed by processor, input-output, and communication capability [Sacca;83], [Sacca;85].
However, in distributed databases a major technique to achieve acceptable performance is the replication of data [Garcia;78c] [Barbera;82] We are beginning to model this technique using the concept of an identity connection [Wiederhold;83b]. When replication is used in the centralized environment the binding is implemented quite rigidly (i.e., no access is permitted until updated replicated elements are all set to identical values) [Wiederhold;81b]. We find that the cost of maintaining rigid identity connections is quite high in distributed systems, and are investigating methods that permit a relaxation of those constraints.
In all those approaches to database design, the limits of algorithmic design methods seem to have been reached. The additional complexity that occurs when the fragments can be replicated always requires heuristic approaches [Ceri;81].
When databases are not maintained through a central administrative service, we speak of a loosely coupled system. These systems will require techniques to deal with the uncertainty of identifying relevant data, locating their instances, and determining their status in terms of currency and quality. We have begun to investigate techniques where expert systems help manage these issues [Apers;84b, c]. While there is an inherent level of uncertainty in the obtained results, it appears that the uncertainty can be quantified and manipulated. Development of adequate techniques in this arena is essential so that the benefits of autonomous systems are not outweighed by the difficulties of sharing data in these systems.
This research is motivated by a plan to re-engineer a large multi-site control and command system, WWMCCS, using modern software principles. In accordance with Department of Defense (DoD) policies, the software for the WWMCS Information System (WIS) is designed to be implemented in Ada.
The methodology of the specific work is to start out fresh. This permits the use of modern software and data engineering approaches to database systems, using the power of Ada in the most appropriate way. Most of us believe strongly that the cost of adapting re-engineered applications to 20-year and 10-year old technology, as exemplified by existing DBMSs, is greater than the cost of rethinking and a new DBMS implementation. If we do not put the principles we have learned in the past twenty years to work now, we will be overtaken by others who do not carry the baggage of the past.
The proposed architecture takes advantage of the structure of Ada. Since Ada supports alternative code bodies with the same declarative specification, it becomes natural to maintain and use alternate modules. However, in order to keep the specification unchanged, the initial design must consider the information requirements by any candidate module. In practice, modules developed in the prototyping stage will not use information needed for an optimizing version of the same module type. Similarly, modules for a workstation database will not use much of the security information available for protection in a shared environment [Wiederhold;83e], [Friedman;86].
The context of this research project reinforced the consideration of security and access controls within the initial design, rather than treating security as an add-on. If a high level of access protection is needed, overall reliability and performance will be improved by incorporating clean and appropriate interfaces in the original system design. Whenever the requirements are less strict, the cost of having null or minimal modules in their place will be small [Spooner;85].
Proposing a fresh start for a major system carries a considerable risk. Doing this in the database area is made more feasible due to the initial orientation of Ada. Since Ada was designed as a language for embedded systems, it did not inherit any prior data-processing notions, a problem other general purpose languages such as pl/i have to cope with. Only the most primitive input-output functions are specified in Ada. Hence, we were able to start with a relatively clean slate.
This project presents an example of a data-oriented design methodology. In software projects where we expect that a common data structure will serve many application functions, a data semantics-oriented approach may be a better alternative to a top-down design starting from a procedural objective. In this data engineering approach we still proceed top-down, but start from the information requirements.
Especially when we envisage that modules are replaceable, procedural specifications do not provide a solid base. No procedure can generate information unless the required data and knowledge are made available to the module. When data and knowledge are identified initially, then, in most cases, a competent programmer will have few problems in coding the required transformation procedures and generating a successful program.
On reflection, it appears that the architectural approach proposed, as a conceptualization of architectures found in current DBMS, bears a significant resemblance to the blackboard concepts used in some Artificial Intelligence Systems. An early example of a blackboard is found in hearsay [Reddy;76] and recent work is bb1 [Hayes;85].
Making the distinction in this design process of:
helped reveal the design issues and the rules for information management within the design.
The experience gained in the KBMS research and from other sources is being marshalled for a major subproject, ksys. The other sources include rx, implementing Knowledge Extraction from Databases [Blum;80,86a,86b]. Rx uses frame technology to manage a fairly large database [Wiederhold;81, [Wiederhold;85a]. Additional experience is derived from infobase, developed at DEC [Cherubini;84] and made available to us.
The ksys project addresses specifically the need for an effective interface between AI Systems and Database Management Systems [Ceri;86]. The knowledge managed by the AI system represents an enhanced form of the meta-data used to describe databases. The DBMS maintains the facts which represent the continuously changing world and permit new inferences to be drawn by the AI systems linked to the database. We have found, however, that the problems in communicating between AI and database systems are not just in building an interface itself. Each side depends on its own representation structures and inference methods [Wiederhold;86a]. In an integrated system the strengths of each must be exploited. This direction reflects a maturation of the research in this field, where we can go beyond existence proofs and consider quantitative measures and architectural feasibility as well.
The ksys architecture uses:
To support the learning process it is neccessary to abstract details, stored in the database, to higher-level concepts. By moving to abstractions we can bridge missing data [Blum;81]. If we let the concepts extend over time, we can also start recognizing change, a stronger impetus to action than conventional reports describing a snapshot view of the world [Downs;86], leading to graphic summarizations of a patient's history [deZegher-Geets;88, 88].
However, databases are often quite large and subject to frequent updates. Both of these characteristics render impractical the encoding and maintenance of a lexicon associated solely with the query language processor, which requires separate maintenance,
We have developed and implemented a technique for reducing the number of lexical ambiguities for unknown terms by deferring lexical decisions as long as possible. The database itself is then accessed to try to resolve the ambiguities and complete the parsing. A simple cost model selects an appropriate method for resolving any remaining ambiguities [Davidson;80].
Dealing with incompleteness in databases is a major thrust of the KBMS project. It is motivated by the observation that databases, in practice, do not contain data for all data attributes for all stored entities. This is in part an essential result from the concept of sharing data, and cannot be avoided by exhorting users to enter all data. A database pools data from a variety of sources and each source operates under different preconditions and in a different environment. It cannot be expected that a single person or unit will have the global, detailed knowledge necessary to manage the entire database; rather, each specific user of the database will be qualified to manage only a subset of this large collection.
Specific areas of our research have included: view management, natural language or ad-hoc planning updates, subset definition, and as discussed earlier, the effect of null values and other forms of incomplete information.
In the AI domain, the problem of updating a database of incomplete information arises when an agent tries to incorporate new, possibly contradictory information into a database of beliefs reflecting its partial knowledge of the world. In the logic domain, one might choose to represent such a database as a logical theory, and view the models of the theory as possible states of the real world.
How can new information (i.e., updates) be incorporated into the database? For example, given the new information that ``b or c is true," how can we get rid of all outdated information about b and c, add the new information, and yet in the process not disturb any other information in the database? The burden may be placed on the user or other omniscient authority to determine exactly which changes in the theory will bring about the desired set of models. But what's really needed is a way to specify an update intensionally, by stating some well-formed formula that the state of the world is now known to satisfy and letting the database management system automatically figure out how to accomplish that update.
Alternatives for semantics of databases with incomplete information and the ramifications of these semantics have been summarized in a paper which won the student award at the IEEE Data Engineering conference in Los Angeles, 1984 [Keller;84a, 85b].
We have explored a technique for updating databases containing incomplete information. Our approach embeds the incomplete database and the updates in the language of first-order logic, which we believe has strong advantages over relational tables and traditional data manipulation languages when information is incomplete. We have described semantics and found polynomial-time algorithms for update operators [Winslett;86, 85b]. An implementation of these algorithms has been coded in C. When null values are present in the database or in the updates, incorporating the updates directly in the database may lead to unacceptable expansion in the database size. Using methods reminiscent of those used in concurrency control, we have investigated a lazy evaluation technique which regulates this growth.
To provide access to a specific subset, we may define a database view. Use of defined subsets cannot resolve all cases where nulls will occur.
These restrictions lead to a class of problems when the database is to be updated. An update expressed through a view typically has multiple reasonable interpretations in the shared database. For that reason, in most current DBMSs, views can only be used to retrieve information; users operating within a view are not permitted to do updates. This restriction increases the burden on the centralized authority with power to update the database, and introduces errors due to separation of authority over data and authority over database operations.
We have developed a theory which permits enumeration of all possible interpretations of a view update, and a complete collection of algorithms for the implementation of these interpretations. The theory is based on notions of formal correctness of updates [Keller;82, 85a, c].
When the view includes the join columns, view update is generally feasible, especially when the key for the view is entirely contained in one of the underlying relations [Keller;81b]. This work has been extended to include other views formed by standard relational operators (selection, projection). This algorithmic approach is in contrast with the information-theoretic approach based on minimal complements [Bancilhon;81], [Keller;84b].
In addition, we have developed an interactive approach to permit a database administrator to choose the appropriate interpretation for updates at view definition time. Once the interpretation has been fixed, an update expressed against a view will have an unambiguous interpretation in the shared database [Keller;86a].
Although considerable research has been devoted to the problem of processing queries expressed in natural language, the issues and techniques for performing natural-language database updates have had little attention. The need to perform natural language updates arises when computer databases support ad-hoc planning. When using the database to plan for hypothetical future scenarios, update interpretation should not depend on the insights of a database administrator. For this situation we have developed artificial intelligence techniques to select a plausible update interpretation based on the user's previous interactions with the database, database semantic constraints, and the goal of minimal side effects. If one update interpretation clearly dominates, the user can proceed in the planning scenario without being forced into a tedious disambiguation dialog.
A theory has been developed that makes it possible to identify a particular user's view of the database and to determine the legality, ambiguity, and potential side effects of updates expressed in natural language. A formal analysis of the problems associated with updates expressed on views (database submodels) is central to this work. A system, pique, has been implemented, based on this approach, that processes natural-language updates, explaining problems or options to the user in terms the user can understand and that, moreover, makes changes in the underlying database with a minimal disruption of other views [Kaplan;81a] [Davidson;82, 83a, 83b].
The separation of read-transactions into three classes, namely those that require no or minimal consistency, those that require auditable consistency, and those that require time-critical consistency, can improve aggregate system performance. The problems arising from serving transactions that support planning functions, which require access to large granules of the database, can be greatly reduced by lowering their consistency demands [Garcia;82].
Such a system is considered resilient with respect to failures, rather than failure-proof. A local node may not be able to provide full local recoverability by itself. This can be due to economic considerations, since the cost of local secure redundant storage can be excessive for small systems, or can be due to operation in a high-risk environment. Techniques to use other nodes within the distributed network to restore failed nodes, whose local storage has been damaged, have been explored and documented [Minoura;78, 82, 83a, b]. We are obtaining feedback about their use in financial and manufacturing systems.
Subsequent work in this area deals with a classification of transactions in relation to resiliency. By formally defining a set of transactions which can proceed when the database is partitioned, a set which can be conditionally handled, and a set which cannot proceed, we expect to enhance the operability of distributed databases in conditions of adversity [Apers;84a,d].
We have extended the approaches to natural language understanding addressed in earlier work [Gardner;81]. The intent here was to avoid reinventing the wheel, so that we can build on research performed earlier and study the many remaining problematical issues in regard to the human interaction with database [Rowe;82c].
When querying issues of incomplete knowledge by the users arise, the structural model can provide guidance in query formulation and in response processing. We also tested the approach to response analysis developed by [Kaplan;79a, b, f, 80].
Information from the structural model can provide useful hints for summarization. Since a referenced entity provides a higher level abstraction of the referencing objects, the typically small set of reference connections can provide the candidate relations for descriptive responses.
By incorporating a task model into the KBMS, these characteristics can be used in a variety of ways to improve the utility of the system. Four main uses of task models have been explored in a KBMS:
Semantic query optimization is an approach to improve the performance of information requests that uses such domain knowledge automatically. The objective is to transform a query into a semantically equivalent one (i.e., one that produces the same answer but can be processed more efficiently).
This work demonstrates that semantic knowledge about the database can be used in a new way to achieve efficient data retrieval. The method supplements conventional query optimization methods. Drawing on semantic, structural, and processing knowledge, it makes use of the heuristics of query processing developed in prior research. A system, quist, has been implemented to explore this approach. Improvements have been demonstrated in a range of query types by means of transformations that add or delete constraints on attributes or even entire files, as warranted by the database and query structure. In other cases, the system detects unsatisfiable query conditions without inspecting the database, or aborts inference when it determines that no improvement can be expected. Analysis overhead is low in all cases [King;79, 80a, b, 81a, b].
We have written a program that detects whether or not the result of one query can be used to support answering another query. Finkelstein describes both the query graph and the first two settings for optimization described below [Finkelstein;82].
Compared with conventional natural language interface systems, our approach is based on a normative system assumption. The system works with a relatively small-sized grammar, vocabulary, and knowledge base. Besides, they can be made to grow incrementally.
To demonstrate the effectiveness of the menu-guided cooperative ad hoc query interface described in this section, we plan to implement it on a XEROX lisp machine with a SQL DBMS configured on a remote server. The idea of menu-guided cooperative system carries over to formal query language interfaces as well. In addition to testing over a small subset of natural language grammar, we plan to create a menu-guided cooperative interface for a subset of SQL. We expect such SQL interface to be useful to casual and advanced database users, while the interface based on a natural language to be suited for wide range of users.
The inference rules comprise a new example of an artificial-intelligence expert system. There are several important advantages of this approach over sampling methods, as has been demonstrated experimentally in estimating a variety of statistics on two different databases [Rowe;81b]. [82b, d, e, f, 83a, b, d].
Logic and circuit design information of two devices, an ALU and a pdp-11 CPU, and the component library required to build them, have been placed into databases. The lower level data elements can be automatically generated using an existing design automation program. Performance measurements indicated only a performance degradation by a factor of about 2 versus use of specialized design files [Wiederhold;80b]. Modification of lower level elements during the design process is signalled automatically, using a height-first algorithm, to the related parent levels, so that this detailed knowledge can be incorporated in the higher level abstraction when these are accessed during successive design iterations [Beetem;82], [Wiederhold;82b].
We have developed some access techniques for the design data which permit retrieval of data that is partially stored. If required elements are not stored, they are generated from higher level circuit or logic specifications. Partial storage is important. The volume of lower level elements in a VLSI design can be greater than is manageable by the largest storage devices now available, so that automatic methods for VLSI design automation will need access to a mix of generatable, regular elements and instantiated, irregular elements if they are to handle large circuits that are not totally regular [Wiederhold;82b].
Our initial experiments utilized a codasyl database system, DBMS-20. We are in the process of evaluating our approaches for other database environments to make our results more accessible to practitioners. We have ported the database to a relational and object oriented system at XEROX Palo Alto Research Center and evaluated the performance of the VLSI-design programs in both a purely relational and an entity-relationship model implementation [Winslett;84, 85a]. The entity-relationship model improved performance, but the overriding factor in both versions was the cost of writing the results to disk.
An important byproduct of this work is an assessment of the tradeoffs among database implementation models. Much argument has been created in the arena of relational versus network implementations. We believe that some solid experiments in this one critical area can help in quantizing the issue in an objective fashion [Milton;83].
We have begun to consider the issue of the automatic management of design versions. This task is performed now by databases that support engineering change control [Olumi;83]. The same problem is being addressed in conventional databases that are concerned with time or planning updates, but there are yet few results to draw upon, so that in fact the design area may be able to lead in this research field [Belady;77] , [Bennett;82], [Parnas;76], [Rochkind;75]. The high capacity permanent storage provided by optical disks permits a greater retention of past versions [Rathmann;84].
In research performed at the SRI International AI Center theoretical issues in planning are being investigated. A method which permits plans to be created, although not necessarily executed, efficiently is the focus of current research [Pednault;84], [Manna;86]. The method can be related to many techniques which had a heuristic base.
Content analysis requires large collections of well maintained data. Many years of operational use were required to get to this stage. In this research we have developed tools that use artificial intelligence techniques to assist in the processing of time-oriented data. A frame-based model is used to capture domain knowledge. This model assists in combing the database for unexpected time-lagged correlations and reporting such events as prospective hypotheses. If analysis of the finding appears warranted, a statistical model is built that includes all possible covariates given in the knowledge base.
The appropriate statistical tests are selected by applying a collection of rules that use the descriptions of the data from the database schema and the conditions that are satisfied by the statistical procedures. If the findings turn out to be significant, they are added to the knowledge base for use in another iteration of hypotheses generation and testing. This approach to knowledge generation complements the familiar, but less formal approach taken by expert systems [Shortliffe;73, 76, 79].
This approach has been demonstrated by using a small subset of a time-oriented rheumatology database (aramis) [Fries;79]. The statistical methods now available are limited to multi-variate regression. Tentative hypotheses were generated and automatically tested for statistical significance. Several verified hypotheses could be added to the original loaded knowledge base [Blum;82a, b, c].
Work is continuing in the medical arena, but its applications to other databases (e.g., military manpower management) are also under way. Basic research to deal with reduction of the inherent complexity of hypotheses generation is continuing.
Modules developed and under consideration include a symmetric file access system, a database definition language processor, a relational algebra processor, and recovery support. We hope to integrate these on the naxos system, a vax 750 dedicated to experimentation with advanced databases.
While traditionally the notions of primary key have implied both a high degree of locality as well as uniqueness of key, it is desirable for a database-driven system not to make that distinction. Included in that development is a port to dec vax equipment. The available unix software is particularly deficient in the area of file management.
Such an intelligent file server would respond to high level queries. Whereas conventional equipment requires the processor to specify a device and a storage address for retrieval, we foresee having to submit only a symbolic name and a key. The objects to be retrieved will no longer be fixed-length blocks, but sets of variable length records.
The difficulties of moving advanced file functions out of the processors realm are in the respecification of the interface. Linkages for serial access within the retrieved sets, algorithms for memory allocation and for setting and managing locks for concurrent access all need analysis and formalization.
These results have led to a ``high-level'' design for a specialized non-von Neumann machine, portions of which would be implemented in VLSI, which would support the highly efficient evaluation of the most computationally complex operators of a relational algebra, specifically projection with elimination of duplicate tuples and the equi-join. Based on a hierarchy of associative storage devices, this architecture permits an O(log n) improvement in time complexity over the best known evaluation methods for these operators on a conventional computer system, without the use of redundant storage, and using currently available and potentially competitive technology. In many cases of practical import, the proposed architecture should also permit a very significant improvement (by a factor roughly proportional to the size of the primary associative storage device) over the performance of previously implemented or proposed database machine architectures based on associative secondary storage devices [Shaw;80a, b, 81]. This work is the basis for further VLSI data-oriented architectural research in progress at Columbia University.
While research on rewritable optical media is progressing, it seems that the linear density will be reduced by at least a factor of 2, leading to reduction in storage capacity by a factor of 4.
With this development, new problems have to be addressed. At this time we recognize that:
We have developed a technique where indexing structures, in particular tries and various kinds of trees [Knuth;73], are stored incrementally within the data blocks as they are rewritten [Rathmann;84]. This appears to solve the basic problem, although further investigations are foreseen. The trie approach has been implemented [SYTH;85], in an optical version of flash, a language independent portable file access system [Allchin;80a].
Request from Gio, 5-12-86:
I am happy to tell you that my paper "Subsumption and Implication",
written last summer at Stanford and sponsered by DARPA Contract
N39-84-C-211, is accepted for publication in Information Processing Letters.
George Gottlob.Conceptual~Interactions~of~KBMS~Components
---------------------------
| Users |
---------------------------
Free form|| /\ /\ data
queries || || || flow
and || ||
updates || ||
\/ || ^ control
------- ------------------ || | flow
| |====>| Natural language | ||
| User | | front end | ||
| | ------------------ || Responses
| Focus | | | || ||
| | Lexical | | || Knowledge ||
| | data | | || base ||
| | inter- | | || queries ||
| | action | | || ||
| | v | \/ ||
| | ----------------- ------------- ------------
| | | Domain | | Cooperative | | Database |
| |<====| Knowledge |===| response |<===| Abstract |
------- | Base | | analysis | | |
----------------- ------------- ------------
^ | | || /\ ^ /\^
Organi- | | | || Database || | Access ||
zation | | | || queries || | path ||
advice | | | || || | information ||
| | | || || | ||
-------------\ | | | || || | ||
| Data | \ | | | || || | ||
| View | \ | | | \/ || | ||
| Models | \ ------------------------------------ ||
| of |\ Structural Schema | ||
| Applica-| | / ------------------------------------ ||
| tions | | / | | | || || ^ ||
| | |/ | | | || Data || | ||
------------/| | | | || base || | Access ||
| | | | || queries || | path ||
Structural | | | || || | information ||
information | | | || || | ||
| | | \/ || | ---------- ||
------------------------------------ | Access | ||
| Access Path Organizer |<=>|statistics| ||
------------------------------------ | file | ||
| | | || /\ --------- ||
directives | | | || queries || ||
| v | \/ || ||
------------------- ----------------- ------------------ ||
| Modern File access| | Commercial DBMS | | VLSI based hard- | ||
| software, FLASH | | a.o.CODASYL | | ware concepts | ||
------------------- ----------------- ------------------ ||
|| /\ ||
|| || ||
|| || ||
-------------------------------------------------------------
| DATA BASES |
| 1: Merchant ships, ports, cargo, voyages |
| 2: Logic of a DEC-11 machine, circuits, components |
| 3: Time-oriented data from Arthritis patients |
-------------------------------------------------------------
Stanford Reports and Papers
These books, papers and reports document research performed
fully or partially under current and preceding ARPA contracts
to the investigators collaborating in the KBMS project.}
<% KBMS References , Maintained by Peter Rathmann>
<% Homebase is [SCORE]
Entries with `*' are KBMS documents, others are anciliary documents.
The following papers relevant to, but not created by KBMS research
Stuff to take care of
---------------------------------
We should probably start A KSYS section
" A system to experiment with Knowledge-base- database interaction>>
We have performed some experiments using PROLOG, to develop and validate
methods to acquire database information into a knowledge base.
We consider ground rules in PROLOG and bring them, before or during
execution, into the knowledge base. A `Tracker' records the portions
placed in the knowledgebase so that repeated database access are avoided.
When all data relating to a predicate have been placed all further
accesses are avoided by inserting a negative rule.
Other reference is paper by Ceri, Gottlob and me,
with PROLOG in title. I assume its in and not referenced now.