Acquisition and Validation of Knowledge from Data

preliminary report, Stanford Report KSL 90-02, Feb. 1990.

Michael G. Walker and Gio Wiederhold

Stanford University

Final Version Published in "Intelligent Systems", Z.W.Ras and M.Zemankova, editors, Ellis Horwood, publisher, May 1990, pages 415-428.

Abstract

Data provides a basis for most of our knowledge. While most knowledge acquisition uses data that has been abstracted through experience and education, there is also an important role for a more direct extraction.> The process of extracting knowledge from data may take several forms:

Automated abstraction and summarization, for example, extracting a concise description of a patient's history from a lengthy medical chart.
Discovery of new knowledge about relationships, for example discovering drug side effects by retrospective examination of medical databases.
Discovery of new abstractions, for example determining the need for a new factor to explain inconsistencies.
Quantitative knowledge acquisition, for example, deriving likelihood ratios or other statistical parameters for rules in knowledge bases by statistical analysis of a database.
Knowledge validation, for example monitoring the database to assure that knowledge-base rules entered in the past remain adequate, or testing the accuracy of new rules proposed by domain experts.

In each case, we begin with a database of case examples, and apply a combination of statistical analysis and domain knowledge to extract the knowledge implicitly present in the data. In this chapter we describe systems that we have implemented to perform these tasks.

1. Why is knowledge extraction important?

Knowledge is traditionally obtained by experience and education. Today much of what we experience is documented in databases, and as as this trend continues less experience is gained by direct interaction with data. The databases are becoming giant repositories of potential knowledgee, and we will need tools to tap this information and bring it to a level of abstraction where it can be used for decision-making and planning, in short, all the tasks we want intelligent systems to perform.

Practical applications of knowledge-based systems are hindered by the knowledge acquisition problem, i.e., the difficulty of building and maintaining knowledge bases. There are several types of automated knowledge acquisition. They are complementary, and require different levels of analysis and base data.

Data summarization is important because of the problem of information overload. It changes the level of abstraction to match an expert's need, but does not make choices.

Aids to scientific discovery increase the productivity of researchers in formulating hypotheses. Finding and pruning inconsistencies between already encoded knowledge and the data, again at an appropriate level of abstraction, augments the paper-based exploration tools that are used now.

Validation of factors is essential for systems dealing with any form of probalistic reasoning. It is unreasonable to expect a human to be able to quantify consistently the interactions of the many factors needed for expert decisions in even moderatly complex situations. It is certainly not possible to obtain consistent quantifications from multiple experts on topics that partially overlap. Today few commercial expert system products incorporate probabilistic reasoning and others use ad hoc algorithms such as certainty factors. Having well defined quantifications, and using them appropriately can permit them to deal correctly with uncertainty.

An important issue is maintenance. The cost of building a knowledge base is only the down payment. We can assume that in most interesting applications the knowledge changes by at least 5 per year. Without a methodology that permits re-validation and correction a knowledge-based system will not outlive the length of a PhD thesis.>

2. Database summarization

Our work on database summarization in the RX project is described in detail in Downs (Downs, Walker, & Blum, 1986) and DeZegher (DeZegher-Geets, Freeman, Walker, Blum, & Wiederhold, 1987; DeZegher-Geets, Freeman, Walker, Blum, & Wiederhold, 1988)). In this chapter we provide a short overview.

Both the summarization and discovery components of RX use patient records from a <+clinical> database (Figure 1).

Figure 1. Small portion of a medical database used by RX.

The goal of the Summarization Module is to provide concise, high-level summaries of such patient records. The program uses a domain-specific knowledge base to identify and summarize important events. It provides an interactive, graphic representation of the patient record, with active regions on the screen display that are selectable by the user to display evidence supporting its conclusions.

Figure 2. Sample output of the Summarization Module

Figure 2 shows a small portion of a summary: a graphical display of some of the important events relevant to this patient. The summary consists of high-level descriptions such as renal failure and glomerulonephritis that the program has inferred from the patient's laboratory data. These high-level descriptions <-> are typically diagnoses or patient states. The derivation of values for a given case is > inferred from the raw data by the program using its knowledge base of medical relationships. The are presented as labels on a bitmap computer screen followed by a line whose height reflects the magnitude or severity of the event at each point in time. The degree of belief in the diagnosis in indicated by coloring the line shades omust be grey or black, representing increasing confidence. Abstractions that are only weakly instantiated are omitted.>

The user can see the knowledge and data the program used to generate the summary using a mouse. The <-> rationale behind a given diagnosis by pointing with the mouse to a portion of the time line following the label. Alternatively the user can examine the primary data recorded in a region of the record.

The Summarization program uses a hypothetico-deductive algorithm in which disease states are evoked based on attributes with abnormal values, and the evoked diseases are confirmed or rejected by matching their templates against the data in the patient record. Likelihood ratios are used for Bayesian updating of the probability of the evoked disease states. Confirmed hypotheses, probable tentative hypotheses, and unexplained abnormalities are the labels on the time-oriented graphical summary.

The algorithm begins with a data-drive hypothesis generation phase, followed by a goal-driven hypothesis-testing phase. By alternating forward and backward search> the program considers only those hypotheses suggested by the data. The hypothesis generation phase involves searching a visit in the patient record for attributes with abnormal values. Each abnormality found is associated with a list of possible labels or hypotheses that may explain it. These hypotheses are instantiated and placed on an agenda for confirmation or refutation. To avoid the large number of hypotheses that would result from considering all abnormal values, only those attributes determined to be significant for the application are considered. This determination of significance is based on the prior probability of the abnormality and on the cost or utility of missing an important finding.

As each significant abnormality is found in a given visit, an instance of each hypothesis on its differential list is created and placed on the agenda of hypotheses. Each hypothesis has an initial probability equal to its prior probability. The hypotheses on the agenda are taken in turn and evidence for and against is sought in the database. The evidence found is used to update the probability of the hypothesis according to Bayes' rule. when all the evidence for a given hypothesis has been evaluated the probability is fully updated. If the probability of the hypothesis is low the label may be removed as determined by the user. When the probability exceeds a threshold, the labels in its differential are instantiated and placed on the agenda, and the display is updated.

3. Discovery: RX

Our work on automated discovery using medical databases began with the RX project, which is described in detail in [Blum, 1982; Blum, 1986] and [Walker & Blum, 1986]. Walker [Walker, 1987] reviews systems for automated discovery. RX uses again a large time-oriented patient database as its experience base, and a substantial knowledge base of clinical observations, intermediate findings, syndromes, and diagnoses as its knowledge base. The knowledge base is also initialized with links that describe categorical, defintional, and causal relationships [Wiederhold, Blum, and Walker, 1986]. Discovery in RX consists of proposing and validating unknown causal links among nodes in the knowledge base.

The principal innovation of the RX project was the addition of a knowledge base and a Study Module to the usual statistical package that investigators use in analyzing a database. Besides a database (Figure 1), the RX Program consists of four major parts: the Discovery Module, the Study Module, a statistical analysis package, and the medical knowledge base.

The RX Discovery Module depends on statistical approaches. It carries out time-lagged, non-parametric correlation of pairs of attributes in the database, and produces a list of hypotheses consisting of the correlations ranked according to their probability. The RX Study Module then designs a comprehensive study of the most promising hypotheses. It identifies the probable causing event of the correlation by consistent time precedence. Information from the medical knowledge base is used to identify known covariate factors that may have produced an association between the tentative cause and effect. The Study Module uses quantitative information about the data to assure that adequate data are available for statistical testing. knowledge about statistical methods, a component of the knowledge base, is used to design the statistical model of the hypothesis. The statistical analysis package is invoked by the Study Module to test the statistical model. The analysis package accesses relevant data from patient records and then applies the statistical model to the data. The results are returned to the Study Module for interpretation.

One of the unexpected results of the Study Module was in confirming and quantitating the hypothesis, proposed by the Discovery Module, that the drug prednisone increases cholesterol in the blood. We found that this result was previously recorded in one regional medical literature reference, but not widely known. It was not only proposed but also demonstrated in detail by the program, since the database contained much more experience than the prior studies. These results are described in (Blum annals).

From the point of view of knowledge acquisition, RX was of interest in demonstrating how statistical and knowledge-based methods could be integrated in one program.

4. Quantitative knowledge acquisition

Quantitative knowledge acquisition obtains further knowledge about relationships among concepts defined in the initial knowledge base. An example is> deriving likelihood ratios or other statistical parameters for rules in knowledge bases by statistical analysis of a database. Conventional statistics offer a variety of methods that are potentially useful for knowledge acquisition and validation. Pattern recognition approaches to quantitative knowledge acquisition, for example, are described in standard texts (James, 1985). Part of our work has been to define and develop a knowledge acquisition program that incorporates such statistical methods.

Because of the many possible sources of uncertainty, error, and noise in the input to a classifier (such as incomplete data, incorrect data, inconsistent data, user inability to correctly observe, and so on), knowledge base validation and performance evaluation is fundamentally a statistical problem. The Bayesian approach to classification, while normative, is plagued by its requirement for very large amounts of data (when using the actuarial method) and its fundamentally exponential nature (to deal with dependent evidence). We have explored the advantages and consequences of using full joint dependencies, of using the assumption of conditional independence, and the use of special-purpose combining functions. We will describe each of these approaches, and our conclusions. Our program provides functions that analyze a database using the different approaches, and that aid in comparing the results of consultations using each approach.

There are many classification algorithms described in the statistics and pattern recognition literature (James, 1985). It is useful to think of them as alternative methods for combining and weighting evidence, or as evidence-combining functions. In the set of evidence-combining functions we include rule based systems, (including those that use certainty factors), perceptrons and neural nets, parametric statistical methods such as linear & quadratic discriminant analysis, and non-parametric statistical methods such as kernel discriminant analysis, k- nearest neighbor, and Classification and Regression Trees (CART) (Breiman, Friedman, Olshen, & Stone, 1984).

The standard approach to quantitative knowledge acquisition is this: We construct the classifier using a set of examples from a database. We assume, that the database has examples of all the classes of interest. The resulting classifier is used to classify previously unseen objects. The classifier is evaluated in terms of its error rate on either objects in the training set, or, preferably previously unseen objects.

All classification systems use Bayes' rule, either implicitly or explicitly. This is true for rule-based systems, neural nets, discriminant analysis, and every other classification system.

For a set of classes, C, and a set of measurements in the measurement space X, the probability that a new object with measurement vector x belongs to class Ci is given by Bayes' rule:

omitted well-known formula

where

P(Ci | x) = the probability that the object with measurement vector x belongs to class Ci.

P(Ci ) = the prior probability of class Ci.

P(x | Ci ) = the probability that an object in class Ci has measurement vector x.

The objective of all classification program is to determine P(Ci | x), the probability that the object with measurement vector x belongs to class Ci. A new object to be classified is assigned to the class Ci for which the value of P(Ci | x) is greatest.

To construct the classifier we can use a training set taken from a database to determine P(Ci ), the prior probability of class Ci, and P(x | Ci), the probability that the object with measurement vector x belongs to class Ci. The training set should be an unbiased, random sample from the population of objects that are to be classified.

P(Ci ) can be easily estimated from the training set; it is simply the proportion of all the objects in the training set that belong to class Ci. In most of the applications of rule-based expert systems, the prior probabilities of all the classes are assumed to be equal. That is, P(Ci ) is assumed to be the same for all Ci. This is often not a good assumption.

P(x | Ci ) also may be estimated from the training set. However, making these estimates may require more data than are in the training set. The full multinomial approach to Bayes requires calculating and storing, for each possible conclusion, the likelihood of all possible combinations of evidence given that conclusion. If each of n pieces of evidence can take one of two values (i.e., it is a binary categorical variable) there are 2n possible combinations, meaning we must estimate 2n parameters. This number increases greatly when the evidence is not binary. When more parameters are estimated from a given set of data, the variance of the parameters increases.

The same problem of combinatoric explosion seen with categorical data occurs with continuous-valued data. Formally, the probability density of any point for a continuous variable is zero. However, we can estimate probability density over small regions rather than at points. This is, in effect, building a multi-dimensional histogram. However, once the multi-dimensional histogram is specified the problem arises of either not having data in most of the histogram's partitions as is the case with categorical data, or having partitions so large that the probability cannot be accurately estimated.

Dependencies

Several approaches for dealing with dependencies have been proposed. Each of the approaches has weaknesses, which we discuss next.

The full multinomial approach described above requires calculating and storing, for each possible conclusion, the likelihood of all possible combinations of evidence given that conclusion. for n binary variables there are 2n possible combinations. The use of full joint probabilities suffers from an exponential explosion in the number of combinations of pieces of evidence (the power set), high variance of calculated probabilities due to small sample size of each member of the power set, and difficulty obtaining either actuarial data or subjective estimates.

The number of required calculations may be reduced by assuming that each piece of evidence is conditionally independent. For binary-valued evidence this requires 2m marginal probabilities. If there is a database available, these marginal probabilities may be calculated directly >From the database; this is sometimes called the actuarial method. Conditional independence, using the database actuarial method, may suffer loss of classification accuracy, and is limited by the difficulty of obtaining actuarial data.

Another method, similar to conditional independence using a database in that only marginal probabilities are used rather than full joint probabilities, is to have marginal probabilities estimated by an expert rather than being calculated from a database. Conditional independence using subjective estimates has the same problem of reduced classification accuracy, and may be limited by the inability or unwillingness of an expert to provide estimates, along with the need to resolve inconsistent estimates by the expert.

Norusis and Jacquez (Norusis, & Jacquez, 1975a; Norusis, & Jacquez, 1975b) propose a method which they call "attribute cluster models" in which dependent pieces of evidence are grouped into subsets (attribute clusters). Calculation of P(conclusion) proceeds as follows. Within each subset, all joint probabilities are calculated in the same manner as for full joint probabilities. The sets of subsets are then treated as independent pieces of evidence, and the probability of each conclusion is calculated using the conditional independence assumption. The number of required calculations depends on the number and size of the subsets of dependent evidence. Taking subsets of dependent evidence and treating them as independent sets (the attribute clustering method) is potentially exponentially explosive, though not as much so as full joint probabilities.

Dependent pieces of evidence be grouped into subsets, as later proposed by Norusis and Jacquez, but only one piece of evidence from each subset may be used to calculate the probabilities of the conclusions. This eliminates the need for full joint probabilities of the members of each subset of evidence. Eliminating all but one member of each set of dependent attributes may produce a loss of classification accuracy, and may perform much worse than using the conditional independence assumption.

Functions can be written by the user to specify how to combine pieces of dependent evidence (e.g., linear discriminant, second-order Bahadur, special purpose). This requires a number of functions comparable to the power set of combinations of evidence needed for full joint probabilities. Special purpose functions for combining dependent attributes give high computational and/or storage expense, and do not perform uniformly better than using conditional independence.

There is no shortage of additional alternative approaches, of which we will note only a few. Norusis and Jacquez (Norusis, et al., 1975a) describe several proposals. Further departures from the models we have considered include causal network models and belief network models (Cooper, 1989), which are an active research area.

It is worth briefly noting the situations in which the conditional independence assumption is likely to be a problem, because it need not always be so. Misclassification is most likely to occur when P(C|E) is similar across diseases. As pointed out by Miller (Miller, Westphal, & Reigart, 1981) and Norusis and Jacquez (Norusis, et al., 1975a), if the probabilities are dissimilar then diseases can be readily distinguished; the conditional independence assumption has little effect. When P(C|E) is similar the conditional independence assumption may substantially increase misclassification. Finally, if the features (evidence) used to distinguish the diseases are usually poor discriminators then whether or not conditional independence assumed the classification performance will be poor, and only the selection of new features will make a substantial difference.

5. Knowledge Base Validation

Validating knowledge bases with data comprises examination and comparison of the classification accuracy of alternate knowledge resources. In pattern recognition terms, such a validation is equivalent to comparison of different classifiers. The validation code in our program provides standard statistical methods to determine the overall accuracy of a classifier, to determine class-by-class accuracy (false negative and false positive rates), to identify pairs of classes that are most often confused, and other functions. These provide the data necessary to make decisions such as providing additional training examples, modifying the decision boundaries, adding new parameters to help distinguish the classes, modifying rule sets, modifying prior probabilities, or trying alternative classifiers. Users may evaluate new classification results to compare the relative performance.

Knowledge base validation in our program includes deriving conclusions for a set of cases using a specified inference method, comparing the inferred classification to the known conclusions, comparing the accuracy of one classification method to the results of using alternative inference methods, and identifying cases that are most frequently misclassified (including both false positives and false negatives).

The program stores the results of running a set of evaluation cases. Users can generate displays from this data, for example lists of cases with the evidence, true conclusion, and probability of each conclusion as determined by each inference method. The probabilities used to conclude about a particular case (prior, marginal, and conditional probabilities) may be derived from the test results table and from the knowledge base. The number of cases used to calculate the probabilities, which the program stores explicitly in the knowledge base as a beta distribution, indicates the variance; a high variance suggests that the relationship is a candidate for validation.

Consider a hypothetical classification of objects into five classes (A,B,C,D,E). The program can produce a mis-classification matrix for these classes (Figure 3). The vertical axis shows the true class (the correct conclusion), while the horizontal axis shows the class that was inferred using a given classifier. The values in each cell of the matrix are the number of cases in which the classifier concluded the inferred class for a case whose correct conclusion is shown in the actual class. For example, taking the actual class A, we see that this classifier correctly classified 18 A cases, misclassified 5 A cases as B, and misclassified 3 B cases as A. This particular classifier did not make any errors in classifying C, D, or E. The ideal classifier would produce a diagonal matrix.

Inferred class


True          A     B     C    D    E     Total
class
A            18     5     0    0    0      23 
B             3    13     0    0    0      16 
C             0     0     9    0    0       9 
D             0     0     0    6    0       6 
E             0     0     0    0    4       4 

Total        21    18     9    6    4      58

Figure 3. Misclassification matrix for a classifier.

The program can use the misclassification matrix to determine the false positive rate, false negative rate, and classification accuracy for each class, along with the overall classification accuracy (Figure 4). In this example, we see that the classifier has most difficulty in distinguishing class A from class B. If the rate of misclassification of class A and class B is unacceptable we should consider providing additional training examples, modifying the decision boundaries, adding new parameters to help distinguish the classes, or trying an alternative classifier to deal with the misclassification. If one of these alternative is chosen the user may then evaluate the new classification results in the same way as in the example here, and compare the relative performance.

    False -ve         False +ve         Correct
   Count   %        Count   %         Count   %
A  5/23    21.7     3/23    13.0      18/23   78.3 
B  3/16    18.7     5/16    31.2      13/16   81.2 
C  0/9      0       0/9      0         9/9    100 
D  0/6      0       0/6      0         6/6    100 
E  0/4      0       0/4      0         4/4    100

Figure 4. False positive and false negative rates derived From misclassification matrix in Figure 3. Overall accuracy is 50/58 cases = 86.2%

The misclassification matrix, and the derived false positive and false negative rates, identify the cases which are most likely to be mis-classified, and for which the knowledge base needs modification. Is is also desirable to determine the probability that the true classification accuracy (as opposed to the accuracy calculated from the testing set) is within an acceptable range, bounded by given confidence intervals. The method is described in Jensen (1986).

Conclusions

Neural nets provide a form of knowledge acquisition from data as well. Corresponding to our high-level abstractions neural nets provide an exhaustive set of nodes for all base data combinations. However the internal nodes are hidden and not labeled, so that the encoded knowledger can not be exploited in novel ways, but is limited to reuse over similar cases. Future automated knowledge acquisition systems should be able to create new nodes as well as new relationships among the nodes. An expert can check if such new nodes correspond to rreal-world concepts, that can be integrated into larger systems.

Some of our current work focuses on how to compose results from multiple modestly-sized intelligent systems [Wiederhold:89 (92)]. A system that is linked, as ours are, to a database has a well-defined domain, and is likely to be maintainable. However, poerful systems to aid in complex decision-making need access to multiple, distinct sources of knowledge. We believe that careful composition, rather than overwhelming size, provides paths for future intelligent information systems.>

Acknowledgements

We gratefully acknowldge support for this work from the IBM Knowledge Based Systems Group, Menlo Park, from the SUMEX-AIM resources, funded by NLM, and from DARPA .

References

Blum, R. L. (1982). Discovery, Confirmation, and Incorporation of Causal Relationships from a Large Time- Oriented Clinical Database: The RX Project. Computers and Biomedical Research, 15(2), 164-187.

Blum, R. L. (1986). Computer-Assisted Design of Studies Using Routine Clinical Data: Analyzing the Association of Prednisone and Serum Cholesterol. Annals of Internal Medicine, 104(6), 858-868.

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees (CART). Monterry, CA: Wadsworth & Brooks.

Cooper, G. F. (1989). Current Research Directions in the Development of Expert Systems Based on Belief Networks. Applied Stochastic Models and Data Analysis, 5, 39-52.

DeZegher-Geets, I. M., Freeman, A. G., Walker, M. G., Blum, R. L., & Wiederhold, G. (1987). Computer-Aided Summarization of a Time-Oriented Medical Data Base. Third Annual International Conference on Computerization of Medical Records -- Creating Patient Information Systems, Institute for Medical Record Economics, Chicago, March 1987.

DeZegher-Geets, I. M., Freeman, A. G., Walker, M. G., Blum, R. L., & Wiederhold, G. (1988). Intelligent Summarization and Display of On-Line Medical Records. M.D. Computing, 5(3), 38-54.

Downs, S., Walker, M. G., & Blum, R. L. (1986). Automated Summaries of Patient Medical Records. Washington, SCAMC D.C.:

James, M. (1985). Classification Algorithms. London, England: Collins.

Miller, M. C., Westphal, M. C., & Reigart, J. R. (1981). Mathematical Models in Medical Diagnosis. New York: Praeger Publishers.

Norusis, M. J., & Jacquez, J. A. (1975a). Diagnosis. I. Symptom Nonindependence in Mathematical Models for Diagnosis. Computers and Biomedical Research, 8, 156- 172.

Norusis, M. J., & Jacquez, J. A. (1975b). Diagnosis. II. Diagnostic Models Based on Attribute Clusters: A Proposal and Comparisons. Computers and Biomedical Research, 8, 173-188.

Walker, M. G. (1987). How Feasible is Automated Discovery? IEEE Expert, 2(1), 70-82.

Walker, M. G., & Blum, R. L. (1986). Towards Automated Discovery from Clinical Databases: The RADIX Project. CAMC, Washington, D.C.:

Wiederhold, Gio: "Mediators in the Architecture of Future Information Systems"; IEEE Computer, March 1992, pages 38-49, based on a 1989 report. Wiederhold, Gio, Robert L. Blum, and Michael Walker: "An Integration of Knowledge and Data Representation"; Chapter 29 of Brodie, Mylopoulos, and Schmidt (eds.), On Knowledge Base Management Systems: Integrating Artificial Intelligence and Database Technologies; Springer Verlag, June 1986, pages 431 to 444.