Final Version Published in "Intelligent Systems", Z.W.Ras and M.Zemankova, editors, Ellis Horwood, publisher, May 1990, pages 415-428.
Practical applications of knowledge-based systems are hindered by the knowledge acquisition problem, i.e., the difficulty of building and maintaining knowledge bases. There are several types of automated knowledge acquisition. They are complementary, and require different levels of analysis and base data.
Data summarization is important because of the problem of information overload. It changes the level of abstraction to match an expert's need, but does not make choices.
Aids to scientific discovery increase the productivity of researchers in formulating hypotheses. Finding and pruning inconsistencies between already encoded knowledge and the data, again at an appropriate level of abstraction, augments the paper-based exploration tools that are used now.
Validation of factors is essential for systems dealing with any form of probalistic reasoning. It is unreasonable to expect a human to be able to quantify consistently the interactions of the many factors needed for expert decisions in even moderatly complex situations. It is certainly not possible to obtain consistent quantifications from multiple experts on topics that partially overlap. Today few commercial expert system products incorporate probabilistic reasoning and others use ad hoc algorithms such as certainty factors. Having well defined quantifications, and using them appropriately can permit them to deal correctly with uncertainty.
An important issue is maintenance.
The cost of building a knowledge base is only the down payment.
We can assume that in most interesting applications the knowledge
changes by at least 5
Both the summarization and discovery components of RX
use patient records from a <+clinical> database (Figure 1).
Figure 1. Small portion of a medical database used by RX.
The goal of the Summarization Module is to provide
concise, high-level summaries of such patient records. The
program uses a domain-specific knowledge base to
identify and summarize important events. It provides an
interactive, graphic representation of the patient record,
with active regions on the screen display that are
selectable by the user to display evidence supporting its
conclusions.
Figure 2. Sample output of the Summarization Module
Figure 2 shows a small portion of a summary: a graphical
display of some of the important events relevant to this
patient. The summary consists of high-level descriptions
such as renal failure and glomerulonephritis that the
program has inferred from the patient's laboratory data.
These high-level descriptions <-> are typically diagnoses or patient states.
The user can see
The Summarization program uses a hypothetico-deductive
algorithm in which disease states are evoked based on
attributes with abnormal values, and the evoked diseases
are confirmed or rejected by matching their templates
against the data in the patient record. Likelihood ratios
are used for Bayesian updating of the probability of the
evoked disease states. Confirmed hypotheses, probable
tentative hypotheses, and unexplained abnormalities are
the labels on the time-oriented graphical summary.
The algorithm begins with a data-drive hypothesis
generation phase, followed by a goal-driven hypothesis-testing phase.
By alternating forward and backward search> the program considers only
those hypotheses suggested by the data. The hypothesis
generation phase involves searching a visit in the patient
record for attributes with abnormal values. Each
abnormality found is associated with a list of possible
labels or hypotheses that may explain it. These
hypotheses are instantiated and placed on an agenda for
confirmation or refutation. To avoid the large number of
hypotheses that would result from considering all
abnormal values, only those attributes determined to be
significant for the application are considered. This
determination of significance is based on the prior
probability of the abnormality and on the cost or utility of
missing an important finding.
As each significant abnormality is found in a given visit,
an instance of each hypothesis on its differential list is
created and placed on the agenda of hypotheses. Each
hypothesis has an initial probability equal to its prior
probability. The hypotheses on the agenda are taken in
turn and evidence for and against is sought in the
database. The evidence found is used to update the
probability of the hypothesis according to Bayes' rule.
when all the evidence for a given hypothesis has been
evaluated the probability is fully updated. If the
probability of the hypothesis is low the label may be
removed as determined by the user. When the probability
exceeds a threshold, the labels in its differential are
instantiated and placed on the agenda, and the display is
updated.
The principal innovation of the RX project was the
addition of a knowledge base and a Study Module to the
usual statistical package that investigators use in
analyzing a database. Besides a database (Figure 1), the RX
Program consists of four major parts: the Discovery
Module, the Study Module, a statistical analysis package,
and the medical knowledge base.
The RX Discovery Module depends on statistical
approaches. It carries out time-lagged, non-parametric
correlation of pairs of attributes in the database, and
produces a list of hypotheses consisting of the
correlations ranked according to their probability. The RX
Study Module then designs a comprehensive study of the
most promising hypotheses. It identifies the probable
causing event of the correlation by consistent time
precedence. Information from the medical knowledge base
is used to identify known covariate factors that may have
produced an association between the tentative cause and
effect. The Study Module uses quantitative information
about the data to assure that adequate data are available
for statistical testing. knowledge about statistical methods,
a component of the knowledge base, is used to design the
statistical model of the hypothesis. The statistical analysis
package is invoked by the Study Module to test the
statistical model. The analysis package accesses relevant
data from patient records and then applies the statistical
model to the data. The results are returned to the Study
Module for interpretation.
One of the unexpected results of the Study Module was in
confirming and quantitating the hypothesis, proposed by
the Discovery Module, that the drug prednisone increases
cholesterol in the blood. We found that this result was previously
recorded in one regional medical literature reference, but not widely
known.
It was not only proposed but also demonstrated in detail by the program,
since the database contained much more experience than the prior studies.
These results are described in (Blum annals).
From the point of view of knowledge acquisition, RX was
of interest in demonstrating how statistical and
knowledge-based methods could be integrated in one
program.
Because of the many possible sources of uncertainty,
error, and noise in the input to a classifier (such as
incomplete data, incorrect data, inconsistent data, user
inability to correctly observe, and so on), knowledge base
validation and performance evaluation is fundamentally a
statistical problem. The Bayesian approach to
classification, while normative, is plagued by its
requirement for very large amounts of data (when using
the actuarial method) and its fundamentally exponential
nature (to deal with dependent evidence). We have
explored the advantages and consequences of using full
joint dependencies, of using the assumption of conditional
independence, and the use of special-purpose combining
functions. We will describe each of these approaches, and
our conclusions. Our program provides functions that
analyze a database using the different approaches, and
that aid in comparing the results of consultations using
each approach.
There are many classification algorithms described in
the statistics and pattern recognition literature
(James, 1985). It is useful to think of them as
alternative methods for combining and weighting
evidence, or as evidence-combining functions. In the
set of evidence-combining functions we include rule
based systems, (including those that use certainty
factors), perceptrons and neural nets, parametric
statistical methods such as linear & quadratic
discriminant analysis, and non-parametric statistical
methods such as kernel discriminant analysis, k-
nearest neighbor, and Classification and Regression
Trees (CART) (Breiman, Friedman, Olshen, & Stone,
1984).
The standard approach to quantitative knowledge
acquisition is this: We construct the classifier using a
set of examples from a database. We assume,
that the database has examples of all the
classes of interest. The resulting classifier is used to
classify previously unseen objects. The classifier is
evaluated in terms of its error rate on either objects
in the training set, or, preferably previously unseen
objects.
All classification systems use Bayes' rule, either
implicitly or explicitly. This is true for rule-based
systems, neural nets, discriminant analysis, and
every other classification system.
For a set of classes, C, and a set of measurements in
the measurement space X, the probability that a new
object with measurement vector x belongs to class Ci
is given by Bayes' rule:
omitted well-known formula
where
P(Ci | x) = the probability that the object with
measurement vector x belongs to class Ci.
P(Ci ) = the prior probability of class Ci.
P(x | Ci ) = the probability that an object in class Ci
has measurement vector x.
The objective of all classification program is to
determine P(Ci | x), the probability that the object
with measurement vector x belongs to class Ci. A
new object to be classified is assigned to the class Ci
for which the value of P(Ci | x) is greatest.
To construct the classifier we can use a training set
taken from a database to determine P(Ci ), the prior
probability of class Ci, and P(x | Ci), the probability
that the object with measurement vector x belongs
to class Ci. The training set should be an unbiased,
random sample from the population of objects that
are to be classified.
P(Ci ) can be easily estimated from the training set; it
is simply the proportion of all the objects in the
training set that belong to class Ci. In most of the
applications of rule-based expert systems, the prior
probabilities of all the classes are assumed to be
equal. That is, P(Ci ) is assumed to be the same for all
Ci. This is often not a good assumption.
P(x | Ci ) also may be estimated from the training set.
However, making these estimates may require more
data than are in the training set. The full multinomial
approach to Bayes requires calculating and storing,
for each possible conclusion, the likelihood of all
possible combinations of evidence given that
conclusion. If each of n pieces of evidence can take
one of two values (i.e., it is a binary categorical
variable) there are 2n possible combinations,
meaning we must estimate 2n parameters. This number
increases greatly when the evidence is not binary. When
more parameters are estimated from a given set of
data, the variance of the parameters increases.
The same problem of combinatoric explosion seen
with categorical data occurs with continuous-valued
data. Formally, the probability density of any point
for a continuous variable is zero. However, we can
estimate probability density over small regions
rather than at points. This is, in effect, building a
multi-dimensional histogram. However, once the
multi-dimensional histogram is specified the
problem arises of either not having data in most of
the histogram's partitions as is the case with
categorical data, or having partitions so large that the
probability cannot be accurately estimated.
The full multinomial approach described above requires
calculating and storing, for each possible conclusion, the
likelihood of all possible combinations of evidence given
that conclusion. for n binary variables there are 2n
possible combinations. The use of full joint probabilities
suffers from an exponential explosion in the number of
combinations of pieces of evidence (the power set), high
variance of calculated probabilities due to small sample
size of each member of the power set, and difficulty
obtaining either actuarial data or subjective estimates.
The number of required calculations may be reduced by
assuming that each piece of evidence is conditionally
independent. For binary-valued evidence this requires 2m
marginal probabilities. If there is a database available,
these marginal probabilities may be calculated directly
>From the database; this is sometimes called the actuarial
method. Conditional independence, using the database
actuarial method, may suffer loss of classification accuracy,
and is limited by the difficulty of obtaining actuarial data.
Another method, similar to conditional independence
using a database in that only marginal probabilities are
used rather than full joint probabilities, is to have
marginal probabilities estimated by an expert rather than
being calculated from a database. Conditional
independence using subjective estimates has the same
problem of reduced classification accuracy, and may be
limited by the inability or unwillingness of an expert to
provide estimates, along with the need to resolve
inconsistent estimates by the expert.
Norusis and Jacquez (Norusis, & Jacquez, 1975a; Norusis, &
Jacquez, 1975b) propose a method which they call
"attribute cluster models" in which dependent pieces of
evidence are grouped into subsets (attribute clusters).
Calculation of P(conclusion) proceeds as follows. Within
each subset, all joint probabilities are calculated in the
same manner as for full joint probabilities. The sets of
subsets are then treated as independent pieces of
evidence, and the probability of each conclusion is
calculated using the conditional independence assumption.
The number of required calculations depends on the
number and size of the subsets of dependent evidence.
Taking subsets of dependent evidence and treating them
as independent sets (the attribute clustering method) is
potentially exponentially explosive, though not as much so
as full joint probabilities.
Dependent pieces of evidence be grouped into subsets, as
later proposed by Norusis and Jacquez, but only one piece
of evidence from each subset may be used to calculate the
probabilities of the conclusions. This eliminates the need
for full joint probabilities of the members of each subset
of evidence. Eliminating all but one member of each set of
dependent attributes may produce a loss of classification
accuracy, and may perform much worse than using the
conditional independence assumption.
Functions can be written by the user to specify how to
combine pieces of dependent evidence (e.g., linear
discriminant, second-order Bahadur, special purpose). This
requires a number of functions comparable to the
power set of combinations of evidence needed for full
joint probabilities. Special purpose functions for combining
dependent attributes give high computational and/or
storage expense, and do not perform uniformly better
than using conditional independence.
There is no shortage of additional alternative approaches,
of which we will note only a few. Norusis and Jacquez
(Norusis, et al., 1975a) describe several proposals. Further
departures from the models we have considered include
causal network models and belief network models
(Cooper, 1989), which are an active research area.
It is worth briefly noting the situations in which the
conditional independence assumption is likely to be a
problem, because it need not always be so.
Misclassification is most likely to occur when P(C|E) is
similar across diseases. As pointed out by Miller (Miller,
Westphal, & Reigart, 1981) and Norusis and Jacquez
(Norusis, et al., 1975a), if the probabilities are dissimilar
then diseases can be readily distinguished; the conditional
independence assumption has little effect. When P(C|E) is
similar the conditional independence assumption may
substantially increase misclassification. Finally, if the
features (evidence) used to distinguish the diseases are
usually poor discriminators then whether or not
conditional independence assumed the classification
performance will be poor, and only the selection of new
features will make a substantial difference.
Validating knowledge bases with data
comprises examination and comparison of the classification
accuracy of alternate knowledge resources.
In pattern recognition terms, such a validation is equivalent to
comparison of different classifiers. The validation
code in our program provides standard statistical methods
to determine the overall accuracy of a classifier, to
determine class-by-class accuracy (false negative and
false positive rates), to identify pairs of classes that are
most often confused, and other functions. These provide
the data necessary to make decisions such as providing
additional training examples, modifying the decision
boundaries, adding new parameters to help distinguish
the classes, modifying rule sets, modifying prior
probabilities, or trying alternative classifiers. Users may
evaluate new classification results to compare the relative
performance.
Knowledge base validation in our program includes
deriving conclusions for a set of cases using a specified
inference method, comparing the inferred classification to
the known conclusions, comparing the accuracy of one
classification method to the results of using alternative
inference methods, and identifying cases that are most
frequently misclassified (including both false positives
and false negatives).
The program stores the results of running a set of
evaluation cases. Users can generate displays from this
data, for example lists of cases with the evidence, true
conclusion, and probability of each conclusion as
determined by each inference method. The probabilities
used to conclude about a particular case (prior, marginal,
and conditional probabilities) may be derived from the
test results table and from the knowledge base. The
number of cases used to calculate the probabilities, which
the program stores explicitly in the knowledge base as a
beta distribution, indicates the variance; a high variance
suggests that the relationship is a candidate for validation.
Consider a hypothetical classification of objects into five
classes (A,B,C,D,E). The program can produce a mis-classification matrix for
these classes (Figure 3). The
vertical axis shows the true class (the correct conclusion),
while the horizontal axis shows the class that was inferred
using a given classifier. The values in each cell of the
matrix are the number of cases in which the classifier
concluded the inferred class for a case whose correct
conclusion is shown in the actual class. For example, taking
the actual class A, we see that this classifier correctly
classified 18 A cases, misclassified 5 A cases as B, and
misclassified 3 B cases as A. This particular classifier did
not make any errors in classifying C, D, or E. The ideal
classifier would produce a diagonal matrix.
Inferred class
The program can use the misclassification matrix to
determine the false positive rate, false negative rate, and
classification accuracy for each class, along with the
overall classification accuracy (Figure 4). In this example,
we see that the classifier has most difficulty in
distinguishing class A from class B. If the rate of
misclassification of class A and class B is unacceptable we
should consider providing additional training examples,
modifying the decision boundaries, adding new
parameters to help distinguish the classes, or trying an
alternative classifier to deal with the misclassification. If
one of these alternative is chosen the user may then
evaluate the new classification results in the same way as
in the example here, and compare the relative
performance.
The misclassification matrix, and the derived false positive
and false negative rates, identify the cases which are most
likely to be mis-classified, and for which the knowledge
base needs modification. Is is also desirable to determine
the probability that the true classification accuracy (as
opposed to the accuracy calculated from the testing set) is
within an acceptable range, bounded by given confidence
intervals. The method is described in Jensen (1986).
Neural nets provide a form of knowledge acquisition from data
as well.
Corresponding to our high-level abstractions neural nets provide an
exhaustive set of nodes for all base data combinations.
However the internal nodes are hidden and not labeled,
so that the encoded knowledger can not be exploited in novel ways,
but is limited to reuse over similar cases.
Future automated knowledge acquisition systems should be able to create new
nodes as well as new relationships among the nodes.
An expert can check if such new nodes correspond to rreal-world concepts,
that can be integrated into larger systems.
Some of our current work focuses on how to compose results from
multiple modestly-sized intelligent systems [Wiederhold:89 (92)].
A system that is linked, as ours are, to a database has a well-defined
domain, and is likely to be maintainable.
However, poerful systems to aid in complex decision-making need access
to multiple, distinct sources of knowledge.
We believe that careful composition, rather than overwhelming size,
provides paths for future intelligent information systems.>
Blum, R. L. (1982). Discovery, Confirmation, and
Incorporation of Causal Relationships from a Large Time-
Oriented Clinical Database: The RX Project. Computers and
Biomedical Research, 15(2), 164-187.
Blum, R. L. (1986). Computer-Assisted Design of Studies
Using Routine Clinical Data: Analyzing the Association of
Prednisone and Serum Cholesterol. Annals of Internal
Medicine, 104(6), 858-868.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J.
(1984). Classification and Regression Trees (CART). Monterry, CA:
Wadsworth & Brooks.
Cooper, G. F. (1989). Current Research Directions in the
Development of Expert Systems Based on Belief Networks.
Applied Stochastic Models and Data Analysis, 5, 39-52.
DeZegher-Geets, I. M., Freeman, A. G., Walker, M. G., Blum,
R. L., & Wiederhold, G. (1987). Computer-Aided
Summarization of a Time-Oriented Medical Data Base.
Third Annual International Conference on Computerization of Medical
Records -- Creating Patient Information Systems, Institute for Medical Record
Economics, Chicago, March 1987.
DeZegher-Geets, I. M., Freeman, A. G., Walker, M. G., Blum,
R. L., & Wiederhold, G. (1988). Intelligent Summarization
and Display of On-Line Medical Records. M.D. Computing,
5(3), 38-54.
Downs, S., Walker, M. G., & Blum, R. L. (1986). Automated
Summaries of Patient Medical Records. Washington, SCAMC D.C.:
James, M. (1985). Classification Algorithms. London,
England: Collins.
Miller, M. C., Westphal, M. C., & Reigart, J. R. (1981).
Mathematical Models in Medical Diagnosis. New York:
Praeger Publishers.
Norusis, M. J., & Jacquez, J. A. (1975a). Diagnosis. I.
Symptom Nonindependence in Mathematical Models for
Diagnosis. Computers and Biomedical Research, 8, 156-
172.
Norusis, M. J., & Jacquez, J. A. (1975b). Diagnosis. II.
Diagnostic Models Based on Attribute Clusters: A Proposal
and Comparisons. Computers and Biomedical Research, 8,
173-188.
Walker, M. G. (1987). How Feasible is Automated
Discovery? IEEE Expert, 2(1), 70-82.
Walker, M. G., & Blum, R. L. (1986). Towards Automated
Discovery from Clinical Databases: The RADIX Project.
CAMC, Washington, D.C.:
Wiederhold, Gio: "Mediators in the Architecture of Future Information Systems"; IEEE
Computer, March 1992, pages 38-49, based on a 1989 report.
Wiederhold, Gio, Robert L. Blum, and Michael Walker: "An Integration of Knowledge
and Data Representation"; Chapter 29 of Brodie, Mylopoulos, and Schmidt (eds.), On
Knowledge Base Management Systems: Integrating Artificial Intelligence and
Database Technologies; Springer Verlag, June 1986, pages 431 to 444.
2. Database summarization
Our work on database summarization in the RX project is
described in detail in Downs (Downs, Walker, & Blum,
1986) and DeZegher (DeZegher-Geets, Freeman, Walker,
Blum, & Wiederhold, 1987; DeZegher-Geets, Freeman,
Walker, Blum, & Wiederhold, 1988)). In this chapter we
provide a short overview.
3. Discovery: RX
Our work on automated discovery using medical databases
began with the RX project, which is described in detail in
[Blum, 1982; Blum, 1986] and [Walker & Blum, 1986].
Walker [Walker, 1987] reviews systems for
automated discovery.
RX uses again a large time-oriented patient database as its experience base,
and a substantial knowledge base of clinical observations, intermediate
findings, syndromes, and diagnoses as its knowledge base.
The knowledge base is also initialized with links that
describe categorical, defintional, and causal relationships
[Wiederhold, Blum, and Walker, 1986].
Discovery in RX consists of proposing and validating unknown causal links
among nodes in the knowledge base.
4. Quantitative knowledge acquisition
Quantitative knowledge acquisition obtains further knowledge
about relationships among concepts defined in the initial knowledge base.
An example is> deriving likelihood ratios or other statistical
parameters for rules in knowledge bases by statistical
analysis of a database. Conventional statistics offer a
variety of methods that are potentially useful for
knowledge acquisition and validation. Pattern recognition
approaches to quantitative knowledge acquisition, for
example, are described in standard texts (James, 1985).
Part of our work has been to define and develop a
knowledge acquisition program that incorporates such
statistical methods.
Dependencies
Several approaches for dealing with dependencies have
been proposed. Each of the approaches has weaknesses,
which we discuss next.
5. Knowledge Base Validation
True A B C D E Total
class
A 18 5 0 0 0 23
B 3 13 0 0 0 16
C 0 0 9 0 0 9
D 0 0 0 6 0 6
E 0 0 0 0 4 4
Total 21 18 9 6 4 58
Figure 3. Misclassification matrix for a classifier.
False -ve False +ve Correct
Count % Count % Count %
A 5/23 21.7 3/23 13.0 18/23 78.3
B 3/16 18.7 5/16 31.2 13/16 81.2
C 0/9 0 0/9 0 9/9 100
D 0/6 0 0/6 0 6/6 100
E 0/4 0 0/4 0 4/4 100
Figure 4. False positive and false negative rates derived
From misclassification matrix in Figure 3. Overall accuracy
is 50/58 cases = 86.2%
Conclusions
Acknowledgements
We gratefully acknowldge support for this work from
the IBM Knowledge Based Systems Group, Menlo Park, from
the SUMEX-AIM resources, funded by NLM, and from DARPA .
References