Stanford University
18 Jan 2003
Position Paper prepared for the SIAM workshop on Security Data Mining, San Francisco, 3 May 2003
A wide variety of data mining tools are being expanded and developed in order to discover relationships that might pose threats to our security. They may be based on traditional statistical inference procedures, on extracting association rules, or on any technique that attempts to extract out of large and varied data collections previously unknown information.
Fortunately for us, but unhappily for the techniques being employed for data mining today, the information needed is rare, and buried deeply in a mass of observations of non-threatening events. These events also display relationships, relationships that are as complex as any we need to discover, at frequencies that are higher than those that can imply threatening relationships.
The objective of traditional statistics is to validate hypothetical relationships, presented in a model that identified independent and candidate dependent variables. Regression and factor analysis in their many reincarnations are the primary tools. Relationships are validated when the probability of the appearance of a quantitative dependence based on random events becomes small, say less than 5 or 1%. Fishing for relationships without a model is discouraged, since eventually (~20 tries at 5%) some relationship will be discovered that is actually due to happenstance. The intent is to establish strong relationships, that can be exploited for broad actions: clinical treatments, effective marketing and the like. Infrequent events are not of interest
Data mining opened the search for serendipitous relationships by not insisting on a prior model. Associated events that stand out based on frequent co-occurrence are reported, and the receiver of such data will still build a mental model that can identify and relate cause and effect. In the apocryphal finding of a market-basket relationship in purchasing of diapers and beer the actual causal variable – having a child at home – was not available in the observed data. Again, a sufficient frequency is needed in order for one particular association to rise out of the mass of possible associations. In a world where we wish to test thousands of event types the number of binary associations will be in the millions. But we obviously have to look for ternary, and higher order associations as well.
A statistical analysis should have included those observations, although causality would still not be definitely established. If time stamps are available some inference is possible: If events of type a always precede events of type b then it is very unlikely that instances of b cause a, and events of type a might well cause b. Still, there might still a missing true causal variable c and consistent onset delays between c Ø a and c Ø b. If there is no temporal precedence between a and b, then it is fairly certain that data on the causal, true independent variable c has not been collected, or could not even be observed.
The lack of true causal variables also bewilders many traditional statistical analyses. Most of databases, collected without an analytic model in mind fail to capture the true causes. Even if a model can be constructed, the causal variables may not have been collected. When causes are unobservable, as still in most conditions affected by ones genomic makeup, we can at best look for surrogates, as race or familial history. And causal conditions rarely act alone. Quantifying the relative contributions of causal factors requires more insight, deep models and immense quantities of data.
What conclusion can we draw from this exposition for the task at hand? Realizing that we will be looking for rare events, and not for routine happenings we must have methods to extract the rare, second-order events from the expected and explainable. In order to document the explainable we must have a means to represent such knowledge.
To account for expected events it will be necessary to create scenarios. If travel aberrations are to be discovered we will to populate scenarios of business travel, vacation travel, family emergencies etc. If we have sufficient data to recognize these patterns then a high frequency of travel events becomes explainable, We can now analyze the remaining, yet unexpected patterns, We may find cases where two return air tickets in opposite direction are purchased. That can be explained by price differentials for weekday return tickets, and creates a subclass of business trips. For many of these models new data has to be acquired, causal data if possible, but more often surrogates.
These models will need to incorporate hierarchies, and most higher level entries will not be represented directly in that data collections, but only inferred, as `cheap-skate business traveler'. Building these models can be aided by bringing in domain experts. The people that try to maximize air travel revenue are likely to be aware of many categories already, although they will not have as broad access to data as will be required for threat analyses.
To be effective a feature has to be added to the analytical systems that allows continuous refinement. Today, the models used or inferred in data analyses remain outside of the processing systems, typically on paper documents. The models attributes are used by the statisticians, and they and data miners explain the extracted results in modeling terms.
To close the loop the model must be represented in computer-processable form. The results of any completed analysis can then be entered as a known relationship, and discounted in the subsequent analyses. The model can be updated with new quantitative relation ships found among its nodes, with new attributes for which data have to be collected, and with new nodes as needed to identify new sub models. If a previously unknown pattern emerges, it can be entered under a temporary name, and processing can continue while experts try to discover the reason for the aberration, as the fare imbalance pattern cited above.
The representation of these models will stress computer science technology. Many common representations are hierarchical. Analyses that follow the divide-and conquer paradigm are inherently hierarchical. Any single data structure supporting a model is likely hierarchical. However, causal relationships are likely to be cross cutting, and have to be represented as well. Cross cutting relationships may be discovered at any level. They can be aggregated to higher levels. Additional analyses will be required to partition quantitative findings to nodes at lower levels in the model.
Other models will have their own hierarchies. These will intersect and the overall model structure becomes a network. It is not necessary that the stored source data themselves be bound into a network structure. The model, being processable, can extract data out of conventional, say relational databases, and restructure it for a specific analysis task. Object-oriented structures can be temporarily populated to match the hierarchies. After processing the any discovered relationships must be mapped into the overall model representation so that the can contribute to the overall knowledge, and be accounted for in subsequent analyses.
In addition to having incomplete models we will also encounter patterns of missing data. These can also be modeled, but not allow quantitative analyses, unless surrogates van be found. Surrogates, being abstractions, will appear at higher levels in the hierarchy. Again, experts will be needed to decide if missing data can be systematically explained or if further sleuthing and modeling is required to arrive at explanations or recognize significant aberrations.
There is much interest, and potential funding, today in expanding data mining technologies into very difficult domains. Getting significant results when searching for infrequent events will require more sophistication than is widely found. Pieces of the required technologies exist, but they are owned by distinct communities, data managers, modelers, statisticians, computer scientists, artificial intelligence specialists, and omain experts. Bringing them together will not be easy. Defining an initial modest demonstration task might be a first step, but even a modest task will need to have a scale that is of a substantial magnitude, comparable to the larger studies we have seen in health care scale.
The concepts behind this essay are largely based on the RX project, specifically the PhD thesis research of Dr. Robert Blum in 1982. We are trying to recover some of the material from old archives, to see what we have please check
http:www-db.Stanford,edu/pub/gio/inprogress.html#RX.
The RX study used data about immunological disease findings, collected by Dr. James Fries and residents in the Stanford immunology clinic, and kept in a Time-oriented database system. It was published as
Blum,R.L.: Discovery and Representation of
Causal Relationships from a Large Time-Oriented Clinical Database: The RX Project; Springer Verlag Lecture
Notes on Medical Informatics No.19, Lindberg and Reichertz (eds), 1982, 242pp.
Earlier publications documenting the RX project include:
Blum,R.L.
and Wiederhold,G.: Inferring Knowledge from Clinical Data Banks Utilizing
Techniques from Artificial Intelligence; Proc.of the Annual Symp. on Computer
Applications in Medical Care – SCAMC Vol.2, IEEE, Nov. 1978.
Blum,R.L.:
Title ={Automating The Study of Clinical Hypotheses on a Time-Oriented
Database: The RX Project; MEDINFO 80, Lindberg and Kaihara(eds), IFIP,
North-Holland, 1980, pp.456—460.
Blum,R.L.:
Displaying Clinical Data from a Time-Oriented Database; Computers in Biology
and Medicine, Vol.11 No.4, 1981.
Blum,Robert
L.; Discovery, Confirmation, and Incorporation of Causal Relationships from a
Large Time-Oriented Clinical Database: The RX Project; Computers and Biomedical
Research, Academic Press Vol,12 No.2,
1982, pp.164—187.
Blum,R.L
and Wiederhold,G.C.M.: Studying Hypotheses on a Time-Oriented Clinical
Database: An Overview of the
RX~Project; Management Science, Journal of TIMS, vol I 6, Oct-Nov. 1982 (1982, Washington
DC), IEEE 82 CH1805-1.}, pages 725-735.
Blum,
Robert L. and Gio Wiederhold: "Studying Hypotheses on a Time-Oriented
Clincial Database: An Overview of the "RX project"; in J.A.Reggia and
S.Thurim: Computer Assisted Medical Decision-Making; Springer Verlag, 1985,
pages 245-253;
Blum,Robert
L.: Computer-Assisted Design of Studies Using Routine Clinical Data:
Analyzing the Association of Prednisone
and Cholesterol; Annals of Internal Medicine, Jul. 1986. <<verify, get pages>>
Dannenberg,A.L,
Shapiro,A.R., and Fries,J.F.: "Enhancement of Clinical Predictive Ability
by Computer Consulation"; Methods of Information in Medicine, Vol.18 no.1,
Jan 1979, pp.10—14.
Fries,J.F.:
The Chronic Disease Databank: First Principles for Further Directions;
J.Med.Philos., Vol.9 No.2, May 1984, pp. 161—180.
McShane,D.J.,
Harlow,A., Kraines,R.G., and Fries,J.F.: TOD: A Software System for the ARAMIS
Data Bank; IEEE Computer, Vol.12 No.11, Nov.1979, pp. 34--40".
Related Material is found in
Albridge,K.M., Standish,J. and Fries,J.F.:
Hierarchical Time-Oriented Approaches to Missing Data Inference; Computers and
Biomedical Research, Academic Press, Vol.21 No.4, Aug.1988, pp. 349—366.
deZegherGeets,I.M., Freeman,A.G.,
Walker,M.G., Blum,R.L., and Wiederhold,G.: Summarization and Display of On-Line
Medical Records; M.D. Computing, Springer Verlag, Vol.5 No.3, Mar. 1988.
Parsaye,K., Chignell,M., Khoshafian,S., and
Wong,H.: Intelligent Databases: Object-Oriented Deductive Hypermedia
Technologies;John Wiley Wiley Sons Publishers, 1989, 510pp.
Springsteel,Frederic: Complexity of
Hypothesis Formation Problems; Int. Journal Man-Machine Studies, Vol.15, 1981,
pp.319—332.
Walker,Michael G.: How Feasible is
Automated Discovery?; EEE Expert, Spr.1987, pp.69—82.
Walker,
Michael G. and Gio Wiederhold: "Acquisition and Validation of Knowledge
from Data"; in Z.W. Ras and M. Zemankova: Intelligent Systems, State of
the Art and Future Directions, Ellis Horwood, 1990, pages 415-428.
Weyl,S., Fries,J., Wiederhold,G., and
Germano,F.: A Modular, Self-Describing Clinical Databank System; Computers and
Biomedical Research, Academic Press, Vol.8, 1975, pp.279—293.
Wiederhold, Gio, Robert L.
Blum, and Michael Walker: "An Integration of Knowledge and Data
Representation"; Proc. of
Islamorada Workshop, Feb.1985, Computer Corporation of America,
Cambridge MA; also Report KSL-86-13, Stanford University, 1986. Abstract."
Available as safe/pap/ISLAMORADA.TEX.Z. <<put it on-line>>
Wiederhold,Gio: Knowledge Versus Data; in
Brodie, Mylopoulos, and Schmidt (eds) `On Knowledge Base Management Systems:
Integrating Artificial Intelligence and Database Technologies,' Springer Verlag
Feb. 1986.
Wiederhold,G.C.M., Walker,M.G., Blum,R.L.,
and Downs,S.M.: Acquisition of Knowledge from Data; International Symp.on
Methodologies for Intelligent Systems; Un.Tennessee, Oct. 1986, pp.74—84.
Models and Data transformation were the topic of the Penguin Project:
Barsalou, Thierry: ``An
Object-based Architecture for Biomedical Expert Database Systems"; SCAMC
12, IEEE CS Press, 1988
Barsalou, Thierry, R. Martin Chavez, and Gio Wiederhold:
"Hypertext Interfaces for Decision-Support Systems: A Case Study";
Proc IFIP MEDINFO 89, Beijing and Singapore, Dec. 1989, pages 126 to 130.
Barsalou, Thierry and Gio
Wiederhold: ``Knowledge-directed Mediation Between Application Objects and
Data"; Proc. Working Conf. on Data and Knowledge Integration, Un.\of
Keele, England, 1989, Pittman Pub.
Barsalou, T. and G.
Wiederhold: ``Complex Objects For Relational Databases''; Computer Aided
Design, Vol. 22 No. 8, Buttersworth, Great Britain, October 1990.
Barsalou, T., N. Siambela,
A. Keller, and G. Wiederhold: ``Updating Relational Databases through
Object-Based Views''; ACM SIGMOD Conf.
on the Management of Data 91, Boulder CO, May 1991; on-line at
<<>>.
Barsalou, T., W. Sujansky,
L.A. Herzenberg, and G. Wiederhold: "Management of Complex Immunogenetics
Information Using an Enhanced Relational Model"; IMIA Yearbook in Medical
Informatics, International Medical Informatics Association, 1992.
Hara, Yoshinori, Arthur M.
Keller, Peter Rathmann, and Gio Wiederhold: "Implementing Hypertext
Database Relationships through Aggregations and Exceptions"; Hypertext'91
(Third ACM Conference on Hypertext Proceedings), San Antonio, Texas, December
15--18, 1991, pages 75--90. Abstract
on line at <<>>.
Keller, Arthur M., and Gio
Wiederhold: "Penguin:
Objects for Programs, Relations for Persistence"; in Roberto Zicari
and Akmal Chaudhri(eds.): Succeeding with Object Databases; Wiley, 2000,
pp. 75-88. <<on line>>
Law, Kincho H., Gio Wiederhold,
Thierry Barsalou, Niki Siambela, Walter Sujansky, David Zingmond, Harvinder
Singh: ``Architecture for Managing Design Objects in a Shareable Relational
Framework''; International Journal of Systems Automation: Research and
Applications (SARA), Volume 1, Number 1, pages 47-65, 1991.
Gio,Wiederhold, Peter
Rathmann, Thierry Barsalou, Byung Suk Lee, and Dallan Quass: "Partitioning
and Composing Knowledge"; Information Systems, Vol.15 No.1, 1990, pages
61-72.
Wiederhold, Gio, Thierry
Barsalou, Walter Sujansky, and David Zingmond: "Sharing Information Among
Biomedical Applications"; in T.Timmers and B.I.Blum (eds): Software
Engineering in Medical Informatics, IMIA, North-Holland, 1991, pages 49-84.
Wiederhold, Gio and Arthur
Keller: "Integrating
Data into Objects Using Structural Knowledge"; Third International
Symposium on Command and Control Research and Technology (3ICCRTS),
National Defense University, 17-20 June, 1997, pages 842-853.
------------------ o ------------------------------ o ------------------------------