Overview of the IBM ProbE Data Mining Engine

Ed Pednault
Mathematical Sciences Department
IBM T. J. Watson Research Center
Yorktown Heights, New York

pednault@us.ibm.com

Abstract

IBM ProbE(TM) (pronounced probe, for PROBabilistic Estimation) is a customizable data mining engine that is being developed to enhance IBM's competitive position in delivering predictive modeling solutions. ProbE incorporates state-of-the-art algorithms that have thus far demonstrated the ability to consistently generate high-quality predictive models on real-world problems, such as customer response modeling for targeted marketing. Such consistency is important from an operational standpoint because it permits models to be periodically refreshed as batch jobs with no human intervention. In addition, models can be exported in the form of executable code, either as SAS code or as parallelizable DB2-UDB user-defined functions coded in C. Automatic generation of UDFs is particularly important for deploying models in operational databases. No physical limits are imposed on data set size: thousands of columns and millions of rows are possible. For example, in a recent joint project with Fingerhut, Inc., a leading direct-mail retailer, data sets comprising 1,400 input attributes and 500,000 rows were routinely used for constructing models. ProbE is currently designed to be an embeddable system so that it can be integrated with an existing product, integrated with customer software, or packaged as a stand-alone application. No restrictions are placed on the source of data. Flat files and DB2 access are currently implemented; however, main-memory databases, the Web, etc., are also possible. However, ProbE's architecture has also been designed to support data-partition parallelism, and efforts are currently underway to implement ProbE as a parallelizable DB2 Extender. This talk will present the scientific principles that form the basis for ProbE's predictive modeling algorithms.

Biography

Ed Pednault joined IBM in 1996, where he is a Research Staff Member in the Data Abstraction Research Group of the Mathematical Sciences Department. From 1986 to 1995, he was a Member of Technical Staff at AT&T Bell Laboratories. Ed received a Ph.D. in electrical engineering from Stanford University in 1987, a M.S. in computer science from Stanford in 1981, and a B.Eng. in electrical engineering from McGill University in 1979. His current research interests center on statistical learning theory and its application to automated predictive modeling.