Database Seminar May 2 1997

Dynamic Itemset Counting (DIC)
You will never miss the train again...

Shalom Tsur
Hitachi America Ltd., R&D Division
Santa Clara

Abstract:

The derivation of associations from data is one of the most commonly used techniques in datamining today. An essential ingredient in this process is the identification of large itemsets--subsets of items that appear together in a significant number of the records of the data analyzed. Itemsets form the basis for the formulation of the association rules of interest. The standard method used in the derivation, Apriori, counts the itemsets by performing multiple passes over the data. Other methods, discussed by H. Toivonen, use sampling methods to estimate the itemsets and so to reduce the number of passes required.

Dynamic Itemset Counting (DIC) is a method developed by S. Brin, R. Motwani and J. Ullman at Stanford and by S. Tsur and G.D. Ramkumar of Hitachi America. The method seeks to strike a balance between the performance of Apriori and the accuracy of sampling methods by using multiple counts during each pass over the data. The process can be thought of as a train that makes multiple stops during a pass over the data. At each stop a new counting process can commence and it has been shown that the total number of passes can be reduced relative to other methods.

The method lends itself to parallel computation and some early results will be reported.