Database Seminar May 2 1997
Dynamic Itemset Counting (DIC)
You will never miss the train again...
Hitachi America Ltd., R&D Division
The derivation of associations from data is one of the most commonly used techniques in datamining today. An
essential ingredient in this process is the identification of large itemsets--subsets of items that appear together in a
significant number of the records of the data analyzed. Itemsets form the basis for the formulation of the association
rules of interest. The standard method used in the derivation, Apriori, counts the itemsets by performing multiple
passes over the data. Other methods, discussed by H. Toivonen, use sampling methods to estimate the itemsets and so
to reduce the number of passes required.
Dynamic Itemset Counting (DIC) is a method developed by S. Brin, R. Motwani and J. Ullman at Stanford and by S.
Tsur and G.D. Ramkumar of Hitachi America. The method seeks to strike a balance between the performance of
Apriori and the accuracy of sampling methods by using multiple counts during each pass over the data. The process
can be thought of as a train that makes multiple stops during a pass over the data. At each stop a new counting process
can commence and it has been shown that the total number of passes can be reduced relative to other methods.
The method lends itself to parallel computation and some early results will be reported.