Set Containment Joins: Testbed

This page offers for download a testbed for evaluating set containment join algorithms. Current version is v3.0.

Previous versions of the testbed were used for conducting the experiments published in [1] (v1.0), and [2] (v2.0).

Version v3.0 contains an implementation of the following algorithms: Adaptive Peek-and-Sweep Join (APSJ), Adaptive Divide-and-Conquer Join (ADCJ), Partitioning Set Join (PSJ), Signature Hash Join (SHJ), and hashtree join from [3]. The hashtree join algorithm is new in v3.0 and has not yet been evaluated against the other algorithms.

The testbed includes the complete source code and is free for academic use. Your feedback will be appreciated. Please acknowledge the author if you extend it or publish results obtained using the testbed.

Installation

Download and install Berkeley DB with Java API support. Include the libraries (db.jar) in your CLASSPATH. Note: the testbed was tested with Berkeley DB v4.0.14.

Download the current version of the testbed (about 200K) and include it in your CLASSPATH.

Running

Create a directory <DIR> to hold the relations and the intermediate results.

Create two default synthetic relations R and S using the following command:

   java -Dscj.base_dir=<DIR> edu.stanford.db.alg.join.Loader R S

Run a sample join algorithm (APSJ followed by a signature nested loop):

   java -Dscj.base_dir=<DIR> edu.stanford.db.alg.join.Main R S -f APSJ -f sigNestedLoop

For measurements, make sure that the file caching of your operating system has been disabled; otherwise the I/O performance is distorted (see [2], Sec. 5).

Run the Loader and Main without any parameters to see the list of available options and algorithms. You should see the following output:

$ java edu.stanford.db.alg.join.Loader
Usage: create <R> <S> [options] | dump <R>
 where [options] are:
	 [-domain <10000>]
	 [-payload <100>]           size of payload per tuple (in bytes)
	 [-setcard <50> <100>]      average set cardinalities
	 [-relcard <10000> <20000>] relation cardinalities
	 [-cdist <g> <g>]           set cardinality distribution (u=uniform, g=Gaussian, z=Zipfian, p=Poisson)
	 [-udeviation <20> <40>]    deviation for uniform set cardinality distribution
	 [-gdeviation <20> <40>]    deviation for Gaussian set cardinality distribution (sigma value)
	 [-zskew <0.01> <0.01>]     skew for Zipfian set cardinality distribution
	 [-vdist <u>]               value distribution (u=uniform, g=Gaussian)
	 [-vdeviation <1000>]       deviation for uniform value distribution
	 [-cfactor <0.0>]           correlation factor (0..1)
	 [-cportion <0>]            correlated domain portion
 Use option -Dscj.base_dir=<DIR> to set base directory.

$ java edu.stanford.db.alg.join.Main
Testbed for evaluating set containment join algorithms (v3.0) by Sergey Melnik
See http://www-db.stanford.edu/~melnik/scj for more information.

Usage: <R> <S>
	 [-part <32>]              number of partitions
	 [-sig <160>]              signature size in bits
	 [-sample <0.05>]          percentage of relations to sample
	 [-dbcache <1.0>]          database cache size in MB
	 [-mem <1.0>]              available memory in MB
	 [-verify <mem>]           verification procedure (skip|inline|part|block|post|mem)
	 [-verifyAlg <indexScan>]  verification access mode (dynamic|scan|indexScan)
	 [-verifyBlockSize <10>]   verification/result buffer size in MB
	 [-materialize <none>]     materialization before joining (none|disk|mem)
	 [-fill <0.95>             fill factor for materialization]
	 [-fillTarget <none>]      fill target (above|below|none)
	 (-f <LSJ | SHJ | DCJ | PSJ | ADCJ | APSJ | sigHashtree | sigNestedLoop | tupleHashtree | tupleNestedLoop>)+
	                           sequence of algorithm(s) to be applied
 Use option -Dscj.base_dir=<DIR> to set base directory.

References

[1] S. Melnik, H. Garcia-Molina: Divide-and-Conquer Algorithm for Computing Set Containment Joins, Proc. 8th EDBT, Prague, March 2002

[2] S. Melnik, H. Garcia-Molina: Adaptive Algorithms for Set Containment Joins, ACM Trans. on Database Systems, March 2003

[3] R. Agrawal, R. Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994

Last modified: Nov 25 2003 by Sergey Melnik