CS245A PROBLEM SET #3
Due by 5pm, Tuesday, February 3, 1998.
4 problems, equal credit. 100 pts.

Problem 1: Linear Hashing

Suppose we start with an empty linear hash table, with only (empty) buckets for 0 and 1. Our hash function produces k bits for some very large k. Blocks hold two records, as in the class example, but unlike the class example, we use the threshold 100%; that is, we do not add an extra bucket until the number of records exceeds twice the number of buckets. To be more precise, the addition of a bucket occurs after a request to insert into the table is made, but before the insertion itself is done.

We insert records into the table, and by strange coincidence, the results of applying the hash function to the first key is 00...000, the result of applying it to the second key is 00...001, for the next 00...010, then 00..011, 00...100, 00...101, and so on. That is, the hash value of the ith key is i-1 in binary.

Question: what is the first record that causes an overflow block to be created? Explain your reasoning briefly.

Problem 2: Grid Files

Suppose we have a relation R(x,y) where the range of both x and y is from 1 to 1000. We'd like to store R in a grid file, using three partitions (i.e., four stripes) in the x direction and the same in the y direction (i.e., 16 regions). Unfortunately, the data in R is surprising; it consists of the 1000 pairs (i,i) for i=1,2,...,1000.

For what partitions is the maximum number of points in a region is minimized? What is that maximum? (Hint: it's not 250.)

Problem 3: Multiple-Key Indexes

There is a relation R(X,Y,Z), where the pair of attributes X and Y together form the key. X ranges from 1 to 100, and Y from 1 to 1000. For each X there are records with 100 different values of Y, and for each Y there are records with 10 different values of X. (i.e., there are 10,000 records in relation R). We need to answer queries of the form
     SELECT Z
     FROM R
     WHERE X = c AND Y = d;
for various values of constants c and d.

We decide to set up a multiple-key index, using blocks that can hold 10 key-pointer pairs (a key can be either an X- or Y-value). The options we have are:

i)
We can choose either X or Y to be the first attribute.
ii)
We can build on top of any index one or more additional levels of index.
Your questions:
a)
Describe a structure with X as the first attribute that minimizes the number of disk reads necessary to answer a query of the above form. How many disk reads do you perform?

b)
Describe a structure with Y as the first attribute that minimizes the number of disk reads necessary to answer a query of the above form. How many disk reads do you perform?

c)
Suppose that you were allowed to buffer 11 blocks in main memory at all times. Which blocks would you choose? How would it affect the average number of disk reads needed to answer a query of the above form?

Problem 4: Partitioned Hashing

Suppose we have a hash table whose buckets are numbered 0 to (2^n)-1 (i.e., bucket numbers have n bits). We wish to store in the table a relation with two attributes, X and Y. A query will either ask for those tuples with a given value of X, or it will ask for those tuples with a given value of Y (never both). With probability p, the value of X will be specified.
a)
Suppose we partition the hash function so that m bits are devoted to X and the remaining n-m to Y. Give a formula in terms of n, m, and p for the expected number of buckets that must be examined to answer a random query.

b)
Find the value of m, as a function of n and p, that minimizes the expected number of buckets examined. Do not worry that this value is unlikely to be an integer.