CS245 PROBLEM SET #2
Solutions

Problem 1: Conventional Indexes (20 points)

Consider two relations R(A, B) and S(B, C) stored as sequential files. There are two types of most frequently asked queries as follows.
Q1:
select * from R where R.A > a1 and R.A < a2
Q2:
select * from R, S where R.B = S.B
Suppose that we are allowed to build conventional dense, sparse, or multi-level indexes over the relations to optimize the above queries. What indexes shall we build over R and S to obtain the best query performance? Briefly specify the following aspects for each index you choose to build.
a)
What is the type of the index (dense, sparse, or multi-level)?
b)
Which attribute is the index built on?
c)
Is it a primary or a secondary index?
d)
How should we use the index for the queries?

Solution:

We can build three indexes as follows:

We use Index 1 for Q1. When executing Q1, we first search for a1 in R using Index 1, and obtain a tuple with the smallest R.A that is greater than a1. We then do a sequential scan from that point til we fetch a tuple with R.A greater than a2.

We use Index 2 and Index 3 for Q2. When executing Q2, we intersect Index 2 and Index 3 to find all pointers to the joined tuples in R and S, then fetch the tuples and return the join result.

NOTE: We store all indexes in memory to achieve the best query performance. In this case, a sparse index on S.B is not as good as a dense index, because not all keys appear in the sparse index, and we may have to fetch a data block for each R.B value, even if it is not in the sparse index on S.B.

Problem 2: B+-tree and B-tree (20 points)

Consider a DBMS that has the following characteristics: We want to build an index on a search key that is 8 bytes long. Calculate the maximum number of records we can index with
a)
a 3-level B+ tree (2 levels plus the root)
b)
a 3-level B tree

Solution:

a)
Let each node of a B+-tree contain at most n pointers and n-1 keys. 8 * (n-1) + 12 * n + 56 <= 2048. Therefore, n <= 100. The leaf level of a B+-tree can hold at most 99 * 100 * 100 record pointers. Therefore, the maximum number of records that can be indexed is 990000.
b)
Let each inner node of a B-tree contain at most n index pointers, n-1 keys, and n-1 record pointers. 8 * (n-1) + 12 * (2n-1) + 56 <= 2048. Therefore, n <= 62. The first level of a B-tree can hold at most 61 record pointers. The second level can hold at most 62 * 61 record pointers. The leaf level can hold at most 62 * 62 * 61 record pointers. Therefore, the maximum number of records that can be indexed is 61 + 62 * 61 + 62 * 62 * 61 = 238327.
NOTE: You can also assume that each leaf node of a B-tree can hold at most 99 record pointers. Then, the answer is 384399.

Problem 3: B+-tree Insertions and Deletions (20 points)

Consider the above B+ tree.

Show the B+ tree that results after inserting (in the given order) 56, 50, 75, 87, 48. From that resulting B+ tree we delete (in the given order) 50, 24, 65, 93, 75. What is the resulting final B+ tree?

Solution:

After insertions:

After deletions:

NOTE: There are alternative correct answers to this problem. However, you need to make sure that your solution follows the insertion and deletion algorithm described in the class.

Problem 4: Extensible and Linear Hashing (20 points)

Consider a dynamic hash structure where buckets can hold up to three records. Initially the structure is empty. Then we insert the following records, in the order below, where we indicate the hashed key in parenthesis (in binary):
   a [010000]
   b [011010]
   c [111100]
   d [001110]
   e [010111]
   f [011010]
   g [101001]
   h [010111]
   i [000110]
   j [101001]
a)
Show the extensible hash structure after these records have been inserted.
b)
Show the linear hash structure after these records have been inserted. Assume that the threshold value is 2. (i.e., when the average number of keys per non-overflow bucket is greater than 2, we allocate another bucket).

Solution

a)

b)

Problem 5: Multiple-Key Index (20 points)

There is a relation R(X, Y, Z) where the pair of attributes X and Y together form the key. Suppose that for each X there are records with 1024 different values of Y, and for each Y there are records with 1024 different values of X. We need to answer queries of the form "select Z from R where C", where C is the selection condition. Suppose we know that
1)
C is "X = x" in 20% of the cases
2)
C is "Y = y" in 30% of the cases
3)
C is "X = x and Y = y" in 50% of the cases
for various constants x and y. We want to build an index for the relation to improve the average query performance, and we have the following options:
a)
Build a multiple-key index with X as the first attribute
b)
Build a multiple-key index with Y as the first attribute
c)
Build two secondary indexes on X and Y, using the "pointer bucket" technique (described in the lecture) to deal with duplicate values.
Both the index and the relation are stored on disk. Suppose that each X or Y value is 8 bytes long, a record pointer is 8 bytes long, and a block is 8K bytes. Compute the average query cost (number of block I/Os) for each index option, and pick the best index.

Solution

Assume that there are totally 1024 X values and 1024 Y values. Further assume that the index entries are stored in contiguous blocks. Each block can hold 1K pointers or 0.5K pairs.

a)
The first level index contains two blocks. For each X value, the second-level index on Y also contains two blocks. For query 1), we first search X = x in the first level index. The average cost for this step is 1.5 I/Os. We then fetch both block of second level index on Y for the X value, which takes 2 I/Os. Finally, we fetch the result tuples, which takes 1024 I/Os. For query 2), we first scan both blocks of the first level index on X. For each X value, we search Y = y in the second level index of that X value. The average cost for this step is 1.5 I/Os. Finally, we fetch the result tuples, which takes 1024 I/Os. Therefore, the average query cost is

20% * (1.5 + 2 + 1024) + 30% * (2 + 1.5 * 1024 + 1024) + 50% * (1.5 + 1.5 + 1) = 976.1 I/Os

b)
Similarly to a), the average query cost is

20% * (2 + 1.5 * 1024 + 1024) + 30% * (1.5 + 2 + 1024) + 50% * (1.5 + 1.5 + 1) = 822.65 I/Os

c)
For query 1), we use the secondary index on X to find all pointers to tuples with X = x. This takes 1.5 + 1 = 2.5 I/Os. We then fetch the result tuples which takes 1024 I/Os. For query 2), we use the secondary index on Y to find all pointers to tuples with Y = y. This takes 1.5 + 1 = 2.5 I/Os. We then fetch the result tuples which takes 1024 I/Os. For query 3), we first use the secondary index on X to find all pointers to tuples with X = x, which takes 2.5 I/Os. We then use the secondary index on Y to find all pointers to tuples with Y = y., which takes 2.5 I/Os. Finally, we intersect the two buckets of pointers in memory, and fetch the result tuples, which takes 1 I/O. The average query cost is

20% * (1.5 + 1 + 1024) + 30% * (1.5 + 1 + 1024) + 50% * (1.5 + 1 + 1.5 + 1 + 1) = 516.25 I/Os

According to the above result, scheme c) is the best.