# CS 245 Midterm Exam

## Summer 2000

Part I. (16 points) Indicate whether each of the following statements is TRUE or FALSE. If you choose FALSE, please correct it or briefly explain why.

1. The LRU algortihm should not be used for the buffer management in B+-tree indexes.

Ans: TRUE.

2. Since a plain B-tree can store record pointers on the inner nodes, it is more compact than a B+-tree and incurs a lower search cost.

Ans: FALSE.
Given fixed-size data blocks, a B-tree node contains less keys than a B+ tree node. So, B-tree is in fact deeper and less compact.

3. Split-records is a useful technique for organizing data records with a hybrid format.

Ans: TRUE.

4. The time for a request to wait to be serviced may be longer using the elevator algorithm than using the first-come-first-served algortihm.

Ans: TRUE.

5. It is always better to store a sequential file in as few blocks as possible, because that can reduce the disk I/Os when the file is accessed.

Ans: FALSE.
We may need to leave some space in each block to reduce the I/O cost for insertion. Also, we can save I/Os by leaving some slack, rather than spanning records at the end of blocks.

6. If V(R,C) = 1000 and T(R) = 10000, we can expect the selection on R on the condition C = 20 will yield a relation with with 20 tuples.

Ans: FALSE.
We expect the selection will yield a relation with 10 tuples.

7. A second level index for an indexed sequential file is usually dense.

Ans: FALSE.
A second level index is always sparse.

8. Consider relations R(A,B) and S(B,C) where T(R) = 4000 and T(S) = 2000, and B is a key on S. The expected number of tuples in R natural-join S is less than or equal to 2000.

Ans: FALSE.
The expected number of tuples is less than or equal to 4000.

Part II. (10 points) Questions 1-2 below refer to two relations R(A,B,C,D) and S(B,E). R contains 1,000 tuples stored in a sequential file sorted on attribute A.   S contains 20,000 tuples stored in a sequential file sorted on attribue B.   Each block can hold 10 R or S tuples.

1. Which of the following indexes is most appropriate for answering the query R natural-join S, assuming we can always store the index in memory.

(a) A dense index on R.B
(b) A sparse index on R.B
(c) A dense index on S.B
(d) A sparse index on S.B
(e) A sparse index on R.A

Ans: (C).
The strategy that should be used in this problem is to read data blocks one by one from one relation, and use the index on the other relation to find the matching tuples.

The choice (E) should be discarded first, because it does not help the join operation at all.

We then compare dense indexes with sparse indexes, and find that dense indexes are better than sparse indexes in this case given that they all fit in memory. For example, with a sparse index on R.B, we first fetch S blocks. For each S tuple, when looking for R tuples with a matching B value, we need to fetch a block of R even though the S.B value is not in the R.B index, because not all key values appear in a sparse index. However, if we use a dense index, as soon as we find out that the S.B value is not in the R.B index, we know that there is no tuples in R with that the matching B value, and we do not need to fetch R's data block any more. So, using a sparse index costs more I/Os than using a dense index in this case.

Therefore, (A) is better than (B), and (C) is better than (D).

Finally, we compare choices (A) and (C). With (A), we have a dense index on R.B. We need to first read all blocks of S, and for each S tuple, we find matching tuples in R using the R.B index. The total I/O cost is about B(S) + T(S)*T(R)/V(R,B) (given that R is not sorted on the B attribute). This value is 2000 + (20,000,000/V(R,B)). With (C), we first read R blocks, then find matching tuples in S. The I/O cost for this join scheme is about B(R) + T(R)*B(S)/V(S,B) (given that S is sorted on the B attribute). This value is 100 + (2,000,000/V(S,B)). With a reasonable assumption that V(R,B) <= V (S,B), we can easily draw the conclusion that (C) achieves lower query cost than (A).

This answer could have been readily intuited, given the sizes of relations R and S. V(R,B) would actually have to have at least 10X the value of V(S,B) before it would even be close! Given that T(R) is already 20X less than T(S), the expectation of this ever occuring is near zero.

2. Which query benefits least from a multi-key index on R with A as the first attribute and B as the second attribute.

(a) select * from R where A = 10
(b) select * from R where B = 20
(c) select * from R where A < B
(d) select * from R where A < C
(e) select * from R where C < 100 and A = 10

Ans: (D)

• Query (A) can be done with just the multi-key index, since A is the first attribute..
• Query (B) can be done with just the multi-key index: we can use the second level index (on the B attribute) in the multi-key index to find all the pointers to the records with B = 20.
• Query (C) likewise can be done with just the multi-key index.
• Query (E) benefit greatly from the multi-key index since A is the first attribute and we can narrow our search to tuples where A = 10.
• That leaves option (D).
For the query in (D), we cannot use the multi-key index at all, because we basically need to fetch all the data records in R and compare their A and C attributes.

Part III. (20 points) Questions 1-2 below refer to a disk with an actual (formatted) capacity of 8 gigabytes. The disk has 16 surfaces and 1024 tracks The disk rotates at 7200 rpm. The average seek time is 9 ms. The block size is 8KB.

1. Suppose that we are reading a file F that occupies exactly one entire track, and we want to estimate how long it takes to read the whole file sequentially. What key parameter is missing for doing this? Select one from the following choices:

(a) the diameter of the disk - d
(b) the average rotational latency - r
(c) the amount of overhead per track - v
(d) none

Ans: (D)

Based on your choice above, how long does it take to read the file? You may use the symbol (after each choice) to represent the parameter that you think is useful in the estimation result.

Ans: ~21.5 ms.
average seek time = 9 ms.
rotational latency = 8.3/2 = 4.15 ms.
transfer time = 7200rpm/60spm = 120rps = 0.0083 spr = 8.3 ms.
access time = seek time + rotational latency + transfer time = 9 + 4.15 + 8.3 = 21.45 = ~21.5 ms.

2. Now, suppose that we want to estimate how long it takes to read a block. What key parameter is missing for doing this? Select one from the following choices:

(a) the diameter of the disk - d
(b) the average rotational latency - r
(c) the amount of overhead per track - v
(d) none

Ans: (C)

Based on your choice above, how long does it take to read a block? You may use the symbol (after each choice) to represent the parameter that you think is useful in the estimation result.

Ans: ~13.15ms + (0.13 ms * (1 - v))    =    ~13.3ms - v*0.13ms
# blocks per track: 8GB / 16 surfaces / 1024 tracks / 8k(per block) = 233 / 24 / 210 / 213 = 26 blocks/track
= 64 blocks/track average seek time = 9 ms.
rotational latency = 8.3/2 = 4.15 ms.
transfer time = 8.3 ms / 26 * (1 - v) = 0.13 ms * (1 - v)

Part IV. (14 points)

Consider a relation R(A,B) stored in a hash table with 1024 (210) buckets. Suppose that 20% of the queries on R have the form SELECT * from R where A = a, and 80% of the queries have the form SELECT * from R where B = b where a and b are various constants. Further suppose that a hash function h1 produces a string of nA bits based on the A value, and another hash function h2 produces a string of nB bits based on the B value. The combination of the two strings form the hash key used to index the hash table. What should the values of nA and nB be in order to minimize the average number of buckets that must be accessed to answer a query?

Ans: nA = 4,nB = 6.
Consider a partitioned hash table that uses a 10-bit hash key containing nA bits of the hash value on A and nB bits of the hash value on B. For the query "select...where A = a" we need to access 210-nA buckets. For the query "select...where B = b" we need to access 210-nB buckets. The average number of buckets that must be accessed for the queries given nA and nB is

0.2 * 210-nA + 0.8 * 210-nB =
0.1 * (2 * 210-nA + 8 * 210-nB) =
0.1 * (211-nA + 213-nB) =
now, recalling that (nA + nB = 10), we get:
0.1 * (211-nA + 2nA+3)

To minimize this query cost, we need 11 - nA = nA + 3; that is, nA = 4 and nB = 6.

Part V. (20 points)

Suppose that we have 8192-byte blocks in which we store records of 614 bytes each. The block header consists of a two-byte counter (indicating the number of entries in the offset table), plus an offset table (or directory), as in Figure 3.8 in the textbook and in slide 52 of Notes 3. In the offe\set table, each record pointer is 2 bytes.

On an average day, three records per block are inserted, and one record per block is deleted. When a record is deleted, a tombstone is left in the offset table. For specificity, assume that the deletion on any day occurs before the insertions.

If a block is initially empty, on what day will there be no more room for the days insertions? (On the first day, day #1, there is no deletion, and three records are inserted.)

Ans: Day 7

On the first day, the block has a counter, 3 pointers, and 3 records. Therefore, a total of 2 + 3 * (614 + 2) = 1850 bytes is used up. Every day,

• We add three records : + 3 * (614 + 2) = 1848 bytes
• Delete one record : -614 bytes
Increase in size is 1848 - 614 = 1234 bytes per day. Notice that we don't reclaim the space allocated to the record pointer since that space is now occupied by the tombstone. After n days the size of the block is S(n) = 1850 + 1234 * n. The condition S(n) < 8192 gives n < 5.14. Thus, we can perform the operations for 5 additional days (i.e, until the 6th day) without any problem. On the seventh day, we will be unable to make the second or third insertions.

Bonus. (5 points)

1. (2 points) Consider a B+ tree index of order n over a file with r records. What is the minimum number of nodes that we may have to examine when searching for a unique record using its key value?

Ans: ceiling(logn+1(r/n)) + 1

We first compute the maximum number of records that can be indexed using a l level B+ tree. Each inner node has a maximum fan-out of n+1. Each leaf node contains at most n record pointers. So, given an l-level B+ tree, we can index at most (n+1)l-1 * n nodes. In other words, (n+1)l-1 * n >= r. Therefore we get l >= ceiling(logn+1(r/n)) + 1.

2. (3 points) Suppose you are designing a file system for a movie database. Each movie has 10 fields that always occur (e.g., title, year) and 20 optional fields that may or may not be relevant or known for a movie (e.g., special efects designer). Each field has a fixed size of 24 bytes. Assume that each optional field is relevant for a particular move with probability p.

You are consider two options for the file records:

1. A fixed-format record
2. A variable-format record where all fields are tagged using a 2-byte tag
For what range of p values is the fixed option better than the variable option according to the expected storage requirement? (hint: you may need to estimate the expected size of a record for each option.)

Ans:    ~0.885 < p <= 1

The size of a fixed-format record is 24*(10+20) = 720 bytes. The average size of a variable- format record is 26 * (10 + 20p) = 260 + 520p bytes. Using fixed-format records is better than using variable-format records (according to the storage requirement) if 720 < 260 + 520p; that is, 23/26 < p <= 1. (23/26 = ~ 0.885)