1. The **LRU** algortihm should not be used for the buffer management
in
**B+**-tree indexes.

**Ans: TRUE.**

2. Since a plain **B**-tree can store record pointers on the inner
nodes,
it is more compact than a **B+**-tree and incurs a lower search cost.

**Ans: ****FALSE.**

Given fixed-size data blocks, a B-tree node contains less keys than a
B+ tree node. So, B-tree is in fact deeper and less compact.

3. Split-records is a useful technique for organizing data records with a hybrid format.

**Ans: ****TRUE.**

4. The time for a request to wait to be serviced may be longer using the elevator algorithm than using the first-come-first-served algortihm.

**Ans: ****TRUE.**

5. It is always better to store a sequential file in as few blocks as
possible,
because that can reduce the disk **I/O**s when the file is accessed.

**Ans: ****FALSE.**

We may need to leave some space in each block to reduce the I/O cost
for insertion. Also, we can save I/Os by leaving some slack, rather
than spanning records at the end of blocks.

6. If ** V(R,C) = 1000** and

**Ans: ****FALSE.**

We expect the selection will yield a relation with **10** tuples.

7. A second level index for an indexed sequential file is usually dense.

**Ans: ****FALSE.**

A second level index is always sparse.

8. Consider relations ** R(A,B)** and

**Ans: ****FALSE.**

The expected number of tuples is less than or equal to **4000**.

**Part II. (10 points)** Questions 1-2 below refer to two relations
** R(A,B,C,D)** and

1. Which of the following indexes is most appropriate for answering the
query
** R natural-join S**, assuming we can always store the index

(a) A dense index on *R.B*

(b) A sparse index on *R.B*

(c) A dense index on *S.B*

(d) A sparse index on *S.B*

(e) A sparse index on *R.A*

**Ans: ****(C)**.

The strategy that should be used in this problem is to read data blocks one
by one from one relation, and use the index on the other relation to find the
matching tuples.

The choice (E) should be discarded first, because it does not help the join operation at all.

We then compare dense indexes with sparse indexes, and find that dense indexes are better
than sparse indexes in this case given that they all fit in memory. For example, with a sparse
index on R.B, we first fetch S blocks. For each S tuple, when looking for R tuples with a
matching B value, we need to fetch a block of R even though the S.B value is not in the R.B
index, because not all key values appear in a *sparse* index. However, if we use a dense
index, as soon as we find out that the S.B value is not in the R.B index, we know that there
is no tuples in R with that the matching B value, and we do not need to fetch R's data block
any more. So, using a *sparse index costs more I/Os than using a dense index* in this case.

Therefore, (A) is better than (B), and (C) is better than (D).

Finally, we compare choices (A) and (C). With (A), we have a dense index on R.B. We
need to first read all blocks of S, and for each S tuple, we find matching tuples in R using
the R.B index. The total I/O cost is about B(S) + T(S)*T(R)/V(R,B) (given that R is
not sorted on the B attribute). This value is **2000 + (20,000,000/V(R,B))**.
With (C), we first read R blocks, then find matching tuples
in S. The I/O cost for this join scheme is about B(R) + T(R)*B(S)/V(S,B) (given that
S is sorted on the B attribute). This value is **100 + (2,000,000/V(S,B))**.
With a reasonable assumption that V(R,B) <= V (S,B), we can easily draw the conclusion
that (C) achieves lower query cost than (A).

__This answer could have been readily intuited, given the sizes of relations R and S__.
*V(R,B) would actually have to have at least 10X the value of V(S,B) before it would even
be close! Given that T(R) is already 20X less than T(S), the expectation of this ever occuring
is near zero.*

2. Which query benefits least from a multi-key index on ** R**
with

(a) select * from R where A =
10

(b) select * from R where B =
20

(c) select * from R where A <
B

(d) select * from R where A <
C

(e) select * from R where C < 100
and A = 10

**Ans: ****(D)**

- Query (A) can be done with just the multi-key index, since A is the first attribute..

- Query (B) can be done with just the multi-key index: we can use the second level index (on
the B attribute) in the multi-key index to find all the pointers to the records with B = 20.

- Query (C) likewise can be done with just the multi-key index.

- Query (E) benefit greatly from the multi-key index since A is the first attribute and we
can narrow our search to tuples where A = 10.

- That leaves option (D).

For the query in (D), we cannot use the multi-key index at all, because we basically need to fetch all the data records in R and compare their A and C attributes.

**Part III. (20 points)** Questions 1-2 below refer to a disk with an
actual (formatted) capacity of 8 gigabytes. The disk has 16 surfaces and
1024 tracks The disk rotates at 7200 rpm. The average seek time is 9 ms.
The block size is 8KB.

1. Suppose that we are reading a file *F* that occupies exactly one
entire track, and we want to estimate how long it takes to read the whole file
sequentially. What key parameter is missing for doing this? Select one
from the following choices:

(a) the diameter of the disk - *d*

(b) the average rotational latency - *r*

(c) the amount of overhead per track - *v*

(d) none

**Ans: ****(D)**

Based on your choice above, how long does it take to read the file? You may use the symbol (after each choice) to represent the parameter that you think is useful in the estimation result.

**Ans: **~**21.5** ms.

average seek time = 9 ms.

rotational latency = 8.3/2 = 4.15 ms.

transfer time = 7200rpm/60spm = 120rps = 0.0083 spr = 8.3 ms.

access time = seek time + rotational latency + transfer time = 9 + 4.15 + 8.3 = 21.45 = ~**21.5** ms.

2. Now, suppose that we want to estimate how long it takes to read a
block. What key parameter is missing for doing this? Select one from the following
choices:

(a) the diameter of the disk - *d*

(b) the average rotational latency - *r*

(c) the amount of overhead per track - *v*

(d) none

**Ans: ****(C)**

Based on your choice above, how long does it take to read a block? You may use the symbol (after each choice) to represent the parameter that you think is useful in the estimation result.

**Ans: ****~13.15ms + (0.13 ms * (1 - v)) = ~13.3ms - v*0.13ms**

# blocks per track: 8GB / 16 surfaces / 1024 tracks / 8k(per block) = 2^{33} / 2^{4} / 2^{10} / 2^{13} = 2^{6} blocks/track

= 64 blocks/track
average seek time = 9 ms.

rotational latency = 8.3/2 = 4.15 ms.

transfer time = 8.3 ms / 2^{6} * (1 - v) = 0.13 ms * (1 - v)

**Part IV. (14 points)**

Consider a relation *R(A,B)* stored in a hash
table with 1024 (2^{10}) buckets. Suppose that 20% of the queries
on *R* have the form **SELECT * from R
where A = a**, and 80% of the queries have the form
**SELECT * from R where B =
b** where *a* and *b* are various constants. Further suppose
that a hash function *h*_{1} produces a string of
*n _{A}* bits based on the

**Ans: ** ** n_{A} = 4,n_{B} = 6**.

Consider a partitioned hash table that uses a 10-bit hash key containing

0.2 * 2^{10-nA} + 0.8 * 2^{10-nB} =

0.1 * (2 * 2^{10-nA} + 8 * 2^{10-nB}) =

0.1 * (2^{11-nA} + 2^{13-nB}) =

now, recalling that (*n _{A}* +

0.1 * (2

To minimize this query cost, we need 11 - *n _{A}* =

**Part V. (20 points)**

Suppose that we have 8192-byte blocks in which we store records of 614 bytes each. The block header consists of a two-byte counter (indicating the number of entries in the offset table), plus an offset table (or directory), as in Figure 3.8 in the textbook and in slide 52 of Notes 3. In the offe\set table, each record pointer is 2 bytes.

On an average day, three records per block are inserted, and one record per block is deleted. When a record is deleted, a tombstone is left in the offset table. For specificity, assume that the deletion on any day occurs before the insertions.

If a block is initially empty, on what day will there be no more room for the days insertions? (On the first day, day #1, there is no deletion, and three records are inserted.)

**Ans: ****Day 7**

On the first day, the block has a counter, 3 pointers, and 3 records. Therefore, a total of 2 + 3 * (614 + 2) = 1850 bytes is used up. Every day,

- We add three records : + 3 * (614 + 2) = 1848 bytes
- Delete one record : -614 bytes

1. **(2 points)** Consider a B+ tree index of order *n* over a
file with *r* records.
What is the minimum number of nodes that we may have to examine when
searching
for a unique record using its key value?

**Ans: ** ** ceiling(log _{n+1}(r/n)) + 1 **

We first compute the maximum number of records that can be indexed using a *l* level B+
tree. Each inner node has a maximum fan-out of *n*+1. Each leaf node contains at most *n*
record pointers. So, given an *l*-level B+ tree, we can index at most
(*n*+1)^{l-1} * *n* nodes. In other words,
(*n*+1)^{l-1} * *n* >= *r*.
Therefore we get *l* >= ceiling(log_{n+1}(*r*/*n*)) + 1.

2. **(3 points)** Suppose you are designing a file system for a movie
database.
Each movie has 10 fields that always occur (e.g., title, year) and 20
optional fields
that may or may not be relevant or known for a movie (e.g., special efects
designer). Each field
has a fixed size of 24 bytes. Assume that each optional field is relevant
for a particular
move with probability *p*.

You are consider two options for the file records:

- A fixed-format record
- A variable-format record where all fields are tagged using a 2-byte tag

**Ans: ~0.885 < p <= 1**

The size of a fixed-format record is 24*(10+20) = 720 bytes. The average size of a variable- format record is 26 * (10 + 20p) = 260 + 520p bytes. Using fixed-format records is better than using variable-format records (according to the storage requirement) if 720 < 260 + 520p; that is, 23/26 < p <= 1. (23/26 = ~ 0.885)