CS245 Winter 1999 Assignment 3 due in class on Thursday January 28 PROBLEM 1 Consider an indexed sequential file consisting of 1,000,000 contiguous blocks. Each block contains ten fixed size records. For each key value found in the file, there are at most 4 records that share that value. (a) How large (in blocks) would a sparse one-level, primary sequential index be? Assume blocks are 4 KB, block pointers are 4 bytes, a key is 5 bytes, and that the records of the index are contiguous. Design the index so it would use the minimum amount of space; however, assume that keys are not spanned across blocks. (b) Suppose you now construct a second level sequential index on the index of part (a)? How large would it be, in blocks? Also assume you want to minimize space (but no spanning). (c) Next you construct a one level, dense secondary index. Compute its minimum size (in blocks), assuming that record pointers take 5 bytes, block pointers take 4 bytes, the secondary keys are also 5 bytes long, blocks are 4 KB, and index blocks are contiguous. Also assume that no buckets are used; if multiple values of a secondary key occur in the file, the keys are replicated in the index. (We do not expect many duplicates, so it is not worth optimizing for them.) Assume that [key, pointer] pairs are not spanned across blocks. (d) Suppose you construct a second level sequential index for the index of part (c). How large would it be in blocks? Also assume you want to minimize space (but no spanning). PROBLEM 2 Consider a B+ tree of order 4 for this problem, that is initially empty. (a) You insert the key 1 (and a pointer to this record) into the tree. Show what the tree looks like at this point. (b) Next you insert keys 2, 3, 4, and 5 (in this order) into the tree. Show what the tree looks like at this point. (c) Next you insert keys 6, 7, 8, 9, 10, 11, 12, 13, and 14 (in this order). Show what the tree looks like at this point. PROBLEM 3 (a) What is the minimum number of record pointers that a B+ tree of order n can contain, when it has two levels (root plus one more)? (All record pointers are found in the second level.) (b) What is the minimum number of record pointers that a B+ tree of order n can contain, when it has three levels (root plus two more)? (The third level contains the leaf nodes in this case.) (c) What is the general expression for the minimum number of record pointers an order n B+ tree can contain, when it has j levels? (d) If we are told that an order n B+ tree points to r records, then what is the maximum number of levels, j, it may have? That is, derive an expression of the form j <= f(n, r). Note that this is a bound on the number of B+ tree nodes we need to examine for looking up a record (given its key value), when the file we are indexing contains r records. PROBLEM 4 Consider an index organized as a B+ tree. The leaf nodes contain pointers to a total of N records, and each block that makes up the index has m pointers. We wish to choose the value of m that will minimize search times on a particular disk device. We are given the following information: (a) For the disk that will hold the index, the time to read a given block into memory can be approximated by (70 + .05*m) milliseconds. The 70 milliseconds represent the seek and latency components of the read, the .05*m milliseconds is the transfer time. (That is, as m becomes larger, the larger the block will be and the more time it will take to read it into memory.) (b) Once the block is in memory a binary search is used to find the right pointer. So the time to process a block in main memory is (a + b log_2 m) milliseconds, for some constants a , b. ("log_2 m" means log base 2 of m.) (c) The main memory time constant "a" is much smaller than the disk seek and latency time of 70 milliseconds. (d) The index is full, so that the number of blocks that must be examined per search is log_m N. Questions: (i) What value of m minimizes the time to search for a given record? An approximate answer is OK. (HINT: If you come up with an equation which is hard to solve algebraically, try plugging in values to locate the root of the equation.) The value you obtain should be independent of "b"! (ii) What happens as the seek and latency constant (70ms) decreases? For instance, if this constant is cut in half, how does the optimum m value change? END OF HOMEWORK 3