CS245 Summer 2000 Assignment 3 due in class on Wednesday July 26 PROBLEM 1 Consider an indexed sequential file consisting of 500,000 blocks. Each block contains 20 fixed size records. For each key value found in the file, there are at most 3 records that share that value. For this problem, assume that: Pointers to blocks are 8 bytes long Pointers to records are 9 bytes long (block pointer plus 1 byte offset) Index blocks are 2KB There is no spanning of records in the file. There is no spanning of "records" in the index, e.g., if a [key, pointer] does not fit in an index block, it must move to another block. Search keys for file records are 12 bytes long. (a) Assume that the file and index blocks are stored contiguously on disk. How large (in blocks) would a sparse one-level, primary sequential index be? Design the index so it would use the minimum amount of space. (b) Assume that the file and index blocks are *not* contigous on disk. How large (in blocks) would a sparse one-level, primary sequential index be? Design the index so it would use the minimum amount of space. (c) Suppose you now construct a second level sequential index on the index of part (b)? How large would it be, in blocks? Also assume you want to minimize space (but no spanning). (d) Next you construct a one level, dense secondary index. Compute its minimum size (in blocks). The secondary keys are also 12 bytes, and index blocks are contiguous. Also assume that no buckets are used; if multiple values of a secondary key occur in the file, the keys are replicated in the index. (We do not expect many duplicates, so it is not worth optimizing for them.) (e) Suppose you construct a second level sequential index for the index of part (d). How large would it be in blocks? Also assume you want to minimize space (but no spanning). PROBLEM 2 Consider a B+ tree of order 3 for this problem, that is initially empty. (a) You insert the key 1 (and a pointer to this record) into the tree. Show what the tree looks like at this point. (b) Next you insert keys 2, 3, 4, and 5 (in this order) into the tree. Show what the tree looks like at this point. (c) Next you insert keys 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, and 6 (in this order). Show what the tree looks like at this point. PROBLEM 3 (a) What is the minimum number of record pointers that a B+ tree of order n can contain, when it has two levels (root plus one more)? (All record pointers are found in the second level.) (b) What is the minimum number of record pointers that a B+ tree of order n can contain, when it has three levels (root plus two more)? (The third level contains the leaf nodes in this case.) (c) What is the general expression for the minimum number of record pointers an order n B+ tree can contain, when it has j levels? (d) If we are told that an order n B+ tree points to r records, then what is the maximum number of levels, j, it may have? That is, derive an expression of the form j <= f(n, r). Note that this is a bound on the number of B+ tree nodes we need to examine for looking up a record (given its key value), when the file we are indexing contains r records. PROBLEM 4 Consider an index organized as a B+ tree. The leaf nodes contain pointers to a total of N records, and each block that makes up the index has m pointers. We wish to choose the value of m that will minimize search times on a particular disk device. We are given the following information: (a) For the disk that will hold the index, the time to read a given block into memory can be approximated by (25 + .02*m) milliseconds. The 25 milliseconds represent the seek and latency components of the read, the .02*m milliseconds is the transfer time. (That is, as m becomes larger, the larger the block will be and the more time it will take to read it into memory.) (b) Once the block is in memory a binary search is used to find the right pointer. So the time to process a block in main memory is (a + b log_2 m) milliseconds, for some constants a , b. ("log_2 m" means log base 2 of m.) (c) The main memory time constant "a" is much smaller than the disk seek and latency time of 25 milliseconds. (d) The index is full, so that the number of blocks that must be examined per search is log_m N. Questions: (i) What value of m minimizes the time to search for a given record? An approximate answer is OK. (HINT: If you come up with an equation which is hard to solve algebraically, try plugging in values to locate the root of the equation.) The value you obtain should be independent of "b"! (ii) What happens as the seek and latency constant (25ms) decreases? For instance, if this constant is cut in half, how does the optimum m value change? END OF HOMEWORK 3