Problem Set 1 ------------- 1. Compute the gamma-codes of the following integers: (i) 1 (ii) 38 (iv) 1026 2. Consider an inverted index over a collection of 50,000 documents. Let the sequence (3,5,1524,1600,29307,31520) be the postings list of a certain term in the index. Compute the size of this postings list for the following six cases: Case 1: The postings list is NOT stored as a sequence of gaps and the numbers are encoded using the following schemes: (i) Unary (ii) Fixed-length binary (iii) Gamma Case 2: The postings list is stored as a sequence of gaps and the numbers are encoded using (i) Unary (ii) Fixed-length binary (iii) Gamma Note: Unary code of x is (x-1) 1's followed by one 0. For example, the unary code of 5 is 11110. 3. For this problem, we will be dealing with a collection of 2 million documents containing 500K distinct terms. Assume that term occurrences follow the Zipf distribution and that the gaps for each term are uniformly distributed (as done in class). Estimate the size of the entire postings file under the following conditions (in all cases, the postings are stored using gaps): (i) Unary encoded postings list (ii) Gamma-encoded postings lists (iii) Repeat (ii) when the 100 most frequent words are not stored in the index. 4. Using the search engine of your choice, find an online Porter's stemming tool, and stem the following words: automobile automotive cars information informative 5. In an inverted index over 0.5 million documents, the following term-frequency statistics were observed: eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812 Recommend a query processing order for the following queries: (i) (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) (ii) tangerine AND (NOT marmalade) AND (NOT trees) 6. Given the following fragment from an inverted index (augmented with positional information), state how often the phrase "The undiscovered country" appears in each document. the: 3:34,38,55; 5:12,16,25,44; 7:67,87,90,101; 10:33,39,45,62; undiscovered: 3:12,15,19; 5:3,5,17,41,45,96; 6:21,25,55,62; 7:4,68,70,85,110; 10:15,34,40,65,81; country: 3:22,26; 5:18,46,52,65; 7:5,69,91,105; 8:32,42,65,93; 10:32,44,75,83; 7. Work out the exercise described on slide 11 of lecture 2.