CS347 (Spring 2010) Map-Reduce Homework ------------------- Problem 1 In class we discussed a map-reduce strategy for sorting a set of numbers, using an example (Slide 16, MapReduce Notes). In that example, the input was a set of 3 unsorted files, each containing (key, value) pairs. The output was two files sorted on the key field. Say we are not happy with this output because the sorted files need to be merged. Instead, say we want the output to be two sorted files, where the first one contains all keys less than or equal to 5, and the second one contains all keys larger than 5. Thus, for this example input data, one file should contain (in key order) the pairs (1,a), (2,b), (3,c), (4,d) and (5,e), and the other should contain (6,f),(6,f*), (7,g), (8,h) and (9,i). Write the map and reduce functions that will produce these two sorted files. You can use pseudo-code. Problem 2 Read Section 1 of the following paper: Swoosh: a generic approach to entity resolution Available here: http://ilpubs.stanford.edu:8090/859/ (Read the rest of the paper if you like :-) We want to use map-reduce to perform entity-resolution on a set of records that describe products. Each input record r has the following three fields: r.C: the product category (e.g., camera, book, ...) r.D: the product description (text) r.E: the rest of the product information. The match function we will use is: M(r,s) = true if [r.C=s.C and sim(r.D, s.D)>0.9 ] false otherwise The functions sim(r.D, s.D) evaluates the textual similarity between the descriptions, returning a real number between 0 (completely different) and 1 (identical). To simplify the problem, assume we do not merge matching records. Instead, we simply want to output the pairs of matching records. For example, if r matches s and s matches t (but r does not match t), then the output should contain {r,s} and {s,t} but not {r, t}. The precise format of the output file can vary depending on your code. For example, some pairs may be grouped together into sets, e.g., { {r,s}, {s,t} }. The output file can of course be partitioned across multiple computers. Part A: Write map and reduce functions to generate the desired output. You can use pseudo-code. Keep your code as simple and easy-to-read as possible. Part B: Suppose we want to perform transitive closure on the pairs of matching records. For instance, if r matches s and s matches t, we want the output to include {r, s, t}. Can this output be generated with a single map-reduce pass? If yes, explain how. If not, explain why not, and discuss how many passes might be necessary. You do not have to write code for Part B, just discuss in English. Please keep your answers short and clear.