6.033 - Computer System Engineering MapReduce Assignment

Hands-on 4: MapReduce

Complete the following hands-on assignment. Do the activities described, and submit your solutions using the online submission site by 11:59p.

This assignment asks you to write a simple parallel program with the MapReduce library using a single-machine python implementation.

This assignment involves programming.

I. Warmup

Download the program mapreduce.py and store it in a file mapreduce.py. Download an ASCII bible and save it in kjv12.txt. Then, run:

$ python mapreduce.py kjv12.txt 
After running for a little while, the output should be as follows:
and 12846
i 8854
god 4114
israel 2574
the 1843
for 1743
but 1558
then 1374
lord 1071
o 1066
david 1064
jesus 978
moses 847
judah 816
jerusalem 814
he 754
now 643
so 622
egypt 611
behold 596

The output has two columns: the first column has a lower-case version of a title-cased word that appears in the ASCII bible and the second column has a count of the number of times that word appears in the bible. The output is trimmed to only display the top 20 results sorted by descending word count.

II. Studying mapreduce.py

We will now study mapreduce.py. The program begins execution after the following statement:

if __name__ == '__main__':

We then create an instance of the WordCount class using a few parameters. The last parameter comes from the command line. In our example, it is "kjv12.txt". This parameter controls which file we will be executing MapReduce on. Immediately after initialization, we call run on the WordCount instance. When we call run on our WordCount instance, the Python MapReduce library runs the MapReduce algorithm using the map and reduce methods defined in the WordCount class.

You may find the Python Reference useful in answering the following questions. In particular, the sections on Multiprocessing and Process Pools may be useful.

Question 1: What do the first two parameters to WordCount's __init__ method control?

Question 2: Briefly explain how calling run triggers calls to both the map and reduce methods of the WordCount instance.

For the following questions, do not assume that the name of the parameters correspond to their function!

Question 3: What do the parameters keyvalue and value of the map method in WordCount represent?

Question 4: What do the parameters key and keyvalues of the reduce method in WordCount represent?

III. Modifying mapreduce.py

Modify the program so that you can answer the following questions:

Question 5: How many invocations are there to doMap and how many to doReduce? Why?

Question 6: Which invocations run in parallel? (Assuming there are enough cores.)

Question 7: How much input (in number of bytes) does a single doMap process?

Question 8: How much input (in number of keys) does a single doReduce process?

Question 9: For which parameters of maptask and reducetask do you see speedup? Why do you observe no speedup for some parameters? (You might see no speedup at all on a busy machine or a machine with a single core.)

IV. Reverse word index

Extend the program with a ReverseIndex class. The Map function should produce for each word in the input a pair (word, offset), where offset is the byte offset in the input file. The Reduce function should output (word, [offset, offset, ...]), sorted by ascending offset. This should require few lines of code (~25); if you find yourself writing much more code, you might be on the wrong track; ask for help to double check that you have the right plan.

While developing ReverseIndex, you want to use a small input file for which you know what the right answer is, so that you can quickly iterate to the correct solution.

Once you have a correct implementation, run it on the bible and look at the top 20 results. Your output should be as follows (without the top two entries "a" and "aaron"):

aaronites [1817624, 1875693]
abaddon [4789761]
abagtha [2165842]
abana [1643159]
abarim [732687, 767117, 767187, 944076]
abase [2304339, 2823950, 3322367, 3494485]
abased [3830632, 4056754, 4075715, 4589626]
abasing [4530190]
abated [28468, 29098, 29656, 573980, 950776, 1106814]
abba [3947502, 4412428, 4550068]
abda [1495474, 2149589]
abdeel [3121198]
abdi [1785594, 2005637, 2096175]
abdiel [1779568]
abdon [1048236, 1135107, 1135327, 1789463, 1797454, 1797941, 1804405, 2036728]
abednego [3468578, 3480066, 3482484, 3482731, 3482889, 3483421, 3483989, 3484242, 3484673, 3484736, 3485390, 3485499, 3485959, 3486358, 3486570]
abel [14223, 14233, 14448, 14563, 15039, 15141, 15228, 17635, 1222829, 1448386, 1448550, 1449216, 3834153, 4040933, 4686214, 4695492]
abelbethmaachah [1570895, 1700412]

Question 10: Include the code for your ReverseIndex class in your submission. How long did it take you to complete this assignment?

Spare-time challenge

If you have spare time, you can could modify mapreduce.py to spawn processes remotely and shares intermediate output files through AFS.

Go to 6.033 Home Page