M.I.T. DEPARTMENT OF EECS
|6.033 - Computer System Engineering||MapReduce Assignment|
Complete the following hands-on assignment. Do the activities described, and submit your solutions using the online submission site by 11:59p.
This assignment asks you to write a simple parallel program with the MapReduce library using a single-machine python implementation.
This assignment involves programming.
Download the program mapreduce.py and store it in a file
mapreduce.py. Download an ASCII bible and save it
in kjv12.txt. Then, run:
$ python mapreduce.py kjv12.txtAfter running for a little while, the output should be as follows:
and 12846 i 8854 god 4114 israel 2574 the 1843 for 1743 but 1558 then 1374 lord 1071 o 1066 david 1064 jesus 978 moses 847 judah 816 jerusalem 814 he 754 now 643 so 622 egypt 611 behold 596
The output has two columns: the first column has a lower-case version of a title-cased word that appears in the ASCII bible and the second column has a count of the number of times that word appears in the bible. The output is trimmed to only display the top 20 results sorted by descending word count.
We will now study mapreduce.py. The program begins execution after the following statement:
if __name__ == '__main__':
We then create an instance of the WordCount class using a few parameters. The
last parameter comes from the command line. In our example, it is "kjv12.txt".
This parameter controls which file we will be executing MapReduce on.
Immediately after initialization, we call
run on the WordCount
instance. When we call run on our WordCount instance, the Python MapReduce
library runs the MapReduce algorithm using the
reduce methods defined in the WordCount class.
You may find the Python Reference useful in answering the following questions. In particular, the sections on Multiprocessing and Process Pools may be useful.
Question 1: What do the first two parameters to WordCount's
__init__ method control?
Question 2: Briefly explain how calling
calls to both the
reduce methods of the
For the following questions, do not assume that the name of the parameters correspond to their function!
Question 3: What do the parameters
value of the
map method in WordCount represent?
Question 4: What do the parameters
keyvalues of the
reduce method in WordCount represent?
Question 5: How many invocations are there to
how many to
Question 6: Which invocations run in parallel? (Assuming there are enough cores.)
Question 7: How much input (in number of bytes) does a single
Question 8: How much input (in number of keys) does a single
Question 9: For which parameters of
reducetask do you see speedup? Why do you observe no speedup for
some parameters? (You might see no speedup at all on a busy machine or a
machine with a single core.)
Extend the program with a ReverseIndex class. The
should produce for each word in the input a pair
offset is the byte offset in the input file. The
Reduce function should output
(word, [offset, offset, ...]), sorted by ascending offset.
This should require few lines of code (~25); if you find yourself writing much
more code, you might be on the wrong track; ask for help to double check that
you have the right plan.
While developing ReverseIndex, you want to use a small input file for which you know what the right answer is, so that you can quickly iterate to the correct solution.
Once you have a correct implementation, run it on the bible and look at the top 20 results. Your output should be as follows (without the top two entries "a" and "aaron"):
aaronites [1817624, 1875693] abaddon  abagtha  abana  abarim [732687, 767117, 767187, 944076] abase [2304339, 2823950, 3322367, 3494485] abased [3830632, 4056754, 4075715, 4589626] abasing  abated [28468, 29098, 29656, 573980, 950776, 1106814] abba [3947502, 4412428, 4550068] abda [1495474, 2149589] abdeel  abdi [1785594, 2005637, 2096175] abdiel  abdon [1048236, 1135107, 1135327, 1789463, 1797454, 1797941, 1804405, 2036728] abednego [3468578, 3480066, 3482484, 3482731, 3482889, 3483421, 3483989, 3484242, 3484673, 3484736, 3485390, 3485499, 3485959, 3486358, 3486570] abel [14223, 14233, 14448, 14563, 15039, 15141, 15228, 17635, 1222829, 1448386, 1448550, 1449216, 3834153, 4040933, 4686214, 4695492] abelbethmaachah [1570895, 1700412]
Question 10: Include the code for your ReverseIndex class in your submission. How long did it take you to complete this assignment?
If you have spare time, you can could modify
spawn processes remotely and shares intermediate output files through AFS.
|Go to 6.033 Home Page|