M.I.T. DEPARTMENT OF EECS
6.033 - Computer System Engineering | MapReduce Assignment |
Complete the following hands-on assignment. Do the activities described, and submit your solutions using the online submission site by 11:59p.
This assignment asks you to write a simple parallel program with the MapReduce library using a single-machine python implementation.
This assignment involves programming.
Download the program mapreduce.py and store it in a file
mapreduce.py
. Download an ASCII bible and save it
in kjv12.txt. Then, run:
$ python mapreduce.py kjv12.txtAfter running for a little while, the output should be as follows:
and 12846 i 8854 god 4114 israel 2574 the 1843 for 1743 but 1558 then 1374 lord 1071 o 1066 david 1064 jesus 978 moses 847 judah 816 jerusalem 814 he 754 now 643 so 622 egypt 611 behold 596
The output has two columns: the first column has a lower-case version of a title-cased word that appears in the ASCII bible and the second column has a count of the number of times that word appears in the bible. The output is trimmed to only display the top 20 results sorted by descending word count.
We will now study mapreduce.py. The program begins execution after the following statement:
if __name__ == '__main__':
We then create an instance of the WordCount class using a few parameters. The
last parameter comes from the command line. In our example, it is "kjv12.txt".
This parameter controls which file we will be executing MapReduce on.
Immediately after initialization, we call run
on the WordCount
instance. When we call run on our WordCount instance, the Python MapReduce
library runs the MapReduce algorithm using the map
and
reduce
methods defined in the WordCount class.
You may find the Python Reference useful in answering the following questions. In particular, the sections on Multiprocessing and Process Pools may be useful.
Question 1: What do the first two parameters to WordCount's
__init__
method control?
Question 2: Briefly explain how calling run
triggers
calls to both the map
and reduce
methods of the
WordCount instance.
For the following questions, do not assume that the name of the parameters correspond to their function!
Question 3: What do the parameters keyvalue
and
value
of the map
method in WordCount represent?
Question 4: What do the parameters key
and
keyvalues
of the reduce
method in WordCount represent?
Question 5: How many invocations are there to doMap
and
how many to doReduce
? Why?
Question 6: Which invocations run in parallel? (Assuming there are enough cores.)
Question 7: How much input (in number of bytes) does a single
doMap
process?
Question 8: How much input (in number of keys) does a single
doReduce
process?
Question 9: For which parameters of maptask
and
reducetask
do you see speedup? Why do you observe no speedup for
some parameters? (You might see no speedup at all on a busy machine or a
machine with a single core.)
Extend the program with a ReverseIndex class. The Map
function
should produce for each word in the input a pair (word, offset)
,
where offset
is the byte offset in the input file. The
Reduce
function should output
(word, [offset, offset, ...])
, sorted by ascending offset.
This should require few lines of code (~25); if you find yourself writing much
more code, you might be on the wrong track; ask for help to double check that
you have the right plan.
While developing ReverseIndex, you want to use a small input file for which you know what the right answer is, so that you can quickly iterate to the correct solution.
Once you have a correct implementation, run it on the bible and look at the top 20 results. Your output should be as follows (without the top two entries "a" and "aaron"):
aaronites [1817624, 1875693] abaddon [4789761] abagtha [2165842] abana [1643159] abarim [732687, 767117, 767187, 944076] abase [2304339, 2823950, 3322367, 3494485] abased [3830632, 4056754, 4075715, 4589626] abasing [4530190] abated [28468, 29098, 29656, 573980, 950776, 1106814] abba [3947502, 4412428, 4550068] abda [1495474, 2149589] abdeel [3121198] abdi [1785594, 2005637, 2096175] abdiel [1779568] abdon [1048236, 1135107, 1135327, 1789463, 1797454, 1797941, 1804405, 2036728] abednego [3468578, 3480066, 3482484, 3482731, 3482889, 3483421, 3483989, 3484242, 3484673, 3484736, 3485390, 3485499, 3485959, 3486358, 3486570] abel [14223, 14233, 14448, 14563, 15039, 15141, 15228, 17635, 1222829, 1448386, 1448550, 1449216, 3834153, 4040933, 4686214, 4695492] abelbethmaachah [1570895, 1700412]
Question 10: Include the code for your ReverseIndex class in your submission. How long did it take you to complete this assignment?
If you have spare time, you can could modify mapreduce.py
to
spawn processes remotely and shares intermediate output files through AFS.
Go to 6.033 Home Page |