## Assignment 1 MPI

Due: Tuesday, February 20, 1996

The purpose of this assignment is to jump right in and learn how to write message passing code with the message passing interface language MPI .

My suggestion is install MPI, and run the pi computation program before reading the manuals. If you are determined to, you can buy the MPI book Using MPI which is available from MIT press. I suspect it may be purchased at the COOP and the MIT Press bookstore, but call ahead and ask. (If somebody does this, let me know, and I will post.) It is not necessary to purchase the book, a good first step would be to see if you can understand or guess at what is probably going on in the example Fortran program or C program to compute pi by integrating f(x)= 4/(1+x^2) from 0 to 1. If you are willing to understand most, but not all, of the code, you will probably make very fast progress. The next step is to execute the code (see below), and the third step is to make sure you understand it by playing with the code, and perhaps reading some of the tutorials. The tutorial by Peter Pacheco looks like a particularly good one. You can also lookup individual commands in the index . If you like, we can arrange to do all this together in an electronic classroom some evening just to see how it all goes and avoid time wasting systems problems.

1) Obtaining an MPI for yourself:

If your department has a network of workstations such as the unix workstations in the mathematics department, it is fun to install MPI yourself. One easy choice (I tried it myself) is the easy eight step (actually just do steps one through seven unless you have root password) quick start to loading the free and well written MPICH (MPI chameleon) package from Argonne National Lab. (The origin of the name chameleon goes back to an old portable message passing system from Argonne that predates MPI.) The benefits of using this program is that it works. The only drawback is that vendor supplied implementations for special machines should be faster.

If you are using a Sun network, step 4 of the eight steps is exactly correct. On another network, you probably would want to change the -arch argument, but I think the -device=ch_p4 works just fine. If you want to use two machines with different filesystems one way is to use the -p4pg pgfile switch as described in the fourth paragraph of the section entitled "Heterogeneous networks and closer control" of the MPICH Users Guide. An example pgfile that worked between the mathematics department and the theory group over at the Lab for Computer Science looks like this (processor0 in math), (processor 1 at LCS)
```weierstrass                     0  /tmp_mnt/home/b2/edelman/mpi/mpich/examples/basic/cpi
woodpecker.lcs.mit.edu          1  /c/edelman/mpi/mpich/examples/basic/cpi
```
Keep me posted to other things that you try either out of necessity or curiosity.

If you find that you have devoted more than one hour to this project, stop, give up, and send me email. It is not worth your time. I am getting you all accounts on the IBM SP-2 (sp2.mit.edu) and the SGIs. There MPI is already installed. I just ran the SP-2 code myself thanks to help from Sivan Toledo. sp2.mit.edu doubles as node9. Please consider randomly logging onto sp2n10, sp2n11, and sp2n12 also.

2) Run the c or Fortran pi program. First do it on one processor, then on (P=) two, three, or four processors. Edit the code so that it takes smaller rectangles. Get some timings for different deltas on P processors and on one processor. Compare the former with the latter divided by P on a graph. What conclusion can be reached?

3) Write a program to sum P numbers, where P is an arbitrary power of 2. For simplicity place the single number i on processor i and form the binary tree that would add these numbers. (Of course nobody would ever compute with one number per processor). Use send and receives. (Is blocking or non-blocking better?) (I think I like the introductory tutorial on point to point communication from Cornell.) The program should have a loop with exactly log2(P) steps. There should be no unnecessary communication. The diagram below illustrates the computation on eight processors with communication shown in heavy lines. The inner loop consists of a send, a receive, and a sum. This loop is executed three times when there are eight processors.

4) Continuing with problem 3, rather than broadcasting, the sum to every processor, pass the sum back down the leaves of the binary tree using sends and receives. Again there should be log2(P) steps and no unnecessary communication. The diagram above may be used again, but this time the communication occurs "down" the tree.

5) This is the most difficult problem. It may seem artificial for now, but later in the course, you will see why this is exactly a simplification of the multipole algorithm.

Assume the data structure from problem 3 is intact. In other words, assume the binary tree consisting of all the partial sums is still available. For example, if there are eight processors, there should be four partial sums on processor 0, two on processor 2, and three on processor 4 in memory.

Here is your goal. Each processor must compute the sum of all of the numbers except for itself and its (at most) two neighbors. Therefore processor 3 computes the sum of all the numbers except those in processors 2, 3, and 4. Processor 0 computes the sum of the numbers in processors 2, 3, 4, ....

The catch is that I want you to compute this in a very special way. You might find this way silly at first, but there is an important pedagogical reason I want you to do it this way. Rather than beginning by defining what I want, let me show what will happen in processor 8 in a 32 processor system:

From the point of view of processor 8, the first thing that happens is processor 24 sends over the sum of the numbers from p24 to p31. (Notice the communications lines do not respect the tree stucture, i.e. in this example processor 24 is talking to processor 8 even though no line exists.) On the next step, this is added to the values in p0, p16 and p20. On the third step, the preceding sum is added to the values from p4, p12 and p14. Finally on the last step, p8 gets and adds in the values in p6, p10, and p11. When the computation is done, p8 has the sum of all the numbers except for those from p7, p8, and p9.

How may this be described? What is happening is that in this top to bottom calculation, p8 is receiving as much as it can without loading in information from itself nor from the nearest neighbor at that level. For example, consider the level containing p0, p4, p8, p12, p16, p20, p24, and p28. We do not wish to include information from p8, nor the two nearest neighbors p4 and p12. Also we already have the information from p24 and p28 from the previous step of the calculation. Note: you are allowed to inherit your parent's other sum information, for example, at the first step, P0 gets information from P24, at the second step, P4 can get *this* information from P0.

Your job is to write code that is executed on each processor that simultaneously does this computation for each processor. You may wish to use MPI_sendreceive Programming style matters. The code should be a loop with almost log2(p) iterations, and the loop (except for the first iteration) should consist of an inheritance of a partial sum from a parent (that does not include itself) and three MPI_sendreceive calls which are functions of the processor number. The code should be general enough to work on an arbirary power of 2 number of processors. Neatness and style count. If you don't have a large number of processors, you may simulate having more processors by having multiple processes per processor. If you are using mpich this may be done by repeating node names multiple times.