The specifications of the program you will write for the contest follow.
=46or a brief summary with pseudocode, go to the end of this message.
**** 1: Getting Input Filename ****
It will either prompt the user for a filename or read it from the command
line. This filename will specify the input data file.
**** 2: Input File Format ****
The input file will contain a variable number of lines where each line
consists of four integers separated by *commas*:
n1,n2,n3,iters
Here, n1, n2, and n3 specify the dimensions of the matrices to multiply.
You will be multiplying an n1xn2 matrix by an n2xn3 matrix to yield an
n1xn3 matrix. (i.e. these are the same as the first three parameters to
mul_mfmf_mf in the uniprocessor contest.) iters specifies the number of
times to perform a matrix multiplication of that size.
**** 4: Outputs ****
=46or each matrix size, you will compute and output the Mflops of the
multiplication, defined as:
Mflops =3D 2.0 * n1*n2*n3 * 1.0e-6 * iters / (time for all iters in seconds)
Also, output the matrix size and the number of iterations. See below for
more information on timing.
After all the lines in the input file have been processed, you will also out=
put
the average Mflops (=3D sum of Mflops for all array sizes / number of array
sizes).
See below for how the computation time is defined.
**** 5: Matrix Format ****
To make our times comparable, everyone will start with the same matrix
format. Your code may convert this format to another format internally,
but the conversion time must be included in the time measurements.
All the matrices must be stored in row-major order, also known as "C
order." (e.g. the matrices in the uniprocessor contest were stored in
row-major order.)
That is, if you have an MxN matrix in double *a, the (i,j)th entry is at:
a[i*N + j] (Here, i and j range from 0..M and 0..N, respectively.)
Optionally (for those of you using MPI, perhaps), the *rows* may be
distributed across processors
=46or example, to distribute the MxN matrix across processes, you might stor=
e
the matrix with two arrays: double *a1 on process 1 and double *a2 on
process 2.
Here, the (i,j)th entry of the matrix is stored in:
a1[i*N+j] if i < M/2 and a2[(i-M/2)*N+j] if i >=3D M/2.
(You need not distribute the matrix equally across processes...you can have
the first row on process 1 and the remaining M-1 rows on process 2 if you
really want to.)
You may (should?) take advantage of the fact that we will be running on a
shared-memory machine, so data does not have be be copied in order to be
sent from one process to another (in theory, at least; this may be hard in
MPI).
You may *not* store any of the matrices in transposed order or distribute
the columns.
Your matrix multiply routine may only write to the output matrix and may
*not* write to the two input matrices. The two input matrices must be
distinct (non-overlapping).
**** 6: One-time Initializations ****
In many real world situations, one performs many multiplications of
matrices of the same size. To take advantage of this, one might often wish
to do some sort of one-time preparations to optimize matrix multiplications
of a given size.
To reflect this, for each given matrix size, you *may* perform some
preparations that will *not* be included in the timing (see below).
These preparations may be anything that you can do given *only* the matrix
dimensions. They may *not* do anything to the matrices themselves, nor may
they do anything specific to particular input and output data.
**** 7: Timing ****
The measured times should *not* include: MPI or other communications
initializations, initializing/allocating the matrices in the format above,
one-time initializations for given matrix dimensions (see 6. above) or
testing.
The measured times *should* include: the loop over the iterations, the
matrix multiplication itself, and conversion to any matrix format other
than the "official" format.
The time you should measure is the total CPU time (in seconds) / # processor=
s.
To get the total CPU time in C, use:
#include <time.h>
clock_t start_t, end_t;
double cpu_time_in_seconds;
start_t =3D clock();
=2E..
end_t =3D clock();
cpu_time_in_seconds =3D (end_t - start_t) * 1.0 / CLOCKS_PER_SEC;
In Fortran or F90, you should use the second() function:
DOUBLE PRECISION start_t, end_t, cpu_time_in_seconds
start_t =3D second()
=2E..
end_t =3D second()
cpu_time_in_seconds =3D end_t - start_t
In both cases, the cpu_time_in_seconds variable then needs to be divided by
the number of processors you utilized. (I think...let me know if I am
wrong about this.)
**** 8. Outline & Summary ****
Pseudocode for your program follows:
begin program
if necessary, initialize communications (for MPI or whatever)
get input filename
open input file
line count =3D 0
total Mflops =3D 0
for each line in the input file
read n1,n2,n3,iters
allocate & initialize A (n1xn2) [optionally, distributed across n1]
allocate & initialize B (n2xn3) [optionally, distributed across n2]
allocate & initialize C (n1xn3) [optionally, distributed across n1]
perform one-time initializations given (n1,n2,n3) only
optionally, test =C7 =3D A * B for correctness
get start time
for iter =3D 1 to iters do
optionally, convert to internal matrix format(s)
compute C =3D A * B
end
get end time
Mflops =3D 2*n1*n2*n3*1e-6*iters/(end time - start time)
output n1,n2,n3,iters,Mflops
line count =3D line count + 1
total Mflops =3D total Mflops + Mflops
deallocate A, B, C, etcetera as needed
end
close input file
output (total Mflops/line count)
end program
**** 9. Feedback ****
I have tried to make this specification as complete, fair, and unambiguous
as possible. If you have any comments, questions, or concerns, please
email me at stevenj@mit.edu.
Also, watch the Webmeet page for announcements of addendums or corrections
to the contest rules.
Cordially,
Steven G. Johnson
In-Reply-To: v03020910af7315bb309d@[18.84.0.54]