18.337 WebMeeting


Message from: Steven G. Johnson (stevenj@MIT.EDU)
About: SMP Matrix Multiply Rules

Thu, 10 Apr 1997 19:34:46 -0500

  • Next message: Steven G. Johnson: "whoops, duplicate message"
    The second contest will compare matrix multiply routines for a single
    shared-memory multiprocessor machine.

    The specifications of the program you will write for the contest follow.
    =46or a brief summary with pseudocode, go to the end of this message.

    **** 1: Getting Input Filename ****

    It will either prompt the user for a filename or read it from the command
    line. This filename will specify the input data file.

    **** 2: Input File Format ****

    The input file will contain a variable number of lines where each line
    consists of four integers separated by *commas*:

    n1,n2,n3,iters

    Here, n1, n2, and n3 specify the dimensions of the matrices to multiply.
    You will be multiplying an n1xn2 matrix by an n2xn3 matrix to yield an
    n1xn3 matrix. (i.e. these are the same as the first three parameters to
    mul_mfmf_mf in the uniprocessor contest.) iters specifies the number of
    times to perform a matrix multiplication of that size.

    **** 4: Outputs ****

    =46or each matrix size, you will compute and output the Mflops of the
    multiplication, defined as:

    Mflops =3D 2.0 * n1*n2*n3 * 1.0e-6 * iters / (time for all iters in seconds)

    Also, output the matrix size and the number of iterations. See below for
    more information on timing.

    After all the lines in the input file have been processed, you will also out=
    put
    the average Mflops (=3D sum of Mflops for all array sizes / number of array
    sizes).

    See below for how the computation time is defined.

    **** 5: Matrix Format ****

    To make our times comparable, everyone will start with the same matrix
    format. Your code may convert this format to another format internally,
    but the conversion time must be included in the time measurements.

    All the matrices must be stored in row-major order, also known as "C
    order." (e.g. the matrices in the uniprocessor contest were stored in
    row-major order.)

    That is, if you have an MxN matrix in double *a, the (i,j)th entry is at:
    a[i*N + j] (Here, i and j range from 0..M and 0..N, respectively.)

    Optionally (for those of you using MPI, perhaps), the *rows* may be
    distributed across processors

    =46or example, to distribute the MxN matrix across processes, you might stor=
    e
    the matrix with two arrays: double *a1 on process 1 and double *a2 on
    process 2.
    Here, the (i,j)th entry of the matrix is stored in:
    a1[i*N+j] if i < M/2 and a2[(i-M/2)*N+j] if i >=3D M/2.

    (You need not distribute the matrix equally across processes...you can have
    the first row on process 1 and the remaining M-1 rows on process 2 if you
    really want to.)

    You may (should?) take advantage of the fact that we will be running on a
    shared-memory machine, so data does not have be be copied in order to be
    sent from one process to another (in theory, at least; this may be hard in
    MPI).

    You may *not* store any of the matrices in transposed order or distribute
    the columns.

    Your matrix multiply routine may only write to the output matrix and may
    *not* write to the two input matrices. The two input matrices must be
    distinct (non-overlapping).

    **** 6: One-time Initializations ****

    In many real world situations, one performs many multiplications of
    matrices of the same size. To take advantage of this, one might often wish
    to do some sort of one-time preparations to optimize matrix multiplications
    of a given size.

    To reflect this, for each given matrix size, you *may* perform some
    preparations that will *not* be included in the timing (see below).

    These preparations may be anything that you can do given *only* the matrix
    dimensions. They may *not* do anything to the matrices themselves, nor may
    they do anything specific to particular input and output data.

    **** 7: Timing ****

    The measured times should *not* include: MPI or other communications
    initializations, initializing/allocating the matrices in the format above,
    one-time initializations for given matrix dimensions (see 6. above) or
    testing.

    The measured times *should* include: the loop over the iterations, the
    matrix multiplication itself, and conversion to any matrix format other
    than the "official" format.

    The time you should measure is the total CPU time (in seconds) / # processor=
    s.

    To get the total CPU time in C, use:

    #include <time.h>

    clock_t start_t, end_t;
    double cpu_time_in_seconds;

    start_t =3D clock();
    =2E..
    end_t =3D clock();

    cpu_time_in_seconds =3D (end_t - start_t) * 1.0 / CLOCKS_PER_SEC;

    In Fortran or F90, you should use the second() function:

    DOUBLE PRECISION start_t, end_t, cpu_time_in_seconds

    start_t =3D second()
    =2E..
    end_t =3D second()
    cpu_time_in_seconds =3D end_t - start_t

    In both cases, the cpu_time_in_seconds variable then needs to be divided by
    the number of processors you utilized. (I think...let me know if I am
    wrong about this.)

    **** 8. Outline & Summary ****

    Pseudocode for your program follows:

    begin program

    if necessary, initialize communications (for MPI or whatever)

    get input filename
    open input file

    line count =3D 0
    total Mflops =3D 0

    for each line in the input file
    read n1,n2,n3,iters

    allocate & initialize A (n1xn2) [optionally, distributed across n1]
    allocate & initialize B (n2xn3) [optionally, distributed across n2]
    allocate & initialize C (n1xn3) [optionally, distributed across n1]

    perform one-time initializations given (n1,n2,n3) only

    optionally, test =C7 =3D A * B for correctness

    get start time
    for iter =3D 1 to iters do
    optionally, convert to internal matrix format(s)
    compute C =3D A * B
    end
    get end time

    Mflops =3D 2*n1*n2*n3*1e-6*iters/(end time - start time)

    output n1,n2,n3,iters,Mflops

    line count =3D line count + 1
    total Mflops =3D total Mflops + Mflops

    deallocate A, B, C, etcetera as needed
    end

    close input file

    output (total Mflops/line count)

    end program

    **** 9. Feedback ****

    I have tried to make this specification as complete, fair, and unambiguous
    as possible. If you have any comments, questions, or concerns, please
    email me at stevenj@mit.edu.

    Also, watch the Webmeet page for announcements of addendums or corrections
    to the contest rules.

    Cordially,
    Steven G. Johnson



    You may post a follow-up message or a new message. To send a reply directly to the author, you may click on the email address above.

    If you would like to submit a message using your own mail program, send it to: 18.337-wm@webmeet.mit.edu

    If you are following up this article, please include the following line at the beginning of your message:
    In-Reply-To: v03020910af7315bb309d@[18.84.0.54]