Data Parallelization Library

For the optimizer project we have provided you with a simple library for to use if you decide to data-parallelize decaf applications.  The library is included in lib6035.a  Your compiler is not responsible for automatically extracting loop-level data-parallelism.  Instead, you have the option of executing iterations of a forpar loop in parallel, on different cores of our target architecture.  Please see the Decaf Specification for details on the semantics of forpar.  The benchmarks given for the optimizer project have been hand-parallelized. 

The provided library is built using pthreads and is very simple.  Thread creation and termination are extremely fast.  The library has only 2 external methods:

void set_num_threads(int i);

Set the number of threads to i, where i is greater than 1.  This parameter determines how many parallel threads will execute a single forpar loop.  The most recently executed call to
set_num_threads() effects the forpar.  This method must be called before any calls to create_and_run_threads(...).

void create_and_run_threads(void* (*f)(void *));

This method creates i
threads (where i is set by set_num_threads).  Each thread calls the method f with a unique ID.  ID ranges from 0, ..., i-1.  The call to create_and_run_threads() returns when all threads spawned have finished executing their version of f.  If you don't understand the notation for the parameter, don't worry, it is C notation.  It means that create_and_run_threads() is passed a function pointer, f. The function it is passed, f, accepts a single pointer (of type void) and does not return a value.  See below for more details.

Variable Sharing

For a forpar, as stated in the decaf specification, all variables defined in the forpar are local to each iteration.  Any local variable declared in an outer scope and used in the forpar must be made "global" in the generated assembly to enable sharing of this variable. It is illegal (undefined) for any of the iterations to write to a scalar variable.  Only arrays have defined sharing semantics if at most one iteration writes each array element. 

The pthreads library and the underlying hardware cache-coherent memory implement the sharing of global arrays and scalars.  You need to be careful about how you distribute the iterations (and thus the data) of a forpar.  Note that sharing in the hardware is accomplished at a cache-line granularity.  And briefly, only one processor at a time can own write-permission to a cache-line.  Think about what this means for iteration and data distribution.


Implementation

Traditionally, data-parallelization is a high-level transformation, performed on the high-level intermediate representation.  Just to get your started, we are going to briefly discuss some implementation issues. 

First, if you are data-parallelizing, your generated assembly must call
set_num_threads() before calling  create_and_run_threads()

In your high-level IR, you might create a function call containing the body of a
forpar loop.  This function would accept a single argument, a pointer to an integer thread ID, that the function will use to determine which iterations of the forpar loop the thread is responsible for executing.  The lecture on Parallelization  discussed how to accomplish this distribution using a traditional for loop.

We have not discussed pointers much, but you might want to add some hooks in your compiler for dealing with pointer types.  In the compiler-generated function, you are passed a pointer to the ID.  The pointer is passed in %rdi (the first argument register).  You can deference the pointer using an addressing mode:

    movq    (%rdi), %rax    

This will load the value of the memory location pointed to by %rdi into %rax. Now the integer value of the ID is in %rax.

When execution reaches a
forpar loop, the generated assembly code instead calls create_and_run_threads() with the address to the compiler-generated function as the argument.  This is easily accomplished.  If we have a compiler generated function f defined, we can pass its address as an argument with:

    movq     $f, %rdi
    call     create_and_run_threads

The library will then spawn i threads to execute the parallel iterations.  Calling each function (each f in this case) with a different integer ID, from 0 to i-1.  It is up to the pthread's runtime scheduler to assign threads to processors (cores) and to schedule the threads for execution.

Now, when you execute your program, the benchmarking output is a bit more complex, as it gives min, max, mean, stddev across all the threads spawned.  When benchmarking multi-threaded apps, your output will look like this:

libmonitor debug: (P21780,T0x2afb57e246e0) monitor_fini_process()
PapiEx Version:         0.99rc2
Executable:             /u/mgordon/6.035/par_lib/parallel
Processor:              AMD K8 Revision C
Clockrate:              1993.465942
Parent Process ID:      21773
Process ID:             21780
Hostname:               tyner
Options:                PAPI_TOT_CYC,NO_WRITE
Domain:                 User
Num. of threads:        5

                             min                     max                    mean                  stddev
                             ---                     ---                    ----                  ------
Real usecs :                1723                    8366                    3506                    2535
Real cycles :            3435408                16667275                 6987049                 5050770
Proc usecs :                1684                    4266                    2230                    1018
Proc cycles :            3355448                 8500196                 4444706                 2028198
PAPI_TOT_CYC:            2595252                 5923028                 3319077                 1303291

Event descriptions:
Event: PAPI_TOT_CYC
        Derived: No
        Short Description: Total cycles
        Long Description: Total cycles
        Developer's Notes:

The statistic you are interested in is now *max* "Real cycles" (or *max* "Real usec").  So in this case the program took 16667275 cycles of to finish.  You can use this metric to determine the implementation and parameters that work best.  Also, you can compare this directly to the "Real cycles" field in a serial execution of the program to see if there is a speed up from data-parallelization.  Don't use PAPI_TOT_CYC; it only counts user, non-blocked cycles.  For the derby we will be using the "real cycles" field because for parallelization we are generating system calls and we need to count this overhead.  You should arrive at similar results using "Real cycles" or "Real usecs".