Data Parallelization Library
For the optimizer project we have
provided you with a simple library for to use if you decide to
data-parallelize decaf applications. The library is included in
lib6035.a Your compiler is not responsible for automatically
extracting loop-level data-parallelism. Instead, you have the
option of executing iterations of a forpar loop in parallel, on different cores of
our target architecture. Please see the Decaf
Specification for details on the semantics of forpar. The benchmarks given for the
optimizer project have been hand-parallelized.
The provided library is built using pthreads and is very simple.
Thread creation and termination are extremely fast. The library
has only 2 external methods:
void
set_num_threads(int i);
Set the number of threads to i, where i is greater than 1. This
parameter determines how many parallel threads will execute a single
forpar loop. The most recently executed call to set_num_threads() effects the forpar. This method must be called
before any calls to create_and_run_threads(...).
void
create_and_run_threads(void* (*f)(void *));
This method creates i threads
(where i is set by set_num_threads). Each thread calls the method f
with a unique ID. ID ranges from 0, ..., i-1. The call to create_and_run_threads() returns when all threads spawned have
finished executing their version of f. If you don't understand
the notation for the parameter, don't worry, it is C notation. It
means that create_and_run_threads() is passed a function pointer, f. The
function it is passed, f, accepts a single pointer (of type void) and does not return a value. See
below for more details.
Variable Sharing
For a forpar, as stated in the decaf specification,
all variables defined in the forpar are local to each iteration. Any
local variable declared in an outer scope and used in the forpar must be made "global" in the generated
assembly to enable sharing of this variable. It is illegal (undefined)
for any of the iterations to write to a scalar variable. Only
arrays have defined sharing semantics if at most one iteration writes
each array element.
The pthreads library and the underlying hardware cache-coherent memory
implement the sharing of global arrays and scalars. You need to
be careful about how you distribute the iterations (and thus the data)
of a forpar. Note that sharing in the hardware is accomplished at
a cache-line granularity. And briefly, only one processor at a
time can own write-permission to a cache-line. Think about what
this means for iteration and data distribution.
Implementation
Traditionally, data-parallelization
is a high-level transformation, performed on the high-level
intermediate representation. Just to get your started, we are
going to briefly discuss some implementation issues.
First, if you are data-parallelizing, your generated assembly must call
set_num_threads() before calling create_and_run_threads().
In your high-level IR, you might create a function call containing the
body of a forpar loop. This function would accept a
single argument, a pointer to an integer thread ID, that the function
will use to determine which iterations of the forpar loop the thread is
responsible for executing. The lecture on Parallelization
discussed how to accomplish this distribution using a traditional for loop.
We have not discussed pointers much, but you might want to add some
hooks in your compiler for dealing with pointer types. In the
compiler-generated function, you are passed a pointer to the ID. The
pointer is passed in %rdi (the first argument register). You can
deference the pointer using an addressing mode:
movq (%rdi), %rax
This will load the value of the memory location pointed to by %rdi into
%rax. Now the integer value of the ID is in %rax.
When execution reaches a forpar loop, the generated assembly code
instead calls create_and_run_threads() with the address to the
compiler-generated function as the argument. This is easily
accomplished. If we have a compiler generated function f defined,
we can pass its address as an argument with:
movq $f, %rdi
call create_and_run_threads
The library will then spawn i threads to execute the parallel
iterations. Calling each function (each f in this case) with a
different integer ID, from 0 to i-1. It is up to the pthread's
runtime scheduler to assign threads to processors (cores) and to
schedule the threads for execution.
Now, when you execute your program, the benchmarking output is a bit
more complex, as it gives min, max, mean, stddev across all the threads
spawned. When benchmarking multi-threaded apps, your output will
look like this:
libmonitor debug:
(P21780,T0x2afb57e246e0) monitor_fini_process()
PapiEx
Version: 0.99rc2
Executable:
/u/mgordon/6.035/par_lib/parallel
Processor:
AMD K8 Revision C
Clockrate:
1993.465942
Parent Process
ID: 21773
Process
ID:
21780
Hostname:
tyner
Options:
PAPI_TOT_CYC,NO_WRITE
Domain:
User
Num. of
threads: 5
min
max
mean
stddev
---
---
----
------
Real usecs
:
1723
8366
3506
2535
Real cycles
:
3435408
16667275
6987049
5050770
Proc usecs
:
1684
4266
2230
1018
Proc cycles
:
3355448
8500196
4444706
2028198
PAPI_TOT_CYC:
2595252
5923028
3319077
1303291
Event descriptions:
Event: PAPI_TOT_CYC
Derived: No
Short Description: Total cycles
Long Description: Total cycles
Developer's Notes:
The statistic you are interested in is now *max* "Real cycles" (or
*max* "Real usec"). So in this case the program took 16667275
cycles of to finish. You can use this metric to determine the
implementation and parameters that work best. Also, you can
compare this directly to the "Real cycles" field in a serial execution
of the program to see if there is a speed up from
data-parallelization. Don't use PAPI_TOT_CYC; it only counts
user, non-blocked cycles. For the derby we will be using the
"real cycles" field because for parallelization we are generating
system calls and we need to count this overhead. You should
arrive at similar results using "Real cycles" or "Real usecs".