6.173 Lab #5: Shared-memory barrier and TSP

Deadlines

Useful links

Goal

In this lab you will rewrite TSP using Beehive's shared memory. Beehive's shared memory is a bit unusual: it does not maintain cache coherence. When a core loads from memory, it might see an old value from its data cache rather than a newer value stored by another core. When a core stores data, the store may only affect its own cache, and not be reflected in DRAM or other cores' caches. For example, when data at address a is in core 2's cache and in core 3's cache, then if core 2 stores to a and core 3 loads from that address after 2'store, core 3 is not guaranteed to see core 2's update. Core 2 will update its cache, but not other caches. Software must manage the caches to obtain cache coherence.

The Beehive provides the following primitives for software to obtain cache coherence (see intercore.as and intercore.h):

Thus, if a core wants to guarantee that others cores can see its changes, it should call cache_flushMem after it makes a change to the object at address, and other cores should call cache_invalidateMem before they load that object. test1 in lab4/beehive_main.c uses these functions: make sure that code makes sense to you.

Beehive's cache lines are 32 bytes long, and the flush and invalidate functions effectively round the address down and the size up so that they operate on entire cache lines. Thus the following code probably won't work reliably:

int x;
int y;
core1(){
  cache_invalidateMem(&x, sizeof(x));
  x = x + 1;
  cache_flushMem(&x, sizeof(x));
}
core2(){
  cache_invalidateMem(&y, sizeof(y));
  y = y + 1;
  cache_flushMem(&y, sizeof(y));
}

If two different cores call the two functions at the same time, and x and y are on the same cache line (which is likely), whichever function finishes last will un-do the other's increment. Use the CACHELINE macro to direct the compiler to align a variable to a 32-byte boundary:

int x CACHELINE;
int y CACHELINE;

Beehive does not provide cache coherence; it does not guarantee non-coherence either: when a core loads data from DRAM, that new data may displace existing data in the core's cache; if the displaced data is dirty, it will be evicted to DRAM. Thus cached data may be evicted and flushed even though the software never calls cache_invalidateMem or cache_flushMem.

You may find Beehive's semaphores handy to ensure that only one core at a time does something. There are 64 numbered semaphores. Think of each semaphore as a lock (they are not counting semaphores).

These functions are atomic. For example, if multiple cores are waiting for a semaphore, and a core unlocks it, only one of the waiting cores will return from icSema_P. A semaphore starts out unlocked. For the detailed spec see the comments in intercore.h

Task 1: Update your git repo

You must pull the lab 5 code into your git repository. As with the previous labs, you should first commit your changes and then pull the new lab 5 files:

you@eelab01$ git status
the current state of your git repo
you@eelab01$ git add files that you want to commit
you@eelab01$ git commit -a
you@eelab01$ git pull

You should now see a sw/lab5 directory with a beehive_main.c. You will also see an addition to your lib/barrier.c. The pull also will update the makefile to build lab 5 (tsp_sm.img).


Task 2: barrier using shared memory

Your job is to implement a lab4-style barrier using shared memory. You will use lab4 to test your shared memory barrier. Write your code in lib/barrier.c in sm_barrier(). Check that test1 is invoked with sm_barrier() and build barrier.img. Make sure your code passes lab4's test1 with sm_barrier().

You will need to use cache_invalidateMem and cache_flushMem. You may find semaphores useful, though it's possible to implement a barrier without them.

Please do not use messages in your barrier implementation.


Task 3: TSP using shared memory

Your goal in this task will be to create a parallel TSP solver that uses shared memory and semaphores for all interactions between cores. Fill out tsp_sm.c in lab5, and build tsp_sm.img.

As in lab 3, you'll think about partitioning, load balancing, and inter-core coordination of minimum costs and termination. You may want to try to beat the performance of your message-passing implementation from lab 3. We will announce who gets the best performance!

Please report performance on a full-sized Beehive, which you can create with make risc_13.bit in hwv5. This will give you 13 cores on which to run your software.

Please report run-times for the topology in 12_cities.h, for both starting cities 0 and 1. You should report the longest of the numbers of cycles that any of the cores spend executing your code.

Hint: you probably don't want to base your shared-memory TSP design too directly on your message-passing TSP solution. Think through a design from scratch.

Hint: which data structures do you want to share between cores? the bound? a job queue? the best path? Remember to put data that might be modified at the same time by different cores on different cache lines.

Hint: how are you going to get load balance? You may want to have a shared job queue.

Hint: how will cores learn the latest bound?

Hint: how will you manage concurrent access to shared data structures? do you need locks for the job queue? for the bound? for the best path?

Hint: how will you coordinate starting and termination? Make sure that you terminate only after all tours that might be less than the bound have been explored. You may find your shared-memory barrier useful.

Start with a simple solution, get that working correctly, lock all shared structures appropriately), understand its performance, and then improve it incrementally (e.g., by eliminating unnecessary locking).


Submit lab5_answers.txt along with your tsp_sm.c, and your updated lib/barrier.c.

End of Lab #5!