6.173 Lab #6: reclaimable locks

Deadlines

Useful links

Goal

You will design and implement one new hardware feature: reclaimable locks, which allow you to avoid lock and memory traffic on the ring.

The original semaphore unit (Sem.v) works as follows:

Your task is to modify Sem.v to remember whether the current core was the last holder of each lock. If software on a core asks for a lock and the core was the last holder, Sem.v should grant the lock without sending a Preq slot around the ring. In this case the P should return 3 instead of 1. The overall point is to speed up repeated aquisition of the same lock, and to give software hints about when it doesn't need to invalidate the locked data.

To take advantage of the new Sem.v, we have modified the software icSema_P in shared/intercore.h: when icSema_tryP returns 1, stop spinning and return 0 (indicating success but we were not the last holder) and when icSema_tryP returns 3, stop spinning and return 1 (indicating success and we were the last holder).


Task 1: Update your git repo

You must pull the Lab 6 code into your git repository. As with the previous labs, you should first commit your changes and then pull the new Lab 6 files:

you@eelab01$ git status
the current state of your git repo
you@eelab01$ git add files that you want to commit
you@eelab01$ git commit -a
you@eelab01$ git pull

You should now see a sw/lab6 directory with a beehive_main.c. The pull also will update the makefile to build Lab 6. You will also see an update to simulate/beehive.v and shared/intercore.h.

Task 2: Understand non-reclaimable locks

Before contemplating your changes to Sem.v you should first understand the current verilog well. Make sure you can explain every line in Sem.v to yourself.

To solidify your understanding, it might be helpful to run the simulator. We modified beehive.v in the directory hwv5/simulate to print out some handy info; you may want to print more. You may want to look what happens with lock 7 in the simulator before making any changes to Sem.v; this lock is the one stressed by test0 in lab6/beehive_main.c (Task 4 below describes it too).

Check that the print statements in beehive.v for the lockUnit are uncommented, build lock.img by typing make in the sw directory. Edit simulate/Makefile to redirect vsim output to /dev/null and make remove the comment character from the front of the "run" and "exit" lines in simulate/simulate.do

Now run the simulator (type make sim_3 IMG=../sw/o/lock.img in the hwv5 directory ), inspect simulate/transcript, which lists the printed events for core 2. To see the lock events, you may want to run:

grep -E "(selL=1|Tin=9|Tout=a)" simulate/transcript 

For your unmodified Sem.v, you will see output like this:

# cycle=73601 pc=00003336 inst=e8400305 state=0 selL=1 P=1 L=07 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=7
# cycle=73602 pc=00003336 inst=e8400305 state=2 selL=1 P=1 L=07 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=1, SrcDestIn=0, data=00000000 Tout=7
# cycle=73603 pc=00003336 inst=e8400305 state=4 selL=1 P=1 L=07 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=1
# cycle=73604 pc=00003336 inst=e8400305 state=5 selL=1 P=1 L=07 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=9
# cycle=73605 pc=00003336 inst=e8400305 state=5 selL=1 P=1 L=07 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=7
# cycle=73606 pc=00003336 inst=e8400305 state=5 selL=1 P=1 L=07 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=7
# cycle=73607 pc=00003336 inst=e8400305 state=5 selL=1 P=1 L=07 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=7
# cycle=73608 pc=00003336 inst=e8400305 state=5 selL=1 P=1 L=07 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=7
# cycle=73609 pc=00003336 inst=e8400305 state=5 selL=1 P=1 L=07 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=1, SrcDestIn=0, data=00000001 Tout=7
# cycle=73610 pc=00003336 inst=e8400305 state=5 selL=1 P=1 L=07 cLocked=0 rqLock=00000001 cWrite=1 cLockD=1 Ring: Tin=9, SrcDestIn=2, data=00000007 Tout=1
# cycle=73631 pc=00003337 inst=00800605 state=0 selL=0 P=1 L=07 cLocked=1 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=9, SrcDestIn=3, data=00000007 Tout=1
# cycle=73632 pc=00003337 inst=00800605 state=0 selL=0 P=1 L=07 cLocked=1 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=a
# cycle=74693 pc=00003388 inst=08400845 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=9, SrcDestIn=3, data=00000007 Tout=1
# cycle=74694 pc=00003389 inst=00002703 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=a
# cycle=75758 pc=0000338d inst=f800ce4c state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=9, SrcDestIn=3, data=00000007 Tout=1
# cycle=75759 pc=00003386 inst=07408a07 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=a
# cycle=76792 pc=00003387 inst=e8400305 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=9, SrcDestIn=3, data=00000007 Tout=1
# cycle=76793 pc=00003388 inst=08400845 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=a
# cycle=77826 pc=0000338b inst=1f400445 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=9, SrcDestIn=3, data=00000007 Tout=1
# cycle=77827 pc=0000338c inst=f8001a4b state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=a
# cycle=78860 pc=00003387 inst=e8400305 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=9, SrcDestIn=3, data=00000007 Tout=1
# cycle=78861 pc=00003387 inst=e8400305 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=a
# cycle=79894 pc=00003389 inst=00002703 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=9, SrcDestIn=3, data=00000007 Tout=1
# cycle=79895 pc=0000338a inst=f0c03f05 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=a
# cycle=80928 pc=0000338d inst=f800ce4c state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=9, SrcDestIn=3, data=00000007 Tout=1
# cycle=80929 pc=00003386 inst=07408a07 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=a
# cycle=81962 pc=00003387 inst=e8400305 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=9, SrcDestIn=3, data=00000007 Tout=1
# cycle=81963 pc=00003388 inst=08400845 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=a
# cycle=82996 pc=0000338b inst=1f400445 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=9, SrcDestIn=3, data=00000007 Tout=1
# cycle=82997 pc=0000338c inst=f8001a4b state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=a
# cycle=84030 pc=00003387 inst=e8400305 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=9, SrcDestIn=3, data=00000007 Tout=1
# cycle=84031 pc=00003387 inst=e8400305 state=0 selL=0 P=1 L=01 cLocked=0 rqLock=00000000 cWrite=0 cLockD=0 Ring: Tin=7, SrcDestIn=0, data=00000000 Tout=a
...

You probably want to understand this output before you proceed implementing your own changes. For example, what does Tout=a mean, and why does it show up? Write down your answers in lab6_answers.txt.

Task 3: Implement reclaimable locks

Once you understand the non-reclaimable locks well, decide what you want to change in Sem.v. Here are some issues to consider:

Hint: Each core will need a another distributed memory (similar to lockHeldMem) to keep track if this core was the last holder of each lock. When do you set a lock's bit? When do you clear it?

Hint: You need to arbitrate between Preq and Vreq from a core and the ring in the same cycle. For example, if a core who was the previous holder of a lock issues a Preq for that lock, and there is also a Preq on the ring for the same lock, make sure that only one of these succeeds.

Hint: Make sure not to write a distributed memory twice in the same clock cycle.

Hint: run your design by the TAs before implementing and testing it.

Hint: when compiling your design, redirect output of make into a file, and search for Sem and see if there are any statements that indicate problems with your design.

Task 4: Testing reclaimable locks

Once you have implemented your changes to Sem.v, simulate a 3-core Beehive, and try to pass test0 in lab6/beehive_main.c. One way your implementation may be wrong is if it allows several cores to enter a critical section. To test this, test0 bangs on the same lock but delays core 2 in the critical section to test that no other cores enters it too. You know if your implementation passes test0 when it prints "test0 succeeded" and the number nreclaim is larger than 0. This number counts the number of times the test was able to reclaim the lock without having to send a Preq.

You probably want to set N in beehive_main.c to some small value (e.g., 3) and comment out the call to test1, since otherwise the test won't finish anytime soon. You can check for the console output (which appears on core 1) by grepping for "\["; the simulator prints each console-output character between square brackets. You may have to bump up the number of cycles to simulate to complete test0 (see simulate.do).

After you passed test0 on the simulator, then test on the boards. The output for test0 and test1 on the boards might look something as follows on 3 cores:

3 cores, clock speed is 100 MHz
1: mqinit
test 0 succeeded: sum is: 200 nreclaim=99
test1 start
test1: 02: reached 200 3 times with 10 tries nreclaim 413

test1 increments a counter until some value. When it reaches the value, it reads other core's counters until they all have reached the specified value. The incrementing and reading of the shared counters is wrapped in a critical section. If your implementation is correct, it will print "test 1 succeeded" and nreclaim.

Once test0 and test1 work on a 3-core Beehive on the boards, try running them on 13-core Beehive. It wouldn't be surprising if your implementation doesn't work, since 13 cores beating on the same set of locks stress your implementation more than 3 cores. For example, it is more likely that your Verilog that arbitrates between a Preq issued by a core and Preq from the ring that shows up in the same cycle is stressed. If everything works, the output for test0 and test1 might look something as follows:

13 cores, clock speed is 100 MHz
1: mqinit
test 0 succeeded: sum is: 1200 nreclaim=84
test1 start
test1: 02: reached 1200 3 times with 5 tries nreclaim 403

Task 5: Analyze reclaimable locks

Run your 13-core Beehive with and without reclaimable locks, and report in lab6_answers.txt how many fewer Preq messages your implementation with reclaimable locks sends for test1. Our implementation sends about a factor 8 fewer Preq messages.

Task 6: Using reclaimable locks

Using reclaimable locks one can avoid some cache invalidations, and reduce moving cache lines between cores through memory. The code in the "ifdef SOL6" section takes advantages of when Preq returns 1. Using the performance meters explain in lab6_answers.txt how much memory traffic the optimization avoids. We observe, for example, a factor 2 reduction in "Dread" address slots for core 2.

Submit Sem.v and lab6_answers.txt

End of Lab #6!