6.033 Discussion Suggestions

6.033--Computer System Engineering

Suggestions for classroom discussion

Topic: Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: a dynamic race detector for multithreaded programs. ACM Transaction on Computer Systems 15, 4 (November 1997) pages 391-411. Also in the Proc. Sixteenth ACM Symposium on Operating Systems Principles (October 1997).

by J. H. Saltzer, last updated February 1, 2003.

(N.B.: In February, 1998, Martin C. Renard prepared a set of notes about Eraser that approach the material from a quite different perspective.)

Where does the name come from?

(It is a pun on "Eliminate RACEs")

What is a "lock". Does it guarantee anything? I have a variable named X. How do I protect it?

(The first answer is usually lock(x). But that answer illustrates a misunderstanding caused by a misleading name. The ordinary English language meaning of "lock" is that is is a barrier that prevents access, and the obvious analogy to "lock(house)" is wrong.)

So how do you protect variable X?

(By lock(Y), where Y is a flag associated with X. The meaning of "lock" is "I just acquired a flag that is intended to protect object X--if all other programs that touch X are correct. The interesting observation is that it is not a barrier, it doesn't by itself prevent access like a door lock, and most important of all, we can't verify correctness by local analysis. We must inspect every other program that touches X.)

Inability to use local analysis looks like a problem. In terms of 6.033 concepts, what is the problem?

(It contradicts modularity. You have to look at all the other modules to figure out whether or not this module is programmed correctly. We will also see later that locks make it hard to compose large modules out of smaller ones; you need to know what locks the smaller ones are using. Locks and modularity are at odds with one another.)

The inability to be sure you have done the right thing by local inspection is one of the reasons we need Eraser.

On page 392 the authors say "Only the owner of a lock is allowed to release it." Is this true of the lock implemented in chapter 3 (page 3-62)?

(The implementation of chapter 3 certainly doesn't enforce any such restriction, though it would be easy to add it. Some locking systems enforce those semantics, others don't.)

Does Eraser actually depend on this rule?

(It does need to know which thread is setting a lock, in order to run more advanced versions of the lock-set algorithm. But we haven't gotten that far yet, so let's bookmark that question.)

In the example of figure 3 (page 397) lock mu1 protects V at some times, while lock mu2 protects V at other times. But Eraser flags this as a mistake, and the paper says this is a programming mistake that could result in a race. Why?

(The programmer of a different thread that wants to use V needs to know which lock to set. If the answer depends on where some other thread happens to be executing, this leaves the first programmer without any way to decide.)

So let's have the first programmer acquire (and later release) both locks. Doesn't it work then?

(Not if there could be two threads running the original code of figure 3.)

Can we fix that?

(Introduce a third lock at the beginning and end of the code for the first thread. The new lock assures that the first piece of code can be executed by only one thread at a time. Having the second thread set both locks then eliminates the race. But Eraser continues to raise an alert. We just found another category of false alarm that the authors didn't mention.)

So what is the big picture of Eraser?

(Before you run a program, modify the binary--instrument it so that Eraser can watch every data reference, then let it run and see if it violates its own locking protocol. The lockset algorithm is a protocol checker.)

Can that catch all violations?

(No. Any locking errors in parts of the code that the program doesn't happen to execute won't be tested. This is risky.)

What is the risk?

("We have been running this program for four years and it never failed before. Why is it suddenly crashing every five minutes?" Eraser can help find bugs, but it does not provide assurance. If a naive programmer thinks it does, he may have false sense of security. Analogy: running a spell-checker may mean that all the words in your paper are spelled right, but it doesn't mean that there are no typos; you may have typed "along" where you meant "among".)

Why the elaborate state diagram of figure 4 (page 398). Why not just use the first version of the lockset algorithm described on page 396?

(Because not every variable is both shared and modified, and it is only shared-modified variables that can be the source of races. So the state diagram shows a way discovering which variables are actually shared-modified.)

On page 398, it says that "A write access from a new thread changes the state from Exclusive or Shared to the Shared-Modified state..." But figure 4 says that a write by any thread in the Shared state takes it to the Shared-Modified state. This is a contradiction. Which is right?

(Oops, a bug in the description. The figure is right. Looking at the later description of the implementation, any write will take it to shared-modified. Once it is shared it is running the lockset algorithm without giving warnings, which means that the per-variable shadow area contains the lockset pointer, so it can no longer be keeping track of the thread number of the original writer. We can also reason from what it should do. If anyone is writing into a variable that at least one other thread has been reading from, we have a possibility of a race, so we had better we raising alerts if the locking protocol is violated. [a legalistic reading of the text can claim that it is technically accurate; it is true that a write access from a new thread in the Shared state does take it to the Shared-Modified state; they just didn't bother to mention that a write access from the old thread in the Shared state also takes the variable to the Shared-Modified state. Under that interpretation the sin is that the authors forgot to mention one important case.])

Show me an example in which we get a race if only one thread ever writes to the shared variable.

(

    thread 1                thread 2
                              x = 10 (initialization)


    acquire(xlock);
    if x > 5                  x = 2
       y = x*3;
    else
       y = 0;
    release(xlock);

)

Also on page 398, it says that in the exclusive state C(V) is not refined, but in the shared state, C(v) is updated. Is refined the same as updated?

(Yes. Although in the paragraph on page 398 it appears that perhaps there is some subtle difference, from the previous page it is pretty clear that the authors are using the words as synonyms. This is an interesting example of a difference between technical writing and literary writing. In English class you learned to avoid repeating the same word; use synonyms to make the sentence more interesting. In technical prose it is more important to be precise, so using the same word has a benefit. When you use a different word, the reader assumes that--or at least wonders if--you have some subtle distinction in mind. In a technical context, the literary approach can introduce unnecessary ambiguity and slow down the reader.)

How does "program annotation" really work? Does it present any risks?

(By adding calls in the program that say "Eraser, ignore the following code, I know it is OK". The risk is that the following code has a subtle race condition that eraser would have noticed but that the programmer missed.)

In section 3.4, it says that Eraser would have trouble with semaphores because they are not "owned". This takes us back to the earlier question: Does Eraser really depend on ownership?

(The state diagram of figure 4 has arcs labeled "first thread" and "new thread", so it certainly needs to know who is setting a lock. But presumably Erasericould find that out by looking in some currentÐthread system variable. And it is certainly true that a locking protocol in which one thread acquires a lock and another thread releases it is going to be hard to debug. But it doesn't seem that Eraser would give different answers if the discipline of only the owner can release a lock is abandoned.)

As an example of a benign race, the paper exhibits a piece of code from Alta Vista that begins:

   if(p->ip_fp == (N12_XFILE *) 0 ) {...

This is pretty baffling. What does the syntax mean?

(Very few students are likely to be this familiar with the C language. To the left of the == is a pointer-based reference to a variable. To the right is a zero preceded by a parenthetical expression, called a "cast", that coerces the zero to make it into the same type as the thing on the left. Since we haven't seen the declaration for the thing on the left or the definition of N12_XFILE, this is not especially obvious. The cast is to "pointer to an object of type N12_XFILE")

How about an example of a benign race that isn't buried in code written by Mike Burrows?

          if (A < 5)
             acquire (lock_a)
             if(A < 5)
               A = A + 1
             release (lock_a)

Why on earth would anyone write code like that?

(To optimize the common case. Suppose that this code is in the middle of loop that is executed a million times, and A will be less than five only once or twice. If so, in almost every execution the first statement will quickly rule out the possibility that A is less than five. If it finds that A appears to be less than five, then it is worth going to the trouble of acquiring the lock. Of course, A may have changed by now, so it needs to be tested again, under the protection of the lock. If it is still less than five, we increment it. If not we just release the lock and go about our business as if the test had failed when we first looked. Since the outcome of the first test is used only as a hint that is later reverified, it doesn't matter that there is a race.)

In section 3.4 it says that dynamic code generation would interfere with Eraser. But Eraser dynamically follows the executing program. Why is dynamically generated code a problem?

(Eraser's dynamic tracking depends on statically inserting into the program a set of instrumentation probes--actually subroutine calls. Code generated by the program on the fly won't have these probes installed, so it may set locks or touch shared variables that Eraser won't find out about.)

Comments and suggestions: Saltzer@mit.edu