Accessibility

6.033--Computer System Engineering

Suggestions for classroom discussion


Topic: Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure calls. ACM Transactions on Computer Systems 2, 1 (February, 1984) pages 39-59.

by J. H. Saltzer, February 25, 1996; updated 2/23/98


Environment?: Xerox PARC. But these guys work at DEC SRC now. The team moved.

Citations?: A lot of familiar names, from other papers we are reading.

Date?: 1984. What is this very powerful machine, the "Dorado" (p. 41).

Sounds like specs for a 1994 laptop. But that is a little unfair--the machine had a wide bus (16 bytes?) and lots of lookahead and overlapping. So it is a 1995 laptop with a 1997 processor chip.

N.B. What does "Process" mean in this paper. (= 6.033 "Thread")

1. (suggested by John Chapin) Section 1.3 (page 41) starts by saying, "The primary purpose of our RPC project was to make distributed computation easy." If we take "distributed computation" to mean using the client/server model, can any RPC package, even if ideally efficient and easy to use, "make distributed computation easy."?

(The problem is that distributed computation is intrinsically not "easy". There is a conflict between one of the fundamental goals of the client/server model--independent failure--and the fundamental goal of RPC--the programmer is intended to see the semantics of an ordinary subroutine call. But for all practical purposes a local subroutine call has only two possible outcomes: success, or failure. (If the thing you call never returns, it doesn't matter because you are not in a position to do anything about it.) Remote service invocation always has three possible outcomes: success, failure, and don't know. The programmer must somehow provide for the third outcome. RPC tries to hide the third outcome, but the best it can do is persistently repeat the invocation, which may not always be the right thing.

In addition, one can argue that It is common to write a local program such that failure of a procedure results in failure of the entire program. In contrast, users of a distributed system usually insist that failure of a single service not take out the entire system. So in addition to coping with the don't know case, the programmer often must be much more careful about handling the failure case.)

2. Section 1.4, page 42: "We discarded the possibility of emulating some form of shared address space..." What would be the consequence of doing this job that way? Apollo/Domain did. (Would end up doing paging over the net to support virtual memory.)

But that decision is orthogonal to the decision to use the RPC model. The following chart demonstrates that one can actually implement four possibilities:

                          message          shared
                          passing          memory
----------------------------------------------------------
application-              X window         Apollo
designed protocol         System           Domain
----------------------------------------------------------
RPC                       Birrell           LRPC*
----------------------------------------------------------

* We'll read the LRPC paper near the end of the term.

3. Are the performance numbers helpful? (No. They measured instruction counts and cycles, which might still be relevant today. But they divided by the clock frequency and reported microseconds, and didn't tell us the clock frequency. So we can't make much use of the numbers.)

4. List every speedup trick you can find in their RPC design.

5. Packets can be lost or duplicated. What are the differences between "precisely once," "at most once," and "at least once" semantics? Which does this implementation provide? (precisely once) Why? (They want to imitate normal procedure call as closely as possible) How? (duplicate suppression, explored later.) To what extent to they fail to imitate normal procedure calls? (There can be a third outcome: no response, we don't know whether or not your request was processed.)

6. Section 1.4 says that there are is no timeout mechanism, yet the probe packet seems to be sent after a timeout. Explain this contradiction. (There is NO timeout on the client program, which can go on forever. There IS a timeout on the network part of the transaction.)

7. What is the consequence of the decision to not have timeouts? Suppose the server goes into a loop? (Just like a procedure call--the caller loses.) What happens if the server machine crashes? (Probe packet notices no ack, caller is alerted.)

8. This is an excellent opportunity to reinforce the use of packet timing diagrams. The issue to explore is "what can go wrong?"

Use a simple case (single packet request, single packet response) and start with a normal exchange.

Why is there a callid in a packet? How is it constructed? (calling machine ID + sequence number + process/thread ID) Why? (to be globally unique forever.) Is that overkill? (backing off saves very little and makes it harder to decide whether or not there is a problem.)

N.B. sequence number must be monotonic. Why is that hard? (Because latest value may be lost in a crash. Initialize it from the time-of-century clock.)

case 2: server is slow; timeout causes retransmission of original request, which arrives AFTER the response was sent. (lookup answer for this callid and resend it; client notes second response contains an old callid, discards.)

case 3: response is delayed in the network. (Same scenario as case 2.)

case 4: response is lost by network. (Server sees same situation as case 2, but client receives only one response, which it accepts.)

case 5: delayed request; client resends and second copy of request arrives while server is still computing. (Server discards duplicate.)

case 6. Badly delayed request, arriving after second copy of request is handled by server and response sent back. (Both client and server see same thing as in case 2, but th drawing is different. Server doesn't distinguish the first from the second copy of the request; Client doesn't distinguish response to first from response to second request.

case 7. Lost request. (client resends, server responds.)

This paper is also the bridge to an upcoming topic, on naming. What are the consequences for the application of choosing each of the three possibilities for binding time mentioned near the end of section 2.3?


Comments and suggestions: Saltzer@mit.edu