Accessibility

6.033--Computer System Engineering

Suggestions for classroom discussion of:

Andrew D. Birrell, Andy Hisgen, Chuck Jerian, Timothy Mann, and Garrett Swart. The Echo distributed file system. Digital Systems Research Center Technical Report 111 (September, 1993).

preliminary notes, by J. H. Saltzer, version of May 7, 2001, 11:45 p.m..


Random observations, in random order, based on a quick skimming. This is not a coherent set of suggestions!

  • Why didn't they use assignment epochs for the disks? (They did. But the hardware didn't provide it, so they simulated it in the disk controller software; they call it a handle.)

  • They talk about epochs in Echo, but in connection with file servers managing boxes. Is that the same concept? (Yes, sort of.)

  • Why did it seem so much more complicated? (Two reasons: First, a disk port is stateless, so it can be switched from one controller to another with minimum fuss. But the disk itself contains a huge amount of state, and switching its management from one controller to another requires being very careful. The things that Echo calls epoch numbers are assigned not to disks, but to boxes; the epoch number is associated with the contents (state) of the box. Second, the thing issuing epochs isn't a single piece of hardware, it is a set of replicated file servers, so the replicas first have to agree on what epoch number they are issuing.)

  • On page 18 there is a ten-bullet algorithm that is executed by several replicas simultaneously, to come to agreement on the next epoch counter. Why is it so hard to figure out whether this algorithm does what is claimed? (There are too many possible interleavings for the reader to absorb all possible combinations. Also, the algorithm is presented without any insight into why it works or what each step is protecting against. It even uses the letter B to identify two unrelated things, a box name and the maximum of one of the counters.)

  • Page 18, footnote 5: What is an election algorithm? Would it simplify the ten-bullet algorithm? (Having one replica do the ten-bullet algorithm by itself should reduce the number of retries, but since that replica may crash before it is done, you still need the whole ten-bullet algorithm.)

  • page 18. After bullet 5, can there still be a larger S out there somewhere in an inaccessible replica? (No. The reason is that there is a precondition for running this algorithm: In the previous epoch, more than half of the replicas agreed on the epoch number, and it is bigger than the epoch number of any replica that wasn't participating in that epoch. So if we now survey more than half of the replicas, we are assured of having at least one that participated in that previous epoch. The last bullet sets up the precondition for the next running of the algorithm.)

  • But what if we crash after setting S in a few replicas, and never get half of them set? Then the precondition isn't met. (Yes, but variable P fixes that. Think of P and S as two copies of the service counter, and we are doing a two-copy recoverable write. We write P to all replicas before starting to write S to any replica; if we crash in the middle of writing the P's, the S's are intact; if we crash in the middle of writing the S's, the P's are intact. [this may be an oversimplification.])

  • Page 7: Algorithms work in the face of "Non-transitive network service interruptions". That is an interesting additional constraint that we didn't consider in chapter 4. How could such interruptions arise? (On a poorly-terminated Ethernet with lots of echos, A can hear B and B can hear C, but A can't hear C.)

  • Page 16: Cold start. Why is it so hard to break the circularity? They say that the file system requires name service, but name servers depend on the file system to hold their data bases. Doesn't placing a name server on a dedicated computer fix this? (Only if the dedicated computer has a stand-alone file system. Even using Echo, it ought to be possible to configure an unreplicated box that contains just a single root-level volume containing name server data. That server can boot, then the name server can run and provide help to the rest of the community.)

  • Yes, but it doesn't have any backup. (So build three of them. The tricky part is that you also have to explicitly copy data from one to another to maintain the replicas, rather than expecting Echo to do it automatically. They propose breaking the circle on the file system side; "Higher-level management programs would keep the different root file systems up to date." Either way would work; they key is to identify the base cases and special-case them. The generators used by large power companies have bearings that must be lubricated by high-pressure pumps. The pumps, of course, are driven by electric motors. They have the same problem performing a cold start.)

  • Page 22: What are tokens? (Somewhat like Coda, they are notes kept in the server that some client has a file in its cache, so the server knows whom to call back if necessary.)

  • Do they use call-backs when someone writes a file back? (No. the tokens act as locks; this is pessimistic control. Only one client can have a write token; any number can have read tokens if there is no write token out.)

  • But what if the client crashes? (Every token has a lease, as described on page 24. If the client doesn't renew the lease, the token expires.)

  • On page 25 it points out that having a write token expire is a disaster from the point of view of the client, yet they didn't bother to write any code to recover from this situation. Why not? (Because, as suggested in the Coda paper, write-write conflicts are rare. Sufficiently rare that it wasn't worth doing anything more than telling the application "you lose.")

  • But Coda went to a lot of trouble to make sure that conflicts were detected properly. Why isn't that needed in Echo? (Because with Coda, you might have be trying to reconcile after a week's absence. In Echo, the token lease is one minute. If write-write conflicts are not commone over a one-week window, they are probably almost non-existent over a one-minute window.)

  • Page 12: The discussion of tolerating loads following fail-over seems to be an opportunity to explore how the replication actually works. Suppose I have everything in one large box, with two replicas on different disks, one disk on each of two servers. One server is primary, so all traffic goes through it. Writes go from the primary server to the backup server; reads are handled by the primary. If the primary fails, the entire load switches to the backup server.

    Now, split the data into two half-size boxes, each with two replicas, each on two disks, one on each server. Box one has server A as primary, Box two has server B as primary. Writes still go to both servers, but since reads can be handled by the primary, two reads can go on in parallel. If server A fails, server B now gets to handle all of the read traffic. This seems to be the case that could cause overload, yet they say that they had all the volumes in one box. ??


    Comments and suggestions: Saltzer@mit.edu