6.033--Computer System Engineering
Suggestions for classroom discussion of:
Andrew D. Birrell, Andy Hisgen, Chuck Jerian, Timothy Mann, and Garrett Swart. The Echo distributed file system. Digital Systems Research Center Technical Report 111 (September, 1993).
preliminary notes, by J. H. Saltzer, version of May 7, 2001, 11:45 p.m..
Random observations, in random order, based on a quick skimming. This is not a coherent set of suggestions!
- Building on the chapter 7 model: If one has genuinely fail-fast replicas, the output of any one replica can be relied upon, and the voter can become a chooser. Much of the discussion of the Echo paper consists of algorithms for choosing (mostly in software, but occasionally with a little help from hardware) which replica of something to use. The replica that is chosen is called the primary; the others are backup.
- Using a chooser rather than a voter is significantly more risky, because we are now depending on the fail-fast detection mechanism to detect every possible failure, rather than actually comparing the N copies to see if they are getting the same answer. (Indeed, Echo doesn't read and compare the replicas; it assumes that if the disk says the first read is good, the first read is good.)
- fail-over is the process of automatically changing which replica is primary, in response to a failure of the current primary.
- A supermodule constructed with a voter generally continues working without pause when a replica fails. If it is constructed with a chooser and a fail-over process, there is generally a short period of non-availability during which the chooser chooses another primary, and that primary is brought up to speed.
- When replicas store information (state), there are two kinds of designs:
- The important thing to replicate is the device; its state can be reconstructed or copied quickly from another replica.
- The important thing to replicate is the state of the device. This is harder; it is tricky to get the replicas to all have the same state.
- In Echo there can be replicas of
- disks
- disk ports
- file servers (a file server is a bunch of code and its temporary state)
- volumes (meaning the stored files in a contiguous chunk of the naming hierarchy)
- boxes
- Other things, such as the local area network, may also be replicated, but that is incidental and not managed by Echo.
- How is load control related to reliability? (1. The Echo design
concerns availability, which includes response time. Overload affects
response time. 2. leases and timeouts
used to make things fail-fast may expire because of overload rather than
because of failure, leading to unnecessary execution of recovery procedures.
3. "just say no" load control can itself cause deadlocks, which in
turn may defeat availability guarantees.)
- What do "boxes" accomplish? (They help separate the user view of the system from the system manager's view. What the user sees is that certain regions of the naming hierarchy--volumes such as /src and /valuables--are stored in a high-reliability box, while other regions--such as /bin and /tmp--are stored in an ordinary-reliability box. The system manager's view is that disks, ports, servers, and file systems need to be configured to provide high reliability to the first box and ordinary reliability to the second.
- Suppose the manager has only one server and only one disk. He chooses to make three replicas of the high-reliability box, and one replica of the ordinary-reliability box. All the replicas are on one disk, with one port, one server, and one file system. Does this configuration have any value? (Yes. It helps protect the contents of the high-reliability box from single-sector disk decay.)
- Does Echo provide support for this configuration? (Yes and no. Yes, in that it will let you create three replicas and put them all on the same disk. No, in the sense that if a sector decays silently there isn't any systematic procedure to discover it and make a third copy somewhere else. Echo provides procedures that make a whole disk fail-fast, but you don't discover that an individual sector has decayed till you try to read it, and by then it may be too late. And an application program can't do a periodic test by reading all three copies, because it can't control which copy it will read from.)
- What are the ways that a fail-fast module might try to convince a voter or chooser that it is healthy (or report that it is sick)?
- power-on signal (simple, but detects only power failures)
- internal checking (can be simple or elaborate)
- pair-and-compare (voter reports up/down)
- periodic test (for components that aren't routinely exercised)
- timeout (timer expiration means it died)
- leases (non-renewal means it died)
- heartbeat (as long as it keeps ticking it is alive)
- keep-alive (if it stops answering challenges, it died)
Echo uses several of these.
- Page 14. Why is disk server said to be more reliable than pass-through? (This seems to be just the dedicated server argument. If the only thing the server has to do is to translate disk protocol requests to disk service requests--probably a one-to-one mapping--then there is only a small risk of it getting confused. If the server is also handling file service requests for other clients, then the server may get bollixed up doing that, lose track of its disk protocol activities, or even crash, which reduces the apparent reliability of its disk service. The same argument applies to putting a name server on the same host as a giant file server. Following a power failure, you would like the name server to come up immediately, but the giant file server probably has to do a salvage/recovery pass over all the disk metadata before it can fire up its file system, delaying the turn-on of the name server.)
- How might you coordinate use of a dual-ported disk? What is the problem? (The Echo paper discusses this on page 15. The problem is that the two ports may be connected to different computers, and both computers may try to use the disk at the same time.)
- Arbiter at the disk. This prevents two requests from interfering electrically, but a seek from port 1 followed by a seek from port 2 and then a write from port 1 could be disastrous. (Assuming seeks are addressed to a track and writes are addressed to a sector within the current track.)
- Two-state lock (states: port A, port B) maintained by the disk, settable by either disk controller. Keeps each controller out of the other's way, but needs a plan for failure of the controller that acquired the lock.
- Two-state lock that can't be reset till a timer expires. Somewhat covers the above problem. Downsides: the timeout can delay fail-over if primary controller dies. If the primary controller gets overloaded, the second controller takes over, then the second one gets overloaded, the same problem can occur.
- Three-state lock (states: port A, port B, unassigned) with timeout that resets to "unassigned" state. Disk rejects operating (seek/read/write) requests arriving while in unassigned state, accepts only assign requests. (This seems to be the scheme that the Echo disk hardware supports.)
- Disk maintains both a three-state lock and an assignment epoch, and all operating requests must come from the assigned controller and mention the matching epoch, or they will be rejected. At power up, disk starts in unassigned state and the first controller to request gets epoch #1. (This is known as the "resurrecting duckling" model.) Need a plan for demanding a new epoch. Optional feature: may or may not reject requests from the other controller that mention the right epoch. Rejection is safer; acceptance allows dynamic rerouting of requests if the controller software is sufficiently heads-up. Note that as the restrictions tighten up, it is easier to envision failure scenarios in which the disk refuses to accept orders from the only still-working controller.
- Assignment epoch plus lease, issued by the disk to the controller that requests the epoch, extended with each disk request or a separate keep-alive. If a lease expires, the disk goes back to unassigned state and the next controller to request assignment gets the next epoch number.
Why didn't they use assignment epochs for the disks? (They did. But the hardware didn't provide it, so they simulated it in the disk controller software; they call it a handle.)
They talk about epochs in Echo, but in connection with file servers managing boxes. Is that the same concept? (Yes, sort of.)
Why did it seem so much more complicated? (Two reasons: First, a disk port is stateless, so it can be switched from one controller to another with minimum fuss. But the disk itself contains a huge amount of state, and switching its management from one controller to another requires being very careful. The things that Echo calls epoch numbers are assigned not to disks, but to boxes; the epoch number is associated with the contents (state) of the box. Second, the thing issuing epochs isn't a single piece of hardware, it is a set of replicated file servers, so the replicas first have to agree on what epoch number they are issuing.)
On page 18 there is a ten-bullet algorithm that is executed by several replicas simultaneously, to come to agreement on the next epoch counter. Why is it so hard to figure out whether this algorithm does what is claimed? (There are too many possible interleavings for the reader to absorb all possible combinations. Also, the algorithm is presented without any insight into why it works or what each step is protecting against. It even uses the letter B to identify two unrelated things, a box name and the maximum of one of the counters.)
Page 18, footnote 5: What is an election algorithm? Would it simplify the ten-bullet algorithm? (Having one replica do the ten-bullet algorithm by itself should reduce the number of retries, but since that replica may crash before it is done, you still need the whole ten-bullet algorithm.)
page 18. After bullet 5, can there still be a larger S out there somewhere in an inaccessible replica? (No. The reason is that there is a precondition for running this algorithm: In the previous epoch, more than half of the replicas agreed on the epoch number, and it is bigger than the epoch number of any replica that wasn't participating in that epoch. So if we now survey more than half of the replicas, we are assured of having at least one that participated in that previous epoch. The last bullet sets up the precondition for the next running of the algorithm.)
But what if we crash after setting S in a few replicas, and never get half of them set? Then the precondition isn't met. (Yes, but variable P fixes that. Think of P and S as two copies of the service counter, and we are doing a two-copy recoverable write. We write P to all replicas before starting to write S to any replica; if we crash in the middle of writing the P's, the S's are intact; if we crash in the middle of writing the S's, the P's are intact. [this may be an oversimplification.])
Page 7: Algorithms work in the face of "Non-transitive network service interruptions". That is an interesting additional constraint that we didn't consider in chapter 4. How could such interruptions arise? (On a poorly-terminated Ethernet with lots of echos, A can hear B and B can hear C, but A can't hear C.)
Page 16: Cold start. Why is it so hard to break the circularity? They say that the file system requires name service, but name servers depend on the file system to hold their data bases. Doesn't placing a name server on a dedicated computer fix this? (Only if the dedicated computer has a stand-alone file system. Even using Echo, it ought to be possible to configure an unreplicated box that contains just a single root-level volume containing name server data. That server can boot, then the name server can run and provide help to the rest of the community.)
Yes, but it doesn't have any backup. (So build three of them. The tricky part is that you also have to explicitly copy data from one to another to maintain the replicas, rather than expecting Echo to do it automatically. They propose breaking the circle on the file system side; "Higher-level management programs would keep the different root file systems up to date." Either way would work; they key is to identify the base cases and special-case them. The generators used by large power companies have bearings that must be lubricated by high-pressure pumps. The pumps, of course, are driven by electric motors. They have the same problem performing a cold start.)
Page 22: What are tokens? (Somewhat like Coda, they are notes kept in the server that some client has a file in its cache, so the server knows whom to call back if necessary.)
Do they use call-backs when someone writes a file back? (No. the tokens act as locks; this is pessimistic control. Only one client can have a write token; any number can have read tokens if there is no write token out.)
But what if the client crashes? (Every token has a lease, as described on page 24. If the client doesn't renew the lease, the token expires.)
On page 25 it points out that having a write token expire is a disaster from the point of view of the client, yet they didn't bother to write any code to recover from this situation. Why not? (Because, as suggested in the Coda paper, write-write conflicts are rare. Sufficiently rare that it wasn't worth doing anything more than telling the application "you lose.")
But Coda went to a lot of trouble to make sure that conflicts were detected properly. Why isn't that needed in Echo? (Because with Coda, you might have be trying to reconcile after a week's absence. In Echo, the token lease is one minute. If write-write conflicts are not commone over a one-week window, they are probably almost non-existent over a one-minute window.)
Page 12: The discussion of tolerating loads following fail-over seems to be an opportunity to explore how the replication actually works. Suppose I have everything in one large box, with two replicas on different disks, one disk on each of two servers. One server is primary, so all traffic goes through it. Writes go from the primary server to the backup server; reads are handled by the primary. If the primary fails, the entire load switches to the backup server.
Now, split the data into two half-size boxes, each with two replicas, each on two disks, one on each server. Box one has server A as primary, Box two has server B as primary. Writes still go to both servers, but since reads can be handled by the primary, two reads can go on in parallel. If server A fails, server B now gets to handle all of the read traffic. This seems to be the case that could cause overload, yet they say that they had all the volumes in one box. ??
Comments and suggestions: Saltzer@mit.edu