6.033--Computer System Engineering
Suggestions for classroom discussion of:
Replication.
preliminary notes, by J. H. Saltzer, version of April 22, 2002, 1:45 p.m. (Some of these ideas came from the Echo paper.) Minor corrections for consistency with notes, January 2005.
Random observations, in random order. This is not a coherent set of suggestions!
- Trade-offs: Kerberos uses shared-secret keys to authenticate users, so it must store a table of {principal identifier, key} pairs. (The key is actually a one-way hash of your password. Why do you suppose it is hashed?) If this table is lost, it will inconvenience a lot of people, so it should be replicated. But the table must be kept very secure. For reliability there are several running Kerberos servers, each with a copy (mirror) of the table, and also there are tape backups of the current table and many previous generations of the table. Lots Of Copies Keeps Stuff Safe. But all those copies present a security problem. This is an interesting trade-off.
- Trade-offs, continued: To reduce the security exposure, Kerberos seals every personal key with a master key. That is, the table actually contains
{Kprincipal}Kmaster. Now we can make as many mirrors and backup copies as needed for convenience, reliability, and performance. But there is a new problem: The master key must now be both secure and reliable. Fortunately, it is smaller, so it is (allegedly) easier to deal with. For security, it is not kept on any disk; it must be typed at a keyboard when any server is started, and the server program is designed to hold the master key only in physical RAM, not in virtual memory where it might end up on the disk. For reliability, several people know the key. The number of people who know the secret represents a pure trade-off between rate of leakage and probability of failure.
- Building on the Reliability chapter model: If one has genuinely fail-fast replicas, the output of any one replica can be relied upon, and the voter can become a chooser. Much of the machinery of replicated file systems consists of algorithms for choosing (mostly in software, but occasionally with a little help from hardware) which replica of something to use. The replica that is chosen is called the primary; the others are backup.
- Using a chooser rather than a voter is significantly more risky, because we are now depending on the fail-fast detection mechanism to detect every possible failure, rather than actually comparing the N copies to see if they are getting the same answer. (Indeed, most replicated file system don't read and compare the replicas; they assume that if the disk says the first read is good, the first read is good.)
- fail-over is the process of automatically changing which replica is primary, in response to a failure of the current primary.
- A supermodule constructed with a voter generally continues working without pause when a replica fails. If it is constructed with a chooser and a fail-over process, there is generally a short period of non-availability during which the chooser chooses another primary, and that primary is brought up to speed.
- When replicas store information (state), there are two kinds of designs:
- The important thing to replicate is the device; its state can be reconstructed or copied quickly from another replica.
- The important thing to replicate is the state of the device. This is harder; it is tricky to get the replicas to all have the same state.
- What can we replicate?
- disks
- disk ports
- file servers (a file server is a bunch of code and its temporary state)
- volumes (meaning the stored files in a contiguous chunk of the naming hierarchy)
- Suppose we have only one server and only one disk. We make three replicas of one volume, and one replica of another volume. All the replicas are on one disk, with one port, one server, and one file system. Does this configuration have any value? (Yes. It helps protect the contents of the replicated volume from single-sector disk decay.)
- Is that all there is to it? (No. You can create three replicas and put them all on the same disk. But if a sector decays silently there isn't any systematic procedure to discover it and make a third copy somewhere else. You don't discover that an individual sector has decayed till you try to read it, and by then it may be too late. And an application program can't do a periodic test by reading all three copies, because it can't control which copy it will read from.)
- What to do about that? (It requires periodic testing. Read all the sectors of every disk every once in a while.)
- Does it help to put the three copies on three different disks? (This protects against head crashes, but you still can have silent sector failures, and need to do periodic testing.)
- What are the ways that a fail-fast module might try to convince a voter or chooser that it is healthy (or report that it is sick)?
- power-on signal (simple, but detects only power failures)
- internal checking (can be simple or elaborate)
- pair-and-compare (voter reports up/down)
- periodic test (for components that aren't routinely exercised)
- timeout (timer expiration means it died)
- leases (non-renewal means it died)
- heartbeat (as long as it keeps ticking it is alive)
- keep-alive (if it stops answering challenges, it died)
- For reliability, I replicate my name server by putting it on three file servers. Why is this a bad idea? (Dedicated name servers will be *much* more reliable. If the only thing the server has to do is to handle name requests, there is only a minimal risk of it getting confused. If the server is also handling file service requests, then the server may get bollixed up doing that, lose track of its name service activities, or even crash, which reduces the apparent reliability of its disk service. And, following a power failure, you would like the name server to come up immediately, but a large file server probably has to do a salvage/recovery pass over all its disk metadata before it can fire up its file system, delaying the turn-on of the name server.)
- How might you coordinate use of a dual-ported disk? What is the problem? (The problem is that the two ports may be connected to different computers, and both computers may try to use the disk at the same time.)
- Arbiter at the disk. This prevents two requests from interfering electrically, but a seek from port 1 followed by a seek from port 2 and then a write from port 1 could be disastrous. (Assuming seeks are addressed to a track and writes are addressed to a sector within the current track.)
- Two-state lock (states: port A, port B) maintained by the disk, settable by either disk controller. Keeps each controller out of the other's way, but needs a plan for failure of the controller that acquired the lock.
- Two-state lock that can't be reset till a timer expires. Somewhat covers the above problem. Downsides: the timeout can delay fail-over if primary controller dies. If the primary controller gets overloaded, the second controller takes over, then the second one gets overloaded, the same problem can occur.
- Three-state lock (states: port A, port B, unassigned) with timeout that resets to "unassigned" state. Disk rejects operating (seek/read/write) requests arriving while in unassigned state, accepts only assign requests.
- Disk maintains both a three-state lock and an assignment epoch, and all operating requests must come from the assigned controller and mention the matching epoch, or they will be rejected. At power up, disk starts in unassigned state and the first controller to request gets epoch #1. (This is known as the "resurrecting duckling" model.) Need a plan for demanding a new epoch. Optional feature: may or may not reject requests from the other controller that mention the right epoch. Rejection is safer; acceptance allows dynamic rerouting of requests if the controller software is sufficiently heads-up. Note that as the restrictions tighten up, it is easier to envision failure scenarios in which the disk refuses to accept orders from the only still-working controller.
- Assignment epoch plus lease, issued by the disk to the controller that requests the epoch, extended with each disk request or a separate keep-alive. If a lease expires, the disk goes back to unassigned state and the next controller to request assignment gets the next epoch number.
Some algorithms are said to work in the face of "Non-transitive network service interruptions". That is an interesting additional constraint that we didn't consider in the Networking chapter. How could such interruptions arise? (On a poorly-terminated Ethernet with lots of echos, A can hear B and B can hear C, but A can't hear C.)
Cold start. What makes cold starts (e.g., general power failure) so hard to accomplish? (Unnoticed design circularity. For example, the file system requires name service, but name servers depend on the file system to hold their data bases.)
Doesn't placing a name server on a dedicated computer fix this? (Only if the dedicated computer has a stand-alone file system. That server can boot, then the name server can run and provide help to the rest of the community.)
Yes, but it doesn't have any backup. (So build three of them. The tricky part is that you also have to explicitly copy data from one to another to maintain the replicas, rather than expecting some replicated file system to do it automatically. They key is to identify the base cases and special-case them. The generators used by large power companies have bearings that must be lubricated by high-pressure pumps. The pumps, of course, are driven by electric motors. They have the same problem performing a cold start.)
Comments and suggestions: Saltzer@mit.edu