6.033 Spring 2015, Lecture 20 Hari Balakrishnan April 22, 2015 AVAILABILITY VIA REPLICATION A key theme in our treatment of fault-tolerant computing is to achieve fault-tolerance from unreliable components. In the past four lectures, we studied the transactional model, which provides all-or-nothing atomicity and durability via write-ahead logging, and also supports concurrent operations using serializable semantics ("before-or-after"). The transactional model is a powerful, widely-used way to improve the reliability of many systems. It ensures, in particular, that software and hardware crashes preserve the effects of all committed transactions and undo any changes made by aborted transactions. Transactions by themselves don't improve availability (recall that availability is defined as the fraction of time that a system is working). If a transaction system crashes, it can recover to a reliable state, but when it is recovering, the system is not available. One way to improve availability is via replication. That's what we will study here. Our plan is as follows: 1) Some examples (illustrating the trade-off between availability and consistency). 2) Optimistic and pessimistic replication. 3) Optimistic replication: key challenge (concurrent updates) 4) Vector timestamps: one approach to optimistic replication 5) Pessimistic replication with a replicated state machine (overview only) (#5 was not covered in lecture and will not be on Quiz 2) 1. EXAMPLES We have already seen two examples of replication in the course so far: DNS and RAID, and last week we studied PNUTS in recitation. Even between DNS and RAID, the contrast between their consistency semantics is instructive: DNS: Multiple name servers and extensive caching using "time to live" to determine when the cached data is invalid. Tolerates various server failures and various network partitions. The data in the caches could be inconsistent, but the system has high availability compared to a system where clients would always have to contact a primary or backup name server. RAID: When writing data, a quorum of disks must be available and data is always consistent (i.e., read returns result of most recent write). But if more than a certain number of disks fail or not all disks are reachable, system stalls. I.e., we prefer high consistency to high availability. PNUTS: Uses "eventual consistency" as its model rather than ensuring that all reads show the result of some serializable execution. Replicas "eventually" obtain the right data. Where one sits on the availability-consistency trade-off depends on the application and goals of the system one is designing. Consider for example the following two sets of uses: Set 1: ------ DNS Mobile calendar Version control for source/docs File sharing across devices ("Dropbox", "Drive", etc.) File sync'ing between devices ("Unison") Set 2: ------ RAID Replicated transaction database for a bank Distributed lock manager Distributed transaction coordinator (for two-phase commit) For the first set, the designer is likely to prefer availability over consistency. (I'll define what "consistency" means a little later, but for now, assume it means "a read gets the result of the last write"). It is alright for the data to become "eventually consistent" (or even, as in DNS, give up on consistency and live with inconsistency because it allows us to tolerate network partitions and scale to a large size). For the second set, we would like something a lot better because the damage caused by inconsistent states could be immense. One way to state a goal is "single-copy consistency", i.e., we would like the replicated system to behave as if there was a single machine and single copy of all data, except that we would like much higher availability than a single system can provide. For example, it would be a disaster to have a distributed lock manager give a lock on some object to two different clients at the same time, or rely on some "eventually correct" value in our bank account! The techniques used for the first and second set are rather different. To achieve high availability, the first set of applications are willing to tolerate some inconsistency, as long as it may eventually get resolved, and more importantly, as long as conflicting updates are detected and handled. The second set of applications is better served by stalling or blocking if it means that single-copy semantics might be violated. The challenge in achieving our goals come from: (1) concurrent operations, (2) replica failures, and (3) network partitions. 2. OPTIMISTIC AND PESSIMISTIC REPLICATION Optimistic replication techniques generally favor availability over single-copy consistency. They let data be read or written without a priori coordination, by assuming that conflicts are rare. They propagagte updates in the background, they have a way to detect conflicts, and a way to fix conflicts after they happen. (Often fixing a conflict involves propagating the issue to a higher layer or to the user; for example, "you have two different meetings at 10 am"). Pessimistic replication techniques ensure single-copy consistency by synchronously coordinating replicas, and blocking other operations during an update. 3. OPTIMISTIC REPLICATION Consider a simple example of file synchronization between devices. We look at one file. Example 1. Consider the following (time flows from left to right, and H1, H2, etc. are machines): H1: W("A"); Send to H2 H2: R(file); W("B") H2: Send file to H1 H1: Read(file) Here both H1 and H2 have consistent data. Example 2. Now consider: H1: W("A"); Send to H2; Send to H3 H2: Read(file); W("B") H3: Read(file); Send to H2 What should H2 do now? It has two versions of the file; should it retain its own version, or should it use the version that came to it from H3 *after* H2 had saved its local changes? A solution that works in this situation is to track the modification time of the file and use the version with the latest modification time. This solution assumes that the clocks are synchronized across the different machines. Protocols like the Network Time Protocol (NTP) work reasonably well in practice, and could be used for this purpose. The problem, however, is that this solution is not general enough. Consider the following: Example 3. H1: W("A"); Send to H2; W("C") at time 5 H2: Read(file); W("B) at time 6; Send to H1 Here, both H1 and H2 have made changes, and H2's modification time is larger. So which version should we use? The answer is not obvious. H2 made the change *without* knowing about H1's subsequent change, starting with the version written at the start by H1. H1's subsequent change also used the same initial version. Had H2 made its change *after* becoming aware of H1's change, then its changed version ("B") would be appropriate. But as it stands, the changes made by H1 and H2 are in conflict with each other. We would like a way of automatically detecting such conflicts. When a conflict arises, the higher layer application or user must handle it. For example, a calendar application detecting a conflict could apply application-specific methods to merge entries or resolve conflicts, or ask the user what to do. And when there is no conflict, the replication protocol automatically "does the right thing". There are many ways to define a conflict. One approach is to use the following principle: If updates are made sequentially (i.e., update Y was made at a machine after it received update X), then we want to ensure that the most recent update on that update chain is the correct one. If not, the protocol must detect the conflict automatically. 4. VECTOR TIMESTAMPS A method to achieve these consistency semantics is to use VECTOR TIMESTAMPS. The idea is to extend the "modification time" to be a vector, with one timestamp per machine. Vector Timestamp (VT): , where T_i is the timestamp at which machine i modified the object. The protocol assumes that the timestamp assignment on each machine is monotonically increasing; i.e., time must always move forward from update to update. The timestamps may be assigned via a real monotonic clock, or could be logical. If a = and b = are two vector timestamps, then we define: a = b if for all i, a_i = b_i a > b if for all i, a_i >= b_i AND !(a = b) a || b if there exists i, j: a_i > b_i AND a_j < b_j. a > b means a is MORE RECENT than b. a || b means a and b are concurrent, and neither is more recent than the other. The protocol: 0, Initially, each file's vector timestamp is <0, 0, ..., 0>. 1. If a machine i modifies the file, the vector timestamp changes the i'th index to the modification time. 2. When a machine i sends a file to machine j, the vector timestamp at machine j for the file is set to the more recent of the local timestamp or the version that just arrived from machine i, unless the two timestamps are concurrent. 3. If the timestamps are concurrent, flag a conflict. Thus: when a machine has multiple versions of a file, its conflict resolution process is to use the version with the most-recent version timestamp. If it is unable to determine whether one timestamp is more recent than the other, it flags a conflict. This protocol ensures the conflict detection principle described above: If updates are made sequentially (i.e., update Y was made at a machine after it received update X), then we want to ensure that the most recent update on that update chain is the correct one. If not, the protocol must detect the conflict automatically. Notice that with vector timestamps we only need each machine's timestamps to be monotonically increasing; we don't need the clocks on the machines to be synchronized. 5. PESSIMISTIC REPLICATION (Note: we did not cover this in lecture, and it will not be on Quiz 2.) Examples where pessimistic replication ensuring single-copy consistency is useful include replication transaction systems where we would like all the replicas to run the steps in exactly the same order, a distributed lock manager, a distributed transaction coordinator to orchestrate a two-phase commit protocol, etc. If multiple replicas run the same operations in the same order and all operations are deterministic, then all replicas will have the same state. This idea is called a replicated state machine. On a failure, a VIEW SERVER can select a suitable backup replica to become the primary, and clients can use the new primary. There are a few tricky parts to get right: -- Ensuring deterministic operations; e.g., what happens when the computation involves random number generation, local time, etc.? -- What happens when there is a network partition and clients and servers have conflicting opinions on whether the primary is running or not? -- How to ensure that the view server is not a central point of failure? To ensure deterministic operations, we could always use the primary's results and send those to the secondary on operations that conflict. If the primary is a transaction processing system, we could ship its write-ahead log to the replicas, and return from a COMMIT only when the replicas have written the log data to their own disks. The use of a single view server means that its opinions are the right one. Clients and servers both contact the view server periodically to determine the identity of the primary and backups. The view server will need to itself be replicated for fault-tolerance, which is starting to appear like a chicken-and-egg problem! The difference, however, is that in the simplest of situations, the machines in the system only need to achieve consensus on one thing: the identity of the primary replica. (In practice, one needs consensus on a sequence of operations made, making the protocol a lot more complicated.) Various consensus protocols have been developed by researchers, such as Viewstamped Replication, Paxos, and RAFT, and have inspired schemes used in practice. A course on distributed systems such as 6.824 will teach you the principles and practice of these techniques... We're now at the end of our module on fault-tolerance, where we studied: -- the meaning of reliability and avaialibility -- transactions, and how to implement them -- replication