6.033 Spring 2015, Lecture 19 Hari Balakrishnan April 15, 2015 DISTRIBUTED TRANSACTIONS Sometimes one wants to run transactions over data distributed on multiple different machines or even multiple different sites separated by a wide-area network. This lecture discusses how to coordinate a distributed transaction to ensure that either all the different machines (sites) run the steps to completion (COMMIT), or none do. The key idea we will study is a distributed commit protocol called TWO-PHASE COMMIT. 1. EXAMPLES Example 1: A "sharded" database for scalability Suppose you're building a social networking feature into your application and want to maintain a set of friends for each user. Your app is wildly successful and now you need to scale the database that maintains information about users and their friends. Consider a simple table, called FriendsTable in your database: Username Friend -------- ---------- Alice Charles Alice Doug Alice Eve Bob Charles Charles Alice Charles Bob Charles Doug Doug Alice Doug Charles Eve Alice Let's say Alice adds Bob as a friend. Note that the "friend" relation is symmetric, so when Bob accepts Alice as a friend, we would like the database to have two new entries: Alice Bob Bob Alice Because your app has become very popular, a single database can no longer maintain information of all your users. You decide to "shard" the database, i.e., split it into multiple databases running on different machines. For example, one way to do this sharding is to run N different databases, partitioning them by Username. Let's say that in the sharded system, Alice and Bob are now on different machines, call them M1 and M2. What we want to do is to run the following queries within the same transaction (shown here with SQL): BEGIN_TRANSACTION INSERT INTO M1.FriendsTable (Username, Friend) VALUES ('Alice', 'Bob') INSERT INTO M2.FriendsTable (Username, Friend) VALUES (Bob', Alice') END_TRANSACTION Each of these queries is a separate transaction, the first running on M1 and thessecond on M2. Each could independently fail or ABORT for a variety of reasons. An acceptable outcome is both queries succeeding or both queries failing. An unacceptable outcome is exactly one of the queries succeeding. That is, we want the M1 transaction to COMMIT if the M2 transaction COMMITs, and we want the M2 transaction to COMMIT if the M2 transaction COMMITs. The outer transaction has a coordinator (e.g., a transaction management library used by the application). Example 2: Suppose we want to book a vacation itinerary involving two different airlines and a hotel. We want to run these three different transactions within a single larger transaction: they should all succeed or all fail; only one or two of them succeeding isn't an acceptable outcome. 2. CHALLENGE: COMMIT PROTOCOL One can think of a distributed transaction as nested transactions that run on different machines or sites. We want the independent transactions, e.g., each of the two transactions to add a friend (Alice adding Bob and Bob adding Alice) to make their results visible to other transactions only when both have completed. Let's call te first query A and the second query B. Consider the following simple protocol: Coordinator (C) M1 M2 C-->M1: do A C-->M2: do B Run A (but don't commit) M1-->C: ran A Run B (but don't commit) M2-->C: ran B C-->M1: COMMIT C-->M2: COMMIT OR if either has failed (timed out) or ABORTed C-->M1: ABORT C-->M2: ABORT COMMIT A COMMIT B Question: Is this protocol correct? Why not? Answer: Think about what happens under various failure situations... Suppose M2 fails. When it recovers, what does it need to do? It needs to ask the coordinator (C) what happened to the transaction. C needs to tell M2 the outcome. To do this, C needs to maintain a record of the outcome of every transaction (in durable storage, in case C crashes and then recovers). But in this protocol, how does M2 know how to ask C about a pending transaction? It needs something in durable storage! Like in its log. But it might not have written things to its log, because it did not yet receive a COMMIT (in a non-distributed transaction, the COMMIT forces log writes to disk and then a COMMIT record to disk). We need a better protocol to decide how and when to COMMIT. 3. TWO-PHASE COMMIT Time flows downwards: Coordinator (C) M1 M2 C-->M1: do A C-->M2: do B Run A (but don't commit) M1-->C: ran A Run B (but don't commit) M2-->C: ran B C-->M1: PREPARE to COMMIT! C-->M2: PREPARE to COMMIT! OR if either has failed (timed out) or ABORTed C-->M1: ABORT C-->M2: ABORT PREPARED! Log (PREPARED) PREPARED! (or ABORT!) Log(PREPARED) (or, ABORT) if all PREPARED: COMMIT! (this is the COMMIT point: subordinates MUST commit now) else: ABORT log(COMMITTED) (or log(ABORTED) log(COMMITTED) (or log(ABORTED) Correctness of two-phase commit: In the absence of any failures or message losses, the protocol is correct. Before the COMMIT point, results will not be visible, and at some time following the COMMIT point, results will be made visible to other transactions. The results on M1 and M2 cannot be guaranteed to be made visible at exactly the same instant of time (that is fundamentally impossible to achieve). Failure analysis is the more interesting case. We have to consider three cases: 1) Subordinate failure 2) Coordinator failure 3) Message losses 1) Subordinate failure: If the failure occurs before the PREPARED has been logged, the subordinate simply undoes any changes it might have made when it recovers from its log (more generally, it wouldn't have written any data to disk, so there's nothing to do). If the failure occurs after PREPARED! has been written to the subordinate's log, when the subordinate recovers, it asks the coordinator what the fate of each PREPARED transaction was. If the coordinator says "COMMIT", the subordinate logs a COMMIT record and installs the changes to its cell store. If the coordinator says "ABORT", then the subordinate undoes any changes (if needed) and logs an ABORTED record. Once the subordinate has logged a PREPARED record, its fate wrt that transaction is solely in the coordinator's hands. 2) Coordinator failure: If the coordinator fails, the system may, under some conditions, come to a halt. When the coordinator recovers, it knows about all COMMITTED transactions, and any other transaction that it might be asked about would be considered ABORTed. When the coordinator is down, the subordinates can determine from each other (assuming they know about each other) what happened to certain transactions. Any transaction without a PREPARED logged can be ABORTed. Any transaction with PREPARED in all the nodes has to wait until the coordinator recovers. Any transaction with COMMITTED or ABORTed in a subordinate's log may be handled accordingly by another subordinate. 3) Handling a best-effort network with message losses: We use an exactly-once RPC protocol to handle message losses. Such an RPC has the property that each request is executed exactly once at M1 and M2, and the results provided to C. C would reissue requests if it does not get a response within a timeout period. An exactly-once message protocol uses nonces, which are unique identifiers guaranteed (with very high probability) to be used exactly once. (E.g., make up random 128-bit numbers.) The receiver maintains a list of observed nonces, and responds with the stored answer if the nonce (request) has already been seen. It does not re-execute a request for a stale nonce.