6.033 Spring 2015, Lecture 16 April 6, 2015 Hari Balakrishnan (Last update: April 23, 2015) An algorithm for all-or-nothing put and all_or_nothing get (Adapted from notes by Barbara Liskov) This note describes the design of an all-or-nothing disk sector, supporting a "put" and a "get" method. all_or_nothing_put is called when the higher layer (the caller) has some new data to write; for example, it could be a bank account balance. The goal is: a) all_or_nothing_put is all-or-nothing (all_or_nothing_get will see either the old balance or the new one). b) if all_or_nothing_put completes, all_or_nothing_get will see the change (e.g., the new account balance). We will design the all-or-nothing scheme atop two lower-layer disk write/read functions, careful_put() and careful_get(). 1. ASSUMPTIONS a) If careful_put returns, the data has been correctly written to the disk sector in question. Thus the only error that might corrupt the data is one that occurs during the execution of careful_put. We assume that any other disk problems, such as decay, loss of a sector, etc., are handled by other means. b) A failure during the execution of careful_put will, in general, corrupt the data in that sector, but careful_get can detect any corruption (via a checksum, for example). A failure in the execution of careful_put may occur because of a power failure, software crash, etc. A failure may also occur at other times during the execution of the all_or_nothing functions. 2. STRATEGY: USE TWO SECTORS The strategy is to maintain two distinct sectors of data corresponding to the specified disk sector, which we will call sector s. s is an argument to all_or_nothing_put and all_or_nothing_get. We will adopt the "golden rule of atomicity": "never modify the only copy". That is, the sector for which we wish to provide all_or_nothing_put, s, will have two distinct disk sectors, s.D0 and s.D1. We will use s.D0 and s.D1 to to write data using careful_put and read data using careful_get. The plan is to write carefully (using careful_put) so that we can guarantee that at most one of s.D0 and s.D1 is corrupted. (Both will not be corrupted.) 3. THE ALGORITHM The pseudocode for all_or_nothing_put and all_or_nothing_get is: all_or_nothing_put(s, data): status = careful_get(s.D0, buffer) if status == OK: careful_put(s.D1, data); careful_put(s.D0, data); else: careful_put(s.D0, data); careful_put(s.D1, data); all_or_nothing_get(s, data): status = careful_get(s.D0, data) if flag == OK: return; careful_get(s.D1, data); 4. MAIN CLAIM & PROOF The algorithm stated above preserves the following predicate P: at least one of (s.D0, s.D1) is uncorrupted. The proof is by induction. Basis step: initially both copies are OK by assumption, so P holds. Inductive step: assume P holds when all_or_nothing_put is called. There are two cases to consider: a) both copies are good: In this case it doesn't matter what order we write them. If the first write finishes, that copy is good; if it doesn't the other copy is good. Therefore P holds. b) one copy is bad: This is the key step. In this case we must write to the bad copy first. If this write doesn't finish, we still have one good copy. If it finishes, we now have 2 good copies, so we will satisfy P even if the second write fails. Specifically: If D0 is good when all_or_nothing_put starts: In this case, we write to D1 first. If that write doesn't finish we haven't done anything to D0 and therefore it is still good. If that write finishes, D1 is now good, and therefore if doesn't matter whether the write to D0 fails. If there is a failure after the careful_put to D1 completes and before teh careful_put to D0 starts, then both D0 and D1 have uncorrupted data. But they are different. In this case, all_or_nothing_get will return the data in D0. But as soon as D0 starts getting written, the data returned to a subsequent all_or_nothing_get is the data in D1; that data will be the same as in D0 if the careful_put to D0 also succeeds. On the other hand, if D0 is bad when all_or_nothing_put starts: This implies that D1 is good, by our inductive assumption. The algorithm writes to D0 first. If that write doesn't finish, the data in D1 is still good. If the write to D0 finishes, D0 is now good, and therefore it doesn't matter if the write to D1 fails. In either case we are sure that at least one of D0 and D1 is good; the data returned from all_or_nothing_get is D0 if it is good, and D1 if D0 is bad. P implies directly that all_or_nothing_put is all or nothing, so we satisfy our goal. If all_or_nothing_put finishes, it has succeeded in writing the new state to both D0 and D1. Therefore all_or_nothing_get will observe the new state. 5. COMMIT POINT What is the commit point of this algorithm; i.e., the point in time after which daata will be visible to a subsequent caller of all_or_nothing_get? The commit point is different for various executions: -- The new state will be visible as soon as the careful_put to D0 finishes. -- It will also be visible in the case where the careful_put to D1 happens first, successfully, and the careful_put to D0 starts. The commit point in this case is the instant of time at which the careful_put to D0 starts. There are the following cases: D0 was good when all_or_nothing_put started (a) the careful_put to D1 failed. (b) the careful_put to D1 succeeded and then the machine failed. (c) the careful_put to D1 succeeded, but the machine failed during the write to D0. D0 was bad when all_or_nothing_put started (a) the careful_put to D0 failed. (b) the careful_put to D0 succeeded and then the machine failed. (c) the careful_put to D0 succeeded and the machine failed during the write to D1. The new state will be visible in cases 1c, 2b, and 2c. (Of course as indicated, it will also be visible if all_or_nothing_put finishes. An implementation concerned with a concurrent all_or_nothing_put and all_or_nothing_get operations would use locks, as we will see later. Still, it is interesting to note that one all_or_nothing_put and multiple all_or_nothing_get operations can operate concurrently and the get operations will never see a non-atomic result. An in-progress partial write (careful_put) would show up to the reader (in careful_get) as a corrupt sector -- recall that careful_get has an integrity check (via a checksum), returning OK only if the data is not corrupt. Handling multiple concurrent all_or_nothing_put operations requires more explicit coordination (via locks, for example). 6. PROS AND CONS OF THIS ALGORITHM + Works well for a single sector and can be generalized to a single file. - Hard to generalize to multiple files or directories. Might have to place all files in a single directory, or rename subdirs. - Requires copying the entire sector (file) for any (small) change. - Only one put operation can happen at a time (as mentioned above). - Only works for atomic actions that happen on a single computer, single disk.