6.033 2010 Lecture 20: Replication Fault-tolerance talking about crash recovery for single server so far what about continuing despite failures? need multple servers useful for DNS, file server, master for two-phase commit Replication [diagram: multiple clients, two servers] if client sees both are up send request to both if one is down just send to other one What can go wrong w/ replication? client can't know if a server is down! only thing client knows if whether server responds to messages if not, is server down, or is the network broken? example: network partition [diagram] server in other partition looks like it's down (doesn't respond) some clients use one, some the other result: inconsistency! different clients see different state same client may see state switching back and forth! Easiest plan: tolerate inconsistency (don't care) idea: eventual consistency put a time-stamp on each item of data servers periodically compare time-stamps, both use freshest this is essentially how DNS works a new/modified/deleted record may not take effect for hours or days replicas, caches this is a very common approach single writer What if multiple writers? One approach: write a copy and apologize later Unison Conflicts? Optimistic replicationc Suppose you need consistency e.g. a replicated fault-tolerant lock server better not give the same lock to two clients! First ask before you update a copy Pessimistic replication consistency goal: "single-copy consistency" from the outside looks like one copy of the data each request sees results of previous requests of course, inside there may be disagreement failures, network delays, &c idea: majority remember: clients may disagree which servers are up e.g. if partition (or worse) what if we have *three* servers, not two? require client to contact any two of three otherwise client must wait can proceed despite one server failure if client C1 does an op, then C2, C2 will see C1's operation on at least one of the two that is, any two majorities intersect at at least one node [diagram: three servers] handles partition as a special case if two servers in one, one in the other clients can only make progress in the two-server partition idea: replicated state machine a method to keep replicas the same same starting state same inputs, in same order deterministic operations same final state replicated state with logs [diagram: three logs] S1 S2 S3 1 Wx Wx Wx 2 Ry Ry Ry ... represent as a log, a copy in each server log entries are client operations the entries are numbered How to ensure replicas all execute client ops in same order? [diagram: clients, leader, other servers] primary-copy clients send operations to one of the servers (the leader) leader runs one client operation at a time leader picks a log entry number sends operation to all servers w/ that number when a majority have it, reply to client wouldn't want to reply for an operation that didn't really happen reminder why it's tricky leader may crash in the middle of an operation can other servers disagree on whether operation happened? two servers may think they are the leader maybe due to network problems or lost packets will they append different "next operations" to logs? let's just think about a single entry e.g. log entry #140 ok for a server to not know the operation for #140 all the servers that know should agree on what operation #140 is Paxos consensus protocol allows servers to agree on the operation for each log entry despite crashes and network failures (e.g. partitions) one Paxos instance per log entry many log entry Paxos's might run concurrently overview going to need two phases can't just send "are you willing to commit to this value?" maybe every server commits to a different value, no agreement so: prepare, and accept "chosen": a majority has accepted the value i.e. sent accept_ok majority means at most one value can be chosen exchange: leader acceptors prepare(n) -> <- prepare_ok(n_a, v_a) accept(n, v') -> <- accept_ok(n) why n? may need multiple rounds e.g. if a leader crashes want later rounds to supersede earlier ones numbers allow us to compare early/late the crucial property: it may not be immediately apparent that a value has been chosen maybe a majority has received and accepted, but leader crashed or msgs were lost we cannot change our minds! if a value was chosen, any subsequent choice must be the same value maybe a different leader &c, but same value! so: leader doesn't send out value with prepare acceptors send back any value they have already accepted if there is one, leader proposes that to avoid changing an existing choice if no value already accepted, leader can propose any value (e.g. a client request) leader must get prepare_ok from majority to guarantee intersection with majority formed by existing choice now the protocols leader: choose n (higher than any n seen so far) send prepare(n) to all servers (including self) if prepare_ok(n_a, v_a) from majority: v' = v_a with highest n_a; choose own v otherwise send accept(n, v') to all if accept_ok(n) from majority: done! chosen value is v' acceptor state: must persist across reboots n_p (highest prepare seen) n_a, v_a (highest accept seen) acceptor prepare(n) handler: if n > n_p n_p = n reply prepare_ok(n_a, v_a) acceptor accept(n, v) handler: if n >= n_p n_a = n v_a = v reply accept_ok(n) commit point is when f+1'st acceptor records n_a/v_a in stable storage how to find chosen value ordinarily, winning leader can announce after majority accept_ok other servers might time out waiting for announcement can ask servers for v_a, see if majority has same if no, start another Paxos round if one leader and nothing fails, Paxos clearly reaches agreement what if more than one leader? this is the key danger i.e. two different servers proposing a client operation for log record #142 e.g. due to network failure the two leaders used different n, say 10 and 11 i'll talk about two examples example 1: S1: p10 p11 a10v10 a11v11 S2: "" S3: "" the risk: all nodes replied to 11's prepare none reported a value (hadn't yet seen 10's accepts) so 11 thinks free to propose its own value! then 10 sends its accepts with its value then 11 sends its accepts with different value if this could happen, would change our mind about chosen values! but it's OK: n_p and the "if" in accept() cause nodes to ignore 10's accept after they have seen 11's prepare example 2: S1: p10 a10v10 a11v10 S2: p10 a10v10 p11 a11v10 S3: p10 p11 a11v10 the risk: 10 crashed just after sending accepts but a majority arrived 11 doesn't know it, sends out prepares 11's majority of prepare_ok must include 10's majority of accepts so 11 will learn of 10's value in prepare_ok() and will choose same value as 10 conservative: 11 may use v10 even if only one server accepted v10 what if a server fails after receiving accept? if it doesn't restart, maybe leader doesn't get enough accept_oks, timeout, new leader if it does restart, it must remember v_a/n_a! (on disk) otherwise example 2 breaks 11 may not see chosen v10 what if a node reboots after sending prepare_ok? does it have to remember n_p on disk? yes! example 1 above requires this, e.g. if some servers restart after getting p11 don't want them to accept a10v10... summary use replication for fault-tolerant services danger: inconsistency tolerate it if you can (e.g. DNS) use replicated state machines if you must have consistency