6.033 2010 Lecture 20: Replication

Fault-tolerance
 talking about crash recovery for single server so far
 what about continuing despite failures?
 need multple servers
 useful for DNS, file server, master for two-phase commit

Replication
 [diagram: multiple clients, two servers]
 if client sees both are up
   send request to both
 if one is down
   just send to other one

What can go wrong w/ replication?
 client can't know if a server is down!
   only thing client knows if whether server responds to messages
   if not, is server down, or is the network broken?
 example: network partition
   [diagram]
   server in other partition looks like it's down (doesn't respond)
   some clients use one, some the other
 result: inconsistency!
   different clients see different state
   same client may see state switching back and forth!

Easiest plan: tolerate inconsistency  (don't care)
 idea: eventual consistency
 put a time-stamp on each item of data
 servers periodically compare time-stamps, both use freshest
 this is essentially how DNS works
   a new/modified/deleted record may not take effect for hours or days
   replicas, caches
 this is a very common approach
 single writer

What if multiple writers?
  One approach: write a copy and apologize later
  Unison
  Conflicts?
  Optimistic replicationc

Suppose you need consistency 
 e.g. a replicated fault-tolerant lock server
 better not give the same lock to two clients!
 First ask before you update a copy
 Pessimistic replication

consistency goal: "single-copy consistency"
 from the outside
 looks like one copy of the data
 each request sees results of previous requests
 of course, inside there may be disagreement
   failures, network delays, &c

idea: majority
 remember: clients may disagree which servers are up
   e.g. if partition (or worse)
 what if we have *three* servers, not two?
 require client to contact any two of three
   otherwise client must wait
   can proceed despite one server failure
 if client C1 does an op, then C2, C2 will see C1's operation
   on at least one of the two
 that is, any two majorities intersect at at least one node
 [diagram: three servers]
 handles partition as a special case
   if two servers in one, one in the other
   clients can only make progress in the two-server partition

idea: replicated state machine
 a method to keep replicas the same
 same starting state
 same inputs, in same order
 deterministic operations
 same final state

replicated state with logs
 [diagram: three logs]
   S1 S2 S3
 1 Wx Wx Wx
 2 Ry Ry Ry
    ...
 represent as a log, a copy in each server
 log entries are client operations
 the entries are numbered

How to ensure replicas all execute client ops in same order?
 [diagram: clients, leader, other servers]
 primary-copy
 clients send operations to one of the servers (the leader)
 leader runs one client operation at a time
 leader picks a log entry number
 sends operation to all servers w/ that number
 when a majority have it, reply to client
   wouldn't want to reply for an operation that didn't really happen

reminder why it's tricky
 leader may crash in the middle of an operation
   can other servers disagree on whether operation happened?
 two servers may think they are the leader
   maybe due to network problems or lost packets
   will they append different "next operations" to logs?

let's just think about a single entry
 e.g. log entry #140
 ok for a server to not know the operation for #140
 all the servers that know should agree on what operation #140 is

Paxos consensus protocol
 allows servers to agree on the operation for each log entry
   despite crashes and network failures (e.g. partitions)
 one Paxos instance per log entry
 many log entry Paxos's might run concurrently

overview
 going to need two phases
   can't just send "are you willing to commit to this value?"
   maybe every server commits to a different value, no agreement
 so: prepare, and accept
 "chosen": a majority has accepted the value
   i.e. sent accept_ok
   majority means at most one value can be chosen

exchange:
 leader        acceptors
     prepare(n) ->
  <- prepare_ok(n_a, v_a)
     accept(n, v') ->
  <- accept_ok(n)

why n?
 may need multiple rounds e.g. if a leader crashes
 want later rounds to supersede earlier ones
 numbers allow us to compare early/late

the crucial property:
 it may not be immediately apparent that a value has been chosen
   maybe a majority has received and accepted,
   but leader crashed or msgs were lost
 we cannot change our minds!
 if a value was chosen, any subsequent choice must be the same value
   maybe a different leader &c, but same value!

so:
 leader doesn't send out value with prepare
 acceptors send back any value they have already accepted
 if there is one, leader proposes that
   to avoid changing an existing choice
 if no value already accepted,
   leader can propose any value (e.g. a client request)
 leader must get prepare_ok from majority
   to guarantee intersection with majority formed by existing choice

now the protocols

leader:
 choose n (higher than any n seen so far)
 send prepare(n) to all servers (including self)
 if prepare_ok(n_a, v_a) from majority:
   v' = v_a with highest n_a; choose own v otherwise
   send accept(n, v') to all
   if accept_ok(n) from majority:
     done! chosen value is v'

acceptor state:
 must persist across reboots
 n_p (highest prepare seen)
 n_a, v_a (highest accept seen)

acceptor prepare(n) handler:
 if n > n_p
   n_p = n
   reply prepare_ok(n_a, v_a)

acceptor accept(n, v) handler:
 if n >= n_p
   n_a = n
   v_a = v
   reply accept_ok(n)

commit point is when f+1'st acceptor records n_a/v_a in stable storage

how to find chosen value
 ordinarily, winning leader can announce after majority accept_ok
 other servers might time out waiting for announcement
   can ask servers for v_a, see if majority has same
   if no, start another Paxos round

if one leader and nothing fails,
 Paxos clearly reaches agreement

what if more than one leader?
 this is the key danger
 i.e. two different servers proposing a client operation for log record #142
 e.g. due to network failure
 the two leaders used different n, say 10 and 11
 i'll talk about two examples

example 1:
S1: p10 p11 a10v10 a11v11
S2:           ""
S3:           ""
the risk:
 all nodes replied to 11's prepare
 none reported a value (hadn't yet seen 10's accepts)
 so 11 thinks free to propose its own value!
 then 10 sends its accepts with its value
 then 11 sends its accepts with different value
 if this could happen, would change our mind about chosen values!
but it's OK:
 n_p and the "if" in accept() cause nodes to ignore 10's accept
   after they have seen 11's prepare

example 2:
S1: p10 a10v10         a11v10
S2: p10 a10v10     p11 a11v10
S3: p10            p11 a11v10
the risk:
 10 crashed just after sending accepts
 but a majority arrived
 11 doesn't know it, sends out prepares
 11's majority of prepare_ok must include 10's majority of accepts
   so 11 will learn of 10's value in prepare_ok()
 and will choose same value as 10
 conservative: 11 may use v10 even if only one server accepted v10

what if a server fails after receiving accept?
 if it doesn't restart, maybe leader doesn't get enough accept_oks,
   timeout, new leader
 if it does restart, it must remember v_a/n_a! (on disk)
   otherwise example 2 breaks
   11 may not see chosen v10

what if a node reboots after sending prepare_ok?
 does it have to remember n_p on disk?
 yes!
 example 1 above requires this, e.g. if some servers restart
   after getting p11
 don't want them to accept a10v10...

summary
 use replication for fault-tolerant services
 danger: inconsistency
 tolerate it if you can (e.g. DNS)
 use replicated state machines if you must have consistency