6.033 2013 Lecture 19: replicated state machine

Plan:
  Consistency semantics
  Replicated state machine
  Use view server to reach consensus

Fault-tolerance
  Goal: building reliable systems from unreliable components.
  So far: transactions for crash-recovery on a single server
    Important to recover from failures.
  How to continue despite failures?
    General plan: multiple servers, replication.
    Already seen some cases: DNS, RAID, ..
  This week: how to handle harder cases.
    E.g., replicated storage (e.g., mail server, file server, etc.)
    E.g., replicated master for 2PC
    ..
    Often shows up in data center infrastructure at Google, Facebook, Microsoft, etc.

Ideal goal for replicated system: single-copy consistency.
  Property of the externally-visible behavior of a replicated system.
  Operations appear to execute as if there's only a single copy of the data.
    Internally, there may be failures or disagreement, which we have to mask.
  Similar to how we defined serializability goal ("as if executed serially").
  Many systems give up on this ideal semantics, but some need it
    E.g., Expensive to achieve in wide-area networks
    DNS and PNUTS settle for weaker semantics

Replicating a server (e.g., a bank account server).
  [ diagram: two clients, two servers ]
  Strawman: clients send requests to both servers.
  Tolerating faults: if one server is down, clients send to the other.
  Limit ourselves to two servers for now

[ demo: replicating a server that tracks bank accounts ]
  Two terminals, one on top of another, both with rxvt-font.sh xft:Monospace-18

  top% cd rsm
  top% less srv.py
  top% ./srv.py /tmp/a 5001

  bottom% less client.py
  bottom% ./client.py 5001
  ^C
  bottom% cat /tmp/a

  ^C
  top% rm /tmp/a
  top% ./srv.py /tmp/a 5001 &
  top% ./srv.py /tmp/b 5002 &

  bottom% ./client.py 5001 5002
  ^C
  bottom% cat /tmp/a
  bottom% cat /tmp/b

  ## multiple clients: what goes wrong?
  bottom% ./client.py 5001 5002 &
  bottom% ./client.py 5001 5002
  ^S when first error appears
  ^Q to resume
  ^C

  bottom% fg
  ^C

Problem: replicas can become inconsistent.
  Issue: clients' requests to different servers can arrive in different order.
  How do we ensure the servers remain consistent?

Replicated state machines
  A general approach to making consistent replicas of a server:
  - Start with the same initial state on each server.
  - Provide each replica with the same input operations, in the same order.
  - Ensure all operations are deterministic.
    E.g., no randomness, no reading of current time, etc.
  These rules ensure each server will end up in the same final state.

Assumption: Independent failure of replicas
  But:
  - Replicated bugs
  - Shared network
  ...
  Easy to build non-fault-tolerant fault tolerant systems
    E.g., Amazon EC2 failure, Gmail failure, etc.
    http://research.microsoft.com/pubs/191008/FailureRecoveryBeEvil.pdf

Simple implementation of RSM: replicated logs
  We stick to the two replica example for now
  [ diagram: each server has a log consisting of Wx, Ry ]
  Log contains client operations, including both writes and reads.
  Log entries are numbered.

Key issue: agreeing on the order of operations
  [ diagram: coordinator receives operations from clients, sends to replicas ]
  Coordinator handles one client operation at a time.
  Coordinator chooses an order for all operations (assigns log sequence number).
  Coordinator issues the operation to each replica.
  When is it OK to reply to client?
    Must wait for majority of replicas to reply.
    Otherwise, if a minority crashes, remaining servers may continue without op.

Example 2-node RSM:
  Primary server: coordinator + server
  Backup server: server

[ demo: coordinator ]
  top% kill %1
  top% kill %2
  top% rm /tmp/a /tmp/b
  top% ./srv.py /tmp/a 5001 &
  top% ./srv.py /tmp/b 5002 &
  top% ./coord.py 5099 5001 5002

  bottom% ./client.py 5099 &
  bottom% ./client.py 5099 &
  bottom% jobs

  bottom% kill %1
  bottom% kill %2

Deterministic operations
  Coordinator performs non-deterministic operations and sends results to backup

This prototype RSM can handle only a fail-stop primary failure
 What wider classes of failure would we like to handle?
  temporary or permanent loss of connectivity
  network partitions
  == can't know if a server is crashed or just not reachable

Split brain syndrome
  What would happen if RSM saw these failures?
  clients, primary, backup might not agree about who is primary
    or about whether backup is alive (and thus if it is up to date)
  result:
    two primaries
    e.g. some clients switch to backup, others don't
    or one client switches back and forth
    they don't see each other's updates!
    "split brain"
    does *not* achieve goal of acting like single server
  Example RSM is *incorrect* if there are network problems!

More goals: 
  Reach consensus who is primary, despite network partition
  Replacement of failed servers

An idea: view server
  consensus/agreement:
    "view server" decides who is primary and who is backup
    clients and servers ask view server
    they don't make independent decisions
    only one view server, avoids multiple machines independently deciding who is primary
  repair:
    view server can co-opt "idle" server as backup after old backup becomes primary
    primary initializes new backup's state
  the tricky part:
    1. only one primary!
    2. primary must have state!
  we will work out some rules to ensure these

view server
  maintains a sequence of "views"
    view #, primary, backup
    1: S1 S2
    2: S2 --
    3: S2 S3
  monitors server liveness
    each server periodically sends a Ping RPC
    "dead" if missed N Pings in a row
    "live" after single Ping
  can be more than two servers Pinging view server
    if more than two, "idle" servers
  if primary is dead
    new view with previous backup as primary
  if backup is dead, or no backup
    new view with previously idle server as backup
  OK to have a view with just a primary, and no backup
    but -- if an idle server is available, make it the backup

how to ensure only one server acts as primary?
  even though more than one may *think* it is primary
  "acts as" == executes and responds to client requests
  the basic idea:
    1: S1 S2
    2: S2 --
    S1 still thinks it is primary
    S1 must forward ops to S2
    S2 thinks S2 is primary
    so S2 can reject S1's forwarded ops

the rules:
  1. non-backup must reject forwarded requests
  2. primary in view i must have been primary or backup in view i-1
  3. non-primary must reject direct client requests
  4. primary must wait for backup to accept each request

example:
  1: S1, S2
     viewserver stops hearing Pings from S1
  2: S2, --
  it may be a while before S2 hears about view #2
  before S2 hears about view #2
    S1 can process ops from clients, S2 will accept forwarded requests
    S2 will reject ops from clients who have heard about view #2
  after S2 hears
    if S1 receives client request, it will forward, S2 will reject
      so S1 can no longer act as primary
    S1 will send error to client, client will ask vs for new
       view, client will re-send to S2
  the true moment of switch-over occurs when S2 hears about view #2

how can new backup get state?
  if S2 is backup in view i, but was not in view i-1,
    S2 should ask primary to transfer the complete state

limitations of our example RSM with view server:
  view server may fail
  see next lecture: distributed consensus with Paxos

---- more stuff ---

how to ensure new primary has up-to-date replica of state?
  only promote previous backup
  i.e. don't make an idle server the primary
  but what if the backup hasn't had time to acquire the state?

how to avoid promoting a state-less backup?
  example:
    1: S1 S2
             S1 stops pinging viewserver
    2: S2 S3
             S2 *immediately* stops pinging
    3: S3 --
    potential mistake: maybe S3 never got state from S2
    better to stay with 2/S2/S3, maybe S2 will revive
    how can viewserver know it's OK to change views?
  one answer:
    primary in each view must acknowledge that view to viewserver
    viewserver must stay with current view until acknowledged
    even if the primary seems to have failed
    no point in proceeding since not acked == backup may not be initialize