#  -*- mode: org -*-
#+STARTUP: indent
 
6.033 2011 Lecture 15: Availability

* 6.033 so far: client/server
fault isolation by avoiding propagation
but only benign mistakes (programming errors)
and no recovery plan
** rest of semester:  
keep running despite failures
broaden class of failures to include malicious ones

* Threats:
software faults (e.g., blue screen)
hardware (e.g., disk failed)
design (e.g., UI, typo in airport code results in crash,  therac-25)
operation (e.g., AT&T outage)
environment (e.g., tsunami)

* Fault-tolerant systems
Reliable systems form unreliable components
Redundancy (replication, error code, multiple paths)
latence fault -> active fault -> error -> failure -> fault -> etc.
failure = a component fails to produce the intended result at its interface

* Examples of fault-tolerant systems:
DNS: replicate NS
BGP: alternate path
TCP: resend packet

* Approach to designing fault tolerant systems:
1 Identify possible faults
2 Detect & contain (client/server, checksum, etc.)
3 Decide a plan for handling fault:
   nothing
   fail-fast
   fail-safe
   ..
   mask (through redundancy space or time)
Difficult process, because you must identify *all possible* faults
Easy to miss some possibilities, which then bite you later
Iterative process

* Metric: MTTF and MTBF
MTTF = mean time to failure
MTTR = mean time to repair
MTBF = mean time between failure (MTTF + MTTR)
availability = MTTF / MTBF

* Examples of availability
See slide

* Let's look at MTTF of disk and how we can recover
Fails reasonable often (see slide)
Building block for fault tolerant systems;

* Software systems fault tolerance:
Split state into two parts:
  non-persistent (can be safely abandoned on failure)
  persistent state (that is necessary to recover)
--> All important state stored persistently on disk

* MTTF of disk is high!  
larger than the existence of the manufacturer
how is it possible?
run: 1,000 disks for 3,000 hours, 10 fail -> failure rate is 1 per 300,000 hours
MTTF = 1/failure rate
is that a reasonable inversion?  is the failure rate memory less?

* Bathtub graph
Nope.  But, failure rate is compute for bottom part of tub
Is aging a real issue?  Yes, see graph.

* How do we make a fault-tolerant disk
1. fail-fast disk (discover faults)
2. replication 
costs us two disks (see tomorrow)
need to replicate interconnect between disks, controllers, etc. (see HP slide)

* Software complicated -> bugs!
==> Fault-tolerant software systems rely on correctness of software inside disk!
how to write bug-free software
split up in code matter and doesn't mater
   most software on computer doesn't matter; code in disk controller does
for code that does matter, stringent development process
hard part: bug in specification