# -*- mode: org -*- #+STARTUP: indent 6.033 2011 Lecture 15: Availability * 6.033 so far: client/server fault isolation by avoiding propagation but only benign mistakes (programming errors) and no recovery plan ** rest of semester: keep running despite failures broaden class of failures to include malicious ones * Threats: software faults (e.g., blue screen) hardware (e.g., disk failed) design (e.g., UI, typo in airport code results in crash, therac-25) operation (e.g., AT&T outage) environment (e.g., tsunami) * Fault-tolerant systems Reliable systems form unreliable components Redundancy (replication, error code, multiple paths) latence fault -> active fault -> error -> failure -> fault -> etc. failure = a component fails to produce the intended result at its interface * Examples of fault-tolerant systems: DNS: replicate NS BGP: alternate path TCP: resend packet * Approach to designing fault tolerant systems: 1 Identify possible faults 2 Detect & contain (client/server, checksum, etc.) 3 Decide a plan for handling fault: nothing fail-fast fail-safe .. mask (through redundancy space or time) Difficult process, because you must identify *all possible* faults Easy to miss some possibilities, which then bite you later Iterative process * Metric: MTTF and MTBF MTTF = mean time to failure MTTR = mean time to repair MTBF = mean time between failure (MTTF + MTTR) availability = MTTF / MTBF * Examples of availability See slide * Let's look at MTTF of disk and how we can recover Fails reasonable often (see slide) Building block for fault tolerant systems; * Software systems fault tolerance: Split state into two parts: non-persistent (can be safely abandoned on failure) persistent state (that is necessary to recover) --> All important state stored persistently on disk * MTTF of disk is high! larger than the existence of the manufacturer how is it possible? run: 1,000 disks for 3,000 hours, 10 fail -> failure rate is 1 per 300,000 hours MTTF = 1/failure rate is that a reasonable inversion? is the failure rate memory less? * Bathtub graph Nope. But, failure rate is compute for bottom part of tub Is aging a real issue? Yes, see graph. * How do we make a fault-tolerant disk 1. fail-fast disk (discover faults) 2. replication costs us two disks (see tomorrow) need to replicate interconnect between disks, controllers, etc. (see HP slide) * Software complicated -> bugs! ==> Fault-tolerant software systems rely on correctness of software inside disk! how to write bug-free software split up in code matter and doesn't mater most software on computer doesn't matter; code in disk controller does for code that does matter, stringent development process hard part: bug in specification