6.033 Discussion Suggestions (Gray & Siewiorek paper)

6.033--Computer System Engineering

Suggestions for classroom discussion of:

Jim [N.] Gray and Daniel P. Siewiorek. High-availability computer systems. Computer 24, 9 (September, 1991) pages 39-48.

by J. H. Saltzer, April 24, 1996, from earlier notes; updated 4/9/98.

As a prelude to discussing this paper, it might be useful to review the difference between faults, errors, and failures.

When the lectures move to the subject of atomicity and transactions, it will be appropriate to point out that Grey and Siewiorek use slightly different terminology in that area from that of 6.033.

(Discussion guides on those two topics can be found in the general recitation topics list.)

We have seen a lot of different definitions of modules and objects in various papers. In this paper, what is a module? (The unit of failure, fault containment, and repair.)

The paper talks about MTTF and MTTR. Other papers talk about MTBF.

What is the relation?

{MTTF = Mean Time To Failure
 MTTR = Mean Time To Repair
 MTBF = Mean Time Between Failure

 MTBF = MTTF + MTTR

The model:

            repair                            repair
              |                                 |
              v                                 v
---------------------------------------------------------------> time
       |                                   |
       v                                   v
    failure                             failure

       |<---- Time Between Failure ------->|
    -->|      |<-- Time To Repair
              |<-------Time To Failure --->|

MTTF is a statistic of the distribution of TTF's. Its value is estimated by observing a large number of TTF's and calculating the average. It may also be predicted, starting with failure rates of system components and making assumptions about failure independence. Similarly for TTR and TBF.

                      time system is actually available
Availability =      ---------------------------------------
                     time system is intended to be available

If the system has N failure events during the time it is supposed to be available, we can say

     (time system is actually available)/N  = MTTF
and
     (time system is intended to be available)/N = MTBF

and thus

                MTTF         MTTF - MTTR              MTTF
Availability =  ____    =    ___________    =     _____________

                MTBF             MTBF              MTTF + MTTR

One will find every possible combination of these parameters in different papers, but they are almost always defined as above.)

Is it the same thing to calculate MTTF using statistics for one system for a long time, and using statistics for a large ensemble of systems for a short time? (In general, no. In order to do so, you would have to assume that the single system is at equilibrium (that is, not time-varying), and that the ensemble of systems all have the same characteristics. And even then, you can't prove that the two results are the same. You can hypothesize that they are the same; this assumption is known as the ergodic hypothesis.)

Disk vendors ship disks with claimed 300,000 hour (30 year) MTTF even though they haven't been in business that long. And these disks usually fail shortly after their five-year warranty period is up. Explain. (They are advertising an ensemble average. The ergodic hypothesis does not apply.)

MTTF calculations

There is a place where one can trip up by blindly applying the analysis approach that seems to be suggested in the Gray & Siewiorek paper. They rather casually toss around the MTTF's from different causes and extrapolate the expected effects of improving one of the causes. The casual tossing about is justified only if the failure processes are independent.

Consider a system which has been observed for several years to have a hardware crash an average of every two weeks and a software crash an average of every six weeks. Suppose that the time to repair is zero. The composite MTTF is 1.5 weeks, determined most easily by considering what happens if we run the system for, say, 60 weeks. During that time we would expect to see

     10 software failures
     30 hardware failures
    ___
     40 system failures in 60 weeks --> 1.5 weeks between failure

New hardware is installed, identical to the old except that it never fails. The MTTF should jump to 6 weeks, because the only failures are software, right?

(Not unless the software failure process is known to be *independent* of the hardware failure process.

Suppose the software failure process is that there is a bug (fault) in the clock updating routine: The bug always crashes the system exactly 420 hours (2 1/2 weeks) after it is booted--if it gets a chance to run that long.

The old hardware was causing crashes so often that the software bug only occasionally got a chance to do its thing (in G & S terms, to become effective)--only about once every six weeks. Most of the time the recovery from the hardware failure, which involved restarting the system, had the side-effect of resetting the process that triggered the software bug.

And when the new hardware is installed the system will have an MTTF of only 2.5 weeks, much less than was hoped.

MTTF's are useful, but one must be sure to understand what assumptions go into their measurement and use. Gray and Siewiorek never make the mistake suggested above, because they always qualify their statements with explicit assumptions of independence. But it is important that the reader of the paper fully comprehend the implications of those qualifications. And it would be a good idea to ask whether or not it is a reasonable assumption that the various failure processes are independent.)

Why is this paper [page 42, middle column, last paragraph] talking about the number of control wires required to connect up a disk? (Because connectors are a major source of faults, and technology that reduces the number of wires and connectors can have a big impact on reducing the rate of failure.)

Gray and Siewiorek suggest at the top of page 44 that airplanes with three engines cost three times as much and have three times as many engine problems as airplanes with one engine. The MTTF will be 1/3 of the single-engine MTTF. So why would anyone consider building a three-engine airplane? (If the single-engine plane has a 1/1000 chance of failing while crossing the Atlantic, the three-engine plane that can limp home on two has a 3/1,000,000 chance of not making it, an improvement that makes it worth lugging the extra two engines along and overhauling them every 1000 hours. And as a bonus, when all three engines are working the plane goes faster.)

G & S say things will be worse. The calculation we just did says it is an important improvement. What is going on here? (Both are correct. The MTTF is lower, and the reliability is higher, all at the same time. MTTF doesn't tell you everything you need to know.

There are two quite different uses of TMR. G & S come from a database world, where the system is intended to run much longer than the MTTF of any one module. Since TMR without repair reduces the MTTF, it looks like a loser. Adding repair hugely increases MTTF, and since it is usually feasible to make repairs, they consider only that approach. In the airplane and space-probe world, the system usually is intended to run for a much shorter time than the MTTF of any module, and in-flight repair isn't usually practical. In this case TMR makes a huge difference in the probability of mission success. But it only pays off if the basic reliability of the engines is very high to start with. Engine mechanics are trained to do what G & S call valid construction. Programmers usually don't bother.

Here is the mathematics behind it, provided by Bill Dally:

System reliability is generally expressed as:

   R(t) = P(working(time = t) | working(time = 0)).

If the reliability of a simplex system is R, the reliability of a TMR system Rx is:

   Rx = R^3 + 3R^2(1-R) = 3R^2 - 2R^3

If we assume an exponential (Poisson) failure rate, R = exp(-t/MTTF) and integrating gives the MTTF of the TMR system:

   MTTFx = 3/2 - 2/3 = 5/6.

which is the value given by G & S in table 2. This confirms their observation that a triple-modular system can be expected to fail sooner than the corresponding single-module system.

Example: Boeing 727 with three 10000-hour engines.

    MTT(first failure)    = 3333 hours
    MTT(second failure)   = 5000 hours after the first failure (memoryless)
    MTT(system fails)     = 8333 hours = 5/6 of 10000 hours

Conclusion: Don't try to fly in a 727 for 10000 hours continuously. But 6-hour flights over the Atlantic Ocean are a different matter.

For values of t that are small compared with the MTTF, R is very close to one, in which case

     Rx = 3R^2 - 2R^3 >> R   (T << MTTF)

Actually, both Rx and R are close to one; what is really being said is that Rx is much closer to one than R. To see this, replace R with (1 - a), letting a be a measure of how close R is to one. Then we have

     Rx = 3(1-a)^2 - 2(1-a)^3 = 1 - 3a^2 + 2a^3

When a is close to zero, the third term can be ignored, and we see that for times short compared with the MTTF the reliability of the single module compared with the reliabiliy of the TMR system is:

    single module:      R  = 1 - a
    TMR:                Rx = 1 - 3a^2

The rule of thumb corresponding to the small-t approximation is that if R has N leading nines in its representation (e.g., 0.999), Rx has 2N leading nines (the actual value corresponding to 0.999 is is 0.999997). Thus for applications like a satellite with limited propellant which will run out long before MTTF or a space vehicle that only needs to make one 30 minute flight from North Dakota, this gives us much better probability that the system works when it needs to.)

The MTTF equations of table 2 have a typographical error; in the last equation (concerning Triplex plus repair) the MTTR should be squared. Here is a set of explanations for the equations that was assembled by Carl Manning in 1994. (In the last equation, Carl's 6 has been replaced with a 3; there are two ways that both other modules can fail before repair to the first is accomplished.)

    Simplex    Fails when module fails
               --> SystemMTTF = ModuleMTTF (= 1yr)

    Duplex     Fails when either module fails
               --> SystemMTTF ~= ModuleMTTF/2
               (SystemMTTF is MTTF to first failure.  Assume Poisson
               model;  with two units expect 2 failures in 1yr, or 1
               per half year.)

    Triplex    Fails when 2 modules have failed
               --> SystemMTTF ~= ModuleMTTF/3 + ModuleMTTF/2
               = ModuleMTTF(5/6)
               (Expected time to 1st failure, THEN expected time to
               2nd.  Poisson model, so history doesn't matter.)

    Pair&Spare fails when both duplex supermodules have failed
               --> SystemMTTF ~= ModuleMTTF/4 + ModuleMTTF/2
               = ModuleMTTF(3/4)
               (Expected time to first supermodule failure, THEN time
               to 2nd.  First supermodule fails when any module fails.)

    Duplex+Fix fails when 1st unit fails (like duplex), AND THEN 2nd
               fails before 1st fixed (during the MTTR downtime of the
               first). Thus,  consider only those units of average time

               MTTR long which occur on average every ModuleMTTF/2:
               how long until they add up to ModuleMTTF?
               --> SystemMTTF ~= (ModuleMTTF/2)*(ModuleMTTF/MTTR)

    Triplex+Fix fails when one unit fails, then both others fail
               before first can be fixed.
               --> SystemMTTF ~= (ModuleMTTF/3)*(Duplex+FixMTTF/MTTR)
                              ~= ModuleMTTF^3/(3*MMTR^2)

There is a subtle change of groundrules encountered when moving down the rows of table 2. The first three lines assume that modules are not self-checking (fail-fast). In the triplex equation, for example, the supermodule fails when the second component module fails.

The pair and spare architecture explicitly involves two pairs of two modules each, with the pairs set up with a comparator so they are individually fail-fast; the total cost is 4 modules plus joinery.

The Duplex plus repair and Triplex plus repair equations assume that there are two (or three) fail-fast modules with a voter that ignores output of failed modules. Thus the Triplex plus repair supermodule fails only if all three modules have failed before the first failed module is repaired. If we assume that failfast modules are created the same way they were for Pair & Spare (by using two modules with a voter) then the numbers in the cost column should be 4+ and 6+ , respectively.

9. What is the relation between failfast module design and RAID levels 3 and above? (RAID level 3 is based on the idea that the disk controller can tell when the disk isn't working, so it isn't necessary for the RAID system to figure it out. In terms of G & S, RAID levels 3 and above assume fail-fast disks.)

How about implementing a fail-fast voter? That is, if a voter sees that two modules agree and the third doesn't, it marks the third module DOWN and ignores its output until it hears that a repair has been made. Does this help? (With three modules, it doesn't help. But with five it does. Consider three alternatives (it helps to draw a picture):

failures     failvote       failfast        failfast
                             modules          voter

   0            OK             OK              OK
   1            OK             OK              OK
   2            OK             OK              OK
   3           fail            OK              OK
   4           fail            OK             fail
   5           fail           fail            fail

What is the purpose of the recursive design in figure 2? (Because a voter can fail, too. If you assume that modules have inputs as well as outputs, we have to think about stringing modules together. A simple voter takes the output of three modules, computes a composite output, then feeds that output into all three inputs for the next stage. The "recursive" design in effect gives each next module its own voter.

[The next three questions were suggested by Carl Manning in 1994]

Use the terminology and concepts of this paper to describe how the Ethernet builds a reliable system out of unreliable (but cheap) hardware.

(  - fail-fast -- jamming when collision detected
   - error recovery -- retry after collision (with random backoff)
   - self-checks -- CRC checks on packets)

Use the terminology and concepts of this paper to describe how Grapevine builds a reliable system out of unreliable hardware.

(  - System "pairs" -- replicated servers
   - software modules -- e.g., registry and mail servers communicate
     only by message passing.)

How should a Therac-25, the Space Shuttle control system, and an airline reservation system differ in their approach to fault-tolerance? (Therac should emphasize detection and fail-fast design. Space Shuttle, once mission begins, should emphasize failure masking to assure mission success. An airline reservation system might tolerate occasional user-visible failures if the result is higher availability.)

Comments and suggestions: Saltzer@mit.edu