Jim [N.] Gray and Daniel P. Siewiorek. High-availability computer systems. Computer 24, 9 (September, 1991) pages 39-48.
by J. H. Saltzer, April 24, 1996, from earlier notes; updated 4/9/98.
When the lectures move to the subject of atomicity and transactions, it will be appropriate to point out that Grey and Siewiorek use slightly different terminology in that area from that of 6.033.
(Discussion guides on those two topics can be found in the general recitation topics list.)
What is the relation?
{MTTF = Mean Time To Failure MTTR = Mean Time To Repair MTBF = Mean Time Between Failure MTBF = MTTF + MTTR The model: repair repair | | v v ---------------------------------------------------------------> time | | v v failure failure |<---- Time Between Failure ------->| -->| |<-- Time To Repair |<-------Time To Failure --->|MTTF is a statistic of the distribution of TTF's. Its value is estimated by observing a large number of TTF's and calculating the average. It may also be predicted, starting with failure rates of system components and making assumptions about failure independence. Similarly for TTR and TBF.
time system is actually available Availability = --------------------------------------- time system is intended to be availableIf the system has N failure events during the time it is supposed to be available, we can say
(time system is actually available)/N = MTTF and (time system is intended to be available)/N = MTBF and thus MTTF MTTF - MTTR MTTF Availability = ____ = ___________ = _____________ MTBF MTBF MTTF + MTTROne will find every possible combination of these parameters in different papers, but they are almost always defined as above.)
There is a place where one can trip up by blindly applying the analysis approach that seems to be suggested in the Gray & Siewiorek paper. They rather casually toss around the MTTF's from different causes and extrapolate the expected effects of improving one of the causes. The casual tossing about is justified only if the failure processes are independent.
Consider a system which has been observed for several years to have a hardware crash an average of every two weeks and a software crash an average of every six weeks. Suppose that the time to repair is zero. The composite MTTF is 1.5 weeks, determined most easily by considering what happens if we run the system for, say, 60 weeks. During that time we would expect to see
10 software failures 30 hardware failures ___ 40 system failures in 60 weeks --> 1.5 weeks between failureNew hardware is installed, identical to the old except that it never fails. The MTTF should jump to 6 weeks, because the only failures are software, right?
(Not unless the software failure process is known to be *independent* of the hardware failure process.
Suppose the software failure process is that there is a bug (fault) in the clock updating routine: The bug always crashes the system exactly 420 hours (2 1/2 weeks) after it is booted--if it gets a chance to run that long.
The old hardware was causing crashes so often that the software bug only occasionally got a chance to do its thing (in G & S terms, to become effective)--only about once every six weeks. Most of the time the recovery from the hardware failure, which involved restarting the system, had the side-effect of resetting the process that triggered the software bug.
And when the new hardware is installed the system will have an MTTF of only 2.5 weeks, much less than was hoped.
MTTF's are useful, but one must be sure to understand what assumptions go into their measurement and use. Gray and Siewiorek never make the mistake suggested above, because they always qualify their statements with explicit assumptions of independence. But it is important that the reader of the paper fully comprehend the implications of those qualifications. And it would be a good idea to ask whether or not it is a reasonable assumption that the various failure processes are independent.)
There are two quite different uses of TMR. G & S come from a database world, where the system is intended to run much longer than the MTTF of any one module. Since TMR without repair reduces the MTTF, it looks like a loser. Adding repair hugely increases MTTF, and since it is usually feasible to make repairs, they consider only that approach. In the airplane and space-probe world, the system usually is intended to run for a much shorter time than the MTTF of any module, and in-flight repair isn't usually practical. In this case TMR makes a huge difference in the probability of mission success. But it only pays off if the basic reliability of the engines is very high to start with. Engine mechanics are trained to do what G & S call valid construction. Programmers usually don't bother.
Here is the mathematics behind it, provided by Bill Dally:
System reliability is generally expressed as:
R(t) = P(working(time = t) | working(time = 0)).If the reliability of a simplex system is R, the reliability of a TMR system Rx is:
Rx = R^3 + 3R^2(1-R) = 3R^2 - 2R^3If we assume an exponential (Poisson) failure rate, R = exp(-t/MTTF) and integrating gives the MTTF of the TMR system:
MTTFx = 3/2 - 2/3 = 5/6.which is the value given by G & S in table 2. This confirms their observation that a triple-modular system can be expected to fail sooner than the corresponding single-module system.
Example: Boeing 727 with three 10000-hour engines.
MTT(first failure) = 3333 hours MTT(second failure) = 5000 hours after the first failure (memoryless) MTT(system fails) = 8333 hours = 5/6 of 10000 hoursConclusion: Don't try to fly in a 727 for 10000 hours continuously. But 6-hour flights over the Atlantic Ocean are a different matter.
For values of t that are small compared with the MTTF, R is very close to one, in which case
Rx = 3R^2 - 2R^3 >> R (T << MTTF)Actually, both Rx and R are close to one; what is really being said is that Rx is much closer to one than R. To see this, replace R with (1 - a), letting a be a measure of how close R is to one. Then we have
Rx = 3(1-a)^2 - 2(1-a)^3 = 1 - 3a^2 + 2a^3When a is close to zero, the third term can be ignored, and we see that for times short compared with the MTTF the reliability of the single module compared with the reliabiliy of the TMR system is:
single module: R = 1 - a TMR: Rx = 1 - 3a^2The rule of thumb corresponding to the small-t approximation is that if R has N leading nines in its representation (e.g., 0.999), Rx has 2N leading nines (the actual value corresponding to 0.999 is is 0.999997). Thus for applications like a satellite with limited propellant which will run out long before MTTF or a space vehicle that only needs to make one 30 minute flight from North Dakota, this gives us much better probability that the system works when it needs to.)
Simplex Fails when module fails --> SystemMTTF = ModuleMTTF (= 1yr) Duplex Fails when either module fails --> SystemMTTF ~= ModuleMTTF/2 (SystemMTTF is MTTF to first failure. Assume Poisson model; with two units expect 2 failures in 1yr, or 1 per half year.) Triplex Fails when 2 modules have failed --> SystemMTTF ~= ModuleMTTF/3 + ModuleMTTF/2 = ModuleMTTF(5/6) (Expected time to 1st failure, THEN expected time to 2nd. Poisson model, so history doesn't matter.) Pair&Spare fails when both duplex supermodules have failed --> SystemMTTF ~= ModuleMTTF/4 + ModuleMTTF/2 = ModuleMTTF(3/4) (Expected time to first supermodule failure, THEN time to 2nd. First supermodule fails when any module fails.) Duplex+Fix fails when 1st unit fails (like duplex), AND THEN 2nd fails before 1st fixed (during the MTTR downtime of the first). Thus, consider only those units of average time MTTR long which occur on average every ModuleMTTF/2: how long until they add up to ModuleMTTF? --> SystemMTTF ~= (ModuleMTTF/2)*(ModuleMTTF/MTTR) Triplex+Fix fails when one unit fails, then both others fail before first can be fixed. --> SystemMTTF ~= (ModuleMTTF/3)*(Duplex+FixMTTF/MTTR) ~= ModuleMTTF^3/(3*MMTR^2)
The pair and spare architecture explicitly involves two pairs of two modules each, with the pairs set up with a comparator so they are individually fail-fast; the total cost is 4 modules plus joinery.
The Duplex plus repair and Triplex plus repair equations assume that there are two (or three) fail-fast modules with a voter that ignores output of failed modules. Thus the Triplex plus repair supermodule fails only if all three modules have failed before the first failed module is repaired. If we assume that failfast modules are created the same way they were for Pair & Spare (by using two modules with a voter) then the numbers in the cost column should be 4+ and 6+ , respectively.
9. What is the relation between failfast module design and RAID levels 3 and above? (RAID level 3 is based on the idea that the disk controller can tell when the disk isn't working, so it isn't necessary for the RAID system to figure it out. In terms of G & S, RAID levels 3 and above assume fail-fast disks.)
failures failvote failfast failfast modules voter 0 OK OK OK 1 OK OK OK 2 OK OK OK 3 fail OK OK 4 fail OK fail 5 fail fail fail
[The next three questions were suggested by Carl Manning in 1994]
( - fail-fast -- jamming when collision detected - error recovery -- retry after collision (with random backoff) - self-checks -- CRC checks on packets)
( - System "pairs" -- replicated servers - software modules -- e.g., registry and mail servers communicate only by message passing.)