6.033--Computer System Engineering

6.033 Design Project 2 Sensor Failure Analysis


Introduction

An exact analysis of sensor failure masking strategies for the Mars rovers in 6.033 design project 2 involves straightforward probability but moderately complex algebra. This note analyzes two measurement protocols, median and pair-and-compare. It also addresses a complication of the particular design of some sensors.

The problem setup

The median protocol with Type I sensors

The protocol: take three independent sensor readings and choose as the result of the experiment the median value.

This scheme is simple and the time it takes is fixed and thus can be predicted in advance. We first analyze the median protocol for Type I sensors, and then (with appropriate independence assumptions) apply a multiplicative correction to account for Type II sensors. Our goal is to calculate the probability that the median value, Vm, is close to the true value, V.

  1. Calculate Prob(protocol success) = Prob(Vm is close to V).

    If there are no failures, then Vm is certainly close to V.

    If one sensor fails, its random value either lands between the other two (correctly working) sensors or it doesn't. If its random value lands between, it is the median and Vm is close to V. If its value lands outside, one of the working sensors provides the median reading, and Vm is again close to V. So with a single sensor failure Vm is always close to V.

    So the only situation in which Vm is not close to V is when there is more than 1 bad sensor. Prob(more than 1 bad sensor) = Prob(2 or 3 bad sensors) = 3p2 - 2p3.

    There is a small chance that Vm is close to V even if it comes from a faulty sensor, and another small chance that Vm comes from the one good sensor, but let's ignore those two cases and get a lower bound:

    Prob(Vm is close to V) > 1 - 3p2 + 2p3

  2. Plugging in, e.g., p=.1, we get Prob(success) > (1 - .03 + .002) = 0.972, so we can meet the 95% specification by taking the median of three sensors, each of which produce accurate output with only 90% probability.

As p increases, one can make the protocol more robust by taking the median of more measurements. For example, if p = .2 then the median of 3 experiments gives us a 12% protocol failure probability, which isn't good enough. But the median of 7 experiments (median of 5 is insufficient), by the same reasoning as above, can lead us astray if there are 4 or more sensor failures. This event happens with probability approximately 35·p4 , which is a bit less than 5%.

Type II sensors

The median protocol returns as its result the output of some single sensor, based on examining outputs of several sensors and concluding that the chosen sensor's output value is close to the correct value. As explained in the introduction, if the protocol chooses a sensor that failed but whose random output was close enough to allow it to be chosen, we should, for Type II sensors, identify the result as a failure, rather than a success, of a protocol.

To account for this case, we calculate the probability that the sensor whose output the median protocol chooses was a working sensor, given that it somehow generated an output value that was close to the correct value.

Let M = Pr[working sensor | output value within tolerance]

= Pr[working sensor and output value within tolerance]/Pr[output value within tolerance]

= Pr[working sensor]/Pr[output value within tolerance]

(Since a working sensor always yields output value within tolerance.)

= (1-p)/(Pr[working sensor and output value within tolerance] + Pr[non-working sensor and output value within tolerance]

= (1-p)/(1-p+Pr[non-working sensor]Pr[output value within tolerance | non-working sensor])

M = (1-p)/(1-p+pF)

Because

Pr[working sensor] = Pr[working sensor | output value within tolerance]Pr[output value within tolerance]
the probability M multiplicatively reduces the chance of success with respect to the Pr[output value within tolerance] that we calculated in the previous section. Again choosing F = 0.1, we can obtain the multiplier M for various values of p:
pM
0.10.989
0.20.976
0.30.959
Since a multiplier below 0.95 will inevitably drag the protocol success rate to below the NASA specification of 95%, sensors with failure rates greater than 0.3 are unusable, no matter how many measurements we take.

The pair and compare protocol

The protocol: Take two sensor readings, getting values V1 and V2. If V1 and V2 are close, return the output of the first sensor. Otherwise, discard both values and try again.

This protocol has the feature that it may terminate after taking only two sensor readings, and thus produce results more quickly. But it may also run on for a long time, which makes setting timeouts problematic. In this case, the algebra is simplest if we assume Type II sensors from the outset.

A single pair-and-compare trial has three possible, mutually exclusive outcomes:

  1. We return the output value from a working sensor. Succeed on this try.
    Let Prob(Succeed on this try) = s
  2. We return the output value from a non-working sensor. Fail on this try.
    Let Prob(Fail on this try) = f
  3. We discard this pair of results and do a new, independent pair and compare. Try again.
    Then Prob(Try again) = r = 1-s-f.

Let Z = Prob(protocol fails). We can express Z in terms of s, f, and r by developing a recurrence relation. Since "Try again" restarts the entire protocol, with an outcome that is independent of the trial that just failed,

Z = Prob(Fail on this try) + Prob(Try again)·Z
solving,
Z = Prob(Fail on this try)/(1-(Prob(Try again))
Z = f/(1-r) = f/(s+f)
Conversely,
Prob(Protocol succeeds) = (1 - Z) = s/(s+f)

Rough and tumble solution

It is not hard to come up with a reasonably good upper bound (good in this case means low) on the probability of protocol failure. If that bound is below the NASA specification that the failure rate be lower than 5%, we are safe. First we calculate a quick and slightly rough bound, then we refine it.

Given that the probability of sensor failure is p, we can calculate the probability of each of the three outcomes.

  1. Determine an upper bound for f: Declare a trial failure if V1 is from a sensor that failed (probability p) and V1 and V2 are close. (This is an upper bound because V1 could accidentally be close to V, and this would be a success for a Type 1 sensor). But if V1 is from a failed sensor, its value is random. So regardless of the value of V2, the probability V1 is close to it is F. So f < pF.
  2. Now, get a lower bound for s: For sure, a particular trial succeeds if both sensors work, which happens with probability (1-p)2. (This is a lower bound, because sometimes a trial will succeed even if only the first sensor V1 works.) Thus s > (1-p)2.
  3. Combining, Z = f/(s+f) < f/s < pF/(1-p)2.

This bound is useful unless p is quite large. Suppose that F = 0.1. Solving for p, the bound tells us that we can meet the NASA specification that Z < 0.05 if p < 0.268. We might be able to get along with even flakier sensors if we had a tighter bound. Let's find out.

Getting a tighter bound

Let's get a tighter bound on s by doing a more precise analysis of the "succeed on this try" case.

  1. A particular trial succeeds if V1 is from a sensor that works and V2 is close to V1, whether or not the second sensor fails. We know that Prob(sensor works) = (1-p), so
    s = (1-p)·Prob(V2 is close to V1)
  2. The second sensor works correctly with probability (1-p). In this case, V2 is certain to be close to V1.
  3. The second sensor fails with probability p. In this case, V2 will be close to V1 with probability F. (Actually, this is a lower bound, because V1 might be near the edges of the sensor range.) Thus the probability of this case is greater than pF.
  4. Summing (2) and (3), Prob(V2 is close to V1) > (1-p) + pF. Then from (1) above,
    s > (1-p)(1-p+pF)
    and
    Z < f/(f+s) = pF/(pF + (1-p)(1-p+pF))

Wrap up

This equation looks formidable, but we can evaluate it for various values of p and F. Suppose again that the maximum ratio of tolerance to range is F = 0.1. If the maximum probability of sensor failure, p = .1, we have Z < .012, so our protocol has a 1.2% failure rate. If the maximum probability of sensor failure, p = .3, we have a 5% failure rate, just meeting NASA''s specification. Thus our repeated pair-and-compare protocol will meet specifications as long as the probability of sensor failure is less than 0.3. Our earlier, loose bound calculation allowed a maximum sensor failure rate of 0.268, so the only reward for tightening the bound is to establish that NASA can get along with slightly flakier sensors.

A conclusion

As we move downscale toward more unreliable sensors, both the N-median (for sufficiently large N) protocol and the pair-and-compare protocol fail at about the same point (p=.3), so neither has an advantage by that measure. The median protocol has the feature that the number of trials needed to achieve a given success rate can be exactly predicted in advance, which may be an advantage in setting timeouts. The pair-and-compare protocol has the feature that it may stop after the first trial, which may be an advantage in getting on to the next experiment.