6.033 Design Project 2 Sensor failure analysis

6.033--Computer System Engineering

6.033 Design Project 2 Sensor Failure Analysis

Introduction

An exact analysis of sensor failure masking strategies for the Mars rovers in 6.033 design project 2 involves straightforward probability but moderately complex algebra. This note analyzes two measurement protocols, median and pair-and-compare. It also addresses a complication of the particular design of some sensors.

The problem setup

From the specifications, a sensor produces an output value in some range, which we take to be from 0 to R. Let the accuracy of the sensor be A, which means that if the measured value is actually V, a working sensor will produce an output value between (V-A/2) and (V+A/2), uniformly distributed. To simplify the algebra, let F = A/R, with the intended interpretation that F be the probability that a random output value from a faulty sensor lands within tolerance of an output value from a good sensor.
That interpretation is slightly off near the ends of the range, since a random output value can't be less than zero or greater than R. To avoid having to consider those two corner cases, imagine that the sensor "wraps around", so if V_i > R, the sensor returns the output value (V_i-R), and if V_i < 0, the sensor returns the output value (V_i+R). This assumption overestimates the chance that a random sensor output value will appear to be within tolerance, so the resulting estimate of the probability of making a mistake is an upper bound on the actual probability.

We shall say that two values V₁ and V₂ are "close" if they are within A of each other, considering wrap-around.
The complication mentioned above is that it is possible that a sensor fails, producing a random output value, yet that output value luckily lies close to the correct value or close to the output value of another, correctly working sensor. If the protocol chooses to return the faulty sensor's output value as the result of the experiment, the question arises whether or not we should count that outcome as a success or failure of the protocol.

If the sensor is a thermometer, it seems appropriate to consider this case a success. Unfortunately, according to the design specifications, some sensors (such as a camera) produce an output value that is just a summary, such as the average density of the photograph. With this kind of sensor, even if a random output value happens to be close, the photograph is probably worthless, and if we choose this sensor's output as the result to send back to Earth, we should expect the NASA scientists to count it as a failure of the protocol.

We identify sensors such as thermometers as "Type I" and sensors such as cameras as "Type II".

The median protocol with Type I sensors

The protocol: take three independent sensor readings and choose as the result of the experiment the median value.

This scheme is simple and the time it takes is fixed and thus can be predicted in advance. We first analyze the median protocol for Type I sensors, and then (with appropriate independence assumptions) apply a multiplicative correction to account for Type II sensors. Our goal is to calculate the probability that the median value, V_m, is close to the true value, V.

Calculate Prob(protocol success) = Prob(V_m is close to V).

If there are no failures, then V_m is certainly close to V.

If one sensor fails, its random value either lands between the other two (correctly working) sensors or it doesn't. If its random value lands between, it is the median and V_m is close to V. If its value lands outside, one of the working sensors provides the median reading, and V_m is again close to V. So with a single sensor failure V_m is always close to V.

So the only situation in which V_m is not close to V is when there is more than 1 bad sensor. Prob(more than 1 bad sensor) = Prob(2 or 3 bad sensors) = 3p² - 2p³.

There is a small chance that V_m is close to V even if it comes from a faulty sensor, and another small chance that V_m comes from the one good sensor, but let's ignore those two cases and get a lower bound:

Prob(V_m is close to V) > 1 - 3p² + 2p³
Plugging in, e.g., p=.1, we get Prob(success) > (1 - .03 + .002) = 0.972, so we can meet the 95% specification by taking the median of three sensors, each of which produce accurate output with only 90% probability.

As p increases, one can make the protocol more robust by taking the median of more measurements. For example, if p = .2 then the median of 3 experiments gives us a 12% protocol failure probability, which isn't good enough. But the median of 7 experiments (median of 5 is insufficient), by the same reasoning as above, can lead us astray if there are 4 or more sensor failures. This event happens with probability approximately 35·p⁴ , which is a bit less than 5%.

Type II sensors

The median protocol returns as its result the output of some single sensor, based on examining outputs of several sensors and concluding that the chosen sensor's output value is close to the correct value. As explained in the introduction, if the protocol chooses a sensor that failed but whose random output was close enough to allow it to be chosen, we should, for Type II sensors, identify the result as a failure, rather than a success, of a protocol.

To account for this case, we calculate the probability that the sensor whose output the median protocol chooses was a working sensor, given that it somehow generated an output value that was close to the correct value.

Let M = Pr[working sensor | output value within tolerance]

= Pr[working sensor and output value within tolerance]/Pr[output value within tolerance]

= Pr[working sensor]/Pr[output value within tolerance]

(Since a working sensor always yields output value within tolerance.)

= (1-p)/(Pr[working sensor and output value within tolerance] + Pr[non-working sensor and output value within tolerance]

= (1-p)/(1-p+Pr[non-working sensor]Pr[output value within tolerance | non-working sensor])

M = (1-p)/(1-p+pF)

Because

Pr[working sensor] = Pr[working sensor | output value within tolerance]Pr[output value within tolerance]

the probability M multiplicatively reduces the chance of success with respect to the Pr[output value within tolerance] that we calculated in the previous section. Again choosing F = 0.1, we can obtain the multiplier M for various values of p:

p	M
0.1	0.989
0.2	0.976
0.3	0.959

Since a multiplier below 0.95 will inevitably drag the protocol success rate to below the NASA specification of 95%, sensors with failure rates greater than 0.3 are unusable, no matter how many measurements we take.

The pair and compare protocol

The protocol: Take two sensor readings, getting values V₁ and V₂. If V₁ and V₂ are close, return the output of the first sensor. Otherwise, discard both values and try again.

This protocol has the feature that it may terminate after taking only two sensor readings, and thus produce results more quickly. But it may also run on for a long time, which makes setting timeouts problematic. In this case, the algebra is simplest if we assume Type II sensors from the outset.

A single pair-and-compare trial has three possible, mutually exclusive outcomes:

We return the output value from a working sensor. Succeed on this try.
Let Prob(Succeed on this try) = s
We return the output value from a non-working sensor. Fail on this try.
Let Prob(Fail on this try) = f
We discard this pair of results and do a new, independent pair and compare. Try again.
Then Prob(Try again) = r = 1-s-f.

Let Z = Prob(protocol fails). We can express Z in terms of s, f, and r by developing a recurrence relation. Since "Try again" restarts the entire protocol, with an outcome that is independent of the trial that just failed,

Z = Prob(Fail on this try) + Prob(Try again)·Z

solving,

Z = Prob(Fail on this try)/(1-(Prob(Try again))

Z = f/(1-r) = f/(s+f)

Conversely,

Prob(Protocol succeeds) = (1 - Z) = s/(s+f)

Rough and tumble solution

It is not hard to come up with a reasonably good upper bound (good in this case means low) on the probability of protocol failure. If that bound is below the NASA specification that the failure rate be lower than 5%, we are safe. First we calculate a quick and slightly rough bound, then we refine it.

Given that the probability of sensor failure is p, we can calculate the probability of each of the three outcomes.

Determine an upper bound for f: Declare a trial failure if V₁ is from a sensor that failed (probability p) and V₁ and V₂ are close. (This is an upper bound because V₁ could accidentally be close to V, and this would be a success for a Type 1 sensor). But if V₁ is from a failed sensor, its value is random. So regardless of the value of V₂, the probability V₁ is close to it is F. So f < pF.
Now, get a lower bound for s: For sure, a particular trial succeeds if both sensors work, which happens with probability (1-p)². (This is a lower bound, because sometimes a trial will succeed even if only the first sensor V₁ works.) Thus s > (1-p)².
Combining, Z = f/(s+f) < f/s < pF/(1-p)².

This bound is useful unless p is quite large. Suppose that F = 0.1. Solving for p, the bound tells us that we can meet the NASA specification that Z < 0.05 if p < 0.268. We might be able to get along with even flakier sensors if we had a tighter bound. Let's find out.

Getting a tighter bound

Let's get a tighter bound on s by doing a more precise analysis of the "succeed on this try" case.

A particular trial succeeds if V₁ is from a sensor that works and V₂ is close to V₁, whether or not the second sensor fails. We know that Prob(sensor works) = (1-p), so
s = (1-p)·Prob(V₂ is close to V₁)
The second sensor works correctly with probability (1-p). In this case, V₂ is certain to be close to V₁.
The second sensor fails with probability p. In this case, V₂ will be close to V₁ with probability F. (Actually, this is a lower bound, because V₁ might be near the edges of the sensor range.) Thus the probability of this case is greater than pF.
Summing (2) and (3), Prob(V₂ is close to V₁) > (1-p) + pF. Then from (1) above,
s > (1-p)(1-p+pF)
and
Z < f/(f+s) = pF/(pF + (1-p)(1-p+pF))

Wrap up

This equation looks formidable, but we can evaluate it for various values of p and F. Suppose again that the maximum ratio of tolerance to range is F = 0.1. If the maximum probability of sensor failure, p = .1, we have Z < .012, so our protocol has a 1.2% failure rate. If the maximum probability of sensor failure, p = .3, we have a 5% failure rate, just meeting NASA''s specification. Thus our repeated pair-and-compare protocol will meet specifications as long as the probability of sensor failure is less than 0.3. Our earlier, loose bound calculation allowed a maximum sensor failure rate of 0.268, so the only reward for tightening the bound is to establish that NASA can get along with slightly flakier sensors.

A conclusion

As we move downscale toward more unreliable sensors, both the N-median (for sufficiently large N) protocol and the pair-and-compare protocol fail at about the same point (p=.3), so neither has an advantage by that measure. The median protocol has the feature that the number of trials needed to achieve a given success rate can be exactly predicted in advance, which may be an advantage in setting timeouts. The pair-and-compare protocol has the feature that it may stop after the first trial, which may be an advantage in getting on to the next experiment.