An exact analysis of sensor failure masking strategies for the Mars rovers in 6.033 design project 2 involves straightforward probability but moderately complex algebra. This note analyzes two measurement protocols, median and pair-and-compare. It also addresses a complication of the particular design of some sensors.
That interpretation is slightly off near the ends of the range, since a random output value can't be less than zero or greater than R. To avoid having to consider those two corner cases, imagine that the sensor "wraps around", so if Vi > R, the sensor returns the output value (Vi-R), and if Vi < 0, the sensor returns the output value (Vi+R). This assumption overestimates the chance that a random sensor output value will appear to be within tolerance, so the resulting estimate of the probability of making a mistake is an upper bound on the actual probability.
We shall say that two values V1 and V2 are "close" if they are within A of each other, considering wrap-around.
The complication mentioned above is that it is possible that a sensor fails, producing a random output value, yet that output value luckily lies close to the correct value or close to the output value of another, correctly working sensor. If the protocol chooses to return the faulty sensor's output value as the result of the experiment, the question arises whether or not we should count that outcome as a success or failure of the protocol.
If the sensor is a thermometer, it seems appropriate to consider this case a success. Unfortunately, according to the design specifications, some sensors (such as a camera) produce an output value that is just a summary, such as the average density of the photograph. With this kind of sensor, even if a random output value happens to be close, the photograph is probably worthless, and if we choose this sensor's output as the result to send back to Earth, we should expect the NASA scientists to count it as a failure of the protocol.
We identify sensors such as thermometers as "Type I" and sensors such as cameras as "Type II".
The protocol: take three independent sensor readings and choose as the result of the experiment the median value.
This scheme is simple and the time it takes is fixed and thus can be predicted in advance. We first analyze the median protocol for Type I sensors, and then (with appropriate independence assumptions) apply a multiplicative correction to account for Type II sensors. Our goal is to calculate the probability that the median value, Vm, is close to the true value, V.
Calculate Prob(protocol success) = Prob(Vm is close to V).
If there are no failures, then Vm is certainly close to V.
If one sensor fails, its random value either lands between the other two (correctly working) sensors or it doesn't. If its random value lands between, it is the median and Vm is close to V. If its value lands outside, one of the working sensors provides the median reading, and Vm is again close to V. So with a single sensor failure Vm is always close to V.
So the only situation in which Vm is not close to V is when there is more than 1 bad sensor. Prob(more than 1 bad sensor) = Prob(2 or 3 bad sensors) = 3p2 - 2p3.
There is a small chance that Vm is close to V even if it comes from a faulty sensor, and another small chance that Vm comes from the one good sensor, but let's ignore those two cases and get a lower bound:
Prob(Vm is close to V) > 1 - 3p2 + 2p3
As p increases, one can make the protocol more robust by taking the median of more measurements. For example, if p = .2 then the median of 3 experiments gives us a 12% protocol failure probability, which isn't good enough. But the median of 7 experiments (median of 5 is insufficient), by the same reasoning as above, can lead us astray if there are 4 or more sensor failures. This event happens with probability approximately 35·p4 , which is a bit less than 5%.
The median protocol returns as its result the output of some single sensor, based on examining outputs of several sensors and concluding that the chosen sensor's output value is close to the correct value. As explained in the introduction, if the protocol chooses a sensor that failed but whose random output was close enough to allow it to be chosen, we should, for Type II sensors, identify the result as a failure, rather than a success, of a protocol.
To account for this case, we calculate the probability that the sensor whose output the median protocol chooses was a working sensor, given that it somehow generated an output value that was close to the correct value.
Let M = Pr[working sensor | output value within tolerance]
(Since a working sensor always yields output value within tolerance.)= Pr[working sensor and output value within tolerance]/Pr[output value within tolerance]
= Pr[working sensor]/Pr[output value within tolerance]
M = (1-p)/(1-p+pF)= (1-p)/(Pr[working sensor and output value within tolerance] + Pr[non-working sensor and output value within tolerance]
= (1-p)/(1-p+Pr[non-working sensor]Pr[output value within tolerance | non-working sensor])
Because
Pr[working sensor] = Pr[working sensor | output value within tolerance]Pr[output value within tolerance]the probability M multiplicatively reduces the chance of success with respect to the Pr[output value within tolerance] that we calculated in the previous section. Again choosing F = 0.1, we can obtain the multiplier M for various values of p:
p | M |
0.1 | 0.989 |
0.2 | 0.976 |
0.3 | 0.959 |
The protocol: Take two sensor readings, getting values V1 and V2. If V1 and V2 are close, return the output of the first sensor. Otherwise, discard both values and try again.
This protocol has the feature that it may terminate after taking only two sensor readings, and thus produce results more quickly. But it may also run on for a long time, which makes setting timeouts problematic. In this case, the algebra is simplest if we assume Type II sensors from the outset.
A single pair-and-compare trial has three possible, mutually exclusive outcomes:
Let Z = Prob(protocol fails). We can express Z in terms of s, f, and r by developing a recurrence relation. Since "Try again" restarts the entire protocol, with an outcome that is independent of the trial that just failed,
Z = Prob(Fail on this try) + Prob(Try again)·Zsolving,
Z = Prob(Fail on this try)/(1-(Prob(Try again))
Z = f/(1-r) = f/(s+f)Conversely,
Prob(Protocol succeeds) = (1 - Z) = s/(s+f)
It is not hard to come up with a reasonably good upper bound (good in this case means low) on the probability of protocol failure. If that bound is below the NASA specification that the failure rate be lower than 5%, we are safe. First we calculate a quick and slightly rough bound, then we refine it.
Given that the probability of sensor failure is p, we can calculate the probability of each of the three outcomes.
This bound is useful unless p is quite large. Suppose that F = 0.1. Solving for p, the bound tells us that we can meet the NASA specification that Z < 0.05 if p < 0.268. We might be able to get along with even flakier sensors if we had a tighter bound. Let's find out.
Let's get a tighter bound on s by doing a more precise analysis of the "succeed on this try" case.
s = (1-p)·Prob(V2 is close to V1)
s > (1-p)(1-p+pF)and
Z < f/(f+s) = pF/(pF + (1-p)(1-p+pF))
This equation looks formidable, but we can evaluate it for various values of p and F. Suppose again that the maximum ratio of tolerance to range is F = 0.1. If the maximum probability of sensor failure, p = .1, we have Z < .012, so our protocol has a 1.2% failure rate. If the maximum probability of sensor failure, p = .3, we have a 5% failure rate, just meeting NASA''s specification. Thus our repeated pair-and-compare protocol will meet specifications as long as the probability of sensor failure is less than 0.3. Our earlier, loose bound calculation allowed a maximum sensor failure rate of 0.268, so the only reward for tightening the bound is to establish that NASA can get along with slightly flakier sensors.
As we move downscale toward more unreliable sensors, both the N-median (for sufficiently large N) protocol and the pair-and-compare protocol fail at about the same point (p=.3), so neither has an advantage by that measure. The median protocol has the feature that the number of trials needed to achieve a given success rate can be exactly predicted in advance, which may be an advantage in setting timeouts. The pair-and-compare protocol has the feature that it may stop after the first trial, which may be an advantage in getting on to the next experiment.