Nancy G. Leveson and Clark S. Turner. An investigation of the Therac-25 accidents. Computer 26, 7 (July, 1993) pages 18-41.
by J. H. Saltzer, February 13, 1996, updated February 9, 1999, January 2001, February 8 & 17, 2002, and April 25, 2002.
The paper on the Therac-25 incidents is the first encounter with the risks of using computers; this will be a continued theme. This is a topic that everyone can play expert analyst on, so it shouldn't be hard to get the class to participate. Just asking the question "what really went wrong here?" may be enough to generate half a dozen positions and carry a good discussion.
There is a recent similar incident, involving a Cobalt-60 therapy machine known as the Thyratron, manufactured by MultiData Systems. In Panama in 2001, 28 patients were overdosed, and five died. A comparison of the company's initial announcement (PDF file) with the NRC report overview (another PDF file) is a good item to bring in sometime during the discussion of the Therac. Since all we have are two formal public announcements, we don't know any details, but we can still see at least one of the same problems (blame the operator, not the system design) in this incident.
Reading this paper provides an excellent opportunity for a form of analysis that should come up at least once during the term:
Let's reconstruct the outline of the introduction. For each of the paragraphs of the introduction, give me a ten-word-or-less summary of the significant content:
1 six serious accidents
2 we are going to explore it
3 fluff about applicability
4 fluff about complexity
5 it is hard to get people to talk about mistakes
6 we are first
7 we are careful
8 we aren't perfect
9 we will tell you what we think
Observations:
- they didn't tell us what went wrong
- the ratio of fluff to content is pretty high. There may be
good stuff in this paper, but there is also going to be a lot we
can ignore. So light skimming, looking for nuggets is likely to
be the most profitable approach.
So let's try the same technique on the closing section...
1 we must dig deeper
2 patching doesn't work
3 these lessons are broadly applicable
They still didn't tell us what went wrong.
Who are the authors?: A professor from Irvine, and one of her Ph.D students. But he is a lawyer, so he may have a point of view we aren't familiar with.
What papers do they cite?: The citations seem narrow in scope
- supporting data
- supporting opinions
- their own papers
conclusion: be wary. They haven't established their credentials here. Their view may be narrow. The paper has to stand on its own merits. (but it will!)
What journal is this in? Computer, a general-interest rag of the IEEE Computer Group. It is quite hard to get a paper in, and it must be of pretty broad interest. This journal is heavily refereed. (What does that mean?)
Can you find any red herrings? There are lots of them, e.g.:
double-pass design: electron accelerator with mirrors to fold the beam.
depth dose: focus the energy deep inside the body.
neither of those things has anything to do with the accidents.
Old software was brought forward into this design. Is that good or bad?
- it is seasoned, known to work (good)
- it was designed for a different situation (bad)
So how do these Therac things work? (Most students will have a pretty confused idea of the mechanics. Draw a picture of the three-position turntable that blocks with a flattener, diffuses with a magnet, or passes the beam unaltered.) The following table helps get to the bottom of the problem:
photon turntable position: electron therapy (x-ray) field-light mode mode mode beam energy: 5-25 MEV 25 MEV 0 beam current: low high 0 beam modifiers: magnets flattener none
What is the purpose of "field-light mode"? (So the operator can peer in and verify that the patient is positioned properly.)
And given this table, what can go wrong? (If you somehow get the beam current set to high at the same time that the turntable is positioned with magnets rather than with the flattener, you have set the stage for a disaster, with a high X-ray beam current delivered directly to the patient, rather than being attenuated by a factor of 1000 or more. Another way to kill a patient is to turn the beam on at either current level with the turntable in field-light mode.
An interlock is needed. The hardware interlocks of the Therac-20 were replaced with software interlocks. What is the Therac's concept of a software interlock?
(The "software interlock" is just a boolean flag for which a non-zero value indicates that there is reason to believe that the turntable position should be verified for consistency with the beam setting. )
Get the class to describe the operator input and race condition scenario that leads to the two Tyler disasters.
Here is how the machine seems to work [Thanks to Emmett Witchel for sorting this out]:
Now suppose that the operator sets up a consistent set of parameters and moves the cursor to the bottom field. The keyboard thread notices, acquires the input, and signals the other two two setup threads to go into action. Meanwhile, the operator looks over the display, realizes that the parameters are not the ones the doctor ordered, moves the cursor back up to change the parameters to another consistent set, and then moves the cursor back to the bottom field. The keyboard thread again acquires the input and signals the other two threads.
If the operator does this change in less than 8 seconds, the fast thread will discover the second setup signal and will redo setup with the new parameters. The slow thread is busy doing its 8-second procedure and when it returns it has missed the second setup signal, so that part of the machine remains in the first configuration. The display looks right, each of the threads thinks it has done the right thing, and the operator punches the beam-on button. But the turntable and the beam energy are not consistent.
Question: Why wasn't the setup checked for consistency before turning on the beam? This omission is unexplained and quite puzzling. One possibility is that the philosophy of the interlock design was one of "if we give an order to change something, we should later check to see if it actually changed." Since in the scenario above, the problem was that some orders to change never got given, there was, under this philosophy, no need to followup by checking.
We have here a good example of the maxim that Complex Systems Fail for Complex Reasons.
Why is it so hard to figure out what is going on? (This is worth emphasizing strongly to your class; it is about technical writing: First, the design is complicated, there are a lot of things going on, and they are going on in parallel. Only a small number of those details are essential to understanding the problem, but the text includes both relevant and irrelevant details.
Second, the explanation is hard to follow because certain key sentences in the description are passive. Example: "the data entry phase can be exited before all edit changes are made"). In that sentence, the reader has to figure out from context which of the several parallel tasks--or the operator--"exited" and "made the changes". And depending on which one did it, there might or might not be a race. As a rule, there is no place for passive sentences in technical writing; they usually obscure who took the action, and thereby make the description ambiguous.)
- operator sets up parameters on the screen.
- operator moves turntable to field-light mode, and visually
checks that the patient is properly positioned.
- operator hits "set" to store the parameters.
At this point, the "Class3" interlock is supposed to tell the software that the turntable position needs to be checked.
But this boolean flag seems to be implemented as a counter. Why?
(The PDP-11 had an "Increment-Byte" instruction that performed the operation (A <- A + 1). It looks like the programmer used that instruction as a way of making the flag non-zero. He probably thought he was doing a good deed, because the operation (A <- 1) on a PDP-11 requires that one materialize the constant value 1, which takes an extra byte to store.)
Seems like a good way to save a couple of bytes of program. What is wrong with it?
(Well, the code that set the flag was in a self-rescheduling routine that repeatedly watches the operator to see when everything is ready to proceed. So it sets the interlock to non-zero over and over again. Unfortunately, every 256 times it sets the interlock it actually sets it to zero.)
so we follow one of two paths. Normally,
- software interlock Class3 causes verification of turntable position; verification fails and system tells operator to go fix it.
But if the interlock variable happens to have just turned over from 256 to zero the sequence is
- software interlock Class3 indicates that no check is needed, so turntable remains in field-light position.
- operator tells machine to apply dose as set up, injuring patient.
Fixing any one of the above things might have prevented the disasters. But a good design and implementation should have addressed ALL of these things.
[suggested by Robert Morris] On might draw the lesson that since there was no single cause of failure, no single design change would have contributed much to success. Is that an appropriate conclusion?
(No. There are two objections to this line of reasoning. First, simple, system-wide, end-to-end consistency checks are a potent tool. The design should have included such checks. Second, one single design change would have made a huge difference: use a safety net approach to design. This second point will probably be worth coming back to when we get to chapter 6, which has a discussion of what "safety net" means, and why it can be effective.)
What ethical issues are involved here? For example, should the FDA have ordered the machine off the market immediately after the first accident, at least until someone got to the bottom of the problem?
This question leads to a very interesting tradeoff: Although the paper does not say so explicitly, it implies that the Therac-25 was heavily used in the field for several years before the problem was finally diagnosed and fixed. During this time, presumably several thousand patients were successfully treated, while six received damaging doses of radiation. If a patient is faced with the following choice: "If you don't get this therapy you will almost certainly die within a year. The machine we use is flaky, and it kills one in 1000 patients, but if it doesn't kill you it will cure you." one might expect that many patients would choose to take the risk.
If Lampson's "Hints" paper has been read either before or concurrently with this paper, a good question is "where did the Therac design violate Lampson's suggestions" Many of Lampson's slogans have to do with performance, which isn't the main issue in the Therac. But practically every non-performance hint was violated:
The following discussion ideas were contributed by Steve Ward:
Hardware vs Software reliability? The paper repeatedly suggests that software is inherently more unreliable than hardware. Is this true? If so, why?
GEDANKEN EXPERIMENT: Suppose a redesigned hardware interface allowed a single, simple, localized software module to specify mode and turntable position coded into a single word. Assume that this software module uses a simple, airtight approach -- say table lookup of known, safe output words -- to guarantee that x-ray mode and magnets are never simultaneously selected. Would this mitigate software reliability concerns?
Are hardware interlocks more trustworthy? Often, due to natural limitations in their possibility of unintentional interaction:
Software systems minimize such limitations, in the interest of generality. How can the reliability be regained?
[Note that this discussion provides a natural lead-in to the topic of industrial-strength firewalls, which is the subject of the next few lectures.]