6.033 Discussion Suggestions (Therac-25 paper)

6.033--Computer System Engineering

Suggestions for classroom discussion of:

Nancy G. Leveson and Clark S. Turner. An investigation of the Therac-25 accidents. Computer 26, 7 (July, 1993) pages 18-41.

by J. H. Saltzer, February 13, 1996, updated February 9, 1999, January 2001, February 8 & 17, 2002, and April 25, 2002.

The paper on the Therac-25 incidents is the first encounter with the risks of using computers; this will be a continued theme. This is a topic that everyone can play expert analyst on, so it shouldn't be hard to get the class to participate. Just asking the question "what really went wrong here?" may be enough to generate half a dozen positions and carry a good discussion.

There is a recent similar incident, involving a Cobalt-60 therapy machine known as the Thyratron, manufactured by MultiData Systems. In Panama in 2001, 28 patients were overdosed, and five died. A comparison of the company's initial announcement (PDF file) with the NRC report overview (another PDF file) is a good item to bring in sometime during the discussion of the Therac. Since all we have are two formal public announcements, we don't know any details, but we can still see at least one of the same problems (blame the operator, not the system design) in this incident.

Reading this paper provides an excellent opportunity for a form of analysis that should come up at least once during the term:

Let's reconstruct the outline of the introduction. For each of the paragraphs of the introduction, give me a ten-word-or-less summary of the significant content:

1 six serious accidents
2 we are going to explore it
3 fluff about applicability
4 fluff about complexity
5 it is hard to get people to talk about mistakes
6 we are first
7 we are careful
8 we aren't perfect
9 we will tell you what we think

Observations:

- they didn't tell us what went wrong
- the ratio of fluff to content is pretty high. There may be good stuff in this paper, but there is also going to be a lot we can ignore. So light skimming, looking for nuggets is likely to be the most profitable approach.

So let's try the same technique on the closing section...

1 we must dig deeper
2 patching doesn't work
3 these lessons are broadly applicable

They still didn't tell us what went wrong.

Who are the authors?: A professor from Irvine, and one of her Ph.D students. But he is a lawyer, so he may have a point of view we aren't familiar with.

What papers do they cite?: The citations seem narrow in scope
- supporting data
- supporting opinions
- their own papers

conclusion: be wary. They haven't established their credentials here. Their view may be narrow. The paper has to stand on its own merits. (but it will!)

What journal is this in? Computer, a general-interest rag of the IEEE Computer Group. It is quite hard to get a paper in, and it must be of pretty broad interest. This journal is heavily refereed. (What does that mean?)

Can you find any red herrings? There are lots of them, e.g.:
double-pass design: electron accelerator with mirrors to fold the beam.
depth dose: focus the energy deep inside the body.

neither of those things has anything to do with the accidents.

Old software was brought forward into this design. Is that good or bad?

- it is seasoned, known to work (good)
- it was designed for a different situation (bad)

So how do these Therac things work? (Most students will have a pretty confused idea of the mechanics. Draw a picture of the three-position turntable that blocks with a flattener, diffuses with a magnet, or passes the beam unaltered.) The following table helps get to the bottom of the problem:

                                           photon
turntable position:    electron therapy   (x-ray)   field-light
                         mode              mode      mode
beam energy:           5-25 MEV             25 MEV    0
beam current:            low                high      0
beam modifiers:         magnets          flattener  none

What is the purpose of "field-light mode"? (So the operator can peer in and verify that the patient is positioned properly.)

And given this table, what can go wrong? (If you somehow get the beam current set to high at the same time that the turntable is positioned with magnets rather than with the flattener, you have set the stage for a disaster, with a high X-ray beam current delivered directly to the patient, rather than being attenuated by a factor of 1000 or more. Another way to kill a patient is to turn the beam on at either current level with the turntable in field-light mode.

An interlock is needed. The hardware interlocks of the Therac-20 were replaced with software interlocks. What is the Therac's concept of a software interlock?

(The "software interlock" is just a boolean flag for which a non-zero value indicates that there is reason to believe that the turntable position should be verified for consistency with the beam setting. )

Get the class to describe the operator input and race condition scenario that leads to the two Tyler disasters.

Here is how the machine seems to work [Thanks to Emmett Witchel for sorting this out]:

There are three parallel threads involved. One thread parses keyboard input, a second thread sets the turntable position, a third thread sets several other things, including the beam energy.
The way that the operator signals that the machine should read the parameters and perform configuration set up is that the operator moves the cursor into the bottom field of the display. The keyboard thread watches for this cursor position, and on detection it parses the input, stores it in a two-byte variable, and sets a flag saying that data entry has occurred.
This flag is read independently by the other two threads, each of which is normally spinning in a loop watching either for a setup signal or a beam-on signal. When a setup signal arrives, the two threads go about their business, each reading the parameter values from the two-byte variable and setting their part of the machine up accordingly. When done, they each come back and again spin in the signal watching loop.
The general parameter-setting thread can take as long as 8 seconds to complete. The turntable-setting thread completes quickly (the time is not mentioned, but apparently much less than 8 seconds).

Now suppose that the operator sets up a consistent set of parameters and moves the cursor to the bottom field. The keyboard thread notices, acquires the input, and signals the other two two setup threads to go into action. Meanwhile, the operator looks over the display, realizes that the parameters are not the ones the doctor ordered, moves the cursor back up to change the parameters to another consistent set, and then moves the cursor back to the bottom field. The keyboard thread again acquires the input and signals the other two threads.

If the operator does this change in less than 8 seconds, the fast thread will discover the second setup signal and will redo setup with the new parameters. The slow thread is busy doing its 8-second procedure and when it returns it has missed the second setup signal, so that part of the machine remains in the first configuration. The display looks right, each of the threads thinks it has done the right thing, and the operator punches the beam-on button. But the turntable and the beam energy are not consistent.

Question: Why wasn't the setup checked for consistency before turning on the beam? This omission is unexplained and quite puzzling. One possibility is that the philosophy of the interlock design was one of "if we give an order to change something, we should later check to see if it actually changed." Since in the scenario above, the problem was that some orders to change never got given, there was, under this philosophy, no need to followup by checking.

We have here a good example of the maxim that Complex Systems Fail for Complex Reasons.

Why is it so hard to figure out what is going on? (This is worth emphasizing strongly to your class; it is about technical writing: First, the design is complicated, there are a lot of things going on, and they are going on in parallel. Only a small number of those details are essential to understanding the problem, but the text includes both relevant and irrelevant details.

Second, the explanation is hard to follow because certain key sentences in the description are passive. Example: "the data entry phase can be exited before all edit changes are made"). In that sentence, the reader has to figure out from context which of the several parallel tasks--or the operator--"exited" and "made the changes". And depending on which one did it, there might or might not be a race. As a rule, there is no place for passive sentences in technical writing; they usually obscure who took the action, and thereby make the description ambiguous.)

If there is time, call for a similar scenario description for the Yakima disaster:

- operator sets up parameters on the screen.
- operator moves turntable to field-light mode, and visually checks that the patient is properly positioned.
- operator hits "set" to store the parameters.

At this point, the "Class3" interlock is supposed to tell the software that the turntable position needs to be checked.

But this boolean flag seems to be implemented as a counter. Why?

(The PDP-11 had an "Increment-Byte" instruction that performed the operation (A <- A + 1). It looks like the programmer used that instruction as a way of making the flag non-zero. He probably thought he was doing a good deed, because the operation (A <- 1) on a PDP-11 requires that one materialize the constant value 1, which takes an extra byte to store.)

Seems like a good way to save a couple of bytes of program. What is wrong with it?

(Well, the code that set the flag was in a self-rescheduling routine that repeatedly watches the operator to see when everything is ready to proceed. So it sets the interlock to non-zero over and over again. Unfortunately, every 256 times it sets the interlock it actually sets it to zero.)

so we follow one of two paths. Normally,

- software interlock Class3 causes verification of turntable position; verification fails and system tells operator to go fix it.

But if the interlock variable happens to have just turned over from 256 to zero the sequence is

- software interlock Class3 indicates that no check is needed, so turntable remains in field-light position.

- operator tells machine to apply dose as set up, injuring patient.

Let's compile a list of faults that contributed to the disaster scenario:

faults in design of the operator interface:
- errors reported only by number
- no documentation in which to look them up
- screen cluttered with a high volume of false alarms. (Learned response: try again, it'll probably work next time)
- garbage frequently left on the screen
faults in the software:
- race in accepting input from operator
- failed to check for new input after setting magnets
- overflow of a variable stored in a one-byte space
faults in the system design approach:
- assumed that one can find and test all possible software failure modes
- failure analysis didn't include software, because it doesn't deteriorate
- unnecessarily complex design--ad hoc, little modularity, machine language
faults in design iteration:
- no followup of frequent, unexplained blown fuses on the Therac-20 with mechanical interlocks
- failure to follow through on initial problem reports
- inability to reproduce the problem
- reported failures not circulated to other users
- failures not reported to coordinating authorities in either U.S. or Canada.
other faults
- lack of any hardware mechanism to prevent beam turnon if a deflector isn't in position
- measuring devices that, when overloaded, report small dosage readings
- lack of software quality control at manufacturer
- lack of documentation on software design

Fixing any one of the above things might have prevented the disasters. But a good design and implementation should have addressed ALL of these things.

[suggested by Robert Morris] On might draw the lesson that since there was no single cause of failure, no single design change would have contributed much to success. Is that an appropriate conclusion?

(No. There are two objections to this line of reasoning. First, simple, system-wide, end-to-end consistency checks are a potent tool. The design should have included such checks. Second, one single design change would have made a huge difference: use a safety net approach to design. This second point will probably be worth coming back to when we get to chapter 6, which has a discussion of what "safety net" means, and why it can be effective.)

What ethical issues are involved here? For example, should the FDA have ordered the machine off the market immediately after the first accident, at least until someone got to the bottom of the problem?

This question leads to a very interesting tradeoff: Although the paper does not say so explicitly, it implies that the Therac-25 was heavily used in the field for several years before the problem was finally diagnosed and fixed. During this time, presumably several thousand patients were successfully treated, while six received damaging doses of radiation. If a patient is faced with the following choice: "If you don't get this therapy you will almost certainly die within a year. The machine we use is flaky, and it kills one in 1000 patients, but if it doesn't kill you it will cure you." one might expect that many patients would choose to take the risk.

If Lampson's "Hints" paper has been read either before or concurrently with this paper, a good question is "where did the Therac design violate Lampson's suggestions" Many of Lampson's slogans have to do with performance, which isn't the main issue in the Therac. But practically every non-performance hint was violated:

Keep it simple
Do one thing at a time
plan to throw one away
keep secrets
safety first
end-to-end error recovery
make actions atomic

The following discussion ideas were contributed by Steve Ward:

Hardware vs Software reliability? The paper repeatedly suggests that software is inherently more unreliable than hardware. Is this true? If so, why?

GEDANKEN EXPERIMENT: Suppose a redesigned hardware interface allowed a single, simple, localized software module to specify mode and turntable position coded into a single word. Assume that this software module uses a simple, airtight approach -- say table lookup of known, safe output words -- to guarantee that x-ray mode and magnets are never simultaneously selected. Would this mitigate software reliability concerns?

Still vulnerable to many single-failure disasters (CPU failures, I/O interface bugs, etc). Of course, these are hardware problems.
What do we have to trust to believe this system won't fry patients?
- Our driver code (simple, hence easy to trust)
- Critical hardware (including virtually all of the computer equipment)
- Other software?
  - sensitivity to stray writes by unrelated software - as well as wild transfers, etc
  - without additional firewalls, we have to trust virtually ALL of the software!
- What can we do to minimize the set of critical software?
Are hardware interlocks more trustworthy? Often, due to natural limitations in their possibility of unintentional interaction:
- physical isolation between modules (distance, cases, limited interfaces)
- dedicated function
- static connectivity (wires, rather than eg procedure calls).
Software systems minimize such limitations, in the interest of generality. How can the reliability be regained?
Firewalls (memory, compiler, communication, resource allocation) can limit interactions
Crash-prone computers, eg Lisp machines (single adr space, shared GC'ed heap); MACs (shared system heap, nice-guy timesharing), PCs in general (unprotected memory combined with unsafe languages)
Relation to hierarchy: suppose interactions are confined strictly to a carefully engineered design hierarchy? Does this help?
Is the generality/reliability tradeoff fundamental?
[Note that this discussion provides a natural lead-in to the topic of industrial-strength firewalls, which is the subject of the next few lectures.]
Can legislated engineering techniques/standards improve reliability? I vaguely remember that years ago the Brits had a go at this. I am amused by the vision of a debate in the House of Lords, half of them asleep, about whether recursion should be allowed...

Comments and suggestions: Saltzer@mit.edu