Clayton
Williams
Michael
Walfish
TR12
Therac-25: Fatal
Implementation Flaws and Testing Omissions
In the 1980s, a
number of patients were injured or killed by accidental radiation
overdoses administered by the Therac-25 medical linear
accelerator. This report is a brief summary of “Medical Devices: The
Therac-25” by Nancy Leveson, and examines two of the factors
that most probably contributed to the accidents: fundamental flaws
in the implementation of the machine and a failure on the part of
its developers to create a thorough testing environment.
The Therac-25
was a significant departure from the earlier Therac-20 in that it
was designed to use fewer hardware interlocks to prevent radiation
overdoses, instead placing that responsibility in the hands of the
software controller program. I think this is a viable design
decision, but the particular software implementation written by the
Therac-25 engineers is needlessly complex and poorly thought
out. For example, while the custom OS designed for the machine
supports multiple simultaneous threads of execution, there is no
reliable means of synchronizing variables between threads because
“…the ‘test’ and ‘set’ for such variables are not
indivisible operations.” (Leveson, P.24) This is a fundamental
problem in multi-threaded programming and would have been provided
for in a well thought out implementation of a real-time OS. And
while a task scheduler is a desirable design for a safety-critical
application, the way the user interface of the Theratron-25
interacts with the rest of the system is very complex when it
doesn’t have to be. According to Leveson, the UI task was always
running, concurrently with the treatment task. The decision to use
the data entered to set up the equipment and prepare for treatment
was made by a complex system of setting, clearing, and checking a
certain variable. This creates a system where one misplaced
instruction can cause a serious error. It would have been much
simpler to have the UI routine run by itself until satisfactory
inputs were received, and then start the treatment task with that
data. Even if the treatment task is then allowed to be interrupted
by the operator to change the input, this is still a simpler
implementation of the UI process.
Another major problem with the software on the Therac-25 is that it was never appropriately tested. While the engineers performed large numbers of tests on the complete system, they never (or at least rarely) tested the software by itself, and had no process for testing separate software modules individually. Consequently, the machine was tested thoroughly for the set of “normal” inputs under conditions it would most likely see in the field, but its software components were never tested at the extreme or edge cases. This is likely to have been the cause of one malfunction, when the operator happened to push a button at the exact moment that a buffer overflowed and rolled over to zero. It also seems, from reading FDA documents, that the Therac-25 engineers had no regression tests for the software. As a result, any minor change to the code (such as to fix a bug) could break other parts of the program that had been working.