Clayton Williams

Michael Walfish

TR12

Therac-25: Fatal Implementation Flaws and Testing Omissions

 

In the 1980s, a number of patients were injured or killed by accidental radiation overdoses administered by the Therac-25 medical linear accelerator. This report is a brief summary of  “Medical Devices: The Therac-25” by Nancy Leveson, and examines two of the factors that most probably contributed to the accidents: fundamental flaws in the implementation of the machine and a failure on the part of its developers to create a thorough testing environment.

The Therac-25 was a significant departure from the earlier Therac-20 in that it was designed to use fewer hardware interlocks to prevent radiation overdoses, instead placing that responsibility in the hands of the software controller program. I think this is a viable design decision, but the particular software implementation written by the Therac-25 engineers is needlessly complex and poorly thought out. For example, while the custom OS designed for the machine supports multiple simultaneous threads of execution, there is no reliable means of synchronizing variables between threads because “…the ‘test’ and ‘set’ for such variables are not indivisible operations.” (Leveson, P.24) This is a fundamental problem in multi-threaded programming and would have been provided for in a well thought out implementation of a real-time OS. And while a task scheduler is a desirable design for a safety-critical application, the way the user interface of the Theratron-25 interacts with the rest of the system is very complex when it doesn’t have to be. According to Leveson, the UI task was always running, concurrently with the treatment task. The decision to use the data entered to set up the equipment and prepare for treatment was made by a complex system of setting, clearing, and checking a certain variable. This creates a system where one misplaced instruction can cause a serious error. It would have been much simpler to have the UI routine run by itself until satisfactory inputs were received, and then start the treatment task with that data. Even if the treatment task is then allowed to be interrupted by the operator to change the input, this is still a simpler implementation of the UI process.

Another major problem with the software on the Therac-25 is that it was never appropriately tested. While the engineers performed large numbers of tests on the complete system, they never (or at least rarely) tested the software by itself, and had no process for testing separate software modules individually. Consequently, the machine was tested thoroughly for the set of “normal” inputs under conditions it would most likely see in the field, but its software components were never tested at the extreme or edge cases. This is likely to have been the cause of one malfunction, when the operator happened to push a button at the exact moment that a buffer overflowed and rolled over to zero. It also seems, from reading FDA documents, that the Therac-25 engineers had no regression tests for the software. As a result, any minor change to the code (such as to fix a bug) could break other parts of the program that had been working.