Justin Pun

Recitation Instructor: Sam Madden

Section: TR 11-12

 

The Contributions of Complex System Development to the Therac-25 Accidents

 

Atomic Energy of Canada Limited (AECL) demonstrated a very poor understanding of the necessary precautions and required steps to create a successfully operational complex system. Many mistakes were made in the process of constructing the Therac-25, which led to the radiation overexposure accidents between 1985 and 1987. Serious flaws in the design and testing stages of the development process were direct causes of the accidents.

In the design phase, AECL committed several poor design practices, ranging from software structure to user interface design. One of the most serious design flaws was the substitution of the traditional electromechanical interlocks for software checks to ensure that the turntable and attached equipment are in the correct position when treatment is started. Generally, when adding software control to a hardware system with existing safety mechanisms, one should not remove the existing safety mechanisms. The software should add safety mechanisms that complement or extend the existing safety mechanisms, especially in complex systems where safety is a top priority. Thus, for a mechanism such as the turntable, which could result in subjecting patients to lethal doses of radiation if not in the proper position, ensuring its correct orientation should be of highest importance. By removing the electromechanical interlocks, AECL placed nearly full responsibility for safety on software. Due to the flaws in the software that AECL was not initially aware of, patients were exposed to lethal levels of radiation that they otherwise would not have been if the electromechanical interlocks were still in place. In the case of Therac-20, where some of the same software bugs existed, there were no instances of lethal radiation exposure because the hardware verification mechanism was still in tact.

In the testing phase, AECL did not follow any commonly accepted testing scheme. Most of the testing was done at the system level. Unit testing and software testing were very minimal. Software should always be subjected to an extensive testing regimen. Unit testing, integration testing, system testing, and regression testing should always be performed. Applying all of these testing methods ensures that the individual parts, as well as the system as a whole, work correctly. Any one of these methods alone is inadequate for ensuring that the software will work to specification. By not directly testing their software as a whole or by individual modules, AECL did not ensure the proper functionality of their software and all of its parts under different scenarios and operational conditions. An East Texas Cancer Center physicist conducted his own investigation and found that a simple test of data entry speed during the editing of prescription information revealed a repeatable malfunction that caused patients to be exposed to an overdose of radiation. Creating a set of simple test scenarios would have exposed this software bug and prevented lethal exposures of up to 25,000 rads to patients.

The mistakes made in the design phase allowed the possibility of radiation overexposures to occur. The mistakes made in the testing phase resulted in the overlooking of serious design problems and software bugs. Together, the problems in both of these phases allowed the fatal accidents that took place between 1985 and 1987 to occur. If the mistakes made in either one of these stages were not made, the accidents would most likely never have happened.